basics introduction 1

Upload: venkatvavilala

Post on 03-Jun-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/12/2019 Basics Introduction 1

    1/3

    What is Big Data?

    With all the devices available today to collect data, such as RFID readers, microphones, cameras,Sensors, social media, and so on, we are seeing an explosion in data being collected worldwide. BigData is a term used to describe large collections of data (also known as datasets) that may beUnstructured, and grow so large and quickly that it is difficult to manage with regular database orStatistics tools.

    What is distributed computing and Hadoop fits in ?

    Multiple independent systems appear as one, interacting via a message passing interface, no singlepoint of failure.

    Challenges of Distributed computing.

    1. Resource sharing. Access any data and utilize CPU resources across the system.

    2. Openness. Extensions, interoperability, portability.

    3. Concurrency. Allows concurrent access, update of shared resources.

    4. Scalability. Handle extra load. Like increase in users, etc.

    5. Fault tolerance. By having provisions for redundancy and recovery

    6. Heterogeneity. Different Operating systems, different hardware, Middleware system allows this.

    7. Transparency. Should appear as a whole instead of collection of computers.

    8. Biggest challenge is to hide the details and complexity of accomplishing above challenges. Fromthe user and to have a common unified interface to interact with it. Which is where hadoop comes in?

    Clustered storage is the use of two or more storage servers working together to increaseperformance, capacity, or reliability. Clustering distributes workloads to each server, manages thetransfer of workloads between servers, and provides access to all files from any server regardless ofthe physical location of the file.

    What makes Hadoop unique is its simplified programming model which allows the user to quicklywrite and test distributed systems, and its efficient, automatic distribution of data and workacross machines and in turn utilizing the underlying parallelism of the CPU cores.

    How Hadoop Resolve Big Data Problem?

    Scenarios:

    Increase in Volume

    Imagine you have 1GB of data that you need to process.

    The data are stored in a relational database in your desktop computer and this desktop computer hasno problem handling this load. Then your company starts growing very quickly, and that data grows to10GB.And then 100GB.And you start to reach the limits of your current desktop computer. So youscale-up by investing in a larger computer, and you are then OK for a few more months. When yourdata grows to 10TB, and then 100TB.And you are fast approaching the limits of that computer.Moreover, you are now asked to feed your application with unstructured data coming from sourceslike Facebook, Twitter, RFID readers, sensors, and so on.

    http://searchstorage.techtarget.com/definition/storagehttp://whatis.techtarget.com/definition/serverhttp://searchdatacenter.techtarget.com/definition/cluster-computinghttp://searchdatacenter.techtarget.com/definition/workloadhttp://searchexchange.techtarget.com/definition/filehttp://searchexchange.techtarget.com/definition/filehttp://searchdatacenter.techtarget.com/definition/workloadhttp://searchdatacenter.techtarget.com/definition/cluster-computinghttp://whatis.techtarget.com/definition/serverhttp://searchstorage.techtarget.com/definition/storage
  • 8/12/2019 Basics Introduction 1

    2/3

  • 8/12/2019 Basics Introduction 1

    3/3

    Appendix

    Seek time - is the time taken for a hard disk controller or pointer to locate a specific piece of storeddata. Other delays include transfer time (data rate) and rotational delay (latency).

    Latency - is the time required to perform some action or to produce some result. Latency ismeasured in units of time -- hours, minutes, seconds, nanoseconds or clock periods.

    Throughput - is the amount of data that can traverse through a given medium (Bandwidth is thediameter of your medium)

    Fault Tolerance - Your system should have the ability to respond gracefully and continue theoperations during the unexpected failure of your system like Power, Hardware, Data corruption, etc.

    RAID - Collection of Disks storing the same data (mirroring) in different places. Data redundantlyincreases fault tolerance.

    Commodity hardware - is nothing but PCs that is affordable and easy to obtain. Typically it is a low-performance system that is IBM PC-compatible and is capable of running Microsoft Windows, Linux,or MS-DOS without requiring any special devices or equipment.