real time hadoop + mapreduce intro

Download Real time hadoop + mapreduce intro

Post on 15-Jan-2015




5 download

Embed Size (px)


augmented my real-time hadoop talk to include a programming intro to mapreduce for google developer groups


  • 1. 2013: year of real-time access to Big Data?Geoffrey Hendrey @geoffhendrey@vertascale

2. Agenda Hadoop MapReduce basics Hadoop stack & data formats File access times and mechanics Key-based indexing systems (HBase) MapReduce, Hive/Pig MPP approaches & alternatives 3. A very bad* diagram*this diagram makes it appear that data flows through the master node. 4. A better picture 5. Job Configuration 6. Map and Reduce Java Code 7. Reduce 8. Reducer Group Iterators Reducer groups values together by key Your code will iterate over the values, emit reducedresultBear:[1,1]Bear:2 Hadoop reducer value iterators return THE SAMEOBJECT each next(). Object is reused to reducegarbage collection load Beware of reused objects (this is a VERY commoncause of long and confusing debugs) Cause for concern: you are emitting an object withnon-primitive values. STALE reused object state fromprevious value. 9. Hadoop Writables Values in Hadoop are transmitted (shuffled, emitted) in a binaryformat Hadoop includes primitive types: IntWritable, Text, LongWritable,etc You must implement Writable interface for custom objects public void write(DataOutput d) throws IOException { d.writeUTF(this.string); d.writeByte(this.column); }public void readFields(DataInput di) throws IOException { this.string = di.readUTF(); this.column = di.readByte(); } 10. Hadoop Keys (WritableComparable) Be very careful to implement equals and hashcodeconsistently with compareTo() compareTo() will control the sort order of keysarriving in reducer Hadoop includes ability to write custom partitioner public int getPartition(Document doc,Text v, int numReducers) { return doc.getDocId()%numReducers;} 11. Typical Hadoop File Formats 12. Hadoop Stack Review 13. Distributed File System 14. HDFS performance characteristics HDFS was designed for high throughput, not lowseek latency best-case configurations have shown HDFS toperform 92K/s random reads[] Personal experience: HDFS very robust. Faulttolerance is real. Ive unplugged machinesand never lost data. 15. Motivation for Real-time Hadoop Big Data is more opaque than small data Spreadsheets choke BI tools cant scale Small samples often fail to replicate issues Engineers, data scientists, analysts need: Faster time to answer on Big Data Rapid find, quantify, extract Solve I dont know what I dont know MapReduce jobs are hard to debug 16. Survey or real-time capabilities Real-time, in-situ, self-service is theHoly Grail for the business analyst spectrum of real-time capabilities existson Hadoop Available in Hadoop Proprietary HDFSHBase Drill EasyHard 17. Real-time spectrum on HadoopUse CaseSupport Real-timeSeek to a particular byte in a distributedHDFSYESfileSeek to a particular value in a distributed HBase YESfile, by key (1-dimensional indexing)Answer complex questions expressible in MapReduce NOcode (e.g. matching users to music(Hive, Pig)albums). Data science.Ad-hoc query for scattered records given MPPYESsimple constraints (field*4+==music && Architecturesfield*9+==dvd) 18. Hadoop Underpinned By HDFS Hadoop Distributed File System (HDFS) inspired by Google FileSystem (GFS) underpins every piece of data in Hadoop Hadoop FileSystem API is pluggable HDFS can be replaced with other suitabledistributed filesystem S3 kosmos etc 19. Amazon S3 20. MapFile for real-time access? Index file must be loaded by client (slow) Index file must fit in RAM of client by default scan an average of 50% of the samplinginterval Large records make scanning intolerable not a viable real world solution for randomaccess 21. Apache HBase Clone of Googles Big Table. Key-based access mechanism Designed to hold billions of rows Tables stored in HDFS Supports MapReduce over tables, intotables Requires you to think hard, and committo a key design. 22. HBase Architecture 23. HBase random read performance 7 servers, each with 8 cores 32GB DDR3 and 24 x 146GB SAS 2.0 10K RPM disks. Hbase table 3 billion records, 6600 regions. data size is between 128-256 bytes per row, spread in 1 to 5 columns. 24. Zoomed-in Get time histogram 25. MapReduce MapReduce is a framework for processingparallelizable problems across huge datasetsusing a large number of computers-wikipedia MapReduce is strongly tied to HDFS in Hadoop. Systems built on HDFS (i.e. HBase) leverage thiscommon foundation for integration with the MRparadigm 26. MapReduce and Data Science Many complex algorithms can be expressed inthe MapReduce paradigm NLP Graph processing Image codecs The more complex the algorithm, the more Mapand Reduce processes become complexprograms in their own right. Often cascade multiple MR jobs in succession 27. Is MapReduce real-time? MapReduce on Hadoop has certain latenciesthat are hard to improve Copy Shuffle, sort Iterate time-dependent on the both the size of theinput data and the number of processorsavailable In a nutshell, its a batch process and isntreal-time 28. Hive and Pig Run on top of MapReduce Provide Table metaphor familiar to SQL users Provide SQL-like (or actually same) syntax Store a schema in a database, mapping tablesto HDFS files Translate queries to MapReduce jobs No more real-time than MapReduce 29. MPP Architectures Massively Parallel Processing Lots of machines, so also lots of memoryExamples: Spark general purpose data science frameworksort of like real-time MapReduce for datascience Dremel columnar approach, geared towardanswering SQL-like aggregations and BI-stylequestions 30. Spark Originally designed for iterative machinelearning problems at Berkeley MapReduce does not do a great job on iterativeworkloads Spark makes more explicit use of memorycaches than Hadoop Spark can load data from any Hadoop inputsource 31. Effect of Memory Caching in Spark 32. Is Spark Real-time? If data fits in memory, execution time for mostalgorithms still depends on amount of data to be processed number of processors So, it still depends but definitely more focused on fast time-to-answer Interactive scala and java shells 33. Dremel MPP architecture MPP architecture for ad-hoc query on nesteddata Apache Drill is an OS clone of Dremel Dremel originally developed at Google Features in situ data analysis Dremel is not intended as a replacement forMR and is often used in conjunction with it toanalyze outputs of MR pipelines or rapidlyprototype larger computations. -Dremel:Interactive Analysis of WebScaleDatasets 34. In Situ Analysis Moving Big Data is a nightmare In situ: ability to access data inplace In HDFS In Big Table 35. Uses For Dremel At GoogleAnalysis of crawled web documents.Tracking install data for applications on AndroidMarket.Crash reporting for Google products.OCR results from Google Books.Spam analysis.Debugging of map tiles on Google Maps.Tablet migrations in managed Bigtable instances.Results of tests run on Googles distributed buildsystem. Etc, etc. 36. Why so many uses for Dremel? On any Big Data problem or application, devteam faces these problems: I dont know what I dont know about data Debugging often requires finding and correlatingspecific needles in the haystack Support and marketing often require segmentationanalysis (identify and characterize wide swaths ofdata) Every developer/analyst wants Faster time to answer Fewer trips around the mulberry bush 37. Column Oriented Approach 38. Dremel MPP query execution tree 39. Is Dremel real-time? 40. Alternative approaches? Both MapReduce and MPP query architecturestake throw hardware at the problemapproach. Alternatives? Use MapReduce to build distributed indexes on data Combine columnar storage and inverted indexes tocreate columnar inverted indexes Aim for the sweet spot for data scientist andengineer: Ad-hoc queries with results returned inseconds on a single processing node. 41. Contact Info Email: Twitter: @geoffhendrey @vertascale www: 42. references Dremel: Interactive Analysis of WebScale Datasets