under the hood of hadoop processing at oclc research
DESCRIPTION
Code4lib 2014 • Raleigh, NC. Roy Tennant. Senior Program Officer. Under the Hood of Hadoop Processing at OCLC Research. Apache Hadoop. A family of open source technologies for parallel processing: Hadoop core, which implements the MapReduce algorithm - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Under the Hood of Hadoop Processing at OCLC Research](https://reader036.vdocuments.site/reader036/viewer/2022070421/56816371550346895dd44e53/html5/thumbnails/1.jpg)
The world’s libraries. Connected.
Under the Hood of Hadoop Processing at OCLC Research
Code4lib 2014 • Raleigh, NC
Roy TennantSenior Program Officer
![Page 2: Under the Hood of Hadoop Processing at OCLC Research](https://reader036.vdocuments.site/reader036/viewer/2022070421/56816371550346895dd44e53/html5/thumbnails/2.jpg)
The world’s libraries. Connected.
• A family of open source technologies for parallel processing:
• Hadoop core, which implements the MapReduce algorithm
• Hadoop Distributed File System (HDFS)
• HBase – Hadoop Database
• Pig – A high-level data-flow language
• Etc.
Apache Hadoop
![Page 3: Under the Hood of Hadoop Processing at OCLC Research](https://reader036.vdocuments.site/reader036/viewer/2022070421/56816371550346895dd44e53/html5/thumbnails/3.jpg)
The world’s libraries. Connected.
• “…a programming model for processing large data sets with a parallel, distributed algorithm on a cluster.” – Wikipedia
• Two main parts implemented in separate programs:• Mapping – filtering and sorting
• Reducing – merging and summarizing
• Hadoop marshalls the servers, runs the tasks in parallel, manages I/O, & provides fault tolerance
MapReduce
![Page 4: Under the Hood of Hadoop Processing at OCLC Research](https://reader036.vdocuments.site/reader036/viewer/2022070421/56816371550346895dd44e53/html5/thumbnails/4.jpg)
The world’s libraries. Connected.
• OCLC has been doing MapReduce processing on a cluster since 2005, thanks to Thom Hickey and Jenny Toves
• In 2012, we moved to a much larger cluster using Hadoop and HBase
• Our longstanding experience doing parallel processing made the transition fairly quick and easy
Quick History
![Page 5: Under the Hood of Hadoop Processing at OCLC Research](https://reader036.vdocuments.site/reader036/viewer/2022070421/56816371550346895dd44e53/html5/thumbnails/5.jpg)
The world’s libraries. Connected.
• 1 head node, 40 processing nodes
• Per processing node:• Two AMD 2.6 Ghz processors
• 32 GB RAM
• Three 2 TB drives
• 1 dual port 10Gb NIC
• Several copies of WorldCat, both “native” and “enhanced”
Meet “Gravel”
![Page 6: Under the Hood of Hadoop Processing at OCLC Research](https://reader036.vdocuments.site/reader036/viewer/2022070421/56816371550346895dd44e53/html5/thumbnails/6.jpg)
The world’s libraries. Connected.
• Java Native• Can use any language you want if you use the
“streaming” option• Streaming jobs require a lot of parameters, best
kept in a shell script• Mappers and reducers don’t even need to be in
the same language (mix and match!)
Using Hadoop
![Page 7: Under the Hood of Hadoop Processing at OCLC Research](https://reader036.vdocuments.site/reader036/viewer/2022070421/56816371550346895dd44e53/html5/thumbnails/7.jpg)
The world’s libraries. Connected.
• The Hadoop Distributed File System (HDFS) takes care of distributing your data across the cluster
• You can reference the data using a canonical address; for example: /path/to/data
• There are also various standard file system commands open to you; for example, to test a script before running it against all the data: hadoop fs -cat /path/to/data/part-00001 | head | ./SCRIPT.py
• Also, data written to disk is similarly distributed and accessible via HDFS commands; for example:hadoop fs -cat /path/to/output/* > data.txt
Using HDFS
![Page 8: Under the Hood of Hadoop Processing at OCLC Research](https://reader036.vdocuments.site/reader036/viewer/2022070421/56816371550346895dd44e53/html5/thumbnails/8.jpg)
The world’s libraries. Connected.
• Useful for random access to data elements• We have dozens of tables, including the entirety
of WorldCat• Individual records can be fetched by OCLC
number
Using HBase
![Page 9: Under the Hood of Hadoop Processing at OCLC Research](https://reader036.vdocuments.site/reader036/viewer/2022070421/56816371550346895dd44e53/html5/thumbnails/9.jpg)
The world’s libraries. Connected.
Browsing HBaseOur “HBase Explorer”
![Page 10: Under the Hood of Hadoop Processing at OCLC Research](https://reader036.vdocuments.site/reader036/viewer/2022070421/56816371550346895dd44e53/html5/thumbnails/10.jpg)
The world’s libraries. Connected.
MARC Record
![Page 11: Under the Hood of Hadoop Processing at OCLC Research](https://reader036.vdocuments.site/reader036/viewer/2022070421/56816371550346895dd44e53/html5/thumbnails/11.jpg)
The world’s libraries. Connected.
• Some jobs only have a “map” component• Examples:
• Find all the WorldCat records with a 765 field
• Find all the WorldCat records with the string “Tar Heels” anywhere in them
• Find all the WorldCat records with the text “online” in the 856 $z
• Output is written to disk in the Hadoop filesystem (HDFS)
MapReduce Processing
![Page 12: Under the Hood of Hadoop Processing at OCLC Research](https://reader036.vdocuments.site/reader036/viewer/2022070421/56816371550346895dd44e53/html5/thumbnails/12.jpg)
The world’s libraries. Connected.
Mapper Process Only
ShellScript
Mapper
HDFS
Data
![Page 13: Under the Hood of Hadoop Processing at OCLC Research](https://reader036.vdocuments.site/reader036/viewer/2022070421/56816371550346895dd44e53/html5/thumbnails/13.jpg)
The world’s libraries. Connected.
• Some also have a “reduce” component• Example:
• Find all of the text strings in the 650 $a (map) and count them up (reduce)
MapReduce Processing
![Page 14: Under the Hood of Hadoop Processing at OCLC Research](https://reader036.vdocuments.site/reader036/viewer/2022070421/56816371550346895dd44e53/html5/thumbnails/14.jpg)
The world’s libraries. Connected.
Mapper and Reducer Process
ShellScript
Mapper Reducer
HDFS
Data
SummarizedData
![Page 15: Under the Hood of Hadoop Processing at OCLC Research](https://reader036.vdocuments.site/reader036/viewer/2022070421/56816371550346895dd44e53/html5/thumbnails/15.jpg)
The world’s libraries. Connected.
The JobTracker
![Page 16: Under the Hood of Hadoop Processing at OCLC Research](https://reader036.vdocuments.site/reader036/viewer/2022070421/56816371550346895dd44e53/html5/thumbnails/16.jpg)
The world’s libraries. Connected.
Sample Shell Script
Setup Variables
Remove earlier output
Call Hadoop with parametersand files
![Page 17: Under the Hood of Hadoop Processing at OCLC Research](https://reader036.vdocuments.site/reader036/viewer/2022070421/56816371550346895dd44e53/html5/thumbnails/17.jpg)
The world’s libraries. Connected.
Sample MapperSample Mapper
![Page 18: Under the Hood of Hadoop Processing at OCLC Research](https://reader036.vdocuments.site/reader036/viewer/2022070421/56816371550346895dd44e53/html5/thumbnails/18.jpg)
The world’s libraries. Connected.
Sample Reducer
Sample Reducer
![Page 19: Under the Hood of Hadoop Processing at OCLC Research](https://reader036.vdocuments.site/reader036/viewer/2022070421/56816371550346895dd44e53/html5/thumbnails/19.jpg)
The world’s libraries. Connected.
• Shell screenshot
Running the Job
![Page 20: Under the Hood of Hadoop Processing at OCLC Research](https://reader036.vdocuments.site/reader036/viewer/2022070421/56816371550346895dd44e53/html5/thumbnails/20.jpg)
The world’s libraries. Connected.
The Blog Post
![Page 21: Under the Hood of Hadoop Processing at OCLC Research](https://reader036.vdocuments.site/reader036/viewer/2022070421/56816371550346895dd44e53/html5/thumbnails/21.jpg)
The world’s libraries. Connected.
The Press
When you are really, seriously, lucky.
![Page 22: Under the Hood of Hadoop Processing at OCLC Research](https://reader036.vdocuments.site/reader036/viewer/2022070421/56816371550346895dd44e53/html5/thumbnails/22.jpg)
The world’s libraries. Connected.
![Page 23: Under the Hood of Hadoop Processing at OCLC Research](https://reader036.vdocuments.site/reader036/viewer/2022070421/56816371550346895dd44e53/html5/thumbnails/23.jpg)
The world’s libraries. Connected.
WorldCat Identities
![Page 24: Under the Hood of Hadoop Processing at OCLC Research](https://reader036.vdocuments.site/reader036/viewer/2022070421/56816371550346895dd44e53/html5/thumbnails/24.jpg)
The world’s libraries. Connected.
Kindred Works
![Page 25: Under the Hood of Hadoop Processing at OCLC Research](https://reader036.vdocuments.site/reader036/viewer/2022070421/56816371550346895dd44e53/html5/thumbnails/25.jpg)
The world’s libraries. Connected.
![Page 26: Under the Hood of Hadoop Processing at OCLC Research](https://reader036.vdocuments.site/reader036/viewer/2022070421/56816371550346895dd44e53/html5/thumbnails/26.jpg)
The world’s libraries. Connected.
Cookbook Finder
![Page 27: Under the Hood of Hadoop Processing at OCLC Research](https://reader036.vdocuments.site/reader036/viewer/2022070421/56816371550346895dd44e53/html5/thumbnails/27.jpg)
The world’s libraries. Connected.
VIAF
![Page 28: Under the Hood of Hadoop Processing at OCLC Research](https://reader036.vdocuments.site/reader036/viewer/2022070421/56816371550346895dd44e53/html5/thumbnails/28.jpg)
The world’s libraries. Connected.
MARC Usage in WorldCat
Contents of the 856 $3 subfield
![Page 29: Under the Hood of Hadoop Processing at OCLC Research](https://reader036.vdocuments.site/reader036/viewer/2022070421/56816371550346895dd44e53/html5/thumbnails/29.jpg)
The world’s libraries. Connected.
Work Records
![Page 30: Under the Hood of Hadoop Processing at OCLC Research](https://reader036.vdocuments.site/reader036/viewer/2022070421/56816371550346895dd44e53/html5/thumbnails/30.jpg)
The world’s libraries. Connected.
WorldCat Linked Data Explorer
![Page 31: Under the Hood of Hadoop Processing at OCLC Research](https://reader036.vdocuments.site/reader036/viewer/2022070421/56816371550346895dd44e53/html5/thumbnails/31.jpg)
The world’s libraries. Connected.
Roy [email protected]@rtennantfacebook.com/roytennantroytennant.com