big data: data analysis boot camp hadoop and rintroduction basics hands-onq &...
TRANSCRIPT
-
1/26
Introduction Basics Hands-on Q & A Conclusion References Files
Big Data: Data Analysis Boot CampHadoop and R
Chuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhD
24 September 201724 September 201724 September 201724 September 201724 September 201724 September 201724 September 201724 September 201724 September 201724 September 201724 September 201724 September 201724 September 201724 September 201724 September 201724 September 201724 September 201724 September 201724 September 201724 September 201724 September 2017
-
2/26
Introduction Basics Hands-on Q & A Conclusion References Files
Table of contents (1 of 1)
1 Introduction
2 Basics
3 Hands-on
4 Q & A
5 Conclusion
6 References
7 Files
-
3/26
Introduction Basics Hands-on Q & A Conclusion References Files
What are we going to cover?
1 Look at the Hadoop map-reduceprogramming model
2 Pick apart the “classic”map-reduce word count program
3 Look at how the map-reduce modelcan be used with complex keys
-
4/26
Introduction Basics Hands-on Q & A Conclusion References Files
Hadoop Distributed File System (hdfs)
The Hadoop Distributed File System (HDFS)
“The Hadoop Distributed File System (HDFS) is adistributed file system designed to run on commodityhardware. It has many similarities with existingdistributed file systems. However, the differences fromother distributed file systems are significant. HDFS ishighly fault-tolerant and is designed to be deployed onlow-cost hardware. HDFS provides high throughputaccess to application data and is suitable for applicationsthat have large data sets. HDFS relaxes a few POSIXrequirements to enable streaming access to file systemdata. HDFS was originally built as infrastructure for theApache Nutch web search engine project. HDFS is nowan Apache Hadoop subproject.”
A. Staff [2]
-
5/26
Introduction Basics Hands-on Q & A Conclusion References Files
Hadoop Distributed File System (hdfs)
HDFS Assumptions and Goals[2]
Hardware Failure Hardware failure is the norm rather than theexception.
Streaming Data Access Applications that run on HDFS needstreaming access to their data sets.
Large Data Sets Applications that run on HDFS have large datasets. A typical file in HDFS is gigabytes toterabytes in size.
Simple Coherency Model HDFS applications need awrite-once-read-many access model for files.
Moving Computation is Cheaper than Moving Data Acomputation requested by an application is muchmore efficient if it is executed near its data.
Portability Across Heterogeneous Hardware and Software PlatformsHDFS has been designed to be easily portable fromone platform to another.
-
6/26
Introduction Basics Hands-on Q & A Conclusion References Files
Hadoop Distributed File System (hdfs)
HDFS Implementations[3]
Hardware Failure Redundant copies of the data are kept by thesystem.
Streaming Data Access Applications that run on HDFS needstreaming access to their data sets. Programs readand write data from and to STDIN and STDOUT.
Large Data Sets An HDFS data file is “chuncked” to minimizetotal program execution time.
Simple Coherency Model HDFS trades off some POSIXrequirements for performance, so some operationsmay behave differently than you expect them to.
Moving Computation is Cheaper than Moving Data map() arecopied to the data and the results are copied to thereducer functions.
Portability Across Heterogeneous Hardware and Software PlatformsSystems IAW standards gain market share.
-
7/26
Introduction Basics Hands-on Q & A Conclusion References Files
Hadoop Distributed File System (hdfs)
HDFS terminology
Some terms:namenode manages the filesystem
namespace. It maintains thefilesystem tree and the metadatafor all the files and directories inthe tree.
client accesses the filesystem on behalfof the user by communicatingwith the namenode anddatanodes. The client presents aPOSIX-like filesystem interface.
datanode are the workhorses of thefilesystem. They store andretrieve blocks when they are toldto (by clients or the namenode).
Our applications are clients, andthe mysteries of the name anddata nodes are hidden from us.
Image from [1].
-
8/26
Introduction Basics Hands-on Q & A Conclusion References Files
Hadoop Distributed File System (hdfs)
Same image.
Image from [1].
-
9/26
Introduction Basics Hands-on Q & A Conclusion References Files
Map/Reduce computing model
Map reduce model from 50,000 foot view.
A simple and powerful model:
1 A line of data is presentedto a “mapper” function.
2 The “mapper” outputs 0 ormore key and value tuplesper presented input line
3 Hadoop sorts and merges allkeys and values so thatthere is one key with one ormore values
4 The “reducer” processeseach key and associatedvalues to the output
Image from [1].
-
10/26
Introduction Basics Hands-on Q & A Conclusion References Files
Map/Reduce computing model
Same image.
Image from [1].
-
11/26
Introduction Basics Hands-on Q & A Conclusion References Files
Map/Reduce computing model
A lower level view
There are a lot of processes andcoordination happening behindthe scenes. The client submits ajob to Hadoop, mapper functionsare copied to the data, key valuesare sorted, then presented to thereducers, and output is written.Much of this activity can bemonitored at port 8787.
Image from [1].
-
12/26
Introduction Basics Hands-on Q & A Conclusion References Files
Map/Reduce computing model
Same image.
Image from [1].
-
13/26
Introduction Basics Hands-on Q & A Conclusion References Files
Word count
Classic word count program
The program is in the attachedfile (Hadoop word count). We’ll:
1 Set some environmentvariables for Hadoop
2 Load necessary R libraries
3 Download and save the textfile
4 Do some HDFShousekeeping
5 Define and execute themap-reduce job
6 See where the results endedup
-
14/26
Introduction Basics Hands-on Q & A Conclusion References Files
Word count
Ways to modify the word count program.
Remove all “words” thatare in fact a space
Remove all “stop” words
Remove all words that arenumbers
Stem all words
Process a different text file
Create a histogram of thefirst n most common words
Estimate the “reading”level of the processed text
Create a word cloud of inthe shape of somethingassociated with the text
-
15/26
Introduction Basics Hands-on Q & A Conclusion References Files
Airports and travel
Looking at air traffic between US domestic airports(attached Airport route exploration)
Mashing data from differentsources.
Use the US GovernmentBureau of TransportationStatistics to get route data
Use the OpenFlights to findairport latitude andlongitude
Use Hadoop map/reducemodel to create a pivot table
Plot resultsAttached file.
-
16/26
Introduction Basics Hands-on Q & A Conclusion References Files
Airports and travel
Same image.
Attached file.
-
17/26
Introduction Basics Hands-on Q & A Conclusion References Files
Airports and travel
Bureau of Transportation Statistics home page
https://www.bts.gov
https://www.bts.gov
-
18/26
Introduction Basics Hands-on Q & A Conclusion References Files
Airports and travel
BTS Airlines and Airports page
https://www.rita.dot.gov/bts/sites/rita.dot.gov.bts/
files/subject_areas/airline_information/index.html
https://www.rita.dot.gov/bts/sites/rita.dot.gov.bts/files/subject_areas/airline_information/index.htmlhttps://www.rita.dot.gov/bts/sites/rita.dot.gov.bts/files/subject_areas/airline_information/index.html
-
19/26
Introduction Basics Hands-on Q & A Conclusion References Files
Airports and travel
BTS Domestic Data Selection page
https://www.transtats.bts.gov/DL_SelectFields.asp?
Table_ID=258&DB_Short_Name=Air%20Carriers
https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=258&DB_Short_Name=Air%20Carriershttps://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=258&DB_Short_Name=Air%20Carriers
-
20/26
Introduction Basics Hands-on Q & A Conclusion References Files
Airports and travel
OpenFlights home page
https://openflights.org/data.html
https://openflights.org/data.html
-
21/26
Introduction Basics Hands-on Q & A Conclusion References Files
Airports and travel
Lessons learned about keyval()
Some “interesting” things about the keyval() function:
1 The last call wins. If your processing creates a collection ofkey value pairs, the last keyval() call is the data passed toreduce().
2 keyval() is vectorized. There can be more than one key orvalue passed to the function.
To pass more than one key value combination, use:keyval(c(...), c(...))
Be aware that the shorter argument will be recycled as necessaryto match the longer argument.Execute keyval at the R prompt to see code.
-
22/26
Introduction Basics Hands-on Q & A Conclusion References Files
Airports and travel
Ways to modify the airport program.
Change the lines betweenairports to great circleroutesReduce the number ofroutes to those that carrythe greatest weightSee the difference betweencargo and passenger routesModify the routes to showsource and destinationIdentify the most commoncarriers by weight
Identify the most frequentcarriers
Compute net weightexchange between airports(find sources and sinks)
If the data is for USdomestic routes, why arethere links to Chile
Expand the list of airportlocations to remove allunknown locations
-
23/26
Introduction Basics Hands-on Q & A Conclusion References Files
Q & A time.
Q: How many Oregonians does ittake to screw in a light bulb?A: Three. One to screw in thelight bulb and two to fend off allthose Californians trying to sharethe experience.
-
24/26
Introduction Basics Hands-on Q & A Conclusion References Files
What have we covered?
Gained an understanding of how Rinterfaces with the Hadoopmap-reduce programming model“Played” with a word countprogramLooked at things that airlines carrybetween airports and how todisplay that data
Next: BDAR Chapter 5, RDBMSs and R
-
25/26
Introduction Basics Hands-on Q & A Conclusion References Files
References (1 of 1)
[1] Ricky Ho, How Hadoop Map/Reduce works, 2008.
[2] Apache Staff, HDFS Architecture Guide, https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html,2017.
[3] Tom White, Hadoop: The Definitive Guide, 4th Edition,O’Reilly Media, Inc., 2015.
https://hadoop.apache.org/docs/r1.2.1/hdfs_design.htmlhttps://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
-
26/26
Introduction Basics Hands-on Q & A Conclusion References Files
Files of interest
1 Hadoop word count2 Airport route
exploration
3 R library script file
4 Route information
rm(list=ls())
source("library.R")
loadLibraries