hadoop, hdfs, mapreduce and pig

56

Upload: tomasz-bednarz

Post on 30-Jun-2015

473 views

Category:

Technology


1 download

DESCRIPTION

Open presentation, training material. Presented at CSIRO Big Data 2.0 workshop in September 2013, North Ryde, Australia. Animated by hands-on examples.

TRANSCRIPT

Page 1: Hadoop, HDFS, MapReduce and Pig
Page 2: Hadoop, HDFS, MapReduce and Pig

Page 3: Hadoop, HDFS, MapReduce and Pig

Page 4: Hadoop, HDFS, MapReduce and Pig

●●●

●●●●●●●

Page 5: Hadoop, HDFS, MapReduce and Pig

●●●

●●●

Page 6: Hadoop, HDFS, MapReduce and Pig

Page 7: Hadoop, HDFS, MapReduce and Pig

Page 8: Hadoop, HDFS, MapReduce and Pig

●●

Page 9: Hadoop, HDFS, MapReduce and Pig

> hadoop fs

Page 10: Hadoop, HDFS, MapReduce and Pig

hadoop fs

Page 11: Hadoop, HDFS, MapReduce and Pig

$ hadoop fs

● ls

$ hadoop fs –help ls

Page 12: Hadoop, HDFS, MapReduce and Pig

$ hadoop fs –ls <path> $ hadoop fs –ls /

$ hadoop fs -ls $ hadoop fs –ls /user/cloudera

Page 13: Hadoop, HDFS, MapReduce and Pig

$ hadoop fs -mkdir data $ hadoop fs -ls

$ cd ~/bigdata/Exercises/hadoop/data $ ls -l $ hadoop fs –put mammograms.zip data

Page 14: Hadoop, HDFS, MapReduce and Pig

● http://localhost:50070

● fsck: an HDFS utility $ hadoop fsck /user/cloudera/data/mammograms.zip \

-blocks -locations -files

$ head -n 100 ato_centenary.txt \ | hadoop fs –put - data/ato100.txt

Page 15: Hadoop, HDFS, MapReduce and Pig

$ head -n 1000 ato_centenary.txt \ | hadoop fs –put - data/ato100.txt

put: ‘data/ato100.txt': File exists●

$ hadoop fs -rm data/ato100.txt $ head -n 1000 ato_centenary.txt \ | hadoop fs –put - data/ato100.txt

Page 16: Hadoop, HDFS, MapReduce and Pig

$ hadoop fs -cat data/ato100.txt | less

$ hadoop fs -get data/ato100.txt ato100.txt

-mv, -cp, -rmdir, -stat ...

Page 17: Hadoop, HDFS, MapReduce and Pig

●●●●

●●

Page 18: Hadoop, HDFS, MapReduce and Pig

Page 19: Hadoop, HDFS, MapReduce and Pig

●○

●○

●○

○○

Page 20: Hadoop, HDFS, MapReduce and Pig
Page 21: Hadoop, HDFS, MapReduce and Pig

Page 22: Hadoop, HDFS, MapReduce and Pig

$ javac –classpath `hadoop classpath` *.java

$ jar cvf csiro.jar *.class

$ hadoop jar csiro.jar Csiro input_dir output_dir

Page 23: Hadoop, HDFS, MapReduce and Pig

●●

map(in_key, in_value) -> (inter_key, inter_value) list

Page 24: Hadoop, HDFS, MapReduce and Pig

Page 25: Hadoop, HDFS, MapReduce and Pig

let map(key, value) =emit(key.toUpper(), value.

toUpper())

(‘csiro’, ‘cci’) -> (‘CSIRO’, ‘CCI’)(‘csiro’, ‘cesre’) -> (‘CSIRO’, ‘CESRE’)(‘csiro’, ‘cmse’) -> (‘CSIRO’, ‘CMSE’)(‘toyota’, ‘yaris’) -> (‘TOYOTA’, ‘YARIS’)

Page 26: Hadoop, HDFS, MapReduce and Pig

let map(key, value) =foreach char c in value:

emit(key, c)

(‘cci’, ‘csiro’) -> (‘cci’, ‘c’), (‘cci’, ’s’),(‘cci’, ‘i’), (‘cci’, ‘r’),(‘cci’, ‘o’)

(‘open’, ‘nasa’) -> (‘open’, ‘n’), (‘open’, ’a’),(‘open’, ‘s’), (‘open’, ‘a’)

Page 27: Hadoop, HDFS, MapReduce and Pig

let map(key, value) =emit(value.length(), value)

(‘csiro’, ‘cci’) -> (‘3’, ‘cci’)(‘csiro’, ‘cesre’) -> (‘5’, ‘cesre’)(‘csiro’, ‘cmse’) -> (‘4’, ‘cmse’)(‘toyota’, ‘yaris’) -> (‘5’, ‘yaris’)

Page 28: Hadoop, HDFS, MapReduce and Pig

●○

●○

Page 29: Hadoop, HDFS, MapReduce and Pig

map(String input_key, String input_value)foreach word w in input_value:

emit(w, 1)

reduce(String output_key,Iterator<int> intermediate_values)

set count = 0foreach v in intermediate_values:

count += vemit(output_key, count)

Page 30: Hadoop, HDFS, MapReduce and Pig

● Wordcount $ cd ~/bigdata/Exercises/hadoop/wordcount; ls

$ javac –classpath `hadoop classpath` *.java

$ jar cvf wc.jar *.class

WordCount.java WordMapper.java SumReducer.java

Page 31: Hadoop, HDFS, MapReduce and Pig

$ hadoop jar wc.jar WordCount data/ato100.txt ato_wc

$ hadoop fs ls ato_wc $ hadoop fs -cat ato_wc/part-r-00000 | less $ hadoop fs -cat ato_wc/* | grep ‘ATO\|CSIRO’

$ hadoop fs -rm -r ato_wc

Page 32: Hadoop, HDFS, MapReduce and Pig

● Average max temperature ●

Page 33: Hadoop, HDFS, MapReduce and Pig

$ cd ~/bigdata/Exercises/hadoop/data $ less nsw_temp.csv $ less bom_data_Note.txt

Page 34: Hadoop, HDFS, MapReduce and Pig

map(String input_key, String input_value):emit(input_value[3], input_value[5])

(‘IDCJAC0010,061087,1965,01,02,32.2,1,Y’)->(‘01’, 32.2)

(‘IDCJAC0010,066062,1890,04,27,20.2,1,Y’)->(‘04’, 20.2)

(‘IDCJAC0010,066062,2012,02,03,21.0,1,Y’)->(‘02’, 21.1)

Page 35: Hadoop, HDFS, MapReduce and Pig

reduce(String month, Iterator<double> values)set count = 0

set sum = 0foreach v in values:

sum += v count++ set mean = sum/count

emit(month, mean)

Page 36: Hadoop, HDFS, MapReduce and Pig

● $ cd ../averagetemp $ gedit *.java&

$ cd ../wordcount $ gedit *.java&

AverageTemp.java AverageTempMapper.java AverageReducer.java

Page 37: Hadoop, HDFS, MapReduce and Pig

●●

$ hadoop fs -put ../data/nsw_temp.csv data

$ javac –classpath `hadoop classpath` *.java $ jar cvf avt.jar *.class $ hadoop jar avt.jar AverageTemp data/nsw_temp.csv avt

Page 38: Hadoop, HDFS, MapReduce and Pig

● $ hadoop fs -cat avt/part-1-00000

~/bigdata/Exercises/hadoop/averagetemp/sample_solution

Page 39: Hadoop, HDFS, MapReduce and Pig
Page 40: Hadoop, HDFS, MapReduce and Pig

●○

●●●

Page 41: Hadoop, HDFS, MapReduce and Pig

●●●

Page 42: Hadoop, HDFS, MapReduce and Pig

●●●

Page 43: Hadoop, HDFS, MapReduce and Pig
Page 44: Hadoop, HDFS, MapReduce and Pig

●○○

●○

Page 45: Hadoop, HDFS, MapReduce and Pig

●●

●●

Page 46: Hadoop, HDFS, MapReduce and Pig
Page 47: Hadoop, HDFS, MapReduce and Pig

●●●

Page 48: Hadoop, HDFS, MapReduce and Pig
Page 49: Hadoop, HDFS, MapReduce and Pig
Page 50: Hadoop, HDFS, MapReduce and Pig
Page 51: Hadoop, HDFS, MapReduce and Pig

●○○○

●○

●○○○

Page 52: Hadoop, HDFS, MapReduce and Pig

○○○○○○

Page 53: Hadoop, HDFS, MapReduce and Pig
Page 54: Hadoop, HDFS, MapReduce and Pig

https://github.com/tomaszbednarz/pig-abc-toilets

● We have list of local ABC Radio stations in Australia

● We have list of all Public Toilets across Australia

● We want to find a closest toilet to a Radio Station

Demonstration of:

● Data Schemas● Use of external libraries● Google Maps API

Page 55: Hadoop, HDFS, MapReduce and Pig