goto 2011 preso: 3x hadoop
DESCRIPTION
Talk presentation at the 2011 GOTO Amsterdam conference.TRANSCRIPT
![Page 1: GOTO 2011 preso: 3x Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052523/55615055d8b42adb6b8b51a4/html5/thumbnails/1.jpg)
3xFr
iso va
n Voll
enho
ven
@fzk
![Page 2: GOTO 2011 preso: 3x Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052523/55615055d8b42adb6b8b51a4/html5/thumbnails/2.jpg)
![Page 3: GOTO 2011 preso: 3x Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052523/55615055d8b42adb6b8b51a4/html5/thumbnails/3.jpg)
![Page 4: GOTO 2011 preso: 3x Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052523/55615055d8b42adb6b8b51a4/html5/thumbnails/4.jpg)
86.88.37.142 - - [26/Jul/2011:00:01:46 +0200] "GET /nl/index.html?Referrer=ADVNLGOO22901030000bsl HTTP/1.1" 200 15551 "http://www.google.nl/search?sourceid=navclient&aq=0h&oq=b&hl=nl&ie=UTF-8&q=bol.com.nl" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)" "DYN_USER_ID=12660142780; DYN_USER_CONFIRM=8bc25ea623423bae5c4ce970faf1b13f4; BOL_RFID=ADVNLGOO1322090000bsl; BUI=86.55.31.109.1278181451852406" 0 "Ti3nysCoEI4AAGMfqZAAAAPD" "-" "325886" "ps316"
Millions of these, each day
![Page 5: GOTO 2011 preso: 3x Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052523/55615055d8b42adb6b8b51a4/html5/thumbnails/5.jpg)
Egypt @ Jan 27, 2011
![Page 6: GOTO 2011 preso: 3x Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052523/55615055d8b42adb6b8b51a4/html5/thumbnails/6.jpg)
BGP4MP|980099497|A|193.148.15.68|3333|192.37.0.0/16|3333 5378 286 1836|IGP|193.148.15.140|0|0||NAG||
Hundreds of millions of these, each day
the internet works because of these (and cables and routers and money and people and stuff)
![Page 7: GOTO 2011 preso: 3x Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052523/55615055d8b42adb6b8b51a4/html5/thumbnails/7.jpg)
![Page 8: GOTO 2011 preso: 3x Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052523/55615055d8b42adb6b8b51a4/html5/thumbnails/8.jpg)
![Page 9: GOTO 2011 preso: 3x Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052523/55615055d8b42adb6b8b51a4/html5/thumbnails/9.jpg)
![Page 10: GOTO 2011 preso: 3x Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052523/55615055d8b42adb6b8b51a4/html5/thumbnails/10.jpg)
Date Node
DISK
DISK
DISK
Date Node
DISK
DISK
DISK
Date Node
DISK
DISK
DISK
Name Node
/some/file /foo/bar
HDFS client create file
write data
read data
replicate
Node localHDFS client
read data
![Page 11: GOTO 2011 preso: 3x Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052523/55615055d8b42adb6b8b51a4/html5/thumbnails/11.jpg)
Why ?scalable
open sourcecost-efficient
storage and processing
in one
good for analytics: schema-less, unstructured
![Page 12: GOTO 2011 preso: 3x Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052523/55615055d8b42adb6b8b51a4/html5/thumbnails/12.jpg)
Not for me...
I don’t have a lot of data.
I surely don’t have a cluster of machines to spare.
I just read the paper.
It’d be cool if I could try this stuff sometime, though...
![Page 13: GOTO 2011 preso: 3x Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052523/55615055d8b42adb6b8b51a4/html5/thumbnails/13.jpg)
Free data...
![Page 14: GOTO 2011 preso: 3x Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052523/55615055d8b42adb6b8b51a4/html5/thumbnails/14.jpg)
Getting it...
curl -u fzk:secret \https://stream.twitter.com/1/statuses/sample.json \> tweets.json
8 weeks == ~1/4 TB
![Page 15: GOTO 2011 preso: 3x Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052523/55615055d8b42adb6b8b51a4/html5/thumbnails/15.jpg)
Tens of millions of these
![Page 17: GOTO 2011 preso: 3x Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052523/55615055d8b42adb6b8b51a4/html5/thumbnails/17.jpg)
Step 1: Configure
Step 2: Launch
Step 3: ?
Step 4: Pay
![Page 18: GOTO 2011 preso: 3x Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052523/55615055d8b42adb6b8b51a4/html5/thumbnails/18.jpg)
whirr.service-name=hadoopwhirr.cluster-name=my-clusterwhirr.instance-templates=\1 hadoop-jobtracker+hadoop-namenode, \19 hadoop-datanode+hadoop-tasktracker
whirr.provider=aws-ec2whirr.identity=SECRETwhirr.credential=EVEN-MORE-SECRETwhirr.private-key-file=${sys:user.home}/.ssh/id_rsawhirr.public-key-file=${sys:user.home}/.ssh/id_rsa.pub
whirr.hadoop-install-function=install_cdh_hadoopwhirr.hadoop-configure-function=configure_cdh_hadoop
whirr.hardware-id=c1.xlarge
Step 1: Configure
![Page 19: GOTO 2011 preso: 3x Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052523/55615055d8b42adb6b8b51a4/html5/thumbnails/19.jpg)
whirr launch-cluster --config cluster.properties
Step 2: Launch
bash .whirr/my-cluster/hadoop-proxy.sh
wait about 20 minutes...
![Page 20: GOTO 2011 preso: 3x Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052523/55615055d8b42adb6b8b51a4/html5/thumbnails/20.jpg)
![Page 21: GOTO 2011 preso: 3x Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052523/55615055d8b42adb6b8b51a4/html5/thumbnails/21.jpg)
Twitter mentions
What’s up with Microsoft?
Step 3:
![Page 22: GOTO 2011 preso: 3x Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052523/55615055d8b42adb6b8b51a4/html5/thumbnails/22.jpg)
“Hello, Oracle”
“Google vs. Microsoft vs. Apple”
“Apache rocks! Oracle not so much...”
“Apple == iAwesome”
Oracle, 1Google, 1Microsoft, 1Apple, 1Apache, 1Oracle, 1Apple, 1
input: text
split words
emit:$WORD, 1for ‘interesting’ words
MAP
![Page 23: GOTO 2011 preso: 3x Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052523/55615055d8b42adb6b8b51a4/html5/thumbnails/23.jpg)
MAGIC!
![Page 24: GOTO 2011 preso: 3x Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052523/55615055d8b42adb6b8b51a4/html5/thumbnails/24.jpg)
map(input record) => (key, value)
ORDER BY key GROUP BY key
reduce(key, values) => (key, value)
![Page 25: GOTO 2011 preso: 3x Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052523/55615055d8b42adb6b8b51a4/html5/thumbnails/25.jpg)
Apache: [1]
Apple: [1,1]
Google: [1]
Microsoft: [1]
Oracle: [1,1]
REDUCE
Apache: 1Apple: 2Google: 1Microsoft: 1Oracle: 2
input: text, count
sum values
emit:$KEY, $SUM for all keys
![Page 26: GOTO 2011 preso: 3x Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052523/55615055d8b42adb6b8b51a4/html5/thumbnails/26.jpg)
https://github.com/xebia/BigData-University
![Page 27: GOTO 2011 preso: 3x Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052523/55615055d8b42adb6b8b51a4/html5/thumbnails/27.jpg)
hadoop jar bigdata-twitter-0.1-SNAPSHOT-job.jar \-Dxebia.twitter.terms=oracle,google,microsoft,apache \s3://training-hdfs/twitter-sample/* /job-output
wait another 20 minutes...
mvn clean install
export HADOOP_CONF_DIR=$HOME/.whirr/my-cluster
![Page 28: GOTO 2011 preso: 3x Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052523/55615055d8b42adb6b8b51a4/html5/thumbnails/28.jpg)
![Page 29: GOTO 2011 preso: 3x Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052523/55615055d8b42adb6b8b51a4/html5/thumbnails/29.jpg)
![Page 30: GOTO 2011 preso: 3x Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052523/55615055d8b42adb6b8b51a4/html5/thumbnails/30.jpg)
![Page 31: GOTO 2011 preso: 3x Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052523/55615055d8b42adb6b8b51a4/html5/thumbnails/31.jpg)
![Page 32: GOTO 2011 preso: 3x Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052523/55615055d8b42adb6b8b51a4/html5/thumbnails/32.jpg)
hadoop fs -get /job-output/part-r-00000 .
whirr destroy-cluster --config cluster.properties
![Page 33: GOTO 2011 preso: 3x Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052523/55615055d8b42adb6b8b51a4/html5/thumbnails/33.jpg)
20110807 apache 220110807 google 42220110807 microsoft 4420110807 oracle 1120110808 apache 2520110808 google 134120110808 microsoft 16020110808 oracle 3720110809 apache 1720110809 google 143120110809 microsoft 18420110809 oracle 4020110810 apache 1220110810 google 168820110810 microsoft 17920110810 oracle 51
![Page 34: GOTO 2011 preso: 3x Hadoop](https://reader034.vdocuments.site/reader034/viewer/2022052523/55615055d8b42adb6b8b51a4/html5/thumbnails/34.jpg)
From: [email protected]: AWS Billing Statement Available
Greetings from Amazon Web Services,
This e-mail confirms that your latest billing statement is available on the AWS web site. Your account will be charged the following:
Total: $218.02
Thank you for using Amazon Web Services.
Sincerely,Amazon Web Services
Step 4: Pay