playing with hadoop (npw2013)
TRANSCRIPT
Nordic Perl Workshop 2013
Playing with HadoopSren Lund (slu)[email protected]
DISCLAIMER
I have no experience with Hadoop in a real-world project
The installation notes I present are not nescessarily suitable for production
The example scripts have not been used on real (big) data
Hence the title Playing with Hadoop
About Hadoop (and Big Data)
The Problem (it's not new)
!!!!!We have (access to) more and more data
Processing this data takes longer and longer
Not enough memory
Running out of disk space
Our trusty old server can't keep up
Scaling up
Upgrade hardware: bigger and faster
Redundancy: power supply, RAID, hot-swap
Expensive to keep scaling up
Our software will run without modifications
Scaling out
Add more (commodity) servers
Redundancy is replaced by replication
You can keep on scaling out, it's cheap
How do we enable our software to run across multiple servers?
Google solved this
Google published two papersGoogle File System (GFS), 2003
http://research.google.com/archive/gfs.html
MapReduce, 2004
http://research.google.com/archive/mapreduce.html
GFS and MapReduced provided a platform for processing huge amounts of data in an efficient way
Hadoop was born
Doug Cutting read the Google papers
Based on those, he created Hadoop
(named after his sons toy elephant)
It is an implementation of GFS/MapReduce
(Open Source / Apache License)
Written in Java and deployed on Linux
First part of Lucene, now an Apache project
https://hadoop.apache.org/
Hadoop Components
Hadoop Common utilities to control the rest
HDFS Hadoop Distributed File System
YARN Yet Another Resource Negotiator
MapReduce YARN-based parallel processing
This enables us to write software that can handle Big Data by scaling out
Big Data isn't just big
Huge amounts of data (volume)
Unstructured data (form)
Highly dynamic data (burst/change rate)
Big Data is actually hard-to-handle (with traditional tools/methods) data
Examples of Big Data
Log files, i.e.web server access logs
application logs
Internet feedsTwitter, Facebook, etc.
RSS
Images (face recognition, tagging)
Installing Hadoop
Needed to run Hadoop
You need the following to run HadoopLinux server
Java JDK
Hadoop tarball
I'm using the followingUbuntu 12.04 LTS 64 bit
JDK 1.6.24 64 bit
Hadoop 1.0.4
Could not get JDK7 + Hadoop 2.2 to work
Install Java
Setup Java home and path
Add hadoop user
Create SSH key for hadoop user
Accept SSH key
Install Hadoop and add to path
Disable IPv6
Reboot and check installation
Running an example job
Calculate Pi
Estimated value of Pi
Three modes of operation
Pi was calculated in Local standalone modeit is the default mode (i.e. no configuration needed)
all components of Hadoop run in a single JVM
Pseudo-distributed modea separate JVM is spawned for each component
components communicate using sockets
it is a mini-cluster on a single host
Fully distributed modecomponents are spread across multiple machines
Create base directory for HDFS
Set JAVA_HOME
Edit core-site.xml
Edit hdfs-site.xml
Edit mapred-site.xml
Log out and log on as hadoop
Format HDFS
Start HDFS
Start Map Reduce
Create home directory & test data
Running Word Count
First let's try the example jar
Inspect the result
Compile and run our own jar
https://gist.github.com/soren/7213273
Inspect result
Run improved version
https://gist.github.com/soren/7213453
Inspect (improved) result
Hadoop MapReduce
A reducer will get all values associated with a given key
Precursor job can be used to normalize data
Combiners can be used to perform early sorting of map output before it is send to the reducer
Perl MapReduce
Playing with MapReduce
We don't need Hadoop to play with MapReduce
Instead we can emulate Hadoop using two scripts
wc_mapper.pl a Word Count Mapper
wc_reducer.pl a Word Count Reducer
We connect them using a pipe (|)
Very Unix-like!
Run MapReduce without Hadoop
https://gist.github.com/soren/7596270
https://gist.github.com/soren/7596285
Hadoop's Streaming interface
Enables you to write jobs in any programming language, e.g. Perl
Input from STDIN
Output to STDOUT
Key/Value pairs separated by TAB
Reducers will get values one-by-one
Not to be confused with Hadoop Pipes that provides a native C++ interface to Hadoop
Run Perl Word Count
https://gist.github.com/soren/7596270
https://gist.github.com/soren/7596285
Inspect result
Hadoop::Streaming
Perl interface to Hadoop's Streaming interface
Implemented in Moose
You'll can now implement you MapReduce asa class with a map() and reduce() method
a mapper script
a reducer script
Installing Hadoop::Streaming
Btw, Perl was already installed on the server ;-)
But we want to install Hadoop::Streaming
I also had to install local::lib to make it work
All you have to do issudo cpan local::lib Hadoop::Streaming
Nice and easy
Run Hadoop::Streaming job
https://gist.github.com/soren/7596451
https://gist.github.com/soren/7600134
https://gist.github.com/soren/7600144
Inspect result
Some final notes and loose ends
The Web User Interface
HDFShttp://localhost:8030/
MapReducehttp://localhost:8070/
File Browserhttp://localhost:8075/browseDirectory.jsp?namenodeInfoPort=8070&dir=/
Note: this is with port forwarding in VirtualBox50030 8030, 50070 8070, 50075 8075
Joins in Hadoop
It's possible to implement joins in MapReduceReduce-joins simple
Map-joins less data to transfer
Do you need joins?Maybe you're data has structure SQL?
Try Hive (HiveQL)
Or Pig (Pig Latin)
Hadoop in the Cloud
Elastic MapReduce (EMR)
http://aws.amazon.com/elasticmapreduce/
Essentially Hadoop in the Cloud
Build on EC2 and S3
You can upload JARs or scripts
There's more
DistributionsCloudera Distribution for Hadoop (CDH)
http://www.cloudera.com/
Hortonworks Data Platform (HDP)
http://hortonworks.com/
HBase, Hive, Pig and other related projects
https://hadoop.apache.org/
But, a basic Hadoop setup is a good startand a nice place to just play with Hadoop
I like big data and I can not lie
Oh, my God, Becky, look at the data, it's so big
It looks like one of those Hadoop guys setups
Who understands those Hadoop guys
They only map/reduce it because it is on a distributed file systemI
mean the data, it's just so big
I can't believe it's so huge
It's just out there, I mean, it's gross
Look, it's just so blah
The End
Questions?
Slides will be available at http://www.slideshare.net/slu/Find me on Twitter https://twitter.com/slu
Muokkaa otsikon tekstimuotoa napsauttamalla
Muokkaa jsennyksen tekstimuotoa napsauttamallaToinen jsennystasoKolmas jsennystasoNeljs jsennystasoViides jsennystasoKuudes jsennystasoSeitsems jsennystasoKahdeksas jsennystasoYhdekss jsennystaso
/