big time: introducing hadoop on azure
DESCRIPTION
Introduction to HDInsight service (aka Hadoop on Azure)TRANSCRIPT
Big Data
The problem is simple
• While the storage capacities of hard drives have increased massively over the years, access speeds—the rate at which data can be read from drives have not kept up.
• One typical drive from 1990 could store 1,370 MB of data and had a transfer speed of 4.4 MB/s
• so you could read all the data from a full drive in around five minutes.
• Over 20 years later, one terabyte drives are the norm, but the transfer speed is around 100 MB/s, so it takes more than two and a half hours to read all the data off the disk.
ParallelGo
Cloud computing changes the way applications grow
http://journals.worldnomads.com/davidsgibson/photo/22804/664941/USA/Elephant-shaped-cloud!
Yaniv Rodenski Senior Consultant, Sela Grouphttp://blogs.microsoft.co.il/blogs/roadanTwitter: @YRodenski
BIG-TIME:Introducing Hadoop on Azure
David GinzburgBig Data infrastructure consultantTwitter: @David_Ginzburg
1
34
AGENDA
2
Apache™ Hadoop™
•
•
•
Apache™ Hadoop™
•
•
•
•
Hadoop Distributed File System (HDFS)
HDFS Client
Hadoop Distributed File System (HDFS)
HDFS Client
Hadoop Distributed File System (HDFS)
HDFS Client
MapReduce via WordCount
Hello World
Hello Azure
Goodbye Cruel World
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
1
1
1
A new way to MapReduce
DEMO
Hadoop MapReduce Processing
Input Split
Input Split
Input Split
Merge
Hadoop MapReduce Processing
Job Client
MapReduce TMI
Input Split
Partition, Sort,
and spill to disk
Buffer
Fetch
MapReduce TMI
Sort
Output
Map Outpu
t
Map Outpu
t
Map Outpu
t
Map Outpu
t
Merge result
Merge result
Partitioners
•
•
Combiners
•
•
•
The TeraSort Use case
•
•
•
••
•
•
•
•
•
•
•
•
The TeraSort Use case
•
Beginners Pitfalls
•
•
••
•
•
Beginners Pitfalls
•
•
••
•
•
Distinct Values Problem Statement
•
:// . . /2012/02/01/ -http highlyscalable wordpress com mapreduce/patterns
Distinct Values Problem Statement
•
:// . . /2012/02/01/ -http highlyscalable wordpress com mapreduce/patterns
Distinct Values Problem Statement
•
:// . . /2012/02/01/ -http highlyscalable wordpress com mapreduce/patterns
Distinct Values Problem Statement
•
:// . . /2012/02/01/ -http highlyscalable wordpress com mapreduce/patterns
Administrating Hadoop in the real world
DEMO
Why did Microsoft choose Hadoop?
•
•
•
•
•
•
Hadoop on Azure
•
•
•
•
•
•
•
•
Using hadooponazure.com
DEMO
Windows Azure Compute
•
Azure Role
Supporting service
Application
Configuration
Hadoop on Azure Roles
•
Azure Role
Monitoring service (RdAdmin)
Hadoop services
Configuration
Hadoop MapReduce Processing
Head Node
Name Node
Worker Node
Data Node
Worker Node
Data Node
Worker Node
Data Node
Worker Node
Data Node
Fabric Controller
Worker Node
Data Node
Worker Node
Data Node
Worker Node
Data Node
Hadoop MapReduce Processing
Head Node
Name Node
Worker Node
Data Node
Worker Node
Data Node
Worker Node
Data Node
Worker Node
Data Node
Fabric Controller
Worker Node
Data Node
Worker Node
Data Node
Worker Node
Data Node
Hadoop MapReduce Processing
Head Node
Name Node
Worker Node
Data Node
Worker Node
Data Node
Worker Node
Data Node
Worker Node
Data Node
Fabric Controller
Worker Node
Data Node
Worker Node
Data Node
Worker Node
Data Node
Worker Node
Data Node
The Head Node Template
•
•
•
•
••
•
•
•
The Worker Node Template
•
•
•
Node VM Templates
•
•
HEAD NODE WORKER NODE
VM Template Extra Large Medium
Cores 8 2
Memory 14 GB 3.5 GB
HD 2 TB 489 GB
Cloud Storage
•
•
High Availability on Azure
Fabric Controller
Head Node
Name Node
Head Node
Name Node
Azure Storage
Elastic MapReduce
•
•
•
Elastic MapReduce
Storage Client
Amazon S3
Head Node
Jobtracker
Worker Node
Tasktracker
Worker Node
Tasktracker
Worker Node
Tasktracker
Azure Storage
Elastic MapReduce
Storage Client
Amazon S3
Head Node
Jobtracker
Worker Node
Tasktracker
Worker Node
Tasktracker
Worker Node
Tasktracker
Azure Storage
Head Node
Jobtracker
Worker Node
Tasktracker
Worker Node
Tasktracker
Worker Node
Tasktracker
Elastic MapReduce
Storage Client
Amazon S3
Azure Storage
$$ $ $ $$ $ $ $
Using Elastic MapReduce
DEMO
Azure Blob Considerations
•
•
•
•
•
Storage Size Limitations
•
•
•
•
IsotopeJS
•
•
•
•
Using the JavaScript interactive console
DEMO
Using Hive
DEMO
Summary
•
•
•
Q & A
Resources
http://bit.ly/roadan My Blog
Apache™ Hadoop™http://hadoop.apache.org
http://www.hadooponazure.com
Hadoop on Azure
Tom Whitehttp://shop.oreilly.com/product/9780596521981.do
Hadoop: The Definitive Guide
http://www.windowsazure.com/en-us/develop/overviewWindows Azure Developer center
Thanks!Yaniv Rodenski
Twitter: @YRodenski