hadoop at linkedin
TRANSCRIPT
July 27, 2015Keith Dsouza
2
● Open source framework developed and
maintained by Apache foundation
● Consists of a Distributed file system (HDFS)
used for storing large blocks of data
● MapReduce Framework - Model for large scale
data processing
● YARN - Resource management platform for
managing computing resources in clusters
What is Hadoop?
3
Problem: We have large number of attendees at this meetup
and want to count the number of Java Engineers, C++
Engineers and C Engineers present here
MapReduce In Practice
4
MapReduce In Practice
5
MapReduce In Practice
6
Summing it up
7
8
● One of the largest and most active clusters in
the world
● Hadoop 2.0
● 2000+ Nodes
● 50MM blocks of data
● Thousands of workflows
Scale
9
Project Takeout - Member Data Export
10
Project Takeout - Member Data Export
● 350MM LinkedIn Members
● Reads hundreds of Terabytes of data
● Built using Scalding
● Low-maintenance
11
Supported Development Platforms
12
ETL (Extract, Transform and Load)
13
● Project management for Hadoop jobs
● Hadoop job dependencies / job workflows
● Execution Tracker
● Job History
● Continuous job scheduler
Azkaban - Hadoop Project/Workflow Manager
14
Azkaban - Hadoop Project/Workflow Manager
15
Azkaban Execution/Logs
16
Hadoop DSL - Workflow
17
Hadoop DSL - Job File
18
Dr. Elephant - Analyze Jobs
19
Dr. Elephant - Analyze Problems
20
Production Workflows
21
● Contribute to Apache Hadoop
● Apache DataFu - http://datafu.incubator.apache.org/
● Azkaban - http://azkaban.github.io/
● Gobblin - https://github.com/linkedin/gobblin
● Pinot - https://github.com/linkedin/pinot
● Samza - http://samza.apache.org/
● Data@LinkedIn - http://data.linkedin.com
● Keep in touch - http://engineering.linkedin.com/
LinkedIn open source contributions