Download - Hadoop tech share
![Page 1: Hadoop tech share](https://reader033.vdocuments.site/reader033/viewer/2022052508/559a73091a28ab07548b474b/html5/thumbnails/1.jpg)
Dylan Valerio
Hadoop Tech Share
![Page 2: Hadoop tech share](https://reader033.vdocuments.site/reader033/viewer/2022052508/559a73091a28ab07548b474b/html5/thumbnails/2.jpg)
Agenda
• Overview
• Demo
• Applications
• Configuration
![Page 3: Hadoop tech share](https://reader033.vdocuments.site/reader033/viewer/2022052508/559a73091a28ab07548b474b/html5/thumbnails/3.jpg)
Data!
• NYSE = 1 TB/day (10^12)
• FB = 10B photos = 2.5 PB (10^15)
• Ancestry.com = 2.5 PB
• Large Hadron Collider = 15 PB/year
Hadoop: The Definitive Guide
![Page 4: Hadoop tech share](https://reader033.vdocuments.site/reader033/viewer/2022052508/559a73091a28ab07548b474b/html5/thumbnails/4.jpg)
Storage and Analysis
• Storage capacity has increased, but IO has not increased proportionally.
• Disk failure.
• Analysis needs a large chunk of data.
• Bandwidth is largely limited esp. for BIG DATA.
Hadoop: The Definitive Guide
![Page 5: Hadoop tech share](https://reader033.vdocuments.site/reader033/viewer/2022052508/559a73091a28ab07548b474b/html5/thumbnails/5.jpg)
Hadoop
• Distributed computing framework to process large amounts of data.
– Accessible – data is replicated on large clusters of commodity machines
– Robust – assume disk failures
– Scalable – add nodes
– Simple – simple MapReduce code
Hadoop in Action
![Page 6: Hadoop tech share](https://reader033.vdocuments.site/reader033/viewer/2022052508/559a73091a28ab07548b474b/html5/thumbnails/6.jpg)
HDFS
• Name Node – master of the HDFS. Directs I/O.“The Index”
– Secondary Name Node – Backup
• Data Node – where actual data is stored and replicated
Hadoop In Action
![Page 7: Hadoop tech share](https://reader033.vdocuments.site/reader033/viewer/2022052508/559a73091a28ab07548b474b/html5/thumbnails/7.jpg)
MapReduce
• MR is a programming framework that processes data by keys and values.
• The mapper code processes, while the reducer compiles.
• Mappers and reducers do not directly communicate with each other.
Hadoop: The Definitive Guide
![Page 8: Hadoop tech share](https://reader033.vdocuments.site/reader033/viewer/2022052508/559a73091a28ab07548b474b/html5/thumbnails/8.jpg)
Huy Vo, NYU
![Page 9: Hadoop tech share](https://reader033.vdocuments.site/reader033/viewer/2022052508/559a73091a28ab07548b474b/html5/thumbnails/9.jpg)
Jobs
• Map tasks and reduce tasks are assignedthroughout the cluster.
• Job Tracker managesthe status of the job(s).
• Task Tracker manages each task assigned to them.
Hadoop: The Definitive Guide
![Page 10: Hadoop tech share](https://reader033.vdocuments.site/reader033/viewer/2022052508/559a73091a28ab07548b474b/html5/thumbnails/10.jpg)
Architecture
Hadoop in Action
![Page 11: Hadoop tech share](https://reader033.vdocuments.site/reader033/viewer/2022052508/559a73091a28ab07548b474b/html5/thumbnails/11.jpg)
Demo
• Word Count – the “Hello World” of Map Reduce.
• Distributed GREP – Sampling Pattern
• Top Child Star – Summarization Pattern
![Page 12: Hadoop tech share](https://reader033.vdocuments.site/reader033/viewer/2022052508/559a73091a28ab07548b474b/html5/thumbnails/12.jpg)
Bit of History
• Doug Cutting and the Apache Lucene Team
• Google File Systm (2003) and MapReduce (2004)
• Cutting joined Yahoo! (2006).
• Yahoo announced its search index was being processed by a 10,000-core Hadoop cluster. (2008)
Hadoop: The Definitive Guide
![Page 13: Hadoop tech share](https://reader033.vdocuments.site/reader033/viewer/2022052508/559a73091a28ab07548b474b/html5/thumbnails/13.jpg)
Hadoop Stack
• Core – I/O, serialization, Java RPC
• Avro – data serialization
• MapReduce
• HDFS
• Pig – higher language to explore HDFS & MR clusters
• HBase – distributed column-oriented DB
• Zoo Keeper – distributed coordination service
• Hive – distributed data warehouse + SQL-like query
• Chukwa – data collection and reports
• Mahout – collection of ML algorithms for HDFS clusters
Hadoop: The Definitive Guide
![Page 14: Hadoop tech share](https://reader033.vdocuments.site/reader033/viewer/2022052508/559a73091a28ab07548b474b/html5/thumbnails/14.jpg)
![Page 15: Hadoop tech share](https://reader033.vdocuments.site/reader033/viewer/2022052508/559a73091a28ab07548b474b/html5/thumbnails/15.jpg)
Configuration Checklist
• Rack management
• Java Installation
• Hadoop download and shell environment tweaking
• SSH + VIM
• Default configuration files:
– Core-site.xml, hdfs-site.xml, mapred-site.xml
• Formatting the HDFS
• Start-all.sh
![Page 16: Hadoop tech share](https://reader033.vdocuments.site/reader033/viewer/2022052508/559a73091a28ab07548b474b/html5/thumbnails/16.jpg)
Hadoop Shell Commands
• Hadoop fs –ls
• Hadoop jar <jar file> <main method> <input params>
![Page 17: Hadoop tech share](https://reader033.vdocuments.site/reader033/viewer/2022052508/559a73091a28ab07548b474b/html5/thumbnails/17.jpg)
Web-Based Cluster UI
• Localhost:50070 – Job administration
• Localhost:50030 – DFS administration
![Page 18: Hadoop tech share](https://reader033.vdocuments.site/reader033/viewer/2022052508/559a73091a28ab07548b474b/html5/thumbnails/18.jpg)
Hadoop for other languages
• Hadoop streaming uses Unix standard streams.
– So you can use bash scripts, Ruby, python, etc.
• Hadoop pipes is a C++ interface to MR.
![Page 19: Hadoop tech share](https://reader033.vdocuments.site/reader033/viewer/2022052508/559a73091a28ab07548b474b/html5/thumbnails/19.jpg)
Benefits to AC Technologies Discussion
![Page 20: Hadoop tech share](https://reader033.vdocuments.site/reader033/viewer/2022052508/559a73091a28ab07548b474b/html5/thumbnails/20.jpg)
Report Generation
• Suppose we have HBase for:
– High Availability: Distributed DB
– Partition Tolerance: Auto-Sharding
– Scalability: Horizontal Scaling
• Then, common scenarios will be:
– Service Management & Monitoring:
• Partitioning by month
• Binning by functional category
• Sampling by file status
– Harvest DB
• Top harvested files per day, per site
![Page 21: Hadoop tech share](https://reader033.vdocuments.site/reader033/viewer/2022052508/559a73091a28ab07548b474b/html5/thumbnails/21.jpg)
Log Mining
• Suppose we have a common repository for all log files (Zenoss)
– Exception counting (WARN – FATAL level)
– Info-level reporting
![Page 22: Hadoop tech share](https://reader033.vdocuments.site/reader033/viewer/2022052508/559a73091a28ab07548b474b/html5/thumbnails/22.jpg)
Analytics-Driven Decision Making
• Application Influence Mining through Akamai Logs
– Prevalent and isolated applications over the whole client base
– Prevalent and isolated applications over a single organization
– Projections of Application Influence over time