here comes the flood tools for big data analytics - … · here comes the flood tools for big data...
TRANSCRIPT
![Page 1: Here comes the flood Tools for Big Data analytics - … · Here comes the flood Tools for Big Data analytics Guy Chesnot -June, 2012](https://reader033.vdocuments.site/reader033/viewer/2022052421/5b5e1f8c7f8b9a164b8bde3f/html5/thumbnails/1.jpg)
Here comes the flood
Tools for Big Data
analytics
Guy Chesnot - June, 2012
![Page 2: Here comes the flood Tools for Big Data analytics - … · Here comes the flood Tools for Big Data analytics Guy Chesnot -June, 2012](https://reader033.vdocuments.site/reader033/viewer/2022052421/5b5e1f8c7f8b9a164b8bde3f/html5/thumbnails/2.jpg)
Agenda
� Data flood
� Implementations
� Hadoop
� Not Hadoop
2
![Page 3: Here comes the flood Tools for Big Data analytics - … · Here comes the flood Tools for Big Data analytics Guy Chesnot -June, 2012](https://reader033.vdocuments.site/reader033/viewer/2022052421/5b5e1f8c7f8b9a164b8bde3f/html5/thumbnails/3.jpg)
Agenda
� Data flood
� Implementations
� Hadoop
� Not Hadoop
3
![Page 4: Here comes the flood Tools for Big Data analytics - … · Here comes the flood Tools for Big Data analytics Guy Chesnot -June, 2012](https://reader033.vdocuments.site/reader033/viewer/2022052421/5b5e1f8c7f8b9a164b8bde3f/html5/thumbnails/4.jpg)
Forecast Data Growth Rates
![Page 5: Here comes the flood Tools for Big Data analytics - … · Here comes the flood Tools for Big Data analytics Guy Chesnot -June, 2012](https://reader033.vdocuments.site/reader033/viewer/2022052421/5b5e1f8c7f8b9a164b8bde3f/html5/thumbnails/5.jpg)
Computationally Intensive Distributed Data Analytics
Real
Time
CEP OLTP DW Hadoop
Structured Data Unstructured Data
Low
HighD
ata
vo
lum
e
Pro
cess
ing
co
mp
lex
ity
![Page 6: Here comes the flood Tools for Big Data analytics - … · Here comes the flood Tools for Big Data analytics Guy Chesnot -June, 2012](https://reader033.vdocuments.site/reader033/viewer/2022052421/5b5e1f8c7f8b9a164b8bde3f/html5/thumbnails/6.jpg)
Pride and Prejudice
Cloud = Hadoop = Big Data
![Page 7: Here comes the flood Tools for Big Data analytics - … · Here comes the flood Tools for Big Data analytics Guy Chesnot -June, 2012](https://reader033.vdocuments.site/reader033/viewer/2022052421/5b5e1f8c7f8b9a164b8bde3f/html5/thumbnails/7.jpg)
Pride and Prejudice (cont.)
Cloud ≠ Hadoop ≠ Big Data
![Page 8: Here comes the flood Tools for Big Data analytics - … · Here comes the flood Tools for Big Data analytics Guy Chesnot -June, 2012](https://reader033.vdocuments.site/reader033/viewer/2022052421/5b5e1f8c7f8b9a164b8bde3f/html5/thumbnails/8.jpg)
Agenda
� Data flood
� Implementations
� Hadoop
� Not Hadoop
8
![Page 9: Here comes the flood Tools for Big Data analytics - … · Here comes the flood Tools for Big Data analytics Guy Chesnot -June, 2012](https://reader033.vdocuments.site/reader033/viewer/2022052421/5b5e1f8c7f8b9a164b8bde3f/html5/thumbnails/9.jpg)
Implementation
� Several implementation levels
� Application level
� Hardware: disk arrays
� Software layer, close to OS: « Cloud » OS, File system manager, in-between
– Best choice
– Efficiency
– Feature rich
– Not easy to develop
![Page 10: Here comes the flood Tools for Big Data analytics - … · Here comes the flood Tools for Big Data analytics Guy Chesnot -June, 2012](https://reader033.vdocuments.site/reader033/viewer/2022052421/5b5e1f8c7f8b9a164b8bde3f/html5/thumbnails/10.jpg)
Implementation (cont.)
� Software layer very successful
– Because of Open Source HADOOP
– Other software products exist too
� Two main architectures at the file system manager level
– Centralized metadata service (as in popular PFS) : Hadoop
– Peer-to-peer model: metadata base is fully distributed
![Page 11: Here comes the flood Tools for Big Data analytics - … · Here comes the flood Tools for Big Data analytics Guy Chesnot -June, 2012](https://reader033.vdocuments.site/reader033/viewer/2022052421/5b5e1f8c7f8b9a164b8bde3f/html5/thumbnails/11.jpg)
Hadoop is a widely used technology for Big Data processing
Advantages
� Economics
� Flexibility
� Scalability
Challenges
� Raw Technology
� Complexity of deployment
� Requires significant resources
� No packaged Applications
Rapid Adoption
� Yahoo!, Facebook, eBay, Twitter
� JPMC, Schwab
� GAP, Walmart
� CIA
� Many more….
Hadoop adoption impetus is greatest when
projects combine “Big Analytics” (fast,
comprehensive analysis of complex data)
and massive, unstructured data sets.
Source: Karmasphere & Booz Allen Hamilon
![Page 12: Here comes the flood Tools for Big Data analytics - … · Here comes the flood Tools for Big Data analytics Guy Chesnot -June, 2012](https://reader033.vdocuments.site/reader033/viewer/2022052421/5b5e1f8c7f8b9a164b8bde3f/html5/thumbnails/12.jpg)
Hadoop is not a new model : Hadoop et SIMD
Co
ntro
lU
nit
Co
mp
ute
Un
it
Co
mp
ute
Un
it
Co
mp
ute
Un
it
Co
mp
ute
Un
it
Data in memory
Processor
Instructions in memory
![Page 13: Here comes the flood Tools for Big Data analytics - … · Here comes the flood Tools for Big Data analytics Guy Chesnot -June, 2012](https://reader033.vdocuments.site/reader033/viewer/2022052421/5b5e1f8c7f8b9a164b8bde3f/html5/thumbnails/13.jpg)
Hadoop uses MapReduce to bring processing to data
![Page 14: Here comes the flood Tools for Big Data analytics - … · Here comes the flood Tools for Big Data analytics Guy Chesnot -June, 2012](https://reader033.vdocuments.site/reader033/viewer/2022052421/5b5e1f8c7f8b9a164b8bde3f/html5/thumbnails/14.jpg)
Hardware Hadoop implementation: Servers
Excerpt from INTEL whitepaper:
« Optimizing Hadoop deployments »
� To maximize the energy efficiency and
performance of a Hadoop cluster, it is
important to consider that Hadoop
deployments do not require many of the
features typically found in an enterprise data
centre server.
![Page 15: Here comes the flood Tools for Big Data analytics - … · Here comes the flood Tools for Big Data analytics Guy Chesnot -June, 2012](https://reader033.vdocuments.site/reader033/viewer/2022052421/5b5e1f8c7f8b9a164b8bde3f/html5/thumbnails/15.jpg)
Terasort @ 100GB scales super linearly on a 20-node SGI Rackable™ C2005-TY6 cluster running Cloudera distribution of Apache™ Hadoop™ (CDH3u0)
Terasort Scaling: SGI Rackable C2005-TY6 Hadoop Cluster - 100 GB job size
05
101520253035404550
1 5 10 15 20Number of Nodes
Sca
ling
Terasort Scaling
Linear Scaling
Terasort Scaling on SGI vs. Sun Hadoop Cluster100 GB input data size
-10
0
10
20
30
40
50
0 5 10 15 20 25
Number of Nodes
Sca
ling
Terasort Scaling SGI Rackable C2005-TY6clusterTerasort Scaling Sun X2270 M2 cluster
Linear Scaling
Hardware Hadoop implementation: NetworkWorld Record Benchmark - SGI Hadoop Cluster running Terasort
![Page 16: Here comes the flood Tools for Big Data analytics - … · Here comes the flood Tools for Big Data analytics Guy Chesnot -June, 2012](https://reader033.vdocuments.site/reader033/viewer/2022052421/5b5e1f8c7f8b9a164b8bde3f/html5/thumbnails/16.jpg)
Agenda
� Data flood
� Implementations
� Hadoop
� Not Hadoop
16
![Page 17: Here comes the flood Tools for Big Data analytics - … · Here comes the flood Tools for Big Data analytics Guy Chesnot -June, 2012](https://reader033.vdocuments.site/reader033/viewer/2022052421/5b5e1f8c7f8b9a164b8bde3f/html5/thumbnails/17.jpg)
Choosing the right Application for Hadoop
� Applications need to be written to scale to hundreds and thousands of nodes;
and should support tens of millions of files in a single HDFS instance.
� Applications need streaming access to their data sets; designed more for batch
processing rather than interactive use by users; need high throughput of data
access rather than low latency of data access.
� Applications need a write-once-read-many access model for files.
� Applications need to be compatible to run in a Java MapReduce framework;
need to be able to use HDFS interfaces to move themselves closer to where
the data is located.
� Applications need the ability to process unstructured and semi-structured
data or information.
![Page 18: Here comes the flood Tools for Big Data analytics - … · Here comes the flood Tools for Big Data analytics Guy Chesnot -June, 2012](https://reader033.vdocuments.site/reader033/viewer/2022052421/5b5e1f8c7f8b9a164b8bde3f/html5/thumbnails/18.jpg)
Who is/Will be Using Hadoop� Bioscience pharmacological trials produce massive amounts of data
to validate complex interactions of molecular with experimental data.
� Financial services have larger volumes through smaller trading sizes, increased market volatility, and improvements in automated and algorithmic trading. Fraud detection analyzes otherwise unrecognizable patterns and data relationships.
� Science and research is increasingly being dominated by initiatives with large data volumes:
– Large Hadron Collider [LHC] at CERN generates over 15 PB of data per year. The data must be distributed to be retained and processed.
– Continental-scale experiments and environmental monitoring are both politically and technological feasible (e.g., Ocean Observatories Initiative [OOI], National Ecological Observatory Network [NEON], and USArray, a continental-scale seismic observatory)
– Improving instrument and sensor technology (e.g., the Large Synoptic Survey Telescope [LSST] has a 3.2 Gpixel camera and will generate over 6 PB of image data per year)
� Retailers collect clickstream data from website interactions and data from traditional retailing operations for customer buying analysis and inventory management.
� Government and military agencies collect and process massive amounts of raw data from a wide variety of sources to arrive at actionable intelligence.
![Page 19: Here comes the flood Tools for Big Data analytics - … · Here comes the flood Tools for Big Data analytics Guy Chesnot -June, 2012](https://reader033.vdocuments.site/reader033/viewer/2022052421/5b5e1f8c7f8b9a164b8bde3f/html5/thumbnails/19.jpg)
Some Hadoop users
![Page 20: Here comes the flood Tools for Big Data analytics - … · Here comes the flood Tools for Big Data analytics Guy Chesnot -June, 2012](https://reader033.vdocuments.site/reader033/viewer/2022052421/5b5e1f8c7f8b9a164b8bde3f/html5/thumbnails/20.jpg)
SGI Hadoop customer: regular configuration
� The Solution:
– 720 node C2005 compute platform
– SGI Management Center
– Cloudera Hadoop Distribution
– Arista 10GbE IP Switching
– SMC for control of software images
– Ability to monitor and manage power
Secondary Name Node
Name Node
Job Tracker
Task Trackers / Data Nodes
![Page 21: Here comes the flood Tools for Big Data analytics - … · Here comes the flood Tools for Big Data analytics Guy Chesnot -June, 2012](https://reader033.vdocuments.site/reader033/viewer/2022052421/5b5e1f8c7f8b9a164b8bde3f/html5/thumbnails/21.jpg)
SGI Hadoop customer:
not your vanilla Hadoop hardware architecture� The Solution:
– SGI ICE X
– SGI disk array
– InfiniBand fabric
Secondary Name Node
Name Node
Job Tracker
Task Trackers / Data Nodes
InfiniBand fabric
![Page 22: Here comes the flood Tools for Big Data analytics - … · Here comes the flood Tools for Big Data analytics Guy Chesnot -June, 2012](https://reader033.vdocuments.site/reader033/viewer/2022052421/5b5e1f8c7f8b9a164b8bde3f/html5/thumbnails/22.jpg)
Agenda
� Data flood
� Implementations
� Hadoop
� Not Hadoop
22
![Page 23: Here comes the flood Tools for Big Data analytics - … · Here comes the flood Tools for Big Data analytics Guy Chesnot -June, 2012](https://reader033.vdocuments.site/reader033/viewer/2022052421/5b5e1f8c7f8b9a164b8bde3f/html5/thumbnails/23.jpg)
Big Data analytics without Hadoop
� Fraud detection
� Large memory server
Istituto Nazionale della Previdenza Sociale
![Page 24: Here comes the flood Tools for Big Data analytics - … · Here comes the flood Tools for Big Data analytics Guy Chesnot -June, 2012](https://reader033.vdocuments.site/reader033/viewer/2022052421/5b5e1f8c7f8b9a164b8bde3f/html5/thumbnails/24.jpg)
Big Data analytics without Hadoop
� Wikipedia’s view of: history, persons,
categories, organizations
� Entire edition of English Wikipedia
� Metadata and data
– 4 million pages
– Connections among them
� Some kind of Google Earth view of Big Data
![Page 25: Here comes the flood Tools for Big Data analytics - … · Here comes the flood Tools for Big Data analytics Guy Chesnot -June, 2012](https://reader033.vdocuments.site/reader033/viewer/2022052421/5b5e1f8c7f8b9a164b8bde3f/html5/thumbnails/25.jpg)
Big Data analytics without Hadoop
(cont.)
![Page 26: Here comes the flood Tools for Big Data analytics - … · Here comes the flood Tools for Big Data analytics Guy Chesnot -June, 2012](https://reader033.vdocuments.site/reader033/viewer/2022052421/5b5e1f8c7f8b9a164b8bde3f/html5/thumbnails/26.jpg)
Big Data analytics without data storage!
� IP packets analysis
� Real-time security enforcement
� Check and let the packets flow
� Extract relevant metadata for later analysis
![Page 27: Here comes the flood Tools for Big Data analytics - … · Here comes the flood Tools for Big Data analytics Guy Chesnot -June, 2012](https://reader033.vdocuments.site/reader033/viewer/2022052421/5b5e1f8c7f8b9a164b8bde3f/html5/thumbnails/27.jpg)
Big Data analytics without data storage!
� Set top box
events analysis
� Real-time
Latency
� Analysis
performance
![Page 28: Here comes the flood Tools for Big Data analytics - … · Here comes the flood Tools for Big Data analytics Guy Chesnot -June, 2012](https://reader033.vdocuments.site/reader033/viewer/2022052421/5b5e1f8c7f8b9a164b8bde3f/html5/thumbnails/28.jpg)
Big Data and Cloud with data management
� Cloud for backup/restore, archival
� HDFS only
� Scalability
� Low cost– Purchase
– TCO
![Page 29: Here comes the flood Tools for Big Data analytics - … · Here comes the flood Tools for Big Data analytics Guy Chesnot -June, 2012](https://reader033.vdocuments.site/reader033/viewer/2022052421/5b5e1f8c7f8b9a164b8bde3f/html5/thumbnails/29.jpg)
The Data Wave
Data ingest HadoopAnalytics andVisualization
Archiving
• The wave arrives
• Single large memory server
• Numerous regular servers
• Focus the wave
• Hadoop Clusters
• Processing eddies
• Misc servers
• Store the value
• Dense disk arrays
• Archival solution
![Page 30: Here comes the flood Tools for Big Data analytics - … · Here comes the flood Tools for Big Data analytics Guy Chesnot -June, 2012](https://reader033.vdocuments.site/reader033/viewer/2022052421/5b5e1f8c7f8b9a164b8bde3f/html5/thumbnails/30.jpg)
Unstructured
data
Structured
data Archival
Rackable SGI UV ArcFiniti
Cloud
storage
SGI MIS
Big Data Dream Team
![Page 31: Here comes the flood Tools for Big Data analytics - … · Here comes the flood Tools for Big Data analytics Guy Chesnot -June, 2012](https://reader033.vdocuments.site/reader033/viewer/2022052421/5b5e1f8c7f8b9a164b8bde3f/html5/thumbnails/31.jpg)