big data engineering - top 10 pragmatics
DESCRIPTION
Very high level, but covers all the essentials. Slides of my talk at the Naval PostGraduate School, MontereyTRANSCRIPT
![Page 1: Big Data Engineering - Top 10 Pragmatics](https://reader034.vdocuments.site/reader034/viewer/2022051609/547d6ca1b4af9fcf6a8b4771/html5/thumbnails/1.jpg)
Krishna Sankar, http://doubleclix.wordpress.com
EC4000–PhD Guest Seminar, Naval Post Graduate School
April 27,2012
The road lies plain before me;--'tis a theme
Single and of determined bounds; …
- Wordsworth, The Prelude
![Page 2: Big Data Engineering - Top 10 Pragmatics](https://reader034.vdocuments.site/reader034/viewer/2022051609/547d6ca1b4af9fcf6a8b4771/html5/thumbnails/2.jpg)
What is Big Data ?
Big Data to smart data
Big Data Pipeline
Analytic Algorithms
Storage - NOSQL
Processing - Hadoop
Cloud Architectures
Analytics/Modeling
R
Visualization
o Agenda o To cover the broad
picture o Touch upon
instances of the technologies employed
o Of the Big Data domain …
![Page 3: Big Data Engineering - Top 10 Pragmatics](https://reader034.vdocuments.site/reader034/viewer/2022051609/547d6ca1b4af9fcf6a8b4771/html5/thumbnails/3.jpg)
Thanks to … The giants whose shoulders I am
standing on
Special Thanks to: Peter Ateshian, NPS
Prof Murali Tummala, NPS Shirley Bailes,O’Reilly Ed Dumbill,O’Reilly
Jeff Barr,AWS Jenny Kohr Chynoweth,AWS
![Page 4: Big Data Engineering - Top 10 Pragmatics](https://reader034.vdocuments.site/reader034/viewer/2022051609/547d6ca1b4af9fcf6a8b4771/html5/thumbnails/4.jpg)
Porcelain vs. Plumbing
• The balance is always interesting …
• This talk has both
• Would be happy to dive deep into plumbing topics like Hadoop, R, MongoDB, Cassandra et al…
![Page 5: Big Data Engineering - Top 10 Pragmatics](https://reader034.vdocuments.site/reader034/viewer/2022051609/547d6ca1b4af9fcf6a8b4771/html5/thumbnails/5.jpg)
① Volume o Scale
② Velocity o Data change rate vs. decision window
③ Variety o Different sources & formats o Structured vs. Unstructured
④ Variability o Breadth of interpreta<on & o Depth of analy<cs
EBC322
hKp://doubleclix.wordpress.com/2011/09/13/when-‐is-‐big-‐data-‐really-‐big-‐data/ hKp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf
hKp://www.quora.com/Business-‐Intelligence/What-‐is-‐the-‐future-‐of-‐business-‐intelligence
![Page 6: Big Data Engineering - Top 10 Pragmatics](https://reader034.vdocuments.site/reader034/viewer/2022051609/547d6ca1b4af9fcf6a8b4771/html5/thumbnails/6.jpg)
① Volume o Scale
② Velocity o Data change rate vs. decision window
③ Variety o Different sources & formats o Structured vs. Unstructured
④ Variability o Breadth of interpreta<on & o Depth of analy<cs
EBC322
hKp://doubleclix.wordpress.com/2011/09/13/when-‐is-‐big-‐data-‐really-‐big-‐data/ hKp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf
hKp://www.quora.com/Business-‐Intelligence/What-‐is-‐the-‐future-‐of-‐business-‐intelligence
![Page 7: Big Data Engineering - Top 10 Pragmatics](https://reader034.vdocuments.site/reader034/viewer/2022051609/547d6ca1b4af9fcf6a8b4771/html5/thumbnails/7.jpg)
① Volume o Scale
② Velocity o Data change rate vs. decision window
③ Variety o Different sources & formats o Structured vs. Unstructured
④ Variability o Breadth of interpreta<on & o Depth of analy<cs
EBC322
hKp://doubleclix.wordpress.com/2011/09/13/when-‐is-‐big-‐data-‐really-‐big-‐data/ hKp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf
hKp://www.quora.com/Business-‐Intelligence/What-‐is-‐the-‐future-‐of-‐business-‐intelligence
![Page 8: Big Data Engineering - Top 10 Pragmatics](https://reader034.vdocuments.site/reader034/viewer/2022051609/547d6ca1b4af9fcf6a8b4771/html5/thumbnails/8.jpg)
① Volume o Scale
② Velocity o Data change rate vs. decision window
③ Variety o Different sources & formats o Structured vs. Unstructured
④ Variability o Breadth of interpreta<on & o Depth of analy<cs
EBC322
hKp://doubleclix.wordpress.com/2011/09/13/when-‐is-‐big-‐data-‐really-‐big-‐data/ hKp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf
hKp://www.quora.com/Business-‐Intelligence/What-‐is-‐the-‐future-‐of-‐business-‐intelligence
![Page 9: Big Data Engineering - Top 10 Pragmatics](https://reader034.vdocuments.site/reader034/viewer/2022051609/547d6ca1b4af9fcf6a8b4771/html5/thumbnails/9.jpg)
① Volume o Scale
② Velocity o Data change rate vs. decision window
③ Variety o Different sources & formats o Structured vs. Unstructured
④ Variability o Breadth of interpreta<on & o Depth of analy<cs
⑤ Contextual o Dynamic variability o RecommendaWon
⑥ Connectedness
EBC322
hKp://doubleclix.wordpress.com/2011/09/13/when-‐is-‐big-‐data-‐really-‐big-‐data/ hKp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf
![Page 10: Big Data Engineering - Top 10 Pragmatics](https://reader034.vdocuments.site/reader034/viewer/2022051609/547d6ca1b4af9fcf6a8b4771/html5/thumbnails/10.jpg)
• “… they didn’t need a genius, … but build the world’s most impressive dileKante … baKling the efficient human mind with spectacular flamboyant inefficiency” – Final Jeopardy by Stephen Baker
• 15 TB memory, across 90 IBM 760 servers, in 10 racks • 1 TB of dataset • 200 Million pages processed by Hadoop • This is a good example of Connected data
– Contextual w/ variability – Breath of interpretaWon – AnalyWcs depth
hKp://doubleclix.wordpress.com/2011/03/01/the-‐educaWon-‐of-‐a-‐machine-‐%E2%80%93-‐review-‐of-‐book-‐%E2%80%9Cfinal-‐jeopardy%E2%80%9D-‐by-‐stephen-‐baker/ hKp://doubleclix.wordpress.com/2011/02/17/watson-‐at-‐jeopardy-‐a-‐race-‐of-‐machines/
![Page 11: Big Data Engineering - Top 10 Pragmatics](https://reader034.vdocuments.site/reader034/viewer/2022051609/547d6ca1b4af9fcf6a8b4771/html5/thumbnails/11.jpg)
Ref: hKp://www.ciol.com/News/News/News-‐Reports/Vinod-‐Khosla%E2%80%99s-‐cool-‐dozen-‐tech-‐innovaWons/156307/0/ hKp://yourstory.in/2011/11/vinod-‐khoslas-‐keynote-‐at-‐nasscom-‐product-‐conclave-‐reject-‐punditry-‐believe-‐in-‐an-‐idea-‐take-‐risk-‐and-‐succeed/
![Page 12: Big Data Engineering - Top 10 Pragmatics](https://reader034.vdocuments.site/reader034/viewer/2022051609/547d6ca1b4af9fcf6a8b4771/html5/thumbnails/12.jpg)
Volume
Velocity
Variety
Variability
Connectedness
Context
Model
Infer-ability
Decomplexify! Contextualize! Network! Reason! Infer!
Logs, Scribe, Flume, Storm, Hadoop…
SQL NOSQL, HDFS, XML, =iles, …
SQL, BI Tools, Hadoop, Pig, Hive, .NET Dryad, Various other tools
Internal dashboards, Tableau
Ref:h&p:goo.gl/Mm83k
Hand coded Programs, R, Mahout, …
![Page 13: Big Data Engineering - Top 10 Pragmatics](https://reader034.vdocuments.site/reader034/viewer/2022051609/547d6ca1b4af9fcf6a8b4771/html5/thumbnails/13.jpg)
Twitter § 200 million tweets/day § Peak 10,000/second § How would you handle the fire
hose for social network analytics ?
hKp://goo.gl/dcBsQ
Storage § 4 U box = 40 TB, § 1 PB = 25 boxes !
Zynga § “Analytics company, not a
gaming company!” § Harvests data : 15 TB/day
§ Test new features § Target advertising
§ 230 million players/month
AWS – 900 Billion objects!
![Page 14: Big Data Engineering - Top 10 Pragmatics](https://reader034.vdocuments.site/reader034/viewer/2022051609/547d6ca1b4af9fcf6a8b4771/html5/thumbnails/14.jpg)
• 6 Billion Messages per day
• 2 PB (w/compression) online
• 6 PB w/ replicaWon • 250 TB/Month growth • HBase Infrastructure
![Page 15: Big Data Engineering - Top 10 Pragmatics](https://reader034.vdocuments.site/reader034/viewer/2022051609/547d6ca1b4af9fcf6a8b4771/html5/thumbnails/15.jpg)
Ref: hKp://www.hpts.ws/sessions/2011HPTS-‐TomFastner.pdf
Path Analysis A/B TesWng
50 TB/Day 240 nodes, 84 PB Teradata InstallaWon
Very systemaWc Diagram speaks volumes!
eBay Extreme AnalyWcs Architecture
![Page 16: Big Data Engineering - Top 10 Pragmatics](https://reader034.vdocuments.site/reader034/viewer/2022051609/547d6ca1b4af9fcf6a8b4771/html5/thumbnails/16.jpg)
Splunk Scribe Flume Storm
Collect
NOSQL Cassandra MongoDB Hbase Neo4j
Store
Hadoop Pig/Hive
R
Transform & Analyze
R Mahout BI Tools
Model & Reason
D3.js Tableau
Dashboard
Predict, Recommend & Visualize
When I think of my own native land, !In a moment I seem to be there; !
But, alas! recollection at hand " !Soon hurries me back to despair.!
- Cowper, The Solitude Of Alexander SelKirk!
![Page 17: Big Data Engineering - Top 10 Pragmatics](https://reader034.vdocuments.site/reader034/viewer/2022051609/547d6ca1b4af9fcf6a8b4771/html5/thumbnails/17.jpg)
Key Value Column Document Graph
NOSQL
Neo4j
FlockDB
InfiniteGraph
CouchDB
MongoDB
Lotus Domino
Riak
Google BigTable
HBase
Cassandra
HyperTable
In-‐memory
Disk Based
SimpleDB
Memcached
Redis
Tokyo Cabinet
Dynamo
Voldemort Azure TS
![Page 18: Big Data Engineering - Top 10 Pragmatics](https://reader034.vdocuments.site/reader034/viewer/2022051609/547d6ca1b4af9fcf6a8b4771/html5/thumbnails/18.jpg)
MapReduce
• Data parallelism • Large InstallaWons (many ~5000 node clusters!)
![Page 19: Big Data Engineering - Top 10 Pragmatics](https://reader034.vdocuments.site/reader034/viewer/2022051609/547d6ca1b4af9fcf6a8b4771/html5/thumbnails/19.jpg)
19
Infrastructure As A Service
Plasorm As A Service
Sotware As A Service
![Page 20: Big Data Engineering - Top 10 Pragmatics](https://reader034.vdocuments.site/reader034/viewer/2022051609/547d6ca1b4af9fcf6a8b4771/html5/thumbnails/20.jpg)
![Page 21: Big Data Engineering - Top 10 Pragmatics](https://reader034.vdocuments.site/reader034/viewer/2022051609/547d6ca1b4af9fcf6a8b4771/html5/thumbnails/21.jpg)
Amazon – Canonical Cloud
• S3 – Blob storage • Dynamo DB – NOSQL • EMR – ElasWc Map Reduce • EC2 – Compute • 1% of Internet traffic
hKp://blog.deepfield.net/2012/04/18/how-‐big-‐is-‐amazons-‐cloud/
“Scalability is about building wider roads, not about building faster cars” – Steve Swartz
![Page 22: Big Data Engineering - Top 10 Pragmatics](https://reader034.vdocuments.site/reader034/viewer/2022051609/547d6ca1b4af9fcf6a8b4771/html5/thumbnails/22.jpg)
hKp://www.slideshare.net/AmazonWebServices/keynote-‐your-‐future-‐with-‐cloud-‐compuWng-‐dr-‐werner-‐vogels-‐aws-‐summit-‐2012-‐nyc
![Page 23: Big Data Engineering - Top 10 Pragmatics](https://reader034.vdocuments.site/reader034/viewer/2022051609/547d6ca1b4af9fcf6a8b4771/html5/thumbnails/23.jpg)
hKp://openclipart.org/detail/152311/internet-‐cloud-‐by-‐b.gaulWer,hKp://openclipart.org/detail/17847
EC2
EC2
![Page 24: Big Data Engineering - Top 10 Pragmatics](https://reader034.vdocuments.site/reader034/viewer/2022051609/547d6ca1b4af9fcf6a8b4771/html5/thumbnails/24.jpg)
• Social Network Analysis • SenWment Analysis • Brand Strength • CitaWon/co-‐citaWon ≅ Followed by/Also Follows • Metrics
– Network diameter, – Weak-‐Wes, – Erdös-‐Renyi model & – Kronecker Graphs
Tweets Followers Follow/Unfollow
hKp://www.oscon.com/oscon2012/public/schedule/detail/23130
![Page 25: Big Data Engineering - Top 10 Pragmatics](https://reader034.vdocuments.site/reader034/viewer/2022051609/547d6ca1b4af9fcf6a8b4771/html5/thumbnails/25.jpg)
Was it a vision, or a waking dream?!Fled is that music:—do I wake or sleep?!
-Keats, Ode to a Nightingale!