data center computing for data science: an evolution of machines, middleware, math, and mesos
DESCRIPTION
Guest lecture 2013-08-27 at General Assembly in SF for the Data Science program taught by Jacob Bollinger and Thomson Nguyen https://generalassemb.ly/education/data-science/san-francisco Many thanks to Thomson, Jacob, and the participants in the course. Excellent Q&A! Received a bottle o' Cardhu (my fave Scotch) in payment for lecture, and since it's Burning Man Week, the city was emptied so we had enough to share with the class :) Evidence: https://plus.google.com/u/0/110794698656267747127/posts/GvjhhQ99CTsTRANSCRIPT
General Assembly SF, 2013-08-27:
“Data Center Computing for Data Science: an evolution of machines, middleware, math, and Mesos”
Learnings generalized from trends in Data Science:
a 30-year retrospective on Machine Learning,
a 10-year summary of Leading Data Science Teams,
and a 2-year survey of Enterprise Use Cases
Paco Nathan @pacoidChief Scientist, Mesosphere
1Saturday, 31 August 13
Learnings generalized from trends in Data Science:
1. the practice of leading data science teams
2. strategies for leveraging data at scale
3. machine learning and optimization
4. large-scale data workflows
5. an evolution of cluster computing
GA/SF, 2013-08-272Saturday, 31 August 13
employing a mode of thought which includes both logical and analytical reasoning: evaluating the whole of a problem, as well as its component parts; attempting to assess the effects of changing one or more variables
this approach attempts to understand not just problems and solutions, but also the processes involved and their variances
particularly valuable in Big Data work when combined with hands-on experience in physics – roughly 50% of my peers come from physics or physical engineering…
programmers typically don’t think this way… however, both systems engineers and data scientists must
Process Variation Data Tools
Statistical Thinking
3Saturday, 31 August 13
Modeling
back in the day, we worked with practices based on data modeling
1. sample the data
2. fit the sample to a known distribution
3. ignore the rest of the data
4. infer, based on that fitted distribution
that served well with ONE computer, ONE analyst, ONE model… just throw away annoying “extra” data
circa late 1990s: machine data, aggregation, clusters, etc.algorithmic modeling displaced the prior practices of data modeling
because the data won’t fit on one computer anymore
4Saturday, 31 August 13
Two Cultures
“A new research community using these tools sprang up. Their goal was predictive accuracy. The community consisted of young computer scientists, physicists and engineers plus a few aging statisticians. They began using the new tools in working on complex prediction problems where it was obvious that data models were not applicable: speech recognition, image recognition, nonlinear time series prediction, handwriting recognition, prediction in financial markets.”
Statistical Modeling: The Two Cultures Leo Breiman, 2001bit.ly/eUTh9L
chronicled a sea change from data modeling (silos, manual process) to the rising use of algorithmic modeling (machine data for automation/optimization) which led in turn to the practice of leveraging inter-disciplinary teams
5Saturday, 31 August 13
approximately 80% of the costs for data-related projects gets spent on data preparation – mostly on cleaning up data quality issues: ETL, log files, etc., generally by socializing the problem
unfortunately, data-related budgets tend to go into frameworks that can only be used after clean up
most valuable skills:
‣ learn to use programmable tools that prepare data
‣ learn to understand the audience and their priorities
‣ learn to socialize the problems, knocking down silos
‣ learn to generate compelling data visualizations
‣ learn to estimate the confidence for reported results
‣ learn to automate work, making process repeatable
What is needed most?
Unique Registration
Launched games lobby
NUI:TutorialMode
Birthday Message
Chat PublicRoom voice
Launched heyzap game
ConnectivityTest: test suite started
Create New Pet
Movie View Started: client, community
NUI:MovieMode
Buy an Item: web
Put on Clothing
Address space remaining: 512M
Customer Made Purchase Cart Page Step 2
Feed Pet
Play Pet
Chat Now
Edit Panel
Client Inventory Panel Flip Product Over
Add Friend
Open 3D Window
Change Seat
Type a Bubble
Visit Own Homepage
Take a Snapshot
NUI:BuyCreditsMode
NUI:MyProfileClicked
Address space remaining: 1G
Leave a Message
NUI:ChatMode
NUI:FriendsModedv
Website Login
Add Buddy
NUI:PublicRoomMode
NUI:MyRoomMode
Client Inventory Panel Remove Product
Client Inventory Panel Apply Product
NUI:DressUpMode
Unique RegistrationLaunched games lobbyNUI:TutorialModeBirthday MessageChat PublicRoom voiceLaunched heyzap gameConnectivityTest: test suite startedCreate New PetMovie View Started: client, communityNUI:MovieModeBuy an Item: webPut on ClothingAddress space remaining: 512MCustomer Made Purchase Cart Page Step 2Feed PetPlay PetChat NowEdit PanelClient Inventory Panel Flip Product OverAdd FriendOpen 3D WindowChange SeatType a BubbleVisit Own HomepageTake a SnapshotNUI:BuyCreditsModeNUI:MyProfileClickedAddress space remaining: 1GLeave a MessageNUI:ChatModeNUI:FriendsModedvWebsite LoginAdd BuddyNUI:PublicRoomModeNUI:MyRoomModeClient Inventory Panel Remove ProductClient Inventory Panel Apply ProductNUI:DressUpMode
6Saturday, 31 August 13
apps
discovery
modeling
integration
systems
help people ask the right questions
allow automation to place informed bets
deliver data products at scale to LOB end uses
build smarts into product features
keep infrastructure running, cost-effective
Team Process = Needs
analysts
engineers
inter-disciplinary leadership
7Saturday, 31 August 13
business process,stakeholder
data prep, discovery, modeling, etc.
software engineering, automation
systems engineering, availability
datascience
DataScientist
App Dev
Ops
DomainExpert
introducedcapability
Team Composition = Roles
leverage non-traditional pairing among roles, to complement skills and tear down silos
8Saturday, 31 August 13
discovery
discovery
modeling
modeling
integration
integration
appsapps systems
systems
business process,stakeholder
data prep, discovery, modeling, etc.
software engineering, automation
systems engineering, availability
datascience
DataScientist
App Dev
Ops
DomainExpert
introducedcapability
Team Composition = Needs × Roles
9Saturday, 31 August 13
Alternatively, Data Roles × Skill Sets
Harlan Harris, et al.datacommunitydc.org/blog/wp-content/uploads/2012/08/SkillsSelfIDMosaic-edit-500px.png
Analyzing the AnalyzersHarlan Harris, Sean Murphy, Marck VaismanO’Reilly, 2013amazon.com/dp/B00DBHTE56
10Saturday, 31 August 13
Learning Curves
difficulties in the commercial use of distributed systems often get represented as issues of managing complexity
much of the risk in managing a data science team is about budgeting for learning curve: some orgs practice a kind of engineering “conservatism”, with highly structured process and strictly codified practices – people learn a few things well, then avoid having to struggle with learning many new things perpetually…
that anti-pattern leads to big teams, low ROI scale ➞
com
plexity ➞
ultimately, the challenge is about
managing learning curves within
a social context
11Saturday, 31 August 13
Learnings generalized from trends in Data Science:
1. the practice of leading data science teams
2. strategies for leveraging data at scale
3. machine learning and optimization
4. large-scale data workflows
5. an evolution of cluster computing
GA/SF, 2013-08-2712Saturday, 31 August 13
Business Disruption through Data
Geoffrey MooreMohr Davidow Ventures, author Crossing The Chasm@Hadoop Summit, 2012: what Amazon did to the retail sector… has put the entire Global 1000 on notice over the next decade… data as the major force… mostly through apps – verticals, leveraging domain expertise
Michael StonebrakerINGRES, PostgreSQL, Vertica, VoltDB, Paradigm4, etc. @XLDB, 2012: complex analytics workloads are now displacing SQL as the basis for Enterprise apps
13Saturday, 31 August 13
Data Categories
Three broad categories of dataCurt Monash, 2010
dbms2.com/2010/01/17/three-broad-categories-of-data
• Human/Tabular data – human-generated data which fits into tables/arrays
• Human/Nontabular data – all other data generated by humans
• Machine-Generated data
let’s now add other useful distinctions:
• Open Data
• Curated Metadata
• A/D conversion for sensors (IoT)
14Saturday, 31 August 13
Open Data notes
successful apps incorporate three components:
• Big Data (consumer interest, personalization)
• Open Data (monetizing public data)
• Curated Metadata
most of the largest Cascading deployments leverage some Open Data components: Climate Corp, Factual, Nokia, etc.
consider buildingeye.com, aggregate building permits:
• pricing data for home owners looking to remodel
• sales data for contractors
• imagine joining data with building inspection history,for better insights about properties for sale…
research notes about Open Data use cases: goo.gl/cd995T
15Saturday, 31 August 13
Trends in Public Administration
late 1880s – late 1920s (Woodrow Wilson)as hierarchy, bureaucracy → only for the most educated, elite
late 1920s – late 1930sas a business, relying on “Scientific Method”, gov as a process
late 1930s – late 1940s (Robert Dale)relationships, behavioral-based → policy not separate from politics
late 1940s – 1980syet another form of management → less “command and control”
1980s – 1990s (David Osborne, Ted Gaebler)New Public Management → service efficiency, more private sector
1990s – present (Janet & Robert Denhardt)Digital Age → transparency, citizen-based “debugging”, bankruptcies
Adapted from:The Roles, Actors, and Norms Necessary to Institutionalize Sustainable Collaborative GovernancePeter PirnejadUSC Price School of Policy2013-05-02
Drivers, circa 2013
• governments have run out of money, cannot increase staff and services
• better data infra at scale (cloud, OSS, etc.)
• machine learning techniques to monetize
• viable ecosystem for data products, APIs
• mobile devices enabling use cases
16Saturday, 31 August 13
Open Data ecosystem
municipaldepartments
publishingplatforms
aggregators
data productvendors
end use cases
e.g., Palo Alto, Chicago, DC, etc.
e.g., Junar, Socrata, etc.
e.g., OpenStreetMap, WalkScore, etc.
e.g., Factual, Marinexplore, etc.
e.g., Facebook, Climate, etc.
Data feeds structured for public private partnerships
17Saturday, 31 August 13
Open Data ecosystem – caveats for agencies
municipaldepartments
publishingplatforms
aggregators
data productvendors
end use cases
e.g., Palo Alto, Chicago, DC, etc.
e.g., Junar, Socrata, etc.
e.g., OpenStreetMap, WalkScore, etc.
e.g., Factual, Marinexplore, etc.
e.g., Facebook, Climate, etc.
Required Focus
• respond to viable use cases
• not budgeting hackathons
18Saturday, 31 August 13
Open Data ecosystem – caveats for publishers
municipaldepartments
publishingplatforms
aggregators
data productvendors
end use cases
e.g., Palo Alto, Chicago, DC, etc.
e.g., Junar, Socrata, etc.
e.g., OpenStreetMap, WalkScore, etc.
e.g., Factual, Marinexplore, etc.
e.g., Facebook, Climate, etc.
Required Focus
• surface the metadata
• curate, allowing for joins/aggregation
• not scans as PDFs
19Saturday, 31 August 13
Open Data ecosystem – caveats for aggregators
municipaldepartments
publishingplatforms
aggregators
data productvendors
end use cases
e.g., Palo Alto, Chicago, DC, etc.
e.g., Junar, Socrata, etc.
e.g., OpenStreetMap, WalkScore, etc.
e.g., Factual, Marinexplore, etc.
e.g., Facebook, Climate, etc.
Required Focus
• make APIs consumable by automation
• allow for probabilistic usage
• not OSS licensing for data
20Saturday, 31 August 13
Open Data ecosystem – caveats for data vendors
municipaldepartments
publishingplatforms
aggregators
data productvendors
end use cases
e.g., Palo Alto, Chicago, DC, etc.
e.g., Junar, Socrata, etc.
e.g., OpenStreetMap, WalkScore, etc.
e.g., Factual, Marinexplore, etc.
e.g., Facebook, Climate, etc.
Required Focus
• supply actionable data
• track data provenance carefully
• provide feedback upstream, i.e., cleaned data at source
• focus on core verticals
21Saturday, 31 August 13
Open Data ecosystem – caveats for end uses
municipaldepartments
publishingplatforms
aggregators
data productvendors
end use cases
e.g., Palo Alto, Chicago, DC, etc.
e.g., Junar, Socrata, etc.
e.g., OpenStreetMap, WalkScore, etc.
e.g., Factual, Marinexplore, etc.
e.g., Facebook, Climate, etc.
Required Focus
• address consumer needs
• identify community benefits of the data
22Saturday, 31 August 13
algorithmic modeling + machine data (Big Data) + curation, metadata + Open Data
⇒ data products, as feedback into automation
⇒ evolution of feedback loops
less about “bigness”, more about complexity
internet of things + A/D conversion + more complex analytics ⇒ accelerated evolution, additional feedback loops
⇒ orders of magnitude higher data rates
Recipes for Success
source: National Geographic
“A kind of Cambrian explosion”source: National Geographic
23Saturday, 31 August 13
Trendlines
Big Data? we’re just getting started:
• ~12 exabytes/day, jet turbines on commercial flights
• Google self-driving cars, ~1 Gb/s per vehicle
• National Instruments initiative: Big Analog Data™
• 1m resolution satellites skyboximaging.com
• open resource monitoring reddmetrics.com
• Sensing XChallenge nokiasensingxchallenge.org
consider the implications of Jawbone, Nike, etc., plus the effects of Google Glass…
technologyreview.com/...
24Saturday, 31 August 13
Internet of Things
25Saturday, 31 August 13
Learnings generalized from trends in Data Science:
1. the practice of leading data science teams
2. strategies for leveraging data at scale
3. machine learning and optimization
4. large-scale data workflows
5. an evolution of cluster computing
GA/SF, 2013-08-2726Saturday, 31 August 13
in general, apps alternate between learning patterns/rules and retrieving similar things…
machine learning – scalable, arguably quite ad-hoc, generally “black box” solutions, enabling you to make billion dollar mistakes, with oh so much commercial emphasis(i.e. the “heavy lifting”)
statistics – rigorous, much slower to evolve, confidence and rationale become transparent, preventing you from making billion dollar mistakes, any good commercial project has ample stats work used in QA(i.e., “CYA, cover your analysis”)
once Big Data projects get beyond merely digesting log files, optimization will likely become the next overused buzzword :)
Learning Theory
27Saturday, 31 August 13
Generalizations about Machine Learning…
great introduction to ML, plus a proposed categorization for comparing different machine learning approaches:
A Few Useful Things to Know about Machine LearningPedro Domingos, U Washingtonhomes.cs.washington.edu/~pedrod/papers/cacm12.pdf
toward a categorization for Machine Learning algorithms:
• representation: classifier must be represented in some formal language that computers can handle (algorithms, data structures, etc.)
• evaluation: evaluation function (objective function, scoring function) is needed to distinguish good classifiers from bad ones
• optimization: method to search among the classifiers in the language for the highest-scoring one
28Saturday, 31 August 13
Something to consider about Algorithms…
many algorithm libraries used today are based on implementationsback when people used DO loops in FORTRAN, 30+ years ago
MapReduce is Good Enough?Jimmy Lin, U Marylandumiacs.umd.edu/~jimmylin/publications/Lin_BigData2013.pdf
astrophysics and genomics are light years ahead of e-commerce in terms of data rates and sophisticated algorithms work – as Breiman suggested in 2001 – may take a few years to percolate into industry
other game-changers:
• streaming algorithms, sketches, probabilistic data structures
• significant “Big O” complexity reduction (e.g., skytree.net)
• better architectures and topologies (e.g., GPUs and CUDA)
• partial aggregates – parallelizing workflows
29Saturday, 31 August 13
Make It Sparse…
also, take a moment to check this out… (and related work on sparse Cholesky, etc.)
QR factorization of a “tall-and-skinny” matrix
• used to solve many data problems at scale, e.g., PCA, SVD, etc.
• numerically stable with efficient implementation on large-scale Hadoop clusters
suppose that you have a sparse matrix of customer interactions where there are 100MM customers, with a limited set of outcomes…
cs.purdue.edu/homes/dgleich
stanford.edu/~arbenson
github.com/ccsevers/scalding-linalg
David Gleich, slideshare.net/dgleich
30Saturday, 31 August 13
Sparse Matrix Collection
for those times when you really, really need a wide variety of sparse matrix examples…
University of Florida Sparse Matrix Collectioncise.ufl.edu/research/sparse/matrices/
Tim Davis, U Floridacise.ufl.edu/~davis/welcome.html
Yifan Hu, AT&T Researchwww2.research.att.com/~yifanhu/
31Saturday, 31 August 13
A Winning Approach…
consider that if you know priors about a system, then you may be able to leverage low dimensional structure within high dimensional data… what impact does that have on sampling rates?
1. real-world data ⇒
2. graph theory for representation ⇒
3. sparse matrix factorization for production work ⇒
4. cost-effective parallel processing for machine learning app at scale
32Saturday, 31 August 13
Just Enough Mathematics?
having a solid background in statistics becomes vital, because it provides formalisms for what we’re trying to accomplish at scale
along with that, some areas of math help – regardless of the “calculus threshold” invoked at many universities…
linear algebra e.g., calculating algorithms for large-scale apps efficiently
graph theory e.g., representation of problems in a calculable language
abstract algebra e.g., probabilistic data structures in streaming analytics
topology e.g., determining the underlying structure of the data
operations research e.g., techniques for optimization … in other words, ROI
33Saturday, 31 August 13
ADMM: a general approach for optimizing learners
Distributed Optimization and Statistical Learning via the Alternating Direction Method of MultipliersStephen Boyd, Neal Parikh, et al., Stanfordstanford.edu/~boyd/papers/admm_distr_stats.html
“Throughout, the focus is on applications rather than theory, and a main goal is to provide the reader with a kind of ‘toolbox’ that can be applied in many situations to derive and implement a distributed algorithm of practical use. Though the focus here is on parallelism, the algorithm can also be used serially, and it is interesting to note that with no tuning, ADMM can be competitive with the best known methods for some problems.”
“While we have emphasized applications that can be concisely explained, the algorithm would also be a natural fit for more complicated problems in areas like graphical models. In addition, though our focus is on statistical learning problems, the algorithm is readily applicable in many other cases, such as in engineering design, multi-period portfolio optimization, time series analysis, network flow, or scheduling.”
34Saturday, 31 August 13
Learnings generalized from trends in Data Science:
1. the practice of leading data science teams
2. strategies for leveraging data at scale
3. machine learning and optimization
4. large-scale data workflows
5. an evolution of cluster computing
GA/SF, 2013-08-2735Saturday, 31 August 13
Enterprise Data Workflows
middleware for Big Data applications is evolving, with commercial examples that include:
Cascading, Lingual, Pattern, etc.
Concurrent
ParAccel Big Data Analytics Platform
Actian
Anaconda supporting IPython Notebook, Pandas, Augustus, etc.
Continuum Analytics
ETL dataprep
predictivemodel
datasources
enduses
36Saturday, 31 August 13
Anatomy of an Enterprise app
definition of a typical Enterprise workflow which crosses through multiple departments, languages, and technologies…
ETL dataprep
predictivemodel
datasources
enduses
ANSI SQL for ETL
37Saturday, 31 August 13
Anatomy of an Enterprise app
definition of a typical Enterprise workflow which crosses through multiple departments, languages, and technologies…
ETL dataprep
predictivemodel
datasources
endusesJ2EE for business logic
38Saturday, 31 August 13
Anatomy of an Enterprise app
definition of a typical Enterprise workflow which crosses through multiple departments, languages, and technologies…
ETL dataprep
predictivemodel
datasources
enduses
SAS for predictive models
39Saturday, 31 August 13
Anatomy of an Enterprise app
definition of a typical Enterprise workflow which crosses through multiple departments, languages, and technologies…
ETL dataprep
predictivemodel
datasources
enduses
SAS for predictive modelsANSI SQL for ETL most of the licensing costs…
40Saturday, 31 August 13
Anatomy of an Enterprise app
definition of a typical Enterprise workflow which crosses through multiple departments, languages, and technologies…
ETL dataprep
predictivemodel
datasources
endusesJ2EE for business logic
most of the project costs…
41Saturday, 31 August 13
ETL dataprep
predictivemodel
datasources
enduses
Lingual:DW → ANSI SQL
Pattern:SAS, R, etc. → PMML
business logic in Java, Clojure, Scala, etc.
sink taps for Memcached, HBase, MongoDB, etc.
source taps for Cassandra, JDBC,Splunk, etc.
Anatomy of an Enterprise app
Cascading allows multiple departments to combine their workflow components into an integrated app – one among many, typically – based on 100% open source
a compiler sees it all…one connected DAG:
• optimization
• troubleshooting
• exception handling
• notifications
cascading.org
42Saturday, 31 August 13
a compiler sees it all…
ETL dataprep
predictivemodel
datasources
enduses
Lingual:DW → ANSI SQL
Pattern:SAS, R, etc. → PMML
business logic in Java, Clojure, Scala, etc.
sink taps for Memcached, HBase, MongoDB, etc.
source taps for Cassandra, JDBC,Splunk, etc.
Anatomy of an Enterprise app
Cascading allows multiple departments to combine their workflow components into an integrated app – one among many, typically – based on 100% open source
FlowDef flowDef = FlowDef.flowDef() .setName( "etl" ) .addSource( "example.employee", emplTap ) .addSource( "example.sales", salesTap ) .addSink( "results", resultsTap ); SQLPlanner sqlPlanner = new SQLPlanner() .setSql( sqlStatement ); flowDef.addAssemblyPlanner( sqlPlanner );
cascading.org
43Saturday, 31 August 13
a compiler sees it all…
ETL dataprep
predictivemodel
datasources
enduses
Lingual:DW → ANSI SQL
Pattern:SAS, R, etc. → PMML
business logic in Java, Clojure, Scala, etc.
sink taps for Memcached, HBase, MongoDB, etc.
source taps for Cassandra, JDBC,Splunk, etc.
Anatomy of an Enterprise app
Cascading allows multiple departments to combine their workflow components into an integrated app – one among many, typically – based on 100% open source
FlowDef flowDef = FlowDef.flowDef() .setName( "classifier" ) .addSource( "input", inputTap ) .addSink( "classify", classifyTap ); PMMLPlanner pmmlPlanner = new PMMLPlanner() .setPMMLInput( new File( pmmlModel ) ) .retainOnlyActiveIncomingFields(); flowDef.addAssemblyPlanner( pmmlPlanner );
44Saturday, 31 August 13
Cascading – functional programming
Key insight: MapReduce is based on functional programming – back to LISP in 1970s. Apache Hadoop use cases are mostly about data pipelines, which are functional in nature.
to ease staffing problems as “Main Street” Enterprise firms began to embrace Hadoop, Cascading was introduced in late 2007, as a new Java API to implement functional programming for large-scale data workflows:
• leverages JVM and Java-based tools without anyneed to create new languages
• allows programmers who have J2EE expertise to leverage the economics of Hadoop clusters
Edgar Codd alluded to this (DSLs for structuring data) in his original paper about relational model
45Saturday, 31 August 13
Cascading – functional programming
• Twitter, eBay, LinkedIn, Nokia, YieldBot, uSwitch, etc., have invested in open source projects atop Cascading – used for their large-scale production deployments
• new case studies for Cascading apps are mostly based on domain-specific languages (DSLs) in JVM languages which emphasize functional programming:
Cascalog in Clojure (2010)Scalding in Scala (2012)
github.com/nathanmarz/cascalog/wikigithub.com/twitter/scalding/wiki
Why Adopting the Declarative Programming Practices Will Improve Your Return from TechnologyDan Woods, 2013-04-17 Forbes
forbes.com/sites/danwoods/2013/04/17/why-adopting-the-declarative-programming-practices-will-improve-your-return-from-technology/
46Saturday, 31 August 13
Workflow Abstraction – pattern language
Cascading uses a “plumbing” metaphor in Java to define workflows out of familiar elements: Pipes, Taps, Tuple Flows, Filters, Joins, Traps, etc.
Scrubtoken
DocumentCollection
Tokenize
WordCount
GroupBytoken
Count
Stop WordList
Regextoken
HashJoinLeft
RHS
M
R
data is represented as flows of tuples
operations in the flows bring functional programming aspects into Java
A Pattern LanguageChristopher Alexander, et al.amazon.com/dp/0195019199
47Saturday, 31 August 13
Workflow Abstraction – business process
following the essence of literate programming, Cascading workflows provide statements of business process
this recalls a sense of business process management for Enterprise apps (think BPM/BPEL for Big Data)
Cascading creates a separation of concerns between business process and implementation details (Hadoop, etc.)
this is especially apparent in large-scale Cascalog apps:
“Specify what you require, not how to achieve it.”
by virtue of the pattern language, the flow planner then determines how to translate business process into efficient, parallel jobs at scale
48Saturday, 31 August 13
void map (String doc_id, String text):
for each word w in segment(text):
emit(w, "1");
void reduce (String word, Iterator group):
int count = 0;
for each pc in group:
count += Int(pc);
emit(word, String(count));
The Ubiquitous Word Count
Definition:
this simple program provides an excellent test case for parallel processing:
• requires a minimal amount of code
• demonstrates use of both symbolic and numeric values
• shows a dependency graph of tuples as an abstraction
• is not many steps away from useful search indexing
• serves as a “Hello World” for Hadoop apps
a distributed computing framework that runs Word Count efficiently in parallel at scale can handle much larger and more interesting compute problems
count how often each word appears in a collection of text documents
49Saturday, 31 August 13
DocumentCollection
WordCount
TokenizeGroupBytoken Count
R
M
1 map 1 reduce18 lines code gist.github.com/3900702
WordCount – conceptual flow diagram
cascading.org/category/impatient
50Saturday, 31 August 13
WordCount – Cascading app in Java
String docPath = args[ 0 ];String wcPath = args[ 1 ];Properties properties = new Properties();AppProps.setApplicationJarClass( properties, Main.class );HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );
// create source and sink tapsTap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath );Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );
// specify a regex to split "document" text lines into token streamFields token = new Fields( "token" );Fields text = new Fields( "text" );RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );// only returns "token"Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );// determine the word countsPipe wcPipe = new Pipe( "wc", docPipe );wcPipe = new GroupBy( wcPipe, token );wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
// connect the taps, pipes, etc., into a flowFlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );// write a DOT file and run the flowFlow wcFlow = flowConnector.connect( flowDef );wcFlow.writeDOT( "dot/wc.dot" );wcFlow.complete();
DocumentCollection
WordCount
TokenizeGroupBytoken Count
R
M
51Saturday, 31 August 13
map
reduceEvery('wc')[Count[decl:'count']]
Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']']
GroupBy('wc')[by:['token']]
Each('token')[RegexSplitGenerator[decl:'token'][args:1]]
Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']']
[head]
[tail]
[{2}:'token', 'count'][{1}:'token']
[{2}:'doc_id', 'text'][{2}:'doc_id', 'text']
wc[{1}:'token'][{1}:'token']
[{2}:'token', 'count'][{2}:'token', 'count']
[{1}:'token'][{1}:'token']
WordCount – generated flow diagramDocumentCollection
WordCount
TokenizeGroupBytoken Count
R
M
52Saturday, 31 August 13
(ns impatient.core (:use [cascalog.api] [cascalog.more-taps :only (hfs-delimited)]) (:require [clojure.string :as s] [cascalog.ops :as c]) (:gen-class))
(defmapcatop split [line] "reads in a line of string and splits it by regex" (s/split line #"[\[\]\\\(\),.)\s]+"))
(defn -main [in out & args] (?<- (hfs-delimited out) [?word ?count] ((hfs-delimited in :skip-header? true) _ ?line) (split ?line :> ?word) (c/count ?count)))
; Paul Lam; github.com/Quantisan/Impatient
WordCount – Cascalog / ClojureDocumentCollection
WordCount
TokenizeGroupBytoken Count
R
M
53Saturday, 31 August 13
github.com/nathanmarz/cascalog/wiki
• implements Datalog in Clojure, with predicates backed by Cascading – for a highly declarative language
• run ad-hoc queries from the Clojure REPL –approx. 10:1 code reduction compared with SQL
• composable subqueries, used for test-driven development (TDD) practices at scale
• Leiningen build: simple, no surprises, in Clojure itself
• more new deployments than other Cascading DSLs – Climate Corp is largest use case: 90% Clojure/Cascalog
• has a learning curve, limited number of Clojure developers
• aggregators are the magic, and those take effort to learn
WordCount – Cascalog / ClojureDocumentCollection
WordCount
TokenizeGroupBytoken Count
R
M
54Saturday, 31 August 13
import com.twitter.scalding._ class WordCount(args : Args) extends Job(args) { Tsv(args("doc"), ('doc_id, 'text), skipHeader = true) .read .flatMap('text -> 'token) { text : String => text.split("[ \\[\\]\\(\\),.]") } .groupBy('token) { _.size('count) } .write(Tsv(args("wc"), writeHeader = true))}
WordCount – Scalding / ScalaDocumentCollection
WordCount
TokenizeGroupBytoken Count
R
M
55Saturday, 31 August 13
github.com/twitter/scalding/wiki
• extends the Scala collections API so that distributed lists become “pipes” backed by Cascading
• code is compact, easy to understand
• nearly 1:1 between elements of conceptual flow diagram and function calls
• extensive libraries are available for linear algebra, abstract algebra, machine learning – e.g., Matrix API, Algebird, etc.
• significant investments by Twitter, Etsy, eBay, etc.
• great for data services at scale
• less learning curve than Cascalog
WordCount – Scalding / ScalaDocumentCollection
WordCount
TokenizeGroupBytoken Count
R
M
56Saturday, 31 August 13
CREATE TABLE text_docs (line STRING); LOAD DATA LOCAL INPATH 'data/rain.txt'OVERWRITE INTO TABLE text_docs; SELECT word, COUNT(*)FROM (SELECT split(line, '\t')[1] AS text FROM text_docs) tLATERAL VIEW explode(split(text, '[ ,\.\(\)]')) lTable AS wordGROUP BY word;
WordCount – Apache HiveDocumentCollection
WordCount
TokenizeGroupBytoken Count
R
M
57Saturday, 31 August 13
WordCount – Apache HiveDocumentCollection
WordCount
TokenizeGroupBytoken Count
R
Mhive.apache.org
pro:‣ most popular abstraction atop Apache Hadoop
‣ SQL-like language is syntactically familiar to most analysts
‣ simple to load large-scale unstructured data and run ad-hoc queries
con:‣ not a relational engine, many surprises at scale
‣ difficult to represent complex workflows, ML algorithms, etc.
‣ one poorly-trained analyst can bottleneck an entire cluster
‣ app-level integration requires other coding, outside of script language
‣ logical planner mixed with physical planner; cannot collect app stats
‣ non-deterministic exec: number of maps+reduces may change unexpectedly
‣ business logic must cross multiple language boundaries: difficult to troubleshoot, optimize, audit, handle exceptions, set notifications, etc.
58Saturday, 31 August 13
docPipe = LOAD '$docPath' USING PigStorage('\t', 'tagsource') AS (doc_id, text);docPipe = FILTER docPipe BY doc_id != 'doc_id';
-- specify regex to split "document" text lines into token streamtokenPipe = FOREACH docPipe GENERATE doc_id, FLATTEN(TOKENIZE(text, ' [](),.')) AS token;tokenPipe = FILTER tokenPipe BY token MATCHES '\\w.*';
-- determine the word countstokenGroups = GROUP tokenPipe BY token;wcPipe = FOREACH tokenGroups GENERATE group AS token, COUNT(tokenPipe) AS count;
-- outputSTORE wcPipe INTO '$wcPath' USING PigStorage('\t', 'tagsource');EXPLAIN -out dot/wc_pig.dot -dot wcPipe;
WordCount – Apache PigDocumentCollection
WordCount
TokenizeGroupBytoken Count
R
M
59Saturday, 31 August 13
WordCount – Apache PigDocumentCollection
WordCount
TokenizeGroupBytoken Count
R
Mpig.apache.org
pro:‣ easy to learn data manipulation language (DML)
‣ interactive prompt (Grunt) makes it simple to prototype apps
‣ extensibility through UDFs
con:‣ not a full programming language; must extend via UDFs outside of language
‣ app-level integration requires other coding, outside of script language
‣ simple problems are simple to do; hard problems become quite complex
‣ difficult to parameterize scripts externally; must rewrite to change taps!
‣ logical planner mixed with physical planner; cannot collect app stats
‣ non-deterministic exec: number of maps+reduces may changes unexpectedly
‣ business logic must cross multiple language boundaries: difficult to troubleshoot, optimize, audit, handle exceptions, set notifications, etc.
60Saturday, 31 August 13
Two Avenues to the App Layer…
scale ➞co
mpl
exity
➞
Enterprise: must contend with complexity at scale everyday…
incumbents extend current practices and infrastructure investments – using J2EE, ANSI SQL, SAS, etc. – to migrate workflows onto Apache Hadoop while leveraging existing staff
Start-ups: crave complexity and scale to become viable…
new ventures move into Enterprise space to compete using relatively lean staff, while leveraging sophisticated engineering practices, e.g., Cascalog and Scalding
61Saturday, 31 August 13
Learnings generalized from trends in Data Science:
1. the practice of leading data science teams
2. strategies for leveraging data at scale
3. machine learning and optimization
4. large-scale data workflows
5. an evolution of cluster computing
GA/SF, 2013-08-2762Saturday, 31 August 13
Q3 1997: inflection point
four independent teams were working toward horizontal scale-out of workflows based on commodity hardware
this effort prepared the way for huge Internet successesin the 1997 holiday season… AMZN, EBAY, Inktomi (YHOO Search), then GOOG
MapReduce and the Apache Hadoop open source stack emerged from this period
63Saturday, 31 August 13
RDBMS
Stakeholder
SQL Queryresult sets
Excel pivot tablesPowerPoint slide decks
Web App
Customers
transactions
Product
strategy
Engineering
requirements
BIAnalysts
optimizedcode
Circa 1996: pre- inflection point
64Saturday, 31 August 13
RDBMS
Stakeholder
SQL Queryresult sets
Excel pivot tablesPowerPoint slide decks
Web App
Customers
transactions
Product
strategy
Engineering
requirements
BIAnalysts
optimizedcode
Circa 1996: pre- inflection point
“throw it over the wall”
65Saturday, 31 August 13
RDBMS
SQL Queryresult sets
recommenders+
classifiersWeb Apps
customertransactions
AlgorithmicModeling
Logs
eventhistory
aggregation
dashboards
Product
EngineeringUX
Stakeholder Customers
DW ETL
Middleware
servletsmodels
Circa 2001: post- big ecommerce successes
66Saturday, 31 August 13
RDBMS
SQL Queryresult sets
recommenders+
classifiersWeb Apps
customertransactions
AlgorithmicModeling
Logs
eventhistory
aggregation
dashboards
Product
EngineeringUX
Stakeholder Customers
DW ETL
Middleware
servletsmodels
Circa 2001: post- big ecommerce successes
“data products”
67Saturday, 31 August 13
Workflow
RDBMS
near timebatch
services
transactions,content
socialinteractions
Web Apps,Mobile, etc.History
Data Products Customers
RDBMS
LogEvents
In-Memory Data Grid
Hadoop, etc.
Cluster Scheduler
Prod
Eng
DW
Use Cases Across Topologies
s/wdev
datascience
discovery+
modeling
Planner
Ops
dashboardmetrics
businessprocess
optimizedcapacitytaps
DataScientist
App Dev
Ops
DomainExpert
introducedcapability
existingSDLC
Circa 2013: clusters everywhere
68Saturday, 31 August 13
Workflow
RDBMS
near timebatch
services
transactions,content
socialinteractions
Web Apps,Mobile, etc.History
Data Products Customers
RDBMS
LogEvents
In-Memory Data Grid
Hadoop, etc.
Cluster Scheduler
Prod
Eng
DW
Use Cases Across Topologies
s/wdev
datascience
discovery+
modeling
Planner
Ops
dashboardmetrics
businessprocess
optimizedcapacitytaps
DataScientist
App Dev
Ops
DomainExpert
introducedcapability
existingSDLC
Circa 2013: clusters everywhere
“optimize topologies”
69Saturday, 31 August 13
Amazon“Early Amazon: Splitting the website” – Greg Lindenglinden.blogspot.com/2006/02/early-amazon-splitting-website.html
eBay“The eBay Architecture” – Randy Shoup, Dan Pritchettaddsimplicity.com/adding_simplicity_an_engi/2006/11/you_scaled_your.htmladdsimplicity.com.nyud.net:8080/downloads/eBaySDForum2006-11-29.pdf
Inktomi (YHOO Search)“Inktomi’s Wild Ride” – Erik Brewer (0:05:31 ff)youtu.be/E91oEn1bnXM
Google“Underneath the Covers at Google” – Jeff Dean (0:06:54 ff)youtu.be/qsan-GQaeykperspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspx
MIT Media Lab“Social Information Filtering for Music Recommendation” – Pattie Maespubs.media.mit.edu/pubs/papers/32paper.psted.com/speakers/pattie_maes.html
Primary Sources
70Saturday, 31 August 13
Cluster Computing’s Dirty Little Secret
people like me make a good living by leveraging high ROI apps based on clusters, and so the execs agree to build out more data centers…
clusters for Hadoop/HBase, for Storm, for MySQL, for Memcached, for Cassandra, for Nginx, etc.
this becomes expensive!
a single class of workloads on a given cluster is simpler to manage; but terrible for utilization… various notions of “cloud” help
Cloudera, Hortonworks, probably EMC soon: sell a notion of “Hadoop as OS” ⇒ All your workloads are belong to us
regardless of how architectures change, death and taxes will endure: servers fail, and data must move
Google Data Center, Fox News
~2002
71Saturday, 31 August 13
Three Laws, or more?
meanwhile, architectures evolve toward much, much larger data…
pistoncloud.com/ ...
Rich Freitas, IBM Research
Q:what kinds of disruption in topologies could this imply? because there’s no such thing as RAM anymore…
72Saturday, 31 August 13
Topologies
Hadoop and other topologies arose from a need for fault-tolerant workloads, leveraging horizontal scale-out based on commodity hardware
because the data won’t fit on one computer anymore
a variety of Big Data technologies has since emerged, which can be categorized in terms of topologies and the CAP Theorem
C A
P
strongconsistency
highavailability
partition tolerance
eventualconsistency
“You can have at most two of these properties for any shared-data system… the choice of which feature to discard determines the nature of your system.” – Eric Brewer, 2000 (Inktomi/YHOO)
cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf
julianbrowne.com/article/viewer/brewers-cap-theorem
73Saturday, 31 August 13
Some Topologies Other Than Hadoop…
Spark (iterative/interactive)
Titan (graph database)
Redis (data structure server)
Zookeeper (distributed metadata)
HBase (columnar data objects)
Riak (durable key-value store)
Storm (real-time streams)
ElasticSearch (search index)
MongoDB (document store)
ParAccel (MPP)
SciDB (array database)
74Saturday, 31 August 13
“Return of the Borg”
consider that Google is generations ahead of Hadoop, etc., with much improved ROI on its data centers…
Borg serves as a kind of “secret sauce” for data center OS, with Omega as its next evolution:
2011 GAFS OmegaJohn Wilkes, et al.youtu.be/0ZFMlO98Jkc
Omega: flexible, scalable schedulers for large compute clustersMalte Schwarzkopf, Andy Konwinski, Michael Abd-El-Malek, John Wilkeseurosys2013.tudos.org/wp-content/uploads/2013/paper/Schwarzkopf.pdf
75Saturday, 31 August 13
“Return of the Borg”
Omega: flexible, scalable schedulers for large compute clustersMalte Schwarzkopf, Andy Konwinski, Michael Abd-El-Malek, John Wilkeseurosys2013.tudos.org/wp-content/uploads/2013/paper/Schwarzkopf.pdf
76Saturday, 31 August 13
“Return of the Borg”
Return of the Borg: How Twitter Rebuilt Google’s Secret WeaponCade Metzwired.com/wiredenterprise/2013/03/google-borg-twitter-mesos
The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale MachinesLuiz André Barroso, Urs Hölzleresearch.google.com/pubs/pub35290.html
77Saturday, 31 August 13
Mesos – definitions
a common substrate for cluster computing
heterogenous assets in your data center or cloud made available as a homogenous set of resources
• top-level Apache project
• scalability to 10,000s of nodes
• obviates the need for virtual machines
• isolation between tasks with Linux Containers (pluggable)
• fault-tolerant replicated master using ZooKeeper
• multi-resource scheduling (memory and CPU aware)
• APIs in C++, Java, Python
• web UI for inspecting cluster state
• available for Linux, Mac OSX, OpenSolaris
78Saturday, 31 August 13
Mesos – simplifies app development
CHRONOS SPARK HADOOP DPARK MPI
JVM (JAVA, SCALA, CLOJURE, JRUBY)
MESOS
PYTHON C++
79Saturday, 31 August 13
Mesos – data center OS stack
HADOOP STORM CHRONOS RAILS JBOSS
TELEMETRY
Kernel
OS
Apps
MESOS
CAPACITY PLANNING GUISECURITYSMARTER SCHEDULING
80Saturday, 31 August 13
Prior Practice: Dedicated Servers
DATACENTER
• low utilization rates
• longer time to ramp up new services
81Saturday, 31 August 13
Prior Practice: Virtualization
DATACENTER PROVISIONED VMS
• even more machines to manage
• substantial performance decrease due to virtualization
• VM licensing costs
82Saturday, 31 August 13
Prior Practice: Static Partitioning
DATACENTER STATIC PARTITIONING
• even more machines to manage
• substantial performance decrease due to virtualization
• VM licensing costs
• static partitioning limits elasticity
83Saturday, 31 August 13
MESOS
Mesos: One Large Pool Of Resources
DATACENTER
“We wanted people to be able to program for the data center just like they program for their laptop."
Ben Hindman
84Saturday, 31 August 13
What are the costs of Virtualization?
benchmarktype
OpenVZimprovement
mixed workloads 210%-300%
LAMP (related) 38%-200%
I/O throughput 200%-500%
response time order magnitude
more pronounced at higher loads
85Saturday, 31 August 13
What are the costs of Single Tenancy?
0%
25%
50%
75%
100%
RAILS CPU LOAD
MEMCACHED CPU LOAD
0%
25%
50%
75%
100%
HADOOP CPU LOAD
0%
25%
50%
75%
100%
t t
0%
25%
50%
75%
100%
Rails MemcachedHadoop
COMBINED CPU LOAD (RAILS, MEMCACHED, HADOOP)
86Saturday, 31 August 13
Compelling arguments for Data Center OS
• obviates the need for VMs (licensing, adios VMware)
• provides OS-level building blocks for developing new distributed frameworks (learning curve, adios Hadoop)
• removes significant VM overhead (performance)
• requires less h/w to buy (CapEx), power and fix (OpEx)
• implies less VMs, thus less Ops overhead (staff)
• removes the complexity of Chef/Puppet (staff)
• allows higher utilization rates (ROI)
• reduces latency for data updates (OLTP + OLAP on same server)
• reshapes cluster resources dynamically (100’s ms vs. minutes)
• runs dev/test clusters on same h/w as production (flexibility)
• evaluates multiple versions without more h/w (vendor lock-in)
87Saturday, 31 August 13
Opposite Ends of the Spectrum, One Substrate
Built-in /bare metal
Hypervisors
Solaris Zones
Linux CGroups
88Saturday, 31 August 13
Opposite Ends of the Spectrum, One Substrate
Request /Response Batch
89Saturday, 31 August 13
Case Study: Twitter (bare metal / on premise)
“Mesos is the cornerstone of our elastic compute infrastructure – it’s how we build all our new services and is critical for Twitter’s continued success at scale. It's one of the primary keys to our data center efficiency."
Chris Fry, SVP Engineeringblog.twitter.com/2013/mesos-graduates-from-apache-incubation
• key services run in production: analytics, typeahead, ads
• Twitter engineers rely on Mesos to build all new services
• instead of thinking about static machines, engineers think about resources like CPU, memory and disk
• allows services to scale and leverage a shared pool of servers across data centers efficiently
• reduces the time between prototyping and launching
90Saturday, 31 August 13
Case Study: Airbnb (fungible cloud infrastructure)
“We think we might be pushing data science in the field of travel more so than anyone has ever done before… a smaller number of engineers can have higher impact through automation on Mesos."
Mike Curtis, VP Engineeringgigaom.com/2013/07/29/airbnb-is-engineering-itself-into-a-data-driven...
• improves resource management and efficiency
• helps advance engineering strategy of building small teams that can move fast
• key to letting engineers make the most of AWS-based infrastructure beyond just Hadoop
• allowed company to migrate off Elastic MapReduce
• enables use of Hadoop along with Chronos, Spark, Storm, etc.
91Saturday, 31 August 13
Resources
Apache Mesos Projectmesos.apache.org
Mesospheremesosphe.re
Getting Startedmesosphe.re/tutorials
Documentationmesos.apache.org/documentation
Research Paperusenix.org/legacy/event/nsdi11/tech/full_papers/Hindman_new.pdf
Collected Notes/Archivesgoo.gl/jPtTP
92Saturday, 31 August 13
Enterprise Data Workflows with Cascading
O’Reilly, 2013shop.oreilly.com/product/0636920028536.do
monthly newsletter for updates, events, conference summaries, etc.:liber118.com/pxn/
93Saturday, 31 August 13