![Page 1: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/1.jpg)
Agile Data Science
January 2014
Agile Analytics Applications with Hadoop
![Page 2: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/2.jpg)
2
About Me…Bearding.
•Bearding is my #1 natural talent.
•I’m going to beat this guy.
•Seriously.
•Salty Sea Beard
•Fortified with Pacific Ocean Minerals
2
![Page 3: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/3.jpg)
3
Agile Data Science: The Book
A philosophy.Not the only way, but it’s a really good way!
Code: ‘AUTHD’ – 50% off
3
![Page 4: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/4.jpg)
4
We Go Fast, But Don’t Worry!
•Download the slides - click the links - read examples!•If it’s not on the blog (Hortonworks, Data Syndrome), it’s in the book!
•Order now: http://shop.oreilly.com/product/0636920025054.do
4
![Page 5: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/5.jpg)
5
Agile Application Development: Check•LAMP stack mature•Post-Rails frameworks to choose from•Enable rapid feedback and agility
+ NoSQL
5
![Page 6: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/6.jpg)
6
Data Warehousing
6
![Page 7: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/7.jpg)
7
Scientific Computing / HPC
Tubes and Mercury (Old School)
Cores and Spindles (New School)
UNIVAC and Deep Blue both fill a warehouse. We’re back!
7
‘Smart Kid’ Only: MPI, Globus, etc. Until Hadoop
![Page 8: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/8.jpg)
8
Data Science?
ApplicationDevelopment Data Warehousing
Scientific Computing / HPC
8
![Page 9: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/9.jpg)
9
Data Center as Computer
“A key challenge for architects of WSCs is to smooth out these discrepancies in a cost efficient manner.” Click here for a paper on operating a ‘data center as computer.’
9
Warehouse Scale Computers and Applications
![Page 10: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/10.jpg)
10
Hadoop to the Rescue!•Easy to use (Pig, Hive, Cascading)
•CHEAP: 1% the cost of SAN/NAS
•A department can afford its own Hadoop cluster!
•Dump all your data in one place: Hadoop DFS
•Silos come CRASHING DOWN!
• JOIN like crazy!
•ETL like whoa!
•An army of mappers and reducers at your command
•OMGWTFBBQ ITS SO GREAT! I FEEL AWESOME!
10
![Page 11: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/11.jpg)
11
NOW WHAT?
11
![Page 12: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/12.jpg)
12
Analytics Apps: It takes a Team
•Broad skill-set
•Nobody has them all
•Inherently collaborative
12
![Page 13: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/13.jpg)
13
Data Science Team•3-4 team members with broad, diverse skill-sets that overlap
•Transactional overhead dominates at 5+ people•Expert researchers: lend 25-50% of their time to teams•Creative workers. Like a studio, not an assembly line•Total freedom... with goals and deliverables.•Work environment matters most
13
![Page 14: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/14.jpg)
14
How To Get Insight Into Product•Back-end has gotten THICKER
•Generating $$$ insight can take 10-100x app dev
•Timeline disjoint: analytics vs agile app-dev/design
•How do you ship insights efficiently?
•Can you collaborate on research vs developer timeline?
14
![Page 15: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/15.jpg)
15
The Wrong Way - Part One
“We made a great design.
Your job is to predict the future for it.”
15
![Page 16: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/16.jpg)
16
The Wrong Way - Part Two
“What is taking you so long
to reliably predict the future?”
16
![Page 17: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/17.jpg)
17
The Wrong Way - Part Three
“The users don’t understand
what 86% true means.”
17
![Page 18: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/18.jpg)
18
The Wrong Way - Part FourGHJIAEHGIEhjagigehganb!!!!!RJ(@J?!!
18
![Page 19: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/19.jpg)
19
The Wrong Way - ConclusionInevitable Conclusion
Plane Mountain
19
![Page 20: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/20.jpg)
20
Reminds me of... the waterfall model
:(
20
![Page 21: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/21.jpg)
21
Chief ProblemYou can’t design insight in analytics applications.
You discover it.
You discover by exploring.
21
![Page 22: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/22.jpg)
22
-> Strategy
So make an app for exploring your data.
Which becomes a palette for what you ship.
Iterate and publish intermediate results.
22
![Page 23: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/23.jpg)
23
Data Design
•Not the 1st query that = insight, it’s the 15th, or
150th
•Capturing “Ah ha!” moments
•Slow to do those in batch…
•Faster, better context in an interactive web
application.
•Pre-designed charts wind up terrible. So bad.
•Easy to invest man-years in wrong statistical
models
•Semantics of presenting predictions are complex
•Opportunity lies at intersection of data & design 23
![Page 24: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/24.jpg)
24
How Do We Get Back to Agile?
24
![Page 25: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/25.jpg)
25
Statement of Principles
(Then Tricks With Code)
25
![Page 26: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/26.jpg)
26
Setup An Environment Where:
•Insights repeatedly produced
•Iterative work shared with entire team
•Interactive from day Zero
•Data model is consistent end-to-end
•Minimal impedance between layers
•Scope and depth of insights grow
•Insights form the palette for what you ship
•Until the application pays for itself and more
26
![Page 27: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/27.jpg)
27
Snowballing Audience
27
![Page 28: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/28.jpg)
28
Value Document > Relation
Most data is dirty. Most data is semi-structured or unstructured. Rejoice!
28
![Page 29: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/29.jpg)
29
Value Document > Relation
Note: Hive/ArrayQL/NewSQL’s support of documents/array types blur this distinction.
29
![Page 30: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/30.jpg)
30
Relational Data = Legacy Format•Why JOIN? Storage is fundamentally cheap!
•Duplicate that JOIN data in one big record type!
•ETL once to document format on import, NOT every job
•Not zero JOINs, but far fewer JOINs
•Semi-structured documents preserve data’s actual structure
•Column compressed document formats beat JOINs!
30
![Page 31: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/31.jpg)
31
Value Imperative > Declarative
•We don’t know what we want to SELECT.
•Data is dirty - check each step, clean iteratively.
•85% of data scientist’s time spent munging. ETL.
•Imperative is optimized for our process.
•Process = iterative, snowballing insight
•Efficiency matters, self optimize
31
![Page 32: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/32.jpg)
32
Value Dataflow > SELECT
32
![Page 33: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/33.jpg)
33
Ex. Dataflow: ETL + Email Sent Count
(I can’t read this either. Get a big version here.)
33
![Page 34: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/34.jpg)
34
Value Pig > Hive (for app-dev)•Pigs eat ANYTHING•Pig is optimized for refining data, as opposed to consuming it
•Pig is imperative, iterative•Pig is dataflows, and SQLish (but not SQL)•Code modularization/re-use: Pig Macros•ILLUSTRATE speeds dev time (even UDFs)•Easy UDFs in Java, JRuby, Jython, Javascript•Pig Streaming = use any tool, period.•Easily prepare our data as it will appear in our app.•If you prefer Hive, use Hive.
Actually, I wish Pig and Hive were one tool. Pig, then Hive, then Pig, then Hive.
See: HCatalog for Pig/Hive integration.34
![Page 35: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/35.jpg)
35
Localhost vs Petabyte Scale:Same Tools•Simplicity essential to scalability: highest level tools we can
•Prepare a good sample - tricky with joins, easy with documents
•Local mode: pig -l /tmp -x local -v -w
•Frequent use of ILLUSTRATE
•1st: Iterate, debug & publish locally
•2nd: Run on cluster, publish to team/customer
•Consider skipping Object-Relational-Mapping (ORM)
•We do not trust ‘databases,’ only HDFS @ n=3
•Everything we serve in our app is re-creatable via Hadoop.
35
![Page 36: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/36.jpg)
36
Data-Value Pyramid
Climb it. Do not skip steps. See here.
36
![Page 37: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/37.jpg)
37
0/1) Display Atomic Records On The Web
37
![Page 38: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/38.jpg)
38
0.0) Document - Serialize Events•Protobuf
•Thrift
•JSON
•Avro - I use Avro because the schema is onboard.
38
![Page 39: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/39.jpg)
39
0.1) Documents Via Relation ETLenron_messages = load '/enron/enron_messages.tsv' as (
message_id:chararray,
sql_date:chararray,
from_address:chararray,
from_name:chararray,
subject:chararray,
body:chararray);
enron_recipients = load '/enron/enron_recipients.tsv' as ( message_id:chararray, reciptype:chararray, address:chararray,
name:chararray);
split enron_recipients into tos IF reciptype=='to', ccs IF reciptype=='cc', bccs IF reciptype=='bcc';
headers = cogroup tos by message_id, ccs by message_id, bccs by message_id parallel 10;
with_headers = join headers by group, enron_messages by message_id parallel 10;
emails = foreach with_headers generate enron_messages::message_id as message_id,
CustomFormatToISO(enron_messages::sql_date, 'yyyy-MM-dd HH:mm:ss') as date,
TOTUPLE(enron_messages::from_address, enron_messages::from_name) as from:tuple(address:chararray,
name:chararray), enron_messages::subject as subject,
enron_messages::body as body,
headers::tos.(address, name) as tos,
headers::ccs.(address, name) as ccs,
headers::bccs.(address, name) as bccs;
store emails into '/enron/emails.avro' using AvroStorage(
Example here.39
![Page 40: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/40.jpg)
40
0.2) Serialize Events From Streamsclass GmailSlurper(object): ... def init_imap(self, username, password): self.username = username self.password = password try: imap.shutdown() except: pass self.imap = imaplib.IMAP4_SSL('imap.gmail.com', 993) self.imap.login(username, password) self.imap.is_readonly = True ... def write(self, record): self.avro_writer.append(record) ... def slurp(self): if(self.imap and self.imap_folder): for email_id in self.id_list: (status, email_hash, charset) = self.fetch_email(email_id) if(status == 'OK' and charset and 'thread_id' in email_hash and 'froms' in email_hash): print email_id, charset, email_hash['thread_id'] self.write(email_hash)
Scrape your own gmail in Python and Ruby.40
![Page 41: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/41.jpg)
41
0.3) ETL Logs
log_data = LOAD 'access_log' USING org.apache.pig.piggybank.storage.apachelog.CommongLogLoader AS (remoteAddr, remoteLogname, user, time, method, uri, proto, bytes);
41
![Page 42: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/42.jpg)
42
1) Plumb Atomic Events->Browser
(Example stack that enables high productivity)
42
![Page 43: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/43.jpg)
43
1.1) Cat Avro Serialized Events
me$ cat_avro ~/Data/enron.avro
{ u'bccs': [], u'body': u'scamming people, blah blah', u'ccs': [], u'date': u'2000-08-28T01:50:00.000Z', u'from': {u'address': u'[email protected]', u'name': None}, u'message_id': u'<1731.10095812390082.JavaMail.evans@thyme>', u'subject': u'Re: Enron trade for frop futures', u'tos': [ {u'address': u'[email protected]', u'name': None} ]}
Get cat_avro in python, ruby43
![Page 44: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/44.jpg)
44
1.2) Load Events in Pigme$ pig -l /tmp -x local -v -w
grunt> enron_emails = LOAD '/enron/emails.avro' USING AvroStorage();grunt> describe enron_emails
emails: { message_id: chararray, datetime: chararray, from:tuple(address:chararray,name:chararray) subject: chararray, body: chararray, tos: {to: (address: chararray,name: chararray)}, ccs: {cc: (address: chararray,name: chararray)}, bccs: {bcc: (address: chararray,name: chararray)}}
44
![Page 45: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/45.jpg)
45
1.3) ILLUSTRATE Events in Piggrunt> illustrate enron_emails ---------------------------------------------------------------------------| emails | | message_id:chararray | | datetime:chararray | | from:tuple(address:chararray,name:chararray) | | subject:chararray | | body:chararray | | tos:bag{to:tuple(address:chararray,name:chararray)} || ccs:bag{cc:tuple(address:chararray,name:chararray)} || bccs:bag{bcc:tuple(address:chararray,name:chararray)} |---------------------------------------------------------------------------| | | <1731.10095812390082.JavaMail.evans@thyme> | | 2001-01-09T06:38:00.000Z | | ([email protected], J.R. Bob Dobbs) | | Re: Enron trade for frop futures | | scamming people, blah blah | | {([email protected],)} | | {} | | {} |
Upgrade to Pig 0.10+
45
![Page 46: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/46.jpg)
46
1.4) Publish Events to a ‘Database’
pig -l /tmp -x local -v -w -param avros=enron.avro \ -param mongourl='mongodb://localhost/enron.emails' avro_to_mongo.pig
/* MongoDB libraries and configuration */
register /me/mongo-hadoop/mongo-2.7.3.jarregister /me/mongo-hadoop/core/target/mongo-hadoop-core-1.1.0-SNAPSHOT.jarregister /me/mongo-hadoop/pig/target/mongo-hadoop-pig-1.1.0-SNAPSHOT.jar
/* Set speculative execution off to avoid chance of duplicate records in Mongo */set mapred.map.tasks.speculative.execution falseset mapred.reduce.tasks.speculative.execution falsedefine MongoStorage com.mongodb.hadoop.pig.MongoStorage(); /* Shortcut */
/* By default, lets have 5 reducers */set default_parallel 5avros = load '$avros' using AvroStorage();store avros into '$mongourl' using MongoStorage();
Full instructions here.
Which does this:
From Avro to MongoDB in one command:
46
![Page 47: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/47.jpg)
47
1.5) Check Events in ‘Database’
$ mongo enronMongoDB shell version: 2.0.2connecting to: enron
show collectionsEmailssystem.indexes
>db.emails.findOne({message_id: "<1731.10095812390082.JavaMail.evans@thyme>"}){
"_id" : ObjectId("502b4ae703643a6a49c8d180"),"message_id" : "<1731.10095812390082.JavaMail.evans@thyme>","date" : "2001-01-09T06:38:00.000Z","from" : { "address" : "[email protected]", "name" : "J.R. Bob Dobbs" },"subject" : Re: Enron trade for frop futures,"body" : "Scamming more people...","tos" : [ { "address" : "connie@enron", "name" : null } ],"ccs" : [ ],"bccs" : [ ]
}
47
![Page 48: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/48.jpg)
48
1.6) Publish Events on the Web
require 'rubygems'require 'sinatra'require 'mongo'require 'json'
connection = Mongo::Connection.newdatabase = connection['agile_data']collection = database['emails']
get '/email/:message_id' do |message_id| data = collection.find_one({:message_id => message_id}) JSON.generate(data)end
48
![Page 49: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/49.jpg)
49
1.6) Publish events on the web
49
![Page 50: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/50.jpg)
50
One-Liner to Transition Stack
50
![Page 51: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/51.jpg)
51
What’s the Point?•A designer can work against real data.
•An application developer can work against real data.
•A product manager can think in terms of real data.
•Entire team is grounded in reality!
•You’ll see how ugly your data really is.
•You’ll see how much work you have yet to do.
•Ship early and often!
•Feels agile, don’t it? Keep it up!
51
![Page 52: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/52.jpg)
52
1.7) Wrap Events with Bootstrap
<link href="/static/bootstrap/docs/assets/css/bootstrap.css" rel="stylesheet">
</head>
<body>
<div class="container" style="margin-top: 100px;">
<table class="table table-striped table-bordered table-condensed">
<thead>
{% for key in data['keys'] %}
<th>{{ key }}</th>
{% endfor %}
</thead>
<tbody>
<tr>
{% for value in data['values'] %}
<td>{{ value }}</td>
{% endfor %}
</tr>
</tbody>
</table>
</div>
</body>
Complete example here with code here.52
![Page 53: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/53.jpg)
53
1.7) Wrap Events with Bootstrap
53
![Page 54: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/54.jpg)
54
Refine. Add Links Between Documents.
Not the Mona Lisa, but coming along... See: here54
![Page 55: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/55.jpg)
56
1.8) List Links to Sorted Events
mongo enron
> db.emails.ensureIndex({message_id: 1})
> db.emails.find().sort({date:0}).limit(10).pretty()
{
{
"_id" : ObjectId("4f7a5da2414e4dd0645d1176"),
"message_id" : "<CA+bvURyn-rLcH_JXeuzhyq8T9RNq+YJ_Hkvhnrpk8zfYshL-wA@mail.gmail.com>",
"from" : [
...
pig -l /tmp -x local -v -w
emails_per_user = foreach (group emails by from.address) {
sorted = order emails by date;
last_1000 = limit sorted 1000;
generate group as from_address, emails as emails;
};
store emails_per_user into '$mongourl' using MongoStorage();
Use Pig, serve/cache a bag/array of email documents:
Use your ‘database’, if it can sort.
56
![Page 56: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/56.jpg)
57
1.8) List Links to Sorted Documents
57
![Page 57: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/57.jpg)
58
1.9) Make It SearchableIf you have list, search is easy with ElasticSearch and Wonderdog.../* Load ElasticSearch integration */register '/me/wonderdog/target/wonderdog-1.0-SNAPSHOT.jar';register '/me/elasticsearch-0.18.6/lib/*';define ElasticSearch com.infochimps.elasticsearch.pig.ElasticSearchStorage();
emails = load '/me/tmp/emails' using AvroStorage();store emails into 'es://email/email?json=false&size=1000' using ElasticSearch('/me/elasticsearch-0.18.6/config/elasticsearch.yml', '/me/elasticsearch-0.18.6/plugins');
curl -XGET 'http://localhost:9200/email/email/_search?q=hadoop&pretty=true&size=1'
Test it with curl:
ElasticSearch has no security features. Take note. Isolate.
58
![Page 58: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/58.jpg)
59
2) Create Simple Charts
59
![Page 59: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/59.jpg)
60
2) Create Simple Tables and Charts
60
![Page 60: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/60.jpg)
61
2) Create Simple Charts
•Start with an HTML table on general principle.
•Then use nvd3.js - reusable charts for d3.js
•Aggregate by properties & displaying is first step in entity resolution
•Start extracting entities. Ex: people, places, topics, time series
•Group documents by entities, rank and count.
•Publish top N, time series, etc.
•Fill a page with charts.
•Add a chart to your event page.
61
![Page 61: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/61.jpg)
62
2.1) Top N (of Anything) in Pigpig -l /tmp -x local -v -w
top_things = foreach (group things by key) { sorted = order things by arbitrary_rank desc;top_10_things = limit sorted 10;generate group as key, top_10_things as top_10_things;};store top_n into '$mongourl' using MongoStorage();
Remember, this is the same structure the browser gets as json.
This would make a good Pig Macro.62
![Page 62: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/62.jpg)
63
2.2) Time Series (of Anything) in Pigpig -l /tmp -x local -v -w
/* Group by our key and date rounded to the month, get a total */
things_by_month = foreach (group things by (key, ISOToMonth(datetime))
generate flatten(group) as (key, month),
COUNT_STAR(things) as total;
/* Sort our totals per key by month to get a time series */
things_timeseries = foreach (group things_by_month by key) {
timeseries = order things by month;
generate group as key, timeseries as timeseries;
};
store things_timeseries into '$mongourl' using MongoStorage();
Yet another good Pig Macro.
63
![Page 63: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/63.jpg)
64
Data Processing in Our StackA new feature in our application might begin at any layer…GREAT!
Any team member can add new features, no problemo!
I’m creative!
I know Pig!I’m creative too!
I <3 Javascript!
omghi2u!
where r my legs?
send halp
64
![Page 64: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/64.jpg)
65
Data Processing in Our Stack... but we shift the data-processing towards batch, as we are able.
Ex: Overall total emails calculated in each layer
See real example here.
65
![Page 65: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/65.jpg)
66
3) Exploring with Reports
66
![Page 66: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/66.jpg)
67
3) Exploring with Reports
67
![Page 67: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/67.jpg)
68
3.0) From Charts to Reports
•Extract entities from properties we aggregated by in charts (Step 2)
•Each entity gets its own type of web page
•Each unique entity gets its own web page
•Link to entities as they appear in atomic event documents (Step 1)
•Link most related entities together, same and between types.
•More visualizations!
•Parametize results via forms.
68
![Page 68: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/68.jpg)
69
3.1) Looks Like This:
69
![Page 69: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/69.jpg)
70
3.2) Cultivate Common Keyspaces
70
![Page 70: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/70.jpg)
71
3.3) Get People Clicking. Learn.
•Explore this web of generated pages, charts and links!
•Everyone on the team gets to know your data.
•Keep trying out different charts, metrics, entities, links.
•See whats interesting.
•Figure out what data needs cleaning and clean it.
•Start thinking about predictions & recommendations.
‘People’ could be just your team, if data is sensitive.
71
![Page 71: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/71.jpg)
72
4) Predictions and Recommendations
72
![Page 72: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/72.jpg)
73
4.0) Preparation
•We’ve already extracted entities, their properties and relationships
•Our charts show where our signal is rich
•We’ve cleaned our data to make it presentable
•The entire team has an intuitive understanding of the data
•They got that understanding by exploring the data
•We are all on the same page!
73
![Page 73: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/73.jpg)
74
4.2) Think in Different Perspectives
•Networks
•Time Series / Distributions
•Natural Language Processing
•Conditional Probabilities / Bayesian Inference
•Check out Chapter 2 of the book
74
![Page 74: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/74.jpg)
75
4.3) Networks
75
![Page 75: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/75.jpg)
76
4.3.1) Weighted Email Networks in Pig
76
![Page 76: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/76.jpg)
77
4.3.2) Networks Viz with Gephi
77
![Page 77: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/77.jpg)
78
4.3.3) Gephi = Easy
78
![Page 78: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/78.jpg)
79
4.3.4) Social Network Analysis
79
![Page 79: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/79.jpg)
80
4.4) Time Series & Distributions
80
![Page 80: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/80.jpg)
81
4.4.1) Smooth Sparse Data
See here. 81
![Page 81: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/81.jpg)
82
4.4.2) Regress to Find TrendsJRuby Linear Regression UDF Pig to use the UDF
Trend Line in your Application
82
![Page 82: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/82.jpg)
83
4.5.1) Natural Language Processing
Example with code here and macro here.
83
![Page 83: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/83.jpg)
84
4.5.2) NLP: Extract Topics!
84
![Page 84: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/84.jpg)
85
4.5.3) NLP for All: Extract Topics!•TF-IDF in Pig - 2 lines of code with Pig Macros:
•http://hortonworks.com/blog/pig-macro-for-tf-idf-makes-topic-summarization-2-lines-of-pig/
•LDA with Pig and the Lucene Tokenizer: •http://thedatachef.blogspot.be/2012/03/topic-discovery-with-apache-pig-and.html
85
![Page 85: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/85.jpg)
86
4.6) Probability & Bayesian Inference
86
![Page 86: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/86.jpg)
87
4.6.1) Gmail Suggested Recipients
87
![Page 87: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/87.jpg)
88
4.6.1) Reproducing it with Pig
88
![Page 88: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/88.jpg)
89
4.6.2) Step 1: COUNT (From -> To)
89
![Page 89: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/89.jpg)
90
4.6.2) Step 2: COUNT (From, To, Cc)/Total
P(cc | to) = Probability of cc’ing someone, given that you’ve to’d someone
90
![Page 90: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/90.jpg)
91
4.6.3) Wait - Stop Here! It Works!
They match…
91
![Page 91: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/91.jpg)
92
4.4) Add Predictions to Reports
92
![Page 92: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/92.jpg)
93
5) Enable New Actions
93
![Page 93: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/93.jpg)
94
Why Doesn’t Kate Reply to My Emails?•What time is best to catch her?•Are they too long?•Are they meant to be replied to (original content)?•Are they nice? (sentiment analysis)•Do I reply to her emails (reciprocity)?•Do I cc the wrong people (my mom)?
94
![Page 94: Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014](https://reader035.vdocuments.site/reader035/viewer/2022062617/54c671dc4a7959e37d8b45e4/html5/thumbnails/94.jpg)
97
Thank You!•Questions & Answers
97
•Follow: @rjurney•Read the Blog: datasyndrome.com