the big data exploratorium

78
The Big Data Exploratorium A guided tour of open source data analysis tools Noah Pepper (@noahmp) Devin Chalmers (@qwzybug) #exploratorium @osb11 1 Thursday, June 23, 2011

Upload: peppern

Post on 20-Aug-2015

868 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: The Big Data Exploratorium

The Big Data Exploratorium

A guided tour of open source data analysis tools

Noah Pepper (@noahmp)Devin Chalmers (@qwzybug)

#exploratorium @osb11

1Thursday, June 23, 2011

Page 2: The Big Data Exploratorium

Hi,

• We’re here because...

• We are...

• Data Exploration Is...

• Example 1: Patents

• (Chalmers et al. 2010; Buchanan et al. 2010; Pepper et al 2008)

• Example 2: Health Care

• (Pepper et al. Visweek 2010)

2Thursday, June 23, 2011

Page 3: The Big Data Exploratorium

Hi,

• Exploratorium #1

• Patent citation networks

• Graphviz

• NetworkX

• Exploratorium #2

• Reddit comment word usages

3Thursday, June 23, 2011

Page 4: The Big Data Exploratorium

Hi,

• Get the code & data samples:

• git clone [email protected]:peppern/exploratorium.git

4Thursday, June 23, 2011

Page 5: The Big Data Exploratorium

We’re here because...

• There is a really amazing OSS community in the data space.

• This is fantastic news for academics, hobbyists, and professionals alike.

• We want to show what you can do with open source tools, show you the ones we like.

• We’d love to hear about what YOUR favorites are, #exploratorium to tell us.

• Data exploration is fun...

5Thursday, June 23, 2011

Page 6: The Big Data Exploratorium

We are...

• Academic Data Junkies • We’re Sorta Lucky

Our startup where we build data

exploration platforms

Our academic home. Research focuses on on

exploring the nature of evolutionary

activity through data mining

Noah Pepper - @noahmpDevin Chalmers - @qwzybug

6Thursday, June 23, 2011

Page 7: The Big Data Exploratorium

We Build Data Exploration Tools!

map.clearhealthcosts.com

7Thursday, June 23, 2011

Page 8: The Big Data Exploratorium

What is data exploration and what is an exploratorium

• Narrow Definition

• Data exploration is having an iterative relationship with your data, analysis, and visualization stack where you build an intuitive cognitive model of the information visualized.

• Why do I say visualization instead of the more general ‘representation’?

exploratorium |ikˌsplôrəˈtôrēəm|noun [usu. in names ]a scientific museum or similar center at which visitors have the opportunity of performing prearranged experiments or demonstrations.

Yes! That means there’s code

and data

8Thursday, June 23, 2011

Page 9: The Big Data Exploratorium

Data Exploration Example

• study evolution of technology in patent records– technology is a window on culture– patents are a window on technology

9Thursday, June 23, 2011

Page 10: The Big Data Exploratorium

Patent Networks

10Thursday, June 23, 2011

Page 11: The Big Data Exploratorium

Citation Analysis of Patents

11Thursday, June 23, 2011

Page 12: The Big Data Exploratorium

Time Series Text Analysis

12Thursday, June 23, 2011

Page 13: The Big Data Exploratorium

Some explorations are more open ended

13Thursday, June 23, 2011

Page 14: The Big Data Exploratorium

Pointwise Mutual Information (PMI)

# patents that contain words x and y

14Thursday, June 23, 2011

Page 15: The Big Data Exploratorium

PMI distributions

- see clusters

- different kinds of clusters

15Thursday, June 23, 2011

Page 16: The Big Data Exploratorium

“the”

“optical”

“cultivar”

PMI Comparison: Plotting a different way

PMI integralhalfway rank

- generalityof content?

16Thursday, June 23, 2011

Page 17: The Big Data Exploratorium

btw, these are older graphs, now we use ggplot2

17Thursday, June 23, 2011

Page 18: The Big Data Exploratorium

Previous Work in Health Care...

.... with @homerstrongat Qmedtrix Systems Inc.

Adjudication  type

Bill  volume

0

100,000

200,000

300,000

400,000

500,000

AMB ASC DME ER IPH OPH PRO

Placement  indistribution  of  billed

Bottom  5%

Upper  5%

18Thursday, June 23, 2011

Page 19: The Big Data Exploratorium

Previous Work in Health Care...

... @hadleywickham is a #ballRhttp://had.co.nz

Bill  volume

0

20,000

40,000

60,000

80,000

100,000

120,000

10 1 10 2 10 3 10 4 10 5 10 6 10 7

Amount  ($)

Dollar  density

0.0e+00

2.0e+08

4.0e+08

6.0e+08

8.0e+08

1.0e+09

1.2e+09

1.4e+09

10 1 10 2 10 3 10 4 10 5 10 6 10 7

Billed

First  Audit

Second  Audit

19Thursday, June 23, 2011

Page 20: The Big Data Exploratorium

Health Care Data & Code Samples...

...Hahaha Just Kidding

20Thursday, June 23, 2011

Page 21: The Big Data Exploratorium

But actually:

• Qmedtrix R&D team members made source contributions, see:

• Homer Strong https://github.com/strongh @homerstrong (Lucky Sort)

• Kevin Lynagh https://github.com/lynaghk (Keming Labs)

21Thursday, June 23, 2011

Page 22: The Big Data Exploratorium

Exploratorium #1 Patent Networks

citations amongst top 10k

most cited patents

22Thursday, June 23, 2011

Page 23: The Big Data Exploratorium

Graphviz Art is Pretty!

Grab the graph data:~/exploratorium/patents/toplinks.dot

23Thursday, June 23, 2011

Page 24: The Big Data Exploratorium

GraphViz Can Graph really big

graphs... but they get hard to use ->

<- Psychedelic Patents

24Thursday, June 23, 2011

Page 25: The Big Data Exploratorium

Graphviz - Play with Graphs (http://www.graphviz.org)

• sudo port install graphviz or sudo apt-get install graphviz

• graphing commands: dot,neato,twopi,circo,fdp

• dot -Tpdf -o file.dot

• More options here:

• http://www.graphviz.org/content/command-line-invocation

• Fun options are in the .dot file:

• http://www.graphviz.org/content/dot-language

25Thursday, June 23, 2011

Page 26: The Big Data Exploratorium

Styling dots

• node [shape=point, width="0.15",color="#0000001c"];

• edge [arrowsize="0.50", color="#0000001c"];

• There are tons, read the docs and have fun

• You can also try more complex things

• Like constraints, time for example

• Sometimes too many constraints makes GraphViz unhappy...

26Thursday, June 23, 2011

Page 27: The Big Data Exploratorium

27Thursday, June 23, 2011

Page 28: The Big Data Exploratorium

UbiGraph

• We loved UbiGraph, but don’t know an OSS alternative

• Renders many nodes in 3D in realtime FD-layout (50k+).

• 16gb of ram Mac Pro

• Shout out to Apple: thank you for supporting our research!

• It’s ‘free’ but development has stalled and since it’s closed source we can’t build on it!

• Alternatives?

28Thursday, June 23, 2011

Page 29: The Big Data Exploratorium

Exploratorium #2

• Making graphs of language using python, redis, R and a bunch of awesome libraries

• Thanks

• @hadleywickham

• @homerstrong

• @antirez

• Bryan Lewis (http://illposed.net/)

29Thursday, June 23, 2011

Page 30: The Big Data Exploratorium

...how?

Mine — Munge — Visualize

30Thursday, June 23, 2011

Page 31: The Big Data Exploratorium

...how?

github.com/peppern/exploratorium

[ brew | apt-get | port ] install redis

www.r-project.orggithub.com/qwzybug/rredisredis TTR package

31Thursday, June 23, 2011

Page 32: The Big Data Exploratorium

Best show on TV

32Thursday, June 23, 2011

Page 33: The Big Data Exploratorium

Best show on TV

32Thursday, June 23, 2011

Page 34: The Big Data Exploratorium

Best show on TV

32Thursday, June 23, 2011

Page 35: The Big Data Exploratorium

Best show on TV

32Thursday, June 23, 2011

Page 36: The Big Data Exploratorium

Best show on TV

33Thursday, June 23, 2011

Page 37: The Big Data Exploratorium

Mine the data

• gutenberg.org

• google.com/ngrams

• APIs — Twitter, etc.

• http://code.google.com/apis/socialgraph/

• Scrape

34Thursday, June 23, 2011

Page 38: The Big Data Exploratorium

Store the data

35Thursday, June 23, 2011

Page 39: The Big Data Exploratorium

Store the data

Postgres is not too shabby

35Thursday, June 23, 2011

Page 40: The Big Data Exploratorium

Store the data

SELECT cite AS patent_num, count FROM (SELECT cite, count(*) AS count FROM citations GROUP BY cite) AS t1 ORDER BY t1.count DESC LIMIT 10

36Thursday, June 23, 2011

Page 41: The Big Data Exploratorium

Store the data

SELECT `cite`, count(*), `year` FROM `citations` INNER JOIN (SELECT date_part('year', `grantdate`) AS `year`, `patent_num` AS `patent_num` FROM `patents`) AS `t1` USING (`patent_num`) WHERE (cite IN (12345)) GROUP BY `year`, `cite`

37Thursday, June 23, 2011

Page 42: The Big Data Exploratorium

Store the data

SELECT term, count FROM (SELECT term, count(*) FROM (SELECT patent_num, term FROM tfidfs WHERE (tfidf > 0.05)) AS "t1" INNER JOIN (SELECT * FROM (SELECT patent_num FROM patent_lengths WHERE (wordcount > 10)) AS "t1" INNER JOIN (SELECT * FROM patents WHERE (grantdate > '1990-01-01' AND grantdate < '2000-01-01')) AS "t2" USING ("patent_num")) AS "t2" USING ("patent_num") GROUP BY "term") AS "t3" ORDER BY count DESC LIMIT 50;

38Thursday, June 23, 2011

Page 43: The Big Data Exploratorium

Store the data

39Thursday, June 23, 2011

Page 44: The Big Data Exploratorium

Store the data

NoSQL is a good fit for web data

40Thursday, June 23, 2011

Page 45: The Big Data Exploratorium

Reshape the data

41Thursday, June 23, 2011

Page 46: The Big Data Exploratorium

Reshape the data

citer citee

a b

c b

b d

41Thursday, June 23, 2011

Page 47: The Big Data Exploratorium

Reshape the data

citer citee

a b

c b

b d

{ a : [b], c : [b], b: [d] }

41Thursday, June 23, 2011

Page 48: The Big Data Exploratorium

Reshape the data

citer citee

a b

c b

b d

{ a : [b], c : [b], b: [d] } { b : [a, c], d : [b] }

41Thursday, June 23, 2011

Page 49: The Big Data Exploratorium

Redis

In-Memory Data Structure Server

42Thursday, June 23, 2011

Page 50: The Big Data Exploratorium

Redis

43Thursday, June 23, 2011

Page 51: The Big Data Exploratorium

Redis

• HSET key name value

• SADD key value

• ZUNIONSTORE

• HSETNX

• BRPOPLPUSH

• …

44Thursday, June 23, 2011

Page 52: The Big Data Exploratorium

Redis

45Thursday, June 23, 2011

Page 53: The Big Data Exploratorium

Redis

Global variable for all your programs

45Thursday, June 23, 2011

Page 54: The Big Data Exploratorium

Redis

Global variable for all your programs

Memcached with structure

45Thursday, June 23, 2011

Page 55: The Big Data Exploratorium

Redis

Global variable for all your programs

Memcached with structure

Really fast

45Thursday, June 23, 2011

Page 56: The Big Data Exploratorium

Redis

Global variable for all your programs

Memcached with structure

Really really fast

46Thursday, June 23, 2011

Page 57: The Big Data Exploratorium

Redis

Global variable for all your programs

Memcached with structure

Really, really, astonishingly fast

47Thursday, June 23, 2011

Page 58: The Big Data Exploratorium

Redis

Global variable for all your programs

Memcached with structure

No, faster than that

48Thursday, June 23, 2011

Page 59: The Big Data Exploratorium

Reddit

49Thursday, June 23, 2011

Page 60: The Big Data Exploratorium

Reddit

49Thursday, June 23, 2011

Page 61: The Big Data Exploratorium

Reddit

50Thursday, June 23, 2011

Page 62: The Big Data Exploratorium

Reddit

• Count words by hour

50Thursday, June 23, 2011

Page 63: The Big Data Exploratorium

Reddit

• Count words by hour

• Comment network

50Thursday, June 23, 2011

Page 64: The Big Data Exploratorium

Reddit

• Count words by hour

• Comment network

• User network

50Thursday, June 23, 2011

Page 65: The Big Data Exploratorium

Reddit

• Count words by hour

• Comment network

• User network

ZSET subreddit:2011-06-21:12

50Thursday, June 23, 2011

Page 66: The Big Data Exploratorium

Reddit

• Count words by hour

• Comment network

• User network

ZSET subreddit:2011-06-21:12word [count]

50Thursday, June 23, 2011

Page 67: The Big Data Exploratorium

Reddit

• Count words by hour

• Comment network

• User network

ZSET subreddit:2011-06-21:12

SET thread_id:comments

word [count]

50Thursday, June 23, 2011

Page 68: The Big Data Exploratorium

Reddit

• Count words by hour

• Comment network

• User network

ZSET subreddit:2011-06-21:12

SET thread_id:comments

word [count]

“parent_id:child_id”

50Thursday, June 23, 2011

Page 69: The Big Data Exploratorium

Reddit

• Count words by hour

• Comment network

• User network

ZSET subreddit:2011-06-21:12

SET thread_id:comments

SET thread_id:users

word [count]

“parent_id:child_id”

50Thursday, June 23, 2011

Page 70: The Big Data Exploratorium

Reddit

• Count words by hour

• Comment network

• User network

ZSET subreddit:2011-06-21:12

SET thread_id:comments

SET thread_id:users

word [count]

“parent_id:child_id”

“parent_id:child_id”

50Thursday, June 23, 2011

Page 71: The Big Data Exploratorium

Reddit

• Count words by hour

• Comment network

• User network

ZSET subreddit:2011-06-21:12

SET thread_id:comments

SET thread_id:users

SET subreddit:threads

word [count]

“parent_id:child_id”

“parent_id:child_id”

50Thursday, June 23, 2011

Page 72: The Big Data Exploratorium

Reddit

• Count words by hour

• Comment network

• User network

ZSET subreddit:2011-06-21:12

SET thread_id:comments

SET thread_id:users

SET subreddit:threads

word [count]

“parent_id:child_id”

“parent_id:child_id”

thread_id

50Thursday, June 23, 2011

Page 73: The Big Data Exploratorium

Reddit

github.com/peppern/exploratorium

[ brew | apt-get | port ] install redis

www.r-project.orggithub.com/qwzybug/rredisredis TTR package

51Thursday, June 23, 2011

Page 74: The Big Data Exploratorium

Reddit

(demo)

52Thursday, June 23, 2011

Page 75: The Big Data Exploratorium

Reddit

Go forth and graph!

#exploratorium #osb11

53Thursday, June 23, 2011

Page 76: The Big Data Exploratorium

Reddit

Go forth and graph!

#exploratorium #osb11

We will hire you.

53Thursday, June 23, 2011

Page 77: The Big Data Exploratorium

Reddit

Go forth and graph!

#exploratorium #osb11

We will hire you.

For reals.

53Thursday, June 23, 2011

Page 78: The Big Data Exploratorium

You Are Now Leaving the Big Data Exploratorium

Please ensure you have your valuables.

Noah Pepper @noahmpDevin Chalmers @qwzybug

#exploratorium #osb11

54Thursday, June 23, 2011