the big data exploratorium

Post on 20-Aug-2015

868 Views

Category:

Documents

4 Downloads

Preview:

Click to see full reader

TRANSCRIPT

The Big Data Exploratorium

A guided tour of open source data analysis tools

Noah Pepper (@noahmp)Devin Chalmers (@qwzybug)

#exploratorium @osb11

1Thursday, June 23, 2011

Hi,

• We’re here because...

• We are...

• Data Exploration Is...

• Example 1: Patents

• (Chalmers et al. 2010; Buchanan et al. 2010; Pepper et al 2008)

• Example 2: Health Care

• (Pepper et al. Visweek 2010)

2Thursday, June 23, 2011

Hi,

• Exploratorium #1

• Patent citation networks

• Graphviz

• NetworkX

• Exploratorium #2

• Reddit comment word usages

3Thursday, June 23, 2011

Hi,

• Get the code & data samples:

• git clone git@github.com:peppern/exploratorium.git

4Thursday, June 23, 2011

We’re here because...

• There is a really amazing OSS community in the data space.

• This is fantastic news for academics, hobbyists, and professionals alike.

• We want to show what you can do with open source tools, show you the ones we like.

• We’d love to hear about what YOUR favorites are, #exploratorium to tell us.

• Data exploration is fun...

5Thursday, June 23, 2011

We are...

• Academic Data Junkies • We’re Sorta Lucky

Our startup where we build data

exploration platforms

Our academic home. Research focuses on on

exploring the nature of evolutionary

activity through data mining

Noah Pepper - @noahmpDevin Chalmers - @qwzybug

6Thursday, June 23, 2011

We Build Data Exploration Tools!

map.clearhealthcosts.com

7Thursday, June 23, 2011

What is data exploration and what is an exploratorium

• Narrow Definition

• Data exploration is having an iterative relationship with your data, analysis, and visualization stack where you build an intuitive cognitive model of the information visualized.

• Why do I say visualization instead of the more general ‘representation’?

exploratorium |ikˌsplôrəˈtôrēəm|noun [usu. in names ]a scientific museum or similar center at which visitors have the opportunity of performing prearranged experiments or demonstrations.

Yes! That means there’s code

and data

8Thursday, June 23, 2011

Data Exploration Example

• study evolution of technology in patent records– technology is a window on culture– patents are a window on technology

9Thursday, June 23, 2011

Patent Networks

10Thursday, June 23, 2011

Citation Analysis of Patents

11Thursday, June 23, 2011

Time Series Text Analysis

12Thursday, June 23, 2011

Some explorations are more open ended

13Thursday, June 23, 2011

Pointwise Mutual Information (PMI)

# patents that contain words x and y

14Thursday, June 23, 2011

PMI distributions

- see clusters

- different kinds of clusters

15Thursday, June 23, 2011

“the”

“optical”

“cultivar”

PMI Comparison: Plotting a different way

PMI integralhalfway rank

- generalityof content?

16Thursday, June 23, 2011

btw, these are older graphs, now we use ggplot2

17Thursday, June 23, 2011

Previous Work in Health Care...

.... with @homerstrongat Qmedtrix Systems Inc.

Adjudication  type

Bill  volume

0

100,000

200,000

300,000

400,000

500,000

AMB ASC DME ER IPH OPH PRO

Placement  indistribution  of  billed

Bottom  5%

Upper  5%

18Thursday, June 23, 2011

Previous Work in Health Care...

... @hadleywickham is a #ballRhttp://had.co.nz

Bill  volume

0

20,000

40,000

60,000

80,000

100,000

120,000

10 1 10 2 10 3 10 4 10 5 10 6 10 7

Amount  ($)

Dollar  density

0.0e+00

2.0e+08

4.0e+08

6.0e+08

8.0e+08

1.0e+09

1.2e+09

1.4e+09

10 1 10 2 10 3 10 4 10 5 10 6 10 7

Billed

First  Audit

Second  Audit

19Thursday, June 23, 2011

Health Care Data & Code Samples...

...Hahaha Just Kidding

20Thursday, June 23, 2011

But actually:

• Qmedtrix R&D team members made source contributions, see:

• Homer Strong https://github.com/strongh @homerstrong (Lucky Sort)

• Kevin Lynagh https://github.com/lynaghk (Keming Labs)

21Thursday, June 23, 2011

Exploratorium #1 Patent Networks

citations amongst top 10k

most cited patents

22Thursday, June 23, 2011

Graphviz Art is Pretty!

Grab the graph data:~/exploratorium/patents/toplinks.dot

23Thursday, June 23, 2011

GraphViz Can Graph really big

graphs... but they get hard to use ->

<- Psychedelic Patents

24Thursday, June 23, 2011

Graphviz - Play with Graphs (http://www.graphviz.org)

• sudo port install graphviz or sudo apt-get install graphviz

• graphing commands: dot,neato,twopi,circo,fdp

• dot -Tpdf -o file.dot

• More options here:

• http://www.graphviz.org/content/command-line-invocation

• Fun options are in the .dot file:

• http://www.graphviz.org/content/dot-language

25Thursday, June 23, 2011

Styling dots

• node [shape=point, width="0.15",color="#0000001c"];

• edge [arrowsize="0.50", color="#0000001c"];

• There are tons, read the docs and have fun

• You can also try more complex things

• Like constraints, time for example

• Sometimes too many constraints makes GraphViz unhappy...

26Thursday, June 23, 2011

27Thursday, June 23, 2011

UbiGraph

• We loved UbiGraph, but don’t know an OSS alternative

• Renders many nodes in 3D in realtime FD-layout (50k+).

• 16gb of ram Mac Pro

• Shout out to Apple: thank you for supporting our research!

• It’s ‘free’ but development has stalled and since it’s closed source we can’t build on it!

• Alternatives?

28Thursday, June 23, 2011

Exploratorium #2

• Making graphs of language using python, redis, R and a bunch of awesome libraries

• Thanks

• @hadleywickham

• @homerstrong

• @antirez

• Bryan Lewis (http://illposed.net/)

29Thursday, June 23, 2011

...how?

Mine — Munge — Visualize

30Thursday, June 23, 2011

...how?

github.com/peppern/exploratorium

[ brew | apt-get | port ] install redis

www.r-project.orggithub.com/qwzybug/rredisredis TTR package

31Thursday, June 23, 2011

Best show on TV

32Thursday, June 23, 2011

Best show on TV

32Thursday, June 23, 2011

Best show on TV

32Thursday, June 23, 2011

Best show on TV

32Thursday, June 23, 2011

Best show on TV

33Thursday, June 23, 2011

Mine the data

• gutenberg.org

• google.com/ngrams

• APIs — Twitter, etc.

• http://code.google.com/apis/socialgraph/

• Scrape

34Thursday, June 23, 2011

Store the data

35Thursday, June 23, 2011

Store the data

Postgres is not too shabby

35Thursday, June 23, 2011

Store the data

SELECT cite AS patent_num, count FROM (SELECT cite, count(*) AS count FROM citations GROUP BY cite) AS t1 ORDER BY t1.count DESC LIMIT 10

36Thursday, June 23, 2011

Store the data

SELECT `cite`, count(*), `year` FROM `citations` INNER JOIN (SELECT date_part('year', `grantdate`) AS `year`, `patent_num` AS `patent_num` FROM `patents`) AS `t1` USING (`patent_num`) WHERE (cite IN (12345)) GROUP BY `year`, `cite`

37Thursday, June 23, 2011

Store the data

SELECT term, count FROM (SELECT term, count(*) FROM (SELECT patent_num, term FROM tfidfs WHERE (tfidf > 0.05)) AS "t1" INNER JOIN (SELECT * FROM (SELECT patent_num FROM patent_lengths WHERE (wordcount > 10)) AS "t1" INNER JOIN (SELECT * FROM patents WHERE (grantdate > '1990-01-01' AND grantdate < '2000-01-01')) AS "t2" USING ("patent_num")) AS "t2" USING ("patent_num") GROUP BY "term") AS "t3" ORDER BY count DESC LIMIT 50;

38Thursday, June 23, 2011

Store the data

39Thursday, June 23, 2011

Store the data

NoSQL is a good fit for web data

40Thursday, June 23, 2011

Reshape the data

41Thursday, June 23, 2011

Reshape the data

citer citee

a b

c b

b d

41Thursday, June 23, 2011

Reshape the data

citer citee

a b

c b

b d

{ a : [b], c : [b], b: [d] }

41Thursday, June 23, 2011

Reshape the data

citer citee

a b

c b

b d

{ a : [b], c : [b], b: [d] } { b : [a, c], d : [b] }

41Thursday, June 23, 2011

Redis

In-Memory Data Structure Server

42Thursday, June 23, 2011

Redis

43Thursday, June 23, 2011

Redis

• HSET key name value

• SADD key value

• ZUNIONSTORE

• HSETNX

• BRPOPLPUSH

• …

44Thursday, June 23, 2011

Redis

45Thursday, June 23, 2011

Redis

Global variable for all your programs

45Thursday, June 23, 2011

Redis

Global variable for all your programs

Memcached with structure

45Thursday, June 23, 2011

Redis

Global variable for all your programs

Memcached with structure

Really fast

45Thursday, June 23, 2011

Redis

Global variable for all your programs

Memcached with structure

Really really fast

46Thursday, June 23, 2011

Redis

Global variable for all your programs

Memcached with structure

Really, really, astonishingly fast

47Thursday, June 23, 2011

Redis

Global variable for all your programs

Memcached with structure

No, faster than that

48Thursday, June 23, 2011

Reddit

49Thursday, June 23, 2011

Reddit

49Thursday, June 23, 2011

Reddit

50Thursday, June 23, 2011

Reddit

• Count words by hour

50Thursday, June 23, 2011

Reddit

• Count words by hour

• Comment network

50Thursday, June 23, 2011

Reddit

• Count words by hour

• Comment network

• User network

50Thursday, June 23, 2011

Reddit

• Count words by hour

• Comment network

• User network

ZSET subreddit:2011-06-21:12

50Thursday, June 23, 2011

Reddit

• Count words by hour

• Comment network

• User network

ZSET subreddit:2011-06-21:12word [count]

50Thursday, June 23, 2011

Reddit

• Count words by hour

• Comment network

• User network

ZSET subreddit:2011-06-21:12

SET thread_id:comments

word [count]

50Thursday, June 23, 2011

Reddit

• Count words by hour

• Comment network

• User network

ZSET subreddit:2011-06-21:12

SET thread_id:comments

word [count]

“parent_id:child_id”

50Thursday, June 23, 2011

Reddit

• Count words by hour

• Comment network

• User network

ZSET subreddit:2011-06-21:12

SET thread_id:comments

SET thread_id:users

word [count]

“parent_id:child_id”

50Thursday, June 23, 2011

Reddit

• Count words by hour

• Comment network

• User network

ZSET subreddit:2011-06-21:12

SET thread_id:comments

SET thread_id:users

word [count]

“parent_id:child_id”

“parent_id:child_id”

50Thursday, June 23, 2011

Reddit

• Count words by hour

• Comment network

• User network

ZSET subreddit:2011-06-21:12

SET thread_id:comments

SET thread_id:users

SET subreddit:threads

word [count]

“parent_id:child_id”

“parent_id:child_id”

50Thursday, June 23, 2011

Reddit

• Count words by hour

• Comment network

• User network

ZSET subreddit:2011-06-21:12

SET thread_id:comments

SET thread_id:users

SET subreddit:threads

word [count]

“parent_id:child_id”

“parent_id:child_id”

thread_id

50Thursday, June 23, 2011

Reddit

github.com/peppern/exploratorium

[ brew | apt-get | port ] install redis

www.r-project.orggithub.com/qwzybug/rredisredis TTR package

51Thursday, June 23, 2011

Reddit

(demo)

52Thursday, June 23, 2011

Reddit

Go forth and graph!

#exploratorium #osb11

53Thursday, June 23, 2011

Reddit

Go forth and graph!

#exploratorium #osb11

We will hire you.

53Thursday, June 23, 2011

Reddit

Go forth and graph!

#exploratorium #osb11

We will hire you.

For reals.

53Thursday, June 23, 2011

You Are Now Leaving the Big Data Exploratorium

Please ensure you have your valuables.

Noah Pepper @noahmpDevin Chalmers @qwzybug

#exploratorium #osb11

54Thursday, June 23, 2011

top related