monkigras 2012: networks of data

24
networks of data Matt Biddulph @mattb | [email protected] Every data scientist has their own favourite way of representing their data. For some people it’s Excel, and they think in rows and columns. For others it’s matrices, and they use linear algreba to interrogate their data. For me, it’s graphs.

Upload: matt-biddulph

Post on 21-May-2015

1.339 views

Category:

Technology


5 download

DESCRIPTION

How to think of your data as a graph, and apply social network analysis to understand it.

TRANSCRIPT

Page 1: Monkigras 2012: Networks Of Data

networks of data

Matt Biddulph@mattb | [email protected]

Every data scientist has their own favourite way of representing their data. For some people it’s Excel, and they think in rows and columns. For others it’s matrices, and they use linear algreba to interrogate their data. For me, it’s graphs.

Page 2: Monkigras 2012: Networks Of Data

We’re all pretty used to the idea that you can model human relationships in a social graph.

Page 3: Monkigras 2012: Networks Of Data

“Social network analysis views social relationships in terms of network theory consisting of nodes and ties. Nodes are the individual actors within the networks, and ties are the relationships between the actors.”

There’s a pretty deep area of mathematical study called Social Network Analysis that goes back at least 20 years. It tries to create insight by analysing the structure of social networks, and usually doesn’t incorporate any elements of culture or sociology in doing so.

Page 4: Monkigras 2012: Networks Of Data

Centralitymeasures

It led to the creation of techniques like centrality measures, that try to find the nodes that are most central to the network. These might be the kind of people on Twitter who have the highest chance of being retweeted.

Page 5: Monkigras 2012: Networks Of Data

Communitydetection

There are also community detection algorithms that try to find the most tightly-knit subgraphs and cluster those nodes together. If you ran this over the network of people I follow on Twitter, it might be able to pick out my work colleagues or the people I socialise with face-to-face.

Page 6: Monkigras 2012: Networks Of Data

People you may know

Sites like LinkedIn build almost-telepathic “people you may know” features by walking around the graph starting at your node and looking for people that show up a lot in your neighbourhood that you haven’t connected with yet.

Page 7: Monkigras 2012: Networks Of Data

To demonstrate what these techniques can do, I downloaded some data from Github’s API. I wanted to identify and map London’s most-connected developers.

Page 8: Monkigras 2012: Networks Of Data

lingrch

charlenopires

Allinthedata

georgepalmersnowblink

miyagawa

simonw

tmtmtmtm

andrew

jtweed

ntoll

melo

danlucraft

thesmith

fhelmberger

braindeaf

jjl

nikolaypurzelrakete

yncyrydybyl

eightbitraptormcroydon

pierslowe

alexstubbs

timcowlishaw

rux

tcaine

jasoncalesvetlyak40wt

andybeeching

jonocole

geoffgarside

FND

maccman lifo

kanzure

sh1mmer

molilymikestenalfredwesterveld

thechrisoshowdeepak

samsoir

dennyhalim

sartak

bileckme

danski

tomafro

dstrelau

aphillipo

jystewart

cbetta

julians

smtlaissezfaire

themattharrisperigrin

dhilton

barbie

greut

thommay

superfeedrbruntonspall

Floppy

edouard

arcanez

stig

floehopper

Jonty

rarepleasures

gugod

danwrong

tonyg

tonyl

tyru

jensy

jaygooby

georgebrock

bob-p

pusewicz

lawrencec

bumi

melito

jason23z

psd

matagus

rjray

NeilCrosby

dvydra

grillpanda

bru

the-experimenters

mxcl

chris-d-adams

joshbuddy

pkqk

rlivsey

professionalnerd

richardhodgson

tomdyson

jberkel

dsingleton

cv

nefarioustim

Roelven

cc

andyhd

flunder

hungryblank

colin

digdog

bingos

ja

tobypadilla

filipeamoreira

thmghtd

simonmaddox

si

james

pjbarry

straup

cdent

richardc

marcusramberg

tomtaylor

nothingmuch

jaigouk

rmetzler

ask

chrisroos

hdurer

robmckinnon

zachinglis

c9s riklomas

fidothe

atl

memespring

whomwah

otfrom

richardkeen

phae

salfield

ashbbobtfish

stever

danieljohnmorris

normdrewm

gillesruppert

webiest

rjw1

tomyan

isofarro

fredrikmollerstrand

topfunky

petemounce

libin

camelpunch

olly

micrypt

bbcpete

BenWard

nogeek

iamdanw

jibes21

kenlim

benpickles

pablete

craigw

nevali

cyrildoussin

dwhittle

gfx

lazyatom

sammyt

IanPouncey

steppenwells

jwheare

muesli

natbat

ginader

stonegao

philnash

esneko

BenHall

thrudigital

Rodreegez

mojodna

cwninja

reddavis

rafl

holizz

brett

AndrewDisley

eartle

gerhard

bradleywright

rondevera

monadic

matthewford

acastro

tims

crowbot

timd

dorward

jaikoo

tackley

Marak

carlo

sriprasannaabecciu

newbamboo

BenJam

aubergene

baob

lrug

vancaem

elliottcable

ejdraper

jcoglan

acme

twoism-dev

minty

muffinresearch

matth

mattb

threedaymonk

dannyamey

chrismear

cheeaun

guioconnor

metade

deanwilson

kraih

bricas

russss

SteveMarshall

osde8info

natematias

skarab

techbelly

dougma

kalv

ebrettfelixcohen

mokele

garethr

e1i45

matclayton

threebytesfull

mikewest

baseonmars

wakatara

spjwebster

dwo

kurtjx

blaine

monkchips

rozza

harry-m

kulor

liquid

davorg

zaczheng

ejeliot

evilstreak

andrewmcdonough

dann

tim

willi

stinie

evangineer

voodoochild

tommorris

tonytw1

philhawksworth

hubgit

gklopper

haifeng

This diagram, created in 2009, has several dimensions. Each node is a London developer with a github account. Lines show follower relationships. Nodes are sized according to number of followers, and coloured according to network centrality (red for most-central). The layout shows community structure - for example the top-left cluster is mostly Perl developers.

Page 9: Monkigras 2012: Networks Of Data

lingrch

charlenopires

Allinthedata

georgepalmersnowblink

miyagawa

simonw

tmtmtmtm

andrew

jtweed

ntoll

melo

danlucraft

thesmith

fhelmberger

braindeaf

jjl

nikolaypurzelrakete

yncyrydybyl

eightbitraptormcroydon

pierslowe

alexstubbs

timcowlishaw

rux

tcaine

jasoncalesvetlyak40wt

andybeeching

jonocole

geoffgarside

FND

maccman lifo

kanzure

sh1mmer

molilymikestenalfredwesterveld

thechrisoshowdeepak

samsoir

dennyhalim

sartak

bileckme

danski

tomafro

dstrelau

aphillipo

jystewart

cbetta

julians

smtlaissezfaire

themattharrisperigrin

dhilton

barbie

greut

thommay

superfeedrbruntonspall

Floppy

edouard

arcanez

stig

floehopper

Jonty

rarepleasures

gugod

danwrong

tonyg

tonyl

tyru

jensy

jaygooby

georgebrock

bob-p

pusewicz

lawrencec

bumi

melito

jason23z

psd

matagus

rjray

NeilCrosby

dvydra

grillpanda

bru

the-experimenters

mxcl

chris-d-adams

joshbuddy

pkqk

rlivsey

professionalnerd

richardhodgson

tomdyson

jberkel

dsingleton

cv

nefarioustim

Roelven

cc

andyhd

flunder

hungryblank

colin

digdog

bingos

ja

tobypadilla

filipeamoreira

thmghtd

simonmaddox

si

james

pjbarry

straup

cdent

richardc

marcusramberg

tomtaylor

nothingmuch

jaigouk

rmetzler

ask

chrisroos

hdurer

robmckinnon

zachinglis

c9s riklomas

fidothe

atl

memespring

whomwah

otfrom

richardkeen

phae

salfield

ashbbobtfish

stever

danieljohnmorris

normdrewm

gillesruppert

webiest

rjw1

tomyan

isofarro

fredrikmollerstrand

topfunky

petemounce

libin

camelpunch

olly

micrypt

bbcpete

BenWard

nogeek

iamdanw

jibes21

kenlim

benpickles

pablete

craigw

nevali

cyrildoussin

dwhittle

gfx

lazyatom

sammyt

IanPouncey

steppenwells

jwheare

muesli

natbat

ginader

stonegao

philnash

esneko

BenHall

thrudigital

Rodreegez

mojodna

cwninja

reddavis

rafl

holizz

brett

AndrewDisley

eartle

gerhard

bradleywright

rondevera

monadic

matthewford

acastro

tims

crowbot

timd

dorward

jaikoo

tackley

Marak

carlo

sriprasannaabecciu

newbamboo

BenJam

aubergene

baob

lrug

vancaem

elliottcable

ejdraper

jcoglan

acme

twoism-dev

minty

muffinresearch

matth

mattb

threedaymonk

dannyamey

chrismear

cheeaun

guioconnor

metade

deanwilson

kraih

bricas

russss

SteveMarshall

osde8info

natematias

skarab

techbelly

dougma

kalv

ebrettfelixcohen

mokele

garethr

e1i45

matclayton

threebytesfull

mikewest

baseonmars

wakatara

spjwebster

dwo

kurtjx

blaine

monkchips

rozza

harry-m

kulor

liquid

davorg

zaczheng

ejeliot

evilstreak

andrewmcdonough

dann

tim

willi

stinie

evangineer

voodoochild

tommorris

tonytw1

philhawksworth

hubgit

gklopper

haifeng

Page 10: Monkigras 2012: Networks Of Data

Let’s go beyond purely social data. James Governor suggested I explore the connection between music taste and choice of programming language. I wrote a script to correlate last.fm usernames with github usernames and created a graph structure linking the music genre taste of each developer to the languages their github projects are implemented in.

Page 11: Monkigras 2012: Networks Of Data

This diagram is just a small sample amongst the people I follow on Github and last.fm - not enough to provide a statistically-significant judgement.

Page 12: Monkigras 2012: Networks Of Data

in this small sample we can see that my Ruby-coding friends tend towards sing-songwriter acoustic folk, and the Javascript coders are all about rock and indie.

Page 13: Monkigras 2012: Networks Of Data

This is a great book that goes into these techniques in depth. However it’s useful for any networked data, not just social networks. And it’s useful to anyone, not just startups.

Page 14: Monkigras 2012: Networks Of Data

This is a great book that goes into these techniques in depth. However it’s useful for any networked data, not just social networks. And it’s useful to anyone, not just startups.

Page 15: Monkigras 2012: Networks Of Data

This is a great book that goes into these techniques in depth. However it’s useful for any networked data, not just social networks. And it’s useful to anyone, not just startups.

Page 16: Monkigras 2012: Networks Of Data

So let’s take a step back and think about what other kinds of graph we could form, from what kinds of data.

Page 17: Monkigras 2012: Networks Of Data

I used to work in location apps at Nokia, and so I naturally think of places. Wouldn’t it be interesting to study the connections between cities instead of people? For example, people probably fly more often between NYC and LA than they do between NYC and New Jersey. We could re-draw the map based on closeness in the travel network.

Page 18: Monkigras 2012: Networks Of Data

In 2011 I turned to the Hadoop cluster at Nokia and took a sample of several weeks of logs from our routing servers. These are used every time someone uses our maps application to request a driving route from one place to another. Every time someone drove from A to B, I made an edge in a “place graph” from A to B.

Page 19: Monkigras 2012: Networks Of Data

I ran the data through Gephi and asked it to cluster it based on the strength of connections between towns. The result is a not-quite-geographic new map of the world, where two cities are close to each other if people often drive between them.

Page 20: Monkigras 2012: Networks Of Data

UK

ChinaKorea,

Japan, etc

Spain Most of Europe

RussiaFinland

IndiaPakistan

As you’d expect, the UK is an island and so people don’t drive in and out of it very often. Spain and Portugal are not islands, but they appear separate because they’re attached to the rest of Europe by a very narrow neck of land. So people are much more likely to fly than drive out of Spain.

Page 21: Monkigras 2012: Networks Of Data

Times Square = Piccadilly CircusNew York London

What kind of questions can this data answer? Say I’m coming to London for the first time and I’m familiar with New York. I could ask a friend what the equivalent of Times Square is in London. If they know both towns, they’d probably tell me that Times Square is the Piccadilly Circus of New York.

Page 22: Monkigras 2012: Networks Of Data

What is the Holborn of Amsterdam?

... the De Pijp of New York?

... the Williamsburg of London?

But if we delve into the place graph, we could answer much more interesting questions, and create a “neighbourhood isomorphism” from city to city. People who like the Mission in SF and Shoreditch in London could find out that Williamsberg is probably the best place for them to stay in New York.

Page 23: Monkigras 2012: Networks Of Data

thePlace Graph is just like theSocial Graph

This is just one example of viewing data as a graph and then using Social Graph analytics on it. There are many more possible - the link structure of Wikipedia, the co-occurrence of topics in a newspaper, the implicit social network of @replies on Twitter, etc.

Page 24: Monkigras 2012: Networks Of Data

Matt Biddulph@mattb | [email protected]

Thanks!