monkigras 2012: networks of data
Post on 21-May-2015
1.339 Views
Preview:
DESCRIPTION
TRANSCRIPT
networks of data
Matt Biddulph@mattb | matt@hackdiary.com
Every data scientist has their own favourite way of representing their data. For some people it’s Excel, and they think in rows and columns. For others it’s matrices, and they use linear algreba to interrogate their data. For me, it’s graphs.
We’re all pretty used to the idea that you can model human relationships in a social graph.
“Social network analysis views social relationships in terms of network theory consisting of nodes and ties. Nodes are the individual actors within the networks, and ties are the relationships between the actors.”
There’s a pretty deep area of mathematical study called Social Network Analysis that goes back at least 20 years. It tries to create insight by analysing the structure of social networks, and usually doesn’t incorporate any elements of culture or sociology in doing so.
Centralitymeasures
It led to the creation of techniques like centrality measures, that try to find the nodes that are most central to the network. These might be the kind of people on Twitter who have the highest chance of being retweeted.
Communitydetection
There are also community detection algorithms that try to find the most tightly-knit subgraphs and cluster those nodes together. If you ran this over the network of people I follow on Twitter, it might be able to pick out my work colleagues or the people I socialise with face-to-face.
People you may know
Sites like LinkedIn build almost-telepathic “people you may know” features by walking around the graph starting at your node and looking for people that show up a lot in your neighbourhood that you haven’t connected with yet.
To demonstrate what these techniques can do, I downloaded some data from Github’s API. I wanted to identify and map London’s most-connected developers.
lingrch
charlenopires
Allinthedata
georgepalmersnowblink
miyagawa
simonw
tmtmtmtm
andrew
jtweed
ntoll
melo
danlucraft
thesmith
fhelmberger
braindeaf
jjl
nikolaypurzelrakete
yncyrydybyl
eightbitraptormcroydon
pierslowe
alexstubbs
timcowlishaw
rux
tcaine
jasoncalesvetlyak40wt
andybeeching
jonocole
geoffgarside
FND
maccman lifo
kanzure
sh1mmer
molilymikestenalfredwesterveld
thechrisoshowdeepak
samsoir
dennyhalim
sartak
bileckme
danski
tomafro
dstrelau
aphillipo
jystewart
cbetta
julians
smtlaissezfaire
themattharrisperigrin
dhilton
barbie
greut
thommay
superfeedrbruntonspall
Floppy
edouard
arcanez
stig
floehopper
Jonty
rarepleasures
gugod
danwrong
tonyg
tonyl
tyru
jensy
jaygooby
georgebrock
bob-p
pusewicz
lawrencec
bumi
melito
jason23z
psd
matagus
rjray
NeilCrosby
dvydra
grillpanda
bru
the-experimenters
mxcl
chris-d-adams
joshbuddy
pkqk
rlivsey
professionalnerd
richardhodgson
tomdyson
jberkel
dsingleton
cv
nefarioustim
Roelven
cc
andyhd
flunder
hungryblank
colin
digdog
bingos
ja
tobypadilla
filipeamoreira
thmghtd
simonmaddox
si
james
pjbarry
straup
cdent
richardc
marcusramberg
tomtaylor
nothingmuch
jaigouk
rmetzler
ask
chrisroos
hdurer
robmckinnon
zachinglis
c9s riklomas
fidothe
atl
memespring
whomwah
otfrom
richardkeen
phae
salfield
ashbbobtfish
stever
danieljohnmorris
normdrewm
gillesruppert
webiest
rjw1
tomyan
isofarro
fredrikmollerstrand
topfunky
petemounce
libin
camelpunch
olly
micrypt
bbcpete
BenWard
nogeek
iamdanw
jibes21
kenlim
benpickles
pablete
craigw
nevali
cyrildoussin
dwhittle
gfx
lazyatom
sammyt
IanPouncey
steppenwells
jwheare
muesli
natbat
ginader
stonegao
philnash
esneko
BenHall
thrudigital
Rodreegez
mojodna
cwninja
reddavis
rafl
holizz
brett
AndrewDisley
eartle
gerhard
bradleywright
rondevera
monadic
matthewford
acastro
tims
crowbot
timd
dorward
jaikoo
tackley
Marak
carlo
sriprasannaabecciu
newbamboo
BenJam
aubergene
baob
lrug
vancaem
elliottcable
ejdraper
jcoglan
acme
twoism-dev
minty
muffinresearch
matth
mattb
threedaymonk
dannyamey
chrismear
cheeaun
guioconnor
metade
deanwilson
kraih
bricas
russss
SteveMarshall
osde8info
natematias
skarab
techbelly
dougma
kalv
ebrettfelixcohen
mokele
garethr
e1i45
matclayton
threebytesfull
mikewest
baseonmars
wakatara
spjwebster
dwo
kurtjx
blaine
monkchips
rozza
harry-m
kulor
liquid
davorg
zaczheng
ejeliot
evilstreak
andrewmcdonough
dann
tim
willi
stinie
evangineer
voodoochild
tommorris
tonytw1
philhawksworth
hubgit
gklopper
haifeng
This diagram, created in 2009, has several dimensions. Each node is a London developer with a github account. Lines show follower relationships. Nodes are sized according to number of followers, and coloured according to network centrality (red for most-central). The layout shows community structure - for example the top-left cluster is mostly Perl developers.
lingrch
charlenopires
Allinthedata
georgepalmersnowblink
miyagawa
simonw
tmtmtmtm
andrew
jtweed
ntoll
melo
danlucraft
thesmith
fhelmberger
braindeaf
jjl
nikolaypurzelrakete
yncyrydybyl
eightbitraptormcroydon
pierslowe
alexstubbs
timcowlishaw
rux
tcaine
jasoncalesvetlyak40wt
andybeeching
jonocole
geoffgarside
FND
maccman lifo
kanzure
sh1mmer
molilymikestenalfredwesterveld
thechrisoshowdeepak
samsoir
dennyhalim
sartak
bileckme
danski
tomafro
dstrelau
aphillipo
jystewart
cbetta
julians
smtlaissezfaire
themattharrisperigrin
dhilton
barbie
greut
thommay
superfeedrbruntonspall
Floppy
edouard
arcanez
stig
floehopper
Jonty
rarepleasures
gugod
danwrong
tonyg
tonyl
tyru
jensy
jaygooby
georgebrock
bob-p
pusewicz
lawrencec
bumi
melito
jason23z
psd
matagus
rjray
NeilCrosby
dvydra
grillpanda
bru
the-experimenters
mxcl
chris-d-adams
joshbuddy
pkqk
rlivsey
professionalnerd
richardhodgson
tomdyson
jberkel
dsingleton
cv
nefarioustim
Roelven
cc
andyhd
flunder
hungryblank
colin
digdog
bingos
ja
tobypadilla
filipeamoreira
thmghtd
simonmaddox
si
james
pjbarry
straup
cdent
richardc
marcusramberg
tomtaylor
nothingmuch
jaigouk
rmetzler
ask
chrisroos
hdurer
robmckinnon
zachinglis
c9s riklomas
fidothe
atl
memespring
whomwah
otfrom
richardkeen
phae
salfield
ashbbobtfish
stever
danieljohnmorris
normdrewm
gillesruppert
webiest
rjw1
tomyan
isofarro
fredrikmollerstrand
topfunky
petemounce
libin
camelpunch
olly
micrypt
bbcpete
BenWard
nogeek
iamdanw
jibes21
kenlim
benpickles
pablete
craigw
nevali
cyrildoussin
dwhittle
gfx
lazyatom
sammyt
IanPouncey
steppenwells
jwheare
muesli
natbat
ginader
stonegao
philnash
esneko
BenHall
thrudigital
Rodreegez
mojodna
cwninja
reddavis
rafl
holizz
brett
AndrewDisley
eartle
gerhard
bradleywright
rondevera
monadic
matthewford
acastro
tims
crowbot
timd
dorward
jaikoo
tackley
Marak
carlo
sriprasannaabecciu
newbamboo
BenJam
aubergene
baob
lrug
vancaem
elliottcable
ejdraper
jcoglan
acme
twoism-dev
minty
muffinresearch
matth
mattb
threedaymonk
dannyamey
chrismear
cheeaun
guioconnor
metade
deanwilson
kraih
bricas
russss
SteveMarshall
osde8info
natematias
skarab
techbelly
dougma
kalv
ebrettfelixcohen
mokele
garethr
e1i45
matclayton
threebytesfull
mikewest
baseonmars
wakatara
spjwebster
dwo
kurtjx
blaine
monkchips
rozza
harry-m
kulor
liquid
davorg
zaczheng
ejeliot
evilstreak
andrewmcdonough
dann
tim
willi
stinie
evangineer
voodoochild
tommorris
tonytw1
philhawksworth
hubgit
gklopper
haifeng
Let’s go beyond purely social data. James Governor suggested I explore the connection between music taste and choice of programming language. I wrote a script to correlate last.fm usernames with github usernames and created a graph structure linking the music genre taste of each developer to the languages their github projects are implemented in.
This diagram is just a small sample amongst the people I follow on Github and last.fm - not enough to provide a statistically-significant judgement.
in this small sample we can see that my Ruby-coding friends tend towards sing-songwriter acoustic folk, and the Javascript coders are all about rock and indie.
This is a great book that goes into these techniques in depth. However it’s useful for any networked data, not just social networks. And it’s useful to anyone, not just startups.
This is a great book that goes into these techniques in depth. However it’s useful for any networked data, not just social networks. And it’s useful to anyone, not just startups.
This is a great book that goes into these techniques in depth. However it’s useful for any networked data, not just social networks. And it’s useful to anyone, not just startups.
So let’s take a step back and think about what other kinds of graph we could form, from what kinds of data.
I used to work in location apps at Nokia, and so I naturally think of places. Wouldn’t it be interesting to study the connections between cities instead of people? For example, people probably fly more often between NYC and LA than they do between NYC and New Jersey. We could re-draw the map based on closeness in the travel network.
In 2011 I turned to the Hadoop cluster at Nokia and took a sample of several weeks of logs from our routing servers. These are used every time someone uses our maps application to request a driving route from one place to another. Every time someone drove from A to B, I made an edge in a “place graph” from A to B.
I ran the data through Gephi and asked it to cluster it based on the strength of connections between towns. The result is a not-quite-geographic new map of the world, where two cities are close to each other if people often drive between them.
UK
ChinaKorea,
Japan, etc
Spain Most of Europe
RussiaFinland
IndiaPakistan
As you’d expect, the UK is an island and so people don’t drive in and out of it very often. Spain and Portugal are not islands, but they appear separate because they’re attached to the rest of Europe by a very narrow neck of land. So people are much more likely to fly than drive out of Spain.
Times Square = Piccadilly CircusNew York London
What kind of questions can this data answer? Say I’m coming to London for the first time and I’m familiar with New York. I could ask a friend what the equivalent of Times Square is in London. If they know both towns, they’d probably tell me that Times Square is the Piccadilly Circus of New York.
What is the Holborn of Amsterdam?
... the De Pijp of New York?
... the Williamsburg of London?
But if we delve into the place graph, we could answer much more interesting questions, and create a “neighbourhood isomorphism” from city to city. People who like the Mission in SF and Shoreditch in London could find out that Williamsberg is probably the best place for them to stay in New York.
thePlace Graph is just like theSocial Graph
This is just one example of viewing data as a graph and then using Social Graph analytics on it. There are many more possible - the link structure of Wikipedia, the co-occurrence of topics in a newspaper, the implicit social network of @replies on Twitter, etc.
Matt Biddulph@mattb | matt@hackdiary.com
Thanks!
top related