link prediction with the linkpred tool

Post on 06-Aug-2015

259 Views

Category:

Science

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Measuring scholarly impact: Methods and practice

Link prediction with the linkpred tool

Raf GunsUniversity of Antwerpraf.guns@uantwerpen.be

If you want to follow along…

Download and install Anaconda Python from http://continuum.io/downloads

Download the example data from http://bit.ly/1HpZvIa

“A pair of scientists who have five mutual previous collaborators, for instance, are about twice as likely to collaborate as a pair with only two, and about 200 times as likely as a pair with none.” (Newman, 2001; emphasis mine)

Agenda

What is link prediction? (and why?)

Example data

The linkpred tool

Link prediction in practice

Conclusion

What is link prediction?

Networks

Networks in informetrics

Citation Papers Journals Authors Patents …

Collaboration Authors Institutions Countries …

Co-citation Bibliographic coupling Web links And so on

Definitions

A network G = (V, E) consists of: A set of nodes or vertices V A set of links or edges E

Each link connects two nodes from V

Neighbourhood N(v) of node v: all nodes connected to v

Node degree |N(v)| of v: number of connected nodes = number of items in set N(v)

Change in networks

Most networks are not static, e.g. in collaboration network: New authors appear Old authors disappear New collaborations are initiated Previous collaborators stop collaborating

Change in networks

Some changes are more plausible than others

Change in networks

Different mechanisms have been identified

Assortativity: similar nodes are more likely to connect

Preferential attachment: well-connected nodes attract more new connections

Cf. cumulative advantage, Matthew effect

The link prediction question

Liben-Nowell and Kleinberg (2003, 2007):

“Given a snapshot of a social network, can we infer whichnew interactions among its members are likely to occurin the near future?”

Link prediction steps

1. Data gathering

2. Preprocessing

3. Prediction

4. Evaluation

Steps

Why link prediction?

You want to know which links will appear in the future

Recommendation

Finding missing links

Finding ‘anomalous’ links (correct or incorrect)

Evaluating network formation and evolution models

Our example data

Data

Guns and Rousseau (2013) Collaboration between

cities in Africa and South-Asia

Topic: malaria In three consecutive

time periods

Available as three Pajek network files: http://bit.ly/1HpZvIa

1997-2001

2002-2006

2007-2011

The linkpred tool

About

https://github.com/rafguns/linkpred

Cross-platform (written in Python)

Open source: BSD license

Command-line tool!

Alternative: LPmade (https://github.com/rlichtenwalter/LPmade)

How and where to get linkpred

1. Install Anaconda Python: http://continuum.io/downloads

2. Open command-line window3. Run command:

> pip install https://github.com/rafguns/linkpred/archive/stable.zip

4. Wait until installation is finished

Basic usage

> linkpredShould display brief usage instructions

> linkpred --helpDisplays more complete help output

Basic usage

> linkpred training-network-file --predictors predictor --output output-type

Read the network in training-network-file, predict using predictor and give output of output-type

> linkpred training-network-file test-network-file --predictors predictor --output output-type

Read the network in training-network-file, compare with test-network-file, predict using predictor and give output of output-type

Link prediction in practice

Preprocessing

Nodes may also appear and disappear Restrict to intersection of node sets of training and test

network Only where test network is available

Restrict by degree (default: only discard isolate nodes)

Directed networks: not supported Convert to undirected first

Prediction: choosing predictors

Local AdamicAdar AssociationStrength CommonNeighbours Cosine DegreeProduct Jaccard MaxOverlap MinOverlap NMeasure Pearson ResourceAllocation

Global GraphDistance Katz RootedPageRank SimRank

Other Community Copy Random

Local predictors

Tendency towards triadic closure

Number of common neighbours is a simple but powerful predictor.

Local predictors

Common neighbours

Normalizations of common neighbours Jaccard coefficient, cosine measure…

Adamic/Adar (Adamic & Adar, 2003)

Weighted networks

In weighted networks, links have weights (e.g. number of joint papers, number of citations…)

Link weights : often ignored!!

Most predictors in linkpred can use link weights General idea: higher link weight (e.g., more common

papers), stronger connection

Global predictors

Graph distance: lowest number of links needed to travel from a to b problem: small world

phenomenon

Global predictors

Katz (1953):

: 1 if i and j are linked, 0 otherwise : number of walks with length k from i to j : parameter, “probability of effectiveness of a single link”

Longer walks: lower effectiveness

Global predictors

Rooted PageRank

Global predictors

Rooted PageRank

Global predictors

SimRank (Jeh & Widom, 2002)

“Objects that link to similar objects are similar themselves.”

Starting point: a node is maximally similar to itself:W(v, v) = 1

Demo

Predict

Save predictions to file import in e.g. Excel

Evaluation

Step 4: ‘How well does it work?’

How? compare to ‘known good’ test network

Four groups:

Link Non-link

Predicted True positive False positive

Not predicted False negative True negative

Evaluation

Simply save results to text file:--output cache-evaluations

Create chart: Recall-precision ROC

Evaluation: recall-precision

Precision: fraction of correct predictions

Recall: fraction of correctly predicted links

Evaluation: ROC

False positive rate:Fraction of incorrectly

predicted links

True positive rate: fraction of correctly

predicted links(= recall)

Profiles

A simple way to save and reuse the configuration of a complex prediction run (options, predictors, parameters…)

Usage example:> linkpred network-file --profile profile.yml

Format: YAML, see https://en.wikipedia.org/wiki/YAML

Example profile

predictors: - name: AdamicAdar displayname: Adamic/Adar - name: GraphDistance displayname: Graph distance parameters: weight: weight - name: SimRank displayname: SimRank (c=0.4) parameters: c: 0.4

- name: SimRank displayname: SimRank (c=0.8) parameters: c: 0.8output: - cache-predictions - recall-precision

Conclusion

About link prediction

Link prediction is possible because link formation is not a purely random process

Limitations: Unaware of social and other circumstantial factors Which predictor is ‘best’ for a concrete situation? Trade-off between prediction accuracy and non-triviality

About linkpred

Relatively simple but powerful

Limitations: Not suitable for very large and/or dense networks Does not incorporate more complex setups like predictor

combinations, machine learning etc.

All results can be exported for analysis in other software (cache-*)

Open source: contributions welcome!

top related