implementing link-prediction for social networks in a database system (dbsocial2013)

Implementing Link-Prediction for Social Networks in a Database System

Sara Cohen Netanel Cohen-Tzemach

The Hebrew University of Jerusalem

About me

What backend to choose?● Premise: <1M Nodes

● DIY vs. existing

● Data model

● Limitations/Features

● TPC-C won't help...

Previous workCompared databasesLimitations and FeaturesData model

ImplementationExperimentsMeasurements

N. Ruflin, H. Burkhart, and S. Rizzotti. Social-data storage-systems.In Databases and Social Networks, DBSocial '11, pages 7-12, New York, NY, USA, 2011. ACM.

http://dl.acm.org/citation.cfm?id=1996415

Our work● Implemented 7 Link-Prediction metrics

● Experimented on 10 social-networks

● Over 3 different backends○ Relational (MySQL)

○ Key-Value (Redis)

○ Graph (Neo4J)

● What did we find?○ Stay tuned :)

Link Prediction

● Why Link Prediction?○ Well researched

○ Useful

○ Multiple scoring functions

Link Prediction

D. Liben-Nowell and J. Kleinberg.The link prediction problem for social networks. In CIKM, 2003.

"Given a snapshot of a social network at time t, we seek to accurately predict the edges that will be added

to a specific node during the interval from time t to a given future time t'."

http://www.cs.cornell.edu/home/kleinber/link-pred.pdf

http://www.cs.cornell.edu/home/kleinber/link-pred.pdf

● Common Neighbors○ Only neighbors

● Katz measure○ Paths

● Rooted PageRank○ Random walk

Link Prediction examples

Storage systems: MySQL

http://www.mysql.com/InnoDB vs MyISM: http://www.oracle.com/partners/en/knowledge-zone/mysql-5-5-innodb-myisam-522945.pdf

● Relational database○ Edges table○ Stored procedures, Indices, "helper" tables

2

4

1

3

65

ID1 ID1

1 2

1 3

2 1

2 3

2 4

2 5

3 1

http://www.mysql.com/

http://www.mysql.com/

http://www.oracle.com/partners/en/knowledge-zone/mysql-5-5-innodb-myisam-522945.pdf

Storage systems: Redis

http://redis.io/

● Key-Value store○ Adjacency sets○ Lua functions, "helper" database

2

4

1

3

65

1: (2, 3)2: (1, 2, 3, 4)3: (1, 2, 5)4: (2)5: (2, 3, 6)6: (5)

http://redis.io/

http://redis.io/

Storage systems: Neo4J

http://www.neo4j.org/

● Graph database○ No modeling required○ Cypher queries, Lucene "helper" index

2

4

1

3

65

2

4

1

3

65



Storage systems● Why these systems?

○ Popular

○ Open Source

● Perfect implementation?○ No. But,

■ Unbiased■ Best practices■ Same time-frame

Full implementation available on GitHub: github.com/natict/gdbb

https://github.com/natict/gdbb

https://github.com/natict/gdbb

Implementation of Common Neighbours

select E2.id2 as y, count(E2.id1) as neighbor_count

from edges as E1 join edges as E2

where E1.id1 = x and E1.id2 = E2.id1

and E1.id1 <> E2.id2

group by y

order by neighbor_count desc

imit 100;

START a=node({n})

MATCH (a)-[:COAUTH]->(b)<-[:COAUTH]-(c)

WHERE a <> c

RETURN a.nid,c.nid,count(b) as score

ORDER BY score DESC

LIMIT 100

local tc = {};local x = KEYS[1];for k1,n in pairs(redis.call('smembers', x)) do for k2,y in pairs(redis.call('smembers', n)) do if x ~= y then tc[y] = (tc[y] or 0) + 1; end; end;end;local ttop = {}; -- Extract top 100 resultslocal min = math.huge;local mini = '';for k,v in pairs(tc) do if (#ttop < 100) then table.insert(ttop, {k,v}); if v<min then min=v; mini=table.maxn(ttop); end; else if v>min then ttop[mini] = {k,v}; min = math.huge; for i = 1,#ttop,1 do if ttop[i][2]<min then min=ttop[i][2]; mini=i; end; end; end; end;end; -- Now we just need to sort, and format the output...

...

SQL

Cypher

Lua

CypherSTART a=node({n})

MATCH (a)-[:COAUTH]->(b)<-[:COAUTH]-(c)

WHERE a <> c

RETURN a.nid,c.nid,count(b) as score

ORDER BY score DESC

LIMIT 100a

b

c

b

a

c b

Datasets● Undirected● Medium sized● Socially oriented● Data sources

○ DBLP○ SNAP

DBLP in XML format: http://dblp.uni-trier.de/xml/SNAP Datasets: http://snap.stanford.edu/data/index.html

Name # Nodes # Edges

dblp-all 366,600 4,349,796

ca-HepPh 12,006 237,010

enron 36,692 367,662

facebook 4,039 170,174

http://dblp.uni-trier.de/xml/

http://snap.stanford.edu/data/index.html

Experiments

Detailed specifications and results: www.cs.huji.ac.il/~sara/link-prediction.html

http://www.cs.huji.ac.il/~sara/link-prediction.html

Experiments (2)



Experiments (3)



Conclusions● MySQL is highly optimised

○ mainly for simple queries (with few joins)● Redis is very flexible and fast

○ mainly with complex metrics● Neo4J has implementation simplicity

○ with some limitations○ still evolving at a fast pace

● Future work○ More databases○ More algorithms

Thank youNati (Netanel) Cohen-Tzemachlinkedin.com/in/natict

Acknowledgments:● Israel Science Foundation (Grant 143/09)● Ministry of Science and Technology (Grant 3-8710)● DBSocial Travel Award

http://www.linkedin.com/in/natict

http://www.linkedin.com/in/natict