implementing link-prediction for social networks in a database system (dbsocial2013)
DESCRIPTION
Our project considers the problem of implementing metrics for link prediction in a social network over different types of database systems (MySQL, Redis and Neo4J). In particular, we study how the features of the database system affect the ease in which link prediction may be performed.TRANSCRIPT
Implementing Link-Prediction for Social Networks in a Database System
Sara Cohen Netanel Cohen-Tzemach
The Hebrew University of Jerusalem
About me
What backend to choose?● Premise: <1M Nodes
● DIY vs. existing
● Data model
● Limitations/Features
● TPC-C won't help...
Previous workCompared databasesLimitations and FeaturesData model
ImplementationExperimentsMeasurements
N. Ruflin, H. Burkhart, and S. Rizzotti. Social-data storage-systems.In Databases and Social Networks, DBSocial '11, pages 7-12, New York, NY, USA, 2011. ACM.
Our work● Implemented 7 Link-Prediction metrics
● Experimented on 10 social-networks
● Over 3 different backends○ Relational (MySQL)
○ Key-Value (Redis)
○ Graph (Neo4J)
● What did we find?○ Stay tuned :)
Link Prediction
● Why Link Prediction?○ Well researched
○ Useful
○ Multiple scoring functions
Link Prediction
D. Liben-Nowell and J. Kleinberg.The link prediction problem for social networks. In CIKM, 2003.
"Given a snapshot of a social network at time t, we seek to accurately predict the edges that will be added
to a specific node during the interval from time t to a given future time t'."
● Common Neighbors○ Only neighbors
● Katz measure○ Paths
● Rooted PageRank○ Random walk
Link Prediction examples
Storage systems: MySQL
http://www.mysql.com/InnoDB vs MyISM: http://www.oracle.com/partners/en/knowledge-zone/mysql-5-5-innodb-myisam-522945.pdf
● Relational database○ Edges table○ Stored procedures, Indices, "helper" tables
2
4
1
3
65
ID1 ID1
1 2
1 3
2 1
2 3
2 4
2 5
3 1
Storage systems: Redis
http://redis.io/
● Key-Value store○ Adjacency sets○ Lua functions, "helper" database
2
4
1
3
65
1: (2, 3)2: (1, 2, 3, 4)3: (1, 2, 5)4: (2)5: (2, 3, 6)6: (5)
Storage systems: Neo4J
http://www.neo4j.org/
● Graph database○ No modeling required○ Cypher queries, Lucene "helper" index
2
4
1
3
65
2
4
1
3
65
Storage systems● Why these systems?
○ Popular
○ Open Source
● Perfect implementation?○ No. But,
■ Unbiased■ Best practices■ Same time-frame
Full implementation available on GitHub: github.com/natict/gdbb
Implementation of Common Neighbours
select E2.id2 as y, count(E2.id1) as neighbor_count
from edges as E1 join edges as E2
where E1.id1 = x and E1.id2 = E2.id1
and E1.id1 <> E2.id2
group by y
order by neighbor_count desc
imit 100;
START a=node({n})
MATCH (a)-[:COAUTH]->(b)<-[:COAUTH]-(c)
WHERE a <> c
RETURN a.nid,c.nid,count(b) as score
ORDER BY score DESC
LIMIT 100
local tc = {};local x = KEYS[1];for k1,n in pairs(redis.call('smembers', x)) do for k2,y in pairs(redis.call('smembers', n)) do if x ~= y then tc[y] = (tc[y] or 0) + 1; end; end;end;local ttop = {}; -- Extract top 100 resultslocal min = math.huge;local mini = '';for k,v in pairs(tc) do if (#ttop < 100) then table.insert(ttop, {k,v}); if v<min then min=v; mini=table.maxn(ttop); end; else if v>min then ttop[mini] = {k,v}; min = math.huge; for i = 1,#ttop,1 do if ttop[i][2]<min then min=ttop[i][2]; mini=i; end; end; end; end;end; -- Now we just need to sort, and format the output...
...
SQL
Cypher
Lua
CypherSTART a=node({n})
MATCH (a)-[:COAUTH]->(b)<-[:COAUTH]-(c)
WHERE a <> c
RETURN a.nid,c.nid,count(b) as score
ORDER BY score DESC
LIMIT 100a
b
c
b
a
c b
Datasets● Undirected● Medium sized● Socially oriented● Data sources
○ DBLP○ SNAP
DBLP in XML format: http://dblp.uni-trier.de/xml/SNAP Datasets: http://snap.stanford.edu/data/index.html
Name # Nodes # Edges
dblp-all 366,600 4,349,796
ca-HepPh 12,006 237,010
enron 36,692 367,662
facebook 4,039 170,174
Experiments
Detailed specifications and results: www.cs.huji.ac.il/~sara/link-prediction.html
Experiments (2)
Detailed specifications and results: www.cs.huji.ac.il/~sara/link-prediction.html
Experiments (3)
Detailed specifications and results: www.cs.huji.ac.il/~sara/link-prediction.html
Conclusions● MySQL is highly optimised
○ mainly for simple queries (with few joins)● Redis is very flexible and fast
○ mainly with complex metrics● Neo4J has implementation simplicity
○ with some limitations○ still evolving at a fast pace
● Future work○ More databases○ More algorithms
Thank youNati (Netanel) Cohen-Tzemachlinkedin.com/in/natict
Acknowledgments:● Israel Science Foundation (Grant 143/09)● Ministry of Science and Technology (Grant 3-8710)● DBSocial Travel Award