2015/10/111 dbconnect: mining research community on dblp data osmar r. zaïane, jiyang chen, randy...
TRANSCRIPT
![Page 1: 2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop](https://reader036.vdocuments.site/reader036/viewer/2022062422/56649eaa5503460f94baf4e1/html5/thumbnails/1.jpg)
112/04/19 1
DBconnect: Mining Research Community on DBLP Data
Osmar R. Zaïane, Jiyang Chen, Randy Goebel
Web Mining and Social Network Analysis Workshop in conjunction with ACM SIGKDD, SNA-KDD'07
報告人 : 吳建良
![Page 2: 2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop](https://reader036.vdocuments.site/reader036/viewer/2022062422/56649eaa5503460f94baf4e1/html5/thumbnails/2.jpg)
Outline Community Motivation
Understand research community – recommend collaborations Proposed Apporach
Rank the relevance with a random walk approach DBconnect
A navigational system to investigate community relations Conclusion
2
![Page 3: 2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop](https://reader036.vdocuments.site/reader036/viewer/2022062422/56649eaa5503460f94baf4e1/html5/thumbnails/3.jpg)
What is community? In Graph Theory:
Densely connected groups of vertices, with sparser connection between groups
In Social Network Analysis: Groups of entities that share
similar properties or connect to each other via certain relations
3
![Page 4: 2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop](https://reader036.vdocuments.site/reader036/viewer/2022062422/56649eaa5503460f94baf4e1/html5/thumbnails/4.jpg)
Why is community important? Interesting data with community structure:
Researcher collaboration, friendship network, WWW,
Massive Multi-player on-line gaming, electronic
communications…
Groups in social networks correspond to social communities, which can be used to understand organizational structure, academic collaboration, shared interests and affinities, etc.
4
![Page 5: 2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop](https://reader036.vdocuments.site/reader036/viewer/2022062422/56649eaa5503460f94baf4e1/html5/thumbnails/5.jpg)
Motivation
Understand the research network between authors,
conferences and topics (rank entities by relevance
for given entities)
Find and recommend research collaborators for
given authors
Explore the academic social network
5
![Page 6: 2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop](https://reader036.vdocuments.site/reader036/viewer/2022062422/56649eaa5503460f94baf4e1/html5/thumbnails/6.jpg)
Proposed Approach
Build bipartite graph in the author-conference space
Limitation of traditional bipartite graph model
Extend the bipartite model to include co-authorship
information
Further extend the model to tripartite to include topic
information
Use random walk with restart on such models
6
![Page 7: 2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop](https://reader036.vdocuments.site/reader036/viewer/2022062422/56649eaa5503460f94baf4e1/html5/thumbnails/7.jpg)
An example Author Publication Records in Conferences
7
a, b, c, d, e are authors ac(3) means that author a and c published three papers together in
KDD(y) conference
![Page 8: 2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop](https://reader036.vdocuments.site/reader036/viewer/2022062422/56649eaa5503460f94baf4e1/html5/thumbnails/8.jpg)
Bipartite model for conference-author social network
8
Weight(edge)=publishing frequency of author in a certain conferenceLimitation:Fail to represent any co- co-authorships
To capture the co-author relations:1.Add a link between a and c miss the role of KDD2.Make the link connecting a and c to KDD make the random walk infeasible3.Add additional nodes to represent each co-author relation impractical, a huge number of such relations
![Page 9: 2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop](https://reader036.vdocuments.site/reader036/viewer/2022062422/56649eaa5503460f94baf4e1/html5/thumbnails/9.jpg)
Extend the bipartite model to include co-authorship information
Add a virtual level of nodes to replace the conference partition, and add direction to the edges
9
3
7
7
A nodes then connect to their own split
relation nodes with the original weight C’ nodes to all author nodes
If the A node and C’ node have a co-author
relation edge weight: co-author
frequency * a parameter f
Otherwise, the edge is weighted as original
Set f=k (k is the total author number of
a conference)
3f
3f
3
77
7
7
3 7
![Page 10: 2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop](https://reader036.vdocuments.site/reader036/viewer/2022062422/56649eaa5503460f94baf4e1/html5/thumbnails/10.jpg)
Further extend the model to tripartite to include topic information
Research topic is an important component to differentiate any research community
Authors that attend the same conferences might work on various topics
10
![Page 11: 2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop](https://reader036.vdocuments.site/reader036/viewer/2022062422/56649eaa5503460f94baf4e1/html5/thumbnails/11.jpg)
Adding topic information Very few conference proceedings have their table of
contents included in DBLP Table of contents include session titles
Extract relevant topics from DBLP Use paper title, and find frequent co-locations in title text
Method Manually select a list of stopwords to remove frequently
used but non-topic-related words
Ex: Towards, Understanding, Approach, … 11
![Page 12: 2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop](https://reader036.vdocuments.site/reader036/viewer/2022062422/56649eaa5503460f94baf4e1/html5/thumbnails/12.jpg)
Adding topic information (cond.)
Count frequency of every co-located pairs of stemmed words
Select the top 1000 most frequent bi-grams as topics Manually add several tri-grams
Ex: World Wide Web, Support Vector Machine, …
12
![Page 13: 2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop](https://reader036.vdocuments.site/reader036/viewer/2022062422/56649eaa5503460f94baf4e1/html5/thumbnails/13.jpg)
Random walk on DBLP social network
Problem to be solving: Given an author node a A , compute a relevance score for
each author b A Simple example: conference-author network G
13
Relational matrix M3×5
![Page 14: 2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop](https://reader036.vdocuments.site/reader036/viewer/2022062422/56649eaa5503460f94baf4e1/html5/thumbnails/14.jpg)
Random walk on DBLP social network (cond.)
Normalize M such that every column sum up to 1: Q(M) = col_norm(M), Q(MT) = col_norm(MT)
Construct the adjacency matrix J of G after normalization
14
0)(
)(0TMQ
MQJ
22.00.108.00
77.000.1038.0
0002.062.0
)(MQ
22.041.00
33.000
041.00
44.0016.0
018.084.0
)( TMQ
![Page 15: 2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop](https://reader036.vdocuments.site/reader036/viewer/2022062422/56649eaa5503460f94baf4e1/html5/thumbnails/15.jpg)
Random walk on DBLP social network (cond.)
Normalized adjacency matrix J of G
15
Q(MT )
Q(M )
![Page 16: 2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop](https://reader036.vdocuments.site/reader036/viewer/2022062422/56649eaa5503460f94baf4e1/html5/thumbnails/16.jpg)
A random walk on this graph moves from one node to one of its neighbors based on the probability Probability: proportional to the weight of the edge over the
sum of weights of all edges that connect to this node EX: if we start from node SIGMOD, then build u as
the start vector u is a one-column vector, consisting of (3+7) elements The value of element corresponding to SIGMOD is set to 1
16
Random walk on DBLP social network (cond.)
![Page 17: 2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop](https://reader036.vdocuments.site/reader036/viewer/2022062422/56649eaa5503460f94baf4e1/html5/thumbnails/17.jpg)
u=Ju After step1 of the first iteration, the random walk hits
the author nodes with b=1×0.44, d=1×0.33, e=1×0.22
After step2 of the first iteration, the chance that the random walk goes back to SIGMOD is 0.44×0.8+0.33 ×1+0.22 ×0.22 = 0.73, and the other 0.27 goes to the other two conference nodes
17
Random walk on DBLP social network (cond.)
![Page 18: 2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop](https://reader036.vdocuments.site/reader036/viewer/2022062422/56649eaa5503460f94baf4e1/html5/thumbnails/18.jpg)
After a few iterations, the vector will converge and gives a stable score to every node
However, these scores are always the same no matter where the walk begins
Solved by random walk with restart Given a restarting probability c Use another vector v, and the value of element corresponding
to SIGMOD is set to 1 In each random walk iteration, the walker goes back to the
start node with a restart probability18
Random walk on DBLP social network (cond.)
u=(1-c)u + cv
![Page 19: 2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop](https://reader036.vdocuments.site/reader036/viewer/2022062422/56649eaa5503460f94baf4e1/html5/thumbnails/19.jpg)
Random walk with restart algorithm(1)
19
Random walk on DBLP social network (cond.)
Input: node α A, a bipartite graph model G, restarting probability c, converge threshold ε.Output: relevance score vector B for author nodes.1. Compute the adjacency matrices J(n+m) ×(n+m) of G. /* n conferences and m authors */2. Initialize vα = 0, set element for α to 1: vα(α) = 1.3. While (△uα > ε ) uα = Juα
uα = (1 − c) uα + cvα
4. Set vector B = uα(n+1:n+m).5. Return B.
![Page 20: 2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop](https://reader036.vdocuments.site/reader036/viewer/2022062422/56649eaa5503460f94baf4e1/html5/thumbnails/20.jpg)
Extend the bipartite model into a directed bipartite graph G'=(C',A,E') A has m author nodes, and C has n conference nodes C' is generated based on C and has n*m nodes
Assume every node in C is split into m nodes
First generate a matrix M(n*m)×m for directional edges from C' to A
Then form a matrix Nm×(n*m) for edges from A to C'
20
Random walk on DBLP social network (cond.)
![Page 21: 2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop](https://reader036.vdocuments.site/reader036/viewer/2022062422/56649eaa5503460f94baf4e1/html5/thumbnails/21.jpg)
The adjacency matrix J of G‘
Algorithm(2): The random walk with restart algorithm for directed bipartite model
21
Random walk on DBLP social network (cond.)
![Page 22: 2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop](https://reader036.vdocuments.site/reader036/viewer/2022062422/56649eaa5503460f94baf4e1/html5/thumbnails/22.jpg)
Extend to the tripartite graph model G''=(C,A,T,E'') Assume n conferences, m authors and l topics in G'‘
Three corresponding matrices: Un×m, Vm×l and Wn×l
The adjacency matrices of G'' after normalization:
22
Random walk on DBLP social network (cond.)
![Page 23: 2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop](https://reader036.vdocuments.site/reader036/viewer/2022062422/56649eaa5503460f94baf4e1/html5/thumbnails/23.jpg)
Algorithm(3): The random walk with restart algorithm for tripartite model
23
Random walk on DBLP social network (cond.)
![Page 24: 2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop](https://reader036.vdocuments.site/reader036/viewer/2022062422/56649eaa5503460f94baf4e1/html5/thumbnails/24.jpg)
DBLP dataset Download the publication data for conferences from
the DBLP website9 in July 2007 It contains more than 300,000 authors, about 3,000
conferences and the selected 1,000 N-gram topics The entire adjacency matrix becomes too big to make
the random walk efficient Use the METIS algorithm to partition the large graph into ten
subgraphs of about the same size
24
![Page 25: 2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop](https://reader036.vdocuments.site/reader036/viewer/2022062422/56649eaa5503460f94baf4e1/html5/thumbnails/25.jpg)
The DBconnect System http://kingman.cs.ualberta.ca/research/demos/co
ntent/dbconnect/ A navigational system to investigate the
community connections and relations Displaying researcher statistics from academic
search engines Providing lists of recommended entities to given
authors, topics and conferences
25
![Page 26: 2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop](https://reader036.vdocuments.site/reader036/viewer/2022062422/56649eaa5503460f94baf4e1/html5/thumbnails/26.jpg)
The DBconnect System (cond.)
Academic Information Conference contribution, earliest publication year and
average publication per year H-index is calculated based on information retrieved from
Google Scholar Approximate citation numbers
Related Conferences Based on author-conference-topic model
Related Topics Based on author-conference-topic model
26
![Page 27: 2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop](https://reader036.vdocuments.site/reader036/viewer/2022062422/56649eaa5503460f94baf4e1/html5/thumbnails/27.jpg)
The DBconnect System (cond.)
Co-authors Co-author name and number of paper
Related Researchers Based on the directed bipartite graph model
Recommended Collaborators Based on author-conference-topic model Co-authors’ names are not shown here The result implies that the given author shares similar topics
and conference experiences with these listed researchers, hence the recommendation
27
![Page 28: 2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop](https://reader036.vdocuments.site/reader036/viewer/2022062422/56649eaa5503460f94baf4e1/html5/thumbnails/28.jpg)
The DBconnect System (cond.)
Recommended To The recommendation is not symmetric Author A may be recommended as a possible future
collaborator to author B but not vice versa EX: Jiawei Han has been recommended as collaborator for
6201 authors, but apparently only a few of them is recommended as collaborators to him
The given author has been recommended to the author lists Symmetric Recommendations
The author lists have been recommended to the given author
28
![Page 29: 2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop](https://reader036.vdocuments.site/reader036/viewer/2022062422/56649eaa5503460f94baf4e1/html5/thumbnails/29.jpg)
Conclusion Extend a bipartite graph model to incorporate
co-authorship Propose a random walk with restart approach
Find related conferences, authors, and topics for a given entity
Present DBconnect system Help explore the relational structure and discover
implicit knowledge within the DBLP data collection
29