lise getoor, professor, computer science, uc santa cruz at mlconf sf
DESCRIPTION
Abstract: One of the challenges in big data analytics lies in being able to reason collectively about extremely large, heterogeneous, incomplete, noisy interlinked data. We need data science techniques which an represent and reason effectively with this form of rich and multi-relational graph data. In this presentation, I will describe some common collective inference patterns needed for graph data including: collective classification (predicting missing labels for nodes in a network), link prediction (predicting potential edges), and entity resolution (determining when two nodes refer to the same underlying entity). I will describe three key capabilities required: relational feature construction, collective inference, and scaling. Finally, I briefly describe some of the cutting edge analytic tools being developed within the machine learning, AI, and database communities to address these challenges.TRANSCRIPT
Big Graph Data Science
Lise Getoor University of California, Santa Cruz
SF MLConf November 14, 2014
Data is not flat BIG
©2004-2013 lonnitaylor
Data is multi-modal, multi-relational, spatio-temporal, multi-media
NEED: Data Science for Graphs
New V: Vinculate
Pa#erns Tools Key Ideas
Pa#erns Tools Key Ideas
☐ Collec2ve Classifica2on ☐ Link Predic2on ☐ En2ty Resolu2on
Collec&ve Classifica&on: inferring the labels of nodes in a graph
Collective Classification
spouse
spouse
colleague
colleague
spouse friend
friend
friend
friend
or ? Question:
Collective Classification
spouse
spouse
colleague
colleague
spouse friend
friend
friend
friend
Collective Classification
spouse
spouse
colleague
colleague
spouse friend
friend
friend
friend
?
? ?
☐ Collec2ve Classifica2on ☐ Link Predic2on ☐ En2ty Resolu2on
✔
Link Predic&on: inferring the existence of edges in a graph
Link Prediction
¢ Entities l People, Emails
¢ Observed relationships l communications, co-location
¢ Predict relationships l Supervisor, subordinate,
colleague
#
☐ Collec2ve Classifica2on ☐ Link Predic2on ☐ En2ty Resolu2on
✔
✔
En&ty Resolu&on: determining which nodes refer to same underlying en2ty
Ironically, Entity Resolution has many duplicate names
Doubles
Duplicate detection Record linkage
Deduplication
Object identification Object consolidation
Coreference resolution
Entity clustering
Reference reconciliation
Reference matching Householding
Household matching
Fuzzy match
Approximate match
Merge/purge
Hardening soft databases
Identity uncertainty
before after
Entity Resolution & Network Analysis
Tutorial on Entity Resolution in Big Data w/ Ashwin Machanavajjhala @ KDD13 (slides available)
☐ Collec2ve Classifica2on ☐ Link Predic2on ☐ En2ty Resolu2on
✔
✔
✔
Graph Iden2fica2on
• Goal: – Given an input graph infer an output graph
• Three major components: – En&ty Resolu&on (ER): Infer the set of nodes – Link Predic&on (LP): Infer the set of edges – Collec&ve Classifica&on (CC): Infer the node labels
• Challenge: The components are intra and inter-‐dependent
Namata, Kok, Getoor, KDD 2011
Pa#erns Tools Key Ideas
☐ Feature Construc2on ☐ Collec2ve Reasoning ☐ Blocking
Key Idea #1: Relational Feature Construction
Key Idea: Feature Construction ¢ Feature informativeness is key to the success of a
relational classifier
¢ Relational feature construction l Node-specific measures
• Aggregates: summarize attributes of relational neighbors • Structural properties: capture characteristics of the relational
structure
l Node-pair measures: summarize properties of (potential) edges
• Attribute-based measures • Edge-based measures • Neighborhood similarity measures
Relational Classifiers ¢ Pros:
l Efficient: Can handle large amounts of data • Features can often be pre-computed ahead of time
l Flexible: Can take advantage of well-understood classification/regression algorithms
l One of the most commonly-used ways of incorporating relational information
¢ Cons: l Features are based on observed values/evidence, cannot be
based on attributes or relations that are being predicted l Makes incorrect independence assumptions l Hard to impose global constraints on joint assignments
• For example, when inferring a hierarchy of individuals, we may want to enforce constraint that it is a tree
☐ Feature Construc2on ☐ Collec2ve Reasoning ☐ Blocking
✔
Collective Classification
?
Collective Classification
? $ $
Tweet Status update
Donates(A, ) => Votes(A, ): 5.0
Mentions(A, “Affordable Health”) => Votes(A, ): 0.3
Collective Classification
spouse
spouse
colleague
colleague
spouse friend
friend
friend
friend
Collective Classification
vote(A,P) ∧ spouse(B,A) à vote(B,P) : 0.8
vote(A,P) ∧ friend(B,A) à vote(B,P) : 0.3
spouse
spouse
colleague
colleague
spouse friend
friend
friend
friend
Collective Classification
spouse
spouse
colleague
colleague
spouse friend
friend
friend
friend
Link Prediction § People, emails, words,
communication, relations § Use model to express
dependencies - “If email content suggests type X, it
is of type X” - “If A sends deadline emails to B,
then A is the supervisor of B” - “If A is the supervisor of B, and A is
the supervisor of C, then B and C are colleagues”
#
Link Prediction § People, emails, words,
communication, relations § Use model to express
dependencies - “If email content suggests type X, it
is of type X” - “If A sends deadline emails to B,
then A is the supervisor of B” - “If A is the supervisor of B, and A is
the supervisor of C, then B and C are colleagues”
#
complete by
due
HasWord(E, “due”) => Type(E, deadline) : 0.6
Link Prediction § People, emails, words,
communication, relations § Use model to express
dependencies - “If email content suggests type X, it
is of type X” - “If A sends deadline emails to B,
then A is the supervisor of B” - “If A is the supervisor of B, and A is
the supervisor of C, then B and C are colleagues”
#
Sends(A,B,E) ^ Type(E,deadline) => Supervisor(A,B) : 0.8
Link Prediction § People, emails, words,
communication, relations § Use model to express
dependencies - “If email content suggests type X, it
is of type X” - “If A sends deadline emails to B,
then A is the supervisor of B” - “If A is the supervisor of B, and A is
the supervisor of C, then B and C are colleagues”
#
Supervisor(A,B) ^ Supervisor(A,C) => Colleague(B,C) : ∞
Entity Resolution § Entities
- People References
§ Attributes - Name
§ Relationships - Friendship
§ Goal: Identify references that denote the same person
A B
John Smith J. Smith
name name
C
E
D F G
H
friend friend
=
=
Entity Resolution § References, names,
friendships § Use model to express
dependencies - ‘’If two people have similar names,
they are probably the same’’ - ‘’If two people have similar friends,
they are probably the same’’ - ‘’If A=B and B=C, then A and C must
also denote the same person’’
A B
John Smith J. Smith
name name
C
E
D F G
H
friend friend
=
=
Entity Resolution § References, names,
friendships § Use model to express
dependencies - ‘’If two people have similar names,
they are probably the same’’ - ‘’If two people have similar friends,
they are probably the same’’ - ‘’If A=B and B=C, then A and C must
also denote the same person’’
A B
John Smith J. Smith
name name
C
E
D F G
H
friend friend
=
=
A.name ≈{str_sim} B.name => A≈B : 0.8
Entity Resolution § References, names,
friendships § Use model to express
dependencies - ‘’If two people have similar names,
they are probably the same’’ - ‘’If two people have similar friends,
they are probably the same’’ - ‘’If A=B and B=C, then A and C must
also denote the same person’’
A B
John Smith J. Smith
name name
C
E
D F G
H
friend friend
=
=
{A.friends} ≈{} {B.friends} => A≈B : 0.6
Entity Resolution § References, names,
friendships § Use model to express
dependencies - ‘’If two people have similar names,
they are probably the same’’ - ‘’If two people have similar friends,
they are probably the same’’ - ‘’If A=B and B=C, then A and C must
also denote the same person’’
A B
John Smith J. Smith
name name
C
E
D F G
H
friend friend
=
=
A≈B ^ B≈C => A≈C : ∞
Challenges ¢ Collective Classification: labeling nodes in graph
l irregular structure, not a chain, not a grid l Challenge: One large partially labeled graph
¢ Link prediction: predicting edges in graph l Dependencies among edges l Don’t want to reason about all possible edges l Challenge: scaling & extremely skewed probabilities
¢ Entity resolution: determine nodes that refer to same entities in a graph l Dependencies between clusters l Challenge: enforcing constraints, e.g. transitive closure
☐ Feature Construc2on ☐ Collec2ve Reasoning ☐ Blocking
✔
✔
Blocking: Motivation ¢ Naïve pairwise: |N|2 pairwise comparisons
l 1000 business listings each from 1,000 different cities across the world
l 1 trillion comparisons l 11.6 days (if each comparison is 1 μs)
¢ Mentions from different cities are unlikely to be matches l Blocking Criterion: City l 1 billion comparisons l 16 minutes (if each comparison is 1 µs)
Blocking: Motivation ¢ Mentions from different cities are unlikely to be
matches l May miss potential matches
47
Blocking: Motivation
Matching Pairs of Nodes
Set of all Pairs of Nodes
Pairs of Nodes sa&sfying
Blocking criterion
Blocking Algorithms 1 ¢ Hash based blocking
l Each block Ci is associated with a hash key hi. l Mention x is hashed to Ci if hash(x) = hi. l Within a block, all pairs are compared. l Each hash function results in disjoint blocks.
¢ What hash function? l Deterministic function of attribute values l Boolean Functions over attribute values
[Bilenko et al ICDM’06, Michelson et al AAAI’06, Das Sarma et al CIKM ‘12]
l minHash (min-wise independent permutations) [Broder et al STOC’98]; locality sensitive hashing
Blocking Algorithms 2 ¢ Pairwise Similarity/Neighborhood based blocking
l Nearby nodes according to a similarity metric are clustered together
l Results in non-disjoint canopies.
¢ Techniques l Sorted Neighborhood Approach [Hernandez et al
SIGMOD’95] l Canopy Clustering [McCallum et al KDD’00]
☐ Feature Construc2on ☐ Collec2ve Reasoning ☐ Blocking
✔
✔
✔
Pa#erns Tools Key Ideas
http://psl.umiacs.umd.edu
Matthias Broecheler Lily Mihalkova Stephen Bach Stanley Kok Alex Memory
Bert Huang
Angelika Kimmig
Arti Ramesh Jay Pujara Shobeir Fakhraei Hui Miao Ben London
http://psl.umiacs.umd.edu
Probabilistic Soft Logic (PSL) Declarative language based on logics to express
collective probabilistic inference problems - Predicate = relationship or property - Atom = (continuous) random variable - Rule = capture dependency or constraint - Set = define aggregates
PSL Program = Rules + Input DB
Collective Classification
vote(A,P) ∧ spouse(B,A) à vote(B,P) : 0.8
vote(A,P) ∧ friend(B,A) à vote(B,P) : 0.3
spouse
spouse
colleague
colleague
spouse friend
friend
friend
friend
PSL Foundations § PSL makes large-scale reasoning scalable by
mapping logical rules to convex functions
§ Three principles justify this mapping: - LP programs for MAX SAT with approximation
guarantees [Goemans and Williamson, ’94] - Pseudomarginal LP relaxations of Boolean Markov
random fields [Wainwright, et al., ’02] - Łukasiewicz logic, a logic for reasoning about
continuous values [Klir and Yuan, ‘95]
Hinge-loss Markov Random Fields
P (Y |X) =
1
Zexp
2
4�mX
j=1
wj max{`j(Y,X), 0}pj
3
5
§ Continuous variables in [0,1] § Potentials are hinge-loss functions § Subject to arbitrary linear constraints § Log-concave!
PSL in a Slide § MAP Inference in PSL translates into convex optimization
problem -> inference is really fast! § Inference further enhanced with state-of-the-art
optimization and distributed processing paradigms such as ADMM & GraphLab -> inference even faster!
§ Outperforms discrete MRFs in terms of speed, and (very) often accuracy
§ PSL is flexible: Applied to image segmentation, activity recognition, stance-detection, sentiment analysis, document classification, drug target prediction, latent social groups and trust, engagement modeling, ontology alignment, and looking for more!
psl.umiacs.umd.edu
Discussion
Pa#erns Tools Key Ideas
NEED: Data Science for Graphs
Closing Comments • Need new sta2s2cal and ML theory for very large graphs
– Be#er understand bias, iden2fiability, and mechanism design in graph domains
• Important topics not touched on – Genera2ve mechanisms, causal reasoning, dynamic modeling – Privacy, Users, & Visualiza2on
• Many exci2ng opportuni2es to develop new theory and algorithms and apply them to compelling business, intelligence, social, and societal problems!