1
Generating Semantic Annotations for Frequent Patterns with
Context Analysis
Qiaozhu Mei, Dong Xin, Hong Cheng, Jiawei Han, ChengXiang Zhai
University of Illinois at Urbana-Champaign
April 21, 2023
2
Frequent Patterns
A B
AB
ABE ABF
C
CD
CDE
E F
EFDECE
D
AE
BE BFAF
Frequent Pattern Mining( [Agrawal & Srikant 94] and many others)
A B E
C D E
A B F
C D E F
A B E F
……
Database
Itemsets: diaper milk ; camera film ; …
Sequential Patterns:
... Mining Closed Frequent Graph Patterns …
… Mining Graph and Structured Patterns in ...
Subgraph Patterns: …
3
Frequent Patterns
A B
AB
ABE ABF
C
CD
CDE
E F
EFDECE
D
AE
BE BFAF
Toward Understanding the Patterns-- Find Canonical Patterns
A B E
C D E
A B F
C D E F
A B E F
……
Database
C D E F
1.0 1.0 0.9 0.8
( Yan et al ‘05)
( Xin et al ‘05)
4
• Do they all make sense?• What do they mean?• How are they useful?
diaper beer
female sterile (2) tekele
Our goal: Annotate patterns with semantic information
morphological info. and simple statistics
Semantic Information
Not all frequent patterns are useful, only those with meanings…
Toward Understanding the Patterns-- How to Interpret Patterns?
5
Challenges
• How can we represent the semantics of a frequent pattern? (Annotate a pattern with what?)
• How can we infer pattern semantics? (How to annotate?)
• How can we do it in a general way? (Do it for all kinds of patterns)
• Once such annotations are generated, what can we use them for? (Applications)
6
Word: “pattern” – from Merriam-Webster
A Dictionary Analogy
Non-semantic info.
Examples of Usage
Definitions indicating semantics
Synonyms
Related Words
7
What about a “Pattern Dictionary”?-- Semantic Pattern Annotation (SPA)
PatternWord:
function; pronunciation; date; etc.Non-Semantic:
A form or model proposed for …Definitions:
a dressmaker’s patternExamples:
design, device, Synonyms
motif, motive…
a pattern of dissent
original, constellation …Related words:
“latent semantic analysis”Pattern:
sequential; close; sup = 0.1%Non-Semantic:
“indexing”, “semantic”, “S. Dumais”, Context Indicators (CI): “singular value decomposition”, …
index by latent semantic analysisRepresentativeTransactions: probablist latent semantic analysis
“latent semantic indexing”, Semantically similar
Patterns (SSP): “LSA”, “PLSA”
8
How Can We Generate Such an Entry?
A B E
C D E
A B F
C D E F
A B E F
Pattern AB
Non Sup = 60%
CI AB, E, F, EF …
Trans. ABE; ABEF
SSPs CD; …
DatabaseSemantic Annotations
P2: CD
P3
:
P1: AB
Pn
:
…
Frequent Patterns
…Pattern CD
… …
?
How to infer the semantics of a frequent pattern?
9
Continue the Analogy…
You’ll know the meaning of a pattern by its context
“You shall know a word by the company it keeps.”
- Firth 1957
Data … association … pattern … MINE … algorithm …
mountain … Africa … diamond … MINE … weight …
{C,D}: { … Printer, Film, Camera, Lens, … }
{A,B}: { … Baby, Milk, Diaper, Toy, Soymilk… }
Pattern Context
10
Our Approach: Model the Context
A B E
C D E
A B F
C D E F
A B E F
Pattern AB
Non Sup = 60%
CI AB, E, F, EF
Trans. ABE; ABEF
SSPs CD; …P2: CD
P1: AB
Pn
:
…
Database Frequent Patterns
Semantic Annotations
…Pattern CD
… …
<E, F, …, EF, … ABE>
<E, F, …, EF, …,CDEF>
Context Units
Context Units = Objects co-occurring with p
11
Semantic Analysis with Context Models
• Task1: Model the context of a frequent pattern
Based on the Context Model…• Task2: Extract strongest context indicators • Task3: Extract representative transactions • Task4: Extract semantically similar patterns
12
Task1: Context Modeling - A Vector Space Model
A B E
C D E
A B F
C D E F
A B E F
Pattern AB
Non Sup = 60%
CI AB, E, F, EF
Trans. ABE; ABEF
SSPs CD; …
P2: CD
P1: AB
Pn
:
…
Database Frequent Patterns
Semantic Annotations
…Pattern CD
… …
Context Units
<E, F, …, EF, … ABE>
<E, F, …, EF, …,CDEF>
< 2.0, 2.0, …, 1.0, … , 1.0 >
< 2.0, 2.0, …, 1.0, … , 1.0 >
Context Unit Weight:
Context Similarity:
Co-occurrence
Mutual Information
……
Cosine Similarity
Pearson Coefficient
……
<E, F, …, EF, … ABE>
13
Context Unit Selection
diaper milk babywear lotion
camera memory stick printer
t1
t2
Valid Context Units:
In general, Context Units are frequent patterns
Single itemsdiaper milk printer, , …
,
t1 t2 transactions
milk lotion itemsetscamera
14
Context Unit Selection: Redundancy Removal
• Problem: too many valid context units, most are redundant– { Diaper, milk, babywear }: “diaper”, “diaper,
milk”, “milk, babywear”, “milk, lotion”, …
• Solution: – use close patterns – micro-clustering: (hierarchical, one-pass)
• Jaccard Distance (γ: threshold to stop clustering):
||
||1),(
DD
DDppD
15
Task2: Extract Context Indicators
A B E
C D E
A B F
C D E F
A B E F
Pattern AB
Non Sup = 60%
CI AB, EF, ABE..
Trans. ABE; ABEF
SSPs CD; …
P2: CD
P1: AB
Pn
:
…
Database Frequent Patterns
Semantic Annotations
…
Pattern CD
… …
Context Units<A, B, AB, C, D, CD, E, F, EF, AE, BF, … ABE, ABF,…, ABEF>
Context Unit Weighting
< 3.0, 0, … 2.0, … , 1.0, …>
AB 3.0EF 2.0ABE 1.0…
< AB, CD, … , EF, … ABE, …>
16
Task3: Extract Representative Transactions
A B E
C D E
A B F
C D E F
A B E F
Pattern AB
Non Sup = 60%
CI AB, E, F, EF
Trans. ABEF; ABE
SSPs CD; …
P1: AB
Database Frequent Patterns
Semantic Annotations
…Pattern CD
… …
Context Units
3.0, 0, …,2.0, … , 1.0
< AB, CD, … , EF, … ABE, …>
1.0, 0, …,1.0, … , 1.0T1:
Semantic Similarity
T5 0.8T1 0.6T3 0.6…
T5:
17
Task4: Extract Semantically Similar Patterns
A B E
C D E
A B F
C D E F
A B E F
Pattern AB
Non Sup = 60%
CI AB, E, F, EF
Trans. ABEF; ABE
SSPs CD; …
P1: AB
Database Frequent Patterns
Semantic Annotations
…Pattern CD
… …
Context Units
3.0, 0, …,2.0, … , 1.0
< AB, CD, … , EF, … ABE, …>
0, 3.0, …,2.0, … , 0.5
Semantic Similarity
CD 0.7BF 0.5EF 0.3…
AB:
Pk: EF
P2: CD
18
Experiments
• Three different real world applications– Annotating DBLP title/authors Patterns– Motif/Gene-Ontology (GO) matching– Gene Synonyms extraction
• Study the effectiveness of the proposed SPA methods
• Explore applications of SPA to different real world tasks
19
Annotating DBLP Co-authorship and Title Pattern
Substructure Similarity Search in Graph Databases
X.Yan, P. Yu, J. Han
……
……
Database:
TitleAuthors
Frequent Patterns
P1: { x_yan, j_han }
Frequent Itemset
P2: “substructure search”
Frequent Sequential Pattern
Pattern { x_yan, j_han}
Non Sup = …
CI {p_yu}, graph pattern, …
Trans. gSpan: graph-base……
SSPs { j_wang }, {j_han, p_yu}, …
Semantic Annotations Context Units
< { p_yu, j_han}, { d_xin }, … , “graph pattern”, … “substructure similarity”, … >
20
DBLP Results: Frequent Itemset
Context Indicator
(CI)
graph; {philip_yu}; mine close; graph pattern; index approach; sequential pattern; …
Representative
Transactions (Trans)
> gSpan: graph-base substructure pattern mining;> mining close relational graph connect constraint; …
Semantically Similar
Patterns (SSP)
{jiawei_han, philip_yu}; {jian_pei, jiawei_han};{jiong_yang, philip_yu, wei_wang}; …
Pattern= {xifeng_yan, jiawei_han}
Annotations:
21
DBLP Results: Freq. Seq. Pattern
Context Indicator
(CI)
{w_bruce_croft}; web information; full text; {monika_rauch_hezinger}; {james_p_callan}; …
Representative
Transactions (Trans)
> web information retrieval> language model information retrieval
Semantically Similar
Patterns (SSP)
information use; web information; probabilistic information; information filter; text information; …
Pattern= “Information … retrieval”
Annotations:
22
Motif-GO Matching
GO term 1
GO term 2
GO term 3
GO term 4
GO term 5
Sequence 1
Sequence 2
Sequence 3
motif1 motif2
motif2
motif2
motif3
motif4 motif5
motif2 ?
Motif: a subsequence pattern in the sequences
Gene Ontology (GO) terms: annotating the functionality of sequence, motifs
23
Motif-GO Matching (Cont.)
GOTerm1; GOTerm2;GOTerm3
GOTerm3
……
Database:
GO termsProtein Sequence
Frequent Patterns
P2: GOTerm2
Single Item Pattern
Pattern Motif1
Non
CI GOTerm1, GOTerm3, …
Trans.
SSPs GOTerm1, GOTerm2, …
Semantic Annotations Context Units
< Motif1, Motif3, …, GOTerm1, GOTerm2, … >
P1: Motif1
Sequential Pattern
Motif 1
Motif-GO matching
Motif1
GOTerm1
GOTerm2
24
Motif/GO Matching: Evaluation
• Gold standard generated by human experts• Measure: Mean reciprocal rank (MRR)
– Reflects ranking accuracy (the higher the better)– 1/Rank (0.5 means the correct answer is ranked as the 2nd )
• Results:
Mutual Information Co-occurrence
Random Selection 0.0023 0.0023
Context Indicators 0.5877 0.6064
SSPs 0.4017 0.4681
Weights for Context Units:
Ranking Strategy
25
Gene Synonym Extraction
• Gene Synonyms:– A Sequential Pattern in the textual database
– Matching gene synonyms: a challenging and important new problem in mining biology data
– Analogy: thesaurus or synonyms in dictionary
Gene_id Gene Synonyms
FBgn0001000 female sterile 2 tekele; fs 2 sz 10; tek; fs 2 tek; tekele; …
26
Gene Synonym Extraction (Cont.)
… D. melanogaster gene Female sterile (2) Tekele …
… Female sterile (2) Tekele , abbreviated as Fs(2)Tek …
…
Database:
Biomedical Sentences
Frequent Patterns
P1: female sterile (2) tekele
Sequential Pattern
Pattern female sterile (2) tekele
Non
CI
Trans.
SSPs Fs(2)Tek, female sterile, fs 2 sz 10, …
Semantic Annotations Context Units
< gene, female, …, d. melanogaster gene , … >
Matched Synonyms
female sterile (2) tekele
Fs(2)Tek
fs 2 sz 10female sterile …
P2: Fs(2)Tek
Sequential Pattern
Context Units: context units can be single words or sequential patterns
27
Gene Synonym Extraction: Results
• Effective! MRR > 0.5• frequent pattern >>
single words• Micro-clustering is
useful
Running time: hierarchical Running time:
one-pass
MRR: hierarchicalMRR: one-pass
28
Conclusions
• A novel problem: semantical pattern annotation• A structured annotation for frequent patterns• A general method based on context modeling• A general post-processing procedure of frequent
pattern mining on any types of pattern • Applicable to and effective for quite different
tasks• Future work:
– Tune for specific tasks– Better context unit weights, redundancy removal, etc