1 generating semantic annotations for frequent patterns with context analysis qiaozhu mei, dong xin,...

29
1 Generating Semantic Annotations for Frequent Patterns with Context Analysis Qiaozhu Mei, Dong Xin, Hong C heng, Jiawei Han, ChengXiang Zhai University of Illinois at Urbana-C hampaign June 23, 2022

Upload: kory-owens

Post on 17-Jan-2016

224 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Generating Semantic Annotations for Frequent Patterns with Context Analysis Qiaozhu Mei, Dong Xin, Hong Cheng, Jiawei Han, ChengXiang Zhai University

1

Generating Semantic Annotations for Frequent Patterns with

Context Analysis

Qiaozhu Mei, Dong Xin, Hong Cheng, Jiawei Han, ChengXiang Zhai

University of Illinois at Urbana-Champaign

April 21, 2023

Page 2: 1 Generating Semantic Annotations for Frequent Patterns with Context Analysis Qiaozhu Mei, Dong Xin, Hong Cheng, Jiawei Han, ChengXiang Zhai University

2

Frequent Patterns

A B

AB

ABE ABF

C

CD

CDE

E F

EFDECE

D

AE

BE BFAF

Frequent Pattern Mining( [Agrawal & Srikant 94] and many others)

A B E

C D E

A B F

C D E F

A B E F

……

Database

Itemsets: diaper milk ; camera film ; …

Sequential Patterns:

... Mining Closed Frequent Graph Patterns …

… Mining Graph and Structured Patterns in ...

Subgraph Patterns: …

Page 3: 1 Generating Semantic Annotations for Frequent Patterns with Context Analysis Qiaozhu Mei, Dong Xin, Hong Cheng, Jiawei Han, ChengXiang Zhai University

3

Frequent Patterns

A B

AB

ABE ABF

C

CD

CDE

E F

EFDECE

D

AE

BE BFAF

Toward Understanding the Patterns-- Find Canonical Patterns

A B E

C D E

A B F

C D E F

A B E F

……

Database

C D E F

1.0 1.0 0.9 0.8

( Yan et al ‘05)

( Xin et al ‘05)

Page 4: 1 Generating Semantic Annotations for Frequent Patterns with Context Analysis Qiaozhu Mei, Dong Xin, Hong Cheng, Jiawei Han, ChengXiang Zhai University

4

• Do they all make sense?• What do they mean?• How are they useful?

diaper beer

female sterile (2) tekele

Our goal: Annotate patterns with semantic information

morphological info. and simple statistics

Semantic Information

Not all frequent patterns are useful, only those with meanings…

Toward Understanding the Patterns-- How to Interpret Patterns?

Page 5: 1 Generating Semantic Annotations for Frequent Patterns with Context Analysis Qiaozhu Mei, Dong Xin, Hong Cheng, Jiawei Han, ChengXiang Zhai University

5

Challenges

• How can we represent the semantics of a frequent pattern? (Annotate a pattern with what?)

• How can we infer pattern semantics? (How to annotate?)

• How can we do it in a general way? (Do it for all kinds of patterns)

• Once such annotations are generated, what can we use them for? (Applications)

Page 6: 1 Generating Semantic Annotations for Frequent Patterns with Context Analysis Qiaozhu Mei, Dong Xin, Hong Cheng, Jiawei Han, ChengXiang Zhai University

6

Word: “pattern” – from Merriam-Webster

A Dictionary Analogy

Non-semantic info.

Examples of Usage

Definitions indicating semantics

Synonyms

Related Words

Qiaozhu Mei
put this earlier, following the example...our work is motivated by the analogy...remove the sentence, make the figure larger.show only one definition box
Page 7: 1 Generating Semantic Annotations for Frequent Patterns with Context Analysis Qiaozhu Mei, Dong Xin, Hong Cheng, Jiawei Han, ChengXiang Zhai University

7

What about a “Pattern Dictionary”?-- Semantic Pattern Annotation (SPA)

PatternWord:

function; pronunciation; date; etc.Non-Semantic:

A form or model proposed for …Definitions:

a dressmaker’s patternExamples:

design, device, Synonyms

motif, motive…

a pattern of dissent

original, constellation …Related words:

“latent semantic analysis”Pattern:

sequential; close; sup = 0.1%Non-Semantic:

“indexing”, “semantic”, “S. Dumais”, Context Indicators (CI): “singular value decomposition”, …

index by latent semantic analysisRepresentativeTransactions: probablist latent semantic analysis

“latent semantic indexing”, Semantically similar

Patterns (SSP): “LSA”, “PLSA”

Qiaozhu Mei
make this larger
Page 8: 1 Generating Semantic Annotations for Frequent Patterns with Context Analysis Qiaozhu Mei, Dong Xin, Hong Cheng, Jiawei Han, ChengXiang Zhai University

8

How Can We Generate Such an Entry?

A B E

C D E

A B F

C D E F

A B E F

Pattern AB

Non Sup = 60%

CI AB, E, F, EF …

Trans. ABE; ABEF

SSPs CD; …

DatabaseSemantic Annotations

P2: CD

P3

:

P1: AB

Pn

:

Frequent Patterns

…Pattern CD

… …

?

How to infer the semantics of a frequent pattern?

Page 9: 1 Generating Semantic Annotations for Frequent Patterns with Context Analysis Qiaozhu Mei, Dong Xin, Hong Cheng, Jiawei Han, ChengXiang Zhai University

9

Continue the Analogy…

You’ll know the meaning of a pattern by its context

“You shall know a word by the company it keeps.”

- Firth 1957

Data … association … pattern … MINE … algorithm …

mountain … Africa … diamond … MINE … weight …

{C,D}: { … Printer, Film, Camera, Lens, … }

{A,B}: { … Baby, Milk, Diaper, Toy, Soymilk… }

Pattern Context

Page 10: 1 Generating Semantic Annotations for Frequent Patterns with Context Analysis Qiaozhu Mei, Dong Xin, Hong Cheng, Jiawei Han, ChengXiang Zhai University

10

Our Approach: Model the Context

A B E

C D E

A B F

C D E F

A B E F

Pattern AB

Non Sup = 60%

CI AB, E, F, EF

Trans. ABE; ABEF

SSPs CD; …P2: CD

P1: AB

Pn

:

Database Frequent Patterns

Semantic Annotations

…Pattern CD

… …

<E, F, …, EF, … ABE>

<E, F, …, EF, …,CDEF>

Context Units

Context Units = Objects co-occurring with p

Page 11: 1 Generating Semantic Annotations for Frequent Patterns with Context Analysis Qiaozhu Mei, Dong Xin, Hong Cheng, Jiawei Han, ChengXiang Zhai University

11

Semantic Analysis with Context Models

• Task1: Model the context of a frequent pattern

Based on the Context Model…• Task2: Extract strongest context indicators • Task3: Extract representative transactions • Task4: Extract semantically similar patterns

Qiaozhu Mei
move this slide to the front. getting the big picture out. do not distract the audience with "redundancy", "weight". Start from the goal and show the big picture first. before showing "context units".add "work for any types of patterns"
Page 12: 1 Generating Semantic Annotations for Frequent Patterns with Context Analysis Qiaozhu Mei, Dong Xin, Hong Cheng, Jiawei Han, ChengXiang Zhai University

12

Task1: Context Modeling - A Vector Space Model

A B E

C D E

A B F

C D E F

A B E F

Pattern AB

Non Sup = 60%

CI AB, E, F, EF

Trans. ABE; ABEF

SSPs CD; …

P2: CD

P1: AB

Pn

:

Database Frequent Patterns

Semantic Annotations

…Pattern CD

… …

Context Units

<E, F, …, EF, … ABE>

<E, F, …, EF, …,CDEF>

< 2.0, 2.0, …, 1.0, … , 1.0 >

< 2.0, 2.0, …, 1.0, … , 1.0 >

Context Unit Weight:

Context Similarity:

Co-occurrence

Mutual Information

……

Cosine Similarity

Pearson Coefficient

……

<E, F, …, EF, … ABE>

Page 13: 1 Generating Semantic Annotations for Frequent Patterns with Context Analysis Qiaozhu Mei, Dong Xin, Hong Cheng, Jiawei Han, ChengXiang Zhai University

13

Context Unit Selection

diaper milk babywear lotion

camera memory stick printer

t1

t2

Valid Context Units:

In general, Context Units are frequent patterns

Single itemsdiaper milk printer, , …

,

t1 t2 transactions

milk lotion itemsetscamera

Page 14: 1 Generating Semantic Annotations for Frequent Patterns with Context Analysis Qiaozhu Mei, Dong Xin, Hong Cheng, Jiawei Han, ChengXiang Zhai University

14

Context Unit Selection: Redundancy Removal

• Problem: too many valid context units, most are redundant– { Diaper, milk, babywear }: “diaper”, “diaper,

milk”, “milk, babywear”, “milk, lotion”, …

• Solution: – use close patterns – micro-clustering: (hierarchical, one-pass)

• Jaccard Distance (γ: threshold to stop clustering):

||

||1),(

DD

DDppD

Page 15: 1 Generating Semantic Annotations for Frequent Patterns with Context Analysis Qiaozhu Mei, Dong Xin, Hong Cheng, Jiawei Han, ChengXiang Zhai University

15

Task2: Extract Context Indicators

A B E

C D E

A B F

C D E F

A B E F

Pattern AB

Non Sup = 60%

CI AB, EF, ABE..

Trans. ABE; ABEF

SSPs CD; …

P2: CD

P1: AB

Pn

:

Database Frequent Patterns

Semantic Annotations

Pattern CD

… …

Context Units<A, B, AB, C, D, CD, E, F, EF, AE, BF, … ABE, ABF,…, ABEF>

Context Unit Weighting

< 3.0, 0, … 2.0, … , 1.0, …>

AB 3.0EF 2.0ABE 1.0…

< AB, CD, … , EF, … ABE, …>

Page 16: 1 Generating Semantic Annotations for Frequent Patterns with Context Analysis Qiaozhu Mei, Dong Xin, Hong Cheng, Jiawei Han, ChengXiang Zhai University

16

Task3: Extract Representative Transactions

A B E

C D E

A B F

C D E F

A B E F

Pattern AB

Non Sup = 60%

CI AB, E, F, EF

Trans. ABEF; ABE

SSPs CD; …

P1: AB

Database Frequent Patterns

Semantic Annotations

…Pattern CD

… …

Context Units

3.0, 0, …,2.0, … , 1.0

< AB, CD, … , EF, … ABE, …>

1.0, 0, …,1.0, … , 1.0T1:

Semantic Similarity

T5 0.8T1 0.6T3 0.6…

T5:

Page 17: 1 Generating Semantic Annotations for Frequent Patterns with Context Analysis Qiaozhu Mei, Dong Xin, Hong Cheng, Jiawei Han, ChengXiang Zhai University

17

Task4: Extract Semantically Similar Patterns

A B E

C D E

A B F

C D E F

A B E F

Pattern AB

Non Sup = 60%

CI AB, E, F, EF

Trans. ABEF; ABE

SSPs CD; …

P1: AB

Database Frequent Patterns

Semantic Annotations

…Pattern CD

… …

Context Units

3.0, 0, …,2.0, … , 1.0

< AB, CD, … , EF, … ABE, …>

0, 3.0, …,2.0, … , 0.5

Semantic Similarity

CD 0.7BF 0.5EF 0.3…

AB:

Pk: EF

P2: CD

Page 18: 1 Generating Semantic Annotations for Frequent Patterns with Context Analysis Qiaozhu Mei, Dong Xin, Hong Cheng, Jiawei Han, ChengXiang Zhai University

18

Experiments

• Three different real world applications– Annotating DBLP title/authors Patterns– Motif/Gene-Ontology (GO) matching– Gene Synonyms extraction

• Study the effectiveness of the proposed SPA methods

• Explore applications of SPA to different real world tasks

Qiaozhu Mei
present each task as a special case of the big picture.a concrete example before each experiment
Page 19: 1 Generating Semantic Annotations for Frequent Patterns with Context Analysis Qiaozhu Mei, Dong Xin, Hong Cheng, Jiawei Han, ChengXiang Zhai University

19

Annotating DBLP Co-authorship and Title Pattern

Substructure Similarity Search in Graph Databases

X.Yan, P. Yu, J. Han

……

……

Database:

TitleAuthors

Frequent Patterns

P1: { x_yan, j_han }

Frequent Itemset

P2: “substructure search”

Frequent Sequential Pattern

Pattern { x_yan, j_han}

Non Sup = …

CI {p_yu}, graph pattern, …

Trans. gSpan: graph-base……

SSPs { j_wang }, {j_han, p_yu}, …

Semantic Annotations Context Units

< { p_yu, j_han}, { d_xin }, … , “graph pattern”, … “substructure similarity”, … >

Page 20: 1 Generating Semantic Annotations for Frequent Patterns with Context Analysis Qiaozhu Mei, Dong Xin, Hong Cheng, Jiawei Han, ChengXiang Zhai University

20

DBLP Results: Frequent Itemset

Context Indicator

(CI)

graph; {philip_yu}; mine close; graph pattern; index approach; sequential pattern; …

Representative

Transactions (Trans)

> gSpan: graph-base substructure pattern mining;> mining close relational graph connect constraint; …

Semantically Similar

Patterns (SSP)

{jiawei_han, philip_yu}; {jian_pei, jiawei_han};{jiong_yang, philip_yu, wei_wang}; …

Pattern= {xifeng_yan, jiawei_han}

Annotations:

Qiaozhu Mei
make analogy to dictionary entry. why this annotation is useful?interpret them. make sure people believe they are useful.
Page 21: 1 Generating Semantic Annotations for Frequent Patterns with Context Analysis Qiaozhu Mei, Dong Xin, Hong Cheng, Jiawei Han, ChengXiang Zhai University

21

DBLP Results: Freq. Seq. Pattern

Context Indicator

(CI)

{w_bruce_croft}; web information; full text; {monika_rauch_hezinger}; {james_p_callan}; …

Representative

Transactions (Trans)

> web information retrieval> language model information retrieval

Semantically Similar

Patterns (SSP)

information use; web information; probabilistic information; information filter; text information; …

Pattern= “Information … retrieval”

Annotations:

Qiaozhu Mei
make analogy to dictionary entry. why this annotation is useful?interpret them. make sure people believe they are useful.
Page 22: 1 Generating Semantic Annotations for Frequent Patterns with Context Analysis Qiaozhu Mei, Dong Xin, Hong Cheng, Jiawei Han, ChengXiang Zhai University

22

Motif-GO Matching

GO term 1

GO term 2

GO term 3

GO term 4

GO term 5

Sequence 1

Sequence 2

Sequence 3

motif1 motif2

motif2

motif2

motif3

motif4 motif5

motif2 ?

Motif: a subsequence pattern in the sequences

Gene Ontology (GO) terms: annotating the functionality of sequence, motifs

Page 23: 1 Generating Semantic Annotations for Frequent Patterns with Context Analysis Qiaozhu Mei, Dong Xin, Hong Cheng, Jiawei Han, ChengXiang Zhai University

23

Motif-GO Matching (Cont.)

GOTerm1; GOTerm2;GOTerm3

GOTerm3

……

Database:

GO termsProtein Sequence

Frequent Patterns

P2: GOTerm2

Single Item Pattern

Pattern Motif1

Non

CI GOTerm1, GOTerm3, …

Trans.

SSPs GOTerm1, GOTerm2, …

Semantic Annotations Context Units

< Motif1, Motif3, …, GOTerm1, GOTerm2, … >

P1: Motif1

Sequential Pattern

Motif 1

Motif-GO matching

Motif1

GOTerm1

GOTerm2

Page 24: 1 Generating Semantic Annotations for Frequent Patterns with Context Analysis Qiaozhu Mei, Dong Xin, Hong Cheng, Jiawei Han, ChengXiang Zhai University

24

Motif/GO Matching: Evaluation

• Gold standard generated by human experts• Measure: Mean reciprocal rank (MRR)

– Reflects ranking accuracy (the higher the better)– 1/Rank (0.5 means the correct answer is ranked as the 2nd )

• Results:

Mutual Information Co-occurrence

Random Selection 0.0023 0.0023

Context Indicators 0.5877 0.6064

SSPs 0.4017 0.4681

Weights for Context Units:

Ranking Strategy

Qiaozhu Mei
remove the formulaintuitively, 0.5 means...
Page 25: 1 Generating Semantic Annotations for Frequent Patterns with Context Analysis Qiaozhu Mei, Dong Xin, Hong Cheng, Jiawei Han, ChengXiang Zhai University

25

Gene Synonym Extraction

• Gene Synonyms:– A Sequential Pattern in the textual database

– Matching gene synonyms: a challenging and important new problem in mining biology data

– Analogy: thesaurus or synonyms in dictionary

Gene_id Gene Synonyms

FBgn0001000 female sterile 2 tekele; fs 2 sz 10; tek; fs 2 tek; tekele; …

Page 26: 1 Generating Semantic Annotations for Frequent Patterns with Context Analysis Qiaozhu Mei, Dong Xin, Hong Cheng, Jiawei Han, ChengXiang Zhai University

26

Gene Synonym Extraction (Cont.)

… D. melanogaster gene Female sterile (2) Tekele …

… Female sterile (2) Tekele , abbreviated as Fs(2)Tek …

Database:

Biomedical Sentences

Frequent Patterns

P1: female sterile (2) tekele

Sequential Pattern

Pattern female sterile (2) tekele

Non

CI

Trans.

SSPs Fs(2)Tek, female sterile, fs 2 sz 10, …

Semantic Annotations Context Units

< gene, female, …, d. melanogaster gene , … >

Matched Synonyms

female sterile (2) tekele

Fs(2)Tek

fs 2 sz 10female sterile …

P2: Fs(2)Tek

Sequential Pattern

Context Units: context units can be single words or sequential patterns

Page 27: 1 Generating Semantic Annotations for Frequent Patterns with Context Analysis Qiaozhu Mei, Dong Xin, Hong Cheng, Jiawei Han, ChengXiang Zhai University

27

Gene Synonym Extraction: Results

• Effective! MRR > 0.5• frequent pattern >>

single words• Micro-clustering is

useful

Running time: hierarchical Running time:

one-pass

MRR: hierarchicalMRR: one-pass

Page 28: 1 Generating Semantic Annotations for Frequent Patterns with Context Analysis Qiaozhu Mei, Dong Xin, Hong Cheng, Jiawei Han, ChengXiang Zhai University

28

Conclusions

• A novel problem: semantical pattern annotation• A structured annotation for frequent patterns• A general method based on context modeling• A general post-processing procedure of frequent

pattern mining on any types of pattern • Applicable to and effective for quite different

tasks• Future work:

– Tune for specific tasks– Better context unit weights, redundancy removal, etc

Qiaozhu Mei
general method to solve all kinds of patterns, specific information are useful for some problems...
Page 29: 1 Generating Semantic Annotations for Frequent Patterns with Context Analysis Qiaozhu Mei, Dong Xin, Hong Cheng, Jiawei Han, ChengXiang Zhai University

29

Thanks and Questions