hammouda webcast may21
TRANSCRIPT
-
8/2/2019 Hammouda Webcast May21
1/12
Text Mining:Fast Phrase-based Text Indexing and
Matching
Khaled Hammouda, Ph.D. Student
PAMI Research GroupUniversity of Waterloo
Waterloo, Ontario, Canada
LORNET Theme 4
-
8/2/2019 Hammouda Webcast May21
2/12
The Problem
Information
Source
Web / LOR
Text DocumentsWeb DocumentsDiscussion Articles...
Automatic
Clustering/Grouping
ProgrammingLanguages
Database Systems
PatternRecognition
How do we judgesimilarity?
DataMining
-
8/2/2019 Hammouda Webcast May21
3/12
Group Similar Documents Together Maximize intra-cluster similarity
Minimize inter-cluster similarity
Need to accuratelycalculate document similarity
Intra-Cluster Similarity
Inter-Cluster Similarity
Document Cluster
Document Cluster
Document Cluster
Clustering Documents
-
8/2/2019 Hammouda Webcast May21
4/12
Document Similarity
How similar each document isto every other document?
Very time consuming!
O(n2
)
-
8/2/2019 Hammouda Webcast May21
5/12
Document Similarity
Information Theoretic Measure (Dekang98):
How do we intersect every pair of documentswithout sacrificing efficiency?
What features should we intersect? Words
Phrases
BA
BABA
),sim(
-
8/2/2019 Hammouda Webcast May21
6/12
Fast Phrase-based Document Indexingand Matching
Document Index Graph Structure A model based on a digraphrepresentation of the
phrases in the document set
Nodes correspond to unique terms
Edges maintain phrase representation
A phrase is a path in the graph
The model is an inverted list (terms documents)
Nodes carry term weight information for eachdocument in which they appear
Shared phrases can be matched efficiently
Phrase-based Features Phrases: more informative feature than individual
words
local context matching Represent sentences rather than words
Facilitate phrase-matching between documents
Achieves accurate document pair-wise similarity
Avoid high-dimensionality of vector space model
Allow incremental processing
Document 1
river raftingmild river raftingriver rafting trips
Document 2
wild river adventuresriver rafting vacation plan
Document 3
fishing tripsfishing vacation planbooking fishing trips
river fishing
mild
wild
river
rafting
adventures
booking
fishing
tripsvacation
plan
Document Index Graph
-
8/2/2019 Hammouda Webcast May21
7/12
Document Index GraphDocument 1
river rafting
mild river rafting
river rafting trips
river
booking
fishing
tripsvacation
plan
mild
river
rafting
trips
wild
river
rafting
adventures
vacationplan
Document 2
wild river adventuresriver rafting vacation plan
Document 3
fishing trips
fishing vacation plan
booking fishing trips
river fishing
- river rafting - river- vacation plan
- river
- trips
-
8/2/2019 Hammouda Webcast May21
8/12
Phrase-based Document Indexing
Document Index Graph (internal structure)
riverrafting
adventures
fishing
e2
e1
e0
doc TF ET
1 {0,0,3}2 {0,0,2}3 {0,0,1}
e0
s1(1),s
2(2),s
3(1)
e0
s2(1)
e2
s1(2)
e1
s4(1)
Edge Tables
Document Table
Document Index Graph (size scalability)
Document Index Graph (time performance)
-
8/2/2019 Hammouda Webcast May21
9/12
Effect of using phrase-based similarity overindividual words
Effect of using phrase similarity (F-measure) Effect of using phrase similarity (Entropy)
-
8/2/2019 Hammouda Webcast May21
10/12
Applications
Grouping search engine results on-the-fly(incremental processing)
Creating taxonomies of documents
(Yahoo! and Open Directory style)
Implementing Find Related or Find Similar features of information
retrieval systems
Automatic generation of descriptive phrases about a set ofdocuments (i.e. labeling clusters)
Detecting plagiarism
-
8/2/2019 Hammouda Webcast May21
11/12
Collaboration
Provide Data Mining services (primarilytext mining) for other groups
Opportunity for collaboration with U ofSaskatchewan: I-Help Discussion System
Course Delivery Tools
Others are welcome
-
8/2/2019 Hammouda Webcast May21
12/12
Questions
Instant Messaging
MSN Messenger: [email protected]
E-mail