![Page 1: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/1.jpg)
Classification and clustering methods development and implementation for unstructured documents collections
byOsipova Nataly
St.Petesburg State University Faculty of Applied Mathematics and Control Processes Department of Programming Technology
![Page 2: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/2.jpg)
Contents
IntroductionMethods descriptionInformation Retrieval SystemExperiments
![Page 3: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/3.jpg)
Contextual Document Clustering
was developed in joined project ofApplied Mathematics and Control Processes Faculty, St. Petersburg State University and Northern Ireland Knowledge Engineering Laboratory (NIKEL), University of Ulster.
![Page 4: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/4.jpg)
Definitions
DocumentTerms dictionaryDictionaryClusterWord contextContext or document conditional
probability distributionEntropy
![Page 5: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/5.jpg)
Document conditional probability distribution
Document x
yword1 word2 word3 …wordn
tf(y)5106
16
p(y|x)5/m10/m6/m
16/m
y – wordstf(y) – y frequencyp(y|x) – y conditional probability in document xm – document x size
(5/m, 10/m,6/m,…,16/m ) – document conditional probability distribution
![Page 6: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/6.jpg)
Word context
Word wDocument x1 Document x2 Document xk
yword1 word2 …wordn1
tf(y)510
16
p(y|x1)5/m110/m1
16/m1
yword1 word3 …wordn2
tf(y)712
4
p(y|x1)7/m112/m1
4/m1
yword1 word4 …wordnk
tf(y)209
3
p(y|x1)20/mk9/mk
3/mk
…
yword1 word2 word3 …wordnk
tf(y)5+7+20=321012
3
p(y|w)32/m10/m12/m
3/m
…
Context conditional probability distribution
![Page 7: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/7.jpg)
Contents
IntroductionMethods descriptionInformation Retrieval SystemExperiments
![Page 8: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/8.jpg)
Methods
document clustering methoddictionary build methodsdocument classification method using
training set
Information retrieval methods:keyword search methodcluster based search methodsimilar documents search method
![Page 9: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/9.jpg)
Contextual Documents Clustering
Documents Dictionary Narrow context words
Clusters
Distances calculation
![Page 10: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/10.jpg)
Entropy
(HH
n
i
pipi1
)log(*)
p1 pnp2
y context conditional probability distribution
p1+p2+…+pn=1
p1 pnp2
Uncertainly measure, here it is used to characterize commonness (narrowness) of the word context.
![Page 11: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/11.jpg)
Contextual Document Clustering
maxH(y)=H (
)
![Page 12: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/12.jpg)
Entropy
α0 10.5
)2(log2 1, 21 pp
)loglog(]),([ 221121 ppppppH
H( ) H( ) H( )
![Page 13: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/13.jpg)
Word Context - Document Distance
1p
2p
21 21
21 ppp
y context conditional probability distribution
Document x conditional probability distribution
Average conditional probability distribution
![Page 14: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/14.jpg)
Word Context - Document Distance
JS[p1,p2]=H( )
- 0.5H( )
- 0.5H( )
![Page 15: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/15.jpg)
Jensen-Shannon divergence
210]2,1[
0]2,1[
},{
},{
21
21
21
21
ppppJS
ppJS
![Page 16: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/16.jpg)
Dictionary construction
Why:- big volumes: 60,000 documents, 50,000 words => 15,000
words in a context- narrow context words importance
![Page 17: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/17.jpg)
Dictionary construction
Delete words with1. High or low frequency2. High or low document frequency3. 1. and 2.
![Page 18: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/18.jpg)
Retrieval algorithms
keyword search methodcluster based search methodsearch by example method
![Page 19: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/19.jpg)
Keyword search method
Document 1word 1word 2word 3…word n1
Document 2word 10word 25word 30…word n2
Document 3word 15word 2word 32…word n3
Document 4word 11word 21word 3…word n4
Request: word 2 Result set: document 1document3
![Page 20: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/20.jpg)
Cluster based search method
Documents
Cluster 3word 1word 23…word n3
Documents Documents
Cluster 2word 12word 26…word n2
Cluster 1word 1word 2…word n1
Cluster context words
Request: word 1 Result set: Cluster 1Cluster 3
![Page 21: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/21.jpg)
Similar documents search
document 1Cluster name
Cluster
Minimal Spanning Tree
document 2
document 3
document 4
document 5
document 6
document 7
Request: document 3
Result set: document 6document 7
![Page 22: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/22.jpg)
Document classification: method 1
Clusters List of topics Training set
Topics contexts
Distances between topics and clusters contexts
Classification result:cluster1 – topic 10cluster 2 – topic 3
…cluster n – topic 30
Test documents
![Page 23: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/23.jpg)
Clusters
Topics listTraining set
Classification result:cluster1 – topic 10cluster 2 – topic 3
…cluster n – topic 30
Document classification: method 2
Test documents
All documents set
![Page 24: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/24.jpg)
Contents
IntroductionMethods descriptionInformation Retrieval SystemExperiments
![Page 25: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/25.jpg)
Information Retrieval System
ArchitectureFeaturesUse
![Page 26: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/26.jpg)
Information Retrieval System architecture.
data base serverclient
![Page 27: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/27.jpg)
IRS architecture
Data Base
Data Base ServerMS SQL Server 2000
Local AreaNetwork
“thick” clientC#
![Page 28: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/28.jpg)
IRS architecture
DBMS MS SQL Server 2000:High-performanceScalableSecureHuge volumes of data treatT/SQLStored procedures
![Page 29: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/29.jpg)
IRS features
In the IRS the following problems are solved:document clusteringkeyword search methodcluster based search methodsimilar documents search methoddocument classification with the use of
training set
![Page 30: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/30.jpg)
DB structure
The Data Base of the IRS consists of the following tables: documents all words dictionary dictionary table of relations between documents and words: document-word words contexts words with narrow contexts clusters intermediate tables for main tables build and for retrieve realization
![Page 31: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/31.jpg)
DictionaryDocuments
Table “document-word”
Words contexts
Clusters CentroidCluster based search
Keyword search
Words with narrow contexts
All words dictionary
Similar documents search
Algorithms implementation
![Page 32: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/32.jpg)
document1document2
document5 document3
document4
Cluster
0,16285
0,98154
0,57231
0,23851
0,26967
0,211
0,87310,7231
0,1011
Similar documents search
![Page 33: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/33.jpg)
Minimal Spanning Tree
document 1
Cluster name
Cluster
document 2
document 3
document 4
document 5
![Page 34: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/34.jpg)
Similar documents search
Clusterstable Tree tableDistances
table
Similar documents
search
![Page 35: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/35.jpg)
IRS use
![Page 36: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/36.jpg)
IRS use
![Page 37: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/37.jpg)
IRS use
![Page 38: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/38.jpg)
IRS use
![Page 39: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/39.jpg)
IRS use
![Page 40: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/40.jpg)
IRS use
![Page 41: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/41.jpg)
Contents
IntroductionMethods descriptionInformation Retrieval SystemExperiments
![Page 42: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/42.jpg)
Experiments
Test goals were:algorithm accuracy testdifferent classification methods
comparisonalgorithm efficiency evaluation
![Page 43: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/43.jpg)
Experiments
60,000 documents100 topicsTraining set volume = 5% of the
collection size
![Page 44: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/44.jpg)
Experiments
1000)(,2)( ydfydf
1000)(,5)( ytfytf
![Page 45: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/45.jpg)
Result analysis
- Russian Information Retrieval Evaluation Seminar
- Such measures as macro-average recallprecision F-measure were calculated.
![Page 46: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/46.jpg)
Recall
textan
xxxxxxxxxxxx
xxxx
xxxx
xxxxxxxx
xxxx
xxxx
0
0.1
0.2
0.3
0.4
0.5
0.6
Systems
textan
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
Recall
![Page 47: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/47.jpg)
Precision
xxxx
xxxxxxxx
xxxx xxxx
xxxx
textanxxxxxxxx
xxxx
00.10.20.30.40.50.60.7
Systems
textan
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
Precision
![Page 48: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/48.jpg)
F-measure
textan
xxxx
xxxxxxxx
xxxxxxxx
xxxxxxxx
xxxx
xxxx
00.050.1
0.150.2
0.250.3
0.35
Systems
textan
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
F-measure
![Page 49: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/49.jpg)
Result analysis
List of some topicstest documents were classified in
№ Category
1 Family law
2 Inheritance law
3 Water industry
4 Catering
5 Inhabitants’ consumer services
6 Rent truck
7 International law of the space
8 Territory in international law
9 Off-economic relations fellows
10 Off-economic dealerships
11 Economy free trade zones. Customs unions.
![Page 50: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/50.jpg)
Result analysis
Recall results for every category.Results which were the best for the category are selected with bold type.All results are set in percents.
СV 1 2 3 4 5 6 7 8 9 10 11
textan 33 34 35 60 46 26 27 98 75 25 100
xxxx 1 0 0.2 3 4 0 0.9 0 3 0 2
xxxx 0 0 4.3 2.3 0 5 0.9 8 3 0 0.8
xxxx 55 86 75 19 59 51 80 0 41 82 0
xxxx 21 39 2 22 15 6 0 1.4 0 5 0
xxxx 40 43 16 11 25 23 10 1.4 1.2 5 0
xxxx 23 4 2.5 1.1 18 7 0.9 0 1.2 10 0
xxxx 2.7 0 0 0 1.5 0 0 0 0 0 0
xxxx 2.2 0 0 0 1.5 0 0 0 0 0 0
xxxx 37 21 12 22 18 27 51 0 0 0 0
![Page 51: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.site/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/51.jpg)
Thank you for your attention!