집단지성프로그래밍 ch6. 문서 필터링
TRANSCRIPT
![Page 1: 집단지성프로그래밍 ch6. 문서 필터링](https://reader034.vdocuments.site/reader034/viewer/2022042522/55bfd1e8bb61ebbd3d8b46e4/html5/thumbnails/1.jpg)
문서 필터링집단지성 프로그래밍 Ch.6
허윤
![Page 2: 집단지성프로그래밍 ch6. 문서 필터링](https://reader034.vdocuments.site/reader034/viewer/2022042522/55bfd1e8bb61ebbd3d8b46e4/html5/thumbnails/2.jpg)
Document Filtering
Filtering == Classification Problem
Data Mining Problem
Estimation Classification Predication
Clustering Description
Affinity Grouping
Document?A set of feature -> text document, image, etc.
![Page 3: 집단지성프로그래밍 ch6. 문서 필터링](https://reader034.vdocuments.site/reader034/viewer/2022042522/55bfd1e8bb61ebbd3d8b46e4/html5/thumbnails/3.jpg)
Spam Filtering
Binary Classification Problem
‘Spam’ or ‘Ham’
Techniques
Naïve Bayesian Classifier
Support Vector Machine
Decision Tree
Rule vs. Model
![Page 4: 집단지성프로그래밍 ch6. 문서 필터링](https://reader034.vdocuments.site/reader034/viewer/2022042522/55bfd1e8bb61ebbd3d8b46e4/html5/thumbnails/4.jpg)
Spam Filtering in Practice
Referred at: Sahil Puri1 et al, “COMPARISON AND ANALYSIS OF SPAM DETECTION ALGORITHMS”, 2013, IJAIEM
![Page 5: 집단지성프로그래밍 ch6. 문서 필터링](https://reader034.vdocuments.site/reader034/viewer/2022042522/55bfd1e8bb61ebbd3d8b46e4/html5/thumbnails/5.jpg)
Referred at: Rene, “New insights into Gmail’s spam filtering”, 2012, emailmarketingtipps.de
![Page 6: 집단지성프로그래밍 ch6. 문서 필터링](https://reader034.vdocuments.site/reader034/viewer/2022042522/55bfd1e8bb61ebbd3d8b46e4/html5/thumbnails/6.jpg)
Naïve Bayesian Classifier
Bayesian Classifier
Naïve?
Bayesian Theorem with string independence assumption
![Page 7: 집단지성프로그래밍 ch6. 문서 필터링](https://reader034.vdocuments.site/reader034/viewer/2022042522/55bfd1e8bb61ebbd3d8b46e4/html5/thumbnails/7.jpg)
Example
1. 상자 A 가 선택될 확률 P( A ) = 7 / 10
2. 상자 A 에서 흰공 뽑힐 확률 P( 흰공 | A )= 2 / 10
3. 주머니에서는 A, 상자 A 에서 흰공 뽑힐 확률
4. 흰공의 확률
❶ ❷
![Page 8: 집단지성프로그래밍 ch6. 문서 필터링](https://reader034.vdocuments.site/reader034/viewer/2022042522/55bfd1e8bb61ebbd3d8b46e4/html5/thumbnails/8.jpg)
Example ❶ ❷
어디선가 흰공이 나왔는데… P( A | 흰공 )A 에서 나왔을 확률 ?
B 에서 나왔을 확률 ? P( B | 흰공 )
P( A | 흰공 ) = ?
![Page 9: 집단지성프로그래밍 ch6. 문서 필터링](https://reader034.vdocuments.site/reader034/viewer/2022042522/55bfd1e8bb61ebbd3d8b46e4/html5/thumbnails/9.jpg)
Example ❶ ❷
![Page 10: 집단지성프로그래밍 ch6. 문서 필터링](https://reader034.vdocuments.site/reader034/viewer/2022042522/55bfd1e8bb61ebbd3d8b46e4/html5/thumbnails/10.jpg)
Bayes Rule
❶ Conditional Prob. A given B ❷ Conditional Prob. B given A
❸ Bayes Rule
![Page 11: 집단지성프로그래밍 ch6. 문서 필터링](https://reader034.vdocuments.site/reader034/viewer/2022042522/55bfd1e8bb61ebbd3d8b46e4/html5/thumbnails/11.jpg)
Implementation
Extracting words from document
![Page 12: 집단지성프로그래밍 ch6. 문서 필터링](https://reader034.vdocuments.site/reader034/viewer/2022042522/55bfd1e8bb61ebbd3d8b46e4/html5/thumbnails/12.jpg)
Implementation: Preparation
Representation of classifier
![Page 13: 집단지성프로그래밍 ch6. 문서 필터링](https://reader034.vdocuments.site/reader034/viewer/2022042522/55bfd1e8bb61ebbd3d8b46e4/html5/thumbnails/13.jpg)
How to access dict
Implementation: Preparation
![Page 14: 집단지성프로그래밍 ch6. 문서 필터링](https://reader034.vdocuments.site/reader034/viewer/2022042522/55bfd1e8bb61ebbd3d8b46e4/html5/thumbnails/14.jpg)
Training
Implementation: Preparation
![Page 15: 집단지성프로그래밍 ch6. 문서 필터링](https://reader034.vdocuments.site/reader034/viewer/2022042522/55bfd1e8bb61ebbd3d8b46e4/html5/thumbnails/15.jpg)
Training
Implementation: Preparation
![Page 16: 집단지성프로그래밍 ch6. 문서 필터링](https://reader034.vdocuments.site/reader034/viewer/2022042522/55bfd1e8bb61ebbd3d8b46e4/html5/thumbnails/16.jpg)
Training
Implementation: Preparation
![Page 17: 집단지성프로그래밍 ch6. 문서 필터링](https://reader034.vdocuments.site/reader034/viewer/2022042522/55bfd1e8bb61ebbd3d8b46e4/html5/thumbnails/17.jpg)
Recall
Bayesian Theorem
p( category | doc ) = p( doc )
p( doc | category ) * p( category)
![Page 18: 집단지성프로그래밍 ch6. 문서 필터링](https://reader034.vdocuments.site/reader034/viewer/2022042522/55bfd1e8bb61ebbd3d8b46e4/html5/thumbnails/18.jpg)
Implementation : Classifier
P( feature | category ) for prior
![Page 19: 집단지성프로그래밍 ch6. 문서 필터링](https://reader034.vdocuments.site/reader034/viewer/2022042522/55bfd1e8bb61ebbd3d8b46e4/html5/thumbnails/19.jpg)
Assumed Probability to resolve data sparseness
Implementation : Classifier
![Page 20: 집단지성프로그래밍 ch6. 문서 필터링](https://reader034.vdocuments.site/reader034/viewer/2022042522/55bfd1e8bb61ebbd3d8b46e4/html5/thumbnails/20.jpg)
Assumed Probability to resolve data sparseness
Implementation : Classifier
![Page 21: 집단지성프로그래밍 ch6. 문서 필터링](https://reader034.vdocuments.site/reader034/viewer/2022042522/55bfd1e8bb61ebbd3d8b46e4/html5/thumbnails/21.jpg)
P( document | category ) document representation
Implementation : Classifier
![Page 22: 집단지성프로그래밍 ch6. 문서 필터링](https://reader034.vdocuments.site/reader034/viewer/2022042522/55bfd1e8bb61ebbd3d8b46e4/html5/thumbnails/22.jpg)
P( document | category ) * p( category )
Implementation : Classifier
![Page 23: 집단지성프로그래밍 ch6. 문서 필터링](https://reader034.vdocuments.site/reader034/viewer/2022042522/55bfd1e8bb61ebbd3d8b46e4/html5/thumbnails/23.jpg)
Classifier
Implementation : Classifier
![Page 24: 집단지성프로그래밍 ch6. 문서 필터링](https://reader034.vdocuments.site/reader034/viewer/2022042522/55bfd1e8bb61ebbd3d8b46e4/html5/thumbnails/24.jpg)
Classifier
Implementation : Classifier
![Page 25: 집단지성프로그래밍 ch6. 문서 필터링](https://reader034.vdocuments.site/reader034/viewer/2022042522/55bfd1e8bb61ebbd3d8b46e4/html5/thumbnails/25.jpg)
Recall: Naïve Bayesian Classifier
Fisher’s Method
Fisher’s Method
First, p( document| category ) = p( feature_1| category ) * p( feature_2| category ) … * p( feature_N| category )
p( category | document ) ??
p( category | feature ) = # of documents having feature in category
# of documents having feature
![Page 26: 집단지성프로그래밍 ch6. 문서 필터링](https://reader034.vdocuments.site/reader034/viewer/2022042522/55bfd1e8bb61ebbd3d8b46e4/html5/thumbnails/26.jpg)
Q&A
Thank You