a news classifier using evernote
TRANSCRIPT
Background • Evernote
• One of my favorite applications • I can clip interesting news and technical articles using Evernote
I will show a web-based application utilizing Evernote
2
Brief Overview • News Classifier
• Divides news articles into 4 categories • Tech
• General technical news • My Study
• IT news relevant to me, such as programing, development, and computer architectures • Biz
• Business news • Life
• Other news, such as politics, economics, sports, and music
• Displays high-score articles
• Training data • Notes clipped into my Evernote
• The notes in my Evernote are basically divided into the above 4 folders
3
Naïve Bayes Classifier • It divides documents into categories • Given doc, it returns • My application also returns for scoring • The classifier calculates using Bayes Theorem
• are calculated using Multinomial model
arg maxcat P cat doc( )
P cat doc( ) =P cat( )P doc cat( )P ca !t( )P doc ca !t( )
ca !t∑
maxcat P cat doc( )
P cat( ),P doc cat( )
P cat doc( )
4
(Simplified) Multinomial Model • Document is defined as a sequence of n words • Suppose documents are generated by repeatedly picking up
one word
doc := w1,w2,!,wn( )
P doc cat( ) = P w1 cat( )P w2 cat( )!P wn cat( )
W3 W2
W1 W1
Category 1
W3 W3
W2 W2
Category 2
5
Estimating model parameters • Simple method:
• Disadvantage: If nw,c = 0, P(w|cat) = 0
• Improved method: smoothing • Consider Prior distribution for P(cat), P(w|cat): Dirichlet distribution (α=2)
P cat( ),P w cat( )P cat( ) = Nc
N,P w cat( ) = nw,cnc
Nc : Num of docs in category c
N : Num of docs in total nw,c : Num of word w in documents in category c nc : Num of total words in documents in category c
P doc cat( ) = P w1 cat( )P w2 cat( )!P wn cat( ) = 0
cat : Num of category
W : Num of vocabulary ⇒ P cat( ) = Nc +1
N + cat,P w cat( ) = nw,c +1
nc + W
6
Software Architecture
Evernote notes.json
Training data Evernote Crawler
URL list articles.json
Articles Collector
classifier.json
Model parameters Bayes Classifier
Train
contents.json
Classify
Display
7
• I used “Evernote Ruby API” to crawl Evernote
Crawling Evernote
{ "cat": ”Tech", “title”: “「スマホの9割はiPhone」「有料スタンプは買わない」――女子高生起業家に聞くスマホ事情", “content”: ”・・・” }
Category (Notebook)
Title
Content (XML)
I used the “title” and the “content” as the training document
to_json
8
Training • Outline
1. Parse XML content using the Nokogiri library 2. Divide sentences into words using the MeCab library 3. Calculate the model parameters
9
Crawling web • News sites
• IT Media • Yahoo! News • Impress Watch • TechCrunch • gizmode • Nikkan Sports • Nikkei BP
• I downloaded RSS files using Ruby and saved them as json files { "title": "IE 10に未解決の脆弱性、悪用攻撃の発生でIE 11に更新を", “link”: “http://rss.rssad.jp/rss/artclk/VlF3IIxoZHoi/・・・", “desc”: “・・・" }, I used the “title” and the “description” for classification
10
Application • Web-based
• Ruby + Sinatra + jQuery mobile
• Articles are sorted by • The application displays the top 5 articles for each category
P cat doc( )
11