a news classifier using evernote

12
A News Classifier Using Evernote Atsushi KOIKE Sokendai Feb. 18, 2014 1

Upload: atsushi-koike

Post on 15-Jul-2015

598 views

Category:

Technology


0 download

TRANSCRIPT

A News Classifier Using Evernote

Atsushi KOIKE Sokendai

Feb. 18, 2014

1

Background • Evernote

•  One of my favorite applications •  I can clip interesting news and technical articles using Evernote

I will show a web-based application utilizing Evernote

2

Brief Overview • News Classifier

•  Divides news articles into 4 categories •  Tech

•  General technical news •  My Study

•  IT news relevant to me, such as programing, development, and computer architectures •  Biz

•  Business news •  Life

•  Other news, such as politics, economics, sports, and music

•  Displays high-score articles

•  Training data •  Notes clipped into my Evernote

•  The notes in my Evernote are basically divided into the above 4 folders

3

Naïve Bayes Classifier •  It divides documents into categories • Given doc, it returns • My application also returns for scoring •  The classifier calculates using Bayes Theorem

•  are calculated using Multinomial model

arg maxcat P cat doc( )

P cat doc( ) =P cat( )P doc cat( )P ca !t( )P doc ca !t( )

ca !t∑

maxcat P cat doc( )

P cat( ),P doc cat( )

P cat doc( )

4

(Simplified) Multinomial Model • Document is defined as a sequence of n words • Suppose documents are generated by repeatedly picking up

one word

doc := w1,w2,!,wn( )

P doc cat( ) = P w1 cat( )P w2 cat( )!P wn cat( )

W3 W2

W1 W1

Category 1

W3 W3

W2 W2

Category 2

5

Estimating model parameters • Simple method:

•  Disadvantage: If nw,c = 0, P(w|cat) = 0

•  Improved method: smoothing •  Consider Prior distribution for P(cat), P(w|cat): Dirichlet distribution (α=2)

P cat( ),P w cat( )P cat( ) = Nc

N,P w cat( ) = nw,cnc

Nc : Num of docs in category c

N : Num of docs in total nw,c : Num of word w in documents in category c nc : Num of total words in documents in category c

P doc cat( ) = P w1 cat( )P w2 cat( )!P wn cat( ) = 0

cat : Num of category

W : Num of vocabulary ⇒ P cat( ) = Nc +1

N + cat,P w cat( ) = nw,c +1

nc + W

6

Software Architecture

Evernote notes.json

Training data Evernote Crawler

URL list articles.json

Articles Collector

classifier.json

Model parameters Bayes Classifier

Train

contents.json

Classify

Display

7

•  I used “Evernote Ruby API” to crawl Evernote

Crawling Evernote

{ "cat": ”Tech", “title”: “「スマホの9割はiPhone」「有料スタンプは買わない」――女子高生起業家に聞くスマホ事情", “content”: ”・・・” }

Category (Notebook)

Title

Content (XML)

I used the “title” and the “content” as the training document

to_json

8

Training • Outline

1.  Parse XML content using the Nokogiri library 2.  Divide sentences into words using the MeCab library 3.  Calculate the model parameters

9

Crawling web • News sites

•  IT Media •  Yahoo! News •  Impress Watch •  TechCrunch •  gizmode •  Nikkan Sports •  Nikkei BP

•  I downloaded RSS files using Ruby and saved them as json files { "title": "IE 10に未解決の脆弱性、悪用攻撃の発生でIE 11に更新を", “link”: “http://rss.rssad.jp/rss/artclk/VlF3IIxoZHoi/・・・", “desc”: “・・・" }, I used the “title” and the “description” for classification

10

Application • Web-based

•  Ruby + Sinatra + jQuery mobile

• Articles are sorted by •  The application displays the top 5 articles for each category

P cat doc( )

11

Discussion • Most of displayed articles matched my interest • Some articles were divided into wrong categories

•  Possible reasons •  No stemming •  Few stop words •  Is it appropriate to use as the score? P cat doc( )

12