julia text mining_inmobi

26
Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model Hands-on Introduction to Text Mining with Julia - A Mathematical approach Abhijith Chandraprabhu April 19, 2014 Abhijith Chandraprabhu Julia Hands-on Introduction to Text Mining with Julia - A Mathematical approach

Upload: abhijith-chandraprabhu

Post on 27-Jan-2015

118 views

Category:

Technology


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Julia text mining_inmobi

Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model

Hands-on Introduction to Text Mining with Julia- A Mathematical approach

Abhijith Chandraprabhu

April 19, 2014

Abhijith Chandraprabhu Julia

Hands-on Introduction to Text Mining with Julia - A Mathematical approach

Page 2: Julia text mining_inmobi

Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model

Overview

Introduction

Preprocessing

VSM Model

Query and Performance Modeling

LSI Model

Abhijith Chandraprabhu Julia

Hands-on Introduction to Text Mining with Julia - A Mathematical approach

Page 3: Julia text mining_inmobi

Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model

About todays session

I We will be dealing with an age old technique(1999).

I Old wine in a New bottle!I The emphasis is on understanding the math behind it.

Vectors, Matrices

Dimension, Space

Projection, Matrix factorization

SVD

I Hands on session.

Abhijith Chandraprabhu Julia

Hands-on Introduction to Text Mining with Julia - A Mathematical approach

Page 4: Julia text mining_inmobi

Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model

“For almost a decade the computational linguistics community hasviewed large text collections as a resource to be tapped in order toproduce better text analysis algorithms. In this paper, I haveattempted to suggest a new emphasis: the use of large online textcollections to discover new facts and trends about the world itself.I suggest that to make progress we do not need fully artificialintelligent text analysis; rather, a mixture ofcomputationally-driven and user-guided analysis may open the doorto exciting new results.”Untangling Text Data Mining, Marti A. Hearst

Abhijith Chandraprabhu Julia

Hands-on Introduction to Text Mining with Julia - A Mathematical approach

Page 5: Julia text mining_inmobi

Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model

Text Mining

I Type of Information retrieval, where we try to extract relevantinformation from huge collection of textual data(documents).

I Documents: web pages, biomedical literature, movie reviews,research articles etc.

I Non-Semantic

I Based Vector-Space model.

I Applications : Web Search Engines, Biomedical InformationRetrieval etc.

Abhijith Chandraprabhu Julia

Hands-on Introduction to Text Mining with Julia - A Mathematical approach

Page 6: Julia text mining_inmobi

Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model

Document1It’s also important to point out that even though these arrays are generic, they’re not boxed: an Int8 array will takeup much less memory than an Int64 array, and both will be laid out as continuous blocks of memory; Julia can dealseamlessly and generically with these different immediate types as well as pointer types like String.

Document2To celebrate some of the amazing work that’s already been done to make Julia usable for day-to-day data analysis,I’d like to give a brief overview of the state of statistical programming in Julia. There are now several packagesthat, taken as a whole, suggest that Julia may really live up to its potential and become the next generationlanguage for data analysis.

Document3Only later did I realize what makes Julia different from all the others. Julia breaks down the second wall — the wallbetween your high-level code and native assembly. Not only can you write code with the performance of C in Julia,you can take a peek behind the curtain of any function into its LLVM Intermediate Representation as well as itsgenerated assembly code — all within the REPL. Check it out.

Document4Homoiconicity — the code can be operated on by other parts of the code. Again, R kind of has this too! Kind of,because I’m unaware of a good explanation for how to use it productively, and R’s syntax and scoping rules make ittricky to pull off. But I’m still excited to see it in Julia, because I’ve heard good things about macros and I’d like toappreciate them.

Document5Graphics. One of the big advantages of R over similar languages is the sophisticated graphics available in ggplot2or lattice. Work is underway on graphics models for Julia but, again, it is early days still.

Abhijith Chandraprabhu Julia

Hands-on Introduction to Text Mining with Julia - A Mathematical approach

Page 7: Julia text mining_inmobi

Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model

Term-Document Matrix

Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Query 1 Query 2Array 1 0 0 0 0 1 0

continous block 1 0 0 0 0 0 0Julia 1 3 3 1 1 1 1types 1 0 0 0 0 1 0string 1 0 0 0 0 0 0

amazing 0 1 0 0 0 0 0data analysis 0 2 0 0 0 0 1

statistical computing 0 1 0 0 0 0 1high-level 0 0 1 0 0 0 0

performance 0 0 1 0 0 0 0LLVM 0 0 1 0 0 0 0

homoiconicity 0 0 0 1 0 0 0R 0 0 0 2 0 0 0

syntax 0 0 0 4 0 0 0macros 0 0 0 1 0 0 0

graphics 0 0 0 0 3 0 0advantages 0 0 0 0 1 0 0

Abhijith Chandraprabhu Julia

Hands-on Introduction to Text Mining with Julia - A Mathematical approach

Page 8: Julia text mining_inmobi

Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model

Terminologies explained:

Corpus : Structured set of text, obtained by preprocessing rawtextual data.

Lexicon : Distinctive collection of all the words/terms in thecorpus.

Document, Query : Bag of words/terms, represented as vector.

Keyword, term : Elements of Lexicon.

Term Frequency : Frequency of occurence of a term in aDocument.

TDM : A matrix, with term frequencies as entries across allDocuments.

Abhijith Chandraprabhu Julia

Hands-on Introduction to Text Mining with Julia - A Mathematical approach

Page 9: Julia text mining_inmobi

Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model

Major Steps in Text Mining

1. Preprocess the documents to form a corpus.

2. Identify the lexicon from the corpus.

3. Form the Term-Document matrix (Numericized Text).

4. Apply VSM, LSI, K-Means to measure the proximity betweenall the documents and the query.

Abhijith Chandraprabhu Julia

Hands-on Introduction to Text Mining with Julia - A Mathematical approach

Page 10: Julia text mining_inmobi

Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model

Preprocessing using TextAnalysis.jl package

I Raw input Text is usually stream of characters. Need toconvert this to stream of terms(basic processing units).

I The first step would be to create Documents which iscollection of terms.

I Four types of documents can be created in Julia.

I A FileDocument which represents the files on disk

I str=“Julia is a high-level, high-performance dynamicprogramming language for technical computing”

sd = StringDocument(str)

td = TokenDocument(str)

nd = NGramDocument(str)

Abhijith Chandraprabhu Julia

Hands-on Introduction to Text Mining with Julia - A Mathematical approach

Page 11: Julia text mining_inmobi

Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model

Preprocessing

Most of the times we have huge volumes of unstructured data, withcontent not carrying any useful information w.r.t Text Mining.

I HTML tags

I Numbers

I Stop words

I Prepositions, Articles, Pronouns,

In Julia, we can remove these uneccessary using the functions,

remove_articles!(), remove_indefinite_articles!()

remove_definite_articles!(), remove_pronouns!()

remove_prepositions!(), remove_stop_words!()

Abhijith Chandraprabhu Julia

Hands-on Introduction to Text Mining with Julia - A Mathematical approach

Page 12: Julia text mining_inmobi

Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model

Stemming

I have a different opinionI differ with your opinionI opine differently

I In Information Retrieval, morphological variants ofwords/terms carrying the same semantic information adds toredundancy.

I Stemming linguistically normalizes the lexicon of a corpus.

stem!(Corpus)

stem!(Document) #sd, td or nd

Abhijith Chandraprabhu Julia

Hands-on Introduction to Text Mining with Julia - A Mathematical approach

Page 13: Julia text mining_inmobi

Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model

Vector-Space Model

T1

T2

T3

D1

D2

D3

0

I Documents are vectors inthe Term-Document Space

I The elements of the vectorare the weights1, wij ,corresponding to Documenti and term j

I The weights are thefrequencies of the terms inthe documents.

I Proximity of documentscalculated by the cosine ofthe angle between them.

aRefer Weighting Schemes

Abhijith Chandraprabhu Julia

Hands-on Introduction to Text Mining with Julia - A Mathematical approach

Page 14: Julia text mining_inmobi

Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model

Term-Document Space

I Let Dj , be a collection of i documents.

I T = t1, t2, t3, ..., ti , is the Lexicon set.

I wji , is the frequency of occurence of the term j in document i .

I Document, dj = [w1j ,w2j ,w3j , ...,wij ].

I TDM = [d1 d2 d3 ... dj ]ijI dj ∈ Ri×j

Abhijith Chandraprabhu Julia

Hands-on Introduction to Text Mining with Julia - A Mathematical approach

Page 15: Julia text mining_inmobi

Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model

Weighting Schemes

I Associating each occurence of a term with a weight thatrepresents its relevance with respect to the meaning of thedocument it appears in. []

I Longer documents do not always carry more informativecontent (or more relevant content wrt a query).

Abhijith Chandraprabhu Julia

Hands-on Introduction to Text Mining with Julia - A Mathematical approach

Page 16: Julia text mining_inmobi

Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model

I Binary Scheme : wij = 1, if ti occurs in dj , 0 else not.

I Term Frequency (TF) Scheme : wij = fij , i.e. the number oftimes ti occurs in document dj .

I Term Frequency - Inverse Document Frequency (TF-IDF)

Scheme : Term Frequency tfij =fij

max[f1j ,f2j ,f3j ...,fij ]

Inverse-Document frequency, idfi = log Ndfi

,N: number of documents and dfi : Number of documents inwhich the term ti occurs.wij = tfij × idfi .

m=DocumentTermMatrix(crps)

tf_idf(m)

Abhijith Chandraprabhu Julia

Hands-on Introduction to Text Mining with Julia - A Mathematical approach

Page 17: Julia text mining_inmobi

Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model

Query Matching

I Finding the relevant documents for a query, q.

I Cosine Distance Measure is used.

I cos(θ) =qTdj

‖q‖2‖dj‖2, where θ is the angle between the query q

and document dj .

I The documents for which cos(θ) > tol, are consideredrelevant, where tol, is the predefined tolerance.

Abhijith Chandraprabhu Julia

Hands-on Introduction to Text Mining with Julia - A Mathematical approach

Page 18: Julia text mining_inmobi

Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model

Performance ModelingI The tol, decides the number of documents returned.I With low tol, more documents are returned.I But chances of the documents being irrelevant increases.I Ideally we need higher number of documents returned and

majority of the returned documents to be relevant.I Precision, P = Dr

Dt, where Dr is the number of relevant

documents retrieved, and Dt is the total number ofdocuments retrieved.

I Recall, R = DrNr

, where Nr is the total number of relevantdocuments in the database.

VSMModel(QueryNum::Int,A::Array{Float64,2},NumQueries::Int)

This function is used to obtain Dr , Dt & Nr for any specific queryusing Vector Space model.

Abhijith Chandraprabhu Julia

Hands-on Introduction to Text Mining with Julia - A Mathematical approach

Page 19: Julia text mining_inmobi

Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model

0.2 0.4 0.6 0.8Tolerance

0

5

10

15

20

25

30

35

40No

. of D

ocum

ents

Query10

DrDtNr

0.2 0.4 0.6 0.8Tolerance

0

20

40

60

80

No. o

f Doc

umen

ts

Query13

DrDtNr

0.2 0.4 0.6 0.8Tolerance

0

50

100

150

200

No. o

f Doc

umen

ts

Query6

DrDtNr

0.2 0.4 0.6 0.8Tolerance

0

5

10

15

20

25

30

35

No. o

f Doc

umen

ts

Query23

DrDtNr

Abhijith Chandraprabhu Julia

Hands-on Introduction to Text Mining with Julia - A Mathematical approach

Page 20: Julia text mining_inmobi

Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model

Latent Semantic Indexing

I There exists underlying latent semantic structure in the data.

I We can identify this structure through SVD.

I Project the data onto two lower dimensional spaces.

I These are the term space and the document space.

I Dimension reduction is achieved through truncated SVD.

Abhijith Chandraprabhu Julia

Hands-on Introduction to Text Mining with Julia - A Mathematical approach

Page 21: Julia text mining_inmobi

Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model

Singular Value Decomposition

Theorem (SVD)

For any matrix A ∈ Rm×n, with m > n, there exists twoorthogonal matrices U = (u1, . . . , um) ∈ Rm×m &V = (v1, . . . , vn) ∈ Rn×n such that

A = U

(Σ0

)V T ,

where Σ ∈ Rn×n is diagonal matrix, i.e., Σ = (σ1, ...., σn), withσ1 ≥ σ2 ≥ .... ≥ σn ≥ 0. σ1, . . . , σn are called the singular valuesof A. Columns of U & V are called the right and left singularvectors of A respectively.

Abhijith Chandraprabhu Julia

Hands-on Introduction to Text Mining with Julia - A Mathematical approach

Page 22: Julia text mining_inmobi

Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model

Low Rank Matrix Approximation using SVD

A =n∑

i=1

σiuivTi ≈

k∑i=1

σiuivTi =: Ak k < r

I The above approximation is based on the Eckart-Youngtheorem.

I It helps in removal of noise, solving ill-conditioned problems,and mainly in dimension reduction of data.

I Using the below function examine the effect of rank reduction.

I Why is SVD good enough to decompose the TDM?

svdRedRank(A,k)

Abhijith Chandraprabhu Julia

Hands-on Introduction to Text Mining with Julia - A Mathematical approach

Page 23: Julia text mining_inmobi

Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model

A

m

n

≈ m

U

k ≤ n

k ≤ n

S

k

k ≤ n

V

n

Abhijith Chandraprabhu Julia

Hands-on Introduction to Text Mining with Julia - A Mathematical approach

Page 24: Julia text mining_inmobi

Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model

VSMModel(A::Array{Float64,2},nq::Int)

SVDModel(A::Array{Float64,2},nq::Int,rank::Int)

I The above functions returns the Recall and Precision usingthe Vector space model and the LSI model.

I The first qNum vectors are the queries and the rest are thedocument vectors, of matrix A.

I For the LSI(SVDModel), we need to pass in the rank also as aparameter.

Use the functions shown below to view the Precision Vs Recall forVSM and LSI.

plotNew_RecPrec()

plotAdd_RecPrec()

Abhijith Chandraprabhu Julia

Hands-on Introduction to Text Mining with Julia - A Mathematical approach

Page 25: Julia text mining_inmobi

Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model

10 20 30 40 50 60 70 80 90Recall

30

40

50

60

70

Prec

isio

n

Precision Vs Recall

VSMLSIKM

Abhijith Chandraprabhu Julia

Hands-on Introduction to Text Mining with Julia - A Mathematical approach

Page 26: Julia text mining_inmobi

Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model

Thank you.

Abhijith Chandraprabhu Julia

Hands-on Introduction to Text Mining with Julia - A Mathematical approach