julia text mining_inmobi

Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model

Hands-on Introduction to Text Mining with Julia- A Mathematical approach

Abhijith Chandraprabhu

April 19, 2014

Abhijith Chandraprabhu Julia

Hands-on Introduction to Text Mining with Julia - A Mathematical approach


Overview

Introduction

Preprocessing

VSM Model

Query and Performance Modeling

LSI Model




About todays session

I We will be dealing with an age old technique(1999).

I Old wine in a New bottle!I The emphasis is on understanding the math behind it.

Vectors, Matrices

Dimension, Space

Projection, Matrix factorization

SVD

I Hands on session.




“For almost a decade the computational linguistics community hasviewed large text collections as a resource to be tapped in order toproduce better text analysis algorithms. In this paper, I haveattempted to suggest a new emphasis: the use of large online textcollections to discover new facts and trends about the world itself.I suggest that to make progress we do not need fully artificialintelligent text analysis; rather, a mixture ofcomputationally-driven and user-guided analysis may open the doorto exciting new results.”Untangling Text Data Mining, Marti A. Hearst




Text Mining

I Type of Information retrieval, where we try to extract relevantinformation from huge collection of textual data(documents).

I Documents: web pages, biomedical literature, movie reviews,research articles etc.

I Non-Semantic

I Based Vector-Space model.

I Applications : Web Search Engines, Biomedical InformationRetrieval etc.




Document1It’s also important to point out that even though these arrays are generic, they’re not boxed: an Int8 array will takeup much less memory than an Int64 array, and both will be laid out as continuous blocks of memory; Julia can dealseamlessly and generically with these different immediate types as well as pointer types like String.

Document2To celebrate some of the amazing work that’s already been done to make Julia usable for day-to-day data analysis,I’d like to give a brief overview of the state of statistical programming in Julia. There are now several packagesthat, taken as a whole, suggest that Julia may really live up to its potential and become the next generationlanguage for data analysis.

Document3Only later did I realize what makes Julia different from all the others. Julia breaks down the second wall — the wallbetween your high-level code and native assembly. Not only can you write code with the performance of C in Julia,you can take a peek behind the curtain of any function into its LLVM Intermediate Representation as well as itsgenerated assembly code — all within the REPL. Check it out.

Document4Homoiconicity — the code can be operated on by other parts of the code. Again, R kind of has this too! Kind of,because I’m unaware of a good explanation for how to use it productively, and R’s syntax and scoping rules make ittricky to pull off. But I’m still excited to see it in Julia, because I’ve heard good things about macros and I’d like toappreciate them.

Document5Graphics. One of the big advantages of R over similar languages is the sophisticated graphics available in ggplot2or lattice. Work is underway on graphics models for Julia but, again, it is early days still.




Term-Document Matrix

Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Query 1 Query 2Array 1 0 0 0 0 1 0

continous block 1 0 0 0 0 0 0Julia 1 3 3 1 1 1 1types 1 0 0 0 0 1 0string 1 0 0 0 0 0 0

amazing 0 1 0 0 0 0 0data analysis 0 2 0 0 0 0 1

statistical computing 0 1 0 0 0 0 1high-level 0 0 1 0 0 0 0

performance 0 0 1 0 0 0 0LLVM 0 0 1 0 0 0 0

homoiconicity 0 0 0 1 0 0 0R 0 0 0 2 0 0 0

syntax 0 0 0 4 0 0 0macros 0 0 0 1 0 0 0

graphics 0 0 0 0 3 0 0advantages 0 0 0 0 1 0 0




Terminologies explained:

Corpus : Structured set of text, obtained by preprocessing rawtextual data.

Lexicon : Distinctive collection of all the words/terms in thecorpus.

Document, Query : Bag of words/terms, represented as vector.

Keyword, term : Elements of Lexicon.

Term Frequency : Frequency of occurence of a term in aDocument.

TDM : A matrix, with term frequencies as entries across allDocuments.




Major Steps in Text Mining

1. Preprocess the documents to form a corpus.

2. Identify the lexicon from the corpus.

3. Form the Term-Document matrix (Numericized Text).

4. Apply VSM, LSI, K-Means to measure the proximity betweenall the documents and the query.




Preprocessing using TextAnalysis.jl package

I Raw input Text is usually stream of characters. Need toconvert this to stream of terms(basic processing units).

I The first step would be to create Documents which iscollection of terms.

I Four types of documents can be created in Julia.

I A FileDocument which represents the files on disk

I str=“Julia is a high-level, high-performance dynamicprogramming language for technical computing”

sd = StringDocument(str)

td = TokenDocument(str)

nd = NGramDocument(str)




Preprocessing

Most of the times we have huge volumes of unstructured data, withcontent not carrying any useful information w.r.t Text Mining.

I HTML tags

I Numbers

I Stop words

I Prepositions, Articles, Pronouns,

In Julia, we can remove these uneccessary using the functions,

remove_articles!(), remove_indefinite_articles!()

remove_definite_articles!(), remove_pronouns!()

remove_prepositions!(), remove_stop_words!()




Stemming

I have a different opinionI differ with your opinionI opine differently

I In Information Retrieval, morphological variants ofwords/terms carrying the same semantic information adds toredundancy.

I Stemming linguistically normalizes the lexicon of a corpus.

stem!(Corpus)

stem!(Document) #sd, td or nd




Vector-Space Model

T1

T2

T3

D1

D2

D3

0

I Documents are vectors inthe Term-Document Space

I The elements of the vectorare the weights1, wij ,corresponding to Documenti and term j

I The weights are thefrequencies of the terms inthe documents.

I Proximity of documentscalculated by the cosine ofthe angle between them.

aRefer Weighting Schemes




Term-Document Space

I Let Dj , be a collection of i documents.

I T = t1, t2, t3, ..., ti , is the Lexicon set.

I wji , is the frequency of occurence of the term j in document i .

I Document, dj = [w1j ,w2j ,w3j , ...,wij ].

I TDM = [d1 d2 d3 ... dj ]ijI dj ∈ Ri×j




Weighting Schemes

I Associating each occurence of a term with a weight thatrepresents its relevance with respect to the meaning of thedocument it appears in. []

I Longer documents do not always carry more informativecontent (or more relevant content wrt a query).




I Binary Scheme : wij = 1, if ti occurs in dj , 0 else not.

I Term Frequency (TF) Scheme : wij = fij , i.e. the number oftimes ti occurs in document dj .

I Term Frequency - Inverse Document Frequency (TF-IDF)

Scheme : Term Frequency tfij =fij

max[f1j ,f2j ,f3j ...,fij ]

Inverse-Document frequency, idfi = log Ndfi

,N: number of documents and dfi : Number of documents inwhich the term ti occurs.wij = tfij × idfi .

m=DocumentTermMatrix(crps)

tf_idf(m)




Query Matching

I Finding the relevant documents for a query, q.

I Cosine Distance Measure is used.

I cos(θ) =qTdj

‖q‖2‖dj‖2, where θ is the angle between the query q

and document dj .

I The documents for which cos(θ) > tol, are consideredrelevant, where tol, is the predefined tolerance.




Performance ModelingI The tol, decides the number of documents returned.I With low tol, more documents are returned.I But chances of the documents being irrelevant increases.I Ideally we need higher number of documents returned and

majority of the returned documents to be relevant.I Precision, P = Dr

Dt, where Dr is the number of relevant

documents retrieved, and Dt is the total number ofdocuments retrieved.

I Recall, R = DrNr

, where Nr is the total number of relevantdocuments in the database.

VSMModel(QueryNum::Int,A::Array{Float64,2},NumQueries::Int)

This function is used to obtain Dr , Dt & Nr for any specific queryusing Vector Space model.




0.2 0.4 0.6 0.8Tolerance

0

5

10

15

20

25

30

35

40No

. of D

ocum

ents

Query10

DrDtNr

0.2 0.4 0.6 0.8Tolerance

0

20

40

60

80

No. o

f Doc

umen

ts

Query13

DrDtNr

0.2 0.4 0.6 0.8Tolerance

0

50

100

150

200

No. o

f Doc

umen

ts

Query6

DrDtNr

0.2 0.4 0.6 0.8Tolerance

0

5

10

15

20

25

30

35

No. o

f Doc

umen

ts

Query23

DrDtNr




Latent Semantic Indexing

I There exists underlying latent semantic structure in the data.

I We can identify this structure through SVD.

I Project the data onto two lower dimensional spaces.

I These are the term space and the document space.

I Dimension reduction is achieved through truncated SVD.




Singular Value Decomposition

Theorem (SVD)

For any matrix A ∈ Rm×n, with m > n, there exists twoorthogonal matrices U = (u1, . . . , um) ∈ Rm×m &V = (v1, . . . , vn) ∈ Rn×n such that

A = U

(Σ0

)V T ,

where Σ ∈ Rn×n is diagonal matrix, i.e., Σ = (σ1, ...., σn), withσ1 ≥ σ2 ≥ .... ≥ σn ≥ 0. σ1, . . . , σn are called the singular valuesof A. Columns of U & V are called the right and left singularvectors of A respectively.




Low Rank Matrix Approximation using SVD

A =n∑

i=1

σiuivTi ≈

k∑i=1

σiuivTi =: Ak k < r

I The above approximation is based on the Eckart-Youngtheorem.

I It helps in removal of noise, solving ill-conditioned problems,and mainly in dimension reduction of data.

I Using the below function examine the effect of rank reduction.

I Why is SVD good enough to decompose the TDM?

svdRedRank(A,k)




A

m

n

≈ m

U

k ≤ n

k ≤ n

S

k

k ≤ n

V

n




VSMModel(A::Array{Float64,2},nq::Int)

SVDModel(A::Array{Float64,2},nq::Int,rank::Int)

I The above functions returns the Recall and Precision usingthe Vector space model and the LSI model.

I The first qNum vectors are the queries and the rest are thedocument vectors, of matrix A.

I For the LSI(SVDModel), we need to pass in the rank also as aparameter.

Use the functions shown below to view the Precision Vs Recall forVSM and LSI.

plotNew_RecPrec()

plotAdd_RecPrec()




10 20 30 40 50 60 70 80 90Recall

30

40

50

60

70

Prec

isio

n

Precision Vs Recall

VSMLSIKM




Thank you.



julia text mining_inmobi

Technology

julia dierent

julia usable

untangling text data

large text collections

performance of c

structured set of text

daytoday data analysis

highlevel code