vector space model any text object can be represented by a term vector examples: documents, queries,...
Post on 21-Dec-2015
217 views
TRANSCRIPT
![Page 1: Vector Space Model Any text object can be represented by a term vector Examples: Documents, queries, sentences, …. A query is viewed as a short document](https://reader030.vdocuments.site/reader030/viewer/2022032801/56649d555503460f94a32983/html5/thumbnails/1.jpg)
Vector Space Model
Any text object can be represented by a term vector Examples: Documents, queries, sentences, …. A query is viewed as a short document
Similarity is determined by distance in a vector space Example: The cosine of the angle between the vectors
The SMART system: Developed at Cornell University, 1960-1999 Still used widely
![Page 2: Vector Space Model Any text object can be represented by a term vector Examples: Documents, queries, sentences, …. A query is viewed as a short document](https://reader030.vdocuments.site/reader030/viewer/2022032801/56649d555503460f94a32983/html5/thumbnails/2.jpg)
Vector Space Model
Documents represented as vectors in a multi-dimensional Euclidean space Each axis = a term (token)
Coordinate of document d in direction of term t
determined by: Term frequency: TF(d,t)
number of times t occurs in document d, scaled in a
variety of ways to normalize document length Inverse document frequency: IDF(t)
to scale down the terms that occur in many documents
![Page 3: Vector Space Model Any text object can be represented by a term vector Examples: Documents, queries, sentences, …. A query is viewed as a short document](https://reader030.vdocuments.site/reader030/viewer/2022032801/56649d555503460f94a32983/html5/thumbnails/3.jpg)
Term Frequency: Scaling
TF (d;t) = P
ü2W
n(d;ü)n(d;t) ;
The number of times t occurs in document d: n(d;t)
TF (d;t) = maxü2W
n(d;ü)n(d;t)
The Cornell SMART system uses:
TF (d;t) =01+ log(1+ log(n(d;t)))
úif n(d;t) = 0
otherwise
![Page 4: Vector Space Model Any text object can be represented by a term vector Examples: Documents, queries, sentences, …. A query is viewed as a short document](https://reader030.vdocuments.site/reader030/viewer/2022032801/56649d555503460f94a32983/html5/thumbnails/4.jpg)
Inverse Document Frequency
Not all axes (terms) in the vector space are equally important.
IDF seeks to scale down the coordinates of terms that occur in many documents. The Cornell SMART system uses:IDF (t) = log jD tj
1+jDj
If the term t will enjoy a large IDF scale and vice versa.
jD tj ü jDj
Other variants are also used, these are mostly dampened functions of jD tj
jDj
![Page 5: Vector Space Model Any text object can be represented by a term vector Examples: Documents, queries, sentences, …. A query is viewed as a short document](https://reader030.vdocuments.site/reader030/viewer/2022032801/56649d555503460f94a32983/html5/thumbnails/5.jpg)
TFIDF-space
An obvious way to combine TF-IDF: the coordinate of document in axis is given by
dt = TF (d;t) áIDF (t)
d t
General form of consists of three parts: dt
dt = L tdGtDd
L td :Local weight for term occurring in doc.t d
Gt :Global weight for term occurring in the corpust
Dd :Document normalization factor
![Page 6: Vector Space Model Any text object can be represented by a term vector Examples: Documents, queries, sentences, …. A query is viewed as a short document](https://reader030.vdocuments.site/reader030/viewer/2022032801/56649d555503460f94a32983/html5/thumbnails/6.jpg)
Term-by-Document Matrix
A document collection (corpus) composed of n doc. that are indexed by m terms (tokens) can be represented as an matrix mâ n A
![Page 7: Vector Space Model Any text object can be represented by a term vector Examples: Documents, queries, sentences, …. A query is viewed as a short document](https://reader030.vdocuments.site/reader030/viewer/2022032801/56649d555503460f94a32983/html5/thumbnails/7.jpg)
Summary
Tokenization
Removing stopwords Stemming
Term Weighting
TF: Local IDF: Global Normalization
TF-IDF Vector Space
Term-by-Document Matrix
![Page 8: Vector Space Model Any text object can be represented by a term vector Examples: Documents, queries, sentences, …. A query is viewed as a short document](https://reader030.vdocuments.site/reader030/viewer/2022032801/56649d555503460f94a32983/html5/thumbnails/8.jpg)
Reuters-2157821578 docs – 27000 terms, and 135
classes
21578 documents 1-14818 belong to training set 14819-21578 belong to testing set
Reuters-21578 includes 135 categories by using ApteMod version of the TOPICS set
Result in 90 categories with 7,770 training documents and 3,019 testing documents
![Page 9: Vector Space Model Any text object can be represented by a term vector Examples: Documents, queries, sentences, …. A query is viewed as a short document](https://reader030.vdocuments.site/reader030/viewer/2022032801/56649d555503460f94a32983/html5/thumbnails/9.jpg)
Preprocessing Procedures (cont.)
After Stopwords Elimination
After Porter Algorithm
![Page 10: Vector Space Model Any text object can be represented by a term vector Examples: Documents, queries, sentences, …. A query is viewed as a short document](https://reader030.vdocuments.site/reader030/viewer/2022032801/56649d555503460f94a32983/html5/thumbnails/10.jpg)
BusinessUnderstanding
Deployment
DataUnderstanding
DataPreparation
Modeling
Evaluation
DATA
![Page 11: Vector Space Model Any text object can be represented by a term vector Examples: Documents, queries, sentences, …. A query is viewed as a short document](https://reader030.vdocuments.site/reader030/viewer/2022032801/56649d555503460f94a32983/html5/thumbnails/11.jpg)
Problems with Vector Space Model
How to define/select ‘basic concept’? VS model treats each term as a basic vector E.g., q=(‘microsoft’, ‘software’), d = (‘windows_xp’)
How to assign weights to different terms? Need to distinguish common words from uninformative words Weight in query indicates importance of term Weight in doc indicates how well the term characterizes the doc
How to define similarity/distance function?
How to store the term-by-document matrix?
![Page 12: Vector Space Model Any text object can be represented by a term vector Examples: Documents, queries, sentences, …. A query is viewed as a short document](https://reader030.vdocuments.site/reader030/viewer/2022032801/56649d555503460f94a32983/html5/thumbnails/12.jpg)
Choice of ‘Basic Concepts’
Java
Microsoft
Starbucks
D1
Which one is better?
![Page 13: Vector Space Model Any text object can be represented by a term vector Examples: Documents, queries, sentences, …. A query is viewed as a short document](https://reader030.vdocuments.site/reader030/viewer/2022032801/56649d555503460f94a32983/html5/thumbnails/13.jpg)
Vector Space Model: Similarity
Given A query q = (q1, q2,…, qn)
qi: term frequency of the i-th word
A document dk = (dk,1, dk,2,…, dk,n) dk,i: term frequency of the
i-th word
Similarity of a query q to a document dk
q
dk
)),(cos(
...
),(sim
22
,2,21,1
kkk dqdqdq
nknkk
k
dqdqdq
dq
),( dq
2,
22,
21,
222
21
,2,21,1
22
......
...
)),(cos(),(sim'
nkkkn
nknkk
k
dddqqq
dqdqdq
dq
k
kk dq
dqdq
![Page 14: Vector Space Model Any text object can be represented by a term vector Examples: Documents, queries, sentences, …. A query is viewed as a short document](https://reader030.vdocuments.site/reader030/viewer/2022032801/56649d555503460f94a32983/html5/thumbnails/14.jpg)
Terms Documents
T1: Bab(y,ies,y’s) D1: Infant & Toddler First Aid
T2: Child(ren’s) D2: Babies & Children’s Room (For Your Home)
T3: Guide D3: Child Safety at Home
T4: Health D4: Your Baby’s Health and Safety: From Infant
T5: Home to Toddler
T6: Infant D5: Baby Proofing Basics
T7: Proofing D6: Your Guide to Easy Rust Proofing
T8: Safety D7: Beanie Babies Collector’s Guide
T9: Toddler
![Page 15: Vector Space Model Any text object can be represented by a term vector Examples: Documents, queries, sentences, …. A query is viewed as a short document](https://reader030.vdocuments.site/reader030/viewer/2022032801/56649d555503460f94a32983/html5/thumbnails/15.jpg)
Aê =
0 1 0 1 1 0 10 1 1 0 0 0 00 0 0 0 0 1 10 0 0 1 0 0 00 1 1 0 0 0 01 0 0 1 0 0 00 0 0 0 1 1 00 0 1 1 0 0 01 0 0 1 0 0 0
2
666666666664
3
777777777775
The 9 x 7 term-by-document matrix before normalization, where the element is the number of times term appears in document title :
aêij ij
![Page 16: Vector Space Model Any text object can be represented by a term vector Examples: Documents, queries, sentences, …. A query is viewed as a short document](https://reader030.vdocuments.site/reader030/viewer/2022032801/56649d555503460f94a32983/html5/thumbnails/16.jpg)
The 9 x 7 term-by-document matrix with unit columns:
A =
0 0:5774 0 0:4472 0:7071 0 0:70710 0:5774 0:5774 0 0 0 00 0 0 0 0 0:7071 0:70710 0 0 0:4472 0 0 00 0:5774 0:5774 0 0 0 0
0:7071 0 0 0:4472 0 0 00 0 0 0 0:7071 0:7071 00 0 0:5774 0:4472 0 0 0
0:7071 0 0 0:4472 0 0 0
2
666666666664
3
777777777775
![Page 17: Vector Space Model Any text object can be represented by a term vector Examples: Documents, queries, sentences, …. A query is viewed as a short document](https://reader030.vdocuments.site/reader030/viewer/2022032801/56649d555503460f94a32983/html5/thumbnails/17.jpg)
A =
0 0:5774 0 0:4472 0:7071 0 0:70710 0:5774 0:5774 0 0 0 00 0 0 0 0 0:7071 0:70710 0 0 0:4472 0 0 00 0:5774 0:5774 0 0 0 0
0:7071 0 0 0:4472 0 0 00 0 0 0 0:7071 0:7071 00 0 0:5774 0:4472 0 0 0
0:7071 0 0 0:4472 0 0 0
2
666666666664
3
777777777775
val 0.5774
0.4472
0.7071
0.7071
0.5774
0.5774
0.7071
0.7071
0.4472
0.5774
0.5774
0.7071
0.4472
0.7071
0.7071
0.5774
0.4472
0.7071
0.4472
col_ind 2 4 5 7 2 3 6 7 4 2 3 1 4 5 6 3 4 1 4
row_ptr
1 5 7 9 10 12 14 16 18 20
RCS
![Page 18: Vector Space Model Any text object can be represented by a term vector Examples: Documents, queries, sentences, …. A query is viewed as a short document](https://reader030.vdocuments.site/reader030/viewer/2022032801/56649d555503460f94a32983/html5/thumbnails/18.jpg)
A =
0 0:5774 0 0:4472 0:7071 0 0:70710 0:5774 0:5774 0 0 0 00 0 0 0 0 0:7071 0:70710 0 0 0:4472 0 0 00 0:5774 0:5774 0 0 0 0
0:7071 0 0 0:4472 0 0 00 0 0 0 0:7071 0:7071 00 0 0:5774 0:4472 0 0 0
0:7071 0 0 0:4472 0 0 0
2
666666666664
3
777777777775
val 0.7071
0.7071
0.5774
0.5774
0.5574
0.5574
0.5774
0.5774
0.4472
0.4472
0.4472
0.4472
0.4472
0.7071
0.7071
0.7071
0.7071
0.7071
0.7071
col_ind 6 9 1 2 5 2 5 8 1 4 6 8 9 1 7 3 7 1 3
row_ptr
1 3 6 9 14 16 18 20
CCS
![Page 19: Vector Space Model Any text object can be represented by a term vector Examples: Documents, queries, sentences, …. A query is viewed as a short document](https://reader030.vdocuments.site/reader030/viewer/2022032801/56649d555503460f94a32983/html5/thumbnails/19.jpg)
Short Review of Linear Algebra
![Page 20: Vector Space Model Any text object can be represented by a term vector Examples: Documents, queries, sentences, …. A query is viewed as a short document](https://reader030.vdocuments.site/reader030/viewer/2022032801/56649d555503460f94a32983/html5/thumbnails/20.jpg)
The Terms that You Have to Know!
Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product Eigenvalue, Eigenvector Projection
![Page 21: Vector Space Model Any text object can be represented by a term vector Examples: Documents, queries, sentences, …. A query is viewed as a short document](https://reader030.vdocuments.site/reader030/viewer/2022032801/56649d555503460f94a32983/html5/thumbnails/21.jpg)
Matrix Factorization
LU-Factorization: A = LU
QR-Factorization:
Very useful for solving linear system equations Some row exchanges are required
A = QR; A 2 Rmân; Q 2 Rmân;R 2 Rnân
Every matrix with linearly independent columns can be factored into . The columns of are orthonormal,and is upper triangular and invertible. When and all matrices are square, becomes anorthogonal matrix ( )
A 2 Rmâ n
A = QR Q
Rm = n Q
QTQ = I
![Page 22: Vector Space Model Any text object can be represented by a term vector Examples: Documents, queries, sentences, …. A query is viewed as a short document](https://reader030.vdocuments.site/reader030/viewer/2022032801/56649d555503460f94a32983/html5/thumbnails/22.jpg)
QR Factorization SimplifiesLeast Squares Problem
The normal equation for LS problem: ATAx = ATb
ATAx = RTQTQRx = RTRx = RTQTb
, Rx = QTb (RT is invertible)
A ï j = Q áR ï j =P
k=1
n
RkjQ ï k
A
Note: The orthogonal matrix constructs the column space of matrix
Q
![Page 23: Vector Space Model Any text object can be represented by a term vector Examples: Documents, queries, sentences, …. A query is viewed as a short document](https://reader030.vdocuments.site/reader030/viewer/2022032801/56649d555503460f94a32983/html5/thumbnails/23.jpg)
Motivation for Computing QR of the term-by-doc Matrix
The basis vectors of the column space of can be used to describe the semantic content of the corresponding text collection
A
cosòk = jjA ï kjj2jjqjj2
A Tï káq = jjQR ï kjj2jjqjj2
(QR ï k)Táq = jjR ï kjj2jjqjj2
R Tï k(Q
Táq)
Let be the angle between a query and the document vector
òk qA ï k
That means we can keep and instead of Q R A
QR also can be applied to dimension reduction