text databases. outline spatial databases temporal databases spatio-temporal databases data mining...

Text Databases

Outline

Spatial Databases Temporal Databases Spatio-temporal Databases Data Mining Multimedia Databases

Text databases Image and video databases Time Series databases

Text - Detailed outline

Text databases problem full text scanning inversion signature files (a.k.a. Bloom Filters) Vector model and clustering information filtering and LSI

Vector Space Model and Clustering

Keyword (free-text) queries (vs Boolean) each document: -> vector (HOW?) each query: -> vector search for ‘similar’ vectors

main idea: each document is a vector of size d: d is the number of different terms in the database

document

...data...

aaron zoodata

d (= vocabulary size)

‘indexing’

Document Vectors

Documents are represented as “bags of words”

Represented as vectors when used computationally A vector is like an array of floating points Has direction and magnitude Each vector holds a place for every term in the

collection Therefore, most vectors are sparse

Document VectorsOne location for each word.

nova galaxy heat h’wood film rolediet fur

10 5 3 5 10

10 8 7 9 10 5

10 10 9 10

5 7 9 6 10 2 8

7 5 1 3

ABCDEFGHI

“Nova” occurs 10 times in text A“Galaxy” occurs 5 times in text A“Heat” occurs 3 times in text A(Blank means 0 occurrences.)

Document VectorsOne location for each word.

10 5 3 5 10

10 8 7 9 10 5

10 10 9 10

5 7 9 6 10 2 8

7 5 1 3

ABCDEFGHI

“Hollywood” occurs 7 times in text I“Film” occurs 5 times in text I“Diet” occurs 1 time in text I“Fur” occurs 3 times in text I

Document Vectors

10 5 3 5 10

10 8 7 9 10 5

10 10 9 10

5 7 9 6 10 2 8

7 5 1 3

ABCDEFGHI

Document ids

We Can Plot the Vectors

Doc about astronomyDoc about movie stars

Doc about mammal behavior

Then, group nearby vectors together Q1: cluster search? Q2: cluster generation?

Two significant contributions ranked output relevance feedback

cluster search: visit the (k) closest superclusters; continue recursively

CS TRs

MD TRs

ranked output: easy!

CS TRs

MD TRs

relevance feedback (brilliant idea) [Roccio’73]

CS TRs

MD TRs

relevance feedback (brilliant idea) [Roccio’73] How?

CS TRs

MD TRs

How? A: by adding the ‘good’ vectors and subtracting the ‘bad’ ones

CS TRs

MD TRs

Cluster generation

Problem: given N points in V dimensions, group them

Cluster generation

Problem: given N points in V dimensions, group them (typically a k-means or AGNES is

Assigning Weights to Terms

Binary Weights Raw term frequency tf x idf

Recall the Zipf distribution Want to weight terms highly if they are

frequent in relevant documents … BUT infrequent in the collection as a whole

Binary Weights

Only the presence (1) or absence (0) of a term is included in the vector

docs t1 t2 t3D1 1 0 1D2 1 0 0D3 0 1 1D4 1 0 0D5 1 1 1D6 1 1 0D7 0 1 0D8 0 1 0D9 0 0 1D10 0 1 1D11 1 0 1

Raw Term Weights

The frequency of occurrence for the term in each document is included in the vector

docs t1 t2 t3D1 2 0 3D2 1 0 0D3 0 4 7D4 3 0 0D5 1 6 3D6 3 5 0D7 0 8 0D8 0 10 0D9 0 0 1D10 0 3 5D11 4 0 1

Assigning Weights

tf x idf measure: term frequency (tf) inverse document frequency (idf) -- a way to

deal with the problems of the Zipf distribution

Goal: assign a tf * idf weight to each term in each document

tf x idf

)/log(* kikik nNtfw

Tcontain that in documents ofnumber the

collection in the documents ofnumber total

in T termoffrequency document inverse

document in T termoffrequency

Inverse Document Frequency

IDF provides high values for rare words and low values for common words

10000log

698.220

10000log

301.05000

10000log

010000

10000log

For a collectionof 10000 documents

Similarity Measures for document vectors

|)||,min(|

Simple matching (coordination level match)

Dice’s Coefficient

Jaccard’s Coefficient

Cosine Coefficient

Overlap Coefficient

tf x idf normalization

Normalize the term weights (so longer documents are not unfairly given more weight)

normalize usually means force all values to fall within a certain range, usually between 0 and 1, inclusive.

22 )]/[log()(

)/log(

Vector space similarity(use the weights to compare the documents)

product.inner normalizedor cosine, thecalled also is This

:is documents twoof similarity theNow,

kjkikji wwDDsim

Computing Similarity Scores

98.0cos

74.0cos

)8.0 ,4.0(

)7.0 ,2.0(

)3.0 ,8.0(

0.60.4 1.00.2

Vector Space with Term Weights and Cosine Matching

0.80.60.40.20 1.0

Term B

Term A

Di=(di1,wdi1;di2, wdi2;…;dit, wdit)Q =(qi1,wqi1;qi2, wqi2;…;qit, wqit)

wwDQsim

)()(),(

Q = (0.4,0.8)D1=(0.8,0.3)D2=(0.2,0.7)

98.042.0

])7.0()2.0[(])8.0()4.0[(

)7.08.0()2.04.0()2,(

74.058.0

56.),( 1 DQsim

Text - Detailed outline

Text databases problem full text scanning inversion signature files (a.k.a. Bloom Filters) Vector model and clustering information filtering and LSI

Information Filtering + LSI [Foltz+,’92] Goal:

users specify interests (= keywords) system alerts them, on suitable news-

documents Major contribution: LSI = Latent

Semantic Indexing latent (‘hidden’) concepts

Information Filtering + LSI

Main idea map each document into some

‘concepts’ map each term into some ‘concepts’

‘Concept’:~ a set of terms, with weights, e.g. “data” (0.8), “system” (0.5), “retrieval”

(0.6) -> DBMS_concept

Pictorially: term-document matrix (BEFORE)

'data' 'system' 'retrieval' 'lung' 'ear'

TR1 1 1 1

TR2 1 1 1

TR3 1 1

TR4 1 1

Information Filtering + LSIPictorially: concept-document matrix

and...'DBMS-concept'

'medical-concept'

Information Filtering + LSI... and concept-term matrix

'DBMS-concept'

'medical-concept'

data 1

system 1

retrieval 1

lung 1

Q: How to search, eg., for ‘system’?

Information Filtering + LSIA: find the corresponding concept(s);

and the corresponding documents

'DBMS-concept'

'medical-concept'

data 1

system 1

retrieval 1

lung 1

'DBMS-concept'

'medical-concept'

Information Filtering + LSIA: find the corresponding concept(s);

and the corresponding documents

'DBMS-concept'

'medical-concept'

data 1

system 1

retrieval 1

lung 1

'DBMS-concept'

'medical-concept'

Thus it works like an (automatically constructed) thesaurus:

we may retrieve documents that DON’T have the term ‘system’, but they contain almost everything else (‘data’, ‘retrieval’)

SVD - Detailed outline Motivation Definition - properties Interpretation Complexity Case studies Additional properties

SVD - Motivation problem #1: text - LSI: find

‘concepts’ problem #2: compression / dim.

reduction

SVD - Motivation problem #1: text - LSI: find

‘concepts’

SVD - Motivation problem #2: compress / reduce

dimensionality

Problem - specs ~10**6 rows; ~10**3 columns; no updates; random access to any cell(s) ; small error: OK

SVD - Motivation

SVD - Definition

A[n x m] = U[n x r] r x r] (V[m x r])T

A: n x m matrix (eg., n documents, m terms)

U: n x r matrix (n documents, r concepts)

: r x r diagonal matrix (strength of each ‘concept’) (r : rank of the matrix)

V: m x r matrix (m terms, r concepts)

SVD - Properties

THEOREM [Press+92]: always possible to decompose matrix A into A = U VT , where

U, V: unique (*) U, V: column orthonormal (ie., columns

are unit vectors, orthogonal to each other) UT U = I; VT V = I (I: identity matrix)

: eigenvalues are positive, and sorted in decreasing order

SVD - Example A = U VT - example:

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

datainf.

retrieval

brain lung

0.18 0

0.36 0

0.18 0

0.90 0

0 0.53

0 0.800 0.27

9.64 0

0 5.29x

0.58 0.58 0.58 0 0

0 0 0 0.71 0.71

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

datainf.

retrieval

brain lung

0.18 0

0.36 0

0.18 0

0.90 0

0 0.53

0 0.800 0.27

9.64 0

0 5.29x

0.58 0.58 0.58 0 0

0 0 0 0.71 0.71

CS-conceptMD-concept

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

datainf.

retrieval

brain lung

0.18 0

0.36 0

0.18 0

0.90 0

0 0.53

0 0.800 0.27

9.64 0

0 5.29x

0.58 0.58 0.58 0 0

0 0 0 0.71 0.71

CS-conceptMD-concept

doc-to-concept similarity matrix

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

datainf.

retrieval

brain lung

0.18 0

0.36 0

0.18 0

0.90 0

0 0.53

0 0.800 0.27

9.64 0

0 5.29x

0.58 0.58 0.58 0 0

0 0 0 0.71 0.71

‘strength’ of CS-concept

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

datainf.

retrieval

brain lung

0.18 0

0.36 0

0.18 0

0.90 0

0 0.53

0 0.800 0.27

9.64 0

0 5.29x

0.58 0.58 0.58 0 0

0 0 0 0.71 0.71

term-to-conceptsimilarity matrix

CS-concept

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

datainf.

retrieval

brain lung

0.18 0

0.36 0

0.18 0

0.90 0

0 0.53

0 0.800 0.27

9.64 0

0 5.29x

0.58 0.58 0.58 0 0

0 0 0 0.71 0.71

term-to-conceptsimilarity matrix

CS-concept

SVD - Interpretation #1

‘documents’, ‘terms’ and ‘concepts’: U: document-to-concept similarity

matrix V: term-to-concept sim. matrix : its diagonal elements:

‘strength’ of each concept

SVD - Interpretation #2 best axis to project on: (‘best’ =

min sum of squares of projection errors)

SVD - Motivation

SVD - interpretation #2

minimum RMS error

SVD: givesbest axis to project

A = U VT - example:

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

0.18 0

0.36 0

0.18 0

0.90 0

0 0.53

0 0.800 0.27

=9.64 0

0 5.29x

0.58 0.58 0.58 0 0

0 0 0 0.71 0.71

SVD - Interpretation #2 A = U VT - example:

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

0.18 0

0.36 0

0.18 0

0.90 0

0 0.53

0 0.800 0.27

=9.64 0

0 5.29x

0.58 0.58 0.58 0 0

0 0 0 0.71 0.71

variance (‘spread’) on the v1 axis

SVD - Interpretation #2 A = U VT - example:

U gives the coordinates of the points in the projection axis

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

0.18 0

0.36 0

0.18 0

0.90 0

0 0.53

0 0.800 0.27

=9.64 0

0 5.29x

0.58 0.58 0.58 0 0

0 0 0 0.71 0.71

SVD - Interpretation #2 More details Q: how exactly is dim. reduction

done?1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

0.18 0

0.36 0

0.18 0

0.90 0

0 0.53

0 0.800 0.27

=9.64 0

0 5.29x

0.58 0.58 0.58 0 0

0 0 0 0.71 0.71

SVD - Interpretation #2 More details Q: how exactly is dim. reduction

done? A: set the smallest eigenvalues to

zero:1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

0.18 0

0.36 0

0.18 0

0.90 0

0 0.53

0 0.800 0.27

=9.64 0

0 5.29x

0.58 0.58 0.58 0 0

0 0 0 0.71 0.71

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

0.18 0

0.36 0

0.18 0

0.90 0

0 0.53

0 0.800 0.27

~9.64 0

0.58 0.58 0.58 0 0

0 0 0 0.71 0.71

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

0.18 0

0.36 0

0.18 0

0.90 0

0 0.53

0 0.800 0.27

~9.64 0

0.58 0.58 0.58 0 0

0 0 0 0.71 0.71

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

0.58 0.58 0.58 0 0

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 0 0

0 0 0 0 00 0 0 0 0

Equivalent:‘spectral decomposition’ of the

matrix:1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

0.18 0

0.36 0

0.18 0

0.90 0

0 0.53

0 0.800 0.27

=9.64 0

0 5.29x

0.58 0.58 0.58 0 0

0 0 0 0.71 0.71

Equivalent:‘spectral decomposition’ of the

matrix:1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

= x xu1 u2

SVD - Interpretation #2Equivalent:‘spectral decomposition’ of the

matrix:1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

= u11 vT1 u22 vT

2+ +...n

Tiii vu

‘spectral decomposition’ of the matrix:

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

= u11 vT1 u22 vT

2+ +...n

n x 1 1 x m

r terms

approximation / dim. reduction:by keeping the first few terms (Q: how

many?)1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

= u11 vT1 u22 vT

2+ +...n

assume: 1 >= 2 >= ...

A (heuristic - [Fukunaga]): keep 80-90% of ‘energy’ (= sum of squares of i ’s)

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

= u11 vT1 u22 vT

2+ +...n

assume: 1 >= 2 >= ...

finds non-zero ‘blobs’ in a data matrix

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

0.18 0

0.36 0

0.18 0

0.90 0

0 0.53

0 0.800 0.27

=9.64 0

0 5.29x

0.58 0.58 0.58 0 0

0 0 0 0.71 0.71

SVD - Interpretation #3 finds non-zero ‘blobs’ in a data

matrix

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

0.18 0

0.36 0

0.18 0

0.90 0

0 0.53

0 0.800 0.27

=9.64 0

0 5.29x

0.58 0.58 0.58 0 0

0 0 0 0.71 0.71

SVD - Interpretation #3 Drill: find the SVD, ‘by inspection’! Q: rank = ??

1 1 1 0 0

0 0 0 1 1

= x x?? ??

SVD - Interpretation #3 A: rank = 2 (2 linearly independent

rows/cols)

1 1 1 0 0

0 0 0 1 1

= x x??

SVD - Interpretation #3 A: rank = 2 (2 linearly independent

rows/cols)

1 1 1 0 0

0 0 0 1 1

= x x?? 0

0 11 1 1 0 0

0 0 0 1 1

orthogonal??

SVD - Interpretation #3 column vectors: are orthogonal -

but not unit vectors:

1 1 1 0 0

0 0 0 1 1

= x x?? 0

0 0 0 2

SVD - Interpretation #3 and the eigenvalues are:

1 1 1 0 0

0 0 0 1 1

= x x3 0

0 0 0 2

SVD - Interpretation #3 A: SVD properties:

matrix product should give back matrix A

matrix U should be column-orthonormal, i.e., columns should be unit vectors, orthogonal to each other

ditto for matrix V matrix should be diagonal, with

positive values

SVD - Complexity O( n * m * m) or O( n * n * m)

(whichever is less) less work, if we just want

eigenvalues or if we want first k eigenvectors or if the matrix is sparse [Berry] Implemented: in any linear algebra

package (LINPACK, matlab, Splus, mathematica ...)

SVD - Complexity Faster algorithms for approximate

eigenvector computations exist: Alan Frieze, Ravi Kannan, Santosh Vempala: Fast Monte-

Carlo Algorithms for finding low-rank approximations, Proceedings of the 39th FOCS, p.370, November 08-11, 1998

Sudipto Guha, Dimitrios Gunopulos, Nick Koudas: Correlating synchronous and asynchronous data streams. KDD 2003: 529-534

SVD - conclusions so far SVD: A= U VT : unique (*) U: document-to-concept similarities V: term-to-concept similarities : strength of each concept dim. reduction: keep the first few

strongest eigenvalues (80-90% of ‘energy’) SVD: picks up linear correlations

SVD: picks up non-zero ‘blobs’

References

Berry, Michael: http://www.cs.utk.edu/~lsi/ Fukunaga, K. (1990). Introduction to Statistical

Pattern Recognition, Academic Press. Press, W. H., S. A. Teukolsky, et al. (1992).

Numerical Recipes in C, Cambridge University Press.

text databases. outline spatial databases temporal databases spatio-temporal databases data mining...

vector slide

text databases slide

lsi slide

sparse slide

vector space model

vector search

mammal behavior slide

document vectors documents

Documents

spatio-temporal databases spatio-temporal databases

use of temporal concepts in transactional databases

time-parameterized queries in spatio-temporal databases

querying mobile objects in spatio-temporal databases

temporal data: time and relational databases bjørn...

transaction timestamping in temporal databases

online databases for academic libraries databases for the...

understanding urban dynamics: the use of ......temporal...

7, temporal and real-time databases: a...

spatio-temporal databases

spatio-temporal data modelling for 4d -databases ·...

spatio-temporal databases time parameterized queries

constraint databases and temporal reasoning

iii-cxt: spatio-temporal graph databases for

performing and visualizing temporal analysis of large text...

temporal databases. outline spatial databases indexing,...

temporal databases: data models

data mining – intro. course overview spatial databases...

spatio-temporal databases temporal databases. temporal data....

temporal databases: modifications