ges$one’avanzatadell’informazione · contents ! introduction ! inverted indices ! construction...

Ges$one Avanzata dell’Informazione Part A – Full-‐Text Informa$on Management

Full-‐Text Indexing

Contents }  Introduction }  Inverted Indices }  Construction }  Searching

2 GAvI -‐ Full-‐Text Informa$on Management: Full-‐Text Indexing

Sequen$al or online searching }  Involves finding the occurrences of a paFern in a text when the text is not preprocessed

}  Is appropriate when the text is small }  Is the only choice if the text collec$on is very vola$le (i.e. undergoes modifica$ons very frequently), or the index space overhead cannot be afforded


Indexed searching

Index }  data structure over the text to speed up the search }  is appropriate when the text collec$on is large and semi-‐sta$c }  Semi-‐sta$c collec$on: is updated at reasonably regular intervals but is not deemed to support thousands of inser$on of single words per second.

Indexing techniques }  inverted indices, suffix arrays, and signature files }  consider

}  search cost }  space overhead }  cost of building and upda$ng indexing structures


Indexing techniques Inverted indices

}  Word oriented mechanism for indexing a text collec$on }  Composed of vocabulary and occurrences }  are currently the best choice for most applica$ons

Suffix arrays }  are faster for phrase searches and other less common queries }  are harder to build and maintain

Signature files }  Word oriented index structures based on hashing }  were popular in the 1980s


Nota$ons and Background

}  Nota$ons n: the size of the text database m: paFern length M: amount of main memory available

}  Background }  sorted arrays }  binary search tree }  B-‐tree }  hash table }  trie


Trie }  TRIE or PREFIX TREE }  An in-‐memory mul$way tree storing a set of strings

}  Strings are stored in the leaves }  Every edge of the trie is labelled with a leFer }  All the descendants of a node have a common prefix

}  the sequence of leFers from root to the node

This is a text. A text has many words. Words are made from leFers.

1 6 9 11 17 19 24 28 33 40 46 50 55 60

leFers:60 ‘l’ ‘m’ ‘a’ ‘d’

‘n’

made:50

many:28 text:11,19

words:33,40

‘t’

‘w’


Strings to be stored and the corresponding star$ng posi$ons: leFers: 60 made:50 many: 28 text: 11, 19 words: 33,40

Trie (cont.)

Construc<on }  The root of the trie uses the first character }  The children of the root uses the second character, and so on

}  If the remaining subtrie contains only one string, that string’s iden$ty is stored in a leaf node

Access }  Start from the root }  Follow the path given by the character sequence of the paFern

}  Stop when a leaf is found or no character matches with the current one

}  the access cost is O(m)


Unstructured data in 1680

}  Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia?

}  One could grep all of Shakespeare’s plays for Brutus and Caesar, then strip out lines containing Calpurnia?

}  Why is that not the answer? }  Slow (for large corpora) }  NOT Calpurnia is non-‐trivial }  Other opera$ons (e.g., find the word Romans near countrymen) not feasible

10

Sec. 1.1

Term-‐document incidence

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 1 1 0 0 0 1

Brutus 1 1 0 1 0 0

Caesar 1 1 0 1 1 1

Calpurnia 0 1 0 0 0 0

Cleopatra 1 0 0 0 0 0

mercy 1 0 1 1 1 1

worser 1 0 1 1 1 0

1 if play contains word, 0 otherwise

Brutus AND Caesar BUT NOT Calpurnia

Sec. 1.1

Incidence vectors }  So we have a 0/1 vector for each term. }  To answer query:

}  take the vectors for }  Brutus: 110100 }  Caesar: 110111 }  Calpurnia (complemented):101111

}  bitwise AND }  110100 AND 110111 AND 101111 = 100100

}  Select the documents corresponding to 1

12

Sec. 1.1

Answers to query

} Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar dead, He cried almost to roaring; and he wept When at Philippi he found Brutus slain.

} Hamlet, Act III, Scene ii Lord Polonius: I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me.

13

Sec. 1.1

Bigger collec$on

GAvI -‐ Full-‐Text Informa$on Management: Full-‐Text Indexing 14

}  Consider N = 106 documents, each with about 1000 tokens ⇒ total of 109 tokens

}  On average 6 bytes per token, including spaces and punctua$on ⇒ size of document collec$on is about 6 ·∙ 109 = 6 GB

}  Assume there are M = 500,000 dis$nct terms in the collec$on (Note that we are making a term/token dis$nc$on.)

Can’t build the incidence matrix


}  M = 500,000 × 106 = half a trillion 0s and 1s.

}  But the matrix has no more than one billion 1s.

}  Matrix is extremely sparse.

}  What is a beFer representa$ons?

}  We only record the 1s.

Inverted index }  For each term t, we must store a list of all documents that contain t. }  Iden$fy each by a docID, a document serial number

}  Can we use fixed-‐size arrays for this?

16

Brutus

Calpurnia

Caesar 1 2 4 5 6 16 57 132

1 2 4 11 31 45 173

2 31

What happens if the word Caesar is added to document 14?

Sec. 1.2

174

54 101

Defini$on of Inverted index }  A word-‐oriented mechanism for indexing a text collec$on in order to speed up the searching task

}  Two elements }  Vocabulary

}  The set of all different words in the text }  Occurrences (Pos<ng lists)

}  For each word a list of loca$ons where term occurs ¨  Document-‐based: A list of documents with the corresponding term frequency

¨  Word-‐based: A list of documents with the corresponding word posi$ons


Document-‐based inverted index

system

computer

database

science D2, 4

D5, 2

D1, 3

D7, 4 Index terms df

3

2

4

1

Dj, tfj

Vocabulary Pos<ng lists

• • •


Word-‐based inverted index

}  The posi$on of the term in the document, wpi , facilitates proximity searching

system

computer

database

science D2, …

D5, …

D1, 1,100,634

D7, 50, 90, 150, 800 Index terms df

3

2

4

1

Dj, wp1, …, wpn

Vocabulary Pos<ng lists

• • •


Space requirements }  The space required for the vocabulary is rather small }  HEAPS’ LAW: the vocabulary grows as O(nb) where

b ∈ [0,1] (usually [0.4,0.6]) }  The occurrences demand much more space

Occurrence space (in rela$on to

original collec$on size)

Small collec$on (1 Mb)

Medium collec$on (200 Mb)

Large collec$on (2 Gb)

No Stop words

All text No Stop words

All text No Stop words

All text

Addressing words 45%

73%

36%

64%

35%

63%

Addressing documents

19%

26%

18%

32%

26%

47%


Tokenizer

Friends Romans Countrymen

Inverted index construc$on

Linguis$c modules

Modified tokens friend roman countryman

Indexer

Inverted index

friend

roman

countryman

2 4

2

13 16

1

Documents to be indexed

Friends, Romans, countrymen.

Sec. 1.2

Construc$on based on a trie

}  All the vocabulary known up to now is kept in a trie structure Steps 1.  Read each word of the text 2.  Search the word in the trie 3.  If word is not found in the trie, it is added to the trie with an empty list of occurrences

4.  If word is in the trie, the new posi$on is added to the end of its list of occurrences

}  Once the text is exhausted, the trie is wriFen to disk together with the list of occurences

Complexity –  O(1) opera$ons per text character (step 2) –  O(1) inser$on of the posi$on in the list of occurrences (steps 3 & 4) –  O(n) for the overall process


Splipng the index into two files }  Splipng the index into two files allows the vocabulary to be kept in memory at search $me in many cases

}  Pos$ng file }  The lists of occurrences are stored con$guously

}  Vocabulary file }  The vocabulary is stored and, for each word, the number of documents associated with it and a pointer to its list in the pos$ng file is also included


An example

Term Start n boundary

case computer database

deliver document

fan play

position science System

0 1 3 5 6 7

10 13 14 15 17

1 2 2 1 1 3 3 1 1 2 1

0 0023

0012

0041

0011

0032

5 0021

0031

0011

0021

0031

10 0011

0021

0042

0022

0011

15 0032

0041

0031

The pos<ng list of “document” starts at

posi<on 7 and contains 3 entries.

“document” is contained in

documents 001002 003 with frequency …

Vocabulary file implemented through a sorted array

Pos$ng file


Structures for vocabulary file To accelerate vocabulary searches: }  Sorted array

}  The vocabulary is stored in lexicographical order }  Searched using a standard binary search

}  Complexity O(log2|vocabulary|) }  Disadvantage: upda$ng is expensive

}  B+-‐tree }  The vocabulary is stored in a B+-‐tree }  Disadvantage: B+-‐tree uses more space than sorted array

}  Trie }  Hash


Construc$on using PARTIAL INDICES }  For large texts where the index does not fit in main

memory Construc<on step

1.  The algorithm already described is used un$l the main memory is exhausted.

2.  When no more memory is available, the par$al index obtained up to now is wriFen to disk.

3.  Erase from main memory. 4.  Con$nue with the rest of the text. 5.  A number of par$al indices on disk are merged in a

hierarchical fashion.


Merging par$al indices Merging step }  Given two par$al indexes I1 and I2 1.  Merge the sorted vocabularies

•  Complexity O(|I1|+|I2|)) 2.  Whenever the same word appears in both indices,

merge both lists of occurrences }  By construc$on, the occurrences of the smaller-‐

numbered index are before those of the larger-‐numbered index

}  We can perform list concatena$on


Merging par$al indices

system

computer

database

D1, 20,50,90

D1, 30

1

1 1 D1, 100

I1

system database D2, 10

D2, 20 1 1 2 D1, 120

I2

D2, 50

≤

system database

1

2 2

I12

CAD

CAD

≤

computer 1

D2, 20

D1, 20,50,90 D1, 30 D2, 10

D1, 100 D1, 120 D2, 50


Merging par$al indices

Binary fashion }  More than two indices can be merged

}  Complexity: }  n/M: Number of par$al

indexes }  O(n) merging cost at

each level }  O(n log(n/M)) the

overall cost }  To reduce build-‐$me space requirements }  It is possible to perform

the merging in-‐place I-1 I-2 I-3 I-4 I-5 I-6 I-7 I-8

1 2 4 5

I-1..2 I-3..4 I-5..6 I-7..8

6 3

I-1..4 I-5..8

7

I-1..8

Level 1 (initial dumps)

Level 2

Level 3

Level 4 (final index)


Popular indexing systems


Indexing and search technology

Enterprise search platorm

Web crawler

Informa$on platorm for enterprise

Open seman$c search

Addi$onal material


}  Inside google data center hFps://www.youtube.com/watch?v=PBx7rgqeGG8

}  Inverted index compression }  Stanford IR book chapter hFp://nlp.stanford.edu/IR-‐book/html/htmledi$on/index-‐compression-‐1.html

}  MaFeo Catena, Craig Macdonald Iadh Ounis “On Inverted Index Compression for Search Engine” ECIR 2014 (available online)

}  Inverted index distribu$on }  Apache SolrCloud hFps://cwiki.apache.org/confluence/display/solr/How+SolrCloud+Works

Three General steps

1.  Vocabulary search }  The paFerns and words present in the query are isolated

and searched in the vocabulary }  Phrases and proximity queries are split into single words

}  The cost depends on the used structure }  Trie or Hash: O(m) thus independent on the text size }  B-‐tree or Sorted Array: O(log n)

2.  Retrieval of occurrences }  The lists of the occurrences of all the words found are

retrieved 3.  Manipula$on of occurrences

}  The occurrences are processed to solve basic queries, phrases, proximity, or Boolean opera$ons


Single-‐word Retrieval }  Single-‐word queries

}  Can be searched using any suitable data structure to speed up the search in the vocabulary file

}  Prefix and range queries }  Can be solved with binary search, tries, or B-‐trees, but not with hashing.


Boolean Retrieval }  Given a boolean query

}  Parse the query into its syntax tree }  Leaves: basic queries }  Internal nodes: operators

}  Solve the leaves of the query syntax tree using

the appropriate algorithm }  Work the relevant documents by composi$on operators

}  OR: Recursively retrieve e1 and e2 and take union of results }  AND: Recursively retrieve e1 and e2 and take intersec$on of results }  BUT: Recursively retrieve e1 and e2 and take set difference of results

}  Op$miza$on }  Algebraic op$miza$on

}  E.g. a OR (a AND b) = a }  Keep the set sorted in order to proceed sequen$ally on both lists when

intersec$on, union, etc. are required 36 GAvI -‐ Full-‐Text Informa$on Management: Full-‐Text Indexing

Q: database OR science AND case ?

database

science case

AND

OR

Intersec$ng two pos$ngs lists (a “merge” algorithm)

37

Boolean Retrieval

Term Start n

boundary case

computer database

deliver document

fan play

position science System

0 1 3 5 6 7

10 13 14 15 17

1 2 2 1 1 3 3 1 1 2 1

0 0023

0012

0041

0011

0032

5 0021

0031

0011

0021

0031

10 0011

0021

0042

0022

0011

15 0032

0041

0031

Q: database OR science AND case ?

database OR science AND case

⇒ {002,004}

⇒ database OR ( {003,004} ∩ {001,004} )

⇒ {002} ∪ {004}


Phrasal Retrieval }  BeFer a word-‐based inverted file ①  Retrieve documents and posi$ons for each individual

word ②  intersect documents ③  check for ordered con$guity of keyword posi$ons }  Best to start con$guity check with the least common word in the phrase

}  Simple and effec$ve op$miza$on: Process in order of increasing frequency


Searching: Phrasal Retrieval 1.  Find set of documents D in which all keywords (k1…km) in

phrase occur (using AND query processing) 2.  Ini$alize empty set, R, of retrieved documents 3.  For each document, d, in D:

1.  Get array, Pi ,of posi$ons of occurrences for each ki in d 2.  Find shortest array Ps of the Pi’s 3.  For each posi$on p of keyword ks in Ps

For each keyword ki except ks 1.  Use binary search to find a posi$on (p + i – s) in the array Pi

4.  If correct posi$on for every keyword found, add d to R 4.  Return R


Searching: Phrasal Retrieval

This is a text. A text has many words. Words are made from leFers. 1 2 3 4 5 6 7

many

letters

made 6

7 1

1 1 3

text

words

1

1 1, 2 4, 5

Q: “many words” ? 1 2

3 P1 (many)

4, 5 P2 (words)

|P1| ≤ |P2|

Step 3.3: take 3

Step 3.3.1: Use binary search to find the posi$on (p + i – s) = (3 + 2 -‐ 1) in the array P2

Step 4: add the document to R


Searching: Proximity Retrieval }  Use approach similar to phrasal search to find documents in which all keywords are found in a context that sa$sfies the proximity constraints

}  During binary search for posi$ons of remaining keywords, find closest posi$on of ki to p and check that it is within maximum allowed distance