ges$one’avanzatadell’informazione · contents ! introduction ! inverted indices ! construction...
TRANSCRIPT
Ges$one Avanzata dell’Informazione Part A – Full-‐Text Informa$on Management
Full-‐Text Indexing
Contents } Introduction } Inverted Indices } Construction } Searching
2 GAvI -‐ Full-‐Text Informa$on Management: Full-‐Text Indexing
Sequen$al or online searching } Involves finding the occurrences of a paFern in a text when the text is not preprocessed
} Is appropriate when the text is small } Is the only choice if the text collec$on is very vola$le (i.e. undergoes modifica$ons very frequently), or the index space overhead cannot be afforded
3 GAvI -‐ Full-‐Text Informa$on Management: Full-‐Text Indexing
Indexed searching
Index } data structure over the text to speed up the search } is appropriate when the text collec$on is large and semi-‐sta$c } Semi-‐sta$c collec$on: is updated at reasonably regular intervals but is not deemed to support thousands of inser$on of single words per second.
Indexing techniques } inverted indices, suffix arrays, and signature files } consider
} search cost } space overhead } cost of building and upda$ng indexing structures
4 GAvI -‐ Full-‐Text Informa$on Management: Full-‐Text Indexing
Indexing techniques Inverted indices
} Word oriented mechanism for indexing a text collec$on } Composed of vocabulary and occurrences } are currently the best choice for most applica$ons
Suffix arrays } are faster for phrase searches and other less common queries } are harder to build and maintain
Signature files } Word oriented index structures based on hashing } were popular in the 1980s
5 GAvI -‐ Full-‐Text Informa$on Management: Full-‐Text Indexing
Nota$ons and Background
} Nota$ons n: the size of the text database m: paFern length M: amount of main memory available
} Background } sorted arrays } binary search tree } B-‐tree } hash table } trie
6 GAvI -‐ Full-‐Text Informa$on Management: Full-‐Text Indexing
Trie } TRIE or PREFIX TREE } An in-‐memory mul$way tree storing a set of strings
} Strings are stored in the leaves } Every edge of the trie is labelled with a leFer } All the descendants of a node have a common prefix
} the sequence of leFers from root to the node
This is a text. A text has many words. Words are made from leFers.
1 6 9 11 17 19 24 28 33 40 46 50 55 60
leFers:60 ‘l’ ‘m’ ‘a’ ‘d’
‘n’
made:50
many:28 text:11,19
words:33,40
‘t’
‘w’
7 GAvI -‐ Full-‐Text Informa$on Management: Full-‐Text Indexing
Strings to be stored and the corresponding star$ng posi$ons: leFers: 60 made:50 many: 28 text: 11, 19 words: 33,40
Trie (cont.)
Construc<on } The root of the trie uses the first character } The children of the root uses the second character, and so on
} If the remaining subtrie contains only one string, that string’s iden$ty is stored in a leaf node
Access } Start from the root } Follow the path given by the character sequence of the paFern
} Stop when a leaf is found or no character matches with the current one
} the access cost is O(m)
8 GAvI -‐ Full-‐Text Informa$on Management: Full-‐Text Indexing
Contents } Introduction } Inverted Indices } Construction } Searching
9 GAvI -‐ Full-‐Text Informa$on Management: Full-‐Text Indexing
Unstructured data in 1680
} Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia?
} One could grep all of Shakespeare’s plays for Brutus and Caesar, then strip out lines containing Calpurnia?
} Why is that not the answer? } Slow (for large corpora) } NOT Calpurnia is non-‐trivial } Other opera$ons (e.g., find the word Romans near countrymen) not feasible
10
Sec. 1.1
Term-‐document incidence
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
1 if play contains word, 0 otherwise
Brutus AND Caesar BUT NOT Calpurnia
Sec. 1.1
Incidence vectors } So we have a 0/1 vector for each term. } To answer query:
} take the vectors for } Brutus: 110100 } Caesar: 110111 } Calpurnia (complemented):101111
} bitwise AND } 110100 AND 110111 AND 101111 = 100100
} Select the documents corresponding to 1
12
Sec. 1.1
Answers to query
} Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar dead, He cried almost to roaring; and he wept When at Philippi he found Brutus slain.
} Hamlet, Act III, Scene ii Lord Polonius: I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me.
13
Sec. 1.1
Bigger collec$on
GAvI -‐ Full-‐Text Informa$on Management: Full-‐Text Indexing 14
} Consider N = 106 documents, each with about 1000 tokens ⇒ total of 109 tokens
} On average 6 bytes per token, including spaces and punctua$on ⇒ size of document collec$on is about 6 ·∙ 109 = 6 GB
} Assume there are M = 500,000 dis$nct terms in the collec$on (Note that we are making a term/token dis$nc$on.)
Can’t build the incidence matrix
GAvI -‐ Full-‐Text Informa$on Management: Full-‐Text Indexing 15
} M = 500,000 × 106 = half a trillion 0s and 1s.
} But the matrix has no more than one billion 1s.
} Matrix is extremely sparse.
} What is a beFer representa$ons?
} We only record the 1s.
Inverted index } For each term t, we must store a list of all documents that contain t. } Iden$fy each by a docID, a document serial number
} Can we use fixed-‐size arrays for this?
16
Brutus
Calpurnia
Caesar 1 2 4 5 6 16 57 132
1 2 4 11 31 45 173
2 31
What happens if the word Caesar is added to document 14?
Sec. 1.2
174
54 101
Defini$on of Inverted index } A word-‐oriented mechanism for indexing a text collec$on in order to speed up the searching task
} Two elements } Vocabulary
} The set of all different words in the text } Occurrences (Pos<ng lists)
} For each word a list of loca$ons where term occurs ¨ Document-‐based: A list of documents with the corresponding term frequency
¨ Word-‐based: A list of documents with the corresponding word posi$ons
17 GAvI -‐ Full-‐Text Informa$on Management: Full-‐Text Indexing
Document-‐based inverted index
system
computer
database
science D2, 4
D5, 2
D1, 3
D7, 4 Index terms df
3
2
4
1
Dj, tfj
Vocabulary Pos<ng lists
• • •
18 GAvI -‐ Full-‐Text Informa$on Management: Full-‐Text Indexing
Word-‐based inverted index
} The posi$on of the term in the document, wpi , facilitates proximity searching
system
computer
database
science D2, …
D5, …
D1, 1,100,634
D7, 50, 90, 150, 800 Index terms df
3
2
4
1
Dj, wp1, …, wpn
Vocabulary Pos<ng lists
• • •
19 GAvI -‐ Full-‐Text Informa$on Management: Full-‐Text Indexing
Space requirements } The space required for the vocabulary is rather small } HEAPS’ LAW: the vocabulary grows as O(nb) where
b ∈ [0,1] (usually [0.4,0.6]) } The occurrences demand much more space
Occurrence space (in rela$on to
original collec$on size)
Small collec$on (1 Mb)
Medium collec$on (200 Mb)
Large collec$on (2 Gb)
No Stop words
All text No Stop words
All text No Stop words
All text
Addressing words 45%
73%
36%
64%
35%
63%
Addressing documents
19%
26%
18%
32%
26%
47%
20 GAvI -‐ Full-‐Text Informa$on Management: Full-‐Text Indexing
Contents } Introduction } Inverted Indices } Construction } Searching
21 GAvI -‐ Full-‐Text Informa$on Management: Full-‐Text Indexing
Tokenizer
Friends Romans Countrymen
Inverted index construc$on
Linguis$c modules
Modified tokens friend roman countryman
Indexer
Inverted index
friend
roman
countryman
2 4
2
13 16
1
Documents to be indexed
Friends, Romans, countrymen.
Sec. 1.2
Construc$on based on a trie
} All the vocabulary known up to now is kept in a trie structure Steps 1. Read each word of the text 2. Search the word in the trie 3. If word is not found in the trie, it is added to the trie with an empty list of occurrences
4. If word is in the trie, the new posi$on is added to the end of its list of occurrences
} Once the text is exhausted, the trie is wriFen to disk together with the list of occurences
Complexity – O(1) opera$ons per text character (step 2) – O(1) inser$on of the posi$on in the list of occurrences (steps 3 & 4) – O(n) for the overall process
23 GAvI -‐ Full-‐Text Informa$on Management: Full-‐Text Indexing
Splipng the index into two files } Splipng the index into two files allows the vocabulary to be kept in memory at search $me in many cases
} Pos$ng file } The lists of occurrences are stored con$guously
} Vocabulary file } The vocabulary is stored and, for each word, the number of documents associated with it and a pointer to its list in the pos$ng file is also included
24 GAvI -‐ Full-‐Text Informa$on Management: Full-‐Text Indexing
An example
Term Start n boundary
case computer database
deliver document
fan play
position science System
0 1 3 5 6 7
10 13 14 15 17
1 2 2 1 1 3 3 1 1 2 1
0 0023
0012
0041
0011
0032
5 0021
0031
0011
0021
0031
10 0011
0021
0042
0022
0011
15 0032
0041
0031
The pos<ng list of “document” starts at
posi<on 7 and contains 3 entries.
“document” is contained in
documents 001002 003 with frequency …
Vocabulary file implemented through a sorted array
Pos$ng file
25 GAvI -‐ Full-‐Text Informa$on Management: Full-‐Text Indexing
Structures for vocabulary file To accelerate vocabulary searches: } Sorted array
} The vocabulary is stored in lexicographical order } Searched using a standard binary search
} Complexity O(log2|vocabulary|) } Disadvantage: upda$ng is expensive
} B+-‐tree } The vocabulary is stored in a B+-‐tree } Disadvantage: B+-‐tree uses more space than sorted array
} Trie } Hash
26 GAvI -‐ Full-‐Text Informa$on Management: Full-‐Text Indexing
Construc$on using PARTIAL INDICES } For large texts where the index does not fit in main
memory Construc<on step
1. The algorithm already described is used un$l the main memory is exhausted.
2. When no more memory is available, the par$al index obtained up to now is wriFen to disk.
3. Erase from main memory. 4. Con$nue with the rest of the text. 5. A number of par$al indices on disk are merged in a
hierarchical fashion.
27 GAvI -‐ Full-‐Text Informa$on Management: Full-‐Text Indexing
Merging par$al indices Merging step } Given two par$al indexes I1 and I2 1. Merge the sorted vocabularies
• Complexity O(|I1|+|I2|)) 2. Whenever the same word appears in both indices,
merge both lists of occurrences } By construc$on, the occurrences of the smaller-‐
numbered index are before those of the larger-‐numbered index
} We can perform list concatena$on
28 GAvI -‐ Full-‐Text Informa$on Management: Full-‐Text Indexing
Merging par$al indices
system
computer
database
D1, 20,50,90
D1, 30
1
1 1 D1, 100
I1
system database D2, 10
D2, 20 1 1 2 D1, 120
I2
D2, 50
≤
system database
1
2 2
I12
CAD
CAD
≤
computer 1
D2, 20
D1, 20,50,90 D1, 30 D2, 10
D1, 100 D1, 120 D2, 50
29 GAvI -‐ Full-‐Text Informa$on Management: Full-‐Text Indexing
Merging par$al indices
Binary fashion } More than two indices can be merged
} Complexity: } n/M: Number of par$al
indexes } O(n) merging cost at
each level } O(n log(n/M)) the
overall cost } To reduce build-‐$me space requirements } It is possible to perform
the merging in-‐place I-1 I-2 I-3 I-4 I-5 I-6 I-7 I-8
1 2 4 5
I-1..2 I-3..4 I-5..6 I-7..8
6 3
I-1..4 I-5..8
7
I-1..8
Level 1 (initial dumps)
Level 2
Level 3
Level 4 (final index)
30 GAvI -‐ Full-‐Text Informa$on Management: Full-‐Text Indexing
Popular indexing systems
GAvI -‐ Full-‐Text Informa$on Management: Full-‐Text Indexing 31
Indexing and search technology
Enterprise search platorm
Web crawler
Informa$on platorm for enterprise
Open seman$c search
Addi$onal material
GAvI -‐ Full-‐Text Informa$on Management: Full-‐Text Indexing 32
} Inside google data center hFps://www.youtube.com/watch?v=PBx7rgqeGG8
} Inverted index compression } Stanford IR book chapter hFp://nlp.stanford.edu/IR-‐book/html/htmledi$on/index-‐compression-‐1.html
} MaFeo Catena, Craig Macdonald Iadh Ounis “On Inverted Index Compression for Search Engine” ECIR 2014 (available online)
} Inverted index distribu$on } Apache SolrCloud hFps://cwiki.apache.org/confluence/display/solr/How+SolrCloud+Works
Contents } Introduction } Inverted Indices } Construction } Searching
33 GAvI -‐ Full-‐Text Informa$on Management: Full-‐Text Indexing
Three General steps
1. Vocabulary search } The paFerns and words present in the query are isolated
and searched in the vocabulary } Phrases and proximity queries are split into single words
} The cost depends on the used structure } Trie or Hash: O(m) thus independent on the text size } B-‐tree or Sorted Array: O(log n)
2. Retrieval of occurrences } The lists of the occurrences of all the words found are
retrieved 3. Manipula$on of occurrences
} The occurrences are processed to solve basic queries, phrases, proximity, or Boolean opera$ons
34 GAvI -‐ Full-‐Text Informa$on Management: Full-‐Text Indexing
Single-‐word Retrieval } Single-‐word queries
} Can be searched using any suitable data structure to speed up the search in the vocabulary file
} Prefix and range queries } Can be solved with binary search, tries, or B-‐trees, but not with hashing.
35 GAvI -‐ Full-‐Text Informa$on Management: Full-‐Text Indexing
Boolean Retrieval } Given a boolean query
} Parse the query into its syntax tree } Leaves: basic queries } Internal nodes: operators
} Solve the leaves of the query syntax tree using
the appropriate algorithm } Work the relevant documents by composi$on operators
} OR: Recursively retrieve e1 and e2 and take union of results } AND: Recursively retrieve e1 and e2 and take intersec$on of results } BUT: Recursively retrieve e1 and e2 and take set difference of results
} Op$miza$on } Algebraic op$miza$on
} E.g. a OR (a AND b) = a } Keep the set sorted in order to proceed sequen$ally on both lists when
intersec$on, union, etc. are required 36 GAvI -‐ Full-‐Text Informa$on Management: Full-‐Text Indexing
Q: database OR science AND case ?
database
science case
AND
OR
Intersec$ng two pos$ngs lists (a “merge” algorithm)
37
Boolean Retrieval
Term Start n
boundary case
computer database
deliver document
fan play
position science System
0 1 3 5 6 7
10 13 14 15 17
1 2 2 1 1 3 3 1 1 2 1
0 0023
0012
0041
0011
0032
5 0021
0031
0011
0021
0031
10 0011
0021
0042
0022
0011
15 0032
0041
0031
Q: database OR science AND case ?
database OR science AND case
⇒ {002,004}
⇒ database OR ( {003,004} ∩ {001,004} )
⇒ {002} ∪ {004}
38 GAvI -‐ Full-‐Text Informa$on Management: Full-‐Text Indexing
Phrasal Retrieval } BeFer a word-‐based inverted file ① Retrieve documents and posi$ons for each individual
word ② intersect documents ③ check for ordered con$guity of keyword posi$ons } Best to start con$guity check with the least common word in the phrase
} Simple and effec$ve op$miza$on: Process in order of increasing frequency
39 GAvI -‐ Full-‐Text Informa$on Management: Full-‐Text Indexing
Searching: Phrasal Retrieval 1. Find set of documents D in which all keywords (k1…km) in
phrase occur (using AND query processing) 2. Ini$alize empty set, R, of retrieved documents 3. For each document, d, in D:
1. Get array, Pi ,of posi$ons of occurrences for each ki in d 2. Find shortest array Ps of the Pi’s 3. For each posi$on p of keyword ks in Ps
For each keyword ki except ks 1. Use binary search to find a posi$on (p + i – s) in the array Pi
4. If correct posi$on for every keyword found, add d to R 4. Return R
40 GAvI -‐ Full-‐Text Informa$on Management: Full-‐Text Indexing
Searching: Phrasal Retrieval
This is a text. A text has many words. Words are made from leFers. 1 2 3 4 5 6 7
many
letters
made 6
7 1
1 1 3
text
words
1
1 1, 2 4, 5
Q: “many words” ? 1 2
3 P1 (many)
4, 5 P2 (words)
|P1| ≤ |P2|
Step 3.3: take 3
Step 3.3.1: Use binary search to find the posi$on (p + i – s) = (3 + 2 -‐ 1) in the array P2
Step 4: add the document to R
41 GAvI -‐ Full-‐Text Informa$on Management: Full-‐Text Indexing
Searching: Proximity Retrieval } Use approach similar to phrasal search to find documents in which all keywords are found in a context that sa$sfies the proximity constraints
} During binary search for posi$ons of remaining keywords, find closest posi$on of ki to p and check that it is within maximum allowed distance
42 GAvI -‐ Full-‐Text Informa$on Management: Full-‐Text Indexing