ester efficient search on text, entities, and relations holger bast max-planck-institut für...
TRANSCRIPT
ESTEREfficient Search on Text, Entities, and Relations
Holger BastMax-Planck-Institut für Informatik
Saarbrücken, Germany
joint work with
Alexandru Chitea, Fabian Suchanek, Ingmar Weber
Talk at SIGIR’07 in Amsterdam, July 26th
ESTEREfficient Search on Text, Entities, and Relations
Holger BastMax-Planck-Institut für Informatik
Saarbrücken, Germany
joint work with
Alexandru Chitea, Fabian Suchanek, Ingmar Weber
Talk at SIGIR’07 in Amsterdam, July 26th
Holger BastMax-Planck-Institut für Informatik
Saarbrücken, Germany
joint work with
Alexandru Chitea, Fabian Suchanek, Ingmar Weber
Talk at SIGIR’07 in Amsterdam, July 26th
ESTERIt’s about:
Fast Semantic Search
Keyword Search vs. Semantic Search
Keyword search
– Query: john lennon
– Answer: documents containing the words john and lennon
Semantic search
– Query: musician
– Answer: documents containing an instance of musician
Combined search
– Query: beatles musician
– Answer: documents containing the word beatles and an instance of musicianUseful by itself or as a component of a QA system
Semantic Search: Challenges + Our System
1. Entity recognition– approach 1: let users annotate (semantic web)
– approach 2: annotate (semi-)automatically
– our system: uses Wikipedia links + learns from them
2. Query Processing– build a space-efficient index
– which enables fast query answers
– our system: as compact and fast as a standard full-text engine
3. User Interface– easy to use
– yet powerful query capabilities
– our system: standard interface with interactive suggestions
Semantic Search: Challenges + Our System
1. Entity recognition– approach 1: let users annotate (semantic web)
– approach 2: annotate (semi-)automatically
– our system: uses Wikipedia links + learns from them
2. Query Processing– build a space-efficient index
– which enables fast query answers
– our system: as compact and fast as a standard full-text engine
3. User Interface– easy to use
– yet powerful query capabilities
– our system: standard interface with interactive suggestions
focus of the paperand of this talk
In the Rest of this Talk …
Efficiency
– three simple ideas (which all fail)
– our approach (which works)
Queries supported
– essentially all SPARQL queries, and
– seamless integration with ordinary full-text search
Experiments
– efficiency (great)
– quality (not so great yet)
Conclusions
– lots of interesting + challenging open problems
Efficiency: Simple Idea 1
Add “semantic tags” to the document
– e.g., add the special word tag:musician before every occurrence of a musician in a document
Problem 1: Index blowup
– e.g., John Lennon is a: Musician, Singer, Composer, Artist, Vegetarian, Person, Pacifist, … (28 classes)
Problem 2: Limited querying capabilities
– e.g., could not produce list of musicians that occur in documents that also contain the word beatles
– i.p., could not do all SPARQL queries (more on that later)
Efficiency: Simple Idea 2
Query Expansion
– e.g., replace query word musician by disjunction
musician:aaron_copland OR … OR musician:zarah_leander
(7,593 musicians in Wikipedia)
Problem: Inefficient query processing
– one intersection per element of the disjunction needed
Efficiency: Simple Idea 3
Use a database
– map semantic queries to SQL queries on suitably constructed tables
– that’s what the Artificial-Intelligence / Semantic-Web people usually do
Problem: Inefficient + Lack of control
– building a search engine on top of an off-the-shelf database is orders of magnitude slower or uses orders of magnitude more space, or both
– very limited control regarding efficiency aspects
Efficiency: Our Approach
Two basic operations
– prefix search of a special kind [will be explained by example]
– join [will be explained by example]
An index data structure
– which supports these two operations efficiently
Artificial words in the documents
– such that a large class of semantic queries reduces to a combination of (few of) these operations
Processing the query “beatles musician”
Gitanes
… legend says that John Lennon entity:john_lennon of the Beatles smoked Gitanes to deepen his voice …
Gitanes
… legend says that John Lennon entity:john_lennon of the Beatles smoked Gitanes to deepen his voice …
John Lennon
0 entity:john_lennon 1 relation:is_a 2 class:musician 2 class:singer …
John Lennon
0 entity:john_lennon 1 relation:is_a 2 class:musician 2 class:singer …
entity:john_lennonentity:1964entity:liverpooletc.
entity:wolfang_amadeus_mozartentity:johann_sebastian_bachentity:john_lennonetc.
entity:john_lennonetc.
twoprefix
queries
onejoin
position
beatles entity:* entity:* . relation:is_a .
class:musician
Processing the query “beatles musician”
Problem: entity:* has a huge number of occurrences– ≈ 200 million for Wikipedia, which is ≈ 20% of all occurrences– prefix search efficient only for up to ≈ 1% (explanation follows)
Solution: frontier classes– classes at “appropriate” level in the hierarchy– e.g.: artist, believer, worker, vegetable, animal, …
Gitanes
… legend says that John Lennon entity:john_lennon of the Beatles smoked Gitanes to deepen his voice …
Gitanes
… legend says that John Lennon entity:john_lennon of the Beatles smoked Gitanes to deepen his voice …
John Lennon
0 entity:john_lennon 1 relation:is_a 2 class:musician 2 class:singer …
John Lennon
0 entity:john_lennon 1 relation:is_a 2 class:musician 2 class:singer …
position
beatles entity:* entity:* . relation:is_a .
class:musician
Processing the query “beatles musician”
Gitanes
… legend says that John Lennon artist:john_lennon believer:john_lennon of the Beatles smoked …
Gitanes
… legend says that John Lennon artist:john_lennon believer:john_lennon of the Beatles smoked …
John Lennon
0 artist:john_lennon 0 believer:john_lennon 1 relation:is_a 2 class:musician …
John Lennon
0 artist:john_lennon 0 believer:john_lennon 1 relation:is_a 2 class:musician …
artist:john_lennonartist:graham_greeneartist:pete_bestetc.
artist:wolfang_amadeus_mozartartist:johann_sebastian_bachartist:john_lennonetc.
artist:john_lennonetc.
position
beatles artist:* artist:* . relation:is_a .
class:musiciantwoprefix
queries
onejoin
first figure out:musician artist
(easy)
Maintains lists for word ranges (not words)
Looks like this for person:*
abl-abt Doc. 12 Doc. 83 Doc. 83 Doc. 187 …
Pos. 5 Pos. 14 Pos. 124 Pos. 88 …
Scor. 0.5 Scor. 0.2 Scor. 0.7 Scor. 0.4 …
able ablaze abroad abnormal
person:* Doc. 17 Doc. 23 Doc. 72 Doc. 72 …
Pos. 12 Pos. 3 Pos. 55 Pos. 59 …
Scor. 0.1 Scor. 0.5 Scor. 0.3 Scor. 0.5 …
person:john_lenno
nperson:ringo_starr
person:graham_gree
neperson:john_lenno
n
The HYB Index [Bast/Weber, SIGIR’06]
The HYB Index [Bast/Weber, SIGIR’06]
Maintains lists for word ranges (not words)
Provably efficient
– no more space than an inverted index (on the same data)
– each query = scan of a moderate number of (compressed) items
abl-abt Doc. 12 Doc. 83 Doc. 83 Doc. 187 …
Pos. 5 Pos. 14 Pos. 124 Pos. 88 …
Scor. 0.5 Scor. 0.2 Scor. 0.7 Scor. 0.4 …
able ablaze abroad abnormal
Extremely versatile
– can do all kinds of things an inverted index cannot do (efficiently)
– autocompletion, faceted search, query expansion, errorcorrection, select and join, …
SPARQL Protocol And
RDF Query Language
(yes, it’s recursive)
Queries we can handle
We prove the following theorem:
– Any basic SPARQL graph query with m edges can be reduced to at most 2m prefix / join operations
SELECT ?who WHERE { ?who is_a Musician ?who born_in_year ?when John_Lennon born_in_year ?when }
ESTER achieves seamless integration with full-text search
– SPARQL has no means for dealing with full text search
– XQuery can handle full-text search, but is not really suitable for semantic search
musicians bornin the same yearas John Lennon
more about supported queries in the paper
Experiments: Corpus, Ontology, Index
Corpus: English Wikipedia (xml dump from Nov. 2006)
≈ 8 GB raw xml
≈ 2,8 million documents
≈ 1 billion words
Ontology: YAGO (Suchanek/Kasneci/Weikum, WWW’07)
≈ 2,5 million facts
derived from clever combination of Wikipedia + WordNet (Entities from Wikipedia, Taxonomy from WordNet)
Our Index
≈ 1.5 billion words (original + artificial)
≈ 3.3 GB total index size; ontology-only is a mere 100 MB
Note: our system works for an arbitrary corpus + ontology
Experiments: Efficiency — What Baseline?
SPARQL engines – can’t do text search
– and slow for ontology-only too (on Wikipedia: seconds)
XQuery engines – extremely slow for text search (on Wikipedia: minutes)
– and slow for ontology-only too (on Wikipedia: seconds)
Other prototypes which do semantic + full-text search– efficiency is hardly considered
– e.g., the system of Castells/Fernandez/Vallet (TKDE’07)
“… average informally observed response time on a standard professional desktop computer [of] below 30 seconds [on 145,316 documents and an ontology with 465,848 facts] …”
– our system: ~100ms, 2.8 million documents, 2.5 million facts
Experiments: Efficiency — Stress Test 1
Compare to ontology-only system
– the YAGO engine from WWW’07
– Onto Simple : when was [person] born [1000 queries]
– Onto Advanced: list all people from [profession] [1000 queries]
– Onto Hard : when did people die who were born in the same year as [person] [1000 queries]
Note: comparison very unfair (for our system)
Our system Onto-Only
avg. max. avg. max.
Onto Simple 2 ms 5 ms 3 ms 20 ms
Onto Advanced 9 ms 31 ms 3 ms794 ms
Onto Hard64 ms
208 ms
78 ms
550 ms
100 MB index
4 GB index
Experiments: Efficiency — Stress Test 2
Our system Full-Text Only
avg. max. avg. max.
Onto+Text Easy224 ms
772 ms 90 ms 498 ms
Onto+Text Hard279 ms
502 ms 44 ms 85 ms
Compare to text-only search engine
– state-of-the-art system from SIGIR’06
– Onto+Text Easy: counties in [US state] [50 queries]
– Onto+Text Hard: computer scientists [nationality] [50 queries]
– Full-text query: e.g. german computer scientists Note: hardly finds relevant documents
Note: comparison extremely unfair (for our system)
Experiments: Quality — Entity Recognition
Use Wikipedia links as hints
– “… following [[John Lennon | Lennon]] and Paul McCartney, two of the Beatles, …”
– “… The southern terminus is located south of the town of [[Lennon, Michigan | Lennon]] …”
Learn other links
– use words in neighborhood as features
Accuracy
all words 2 senses 3 senses ≥4 senses
93.4% 88.2% 84.4% 80.3%
Experiments: Quality — Relevance
2 Query Sets
– People associated with [american university] [100 queries]
– Counties of [american state] [50 queries]
Ground truth
– Wikipedia has corresponding lists
e.g., List of Carnegie Mellon University People
Precision and Recallprecision@1
0recall
PEOPLE 37.3% 89.7%
COUNTIES 66.5% 97.8%
Conclusions
Semantic Retrieval System ESTER
– fast and scalable via reduction to prefix search and join
– can handle all basic SPARQL queries
– seamless integration with full-text search
– standard user interface with (semantic) suggestions
Lots of interesting and challenging problems
– simultaneous ranking of entities and documents
– proper snippet generation and highlighting
– search result quality
– … Dank je wel!
Context-Sensitive Prefix-Search
Compute completions of last query word
– which together with the previous part of the query would lead to a hit
– [DEMO: show a live example]
Extremely useful
– autocompletion search
– faceted search
– error correction, synonym search, …
– category search
for example, add place:amsterdam
then query place:* finds all instances of a place
formal definitionin the paper
Isn’t the last idea enough for semantic search?
DEMO
Do the following queries [live or recorded]
– beatles
– beatles musi
– beatles musicia
– beatles musician:john_lennon (or beatles entity:john_lennon)
Processing the query “beatles musician”
Liverpool[one of many documents mentioning John Lennon]
… in honor of the late Beatle entity:john_lennon
Liverpool[one of many documents mentioning John Lennon]
… in honor of the late Beatle entity:john_lennon
John Lennon
0 entity:john_lennon 1 r:is_a 2 class:musician 2 class:singer …
John Lennon
0 entity:john_lennon 1 r:is_a 2 class:musician 2 class:singer …
beatles entity:* “entity:* r:is_a class:musician”
position
Problem: entity:* has a huge number of occurrences– ≈ 200 million for Wikipedia = 20% of all occurrences– prefix search efficient only up XXX
Solution: Frontier set– classes high up in the hierarchy [explain more]– e.g.: person, animal, substance, abstraction, …
Processing the query “beatles musician”
Liverpool[one of many documents mentioning John Lennon]
… in honour of the late Beatle person:john_lennon
Liverpool[one of many documents mentioning John Lennon]
… in honour of the late Beatle person:john_lennon
John Lennon
0 person:john_lennon 1 is_a: 2 class:musician 2 class:singer …
John Lennon
0 person:john_lennon 1 is_a: 2 class:musician 2 class:singer …
beatles person:*
person:john_lennonperson:the_queenperson:pete_bestetc.
“person:* r:is_a class:musician”
person:wolfang_amadeus_mozartperson:johann_sebastian_bachperson:john_lennonetc.
entity:john_lennon etc.
position
twoprefix
queries
one join
Our Solution, Version 1
Combination of Prefix Search + Join
– Query 1: beatles entity:* entities co-occuring with beatles
– Query 2: musician – entity:* entities which are musicians
– Join the completion from 1 & 2 musicians co-occuring with beatles
Some document about Albert
Einstein
… entity:einstein …
Some document about Albert
Einstein
… entity:einstein …
Albert Einstein
entity:albert_einsteinscientistvegetarianintellectual …
Albert Einstein
entity:albert_einsteinscientistvegetarianintellectual …
But: unspecific prefixes (entity:*) are hard
Our Solution, Version 2
Combination of Prefix Search + Join
– Query 1: translate:singer:* tells us that a singer is a musician
– Query 2: beatles musician:* musicians co-occurring with beatles
– Query 3: physicist – scientist:* musicians which are singers
– Join the completion from 1 & 2 singers co-occurring with beatles
Some document mentioning
John Lennon
… musician:john_lennon xyz:john_lennon …
Some document mentioning
John Lennon
… musician:john_lennon xyz:john_lennon …
John Lennon
musician:john_lennonxyz:john_lennon …
John Lennon
musician:john_lennonxyz:john_lennon …
[Special Doc]
TRANSLATE:singer:musician
[Special Doc]
TRANSLATE:singer:musician
Processing the query “beatles musician”
Gitanes
… legend says that John Lennon artist:john_lennon believer:john_lennon of the Beatles smoked …
Gitanes
… legend says that John Lennon artist:john_lennon believer:john_lennon of the Beatles smoked …
John Lennon
0 artist:john_lennon 0 believer:john_lennon 1 relation:is_a 2 class:musician …
John Lennon
0 artist:john_lennon 0 believer:john_lennon 1 relation:is_a 2 class:musician …
artist:john_lennonartist:queen_elisabethartist:pete_bestetc.
artist:wolfang_amadeus_mozartartist:johann_sebastian_bachartist:john_lennonetc.
person:john_lennonetc.
position
beatles artist:* artist:* . relation:is_a .
class:musiciantwoprefix
queries
onejoin
John Lennon at the Royal Variety Show in 1963, in the presence of members of the British royalty:
"Those of you in the cheaper seats can clap your hands. The rest of you, if you'll just rattle your jewellery."