12/05/2009cs5611 cs561 - research paper presentation kourdounakis harry met. 567
Post on 20-Dec-2015
214 views
TRANSCRIPT
12/05/2009 CS561 2
Topic Introduction
The topic under consideration is full-text search languages and more specifically ‘structered’ full-text queries.
Why bother? Many information systems deal with large document
collections with unknown or ill defined structure Just keyword search isn’t enough. ( what about ordering,
distance etc? ) It is a rather new area of research trying to bridge the gap
between the IR/XML community.
12/05/2009 CS561 3
Expressiveness and Performance of Full-Text Search Languages
Chavdar Botev,Sihem Amer-Yahia, and
Jayavel Shamnugasindaram
12/05/2009 CS561 4
Contents
Introduction The proposed model for full-text search Incompleteness of Existing Full-Text Search
Languages PPRED: Language and Query Evaluation Proof of idea (Experiments)
12/05/2009 CS561 5
The Problem to be addressed
There is no formal basis for comparing full-text languages.
Unlike SQL, there is no well-accepted language for expressing complex full-text search queries
Existing languages are incomplete. Some practical problems on full-text queries:
How to model? How to optimize? How to efficiently evaluate?
12/05/2009 CS561 6
Existing Work. What is missing (1/2)? Some study has been made by the IR
community, but it’s focus was primarily on simple structured full
text queries, no fully composable language was developed, no study of expressiveness, complexity
whatsoever.
12/05/2009 CS561 7
Existing Work. What is missing (2/2)? Text Region Algebras (TRA’s).
A region s is represented as the ordered pair (s.l,s.r) A query operator has the form:
{ sES | exists dED Pred(s,d) } where S,D are sets of regions and Pred is a Boolean expression with the logical operators ‘and’ ‘or’ containing clauses of the form (x*y) where * is one of { =,<,>,<=,>= } and x E{ s.l, s.r, s.l+const, s.l-const, s.r+const, s.r-const }, yE{ d.l,d.r } and const is a constant.
Although efficient algorithms have been devised to evaluate TRA queries, their expressive power is limited ( simultanious inclusion and ordering? )
12/05/2009 CS561 8
Main Contributions of research work A new formal model for structured full-text search
and scoring, based on first order logic (FTC) and the relational algebra (FTA) is introduced.
A notion of completeness of full text languages is defined.
It is shown that existing languages are incomplete with respect to the above completeness
A practical subset of the FTC and FTA which subsumes TRA’s is defined and experimental studies are done upon the performance of it.
12/05/2009 CS561 9
Contents
Introduction A formal basis for Full-Text search languages Incompleteness of Existing Full-Text Search
Languages PPRED: Language and Query Evaluation Proof of idea (Experiments)
12/05/2009 CS561 10
Approach to formalism and the full-text search model (1/3) The full-text queries are not modeled in one
specific syntax. A more general approach is taken using FTC and FTA
The Full-Text Search Model: Let N be the set of nodes, P the set of positions,
and T the set of tokens. The following functions are defined: Positions: N -> 2^p that maps a node to the set of
positions in that node Token: P -> T that maps each position to the token at
that position
12/05/2009 CS561 11
Approach to formalism and the full-text search model (2/3) Queries are specified over a collection of nodes
( text docs, html, XML elements, relational tuples, etc ).
Positions uniquely identifies a token in a node. Each context node is considered separately Completeness requirement: The full-text search
language should be at least as expressive as first order logic formulas specified over the positions of tokens in a context node
12/05/2009 CS561 12
Approach to formalism and the full-text search model (3/3)
Positions(cs561) = {1,2,4,5,6}Token(1) = “An”
12/05/2009 CS561 13
Full-Text Calculus (FTC) 1/2
FTC base predicates: SearchContext(node) is true iff node inN hasPos(node,pos) is true iff pos in
Positions(node) hasAsToken(pos,tok) is true iff tok=Token(pos)
FTC extensions: Position based predicates, pred(p1,…,pm,const1,
…,constn). Example: preds = { distance(pos1,pos2,dist),
ordered(pos1,pos2) }
12/05/2009 CS561 14
Full-Text Calculus (FTC) 2/2
An FTC query has the form: { node| SearchContext(node) and QueryExpresion(node) } where QueryExpression is a FOL expression that specifies
the full-text search condition A query example:
“find context nodes that contain the words ‘IR’ and ‘XML’ in that order”
exists p1,p2 ( hasPos(node,p1) and hasAsToken(p1,”IR”) and hasPos(node,p2) and hasAsToken(pos2,”XML”) and ordered(p1,p2) )
12/05/2009 CS561 15
Full-Text Algebra (FTA) 1/2
FTA is defined upon the full-text relation data model which is of the form R[node,attr1,…,attrN] m>=0 with node in N and atti in P.
FTA can be defined recursively as follows: Rtoken(node,attr1) is an expression. (node,pos) in R iff: node in
D and pos in Positions(node) and token= Token(pos) If Expr1 is a expression then π_node,attr1…,attrM(Expr1) is an
expressio If Expr1 and Expr2 are expression then Expr1 join Expr2 is an
expression. ... same for all other operators
An FTA query is an FTA expression that produces a s full-text relation with a single attribute (the node)
12/05/2009 CS561 16
Full-Text Algebra (FTA) 2/2
An FTA query example Π_node( select_ordered(att1,att2) ((R_”ir” join R_”xml”) ))
cs561 2
cs463 7
cs561 5
cs561 2 5
R_”IR”
R_”XML”
12/05/2009 CS561 17
Relevance Scoring
Instead of hard-coding the scoring method in the framework, a general scoring framework is taken
How is the score incorporated in the full-text search model? A per tuple scoring approach is taken
How do FTA operators transform tuple scores? (if TF*IDF is used) Join: t.score = t1.score/|r2| + t2.score/|r1| Projection: t.score = t1.score+ … + tn.score
12/05/2009 CS561 18
Contents
Introduction A formal basis for Full-Text search languages Incompleteness of Existing Full-Text Search
Languages PPRED: Language and Query Evaluation Proof of idea (Experiments)
12/05/2009 CS561 19
Completeness
A full-text language L is said to be full-text complete with respect to a set of position-based predicates Preds iff all queries that can be expressed in the FTC using Preds can also be expressed in L
12/05/2009 CS561 20
Incompleteness of predicate-based languages 1/2 Let DIST be the following predicate-based query
language: Query := Token | Query op Query |
dist(Token,Token,Integer) op := ‘AND’ | ‘OR’ | ‘AND NOT’ token := StringLiteral | ‘ANY’
The semantics of DIST can be recursively defined in terms of the FTC
Theorem1: if |T| >=3, where T is the set of tokens then there exists a query that can be expressed in FTC with Preds={ distance(p1,p2,d) } that cannot be expressed by DIST
12/05/2009 CS561 21
Incompleteness of predicate-based languages 2/2 No query in DIST can express the following FTC:
exists p1,p1,p3 ( (hasPos(n,p1) and hasAsToken(p1,t1)) and (hasPos(n,p2) and hasAsToken(p2,t2)) and (hasPos(n,p3) and hasAsToken(p3,t3)) and distance(p1,p2,0) and distance(p2,p3,0) )
The proof is by contradiction on the context nodes: CN1 = “t1 t2 t3” CN2 = “t1 t2 t2 t3 t3 t1”
The previous FTC would only return CN1. This is not the case of DIST which returns either both CN1,CN2 or none.
Why? ( dist(token1,token2,int) as opposed to dist(p1,p2,int) )
12/05/2009 CS561 22
Incompleteness of text region algebras 1/2 Theorem2 : There exists a query that can be
expressed in FTC with Preds = { ordered(p1,p2), samepara(p1,p2) } that cannot be expressed in TRA
The following FTC query cannot be expressed using TRA: Exists p1,p2( hasPos(node,p1) and
hasAsToken(p1,t1) and hasPos(node,p2) and hasAsToken(p2,t2) and ordered(p1,p2) and samepara(p1,p2) )
12/05/2009 CS561 23
Incompleteness of text region algebras 2/2 It is proven that TRA’s cannot represent
simultaneously inclusion and ordering constraints. Example?
Find documents with regions sES that contains two other regions tET and uEU such that t comes before u.
Making the assumption that s,t are regions with same start and end and set U to be regions representing paragraphs concludes to the previously defined FTC query. The theorem therefore follows
12/05/2009 CS561 24
Contents
Introduction A formal basis for Full-Text search languages Incompleteness of Existing Full-Text Search
Languages PPRED: Language and Query Evaluation Proof of idea (Experiments)
12/05/2009 CS561 25
Defining Positive Predicates
An FTC query evaluation algorithm can only be: polynomial in the size of the data and exponential in the size of the query
Can we do better? Observation: many full-text search predicates are true in a
contiguous region of the position space. These predicates are called Positive predicates and can be
efficiently evaluated by scanning nodes in increasing order of positions. (single scan)
12/05/2009 CS561 26
PPRED Language Based on previous observations the PPRED
language is defined which is a subset of FTC The syntax is as follows:
PPRED is a strict super set if DIST and TRA’s
Example: some p1 has ‘IR’ some p2 has ‘XML’ ordered(p1,p2)
12/05/2009 CS561 27
Query Evaluation 1/3
Each R_token relation is represented as an inverted list associated to token. Each entry in R_token is of the form (node,PosList,score)
n PosList1
51
89
80 99 139
56 59
96 102 108
n PosList1
75
85 97
83 210
n PosList1
75
90 105 140
81
“judge” inverted list “assignment” inverted list “District” inverted list
12/05/2009 CS561 28
Query Evaluation 2/3
A query is first converted to FTA operators Since it is not wanted to materialize the out put
relation each operator exposes a new API for traversing it’s output
The API maintains the following state: node, which is the current node and list<Position> which track the current positions in the
node
12/05/2009 CS561 29
Query Evaluation 3/3
The API defines the following methods advanceNode():
node = projection_node(R) List<Position> = the smallest positions that appear in R
for that node andvancePosition(i,pos):
List<Position> = the smallest positions that appear in R and satisfy condition pi>pos
12/05/2009 CS561 30
API Example
Node1 5 6
Node1 3 4
Node1 7 8
Node2 19 20
Node2 23 21
advanceNode()advancePosition(p1,4)
12/05/2009 CS561 31
Query Evaluation 4/4
Given the specific query plan the execution would be as follows: advanceNode() is called on the
top project which forwards the call to the distance selection operator below it
The latter continiously calls advancePosition() on the ordered predicate until it finds a satisfying tuple of positions
The ordered predicates advance through the result of the underlying operator until it finds a tuple that satisfies it
The evaluation proceeds down the tree.
12/05/2009 CS561 32
Contents
Introduction A formal basis for Full-Text search languages Incompleteness of Existing Full-Text Search
Languages PPRED: Language and Query Evaluation Proof of idea (Experiments)
12/05/2009 CS561 33
Query Evaluator Complexity
O(entries_per_token*pos_per_entry*toksQ*(predsq+ops+1)) entries_per_token = maximum num of entries in the inverted
list pos_per_entry = maximum number of positions in an entry
in an inverted list toksQ = number of keywords in q predsQ = number of predicates in query opsQ = number of operators in a query
12/05/2009 CS561 34
Experimental Results
•The performance of PPRED was compared to:•BOOL which is a restriction of DIST•REL which is the implementation of FTA using regular relational operators.
12/05/2009 CS561 36
TexQuery: A Full-Text Search Extension to XQuery
Chavdar Botev,Sihem Amer-Yahia, and
Jayavel Shamnugasindaram
12/05/2009 CS561 37
Contents
Introduction Design Goals Limitations of existing framework The TexQuery Language TexQuery Fromal Semantics Conclusions
12/05/2009 CS561 38
Current State and facts
XML also has unstructured information ( ie. Library of Congress in XML, etc)
Current query languages such as Xpath/Xquery provide powerful structured queries but cannot express complex full-text queries. Expressiveness limited to contains() function
12/05/2009 CS561 39
What is missing?
A full-text query language that: can provide complex full-text search conditions ( Boolean
connectives, distance predicates, etc ) could be easily integrated with existing query languages
such as XQuery ( respect to sequence of items’ data model )
is fully compositional ( closed under a particular data model )
Research work’s contribution: TexQuery and the underlying FullMatch data model
12/05/2009 CS561 40
Challenges
Integrating full-text search languages in XQuery is challenging because: Need to identify powerful and composable set of
full-text primitives seamlessly integrate regular XQuery with full-text
search capabilities is not trivial. ( nodes vs keyword tokens ).
A notion of ranking should be supported
12/05/2009 CS561 41
Contents
Introduction Design Goals Limitations of existing framework The TexQuery Language TexQuery Fromal Semantics Conclusions
12/05/2009 CS561 42
Searching over semi-structured data Users should be able to specify the search
context, or the context over which the full-text predicate is to be performed
Users should be able to specify the return context, or part of the document collection to be returned
12/05/2009 CS561 43
Expressive power, extensibility and ranking Users should be able to express complex full-text searches The language should be extensible with respect to new full-text
primitives
Users should be able to obtain relevance scores for the results Users should be able to control how scores are computed Users should be able to obtain the top-k results based on their
relevance score Users should be able to specify a scoring condition which is
possibly different from the full-text search condition
12/05/2009 CS561 44
Integrating with XQuery, Language Syntax and Efficiency Users should be able to embed full-text searches in XQuery
expressions and vice versa XQuery’s capabilities should be leveraged whenever possible There should be no extension to the XQuery data model
It should be possible to statically verify that a query is syntactically correct
The language syntax should allow for static type checking and inference
The language should allow an efficient implementation
12/05/2009 CS561 45
Contents
Introduction Design Goals Limitations of existing framework The TexQuery Language TexQuery Fromal Semantics Conclusions
12/05/2009 CS561 46
Is extending current XQuery functions enough?
How about creating a ‘universal’ contains function for all complex full-text predicates?
Why not create a a function for every complex full-text searsch predicate?
The XQuery data model would need to be changed. Search token positions would need to be captured
Uninterpreted string that is opaque to the XQuery language. ProblemsWhen trying to embed with XQuery
12/05/2009 CS561 47
Contents
Introduction Design Goals Limitations of existing framework The TexQuery Language TexQuery Fromal Semantics Conclusions
12/05/2009 CS561 48
TexQuery Overview
TexQuery introduces 2 new expressions: FTConatinsExpr and FTScoreExpr
These expressions use a set of fully composable full-text primitives called FTSelections
As all other XQuery expressions:
‘Sequence of items’
TexQueryExpr
‘Sequence of items’
12/05/2009 CS561 49
TexQuery Expressions 1/2
FTContainsExpr expr ‘ftcontains’ FTSelection expr is an XQuery expression that defines the search
context and FTSelection specifies the search condition This condition returns true or flase
Examples:
12/05/2009 CS561 50
TexQuery Expressions 2/2 FTScoreExpr
expr ‘ftscore’ FTSelectionWithWeights expr is an XQuery expression that defines the search
context and FTSelectionWithWeights specifies the search condition. Weights are used for scoring
This condition returns: Seq<float(0,1,inclusive)>, A score for every node returned.
Examples:
12/05/2009 CS561 51
FTSelections and FTContextModifiers FTSelection can be either a simple word token or a
more complex ‘predicate’
FTContextModiefiers can be applied to on any FTSelection to modify how the full-text search is performed
12/05/2009 CS561 52
Contents
Introduction Design Goals Limitations of existing framework The TexQuery Language TexQuery Fromal Semantics Conclusions
12/05/2009 CS561 53
The FullMatch Data Model
The XQuery’s data model is inadequate for fully compositional FTSelections since it is defined at the granularity of XML nodes
The solution? A new data model based on the positions of the linguistic tokens within XML nodes
A position represents an occurrence of a linguistic token in an XML and contains information about: the token, a unique identifier, the xml node containing it,
the relative position of the sentence, the relative position of the paragraph, the context ( tag name, attribute, etc )
12/05/2009 CS561 54
The Full Match Data ModelThe building block of FullMatch
An annotated example of XML
12/05/2009 CS561 55
The Full Match Data Model
A Full match is essentially a propositional logic disjunctive normal form (DNF) predicate specified using XML positions
The predicates capture the precise condition that an XML node needs to satisfy in order to be a result of for a full-text search.
Each simple match represents a Possible solution
12/05/2009 CS561 56
The FullMatch Data Model
It is obvious that full match has a hierarchical structure thus can be expressed in XML. A possible DTD follows:
The benefits that derive from the definition of FullMatch are: It ensures that FTSelections are fully composable TexQuery fully extensible. Only semantics in FullMatch need
to be defined for new FTSelection primitives
12/05/2009 CS561 57
Semantics of TexQuery Expressions The big picture:
In order to define the semantics of TexQuery expression the following 2 implementation defined functions are used:
12/05/2009 CS561 58
The Semantics of FTContainsExpr The function
defined returns true iff some node in the search context satisfies at least one of the SimpleMatches
FTContainsExprs
FullMatch
Node*
BOOL
12/05/2009 CS561 59
The Semantics of FTScoreExpr This function
returns a score in the interval [0,1] using a call to implementation specific function fts:score
FTScoreExprs
FullMatch
Node*
FLOAT*
12/05/2009 CS561 60
The Semantics of FTStringSelection FTStringSelection ::= Expr
Transforms a search token into a FullMatch
12/05/2009 CS561 61
FTAndConnective semantics 1/2
FTAndConnective ::= FTSelection ‘&&’ FTSelection
Each SimpleMatch in the Resulting FullMatch is a combination of on SimpleMatch from the first input FullMatch and on SimpleMatch from the second input FullMatch
12/05/2009 CS561 63
FTScopeSelection semantics
The FTScopeSelectionSemantics take takes the FullMatch corresponding to its input and restricts the SimpleMatches