12/05/2009cs5611 cs561 - research paper presentation kourdounakis harry met. 567

12/05/2009 CS561 1

CS561 - Research Paper Presentation

Kourdounakis Harry MET. 567

12/05/2009 CS561 2

Topic Introduction

The topic under consideration is full-text search languages and more specifically ‘structered’ full-text queries.

Why bother? Many information systems deal with large document

collections with unknown or ill defined structure Just keyword search isn’t enough. ( what about ordering,

distance etc? ) It is a rather new area of research trying to bridge the gap

between the IR/XML community.

12/05/2009 CS561 3

Expressiveness and Performance of Full-Text Search Languages

Chavdar Botev,Sihem Amer-Yahia, and

Jayavel Shamnugasindaram

12/05/2009 CS561 4

Contents

Introduction The proposed model for full-text search Incompleteness of Existing Full-Text Search

Languages PPRED: Language and Query Evaluation Proof of idea (Experiments)

12/05/2009 CS561 5

The Problem to be addressed

There is no formal basis for comparing full-text languages.

Unlike SQL, there is no well-accepted language for expressing complex full-text search queries

Existing languages are incomplete. Some practical problems on full-text queries:

How to model? How to optimize? How to efficiently evaluate?

12/05/2009 CS561 6

Existing Work. What is missing (1/2)? Some study has been made by the IR

community, but it’s focus was primarily on simple structured full

text queries, no fully composable language was developed, no study of expressiveness, complexity

whatsoever.

12/05/2009 CS561 7

Existing Work. What is missing (2/2)? Text Region Algebras (TRA’s).

A region s is represented as the ordered pair (s.l,s.r) A query operator has the form:

{ sES | exists dED Pred(s,d) } where S,D are sets of regions and Pred is a Boolean expression with the logical operators ‘and’ ‘or’ containing clauses of the form (x*y) where * is one of { =,<,>,<=,>= } and x E{ s.l, s.r, s.l+const, s.l-const, s.r+const, s.r-const }, yE{ d.l,d.r } and const is a constant.

Although efficient algorithms have been devised to evaluate TRA queries, their expressive power is limited ( simultanious inclusion and ordering? )

12/05/2009 CS561 8

Main Contributions of research work A new formal model for structured full-text search

and scoring, based on first order logic (FTC) and the relational algebra (FTA) is introduced.

A notion of completeness of full text languages is defined.

It is shown that existing languages are incomplete with respect to the above completeness

A practical subset of the FTC and FTA which subsumes TRA’s is defined and experimental studies are done upon the performance of it.

12/05/2009 CS561 9

Contents

Introduction A formal basis for Full-Text search languages Incompleteness of Existing Full-Text Search


12/05/2009 CS561 10

Approach to formalism and the full-text search model (1/3) The full-text queries are not modeled in one

specific syntax. A more general approach is taken using FTC and FTA

The Full-Text Search Model: Let N be the set of nodes, P the set of positions,

and T the set of tokens. The following functions are defined: Positions: N -> 2^p that maps a node to the set of

positions in that node Token: P -> T that maps each position to the token at

that position

12/05/2009 CS561 11

Approach to formalism and the full-text search model (2/3) Queries are specified over a collection of nodes

( text docs, html, XML elements, relational tuples, etc ).

Positions uniquely identifies a token in a node. Each context node is considered separately Completeness requirement: The full-text search

language should be at least as expressive as first order logic formulas specified over the positions of tokens in a context node

12/05/2009 CS561 12

Approach to formalism and the full-text search model (3/3)

Positions(cs561) = {1,2,4,5,6}Token(1) = “An”

12/05/2009 CS561 13

Full-Text Calculus (FTC) 1/2

FTC base predicates: SearchContext(node) is true iff node inN hasPos(node,pos) is true iff pos in

Positions(node) hasAsToken(pos,tok) is true iff tok=Token(pos)

FTC extensions: Position based predicates, pred(p1,…,pm,const1,

…,constn). Example: preds = { distance(pos1,pos2,dist),

ordered(pos1,pos2) }

12/05/2009 CS561 14

Full-Text Calculus (FTC) 2/2

An FTC query has the form: { node| SearchContext(node) and QueryExpresion(node) } where QueryExpression is a FOL expression that specifies

the full-text search condition A query example:

“find context nodes that contain the words ‘IR’ and ‘XML’ in that order”

exists p1,p2 ( hasPos(node,p1) and hasAsToken(p1,”IR”) and hasPos(node,p2) and hasAsToken(pos2,”XML”) and ordered(p1,p2) )

12/05/2009 CS561 15

Full-Text Algebra (FTA) 1/2

FTA is defined upon the full-text relation data model which is of the form R[node,attr1,…,attrN] m>=0 with node in N and atti in P.

FTA can be defined recursively as follows: Rtoken(node,attr1) is an expression. (node,pos) in R iff: node in

D and pos in Positions(node) and token= Token(pos) If Expr1 is a expression then π_node,attr1…,attrM(Expr1) is an

expressio If Expr1 and Expr2 are expression then Expr1 join Expr2 is an

expression. ... same for all other operators

An FTA query is an FTA expression that produces a s full-text relation with a single attribute (the node)

12/05/2009 CS561 16

Full-Text Algebra (FTA) 2/2

An FTA query example Π_node( select_ordered(att1,att2) ((R_”ir” join R_”xml”) ))

cs561 2

cs463 7

cs561 5

cs561 2 5

R_”IR”

R_”XML”

12/05/2009 CS561 17

Relevance Scoring

Instead of hard-coding the scoring method in the framework, a general scoring framework is taken

How is the score incorporated in the full-text search model? A per tuple scoring approach is taken

How do FTA operators transform tuple scores? (if TF*IDF is used) Join: t.score = t1.score/|r2| + t2.score/|r1| Projection: t.score = t1.score+ … + tn.score

12/05/2009 CS561 18

Contents



12/05/2009 CS561 19

Completeness

A full-text language L is said to be full-text complete with respect to a set of position-based predicates Preds iff all queries that can be expressed in the FTC using Preds can also be expressed in L

12/05/2009 CS561 20

Incompleteness of predicate-based languages 1/2 Let DIST be the following predicate-based query

language: Query := Token | Query op Query |

dist(Token,Token,Integer) op := ‘AND’ | ‘OR’ | ‘AND NOT’ token := StringLiteral | ‘ANY’

The semantics of DIST can be recursively defined in terms of the FTC

Theorem1: if |T| >=3, where T is the set of tokens then there exists a query that can be expressed in FTC with Preds={ distance(p1,p2,d) } that cannot be expressed by DIST

12/05/2009 CS561 21

Incompleteness of predicate-based languages 2/2 No query in DIST can express the following FTC:

exists p1,p1,p3 ( (hasPos(n,p1) and hasAsToken(p1,t1)) and (hasPos(n,p2) and hasAsToken(p2,t2)) and (hasPos(n,p3) and hasAsToken(p3,t3)) and distance(p1,p2,0) and distance(p2,p3,0) )

The proof is by contradiction on the context nodes: CN1 = “t1 t2 t3” CN2 = “t1 t2 t2 t3 t3 t1”

The previous FTC would only return CN1. This is not the case of DIST which returns either both CN1,CN2 or none.

Why? ( dist(token1,token2,int) as opposed to dist(p1,p2,int) )

12/05/2009 CS561 22

Incompleteness of text region algebras 1/2 Theorem2 : There exists a query that can be

expressed in FTC with Preds = { ordered(p1,p2), samepara(p1,p2) } that cannot be expressed in TRA

The following FTC query cannot be expressed using TRA: Exists p1,p2( hasPos(node,p1) and

hasAsToken(p1,t1) and hasPos(node,p2) and hasAsToken(p2,t2) and ordered(p1,p2) and samepara(p1,p2) )

12/05/2009 CS561 23

Incompleteness of text region algebras 2/2 It is proven that TRA’s cannot represent

simultaneously inclusion and ordering constraints. Example?

Find documents with regions sES that contains two other regions tET and uEU such that t comes before u.

Making the assumption that s,t are regions with same start and end and set U to be regions representing paragraphs concludes to the previously defined FTC query. The theorem therefore follows

12/05/2009 CS561 24

Contents



12/05/2009 CS561 25

Defining Positive Predicates

An FTC query evaluation algorithm can only be: polynomial in the size of the data and exponential in the size of the query

Can we do better? Observation: many full-text search predicates are true in a

contiguous region of the position space. These predicates are called Positive predicates and can be

efficiently evaluated by scanning nodes in increasing order of positions. (single scan)

12/05/2009 CS561 26

PPRED Language Based on previous observations the PPRED

language is defined which is a subset of FTC The syntax is as follows:

PPRED is a strict super set if DIST and TRA’s

Example: some p1 has ‘IR’ some p2 has ‘XML’ ordered(p1,p2)

12/05/2009 CS561 27

Query Evaluation 1/3

Each R_token relation is represented as an inverted list associated to token. Each entry in R_token is of the form (node,PosList,score)

n PosList1

51

89

80 99 139

56 59

96 102 108

n PosList1

75

85 97

83 210

n PosList1

75

90 105 140

81

“judge” inverted list “assignment” inverted list “District” inverted list

12/05/2009 CS561 28


A query is first converted to FTA operators Since it is not wanted to materialize the out put

relation each operator exposes a new API for traversing it’s output

The API maintains the following state: node, which is the current node and list<Position> which track the current positions in the

node

12/05/2009 CS561 29


The API defines the following methods advanceNode():

node = projection_node(R) List<Position> = the smallest positions that appear in R

for that node andvancePosition(i,pos):

List<Position> = the smallest positions that appear in R and satisfy condition pi>pos

12/05/2009 CS561 30

API Example

Node1 5 6

Node1 3 4

Node1 7 8

Node2 19 20

Node2 23 21

advanceNode()advancePosition(p1,4)

12/05/2009 CS561 31


Given the specific query plan the execution would be as follows: advanceNode() is called on the

top project which forwards the call to the distance selection operator below it

The latter continiously calls advancePosition() on the ordered predicate until it finds a satisfying tuple of positions

The ordered predicates advance through the result of the underlying operator until it finds a tuple that satisfies it

The evaluation proceeds down the tree.

12/05/2009 CS561 32

Contents



12/05/2009 CS561 33

Query Evaluator Complexity

O(entries_per_token*pos_per_entry*toksQ*(predsq+ops+1)) entries_per_token = maximum num of entries in the inverted

list pos_per_entry = maximum number of positions in an entry

in an inverted list toksQ = number of keywords in q predsQ = number of predicates in query opsQ = number of operators in a query

12/05/2009 CS561 34

Experimental Results

•The performance of PPRED was compared to:•BOOL which is a restriction of DIST•REL which is the implementation of FTA using regular relational operators.

12/05/2009 CS561 35

QUESTIONS?

12/05/2009 CS561 36

TexQuery: A Full-Text Search Extension to XQuery

Chavdar Botev,Sihem Amer-Yahia, and

Jayavel Shamnugasindaram

12/05/2009 CS561 37

Contents

Introduction Design Goals Limitations of existing framework The TexQuery Language TexQuery Fromal Semantics Conclusions

12/05/2009 CS561 38

Current State and facts

XML also has unstructured information ( ie. Library of Congress in XML, etc)

Current query languages such as Xpath/Xquery provide powerful structured queries but cannot express complex full-text queries. Expressiveness limited to contains() function

12/05/2009 CS561 39

What is missing?

A full-text query language that: can provide complex full-text search conditions ( Boolean

connectives, distance predicates, etc ) could be easily integrated with existing query languages

such as XQuery ( respect to sequence of items’ data model )

is fully compositional ( closed under a particular data model )

Research work’s contribution: TexQuery and the underlying FullMatch data model

12/05/2009 CS561 40

Challenges

Integrating full-text search languages in XQuery is challenging because: Need to identify powerful and composable set of

full-text primitives seamlessly integrate regular XQuery with full-text

search capabilities is not trivial. ( nodes vs keyword tokens ).

A notion of ranking should be supported

12/05/2009 CS561 41

Contents


12/05/2009 CS561 42

Searching over semi-structured data Users should be able to specify the search

context, or the context over which the full-text predicate is to be performed

Users should be able to specify the return context, or part of the document collection to be returned

12/05/2009 CS561 43

Expressive power, extensibility and ranking Users should be able to express complex full-text searches The language should be extensible with respect to new full-text

primitives

Users should be able to obtain relevance scores for the results Users should be able to control how scores are computed Users should be able to obtain the top-k results based on their

relevance score Users should be able to specify a scoring condition which is

possibly different from the full-text search condition

12/05/2009 CS561 44

Integrating with XQuery, Language Syntax and Efficiency Users should be able to embed full-text searches in XQuery

expressions and vice versa XQuery’s capabilities should be leveraged whenever possible There should be no extension to the XQuery data model

It should be possible to statically verify that a query is syntactically correct

The language syntax should allow for static type checking and inference

The language should allow an efficient implementation

12/05/2009 CS561 45

Contents


12/05/2009 CS561 46

Is extending current XQuery functions enough?

How about creating a ‘universal’ contains function for all complex full-text predicates?

Why not create a a function for every complex full-text searsch predicate?

The XQuery data model would need to be changed. Search token positions would need to be captured

Uninterpreted string that is opaque to the XQuery language. ProblemsWhen trying to embed with XQuery

12/05/2009 CS561 47

Contents


12/05/2009 CS561 48

TexQuery Overview

TexQuery introduces 2 new expressions: FTConatinsExpr and FTScoreExpr

These expressions use a set of fully composable full-text primitives called FTSelections

As all other XQuery expressions:

‘Sequence of items’

TexQueryExpr

‘Sequence of items’

12/05/2009 CS561 49

TexQuery Expressions 1/2

FTContainsExpr expr ‘ftcontains’ FTSelection expr is an XQuery expression that defines the search

context and FTSelection specifies the search condition This condition returns true or flase

Examples:

12/05/2009 CS561 50

TexQuery Expressions 2/2 FTScoreExpr

expr ‘ftscore’ FTSelectionWithWeights expr is an XQuery expression that defines the search

context and FTSelectionWithWeights specifies the search condition. Weights are used for scoring

This condition returns: Seq<float(0,1,inclusive)>, A score for every node returned.

Examples:

12/05/2009 CS561 51

FTSelections and FTContextModifiers FTSelection can be either a simple word token or a

more complex ‘predicate’

FTContextModiefiers can be applied to on any FTSelection to modify how the full-text search is performed

12/05/2009 CS561 52

Contents


12/05/2009 CS561 53

The FullMatch Data Model

The XQuery’s data model is inadequate for fully compositional FTSelections since it is defined at the granularity of XML nodes

The solution? A new data model based on the positions of the linguistic tokens within XML nodes

A position represents an occurrence of a linguistic token in an XML and contains information about: the token, a unique identifier, the xml node containing it,

the relative position of the sentence, the relative position of the paragraph, the context ( tag name, attribute, etc )

12/05/2009 CS561 54

The Full Match Data ModelThe building block of FullMatch

An annotated example of XML

12/05/2009 CS561 55

The Full Match Data Model

A Full match is essentially a propositional logic disjunctive normal form (DNF) predicate specified using XML positions

The predicates capture the precise condition that an XML node needs to satisfy in order to be a result of for a full-text search.

Each simple match represents a Possible solution

12/05/2009 CS561 56

The FullMatch Data Model

It is obvious that full match has a hierarchical structure thus can be expressed in XML. A possible DTD follows:

The benefits that derive from the definition of FullMatch are: It ensures that FTSelections are fully composable TexQuery fully extensible. Only semantics in FullMatch need

to be defined for new FTSelection primitives

12/05/2009 CS561 57

Semantics of TexQuery Expressions The big picture:

In order to define the semantics of TexQuery expression the following 2 implementation defined functions are used:

12/05/2009 CS561 58

The Semantics of FTContainsExpr The function

defined returns true iff some node in the search context satisfies at least one of the SimpleMatches

FTContainsExprs

FullMatch

Node*

BOOL

12/05/2009 CS561 59

The Semantics of FTScoreExpr This function

returns a score in the interval [0,1] using a call to implementation specific function fts:score

FTScoreExprs

FullMatch

Node*

FLOAT*

12/05/2009 CS561 60

The Semantics of FTStringSelection FTStringSelection ::= Expr

Transforms a search token into a FullMatch

12/05/2009 CS561 61

FTAndConnective semantics 1/2

FTAndConnective ::= FTSelection ‘&&’ FTSelection

Each SimpleMatch in the Resulting FullMatch is a combination of on SimpleMatch from the first input FullMatch and on SimpleMatch from the second input FullMatch

12/05/2009 CS561 62

FTAndConnective semantics 2/2

12/05/2009 CS561 63

FTScopeSelection semantics

The FTScopeSelectionSemantics take takes the FullMatch corresponding to its input and restricts the SimpleMatches

12/05/2009 CS561 64

Contents

Introduction Design Goals Limitations of existing framework The TexQuery Language TexQuery Fromal Semantics Questions?

12/05/2009cs5611 cs561 - research paper presentation kourdounakis harry met. 567

Documents

text languages

text queries

text search model

text search languages

region s

text region algebras

existing work

keyword search isnt