text search for fine-grained semi-structured data soumen chakrabarti indian institute of technology,...

Download Text Search for Fine-grained Semi-structured Data Soumen Chakrabarti Indian Institute of Technology, Bombay www.cse.iitb.ac.in/~soumen

Post on 03-Jan-2016

212 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

  • Text Search for Fine-grained Semi-structured DataSoumen ChakrabartiIndian Institute of Technology, Bombaywww.cse.iitb.ac.in/~soumen/

    Sheet1

    Acknowledgments

    S. SudarshanArvind Hulgeri

    B. AdityaParag

  • Two extreme search paradigmsSearching a RDBMSComplex data model: tables, rows, columns, data typesExpressive, powerful query languageNeed to know schema to queryAnswer = unordered set of rowsRanking: afterthoughtInformation RetrievalCollection = set of documents, document = sequence of termsTerms and phrases present or absentNo (nontrivial) schema to learnAnswer = sequence of documentsRanking: central to IR

  • Convergence?SQLXML searchTrees, reference linksLabeled edgesNodes may containStructured dataFree text fieldsData vs. documentQuery involves node data and edge labelsPartial knowledge of schema okAnswer = set of pathsWeb searchIRDocuments are nodes in a graphHyperlink edges have important but unspecified semanticsGoogle, HITSQuery language remains primitiveNo data typesNo use of tag-treeAnswer = URL list

  • Outline of this tutorialReview of text indexing and information retrieval (IR)Support for text search and similarity join in relational databases with text columnsText search features in major XML query languages (and whats missing)A graph model for semi-structured data with free-form text in nodesProximity search formulations and techniques; how to rank responsesFolding in user feedbackTrends and research problems

  • Text indexing basicsInverted index maps from term to document IDsTerm offset info enables phrase and proximity (near) searchesDocument boundary and limitations of near queriesCan extend inverted index to map terms toTable names, column namesPrimary keys, RIDsXML DOM node IDsMy0 care1 is loss of carewith old care doneYour care is gain ofcare with new care wonD1D2careD1: 1, 5, 8D2: 1, 5, 8newD2: 7oldD1: 7lossD1: 3

  • Information retrieval basicsStopwords and stemmingEach term t in lexicon gets a dimension in vector spaceDocuments and the query are vectors in term spaceComponent of d along axis t is TF(d,t)Absolute term count or scaled by max term countDownplay frequent terms: IDF(t) = log(1+|D|/|Dt|)Better model: document vector d has component TF(d,t) IDF(t) for term tQuery is like another document; documents ranked by cosine similarity with queryScale upScale downcarelossof(query vector)

  • MapNone = nothing more than string equality, containment (substring), and perhaps lexicographic orderingSchema: Extensions to query languages, user needs to know data schema, IR-like ranking schemes, no implicit joinsNo schema: Keyword queries, implicit joins

    Sheet1

    Data model

    RelationalXML-like

    IR supportNoneSQL,DatalogXML-QL, Xquery

    SchemaWHIRLELIXIR, XIRQL

    No schemaDBXplorer, BANKS, DISCOVEREasyAsk, Mercado, DataSpot, BANKS

  • WHIRL (Cohen 1998)place(univ,state) and job(univ,dept)Ranked retrieval from a RDBMS:select univ from job where dept ~ CivilRanked similarity join on text columns:select state, dept from place, job where place.univ ~ job.univLimit answer to best k matches onlyAvoid evaluating full Cartesian productIceberg queryUseful for data cleaning and integration

  • WHIRL scoring functionA where-clause in WHIRL is aBoolean predicate as in SQL (age=35)Score for such clauses are 0/1Similarity predicate (job ~ Web design)Score = cosine(job, Web design)Conjunction or disjunction of clausesSub-clause scores interpreted as probabilitiesscore(B1 Bm; )=1im score(Bi,)score(B1 Bm; )=1 1im (1score(Bi,))

  • Query execution strategyselect state, dept from place, job where place.univ ~ job.univStart with place(U1,S) and job(U2,D) where U1, U2, S and D are freeAny binding of these variables to constants is associated with a scoreGreedily extend the current bindings for maximum gain in scoreBacktrack to find more solutions

  • XQueryQuilt + Lorel + YATL + XML-QLPath expressions { FOR $r IN document("recipes.xml") //recipe[//ingredient[@name="flour"]] RETURN {$r/title/text()} }

  • Early text support in XQueryTitle of books containing some para mentioning both sailing and windsurfingFOR $b IN document("bib.xml")//book WHERE SOME $p IN $b//paragraph SATISFIES (contains($p,"sailing") AND contains($p,"windsurfing")) RETURN $b/titleTitle and text of documents containing at least three occurrences of stocks FOR $a IN view("text_table") WHERE numMatches($a/text_document,"stocks") > 3 RETURN {$a/text_title}{$a/text_document}

  • Tutorial outlineReview of text indexing and information retrievalSupport for text search and similarity join in relational databases with text columns (WHIRL)Adding IR-like text search features to XML query languages (Chinenyanga et al. Fhr et al. 2001)

    Sheet1

    Data model

    RelationalXML-like

    IR supportNoneSQL,DatalogXML-QL, Xquery

    SchemaWHIRLELIXIR, XIRQL

    No schemaDBXplorer, BANKS, DISCOVEREasyAsk, Mercado, DataSpot, BANKS

  • ELIXIR: Adding IR to XQueryRanked select for $t in document(db.xml)/items/(book|cd) where $t/text() ~ Ukrainian recipe return $tRanked similarity join: find titles in recent VLDB proceedings similar to speeches in Macbeth for $vi in document(vldb.xml)/issue[@volume>24], $si in document(macbeth.xml)//speech where $vi//article/title ~ $si return $vi//article/title $si

  • How ELIXIR worksELIXIR queryELIXIR CompilerVLDB.xmlMacbeth.xmlBase XML documentsXQuery filters/ transformersWHIRL select/join filtersRewrite to XMLResultFlatten to WHIRL

  • A more detailed view { for $at in document(VLDB.xml)//issue [volume > 24]//title return { $at } } { for $as in document(Macbeth.xml)//act/scene/speechreturn { $as } } q3($title,$line) :- q21($title), q22($line), $title ~ $lineWHIRL query{ for $row in q3/tuple return $row }Result

  • ObservationsSQL/XQuery + IR-like result rankingSchema knowledge remains essentialFree-form text vs. tagged, typed fieldElement hierarchy, element names, IDREFsTypical Web search is two words longEnd-users dont type SQL or XQueryPossible remedy: HTML form accessLimitation: restricted views and queries

  • Using proximity without schemaGeneral, detailed representation: XMLLowest common representationCollection, document, termsDocument = node, hyperlink = edgeMiddle groundGraph with text (or structured data) in nodesLinks: element, subpart, IDREF, foreign keysAll links hint at unspecified notion of proximityExploit structure where available, but do not impose structure by fiat

  • Two paradigms of proximity searchA single node as query responseFind node that matches query termsor is near nodes matching query terms(Goldman et al., 1998)A connected subgraph as query responseSingle node may not match all keywordsNo natural page boundary

  • Single-node response examplesTravolta, CageActor, Face/OffTravolta, Cage, MovieFace/OffKleiser, MovieGathering, GreaseKleiser, Woo, ActorTravoltaTravoltaCageFace/OffGreaseacted-inMovieis-aActoris-aKleiserDirectoris-aGatheringdirectedWooA3

  • Basic search strategyNode subset A activated because they match query keyword(s)Look for node near nodes that are activatedGoodness of response node dependsDirectly on degree of activationInversely on distance from activated node(s)

  • Ranking a single node responseActivated node set ARank node r in response set R based on proximity to nodes a in ANodes have relevance R and A in [0,1]Edge costs are specified by the systemd(a,r) = cost of shortest path from a to rBond between a and r

    Parameter t tunes relative emphasis on distance and relevance scoreSeveral ad-hoc choices

  • Scoring single response nodesAdditive

    Belief

    Goal: list a limited number of find nodes with the largest scoresPerformance issuesAssume the graph is in memory?Precompute all-pairs shortest path (|V |3)?Prune unpromising candidates?

  • Hub indexingDecompose APSP problem using sparse vertex cuts|A|+|B | shortest paths to p|A|+|B | shortest paths to qd(p,q)To find d(a,b) compared(apb) not through qd(aqb) not through pd(apqb)d(aqpb)Greatest savings when |A||B|Heuristics to find cuts, e.g. large-degree nodesABabpq

  • Connected subgraph as responseSingle node may not match all keywordsNo natural page boundaryTwo scenariosKeyword search on relational dataKeywords spread among normalized relationsKeyword search on XML-like or Web dataKeywords spread among DOM nodes and subtrees

  • Tutorial outlineAdding IR-like text search features to XML query languagesA graph model for relational data with free-form text search and implicit joinsGeneralizing to graph models for XML

    Sheet1

    Data model

    RelationalXML-like

    IR supportNoneSQL,DatalogXML-QL, Xquery

    SchemaWHIRLELIXIR, XIRQL

    No schemaDBXplorer, BANKS, DISCOVEREasyAsk, Mercado, DataSpot, BANKS

  • Keyword search on relational dataTuple = nodeSome columns have textForeign key constraints = edges in schema graphQuery = set of termsNo natural notion of a documentNormalizationJoin may be needed to generate resultsCycles may exist in schema graph: CitesCitesCiting Cited PaperPaperID PaperName WritesAuthorID PaperID AuthorAuthorID AuthorName

    Chart1

    121710

    171121

    222914

    141017

    121710

    191520

    Food

    Gas

    Motel

    Sheet1

    PaperIDPaperName

    P1DBXplorer

    P2BANKS

    Chart1

    121710

    171121

    222914

    141017

    121710

    191520

    Food

    Gas

    Motel

    Sheet1

    AuthorIDAuthorName

    A1Chaudhuri

    A2Sudarshan

    A3Hulgeri

    Chart1

    121710

    171121

    222914

    141017

    121710

    191520

    F

Recommended

View more >