cs276a text information retrieval, mining, and exploitation
DESCRIPTION
CS276A Text Information Retrieval, Mining, and Exploitation. Lecture 15 26 Nov 2002. …/~newbie/. www.ibm.com. /…/…/leaf.htm. Recap: Web Anatomy. E2. E1. WEB. Recap:Size of the Web. Capture – Recapture technique Assumes engines get independent random subsets of the Web. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/1.jpg)
CS276AText Information Retrieval, Mining, and
Exploitation
Lecture 1526 Nov 2002
![Page 2: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/2.jpg)
Recap: Web Anatomy
www.ibm.comwww.ibm.com……//~newbie/~newbie/
/…/…/leaf.htm/…/…/leaf.htm
![Page 3: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/3.jpg)
Recap:Size of the Web
Capture – Recapture technique Assumes engines get independent random
subsets of the Web
E2 contains x% of E1.Assume, E2 contains x% of the Web as well
Knowing size of E2 compute size of the WebSize of the Web = 100*E2/x
E1E2
WEB
Bharat & Broder: 200 M (Nov 97), 275 M (Mar 98) Lawrence & Giles: 320 M (Dec 97)
![Page 4: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/4.jpg)
Recent Measurements
Source: http://www.searchengineshowdown.com/stats/change.shtml
![Page 5: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/5.jpg)
Today’s Topics
Web IR infrastructure Search deployment XML intro XML indexing and search
![Page 6: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/6.jpg)
Web IR Infrastructure
Connectivity Server Fast access to links to support link analysis
Term Vector Database Fast access to document vectors to
augment link analysis
![Page 7: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/7.jpg)
Connectivity Server[CS1: Bhar98b, CS2 & 3: Rand01]
Fast web graph access to support connectivity analysis
Stores mappings in memory from URL to outlinks, URL to inlinks
Applications HITS, Pagerank computations Crawl simulation Graph algorithms: web connectivity, diameter etc. Visualizations
![Page 8: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/8.jpg)
Usage
Input
Graphalgorithm
+URLs
+Values
URLstoFPstoIDs
Execution
Graphalgorithm
runs inmemory
IDstoURLs
Output
URLs+
Values
Translation Tables on DiskURL text: 9 bytes/URL (compressed from ~80 bytes ) FP(64b) -> ID(32b): 5 bytesID(32b) -> FP(64b): 8 bytesID(32b) -> URLs: 0.5 bytes
![Page 9: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/9.jpg)
ID assignment
Partition URLs into 3 sets, sorted lexicographically
High: Max degree > 254 Medium: 254 > Max degree > 24 Low: remaining (75%)
IDs assigned in sequence (densely)
E.g., HIGH IDs:
Max(indegree , outdegree) > 254
ID URL
…
9891 www.amazon.com/
9912 www.amazon.com/jobs/
…
9821878 www.geocities.com/
…
40930030 www.google.com/
…
85903590 www.yahoo.com/
Adjacency lists In memory tables for
Outlinks, Inlinks List index maps from an ID
to start of adjacency list
![Page 10: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/10.jpg)
Adjacency List Compression - I
…
…
9813215398
147153
…
…
104105106
ListIndex
Sequenceof
AdjacencyLists
…
…
-63421-8496
…
…
104105106
ListIndex
DeltaEncoded
AdjacencyLists
• Adjacency List: - Smaller delta values are exponentially more frequent (80% to same host)- Compress deltas with variable length encoding (e.g., Huffman)
• List Index pointers: 32b for high, Base+16b for med, Base+8b for low- Avg = 12b per pointer
![Page 11: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/11.jpg)
List Index Pointers
URL Info
LC:TID
LC:TID
…
LC:TID
FRQ:RL
FRQ:RL
…
FRQ:RL
Base (4 bytes)
Offsets For 16
IDs
offset
ID to adjacency list lookup
ID
Adjacencylists
![Page 12: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/12.jpg)
Adjacency List Compression - II
Inter List Compression Basis: Similar URLs may share links
Close in ID space => adjacency lists may overlap Approach
Define a representative adjacency list for a block of IDs Adjacency list of a reference ID Union of adjacency lists in the block
Represent adjacency list in terms of deletions and additions when it is cheaper to do so
Measurements Intra List + Starts: 8-11 bits per link (580M pages/16GB
RAM) Inter List: 5.4-5.7 bits per link (870M pages/16GB RAM.)
![Page 13: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/13.jpg)
Term Vector Database[Stat00]
Fast access to 50 word term vectors for web pages Term Selection:
Restricted to middle 1/3rd of lexicon by document frequency Top 50 words in document by TF.IDF.
Term Weighting: Deferred till run-time (can be based on term freq, doc freq, doc length)
Applications Content + Connectivity analysis (e.g., Topic Distillation) Topic specific crawls Document classification
Performance Storage: 33GB for 272M term vectors Speed: 17 ms/vector on AlphaServer 4100 (latency to read a disk
block)
![Page 14: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/14.jpg)
Architecture
URL Info
LC:TID
LC:TID
…
LC:TID
FRQ:RL
FRQ:RL
…
FRQ:RL
128ByteTV
Record
Terms
Freq
Base (4 bytes)
Bit vectorFor
480 URLids
offset
URLid to Term Vector Lookup
URLid * 64 /480
![Page 15: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/15.jpg)
Search Deployment
Web IR is just one (very specific) type of IR Commercially most important IR
application: Enterprise search (large corporations) Problem different from Web IR
Peer-2-Peer (P2P) search Another search deployment strategy
![Page 16: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/16.jpg)
Enterprise Search Deployment
DatabaseCorporate
Network
Company
Web Site
E-Commerce Web PortalsEnterprises
Proprietary content Public content
World Wide
WebSources
Markets
SearchBoxes
Content
Location
Content
ManagementGroupware
![Page 17: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/17.jpg)
1st Generation:
Classic Information Retrieval
2nd Generation:
Driven by WWW
3rd Generation:
Discovery(Text Mining)
User: Trained specialist Everyone Everyone and software agents
Scope: Small, closed collections Intranet/ExtranetStructured, semi-structured and unstructured information
Technology: Pattern/string matchingPattern/string matching and external factors for relevance ranking + categorization
Introduction of linguistic and semantic processing
1985 - 1993 1994 - 1999 2000+
Evolution of Enterprise Search
![Page 18: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/18.jpg)
Enterprise IR is a lot more than search …
Security Cannot search what you
should not readContent organization & creation
Automatic classification Taxonomy generation Support for multiple
languages, multiple formats
Conduits into databases and other content management -- homes for “valuable” content
Information processing tools
Annotation Range searches Custom ranking
criteria Cross lingual tools,…
Individual preferences Personalization Notification, …
![Page 19: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/19.jpg)
Peer-To-Peer (P2P) Search
No central index Each node in a network builds and
maintains own index Each node has “servent” software
On booting, servent pings ~4 other hosts Connects to those that respond Initiates, propagates and serves requests
![Page 20: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/20.jpg)
Which hosts to connect to?
The ones you connected to last time Random hosts you know of Request suggestions from central (or
hierarchical) nameservers
All govern system’s shape and efficiency
![Page 21: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/21.jpg)
Serving P2P search requests
Send your request to your neighbors They send it to their neighbors
decrement “time to live” for query query dies when ttl = 0
Send search matches back along requesting path
![Page 22: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/22.jpg)
Some P2P Networks
Gnutella Kazaa Bearshare Aimster Grokster Morpheus
![Page 23: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/23.jpg)
P2P: Information Retrieval Issues
Why is this more difficult than centralized IR?
![Page 24: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/24.jpg)
P2P: Information Retrieval Issues
Selection of nodes to query Merging of results Spam
![Page 25: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/25.jpg)
What is XML?
eXtensible Markup Language A framework for defining markup
languages No fixed collection of markup tags Each XML language targeted for
application All XML languages share features Enables building of generic tools
![Page 26: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/26.jpg)
Basic Structure
An XML document is an ordered, labeled tree
character data leaf nodes contain the actual data (text strings) data nodes must be non-empty and non-
adjacent to other character data nodes element nodes, are each labeled with
a name (often called the element type), and a set of attributes, each consisting of a
name and a value, can have child nodes
![Page 27: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/27.jpg)
XML Example
![Page 28: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/28.jpg)
XML Example
<chapter id="cmds"> <chaptitle>FileCab</chaptitle> <para>This chapter describes the commands that manage the <tm>FileCab</tm>inet application.</para> </chapter>
![Page 29: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/29.jpg)
Elements
Elements are denoted by markup tags <foo attr1=“value” … > thetext </foo> Element start tag: foo Attribute: attr1 The character data: thetext Matching element end tag: </foo>
![Page 30: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/30.jpg)
XML vs HTML
Relationship?
![Page 31: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/31.jpg)
XML vs HTML
HTML is a markup language for a specific purpose (display in browsers)
XML is a framework for defining markup languages
HTML can be formalized as an XML language (XHTML)
XML defines logical structure only HTML: same intention, but has evolved into
a presentation language
![Page 32: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/32.jpg)
XML: Design Goals
Separate syntax from semantics to provide a common framework for structuring information
Allow tailor-made markup for any imaginable application domain
Support internationalization (Unicode) and platform independence
Be the future of (semi)structured information (do some of the work now done by databases)
![Page 33: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/33.jpg)
Why Use XML?
Represent semi-structured data (data that are structured, but don’t fit relational model)
XML is more flexible than DBs XML is more structured than simple IR You get a massive infrastructure for free
![Page 34: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/34.jpg)
Applications of XML XHTML CML – chemical markup language WML – wireless markup language ThML – theological markup language
<h3 class="s05" id="One.2.p0.2">Having a Humble Opinion of Self</h3> <p class="First" id="One.2.p0.3">EVERY man naturally desires knowledge <note place="foot" id="One.2.p0.4"> <p class="Footnote" id="One.2.p0.5"><added id="One.2.p0.6"> <name id="One.2.p0.7">Aristotle</name>, Metaphysics, i. 1. </added></p> </note>; but what good is knowledge without fear of God? Indeed a humble rustic who serves God is better than a proud intellectual who neglects his soul to study the course of the stars. <added id="One.2.p0.8"><note place="foot" id="One.2.p0.9"> <p class="Footnote" id="One.2.p0.10"> Augustine, Confessions V. 4. </p> </note></added> </p>
![Page 35: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/35.jpg)
XML Schemas
Schema = syntax definition of XML language
Schema language = formal language for expressing XML schemas
Examples DTD XML Schema (W3C)
Relevance for XML IR Our job is much easier if we have a (one)
schema
![Page 36: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/36.jpg)
XML Tutorial
http://www.brics.dk/~amoeller/XML/index.html
(Anders Møller and Michael Schwartzbach) Previous (and some following) slides are
based on their tutorial
![Page 37: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/37.jpg)
XML Indexing and Search
![Page 38: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/38.jpg)
Native XML Database
Uses XML document as logical unit Should support
Elements Attributes PCDATA (parsed character data) Document order
Contrast with DB modified for XML Generic IR system modified for XML
![Page 39: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/39.jpg)
XML Indexing and Search
Most native XML databases have taken an DB approach Exact match Evaluate path expressions No IR type relevance ranking
Only a few that focus on relevance ranking
![Page 40: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/40.jpg)
Timber: XML as DB extension
DB: search tuples Timber: search trees Main focus
Complex and variable structure of trees (vs. tuples)
Ordering XML query optimization vs relational
optimization
![Page 41: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/41.jpg)
ToXin
Native XML database Exploits overall path structure
Supports any general path query Query evaluation in three stages
Preselection stage Selection stage Postselection stage
![Page 42: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/42.jpg)
ToXin: Motivation
Strawman: Index all paths
occurring in database Does not allow
backward navigation Example query:
find all the titles of articles from 1990
![Page 43: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/43.jpg)
Query Evaluation Stages
Pre-selection First navigation down the tree
Selection Value selection according to filter
Post-selection Navigation up and down again
![Page 44: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/44.jpg)
ToXin
![Page 45: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/45.jpg)
Factors Impacting Performance
Data source specific Document size Number of XML nodes and values Path complexity (degree of nesting) Average value size
Query specific Selectiveness of path constraint Size of query answer Number of elements selected by filter
![Page 46: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/46.jpg)
Benchmark Parameters
![Page 47: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/47.jpg)
Query Classification
![Page 48: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/48.jpg)
Evaluation
![Page 49: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/49.jpg)
ToXin: Summary
Efficient native XML database All paths are indexed (not just from root) Path index linear in corpus size Shortcomings
Order of nodes ignored Semantics of IDRefs ignored
What ismissing?
![Page 50: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/50.jpg)
IR/Relevance Ranking for XML
Why is this difficult?
![Page 51: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/51.jpg)
IR XML Challenge 1: Term Statistics
There is no document unit in XML How do we compute tf and idf? Global tf/idf over all text context is useless Indexing granularity
![Page 52: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/52.jpg)
IR XML Challenge 2: Fragments
IR systems don’t store content (only index) Need to go to document for displaying
fragment Easier in DB framework
![Page 53: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/53.jpg)
Relevance Ranking for XML
Will revisit next week
![Page 54: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/54.jpg)
Querying XML
Semistructured queries XPath XQuery
![Page 55: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/55.jpg)
Types of (Semi)Structured Queries
Location/position (“chapter no.3”) Simple attribute/value
/play/title contains “hamlet” Path queries
title contains “hamlet” /play//title contains “hamlet”
Complex graphs Employees with two managers
All of the above: mixed structure/content Subsumes: hyperlinks
![Page 56: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/56.jpg)
XPath
Declarative language for Addressing (used in XLink/XPointer and in
XSLT) Pattern matching (used in XSLT and in
XQuery) Location path
a sequence of location steps separated by /
Example: child::section[position()<6] /
descendant::cite / attribute::href
![Page 57: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/57.jpg)
Axes in XPath
ancestor, ancestor-or-self, attribute, child, descendent, descendent-or-self, following, following-sibling, namespace, parent, preceding, preceding-sibling, self
![Page 58: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/58.jpg)
Location steps
A single location step has the form: axis :: node-test [ predicate ]
The axis selects a rough set of candidate nodes (e.g. the child nodes of the context node).
The node-test performs an initial filtration of the candidates based on their types (chardata node, processing instruction,
etc.), or names (e.g. element name).
The predicates (zero or more) cause a further, potentially more complex, filtration
child::section[position()<6]
![Page 59: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/59.jpg)
XQuery
SQL for XML Usage scenarios
Human-readable documents Data-oriented documents Mixed documents (e.g., patient records)
Relies on XPath XML Schema datatypes
Turing complete XQuery is still a working draft. More than a hundred open issues as of 2002.11.10
![Page 60: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/60.jpg)
XQuery
The principal forms of XQuery expressions are: path expressions element constructors FLWR ("flower") expressions list expressions conditional expressions quantified expressions datatype expressions
Evaluated with respect to a context
![Page 61: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/61.jpg)
FLWR
FOR $p IN document("bib.xml")//publisher LET $b := document("bib.xml”)//book[publisher = $p] WHERE count($b) > 100 RETURN $p
FOR generates an ordered list of bindings of publisher names to $p
LET associates to each binding a further binding of the list of book elements with that publisher to $b
at this stage, we have an ordered list of tuples of bindings: ($p,$b)
WHERE filters that list to retain only the desired tuples
RETURN constructs for each tuple a resulting value
![Page 62: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/62.jpg)
XQuery vs SQL
Order matters! document("zoo.xml")//chapter[2]//
figure[caption = "Tree Frogs"] XQuery is turing complete, SQL is not.
![Page 63: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/63.jpg)
XQuery Example
Møller and Schwartzbach
![Page 64: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/64.jpg)
XQuery Standard on Ranking (2.3.1)
Document order defines a total ordering among all the nodes seen by the language processor. Within a given document, the document node is the first node, followed by element nodes, text nodes, comment nodes, and processing instruction nodes in the order of their representation in the XML form of the document (after expansion of entities). Element nodes occur before their children. The namespace nodes of an element immediately follow the element node, in implementation-defined order. The attribute nodes of an element immediately follow its namespace nodes, and are also in implementation-defined order.
The relative order of nodes in distinct documents is implementation-defined but stable within a given query or transformation. In other words, given two distinct documents A and B, if a node in document A is before a node in document B, then every node in document A is before every node in document B. The relative order among free-floating nodes (those not in a document) is implementation-defined.
![Page 65: CS276A Text Information Retrieval, Mining, and Exploitation](https://reader036.vdocuments.site/reader036/viewer/2022062304/56813c28550346895da5a29d/html5/thumbnails/65.jpg)
Next Week (12/3)
XML indexing and search II Metadata indexing and search Dublin Core, RDF, DAML+OIL