the www as a database: www query languages curtis dyreson james cook university ( townsville,...
TRANSCRIPT
![Page 1: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649ebb5503460f94bc428f/html5/thumbnails/1.jpg)
The WWW as a Database:WWW Query Languages
Curtis Dyreson
James Cook University
(Townsville, Australia)
Aalborg University
![Page 2: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649ebb5503460f94bc428f/html5/thumbnails/2.jpg)
Outline
• searching the WWW– search engines– WWW query languages
• WebSQL– WWW graph– cost
• Jumping Spider– hybrid
![Page 3: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649ebb5503460f94bc428f/html5/thumbnails/3.jpg)
Searching the WWW
• search engines– Altavista, Infoseek, 2100 others!
• static architecture – robot: periodic, slow, non-uniform coverage– index: keywords to URLs, fast, ranking algorithm
• example query
Lecture notes on trees in a data structures
course.
![Page 4: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649ebb5503460f94bc428f/html5/thumbnails/4.jpg)
A Search Engine Index
![Page 5: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649ebb5503460f94bc428f/html5/thumbnails/5.jpg)
A Search Engine Indexdata structures
![Page 6: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649ebb5503460f94bc428f/html5/thumbnails/6.jpg)
A Search Engine Index
lecture notes
data structures
![Page 7: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649ebb5503460f94bc428f/html5/thumbnails/7.jpg)
A Search Engine Index
lecture notes
treesdata structures
![Page 8: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649ebb5503460f94bc428f/html5/thumbnails/8.jpg)
A Search Engine Index
lecture notes
treesdata structures
![Page 9: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649ebb5503460f94bc428f/html5/thumbnails/9.jpg)
A Search Engine Index
lecture notes
treesdata structures
![Page 10: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649ebb5503460f94bc428f/html5/thumbnails/10.jpg)
WWW Query Languages
• search engines index single pages
• multi-page concepts
• hunting strategy– search engine to nearby page– manual search
• WWW query languages
WebSQL, W3QS, WebLog
![Page 11: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649ebb5503460f94bc428f/html5/thumbnails/11.jpg)
WWW Graph Structure
• large (650K servers, 350M pages)
• dynamic, cycliclink = edge
page = node
![Page 12: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649ebb5503460f94bc428f/html5/thumbnails/12.jpg)
WebSQL
• SQL-like
• search engine to find pages• path expression (regular expression of links)• text manipulation predicates
SELECT <attribute list>FROM <document list>WHERE <predicate>;
![Page 13: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649ebb5503460f94bc428f/html5/thumbnails/13.jpg)
WebSQL From Clause
• from clause collects a set of documents
• unstructured - primitive schema
• MENTIONS - retrieve from search engineDOCUMENT x SUCH THAT x MENTIONS ‘data structures’
![Page 14: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649ebb5503460f94bc428f/html5/thumbnails/14.jpg)
WebSQL From Clause
• from clause collects a set of documents
• unstructured - primitive schema Document[URL, text, link to URL, modify date]
• MENTIONS - retrieve from search engine
SELECT z.URLFROM DOCUMENT x SUCH THAT x MENTIONS ‘data structures’, DOCUMENT y SUCH THAT x -> y, DOCUMENT z SUCH THAT y->* zWHERE y CONTAINS ‘lecture notes’ AND z CONTAINS ‘trees’;
![Page 15: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649ebb5503460f94bc428f/html5/thumbnails/15.jpg)
WebSQL From Clause
• path expression finds related documents
• URL
• local link: ->
• global link: =>
DOCUMENT x SUCH THAT “http://www.cs.auc.dk”
DOCUMENT y SUCH THAT x -> y
DOCUMENT y SUCH THAT x => y
![Page 16: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649ebb5503460f94bc428f/html5/thumbnails/16.jpg)
WebSQL From Clause
• at most one link: ?
• any number of links: *
• alternation: |
DOCUMENT y SUCH THAT x ->(->)? y
DOCUMENT y SUCH THAT x (=> | ->*) y
DOCUMENT y SUCH THAT x ->* y
![Page 17: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649ebb5503460f94bc428f/html5/thumbnails/17.jpg)
WebSQL From Clause: Example
FROM Document X SUCH THAT X MENTIONS ‘Java’, Document Y SUCH THAT X -> | ->-> Y
![Page 18: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649ebb5503460f94bc428f/html5/thumbnails/18.jpg)
WebSQL From Clause: Example
FROM Document X SUCH THAT X MENTIONS ‘Java’, Document Y SUCH THAT X -> | ->-> Y
Java
![Page 19: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649ebb5503460f94bc428f/html5/thumbnails/19.jpg)
WebSQL From Clause: Example
FROM Document X SUCH THAT X MENTIONS ‘Java’, Document Y SUCH THAT X -> | ->-> Y
Java
![Page 20: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649ebb5503460f94bc428f/html5/thumbnails/20.jpg)
WebSQL From Clause: Example
FROM Document X SUCH THAT X MENTIONS ‘Java’, Document Y SUCH THAT X -> | ->-> Y
Java
![Page 21: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649ebb5503460f94bc428f/html5/thumbnails/21.jpg)
WebSQL From Clause
• path expression limits search space
• local link, search limited to local machine
• global link, can go anywhere
• =>* would search all of WWW
• pre-analysis, filtering
• even three to four local links infeasible
![Page 22: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649ebb5503460f94bc428f/html5/thumbnails/22.jpg)
WebSQL Where Clause
• like SQL
• CONTAINS, text search of retrieved document
• can push CONTAINS into navigation
WHERE y CONTAINS ‘lecture notes’ AND y.length < 4000;
![Page 23: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649ebb5503460f94bc428f/html5/thumbnails/23.jpg)
WebSQL Query
• Find lecture notes on trees in a data structures course.
SELECT z.FROM DOCUMENT x SUCH THAT x MENTIONS ‘data structures’, DOCUMENT y SUCH THAT x -> y, DOCUMENT z SUCH THAT y->* zWHERE y CONTAINS ‘lecture notes’ AND z CONTAINS ‘trees’;
![Page 24: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649ebb5503460f94bc428f/html5/thumbnails/24.jpg)
data structures -> lecture notes
![Page 25: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649ebb5503460f94bc428f/html5/thumbnails/25.jpg)
data structures -> lecture notesdata structures
![Page 26: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649ebb5503460f94bc428f/html5/thumbnails/26.jpg)
data structures -> lecture notesdata structures
![Page 27: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649ebb5503460f94bc428f/html5/thumbnails/27.jpg)
data structures -> lecture notesdata structures
lecture notes
![Page 28: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649ebb5503460f94bc428f/html5/thumbnails/28.jpg)
lecture notes ->* treesdata structures
lecture notes
![Page 29: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649ebb5503460f94bc428f/html5/thumbnails/29.jpg)
lecture notes ->* treesdata structures
lecture notes
![Page 30: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649ebb5503460f94bc428f/html5/thumbnails/30.jpg)
lecture notes ->* treesdata structures
lecture notes
trees
![Page 31: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649ebb5503460f94bc428f/html5/thumbnails/31.jpg)
Resultdata structures
lecture notes
trees
![Page 32: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649ebb5503460f94bc428f/html5/thumbnails/32.jpg)
WebSQL Example
![Page 33: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649ebb5503460f94bc428f/html5/thumbnails/33.jpg)
WebSQL Architecture
• Java implementation
![Page 34: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649ebb5503460f94bc428f/html5/thumbnails/34.jpg)
WWW Query Language -Drawbacks
• dynamic architecture
• O(p**k)
- p is length of path expression
- k is branching factor
• a priori knowledge of topology
• back links are a problem
![Page 35: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649ebb5503460f94bc428f/html5/thumbnails/35.jpg)
Jumping Spider - a Hybrid
• like a search engine
- static architecture
- keyword searches
• like a WWW query language
- uses modified WWW graph
- one kind of path expression
![Page 36: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649ebb5503460f94bc428f/html5/thumbnails/36.jpg)
Kinds of Links
• content refinement queries are common
• heuristic
information in subdirectories is refined
• different kinds of links
back - subdirectory to parent
down - parent directory to subdirectory
side - unrelated directories
![Page 37: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649ebb5503460f94bc428f/html5/thumbnails/37.jpg)
Re-using the WWW Graph
![Page 38: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649ebb5503460f94bc428f/html5/thumbnails/38.jpg)
Directory Trees
![Page 39: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649ebb5503460f94bc428f/html5/thumbnails/39.jpg)
Down Links
![Page 40: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649ebb5503460f94bc428f/html5/thumbnails/40.jpg)
Back Links
![Page 41: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649ebb5503460f94bc428f/html5/thumbnails/41.jpg)
Eliminate Back Links
![Page 42: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649ebb5503460f94bc428f/html5/thumbnails/42.jpg)
Transitive Closure of Down Links
![Page 43: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649ebb5503460f94bc428f/html5/thumbnails/43.jpg)
Plus a Side Link
![Page 44: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649ebb5503460f94bc428f/html5/thumbnails/44.jpg)
data structures -> lecture notesdata structures
![Page 45: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649ebb5503460f94bc428f/html5/thumbnails/45.jpg)
data structures -> lecture notesdata structures
![Page 46: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649ebb5503460f94bc428f/html5/thumbnails/46.jpg)
data structures -> lecture notesdata structures
lecture notes
![Page 47: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649ebb5503460f94bc428f/html5/thumbnails/47.jpg)
lecture notes -> treesdata structures
lecture notes
![Page 48: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649ebb5503460f94bc428f/html5/thumbnails/48.jpg)
lecture notes -> treesdata structures
lecture notes
trees
![Page 49: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649ebb5503460f94bc428f/html5/thumbnails/49.jpg)
Analysis
• search engine index
- adds a pertinent index
• pertinent index - O(nlogn) to O(n**2) space
- all URLs that can reach this URL
- tree-like, so should be close to O(nlogn)
• more intersections
• implemented in Perl 5
![Page 50: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649ebb5503460f94bc428f/html5/thumbnails/50.jpg)
Related Work
• WWW query languages
WebSQL (Arocena et al. - WWW6 ’97)
W3QS (Konopnicki and Shmueli - VLDB’95)
WebLog (Lakshmanan et al. RIDE ’96)
AKIRA (Lacroix et al. - ER ’97)
• Indexes that already use directories
Infoseek
WebGlimpse (Manber et al. - Usenix ’97)
• Semi-structured data models - many
![Page 51: The WWW as a Database: WWW Query Languages Curtis Dyreson James Cook University ( Townsville, Australia ) Aalborg University](https://reader030.vdocuments.site/reader030/viewer/2022032612/56649ebb5503460f94bc428f/html5/thumbnails/51.jpg)
Future Work
• scale to size of WWW
• extended query language (negation)
• easier installation