Download - Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web
![Page 1: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/1.jpg)
Information Network Analysis and Extraction
Extraction and Integration of the Semi-Structured Web
Tim WeningerComputer Science and Engineering Department
University of Notre Dame
![Page 2: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/2.jpg)
Rules of this tutorial
1. Ask questions2. Ask lots of questions
3. If you don’t agree with something, let me know4. If something is not clear, ask a question
Slides can be found online at: http://web.engr.illinois.edu/~weninge1/publications.htmlGoogle/Bing/Yahoo: ‘Tim Weninger Publications’
![Page 3: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/3.jpg)
The Web
Social Networks› Early Messenger Networks› Social Media› Gaming Networks› Professional Networks
Hyperlink Networks› Blog Networks› Wiki-networks› Web-at-large
» Internal links» External links
![Page 4: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/4.jpg)
The Web is a Hyperlink Network
![Page 5: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/5.jpg)
Ranking on the Web
Query:
![Page 6: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/6.jpg)
Clustering on the Web
Sim(
![Page 7: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/7.jpg)
This Tutorial is about the structure and content of the Web
NamePhoneOfficeAge
GenderEmail
AuthorDateline
TopicPersonsLocation
![Page 8: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/8.jpg)
Imagine what we could do…
Search› Show structured information in response to query› Automatically rank and cluster entities› Reasoning on the Web
» Who are the people at some company?» What are the courses in some college department?
Analysis› Expand the known information of an entity
» What is a professor’s phone number, email, courses taught, research, etc?
![Page 9: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/9.jpg)
Outline
PreliminariesInformation ExtractionBreak (30 min)Information IntegrationWeb Information Networks
![Page 10: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/10.jpg)
Databases and Schemas
Databases usually have a well defined schema
![Page 11: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/11.jpg)
Databases and Schemas
Databases usually have a well defined schema
![Page 12: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/12.jpg)
XML – a data description language
XML Schema
![Page 13: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/13.jpg)
XML – a data description language
XML Instance
![Page 14: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/14.jpg)
HTML and Semi-Structured data
![Page 15: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/15.jpg)
HTML and Semi-Structured data
What’s the schema?
![Page 16: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/16.jpg)
HTML and Semi-Structured data
HTML has no schema!
HTML is a markup language› A description for a browser to render› HTML describes how the data should be displayed
HTML was never meant to describe the data.
![Page 17: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/17.jpg)
HTML and Semi-Structured data
HTML was never meant to describe the data.
But there is so much data on the Web…we have to try
![Page 18: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/18.jpg)
Document Object Model
HTML -> DOM› DOM is a tree model of the HT markup language
![Page 19: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/19.jpg)
What the DOM is not
From the W3C:
The Document Object Model does not define what information in a document is relevant or how information in a document is structured. For XML, this is specified by the W3C XML Information Set [Infoset]. The DOM is simply an API to this information set.
![Page 20: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/20.jpg)
Web page rendering
HTML -> DOM -> WebPage› Web page rendering according to Web standards
Uses the Boxes Model
![Page 21: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/21.jpg)
Web databases
LOTS of pages on the Web are database interfaces
![Page 22: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/22.jpg)
Web databases
Some pages are not database interfaces….but they could be
![Page 23: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/23.jpg)
Relational Databases on the Web
WebPages can have relational data
![Page 24: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/24.jpg)
Data can be hidden in text too!
![Page 25: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/25.jpg)
HTML and Semi-Structured data
Our goal is to extract information from the Web
…and make sense out of it!
![Page 26: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/26.jpg)
Outline
PreliminariesInformation Extraction from textBreak (30 min)Information Extraction from tables and listsWeb Information Networks
![Page 27: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/27.jpg)
Content Extraction
![Page 28: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/28.jpg)
Web Content Extraction
Extract only the content of a page
Taken from The Hutchinson News on 8/14/2008
![Page 29: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/29.jpg)
Web Content Extraction
Two Approaches1. Heuristic Approaches
Work one “document-at-a-time”2. Template Detection Approaches
Require multiple documents that contain the same template
Benefits of content extraction• Reduce the noise in the document
» Reduce document size» Better indexing, search processing» Easier to fit on small screens
![Page 30: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/30.jpg)
Wrapper Generation
Documents on the Web are made from templates• Popularity of Content Management Systems
• Database queries are used to “fill out” HTML content
Template are the framework of the Web page(s)• The structure of is very similar (near identical) among
template Web pages.
1. Cluster similarly structured documents2. Generate Wrappers3. Extract Information
![Page 31: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/31.jpg)
Wrapper Generation
Documents on the Web are made from templates• Database query “fills in” the content• Separate AJAX/HTTP calls “fill in” content
![Page 32: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/32.jpg)
Locating Web page templates
First Bar-Yossef and Rajagopalan ‘02 proposed a template recognition algorithm using DOM tree segmentation• Template detection via data mining and its applications
Lin and Ho ‘02 developed InfoDiscoverer which uses the heuristic that template generated contents appear more frequently.• Discovering informative content blocks from web documents
Debnath et al. ‘05 develop ContentExtractor but also include features like image or script elements.• Automatic extraction of informative blocks from webpages
![Page 33: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/33.jpg)
Locating Web page templates
Yi, Liu and Li ‘03 use the Site Style Tree(SST) approach finds that identically formatted DOM sub-trees denote the template• Eliminating noisy information in web pages for data mining
Crecensi et al. ’01 develop Roadrunner which uses the Align, collapse under mismatch and extract (ACME) approach to generate wrappers.• Towards Automatic Data Extraction from Large Web Sites.
Buttler ‘04 proposes the path shingling approach which makes use of the shingling technique.• A short survey of document structure similarity algorithms
![Page 34: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/34.jpg)
Wrapper Generation
Generate extraction rules
//div[@class ="content"]/table[1]/tr/td[2]/text()
A home away from school
Day care has after-school duties as some clients start academic year
By Kristen Roderick – The Hutchinson News – [email protected]
The doors at Hadley Day Care opened Wednesday afternoon, and children scurried in with tales of…
![Page 35: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/35.jpg)
Wrapper Generation
Advantages• Easy to implement and learn• Can have perfect precision and recall
Disadvantages• Web sites change their templates often
» Any small change breaks the wrapper• Need several examples to learn the wrapper
» Called “domain-centric” approaches
![Page 36: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/36.jpg)
Single Document Content Extraction
Look at a single document at a time• Use heuristics and data mining principles to find main
content.
No template detectionNo extraction rule learning
Called “Web-centric” approaches
![Page 37: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/37.jpg)
Early Content Extraction Approaches
Body Text Extraction (BTE) • Interprets HTML document as word and tag tokens• Identifies a single, continuous region which contains most
words while excluding most tags.
Document Slope Curves (DSC) • Extension of BTE that looks at several document regions.
Link Quota Filters (LQF) • Remove DOM elements which consist mainly of text
occurring in hyperlink anchors.
![Page 38: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/38.jpg)
Tag Ratios Content Extraction
Two algorithms• Same time, same conference• Same concept
Gottron, et al. ‘07 Content Code Blurring Weninger, et al. ‘07 Content Extraction via Tag Ratios
![Page 39: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/39.jpg)
Text to Tag Ratio
http://www2010.org/www/2010/04/program-guide/
Text: 21 - Tags: 8 -> TTR: 2.63
Text: 22 - Tags: 8 -> TTR: 2.75
Text: 298 - Tags: 6 -> TTR: 49.67
Text: 0 - Tags: 0 -> TTR: 0Text: 0 - Tags: 1 -> TTR: 0
![Page 40: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/40.jpg)
1 26 51 76 1011261511762012262512763013263513764014260
50
100
150
200
250
Line Number
Text
To
Tag
Ratio
Text to Tag Ratio Histogram
![Page 41: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/41.jpg)
Histogram Clustering in 2-Dimensions
Looks for jumps in the moving average of TTR
1 50 99 1481972462953443930
20
40
60
80
100
120
Line Number
Text
To
Tag
Ratio
1 50 99 148197246295344393-150
-100
-50
0
50
100
150
Line Number
![Page 42: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/42.jpg)
Histogram Clustering in 2-Dimensions
Absolute value gives insight
1 52 103154205256307358409-150
-100
-50
0
50
100
150
Line Number
1 46 91 1361812262713163614060
100200300400500600700800
Line Number
gʹ
![Page 43: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/43.jpg)
0 25 50 75 1000
102030405060708090
100
TTR (hʹ)
Diffe
renc
es (g
')
Histogram Clustering in 2-Dimensions
Make a scatterplot
0 25 50 75 1000
20
40
60
80
100
TTR (hʹ)
Diffe
renc
es (g
')
![Page 44: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/44.jpg)
0 25 50 75 1000
10
20
30
40
50
60
70
80
90
100
TTR (hʹ)
Diffe
renc
es (g
')
Modified k-Means
![Page 45: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/45.jpg)
Single Document Content Extraction
Advantages› Only need a single document at a time› Unsupervised
» No training required
Disadvantages› Precision and Recall varies
» On the (1) algorithm, (2) parameters, (3) Web page
› What are other problems?» Javascript!
![Page 46: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/46.jpg)
Rule Extraction
![Page 47: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/47.jpg)
Textual Extraction
Web text holds good information, but full NLP understanding is difficult
Two flavors of text extraction› Domain-at-a-time› Web-at-large (domain-agnostic)
Very different techniques required for each
![Page 48: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/48.jpg)
Domain at a time
Documents on the Web are made from templates› A single domain has similar language
![Page 49: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/49.jpg)
Domain at a time text extraction
If we know the schema/domain, we know the rules
BBC Business – “owned by”, “sales of”, “CEO of”, etc.
![Page 50: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/50.jpg)
Known Domains: Rule Learning
1. User provides initial data
2. Algorithm searches for terms, then induces rules.
[ORGANIZATION]’s headquarters in [LOCATION][LOCATION]-based [ORGANIZATION] [ORGANIZATION], [LOCATION]
“Servers at Microsoft’s headquarters in Redmond…”“The Armonk-based IBM has introduced…”“Intel, Santa Clara, cut prices of its Pentium…”
Microsoft RedmondIBM ArmonkIntel Santa Clara
![Page 51: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/51.jpg)
Known Domains: Rule Learning
1. User provides initial data
2. Algorithm searches for terms, then induces rules.
Extraction rules are intricate and break easily› Different extraction rules per domain
» Can’t scaleHave to parse all of the text
› Computationally very expensive
Microsoft RedmondIBM ArmonkIntel Santa Clara
![Page 52: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/52.jpg)
Domain independent – Source dependent
Don’t analyze raw text - use dataset-specific extraction techniques
Yet another great ontology (YAGO)Finds TYPE relationship in Wikipedia
› Looks at Wikipedia category pages› Categories can be different
» Conceptual (naturalized citizens of the US)» Relational (1879 births)» Thematic (Physics)» Administrative (unsourced articles)» Only Conceptual ones indicate TYPE
YAGO parses category names, tests if head of the name is plural; if so, it’s Conceptual
![Page 53: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/53.jpg)
Domain independent – Source dependent
YAGO/YAGO2
Looks at the Wikipedia structures to learn rules
![Page 54: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/54.jpg)
Domain independent – Source dependent
YAGO/YAGO2
![Page 55: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/55.jpg)
YAGO
Techniques are not general at all› Limited to 14-100 hand-picked relations
» Manually generate the relationships we want to look for
Great performance› Able to extract 40 Million facts in YAGO› 80 million facts in YAGO2
![Page 56: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/56.jpg)
Web-At-Large Text Extraction
“Open Information Extraction”
Discovers rules/predicates on the flyDoes not require domain semantics or much human
input.› Run on the whole Web
Textrunner Banko et al. ‘07
![Page 57: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/57.jpg)
Open Information Extraction - Textrunner
Self-Supervised Classifier› Train extraction-classifier using data & features generated
by (expensive) linguistic parser› Dependency Parser -
![Page 58: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/58.jpg)
Open Information Extraction - Textrunner
![Page 59: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/59.jpg)
Open Information Extraction - Textrunner
Result Assessment› Tuple-extraction frequency counts › Use heuristics
» not a too-long parse dependency between the two NPs» neither NP is simply a pronoun» path between NPs does not pass a sentence-like boundary» etc.
› Use Naïve Bayes Classifier to find good extractions» Features: » part-of-speech tags» Number of tokens in a relation» whether an NP is a proper noun
![Page 60: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/60.jpg)
Open Information Extraction - Textrunner
Compared to Domain-dependent extraction
Better coverage› It’s not restricted on the types of relations › It’s not restricted on the domain
Lower precision› Increase in recall results in lower precision› More noise introduced from the Web-at-large
![Page 61: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/61.jpg)
Outline
PreliminariesInformation Extraction from textBreak (30 min)Information Extraction from tables and listsWeb Information Networks
![Page 62: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/62.jpg)
Outline
PreliminariesInformation Extraction from textBreak (30 min)Information Extraction from tables and listsWeb Information Networks
![Page 63: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/63.jpg)
Record Extraction
![Page 64: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/64.jpg)
Record Extraction
Find structured data in semi-structured HTML• Find database tables (rows & columns) in a Web page
Data Record ExtractionList ExtractionWebTable Integration
![Page 65: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/65.jpg)
Example of Data Records
![Page 66: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/66.jpg)
Data Record Extraction
Mining Data Records from the Web (MDR), Liu et al ’031. Generate Tag Tree
![Page 67: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/67.jpg)
MDR
2. Find Generalized Nodes
Generalized nodes have subtrees of the same size, depth, are adjacent, and have a certain string similarity
![Page 68: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/68.jpg)
MDR
3. Match identical data records
![Page 69: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/69.jpg)
DEPTA
Zhai, Liu ‘05 DEPTA • Structured Data Extraction from the Web based on Partial
Tree Alignment
3. Match similar data records
![Page 70: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/70.jpg)
Record Extraction using Tag Path Clustering
Inverted Index
![Page 71: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/71.jpg)
Record Extraction using Tag Path Clustering
Derive similarities from the visual signal vectors
Distance between centers of gravity
Interleaving measure
Similarity measure
![Page 72: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/72.jpg)
Record Extraction using Tag Path Clustering
Similarity Matrix of tag paths
![Page 73: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/73.jpg)
MiBAT – Extraction of Records containing UGC
Song et al. ‘10 – Extracts data records containing user generated content (UGC)
![Page 74: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/74.jpg)
MiBAT
Finding Anchor Trees• Nodes within the record that match across all subtrees
• Use those anchors to tie the data records together• Those anchor trees need to be predefined
• Are a date, time, or some common structured text that a Regular Expression can find.
![Page 75: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/75.jpg)
DOM Record Extraction
Advantages• Unsupervised
» Only needs one page at a time• Tag-agnostic
» Doesn’t matter what the type of the HTML tag is
Disadvantages• Precision and Recall varies
» Depends on the Web page and assumptions of the algorithm• HTML is not a schema
» Misses AJAX, Javascript, other HTTP calls» What is the purpose of HTML?
![Page 76: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/76.jpg)
Visual Based Record Extraction
Assumptions: • HTML describes the structure of a document• Repeating Patterns = Records• HTML is a markup language
We need to render the Web page
![Page 77: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/77.jpg)
Visual Web Page Rendering
![Page 78: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/78.jpg)
VENTex – Visual Record Extraction
Gatterbauer et al. ‘07 Visual Record Extraction VENTex • Towards Domain-Independent Information
Extraction from Web Tables
![Page 79: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/79.jpg)
Visual Record Extraction
VENTex relies on lots of heuristics
Does not consider underlying DOM
![Page 80: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/80.jpg)
Hybrid List Extraction
Property 1: If box a is contained in box b, then b is an ancestor of a in the rendered box tree.
Property 2: If a and b are not related under property 1, then they do not overlap visually on the page.
Fumarola et al. ‘12 Hybrid List Extraction HyLiEn
![Page 81: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/81.jpg)
Candidate Generation based on Visual Features
A list candidate on a rendered Web page consists of a set of vertically and/or horizontally aligned boxes.
Two lists and are related if they have an element in common.
A set of lists is a tiled structure if for every list there exists at least one other list such that and . Lists in a tiled structure are called tiled lists.
![Page 82: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/82.jpg)
Output: Web page annotated
Tiled ListVertical List
Horizontal List
![Page 83: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/83.jpg)
HyLiEn
![Page 84: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/84.jpg)
HyLiEn
RESTful service: http://dmserv1.cs.illinois.edu/listextractorservice.listextractorsvc.svc/extract/xml/?url= http://cs.illinois.edu/people/faculty
61 Faculty
Tarek A.
Sarita A.
Vikram A.
…and 58 more…
![Page 85: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/85.jpg)
Lets take a look at a single record
Tarek A.
Name & Link
Title
Phone
Research
![Page 86: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/86.jpg)
Lets take a look at a ANOTHER record
Vikram A.
Name & Link
Title
Phone
Research
![Page 87: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/87.jpg)
Visual Record Extraction
Advantages• More accurate than DOM-methods• Unsupervised
» Only needs one page at a time• Tag-agnostic
» Doesn’t matter what the type of the HTML tag is
Disadvantages• Precision and Recall varies
» Depends on the Web page and assumptions of the algorithm» Precision not as good as tag-gnostic methods» Recall not as good as wrappers
![Page 88: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/88.jpg)
Integrating Web data
![Page 89: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/89.jpg)
WebTables
Cafarella et al. ‘08 – The Relational Web WebTables• Exploring the Relational Web
In corpus of 14B raw tables, they estimate 154M are “good” relations› Single-table databases; Schema = attr labels + types› Largest corpus of databases & schemas available
The WebTables system:› Recovers good relations from crawl and enables search› Builds novel apps on the recovered data
![Page 90: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/90.jpg)
Bad table
WebTables
Good table
Slide courtesy Cafarella & Halevy
![Page 91: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/91.jpg)
Some Challenges
Data is semi-structured:› No schema› Columns do not have uniform type› Quality varies a lot› Finding real tables is hard, as is extraction
Data is about everything. › You can’t build a schema over everything
![Page 92: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/92.jpg)
Vertical Tables
Slide courtesy Cafarella & Halevy
![Page 93: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/93.jpg)
Winners of the Boston Marathon
Slide adapted from Cafarella & Halevy
…but that information is nowhere in the table
![Page 94: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/94.jpg)
Much better, but schema extraction is needed
Slide courtesy Cafarella & Halevy
![Page 95: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/95.jpg)
Schema Ok, but context is subtle (year = 2006)
Slide courtesy Cafarella & Halevy
![Page 96: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/96.jpg)
Population Table #2
Slide courtesy Cafarella & Halevy
![Page 97: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/97.jpg)
Asian Population Table
Slide courtesy Cafarella & Halevy
![Page 98: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/98.jpg)
WebTables: Exploring the Relational Web
In corpus of 14B raw tables, Cafarella et al estimate 154M are “good” relations› Single-table databases; Schema = attr labels +
types› Largest database ever!
The Webtables system:› Recovers good relations from crawl and enables
search› Builds novel apps on the recovered data
![Page 99: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/99.jpg)
WebTables
Raw HTML Tables Recovered Relations Relation Search
Inverted Index
Job-title, company, date 104
Make, model, year 916
Rbi, ab, h, r, bb, avg, slg 12
Dob, player, height, weight 4
… …
Attribute Correlation Statistics Db
• 2.6M distinct schemas
• 5.4M attributes
Slide courtesy Cafarella & Halevy
![Page 100: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/100.jpg)
Synonym Discovery
Use schema statistics to automatically compute attribute synonyms› More complete than thesaurus
Given input “context” attribute set C:1. A = all attrs that appear with C2. P = all (a,b) where aA, bA, ab3. rm all (a,b) from P where p(a,b)>04. For each remaining pair (a,b) compute:
Slide courtesy Cafarella & Halevy
![Page 101: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/101.jpg)
Synonym Discovery Examples
name e-mail|email, phone|telephone, e-mail_address|email_address, date|last_modified
instructor course-title|title, day|days, course|course-#,course-name|course-title
elected candidate|name, presiding-officer|speaker
ab k|so, h|hits, avg|ba, name|player
sqft bath|baths, list|list-price, bed|beds, price|rent
Slide courtesy Cafarella & Halevy
![Page 102: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/102.jpg)
More Work on WebTables
Annotate the data in WebTables with ontology information extracted earlier
Physicist
Person
Entity Typehierarchy
Entities
Catalog
B94 P22
The Time and Spaceof Uncle Albert
Albert Einstein
Book
Lemmas
Title Author
B95
Uncle Albert and theQuantum Quest
Writes(Book,Person)bornAt(Person,Place)leader(Person,Country)
Type label
Relation label
B41
Relativity: The Special…
Entity label
Uncle Albert and the Quantum Quest Russell Stannard
Relativity: The Special and the General Theory
A DoxiadisUncle Petros and the Goldback conjecture
A Einstein
![Page 103: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/103.jpg)
Further Challenges
Noisy data› A. Einstien vs Albert Einstein vs Einstien
Ambiguity of entity names› “Michael Jordan” is both a computer scientist and an athlete
Missing type links in Ontology› Universities in Rome -> Universities in Italy
![Page 104: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/104.jpg)
Outline
PreliminariesInformation ExtractionBreak (30 min)Information IntegrationWeb Information Networks
![Page 105: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/105.jpg)
Hyperlink Networks as Homogeneous Info. Networks
![Page 106: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/106.jpg)
Homogeneous Networks lack class
The IMDB Movie Network
Actor MovieDirector
Movie Studio
The Facebook Network
Heterogeneous networks have type information
![Page 107: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/107.jpg)
Hyperlink Networks as Heterogeneous Info. Networks
![Page 108: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/108.jpg)
Hyperlink Networks as Heterogeneous Info. Networks
NamePhoneOfficeAge
GenderEmail
AuthorDateline
TopicPersonsLocation
![Page 109: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/109.jpg)
Homogeneous -> Heterogeneous Information Networks
Task – Heterogenize the Web
Classification Task with many nuances› What are the classes?› Class granularity?
› How do we predict the types computationally?
?
![Page 110: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/110.jpg)
Heterogenization
What is this thing?
ANIMAL, PERSON, PROFESSOR, FULL PROFESSOR, MAN, DATA MINER, MALE-FULL PROFESSOR-DATA MINER?
![Page 111: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/111.jpg)
Heterogenization
ANIMAL, PERSON, PROFESSOR, FULL PROFESSOR, MAN, DATA MINER, MALE-FULL PROFESSOR-DATA MINER?
This is the goal!
The answer is importantWe use these results to do other things
HINT - The network tells us
![Page 112: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/112.jpg)
Extracting Typed-Information networks from the Hierarchical Web
![Page 113: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/113.jpg)
Web Hierarchies
The objects’ location within the network indicates:› Its class› Its relative class
Network Hierarchy› Networks have a hidden Hierarchy
» Note: hidden latent
If we can organize a graph according to its hierarchy:› Information extraction becomes easier› topic models become more expressive› information retrieval models can be enhanced› etc.
![Page 114: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/114.jpg)
Some Methods create/learn Taxonomies
Hierarchical LDA (hLDA) Blei et al. ’03,10
TopicBlock Ho et al. ‘12
Pachinko Allocation Model (hPAM) Mimno et al. ’07
![Page 115: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/115.jpg)
We are interested in Hierarchies
Hierarchical Document Topic Model (HDTM) Weninger et al ‘12
![Page 116: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/116.jpg)
From Taxonomies to Hierarchies - Change the Stochastic Process
Major Difference is that items (documents) can live at non-leaf nodes• How is this accomplished?
Change the Stochastic Model – CRP, nCRP, SB, DSB• Random Walk – Brownian Motion
• Especially random walks on a graph• Page Rank – Random Surfer Model
• Random Surfer Model – PageRank• Jump to a random node with probability
• Random Walk with Restart (RWR/PPR)• Jump back to the starting point (root) with probability
![Page 117: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/117.jpg)
The Generative Story
HDTM
![Page 118: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/118.jpg)
Drawing paths
𝑝 ( 𝐼𝑙𝑙 .→𝐴𝑐𝑎𝑑 .→𝐸𝑛𝑔 .→𝐶𝑆 )=¿
𝑝 ( 𝐼𝑙𝑙 .→𝐶𝑆 )=log [( 1−𝛾𝑛 ) ]( 1−𝛾𝑛 )( 1−𝛾𝑚 )log❑ ( 1−𝛾𝑙 )+¿ +¿
![Page 119: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/119.jpg)
The Generative Story
![Page 120: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/120.jpg)
Sample paths
Similar to standard LDA
RWR Probability
![Page 121: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/121.jpg)
The Generative Story
![Page 122: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/122.jpg)
Sample Words for a topic/document
Similar to standard LDA
RWR Probability
![Page 123: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/123.jpg)
Sample words
Clip of Wikipedia Graph rooted at COMPUTER SCIENCE
𝑐2𝑐1
![Page 124: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/124.jpg)
Example: Hierarchy inferred from Web graph
Colleges
Departments
Engineering Departments
![Page 125: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/125.jpg)
What does this give us?
Given a rooted graph we find a hierarchy› Random Walk with Restart generates parenthood
probabilities
This gives us one possible hierarchy. There are many.
New Challenge - Can’t label
𝑋
𝑌 <: 𝑋
𝑍< :𝑌
𝑊< :𝑍
![Page 126: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/126.jpg)
Set of similarly typed pages
What can we say about these pages?› Class Label/Type?› Name?
![Page 127: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/127.jpg)
Exploring Link Paths
Let’s explore link-paths in a hierarchy
Hierarchy #1PeopleFacultyJiawei HanPersonal Site
Hierarchy #2ResearchData MiningJiawei HanPersonal Site
![Page 128: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/128.jpg)
Exploring Link Paths
What do these pages have in common?
Hierarchy #1PeopleFaculty
Hierarchy #2ResearchData Mining
NamePhoneOfficeAge
GenderEmailNext Step
![Page 129: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/129.jpg)
Table/Record Attribute Extraction
![Page 130: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/130.jpg)
Extract database records from the WebRESTful service: http://dmserv1.cs.illinois.edu/listextractorservice.listextractorsvc.svc/extract/xml/?url= http://cs.illinois.edu/people/faculty
61 Faculty
Tarek A.
Sarita A.
Vikram A.
…and 58 more…
![Page 131: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/131.jpg)
Attribute Propagation
Propagate information through he link paths
NamePhoneOffice
Fax
ResearchEmail
![Page 132: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/132.jpg)
Attribute Propagation Results
Columns match within a Web site (a single hierarchy)› Columns do not match outside of a hierarchy
Columns cannot be labeled easily.
CalTechIowa St.Norfolk St.
Stanford
![Page 133: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/133.jpg)
Links Paths for Known Item Search
Anchor texts look like queries.› Often resemble database records too› Lets match Web pages to improve Web search
HT’12
Hierarchy #1PeopleFacultyJiawei HanPersonal Site
Hierarchy #2ResearchData MiningJiawei HanPersonal Site
#1
![Page 134: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/134.jpg)
Link Paths for Known Item Retrieval
Known Item Retrieval using BM25F› Fields – Slope determines importance
» Content» incoming anchor text (BLP)» Link Paths (FLP)
![Page 135: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/135.jpg)
So what does all this tell us?
What are the other objects?
![Page 136: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/136.jpg)
So what does all this tell us?
What type of object is this?PeopleFacultyData MiningResearch
![Page 137: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/137.jpg)
So what does all this tell us?
What attributes describe this object?
![Page 138: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/138.jpg)
So what does all this tell us?
How can we best search for this object?
PeopleFacultyJiawei HanPersonal SiteResearchData MiningJiawei HanPersonal Site…
![Page 139: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/139.jpg)
Graph Search
![Page 140: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/140.jpg)
New types of search - Web Meta-Paths
Objects are connected together via different types of relationships!› Results from Notre Dame Network collected from the Web
“Bowyer-Viz-Flynn”“Flynn-CCL-Thain”
“Flynn-CCL-Emrich”
Prof-Group-Prof
“CSE40151- Bowyer-Viz-Flynn – CSE40535”“CSE40535 - Flynn-CCL-Thain – CSE20211”
“CSE40535 - Flynn-CCL-Emrich – CSE40532”
Course-Prof-Group-Prof-Course
![Page 141: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/141.jpg)
New types of search - Web Meta-Paths
Objects are connected together via different types of relationships!› Results from Kentucky Network collected from the Web
“Seales-Viz-Jacobs”“Sealas-Viz-Yang”
“Griffeon-EDUCE-Sealas”
Prof-Group-Prof
“CS636-Jacobs-Viz-Yang-CS738”“CS215-Sealas-Viz-Yang-CS738”
“CS485-Griffoen-EDUCE-Sealas-CS215”
Course-Prof-Group-Prof-Course
![Page 142: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/142.jpg)
New types of search - Web Meta-Paths
Objects are connected together via different types of relationships!› Results from New Mexico Network collected from the
Web
“Luger-AI-Lane”“Dorian-SSL-Patrick”“Lance-SciViz-John”
Prof-Group-Prof
“CS 341 - Dorian-SSL-Patrick – CS 442”“CS 481 - Dorian-SSL-Patrick – CS 481”
“CS 357 - Lance-AI-Stephanie – CS 691”
Course-Prof-Group-Prof-Course
![Page 143: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/143.jpg)
New types of search - Web Meta-Paths
Objects are connected together via different types of relationships!› Results from Nebraska Network collected from the Web
“Hong-ADSL-David”“Matthew-E2-Myra”
“Myra-E2-Anita”
Prof-Group-Prof
“CS 432/832 - Hong-ADSL-David – N/A”“CS 496/896 - Matthew-E2-Myra – CS 990”
“CS 990 - Myra-E2-Anita – CS 361”
Course-Prof-Group-Prof-Course
![Page 144: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/144.jpg)
New types of search - Web Meta-Paths
Objects are connected together via different types of relationships!› Results from Illinois Network collected from the Web
“Han-DAIS-Zhai”“Chang-DAIS-Han”
“Roth-AI-Hockenmaier”
Prof-Group-Prof
“CS412- Han-DAIS-Zhai – CS410”“CS512 - Chang-DAIS-Han – CS512”
“CS446 - Roth-AI-Hockenmaier – CS440”
Course-Prof-Group-Prof-Course
![Page 145: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/145.jpg)
Typifying the Web
What do to with a Typed Web?› Query Processing
» Looking for people, professors, CEOs, etc.?› New Search Techniques
» Return structured search results for unstructured query
Typed Graphs› NINA project
» Large scale heterogeneous information network analysis tookit• Graph generation, graph statistics, classification, clustering, etc.
» On github - https://github.com/tweninger/nina
![Page 146: Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web](https://reader035.vdocuments.site/reader035/viewer/2022062814/5681688f550346895ddf1507/html5/thumbnails/146.jpg)
Thank you