Download - Text Analytic Summit 2010
![Page 1: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/1.jpg)
It's not what you said, it's how you said it.
Jamie Taylor, Ph.D.
Text Analytic Summit Boston 2010
![Page 2: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/2.jpg)
What do y'all mean"Semantics"
The Web!Now with
Better Flavor!
![Page 3: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/3.jpg)
![Page 4: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/4.jpg)
![Page 5: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/5.jpg)
![Page 6: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/6.jpg)
May 2001
Tim Berners-Lee, James Hendler and Ora Lassila
![Page 7: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/7.jpg)
The Caketaken from http://www.w3.org/2007/Talks/0130-sb-W3CTechSemWeb/layerCake-4.png
The Semantic Web?
![Page 8: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/8.jpg)
Linked Open Data
![Page 9: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/9.jpg)
The Real Web
http://en.wikipedia.org/wiki/File:Internet_map_1024.jpg
![Page 10: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/10.jpg)
![Page 11: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/11.jpg)
Wish it were real
![Page 12: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/12.jpg)
Might be real
![Page 13: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/13.jpg)
Is real, but don't believe it
![Page 14: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/14.jpg)
Is currently useful
![Page 15: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/15.jpg)
Entities
![Page 16: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/16.jpg)
Identifiers
Bono, A.K.A. Paul David Hewson
http://rdf.freebase.com/ns/en.paul_david_hewson
Side Step Polysemy
![Page 17: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/17.jpg)
Vocabulary
Manufactures
http://rdf.freebase.com/ns/automotive.make.model_s
![Page 18: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/18.jpg)
A socially managed semantic database
![Page 19: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/19.jpg)
Freebase has Many Types of Things
![Page 20: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/20.jpg)
![Page 21: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/21.jpg)
![Page 22: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/22.jpg)
Many Strong Identifiers
http://www.ellerdale.com/topics/view/0080-6ba0
http://rdf.freebase.com/ns/en.berlin_wall
http://musicbrainz.org/artist/7f347782-eb14-40c3-98e2-17b6e1bfe56c
http://rdf.freebase.com/ns/authority.musicbrainz.7f347782-eb14-40c3-98e2-17b6e1bfe56c
http://www.bbc.co.uk/music/artists/7f347782-eb14-40c3-98e2-17b6e1bfe56c
![Page 23: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/23.jpg)
350 Million Relations
12 Million Entites
![Page 24: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/24.jpg)
Users extend the data model
Users contribute data
![Page 25: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/25.jpg)
schema = vocabulary
![Page 26: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/26.jpg)
A range of of vocabularies....
1500 types with 500+ instances!!
![Page 27: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/27.jpg)
Growing Freebase
![Page 28: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/28.jpg)
Reconciliation
+=
![Page 29: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/29.jpg)
Reconciliation
Record Matching
Identity Matching
Collective Entity Resolution
Relational Learning
Record LinkingEquivalence Mining
![Page 30: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/30.jpg)
Reconciliation
"Harrison Ford"
"Excuse Me"
"Vanity Fair"
"Harrison Ford"
"Excuse Me"
"Maytime"
![Page 31: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/31.jpg)
Reconciliation
"Harrison Ford"
"Excuse Me"
"Vanity Fair"
"Harrison Ford"
"Fugitive"
"Blade Runner"
![Page 32: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/32.jpg)
A Graph of Entities
![Page 33: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/33.jpg)
Vocabulary
education
nationality
located
education
plays-inplays-in
performed-at
created
released-bylocated
contains
![Page 34: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/34.jpg)
![Page 35: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/35.jpg)
Reconciliation as "understanding"
education
nationality
located
education
plays-inplays-in
performed-at
created
released-bylocated
contains
![Page 36: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/36.jpg)
http://data.labs.freebase.com/recon/
{ "/type/object/name":"Blade Runner", "/type/object/type":"/film/film", "/film/film/starring/actor":["Harrison Ford", "Rutger Hauer"], "/film/film/director":"Ridley Scott", "/film/film/release_date_s":"1981" } [{
"id":"/guid/9202a8c04000641f8000000000009e89", "name":["Blade Runner", "Bladerunner"], "score":1.4320519, "match":true, "type":["/common/topic", "/film/film","/media_common/adapted_work", "/award/award_winning_work", ...... ]}, { "id":"/guid/9202a8c04000641f80000000002643d0", "name":["Blade"], "score":0.48852453, "match":false, "type":["/common/topic", "/film/film", "/award/award_winning_work", "/award/award_nominated_work", ....... ]}, { "id":"/guid/9202a8c04000641f800000000e5daaae", "name":["Blade"], "score":0.46398318, "match":false, .....
![Page 37: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/37.jpg)
Data Everywhere
![Page 38: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/38.jpg)
![Page 39: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/39.jpg)
Wikipedia Features
![Page 40: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/40.jpg)
Wikipedia Features
Error Prone -- Usually <99%
X
X
![Page 41: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/41.jpg)
![Page 42: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/42.jpg)
![Page 43: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/43.jpg)
![Page 44: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/44.jpg)
![Page 45: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/45.jpg)
WEX
calculate feature counts per type
intersect the two sources
generate type scores for topics
join feature counts with topics
extract features
gettypes
(Machine) Learning Semantics
37M features
2.4M features1400 types
5M type assertions
2.8M Wikipedia topics
5M articles
1.6G scores
![Page 46: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/46.jpg)
/people/person distribution
untyped topicsperson topicsother topicsall topics
Data courtesy Viral Shah
![Page 47: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/47.jpg)
RABJ: Humans in the loop
![Page 48: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/48.jpg)
Thresholding Results
99% threshold at 16.75
![Page 49: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/49.jpg)
/people/person assertions
threshold
53K /people/person assertions
![Page 50: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/50.jpg)
Training Wheels?Semantics are Everywhere
![Page 51: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/51.jpg)
![Page 52: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/52.jpg)
![Page 54: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/54.jpg)
Widgets: Content Tags
![Page 55: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/55.jpg)
![Page 56: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/56.jpg)
![Page 57: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/57.jpg)
![Page 58: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/58.jpg)
Explicit Semantics
![Page 59: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/59.jpg)
![Page 60: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/60.jpg)
Rich Snippets
<div class="post-item restaurant-gen-info hreview-aggregate"> <div class="item vcard"> <h1 class="fn org">Taylor's Refresher</h1> <div class="address"> <div class="ratings"> <ul class="star-rating-2 rating" title="4.0 star rating across 3 ratings"> <li class="current-rating average" style="width:80%;">4.0 star rating</li> <li class="star"> </li> <li class="star"> </li><li class="star"> </li> <li class="star"> </li> <li class="star"> </li> </ul> <div class="rating-stats"> <span class="rating"> <span class="average">4.0</span> </span> rating over <span class="count">1</span> review </div>
![Page 61: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/61.jpg)
![Page 62: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/62.jpg)
microformats
HTML5 MicroData
Open Graph Protocol
RDFa
![Page 63: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/63.jpg)
Explicit Semantics in Surprising Places
![Page 64: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/64.jpg)
Blog Tags::Entities
![Page 65: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/65.jpg)
Metaweb Topic Block
![Page 66: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/66.jpg)
Widget Microdata
<div class="fb-widget" id="fbtb-9a1f44348ad145b5b7d7d7d2376b0420" style="border:0; outline:0; padding:0; margin:0; position:relative;" itemscope="" itemid="http://www.freebase.com/id/en/taylor_swift" itemtype="http://www.freebase.com/id/music/artist"> ..... </div>
![Page 67: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/67.jpg)
Thickening the Graph
![Page 68: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/68.jpg)
"Vocabulary" Patterntaw shooter marksman
marble marksman
photo: http://sarabbit.openphoto.net
http://wordnet.freebaseapps.com
![Page 69: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/69.jpg)
![Page 70: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/70.jpg)
E. Coli
Review (neighborhood) Pattern
Robert Kenner
Michael Pollan
Eric Schlosser
![Page 71: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/71.jpg)
![Page 72: Text Analytic Summit 2010](https://reader033.vdocuments.site/reader033/viewer/2022051110/54c2ee364a795990308b4611/html5/thumbnails/72.jpg)