text analytic summit 2010

Post on 24-Jan-2015

1.164 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

With over 12 million entities and 350 million relationships, Freebase is an excellent resource for performing text analysis. One way to look at document "understanding" is to think about how the entities in the document are connected on a knowledge graph. This is similar to the "reconciliation" process that is used to grow Freebase itself. The web is currently full of semantic hints, whether they are explicit (like those promoted by the Semantic Web) or implicit (like the use of blog widgets.) Using these hints, text analytic methods can get a toe-hold on the web corpus at large.

TRANSCRIPT

It's not what you said, it's how you said it.

Jamie Taylor, Ph.D.

Text Analytic Summit Boston 2010

What do y'all mean"Semantics"

The Web!Now with

Better Flavor!

The Caketaken from http://www.w3.org/2007/Talks/0130-sb-W3CTechSemWeb/layerCake-4.png

The Semantic Web?

Linked Open Data

The Real Web

http://en.wikipedia.org/wiki/File:Internet_map_1024.jpg

Wish it were real

Might be real

Is real, but don't believe it

Is currently useful

Entities

Identifiers

Bono, A.K.A. Paul David Hewson

http://rdf.freebase.com/ns/en.paul_david_hewson

Side Step Polysemy

Vocabulary

Manufactures

http://rdf.freebase.com/ns/automotive.make.model_s

A socially managed semantic database

Freebase has Many Types of Things

350 Million Relations

12 Million Entites

Users extend the data model

Users contribute data

schema = vocabulary

A range of of vocabularies....

1500 types with 500+ instances!!

Growing Freebase

Reconciliation

+=

Reconciliation

Record Matching

Identity Matching

Collective Entity Resolution

Relational Learning

Record LinkingEquivalence Mining

Reconciliation

"Harrison Ford"

"Excuse Me"

"Vanity Fair"

"Harrison Ford"

"Excuse Me"

"Maytime"

Reconciliation

"Harrison Ford"

"Excuse Me"

"Vanity Fair"

"Harrison Ford"

"Fugitive"

"Blade Runner"

A Graph of Entities

Vocabulary

education

nationality

located

education

plays-inplays-in

performed-at

created

released-bylocated

contains

Reconciliation as "understanding"

education

nationality

located

education

plays-inplays-in

performed-at

created

released-bylocated

contains

http://data.labs.freebase.com/recon/

{ "/type/object/name":"Blade Runner", "/type/object/type":"/film/film", "/film/film/starring/actor":["Harrison Ford", "Rutger Hauer"], "/film/film/director":"Ridley Scott", "/film/film/release_date_s":"1981" } [{

"id":"/guid/9202a8c04000641f8000000000009e89", "name":["Blade Runner", "Bladerunner"], "score":1.4320519, "match":true, "type":["/common/topic", "/film/film","/media_common/adapted_work", "/award/award_winning_work", ...... ]}, { "id":"/guid/9202a8c04000641f80000000002643d0", "name":["Blade"], "score":0.48852453, "match":false, "type":["/common/topic", "/film/film", "/award/award_winning_work", "/award/award_nominated_work", ....... ]}, { "id":"/guid/9202a8c04000641f800000000e5daaae", "name":["Blade"], "score":0.46398318, "match":false, .....

Data Everywhere

Wikipedia Features

Wikipedia Features

Error Prone -- Usually <99%

X

X

WEX

calculate feature counts per type

intersect the two sources

generate type scores for topics

join feature counts with topics

extract features

gettypes

(Machine) Learning Semantics

37M features

2.4M features1400 types

5M type assertions

2.8M Wikipedia topics

5M articles

1.6G scores

/people/person distribution

untyped topicsperson topicsother topicsall topics

Data courtesy Viral Shah

RABJ: Humans in the loop

Thresholding Results

99% threshold at 16.75

/people/person assertions

threshold

53K /people/person assertions

Training Wheels?Semantics are Everywhere

A Strong Tag for Food Inc.http://movi.es/BVl43

Widgets: Content Tags

Explicit Semantics

Rich Snippets

<div class="post-item restaurant-gen-info hreview-aggregate"> <div class="item vcard"> <h1 class="fn org">Taylor's Refresher</h1> <div class="address"> <div class="ratings"> <ul class="star-rating-2 rating" title="4.0 star rating across 3 ratings"> <li class="current-rating average" style="width:80%;">4.0 star rating</li> <li class="star">&nbsp;</li> <li class="star">&nbsp;</li><li class="star">&nbsp;</li> <li class="star">&nbsp;</li> <li class="star">&nbsp;</li> </ul> <div class="rating-stats"> <span class="rating"> <span class="average">4.0</span> </span> rating over <span class="count">1</span> review </div>

microformats

HTML5 MicroData

Open Graph Protocol

RDFa

Explicit Semantics in Surprising Places

Blog Tags::Entities

Metaweb Topic Block

Widget Microdata

<div class="fb-widget" id="fbtb-9a1f44348ad145b5b7d7d7d2376b0420" style="border:0; outline:0; padding:0; margin:0; position:relative;" itemscope="" itemid="http://www.freebase.com/id/en/taylor_swift" itemtype="http://www.freebase.com/id/music/artist"> ..... </div>

Thickening the Graph

"Vocabulary" Patterntaw shooter marksman

marble marksman

photo: http://sarabbit.openphoto.net

http://wordnet.freebaseapps.com

E. Coli

Review (neighborhood) Pattern

Robert Kenner

Michael Pollan

Eric Schlosser

top related