semtag and seeker: bootstrapping the semantic web via automated semantic annotation

29
SemTag and Seeker: Bootstrapping the Semantic Web via Automated Semantic Annotation Presented by: Hussain Sattuwala Stephen Dill, Nadav Eiron, David Gibson, Daniel Gruhl, R. Guha, Anant Jhingran, Tapas Kanungo, Sridhar Rjagopalan, Andrew Tomkins, John A. Tomlin, Jason Y. Zien IBM Almaden Research Center http://www.almaden.ibm.com/webfountain/resources/semtag.pdf

Upload: loyal

Post on 06-Jan-2016

47 views

Category:

Documents


0 download

DESCRIPTION

SemTag and Seeker: Bootstrapping the Semantic Web via Automated Semantic Annotation. Presented by: Hussain Sattuwala Stephen Dill, Nadav Eiron, David Gibson, Daniel Gruhl, R. Guha, Anant Jhingran, Tapas Kanungo, Sridhar Rjagopalan, Andrew Tomkins, John A. Tomlin, Jason Y. Zien - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: SemTag and Seeker: Bootstrapping the Semantic Web via Automated Semantic Annotation

SemTag and Seeker: Bootstrapping the Semantic Web via Automated

Semantic Annotation

Presented by: Hussain Sattuwala

Stephen Dill, Nadav Eiron, David Gibson, Daniel Gruhl, R. Guha, Anant Jhingran, Tapas

Kanungo, Sridhar Rjagopalan, Andrew Tomkins, John A. Tomlin, Jason Y. Zien

IBM Almaden Research Centerhttp://www.almaden.ibm.com/webfountain/resources/semtag.pdf

Page 2: SemTag and Seeker: Bootstrapping the Semantic Web via Automated Semantic Annotation

Outline Motivation Goal SemTag

Architecture Phases TBD Results Methodology

Seeker Design Architecture Environment

Conclusion Related and Future work.

Page 3: SemTag and Seeker: Bootstrapping the Semantic Web via Automated Semantic Annotation

Motivation Natural language processing is the most

significant obstacle in building machine understandable web.

To allow for the Semantic Web to become a reality we need: Web-services to maintain & provide metadata. Annotated documents (OWL, RDF, XML, ...).

Page 4: SemTag and Seeker: Bootstrapping the Semantic Web via Automated Semantic Annotation

Annotations Current practice of annotation for knowledge

identification , extraction & other applications

is time consuming

needs annotation by experts

is complex

Reduce burden of text annotation for Knowledge Management

Page 5: SemTag and Seeker: Bootstrapping the Semantic Web via Automated Semantic Annotation

Goal To perform automated semantic tagging of large

corpora.

To introduce a new disambiguation algorithm to resolve ambiguities in a natural language corpus.

To introduce the platform which different tagging applications can share.

Page 6: SemTag and Seeker: Bootstrapping the Semantic Web via Automated Semantic Annotation

SemTag The goal is to automatic add semantic tags to the

existing HTML body of the web.

Example:“The Chicago Bulls announced that Michael Jordan will…”

Will be:The <resource ref = http://tap.stanford.edu/Basketball Team_Bulls>Chicago Bulls</resource> announced yesterday that <resource ref = “http://tap.stanford.edu/ AthleteJordan_Michael”> Michael Jordan</resource> will...’’

Page 7: SemTag and Seeker: Bootstrapping the Semantic Web via Automated Semantic Annotation

SemTag Uses TAP KB

TAP is a public broad, shallow knowledgebase. TAP contains lexical and taxonomical information about

popular objects like music, movies, sports, etc.

Problem: No write access to original documentHow do you annotate???

Uses the concept of Label Bureau from PICS (Platform for Internet Content Selection) HTTP server that can be queried for annotation

information Separate store of semantic annotation information

Page 8: SemTag and Seeker: Bootstrapping the Semantic Web via Automated Semantic Annotation

Example: Annotated Page

Page 9: SemTag and Seeker: Bootstrapping the Semantic Web via Automated Semantic Annotation

SemTag Architecture

Retrieve documents

Find Context

Tokenize

determine distribution of terms

Disambiguate windows

Add to DB

Spotting Learning

Tagging

Automatic

Manual

Page 10: SemTag and Seeker: Bootstrapping the Semantic Web via Automated Semantic Annotation

SemTag Phases 1. Spotting:

Retrieve documents from Seeker. Tokenize documents. Find contexts (10 words + label + 10 words) that

appears in TAP Taxonomy.

2. Learning: Scan the representative sample to determine

distribution of terms at each internal node of the taxonomy.

Page 11: SemTag and Seeker: Bootstrapping the Semantic Web via Automated Semantic Annotation

SemTag Phases, cont’d 3. Tagging

Disambiguate windows (using TBD). Add to the database.

Ambiguities types: Same label appears at multiple locations in TAP

ontology. Some entities have labels that occur in context that

have no representative in the taxonomy.

Training Data: Automatic metadata Manual metadata

Page 12: SemTag and Seeker: Bootstrapping the Semantic Web via Automated Semantic Annotation

TBD Methodology Each node has a set of labels.

E.g.: cats, football, cars all contain the label Jaguar.

Each label in the text is stored with a window of 20 words – the context

A spot(l,c) is a label in a context.

Each node has an associated similarity function mapping a context to a similarity Higher similarity more likely to contain a reference

Page 13: SemTag and Seeker: Bootstrapping the Semantic Web via Automated Semantic Annotation

TBD - Similarity Generate 200k dimensional vector corresponding

to context.

TF-IDF scheme Each entry of the vector is the frequency of the term

occurring at that node divide by corpus frequency of the term.

IR Algorithm – Cosine Similarity Vector product of sparse spot vector and dense node

vector

Page 14: SemTag and Seeker: Bootstrapping the Semantic Web via Automated Semantic Annotation

TBD - Algorithm Some internal nodes very popular:

Associate a measurement Mus of how accurate Sim is

likely to be at a node. Also Mu

a, how ambiguous the node is overall (consistency of human judgment)

TBD Algorithm: returns 1 or 0 to indicate whether a particular context c is on topic for a node v

82% accuracy on 434 million spots

Page 15: SemTag and Seeker: Bootstrapping the Semantic Web via Automated Semantic Annotation

The TBD Algorithm

Page 16: SemTag and Seeker: Bootstrapping the Semantic Web via Automated Semantic Annotation

SemTag Results Applied on 264 million pages

Produced 550 million labels.

Final set of 434 million spots with Accuracy 82%.

Page 17: SemTag and Seeker: Bootstrapping the Semantic Web via Automated Semantic Annotation

SemTag Methodology1. Lexicon generation:

Approximately 90 million total words. 1.4 million unique words . Most frequent 200,000 words.

2. Similarity functions: Estimated distribution of terms corresponding to 192

most common TAP nodes to derive fu.

Page 18: SemTag and Seeker: Bootstrapping the Semantic Web via Automated Semantic Annotation

SemTag Methodology, cont’d3. Measurement values:

Determined based on 750 relevant human judgments.

4. Full TBD Processing: Applied to 550m spots.

5. Evaluation: Compared TBD results with additional 378 human

judgments.

Page 19: SemTag and Seeker: Bootstrapping the Semantic Web via Automated Semantic Annotation

Seeker A platform used by SemTag and other increasingly

sophisticated text analytics applications.

Provides scalable, extensible knowledge extraction from erratic resources.

Erratic resources???

Page 20: SemTag and Seeker: Bootstrapping the Semantic Web via Automated Semantic Annotation

Seeker Design Goals Composability

Modularity

Extensibilty

Scalability

Robustness

Page 21: SemTag and Seeker: Bootstrapping the Semantic Web via Automated Semantic Annotation

Seeker Architecture

Indexing Tokens

Storage &CommunicationCrawls WEB

AnnotatorsQuery

Processing

SemTagComponents

n/w level APIs

Miners

Scalability & Robustness

Modular & Extensible

Page 22: SemTag and Seeker: Bootstrapping the Semantic Web via Automated Semantic Annotation

Seeker Design To achieve modularity and extensibility

SOA (service-oriented architecture) was used where communication among agents is done through a set of language-independent network-level APIs.

To achieve scalability and robustness Infrastructure components.

Page 23: SemTag and Seeker: Bootstrapping the Semantic Web via Automated Semantic Annotation

Infrastructure Components The Data Store

Central repository for all data storage. Communication medium.

The Indexer For indexing sequences of tokens.

The Joiner Query processing component.

Page 24: SemTag and Seeker: Bootstrapping the Semantic Web via Automated Semantic Annotation

Analysis Agents Annotators

Performs some local processing on each web page and write back results to the store in form of an annotation.

Miners Performs Intermediate processing Looks at the results of spots on many pages in order to

disambiguate them.

Page 25: SemTag and Seeker: Bootstrapping the Semantic Web via Automated Semantic Annotation

Observation Advantage

Other application can obtain semantic annotation from web-available database.

Use both human & computer judgments to solve ambiguous data in their TBD algorithm

Disadvantage The system require a large amount of storage space to

store data. Requires much larger and richer KB to build web scale

ontology.

Page 26: SemTag and Seeker: Bootstrapping the Semantic Web via Automated Semantic Annotation

Conclusion Automatic semantic tagging is essential to

bootstrap the Semantic Web.

It’s possible to achieve good accuracy with simple disambiguation approaches.

Page 27: SemTag and Seeker: Bootstrapping the Semantic Web via Automated Semantic Annotation

Future Work Develop more approaches and algorithms to

automated tagging.

Make annotated data public and seeker as a public service.

Page 28: SemTag and Seeker: Bootstrapping the Semantic Web via Automated Semantic Annotation

Related Work Systems built as a result of the Semantic Web are

divided among two types: Create ontologies – semi automated Page annotation. Examples: Protégé, OntoAnnotate, Anntea, SHOE, …

Some AI approaches were used, but, they need a lot of training. Principal tool:Wrapping

Some used other NL understanding techniques, example ALPHA.

Page 29: SemTag and Seeker: Bootstrapping the Semantic Web via Automated Semantic Annotation

Questions?