natural language processing & semantic modelsin an imperfect world

Post on 05-Dec-2014

1.124 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

 

TRANSCRIPT

Confidential

Presenter:Marc Hadfield

marc@alitora.comwww.alitora.com

Natural Language Processing

& Semantic Modelsin an Imperfect World

Copyright Alitora Systems, Inc. 2009

Marc Hadfield

CTO of Alitora Systems Computer Science Research in Bioinformatics

NLP Big (Fuzzy) Networks

Generalized Semantic Data Platform

Alitora Systems

System Approach

…Talk about Systems & Apps more than Modules.

Discussion Today

Storing Data – Semantic Repository Generating Data – NLP Modeling Data – Semantic Models Analyze Data – Methodology Using Data – Application

Alitora Systems Architecture

Alitora Systems API (ASAPI)

User Interfaces ASAPI Collaboration kHarmony™ Semantic

DB Alitora Foundry

Text-Mining UMIS Secure

Distributed URIs URI to Named Graphs

ASAPI Cloud

Multi-Billion Triples

kHarmony™ Semantic DB

Semantic / Graph DB Cloud Deployable

Distribute Data over Servers Layers of Cache

Data Analytics / Clustering Determine High-Value

Knowledge Knowledge Relevancy

Embedded Scripting Data Entitlements

Users, Teams, Organizations, Colleagues

Base Ontology

Alitora Foundry

Manages NLP processes Annotators which add metadata to text

Includes external services like OpenCalais as annotators

Workflows to link annotators together Common data representation across

components RDF in, RDF out Ontology includes representation of

certainty, error

Foundry Workflow

Independent Workflows based on type of text

Combine ML &Rule-based systems

Foundry Data Model

Two dimensional representation of tokens Labels/Spans to tag token ranges (features in machine learning)

Allows multiple interpretations of tokens Chemical names tokenized differently than personal names

Sequence Recognition and Categorization (with scoring/likelyhood) Entities, Entity Types, Normalized (Disambiguated) Entities (ER vs. ER)

Shared across workflow steps Direct RDF representation

“Span”

NLP In Action

Copyright Alitora Systems, Inc. 2009Confidential

Sentence

“Suppression of endogenous Bim greatly inhibits Gadd45a induction of apoptosis.”

Parse [action, inhibit, [action, suppress, [unknown], [gp, endogenous Bim] ], [action, induce, [gp, Gadd45a], [process, apoptosis] ], ]

Confidential Copyright Alitora Systems, Inc. 2009

Foundry Relationship Extraction

Alitora Knowledge Ontology

Data Representation:

Each Object is Named Graph. Unique URI.

“chunks” of RDF

OWL2

“Core” Model

Alitora Knowledge Ontology

Named Graphs:

•URI

•“Reified”

•Provenance

• Hash/Signature

• Creation, Modification, Expiration Dates

•Certainty/Error

Alitora Knowledge Ontology

Lesson:

“Reification” at the model level.

Expose the topology of the knowledge.

Semantic Knowledge StatementsDomain Ontology + Instance Statements

Alitora Knowledge Ontology

Semantic Collaborative Statements

Alitora Knowledge Ontology

Alitora Knowledge Ontology

Fact Representation This example has 9

Named Graphs The “Relation” is the

head Any number of

Relation-Parts Relation-Parts are

chained

“Company Merger”

•OWL

•“Reified”

•Knowledge Representation

•Certainty, Error, Provenance, …

•Graph + Semantic

•Topology Interpretation

•Logical Interpretation

Alitora Knowledge Ontology

MemomicsBio Ontology (Domain) Extends Alitora Knowledge Ontology

Inherits knowledge representation structures OWL Domain Specific Defines types of “facts” specific to

biomedical domain A general AKO fact can be

mapped/asserted into a Memomics BioOntology fact

Where are we?

Store Data Generate data with NLP Represent data in a general knowledge

model Have a domain specific ontology

Where the “action” happens

Need some analysis to push facts into the domain ontology

Query, Inference using the domain ontology

Relevancy

The shape or “topology” of the graph helps to identify relevant knowledge.

The “paths” connecting a User to knowledge, based on search usage, factor into Relevancy

“Knowledge Rank” “Best” facts

Relevancy based onGraph Topology

Scripting, Analysis, Inference Submitted Scripts applied over Graph Walk

Groovy Scripts (Java Interface) Can calculate “scores”

Offline Clustering and Analysis Algorithms Grid/Cloud based

Inference process utilizes knowledge Asserting statements (Relation Statement) Prolog, HiLog, F-Logic Use all features in inferencing (such as certainty)

Certainty

How accurate (F-score) are your NLP extractions?

How accurate is the source material? How dynamic is your domain? Can facts be independently verified

Do multiple sources reinforce a “fact”? Can your community of users curate or

validate information? How sensitive are you to error?

Will users tolerate error (such as in search) or are you trying to inference over absolute “truth”?

Certainty

Choose to assert facts(or not)based on certainty assessments

Confidential

Guided Inference

Inference is guided by ranked knowledge

Analysis can be performed offline

Guided Inference

Dynamic Inference / Rules A question/query is posed to initiate the

inference Knowledge-based is queried to collect

relevant data Certainty Thresholds can be used Relevancy Thresholds can be used

AKO Relations are asserted as “facts” to extend the inference

Process is repeated to add assertions

Demonstrations

Alitora Newstracker Sage Commons, Biomedical Domain Match Engine, Consumer Application

Alitora News Tracker

Track highly relevant news in domain niche

Use NLP to extract entities and relations of interest

Use certainty assessments as thresholds to consider entities/relations

Use a score (an embedded script) to assign a relevancy to news articles Heuristic including entities types in articles,

relationship types, et cetera

Application: News Tracker

Application: Sage Commons

Share networks of biomedical data across the community of researchers million node networks, billions of triples

Extended AKO with Sage Ontology Use for structured data and unstructured data

Allow combination of structured data with NLP derived data

Use certainty thresholds to cut down on noise Use relevancy for efficient queries Expose data for guided inferencing

Application: Match Engine

Match Engine Extended AKO with Match Ontology Foundry for extracting music event entities

Performer, Venue, Price, Genre Certainty for reducing noise Match Engine uses inference with multiple

source of “evidence” to match users with events

Demo Application: Bandalay Facebook App

NLP and (Un)Certainty

Capture Error / Uncertainty in Model from NLP “Reify” relationships so metadata will “fit” Use multiple types of analysis

Rules, Machine Learning, Topology, Curation, User Feedback

Separate general model and domain model Allows asserting a fact in the domain model or not (don’t

“decide” everything at once) Use semantics to make decisions about data Inference can use thresholds to decide to assert

facts (or not) Guided Inference can make informed choice about

facts to add/remove from model

Contact Information

750 Menlo Ave, Suite 340 155 Water Street

Menlo Park, CA 94025 Brooklyn, NY 11201

(415) 310-4406 (917) 463-4776

marc@alitora.com

peter@alitora.com

ConfidentialCopyright Alitora Systems, Inc. 2009

top related