deep machine reading

Deep Machine Reading:

Taming Unstructured, Natural

Language DataNaveen Ashish

University of Southern California & Cognie Inc.,

BigData TECHCON, San Francisco, October 29th 2014

This is about …..

DEEP MACHINE READING

The hard nut of having computers “understand” natural language (text) ….

Pushing the boundaries of what we can achieve ….

A True AI Challenge

"It's (the problem of computers understanding natural language) ambitious ...in fact there's no more important project than understanding intelligence and recreating it.“ - Ray Kurzweil (2013)

Alan Turing based the Turing Test entirely on written language….To really master natural language …that’s the key to the Turing Test–to a human requires the full scope of human intelligence. …So the point is that natural language is a very profound domain to do artificial intelligence in. -Ray Kurzweil (2013)

“Another example of a good language problem is question answering, like What’s the second-biggest city in California that is not near a river?” Michael Jordan, in response to “What would you do with $1B?”, IEEE Spectrum Interview Oct 2014

Commercial Relevance Today

the problem of taming unstructured data is far from solved ….. !!!!

search

text analytics

big data analytics

health informatics

social-media intelligence

mining research literature

Cognie Inc., Cognie Inc., Incorporated in 2006

High-end consulting for semantic-search

Focus is on machine reading technologies

Work leverages Information extraction work and systems conceptualized as part of

university research XAR: eXtraction with Adaptive Rules (Ashish and Mehrotra, 2009)

PEP: Pathology Extraction Pipeline (Ashish, Dahm and Boicey 2014)

Team Developers, Student interns, Researchers

Blog http://cognie.blog.com

Today Building custom text analytics engines

http://cognie.blog.com/

Model

Build custom text understanding engines for domains

CognieTM Platform for Building Text Analytics Engines

Retail Text

Engine

Health NLP

Engine

Research Mining

Engine

Customization, Application Integration, Evolution

Outline

Deep machine reading: What is, and why needed

State-of-the-art

Fundamentals

Approach

Details

Case studies

Retail, Health, Risk assessment, Customer support, Intelligence

Conclusions

What is “Deep” machine reading ?

Deep Machine Reading is ….

The ability to distill the abstract from text

The ability to comprehensively extract multiple concepts and relationships from the text

The ability to link extracted elements to known concepts

The ability to use the text (data) itself, to improve understanding of that text

The Abstract, in Text

The abstract, not explicitly mentioned !

What falls in this category

Expressions

Contextual sentiment

Aspects or Categories

I think you need better chefs SUGGESTION

The mocha is too sweet NEGATIVE

I used to take Lipitor for … PERSONAL EXPERIENCE

The dim lights have a cozy effect …. AMBIENCE

Classification, rather than Extrication

Much of the technology, up to recently, is extrication focused

Extricate particular terms, elements, concepts from the text

Extrication

Named-Entity extraction PERSONS, ORGANIZATIONS, LOCATIONS, …

Sentiment extraction Based on polar words

Need for much more sophisticated classification of text snippets

Along different dimensions of interest

A Comprehensive Signature of TextCognie experienceMany applications have unique requirements of what they want from

the text “ …and for six months I was indeed taking Lipitor but I must say ….” PERSONAL EXPERIENCE “…there is direct correlation between Cadmium exposure and lung …” CAUSALITY

But, many groups of applications have common requirements within

Primary elements required from text Expressions Entities Sentiment Contextual Qualified

Emotion Topics Categories/Aspects Specific signal (“directionality”) Relationships

Deeper Text Analysis Better Insights

Goal: Get actionable insights from data !

Hypothesis: Deeper extraction Better insights !

The top advice items advised for skin rash are aloe vera,

vitamin E oil and oatmeal

Complaints comprise 36% of the overall feedback with top

issues being slow service, drinks and coffee

73% of all research articles indicate that Cadmium is a causal

factor for lung irritation

Context

COGNIETM: A PLATFORM for text analytics

COGNIE TM

XAR UCI-PEP

SHIP SURVEY

ANALYTICS RETAIL

ANALYTICS

RISK

ASSESSMENT

Modus Operandi

All applications require a structured representation of the (unstructured) data

A structured database/meta-base that powers Analytics dashboards

Data coding processes

Risk assessment computations

Consumer health portals

….

Manual extraction processes are typically in place

Goal is to eliminate or alleviate manual effort

Text Analytics Spectrum

Gamut of Text Analytics Engines

in Market

• Lexalytics

• Alchemy API

• Semantria

• Clarabridge

• ConveyAPI

• Linguamatics

• ….

Engines Aiming Deeper

• Luminoso

• Attensity

• …

Availability of Open-source Text

Analysis Tools

• UIMA

• GATE

• Deep Learning for Sentiment

Analysis (Stanford)

• Recursive Neural Networks

• http://openair.allenai.org

http://openair.allenai.org/

Approach

Approach

natural language processing

machine learning

semantics

Architecture: COGNIE TM Platform

Segmentation

POS Tagging

Entity extraction

Anaphora

Parsing

Gram analysis

Existing (DMOZ, SNOMED,UMLS)

Creation

Declarative

Naïve-Bayes

MaxEnt

TFIDF

CRF

RNN Deep Learning

ENSEMBLE

NLP

Machine Learning

Knowledge Engineering

COGNIE TM : Open-source Leverage Framework UIMA

Classification Weka Mallet

NLP Stanford CoreNLP

Indexing Lucene

Databases MySQL, MongoDB

Knowledge Engineering Protégé

Topic mining Mallet

Sentiment Stanford Deep Learner

Step 0: Basic Text Analysis

Text Segmentation

In many cases the “unit” of distillation is a sentence

Segmentation strategies Built-in, such as in UIMA or GATE

Custom segmentation

Sentence decomposition Decompose sentence into individual clauses

Expressions

Beyond entities and sentiment : EXPRESSSIONS

EXPRESSIONS

Introduced in [Ashish et al, 2011]

Expressions

…showers had no hot water !… COMPLAINT

..you should have more veggie options… SUGGESTION

RETAIL/ENTERPRISE

..meats on special this weekend… ANNOUNCEMENT

..this is the best store on the west side… ADVOCACY

There is hardly any evidence to suggest a link between salt and diabetes -

This results confirm that high intake of salt leads to increase in BP +

RISK ASSESSMENT

Expressions

You should try Vitamin E oil … ADVICE

..I have had arthritis since 1991… EXPERIENCE

HEALTH

..for me lipitor worked like a charm… OUTCOME

The Indicators: “Give Aways”

A combination of multiple types of elements !

…showers had no hot water !… COMPLAINT

(You) should have more veggie options… SUGGESTION

..i have been on lipitor… EXPERIENCE

..this is the best store on the west side… ADVOCACY

Approach: Given Indicators

NLP

Identification of individual elements Unsupervised

Relationships between elements

Semantics

Identification of individual elements Knowledge driven

Machine Learning Classification

Combine elements classify

Expression Classification: Relevant Features

Curated lexicons of specific indicative phrases

Examples “could you”, “I took”, ….

Approach Manual creation of “seed” lexicons

Automated expansion from data plus resource such as WordNet

The Sentiment

For instance a Complaint would almost always have negative sentiment

Punctuations, Other expressions or emoticons

Expression Classification Features

Positional information of words, phrases, or part-of-speech patterns in the sentence

Suggestions will usually begin with certain ‘request’ words

Custom patterns

Such as subject-verb-object for PERSONAL EXPERIENCE

Ontology concepts

Expression Classification: Results

Have achieved 75% precision and recall for all expressions considered

Factors

Feature engineering

Classifier selection

Knowledge engineering

Before Automated Classification: Manual Patterns

SoL: Sequences of Labels

Labels LEX-FOODADJ spicy

LEX-EXCESS too, very

ONT-FOOD

POS-NOUN

Sequences (Patterns)ANY LEX-EXCESS LEX-FOODADJ ANY Negative

POS-VB POS-MD * Suggestion

Classification: Machine Learning

Classification tasks

Expression

(Contextual) Sentiment

Aspect category

Frameworks

Weka

Mallet

Baseline Classifiers for Expressions

Mallet and Weka

NaiveBayes

MaxEnt

CRF

Gram-based

Uni, Bi and Trigram features

Baseline

~ 10% accuracy

Expression Classifiers

Trees

Decision Tree (J48)

Functions

Logistic Regression

SVM

Sequence Tagging

CRF: Conditional Random Fields

Entities

Named-entity extractors

The generic PERSON, ORGANIZATION, LOCATION

Ngram and part-of-speech analysis

Frequently mentioned ‘entities’

Improves recall

Ontology driven concept mapping

Using pre-assembled domain ontologies/taxonomies/dictionaries

Based on modules like UIMA ConceptMapper

Scale is a challenge

Contextual Sentiment

(Just) polar words can be misleading !

Polar words many not be present at all !

Combination of elements

The mocha is too sweet

Wait time is over an hour

Aisles are too narrow

Service is slow

Qualified Sentiment

Classify negative comments

Further segregate into

Immediately actionable items

‘Long term’ issues

Approach

Curation of Ngrams for each type of negative comments

Classifier

Topic Mining

Motivated by feedback survey analytics People can talk about “anything”

Interested in broad ‘topics’ of discussion But the set of topics is dynamic, not necessarily known

Unsupervised topic mining LDA: Latent Dirichlet Allocation

As-is led to very fragmented topics that were semantically not meaningfulSolution: consolidation of terms using WordNet Expand terms using WordNet synonyms Consolidate with manual curation after

Semi-automated approach

Cohesive Topic Mining

Problem with WordNet (synonym) expansion

Prone to semantic divergence

Example

Presentation Project(or) Milestones

(Almost) strongly connected components in relationship graph

Manual review after

Aspect Classification

Binning data into few broad categories

Approach Ngram mining

Classification

Categories over Topics

Consolidate topics into broad, fixed categoriesOntology mapping approach Each category has associated concepts Topic signature maps to category concepts

HersheyBieberCocoa beans

Personnel Competitors

Yearly reviews

Emotion Extraction

Plutchik wheel of emotions Fundamental emotion concepts captured in ontology

Augmented with indicator terms, and their synonymsOntology driven extraction for emotion concepts

Semantics is Key

Semantics

Domain knowledge is not ‘nice-to-have’ but critical

HEALTH

• Condition names

• Drug names

• Symptoms

• Procedures

• ..

RETAIL

• Food items

• Other products

• Competitors

• …

RESEARCH

• Chemical substances

• Harmful conditions

• …

INTELLIGENCE

• Manufacturers

• Vehicles

…

Leverage Existing Knowledge Sources

Health informatics UMLS http://www.nlm.nih.gov/research/umls/ NCI Thesaurus http://ncit.nci.nih.gov/

SNOMED http://www.nlm.nih.gov/snomed

Retail DMOZ http://www.dmoz.org

Many other Freebase http://www.freebase.com

Wikipedia, DBPedia

OpenData data.gov

http://www.nlm.nih.gov/research/umls/

http://ncit.nci.nih.gov/

http://www.nlm.nih.gov/snomed

http://www.dmoz.org/

http://www.freebase.com/

Knowledge Engineering Tools

Getting available ontologies into usable formats

Available as database dumps, RDF, or Web data

“Mini” ontology creation

Curate manually when possible (small dictionaries) Example: list of competitors

API access

Freebase https://www.freebase.com/query Query using ‘MQL’ – Metaweb Query Language (Sparql like)

BioPortal http://data.bioontology.org/documentation

Provided sometimes by customer !

https://www.freebase.com/query

http://data.bioontology.org/documentation

Practical Requirements

Confidence Measures

Quantitative confidence score for extracted elements

Binary confidence Y/N Not confident Routed for manual review

‘Explanation’ for classification

Relevant snippets

“….and the checkout times continue to be long despite …”

Complaint

Feedback Learning Mechanisms

Manual overview is not dismissed entirely

Comprehensive pipeline for manual review

Learn and improve from feedback

Applications

Applications

Core Cognie

Platform

Retail Analytics

Engine

Health Distillation

Engine

Survey Analytics

Engine

Research Mining

Engine

Coding Validation

Engine

Risk Analysis

System

Coding

ProcessesHealth Insights

Portal

Scalability

Scale requirements Large numbers of documents as opposed to large

document size

Throughput can be an issueComplex language processing algorithms

Feature extraction can be complex

Large ontologies in some cases

SolutionsMulti-threading and Thread pooling architecture

Hadoop MapReduce [Kahn and Ashish, 2014]

Conclusions

Grand Challenge Projects

AristoAt AI2, Allen AI Institute http://www.allenai.org

Areas Knowledge Extraction

Reasoning

Question Answering

Can the system answer 4th, 6th grade exams ?

Project NELL Never Ending Language Learning http://rtw.ml.cmu.edu/rtw/

“Learnt” 50+million facts from Web data

http://www.allenai.org/

http://rtw.ml.cmu.edu/rtw/

Conclusions

Deeper distillation from text is required

Can be achieved by

Detecting and combining multiple elements in text Feature engineering

Knowledge engineering

Classifier selection

Semantics and Knowledge Engineering is key

Have been successful in leveraging the CognieTM

Platform to develop custom text analytics engines in multiple domains

thank you [email protected]

mailto:[email protected]

deep machine reading

Technology

natural language text

text data

natural language ambitious

natural language thats

written language

good language problem

textthe ability

outlinedeep machine