deep machine reading
DESCRIPTION
Talk on Deep Machine Reading (technologies) given at BigData Techcon 2014TRANSCRIPT
Deep Machine Reading:
Taming Unstructured, Natural
Language DataNaveen Ashish
University of Southern California & Cognie Inc.,
BigData TECHCON, San Francisco, October 29th 2014
This is about …..
DEEP MACHINE READING
The hard nut of having computers “understand” natural language (text) ….
Pushing the boundaries of what we can achieve ….
A True AI Challenge
"It's (the problem of computers understanding natural language) ambitious ...in fact there's no more important project than understanding intelligence and recreating it.“ - Ray Kurzweil (2013)
Alan Turing based the Turing Test entirely on written language….To really master natural language …that’s the key to the Turing Test–to a human requires the full scope of human intelligence. …So the point is that natural language is a very profound domain to do artificial intelligence in. -Ray Kurzweil (2013)
“Another example of a good language problem is question answering, like What’s the second-biggest city in California that is not near a river?” Michael Jordan, in response to “What would you do with $1B?”, IEEE Spectrum Interview Oct 2014
Commercial Relevance Today
the problem of taming unstructured data is far from solved ….. !!!!
search
text analytics
big data analytics
health informatics
social-media intelligence
mining research literature
Cognie Inc., Cognie Inc., Incorporated in 2006
High-end consulting for semantic-search
Focus is on machine reading technologies
Work leverages Information extraction work and systems conceptualized as part of
university research XAR: eXtraction with Adaptive Rules (Ashish and Mehrotra, 2009)
PEP: Pathology Extraction Pipeline (Ashish, Dahm and Boicey 2014)
Team Developers, Student interns, Researchers
Blog http://cognie.blog.com
Today Building custom text analytics engines
Model
Build custom text understanding engines for domains
CognieTM Platform for Building Text Analytics Engines
Retail Text
Engine
Health NLP
Engine
Research Mining
Engine
Customization, Application Integration, Evolution
Outline
Deep machine reading: What is, and why needed
State-of-the-art
Fundamentals
Approach
Details
Case studies
Retail, Health, Risk assessment, Customer support, Intelligence
Conclusions
What is “Deep” machine reading ?
Deep Machine Reading is ….
The ability to distill the abstract from text
The ability to comprehensively extract multiple concepts and relationships from the text
The ability to link extracted elements to known concepts
The ability to use the text (data) itself, to improve understanding of that text
The Abstract, in Text
The abstract, not explicitly mentioned !
What falls in this category
Expressions
Contextual sentiment
Aspects or Categories
I think you need better chefs SUGGESTION
The mocha is too sweet NEGATIVE
I used to take Lipitor for … PERSONAL EXPERIENCE
The dim lights have a cozy effect …. AMBIENCE
Classification, rather than Extrication
Much of the technology, up to recently, is extrication focused
Extricate particular terms, elements, concepts from the text
Extrication
Named-Entity extraction PERSONS, ORGANIZATIONS, LOCATIONS, …
Sentiment extraction Based on polar words
Need for much more sophisticated classification of text snippets
Along different dimensions of interest
A Comprehensive Signature of TextCognie experienceMany applications have unique requirements of what they want from
the text “ …and for six months I was indeed taking Lipitor but I must say ….” PERSONAL EXPERIENCE “…there is direct correlation between Cadmium exposure and lung …” CAUSALITY
But, many groups of applications have common requirements within
Primary elements required from text Expressions Entities Sentiment Contextual Qualified
Emotion Topics Categories/Aspects Specific signal (“directionality”) Relationships
Deeper Text Analysis Better Insights
Goal: Get actionable insights from data !
Hypothesis: Deeper extraction Better insights !
The top advice items advised for skin rash are aloe vera,
vitamin E oil and oatmeal
Complaints comprise 36% of the overall feedback with top
issues being slow service, drinks and coffee
73% of all research articles indicate that Cadmium is a causal
factor for lung irritation
Context
COGNIETM: A PLATFORM for text analytics
COGNIE TM
XAR UCI-PEP
SHIP SURVEY
ANALYTICS RETAIL
ANALYTICS
RISK
ASSESSMENT
Modus Operandi
All applications require a structured representation of the (unstructured) data
A structured database/meta-base that powers Analytics dashboards
Data coding processes
Risk assessment computations
Consumer health portals
….
Manual extraction processes are typically in place
Goal is to eliminate or alleviate manual effort
Text Analytics Spectrum
Gamut of Text Analytics Engines
in Market
• Lexalytics
• Alchemy API
• Semantria
• Clarabridge
• ConveyAPI
• Linguamatics
• ….
Engines Aiming Deeper
• Luminoso
• Attensity
• …
Availability of Open-source Text
Analysis Tools
• UIMA
• GATE
• Deep Learning for Sentiment
Analysis (Stanford)
• Recursive Neural Networks
• http://openair.allenai.org
Approach
Approach
natural language processing
machine learning
semantics
Architecture: COGNIE TM Platform
Segmentation
POS Tagging
Entity extraction
Anaphora
Parsing
Gram analysis
Existing (DMOZ, SNOMED,UMLS)
Creation
Declarative
Naïve-Bayes
MaxEnt
TFIDF
CRF
RNN Deep Learning
ENSEMBLE
NLP
Machine Learning
Knowledge Engineering
COGNIE TM : Open-source Leverage Framework UIMA
Classification Weka Mallet
NLP Stanford CoreNLP
Indexing Lucene
Databases MySQL, MongoDB
Knowledge Engineering Protégé
Topic mining Mallet
Sentiment Stanford Deep Learner
Step 0: Basic Text Analysis
Text Segmentation
In many cases the “unit” of distillation is a sentence
Segmentation strategies Built-in, such as in UIMA or GATE
Custom segmentation
Sentence decomposition Decompose sentence into individual clauses
Expressions
Beyond entities and sentiment : EXPRESSSIONS
EXPRESSIONS
Introduced in [Ashish et al, 2011]
Expressions
…showers had no hot water !… COMPLAINT
..you should have more veggie options… SUGGESTION
RETAIL/ENTERPRISE
..meats on special this weekend… ANNOUNCEMENT
..this is the best store on the west side… ADVOCACY
There is hardly any evidence to suggest a link between salt and diabetes -
This results confirm that high intake of salt leads to increase in BP +
RISK ASSESSMENT
Expressions
You should try Vitamin E oil … ADVICE
..I have had arthritis since 1991… EXPERIENCE
HEALTH
..for me lipitor worked like a charm… OUTCOME
The Indicators: “Give Aways”
A combination of multiple types of elements !
…showers had no hot water !… COMPLAINT
(You) should have more veggie options… SUGGESTION
..i have been on lipitor… EXPERIENCE
..this is the best store on the west side… ADVOCACY
Approach: Given Indicators
NLP
Identification of individual elements Unsupervised
Relationships between elements
Semantics
Identification of individual elements Knowledge driven
Machine Learning Classification
Combine elements classify
Expression Classification: Relevant Features
Curated lexicons of specific indicative phrases
Examples “could you”, “I took”, ….
Approach Manual creation of “seed” lexicons
Automated expansion from data plus resource such as WordNet
The Sentiment
For instance a Complaint would almost always have negative sentiment
Punctuations, Other expressions or emoticons
Expression Classification Features
Positional information of words, phrases, or part-of-speech patterns in the sentence
Suggestions will usually begin with certain ‘request’ words
Custom patterns
Such as subject-verb-object for PERSONAL EXPERIENCE
Ontology concepts
Expression Classification: Results
Have achieved 75% precision and recall for all expressions considered
Factors
Feature engineering
Classifier selection
Knowledge engineering
Before Automated Classification: Manual Patterns
SoL: Sequences of Labels
Labels LEX-FOODADJ spicy
LEX-EXCESS too, very
ONT-FOOD
POS-NOUN
Sequences (Patterns)ANY LEX-EXCESS LEX-FOODADJ ANY Negative
POS-VB POS-MD * Suggestion
Classification: Machine Learning
Classification tasks
Expression
(Contextual) Sentiment
Aspect category
Frameworks
Weka
Mallet
Baseline Classifiers for Expressions
Mallet and Weka
NaiveBayes
MaxEnt
CRF
Gram-based
Uni, Bi and Trigram features
Baseline
~ 10% accuracy
Expression Classifiers
Trees
Decision Tree (J48)
Functions
Logistic Regression
SVM
Sequence Tagging
CRF: Conditional Random Fields
Entities
Named-entity extractors
The generic PERSON, ORGANIZATION, LOCATION
Ngram and part-of-speech analysis
Frequently mentioned ‘entities’
Improves recall
Ontology driven concept mapping
Using pre-assembled domain ontologies/taxonomies/dictionaries
Based on modules like UIMA ConceptMapper
Scale is a challenge
Contextual Sentiment
(Just) polar words can be misleading !
Polar words many not be present at all !
Combination of elements
The mocha is too sweet
Wait time is over an hour
Aisles are too narrow
Service is slow
Qualified Sentiment
Classify negative comments
Further segregate into
Immediately actionable items
‘Long term’ issues
Approach
Curation of Ngrams for each type of negative comments
Classifier
Topic Mining
Motivated by feedback survey analytics People can talk about “anything”
Interested in broad ‘topics’ of discussion But the set of topics is dynamic, not necessarily known
Unsupervised topic mining LDA: Latent Dirichlet Allocation
As-is led to very fragmented topics that were semantically not meaningfulSolution: consolidation of terms using WordNet Expand terms using WordNet synonyms Consolidate with manual curation after
Semi-automated approach
Cohesive Topic Mining
Problem with WordNet (synonym) expansion
Prone to semantic divergence
Example
Presentation Project(or) Milestones
(Almost) strongly connected components in relationship graph
Manual review after
Aspect Classification
Binning data into few broad categories
Approach Ngram mining
Classification
Categories over Topics
Consolidate topics into broad, fixed categoriesOntology mapping approach Each category has associated concepts Topic signature maps to category concepts
HersheyBieberCocoa beans
Personnel Competitors
Yearly reviews
Emotion Extraction
Plutchik wheel of emotions Fundamental emotion concepts captured in ontology
Augmented with indicator terms, and their synonymsOntology driven extraction for emotion concepts
Semantics is Key
Semantics
Domain knowledge is not ‘nice-to-have’ but critical
HEALTH
• Condition names
• Drug names
• Symptoms
• Procedures
• ..
RETAIL
• Food items
• Other products
• Competitors
• …
RESEARCH
• Chemical substances
• Harmful conditions
• …
INTELLIGENCE
• Manufacturers
• Vehicles
…
Leverage Existing Knowledge Sources
Health informatics UMLS http://www.nlm.nih.gov/research/umls/ NCI Thesaurus http://ncit.nci.nih.gov/
SNOMED http://www.nlm.nih.gov/snomed
Retail DMOZ http://www.dmoz.org
Many other Freebase http://www.freebase.com
Wikipedia, DBPedia
OpenData data.gov
Knowledge Engineering Tools
Getting available ontologies into usable formats
Available as database dumps, RDF, or Web data
“Mini” ontology creation
Curate manually when possible (small dictionaries) Example: list of competitors
API access
Freebase https://www.freebase.com/query Query using ‘MQL’ – Metaweb Query Language (Sparql like)
BioPortal http://data.bioontology.org/documentation
Provided sometimes by customer !
Practical Requirements
Confidence Measures
Quantitative confidence score for extracted elements
Binary confidence Y/N Not confident Routed for manual review
‘Explanation’ for classification
Relevant snippets
“….and the checkout times continue to be long despite …”
Complaint
Feedback Learning Mechanisms
Manual overview is not dismissed entirely
Comprehensive pipeline for manual review
Learn and improve from feedback
Applications
Applications
Core Cognie
Platform
Retail Analytics
Engine
Health Distillation
Engine
Survey Analytics
Engine
Research Mining
Engine
Coding Validation
Engine
Risk Analysis
System
Coding
ProcessesHealth Insights
Portal
Scale
Scalability
Scale requirements Large numbers of documents as opposed to large
document size
Throughput can be an issueComplex language processing algorithms
Feature extraction can be complex
Large ontologies in some cases
SolutionsMulti-threading and Thread pooling architecture
Hadoop MapReduce [Kahn and Ashish, 2014]
Conclusions
Grand Challenge Projects
AristoAt AI2, Allen AI Institute http://www.allenai.org
Areas Knowledge Extraction
Reasoning
Question Answering
Can the system answer 4th, 6th grade exams ?
Project NELL Never Ending Language Learning http://rtw.ml.cmu.edu/rtw/
“Learnt” 50+million facts from Web data
Conclusions
Deeper distillation from text is required
Can be achieved by
Detecting and combining multiple elements in text Feature engineering
Knowledge engineering
Classifier selection
Semantics and Knowledge Engineering is key
Have been successful in leveraging the CognieTM
Platform to develop custom text analytics engines in multiple domains
thank you [email protected]