text, knowledge, and information extraction - meetupfiles.meetup.com/14535342/text, knowledge and...
TRANSCRIPT
NICTA Copyright 2014
Text, Knowledge, and Information Extraction
Lizhen Qu
NICTA Copyright 2014
A bit about Myself… • PhD: Databases and Information Systems Group
(MPII) – Advisors: Prof. Gerhard Weikum and Prof. Rainer Gemulla – Thesis: “Sentiment Analysis with Limited Training Data”
• Now: machine learning group at NICTA, adjunct research fellow at ANU.
NICTA Copyright 2014
Macquarie
3
NICTA Copyright 2014
News about Macquarie Bank
4
NICTA Copyright 2014
Negative News about Macquarie Bank
5
NICTA Copyright 2014
Simple Math Problem
Bob has 15 apples. He gives 9 to Sarah. How many apples does Bob have now?
6
NICTA Copyright 2014
Bob has 15 apples. He gives 9 to Sarah. How many apples does Bob have now?
7
NICTA Copyright 2014
Information Extraction
8
• Named entity recognition • Named entity disambiguation • Relation extraction
NICTA Copyright 2014
Knowledge Bases (Open Linked Data)
9
Entity Graph
Economic Graph
OpenIE (Ollie, Reverb)
(Bob_Dylan, compose, Like_a_rolling_stone”) (The_Dark_Night, directedBy, Christopher_Nolan)
NICTA Copyright 2014
Knowledge Bases (Open Linked Data)
10
Entity Graph
Economic Graph
OpenIE (Ollie, Reverb)
YAGO #classes: 350,000 #entities: 10 million #facts: 120 million #language: 10
NICTA Copyright 2014
Knowledge Bases (Open Linked Data)
11
Entity Graph
Economic Graph
OpenIE (Ollie, Reverb)
DBpedia #classes: 735 #entities: 38.3 million #triples: 6.9 billion #languages: 128
NICTA Copyright 2014
Knowledge Bases (Open Linked Data)
12
Entity Graph
Economic Graph
OpenIE (Ollie, Reverb)
Freebase #entities: 50 million #facts: 3 billion #languages: almost 70
NICTA Copyright 2014
Construct YAGO from (Semi) Structured Data
13
NICTA Copyright 2014
IE Challenge: ambiguity of Natural Language
14
I made her duck.
i. I cooked waterfowl for her. ii. I cooked waterfowl belonging her. iii. I created the duck she owns. iv. I caused her to quickly lower her head or body. v. I waved my magic wand and turned her into a
waterfowl.
NICTA Copyright 2014
Named Entity Recognition
Research at Stanford led to a search engine company, founded by Page and Brin.
15
PER PER
ORG
O O ORG O
O
O
O O O
O
O
PER O
PER O
Research at Stanford led to search engine company , founded by Page and Brin .
TASK:
Machine Learning Problem:
NICTA Copyright 2014
Learning and Prediction
16
Feature Extraction Sentences
train models
prediction
Labeled Sentences
has labels
no labels
NICTA Copyright 2014
Feature Extraction • Use features to represent each word.
• Vectorise feature representations.
17
w-2 Research w-1 at w0 Stanford w+1 led w+2 to POS noun capitalized? true
w-2 to w-1 a w0 search w+1 engine w+2 company POS noun capitalized? false
Features of Stanford :
Features of Search :
w-2 = research capitalized w0 = stanford w0 = search …
1 1 1 0 …
NICTA Copyright 2014
Standard Model: Conditional Random Fields
18
• Assigns local score to different (word, label) pairs. • Joint inference to find best label sequences.
CRF: p(y|x) =
exp
⇥PTt=1
Pi �ifi(yt�1, yt, xt)
⇤
Z
Stanford NER [1]: 86% Best system [8]: 89%
NICTA Copyright 2014
Named Entity Disambiguation
Research at Stanford led to a search engine company, founded by Page and Brin.
19
PER PER
ORG TASK:
Larry Page Stanford Univeristy Sergey Brin
NICTA Copyright 2014
AIDA-light [2]
20
NICTA Copyright 2014
First Stage
21
NICTA Copyright 2014
Second Stage
22
AIDA-light [2]: 84.8% DBPepdia spotlight: 75%
NICTA Copyright 2014
Relation Extraction • Relation mention extraction.
• Expand knowledge bases.
23
Research at Stanford led to a search engine company, founded by Page and Brin.
PER: Larry_Page PER: Sergey_Brin
ORG: Stanford_University
?
Larry Page Stanford Univeristy
The Dark Night Christopher Nolan
?
?
NICTA Copyright 2014
Relation Mention Extraction • Multi-class classification. • Example features of a pair of entity mentions [3].
24
Research at Stanford led to a search engine company, founded by Page and Brin.
?
words between (Stanford, Page)
led, to, a, search, engine, company, founded, by
Named entity types (ORG, PER) Number of mentions between (Stanford, Page)
0
…
F-Measure on ACE: 71.2% [3]
NICTA Copyright 2014
Expand Knowledge Base • Multi-instance, multi-label [4,5]. • Distant supervision.
25
Larry Page Sergey Brin
relation-level label
Freebase
Research at Stanford led to a search engine company, founded by Page and Brin.
Larry Page and Sergey Brin explained why they just created Alphabet.
mention-level label ? mention-level label ?
MAP [3] : 56% MAP [4] : 66%
NICTA Copyright 2014
Open Information Extraction • Extract triples of any relations from the web [6].
• Optional: link triples to knowledge bases.
26
(“Bob Dylan”, “record”, “Like a rolling stone”)
It was exactly 50 years ago today that Bob Dylan walked into Studio A at Columbia Records in New York and recorded "Like a Rolling Stone”.
(“Bob Dylan”, “record”, “Like a rolling stone”)
The_Dark_Night Like_a_Rolling _Stone record
F1 [6] : 19.6% F1 [9] : 28.3%
NICTA Copyright 2014
Harvest Domain-Specific Knowledge • Deep learning.
– Learn cross-domain features. – minimize training data.
• Transfer learning.
27
newswire
source domain target domain
nurse handovers
NICTA Copyright 2014
Word Representation • One-hot representation.
• Distributed representation.
28
stanford [ 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ] university [ 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0 ] oxford [ 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ] conference [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1 ] talk [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0 ]
stanford
university
oxford
conference
talk
= [0.01, 0.3, -0.5, 0.6]
NICTA Copyright 2014 29
Distributed Representation
NICTA Copyright 2014
Apply Distributed Representations for NER
30
compare
stanford
university
and
Feature Matrix o
UNI
UNI
o
label
oxford UNI
current word
first word to the right
2nd word to the right
first word to the left
2nd word to the left
Represent words based on positions rather than IDs.
…
NICTA Copyright 2014
Results of Named Entity Recognition [7]
31
• Reduce the amount of training data. • Tiny differences between word embeddings.
NICTA Copyright 2014
NER for Novel Named Entity Types • Goals:
– Minimize labeled training data. – Leverage existing resources:
• Labeled corpora. • Unlabeled text. • Existing knowledge bases.
32
person
orgnization
location
doctor
corporation
city
patient
hotel country
source domain target domain
NICTA Copyright 2014
Experimental Results on I2B2
33
NICTA Copyright 2014
Learn Text Representations for Relations • Unsupervised pre-training. • Distant supervision.
34
Larry Page Sergey Brin
co-founders
Freebase
Research at Stanford led to a search engine company, founded by Page and Brin.
Larry Page and Sergey Brin explained why they just created Alphabet.
Inferred mention-level label Inferred mention-level label
NICTA Copyright 2014
NICTA Deep Learning for IE Toolkit • A fully integrated deep learning toolkit for NLP.
– Pipelines include both NLP preprocessing and DL components.
– Written in Scala/Java. – Easy to write new ML component. – Reuse UIMA NLP components.
• Scalable. – Easy switch between GPUs and CPUs. – Learning on GPUs. – Make use of UIMA for prediction.
35
NICTA Copyright 2014
References • [1] Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating Non-local
Information into Information Extraction Systems by Gibbs Sampling. Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005).
• [2] Nguyen, Dat Ba, et al. "Aida-light: High-throughput named-entity disambiguation." Linked Data on the Web at WWW2014 (2014).
• [3] Chan, Yee Seng, and Dan Roth. "Exploiting background knowledge for relation extraction." Proceedings of the 23rd International Conference on Computational Linguistics. Association for Computational Linguistics, 2010.
• [4] Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati, Christopher D. Manning. “Multi-instance Multi-label Learning for Relation Extraction.” Proceedings of the 2012 Conference on Empirical Methods in Natural Language Processing and Natural Language Learning, 2012.
• [5] Riedel, Sebastian, et al. "Relation extraction with matrix factorization and universal schemas." (2013).
• [6] Schmitz, Michael, et al. "Open language learning for information extraction." Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics, 2012.
• [7] Qu, Lizhen, et al. "Big Data Small Data, In Domain Out-of Domain, Known Word Unknown Word: The Impact of Word Representation on Sequence Labelling Tasks." arXiv preprint arXiv:1504.05319 (2015).
• [8] Rie Kubota Ando and Tong Zhang. 2005. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6:1817–1853
• [9] Angeli, Gabor, Melvin Johnson Premkumar, and Christopher D. Manning. "Leveraging Linguistic Structure For Open Domain Information Extraction."
36
NICTA Copyright 2014
Resources • YAGO:
http://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/yago
• DBPedia: http://wiki.dbpedia.org/ • Alchemy : http://querybuilder.alchemyapi.com/builder • Deep learning: http://www.deeplearning.net/ • Word2vec : https://code.google.com/p/word2vec/ • Mallet (Java): http://mallet.cs.umass.edu/ • Factorie (Scala): http://factorie.cs.umass.edu/ • Stanford CoreNLP: http://nlp.stanford.edu:8080/corenlp/ • NLP conferences.
– ACL, EMNLP, COLING, NAACL, EACL … • NLP online courses.
– https://www.coursera.org/course/nlangp – https://www.youtube.com/playlist?list=PL6397E4B26D00A269
• ML online courses. – https://www.coursera.org/course/ml – https://www.coursera.org/course/neuralnets – http://www.socher.org/index.php/DeepLearningTutorial/DeepLearningTutorial
37