information extraction

69
Information Extraction Ruben Izquierdo [email protected] http://rubenizquierdobevia.com

Upload: ruben-izquierdo-bevia

Post on 14-Jul-2015

112 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Information Extraction

Information  Extraction

Ruben Izquierdo [email protected]

http://rubenizquierdobevia.com

Page 2: Information Extraction

Text  Mining  Course •  1) Introduction to Text Mining •  2) Introduction to NLP •  3) Named Entity Recognition and Disambiguation •  4) Opinion Mining and Sentiment Analysis •  5) Information Extraction

•  6) NewsReader and Visualisation •  7) Guest Lecture and Q&A

Page 3: Information Extraction

Outline 1.  What is Information Extraction 2.  Main goals of Information Extraction 3.  Information Extraction Tasks and Subtasks 4.  MUC conferences 5.  Main domains of Information Extraction 6.  Methods for Information Extraction

o  Cascaded finite-state transducers o  Regular expressions and patterns o  Supervised learning approaches o  Weakly supervised and unsupervised approaches

7.  How far we are with IE

Page 4: Information Extraction

What  is  IE? •  Late 1970s within NLP field

•  Find and extract automatically limited relevant parts of texts

•  Merge information from many pieces of text

Page 5: Information Extraction

What  is  IE? •  Quite often in specialized domains

•  Move from unstructured/semi-structured data to structured data o  Schemas o  Relations (as a database) o  Knowledge base o  RDF triples

Page 6: Information Extraction

What  is  IE? Unstructured  text

•  Natural  language  sentences •  Historically  NLP  system  have  been  designed  to  process  this  type  of  data •  The  meaning  à  linguistic  analysis  and  natural  language  understanding

Page 7: Information Extraction

What  is  IE? Semi-­‐‑structured  text

•  The  physical  layout  helps  to  the  interpretation •  Processing  half  way  linguistic  features  ßà  positional  features

Page 8: Information Extraction

What  is  IE?

Page 9: Information Extraction

Main  goals  of  IE •  Fill a predefined “template” from raw text

•  Extract who did what to whom and when? o  Event extraction

•  Organize information so that is useful to people

•  Put information in a form that allows further inferences by computers o  Big data

Page 10: Information Extraction

IE.  Task  &  Subtasks •  Named Entity Recognition

o  Detection à Mr. Smith eats bitterballen [Mr. Smith] : ENTITY o  Classification à Mr. Smith eats bitterballen [Mr. Smith] : PERSON

•  Event extraction o  The thief broke the door with a hammer

•  CAUSE_HARMà Verb: break Agent: the thief Patient: the door Instrument: a hammer

•  Coreference resolution o  [Mr. Smith] eats bitterballen. Besides to this, [he] only drinks Belgium beer.

Page 11: Information Extraction

IE.  Task  &  Subtasks •  Relationship extraction

o  Bill works for IBM PERSON works for ORGANISATION

•  Terminology extraction o  Finding relevant terms of multi words from a given corpus

•  Some concrete examples o  Extracting earnings, profits, board members, headquarters from company

reports o  Searching on the WWW for e-mails for advertising (spamming) o  Learn drug-gene product interactions from biomedical research papers

Page 12: Information Extraction

IE  Tasks  &  Subtasks •  Apple mail

Page 13: Information Extraction

MUC  conferences •  Message Understanding Conference (MUC), held

between 1987 and 1998.

•  Domain specific texts + training examples + template definition

•  Precision, Recall and F1 as evaluation

•  Domains o  MUC-1 (1987), MUC-2 (1989): Naval operations messages. o  MUC-3 (1991), MUC-4 (1992): Terrorism in Latin American countries. o  MUC-5 (1993): Joint ventures and microelectronics domain. o  MUC-6 (1995): News articles on management changes. o  MUC-7 (1998): Satellite launch reports.

Page 14: Information Extraction

MUC  conferences Bridgestone  Sports  Co.  said  Friday  it  has  set  up  a  joint  venture  in  Taiwan  with  a  local  concern  and  a  Japanese  trading  house  to  produce  golf  clubs  to  be  shipped  to  Japan. The  joint  venture,  Bridgestone  Sports  Taiwan  Co.,  capitalized  at  20  million  new  Taiwan  dollars,  will  start  production  in  January  1990  with  production  of  20,000  iron  and  “metal  wood”  clubs  a  month.

Example  from  MUC5

Page 15: Information Extraction

Main  domains  of  IE •  Terrorist events

•  Joint ventures

•  Plane crashes

•  Disease outbreaks

•  Seminar announcements

•  Biological and medical domain

Page 16: Information Extraction

Outline 1.  What is Information Extraction 2.  Main goals of Information Extraction 3.  Information Extraction Tasks and Subtasks 4.  MUC conferences 5.  Main domains of Information Extraction 6.  Methods for Information Extraction

o  Cascaded finite-state transducers o  Regular expressions and patterns o  Supervised learning approaches o  Weakly supervised and unsupervised approaches

7.  How far we are with IE

Page 17: Information Extraction

Methods  for  IE •  Cascaded finite-state transducers

o  Rule based o  Regular expressions

•  Learning based approaches o  Traditional classifiers

•  Bayes, MME, SVM … o  Sequence label models

•  HMM, CMM, CRF

•  Unsupervised approaches

•  Hybrid approaches

Page 18: Information Extraction

Cascaded  finite-­‐‑state  transducers

•  Emerging idea from MUC participants and approaches

•  Decompose the task into small sub-tasks

•  One element is read at a time from a sequence o  Depending on the type a certain transition in produced in the automaton

to a new state

o  Some states are considered final (the input matches a certain pattern)

•  Can be defined as a regular expression

Page 19: Information Extraction

Cascaded  finite-­‐‑state  transducers

Finite  Automaton  for  noun  groups

=>  John’s  interesting  book  with  a  nice  cover

Page 20: Information Extraction

Cascaded  finite-­‐‑state  transducers

•  Earlier stages recognize smaller linguistics objects o  Usually domain independent

•  Later stages build on top of the previous ones o  Usually domain dependent

•  Typical IE systems 1.  Complex words 2.  Basic phrases 3.  Complex phrases 4.  Domain events 5.  Merging structures

Page 21: Information Extraction

Cascaded  finite-­‐‑state  transducers

•  Complex words o  Multiwords: “set up” “trading house” o  NE: “Bridgestone Sports Co”

•  Basic Phrases o  Syntactic chunking

•  Noun groups (head noun + all modifiers) •  Verb groups

Page 22: Information Extraction

Cascaded  finite-­‐‑state  transducers

Page 23: Information Extraction

Cascaded  finite-­‐‑state  transducers

•  Complex phrases o  Complex noun and verb groups on the basis of syntactic information

•  The attachment of appositives to their head noun group o  “The joint venture, Bridgestone Sports Taiwan Co.,”

•  The construction of measure phrases o  “20,000 iron and ‘metal wood’ clubs a month”

Page 24: Information Extraction

Cascaded  finite-­‐‑state  transducers

•  Domain events o  Recognize events and match with “fillers” detected in previous steps

o  Requires domain specific patterns •  To recognize phrases of interest •  To define what are the roles

o  Patterns can be defined also as a finite-state machines or regular expressions

•  <Company/ies><Set-up><Joint-Venture> with <Company/ies> •  <Company><Capitalized> at <Currency>

Page 25: Information Extraction

Cascaded  finite-­‐‑state  transducers

Page 26: Information Extraction

Regular  Expressions •  1950’s Stephen Kleene •  A string pattern that describes/matches a set of

strings

•  A regular expression consists of: o  Characters

o  Operation symbols •  Boolean (and/or) •  Grouping (for defining scopes) •  Quantification

Page 27: Information Extraction

Regular  Expressions Character Description a The  character  a . Any  single  character [abc] Any  character  in  the  brackets  (OR)  ‘a’  

or  ‘b’  or  ‘c’ [^abc]

Any  character  not  in  the  brackets.  Any  symbol  that  is  not  ‘a  ‘  or  ‘b’  or  ‘c’

* Quantifier.  Matches  the  preceding  element  ZERO  or  more  times

+ Quantifier.  Matches  the  preceding  element  ONE  or  more  times

? Matches  the  previous  element  zero  or  one  time

| Choice  (OR)  Matches  one  of  the  expressions  (before  of  after  the  |)

Page 28: Information Extraction

Regular  Expressions ①  .at è ???

Page 29: Information Extraction

Regular  Expressions ①  .at è hat cat bat xat … ②  [hc]at è hat cat ③  [^b]at è all matched by .at but “bat” ④  [^hc]at è all match by .at but “hat” and

“cat” ⑤  s.* è s sssss ssbsd2ck3e

Page 30: Information Extraction

Regular  Expressions ①  .at è hat cat bat xat … ②  [hc]at è hat cat ③  [^b]at è all matched by .at but “bat” ④  [^hc]at è all match by .at but “hat” and

“cat” ⑤  s.* è s sssss ssbsd2ck3e ⑥  [hc]*at è hat cat hhat chat cchhat at … ⑦  cat|dog è cat dog ⑧  …. ⑨  ….

Page 31: Information Extraction

Using  Regular  Expressions

•  Typically extracting information from automatic generated webpages is easy

o  Wikipedia

•  To know the country for a given city o  Amazon webpage

•  From a list of hits

o  Weather forecast webpages

o  DBpedia

Page 32: Information Extraction
Page 33: Information Extraction
Page 34: Information Extraction

Using  Regular  Expressions

Page 35: Information Extraction

Using  Regular  Expressions

•  Some “unstructured” pieces of information keep some structure and are easy to capture by means of regular expressions o  Phone numbers

o  What else?

o  …

o  ...

Page 36: Information Extraction

Using  Regular  Expressions

•  Some “unstructured” pieces of information keep some structure and are easy to capture by means of regular expressions o  Phone numbers

o  E-mails

o  URL Websites

Page 37: Information Extraction

Using  Regular  Expressions

•  Also to detect relations and fill events

•  Higher level regular expressions make use of “objects” detected by lower level patterns

•  Some NLP information may help (pos tags, phrases, semantic word categories) o  Crime-Victim can use things matched by “noun-group”

•  Prefiller: [pos: V, type-of-verb: KILL] WordNet MCR •  Filler: [phrase: NOUN-GROUP]

Page 38: Information Extraction

Using  Regular  Expressions

•  Extraction relations between entities o  Which PERSON holds what POSITION in what ORGANIZATION

•  [PER], [POSITION] of [ORG]

Entities: PER:  Jose  Mourinho POSITION:  trainer ORG:  Chelsea

Relation

Jose  Mourinho Trainer

Chelsea

Page 39: Information Extraction

Using  Regular  Expressions

•  Extraction relations between entities o  Which PERSON holds what POSITION in what ORGANIZATION

•  [PER], [POSITION ] of [ORG] •  [ORG] (named, appointed,…) [PER] Prep [POSITION]

o  Nokia has appointed Rajeev Suri as President

o  Where a ORGANIZATION is located

•  [ORG] headquarters in [LOC] o  NATO headquarters in Brussels

•  [ORG][LOC] (division, branch, headquarters…) o  KFOR Kosovo headquarters

Page 40: Information Extraction

Extracting  relations  with  palerns

•  Hearst 1992

•  What does Gelidium mean?

•  “Αγαρ ισ α συβστανχε πρεπαρεδ φροµ α µιξτυρε οφ ρεδ αλγαε, συχη ασ

Gelidium, φορ λαβορατορψ ορ ινδυστριαλ υσε”

Page 41: Information Extraction

Extracting  relations  with  palerns

•  Hearst 1992

•  What does Gelidium mean?

•  “Agar is a substance prepared from a mixture of red algae, such as Gelidium, for laboratory or industrial use”

•  How do you know?

Page 42: Information Extraction

Extracting  relations  with  palerns

•  Hearst 1992: Automatic Acquisition of Hyponyms (IS-A)

X à Gelidium (sub-type) Y à red algae (super-type) X à IS-A à Y

•  “Y such as X” •  “Y, such as X” •  “X or other Y” •  “X and other Y” •  “Y including X” •  ….

Page 43: Information Extraction

Extracting  relations  with  palerns

Page 44: Information Extraction

Hand-­‐‑built  palerns •  Positive

o  Tend to be high-precision o  Can be adapted to specific domains

•  Negative o  Human patterns are usually low-recall o  A lot of work to think all possible patterns o  Need to create a lot of patterns for every relation

Page 45: Information Extraction

Learning-­‐‑based  Approaches

•  Statistical techniques and machine learning algorithms o  Automatically learn patterns and models for new domains

•  Some types o  Supervised learning of patterns and rules o  Supervised Learning for relation extraction o  Supervised learning of Sequential Classifier Methods o  Weakly supervised and supervised

Page 46: Information Extraction

Supervised  Learning  of  Palerns  and  Rules

•  Aiming to reduce the knowledge engineering bottleneck to create an IE in a new domain

•  AutoSlog and PALKA à first IE pattern learning systems o  AutoSlog: syntactic templates, lexico-syntactic patterns and manual

review

•  Learning Algorithms à generate rules from annotated text o  LIEP (Huffman 1996) : syntactic paths, role fillers. Patterns that work ok in

training are kept o  (LP)2 uses tagging rules and correction rules

Page 47: Information Extraction

Supervised  Learning  of  Palerns  and  Rules

•  Relational learning methods o  RAPIER: rules for pre-filler, filler, and post-filler component. Each

component is a pattern that consists of words, POS tags, and semantic classes.

Page 48: Information Extraction

Supervised  Learning  for  relation  extraction  (I)

•  Design a supervised machine learning framework

•  Decide what relations we are interested in

•  Choose what entities are relevant

•  Find (or create) labeled data o  Representative corpus o  Label the entities in the corpus (Automatic NER) o  Hand label relation between these entities o  Split into train + dev + test

•  Train, improve and evaluate

Page 49: Information Extraction

Supervised  Learning  for  relation  extraction  (II)

•  Relation extraction as a classification problem •  2 classifiers

o  To decide if two entities are related o  To decide the class for a pair or related entities

•  Why 2? o  Faster training by eliminating most pairs

o  Appropriate feature sets for each task

•  Find all pairs of NE (restricted to the sentence) o  For every pair

1.  Are the entities related (classifier 1) 1.  no à END 2.  Yes à guess the class (classifier 2)

Page 50: Information Extraction

Supervised  Learning  for  relation  extraction  (III)

•  Are the two entities related? •  What is the type of relation?

Page 51: Information Extraction

Supervised  Learning  for  relation  extraction  (IV)

“[American Airlines], a unit of AMR, immediately matched the move, spokesman [Tim Wagner] said” •  What features?

o  Head words of entity mentions and combination •  Airlines Wagner Airlines-Wagner

o  Bag-of-words in the two entity mentions •  American, Airlines, Tim, Wagner, American Airlines, Tim Wagner

o  Words/bigrams in particular positions to the left and right •  M2#-1: spokesman M1#+1: said

o  Bag-of-words (or bigrams) between the 2 mentions •  a, AMR, of, immediately, matched, move, spokesman, the, unit

Page 52: Information Extraction

Supervised  Learning  for  relation  extraction  (V)

“[American Airlines], a unit of AMR, immediately matched the move, spokesman [Tim Wagner] said” •  What features?

o  Named entity types •  M1: ORG M2: PERSON

o  Entity level (Name, Nominal (NP), Pronoun) •  M1: NAME (“it” or “he” would be PRONOUN) •  M2: NAME (“the company” would be NOMINAL)

o  Basic chunk sequence from one entity to the other •  NP NP PP VP NP NP

o  Constituency path on the parse tree •  NP é NP é S é S ê NP

Page 53: Information Extraction

Supervised  Learning  for  relation  extraction  (VI)

“[American Airlines], a unit of AMR, immediately matched the move, spokesman [Tim Wagner] said” •  What features?

•  Trigger lists o  For family à parent, wife, husband… (WordNet)

•  Gazetteers o  List of countries…

•  …. •  …. •  …

Page 54: Information Extraction

Supervised  Learning  for  relation  extraction  (VII)

•  Decide your algorithm o  MaxEnt, Naïve Bayes, SVM

•  Train the system on the training data

•  Tune it on the dev set

•  Test on the evaluation test o  Traditional Precision, Recall and F-score

Page 55: Information Extraction

Sequential  Classifier  Methods

•  IE as a classification problem using sequential learning models.

•  A classifier is induced from annotated data to sequentially scan a text from left to right and decide what piece of text must be extracted or not

•  Decide what you want to extract

•  Represent the annotated data in a proper way

Page 56: Information Extraction

Sequential  Classifier  Methods

Page 57: Information Extraction

Sequential  Classifier  Methods

•  Typical steps for training o  Get the annotated training data o  Represent the data in IOB o  Design feature extractors o  Decide the algorithm to use o  Train the models

•  Testing steps o  Get the test documents o  Extract features o  Run the sequence models o  Extract the recognized entities

Page 58: Information Extraction

Sequential  Classifier  Methods

•  Algorithms o  HMM o  CMM o  CRF

•  Features o  Words (current, previous, next) o  Other linguistic information (PoS, chunks…) o  Task specific features (NER…)

•  Word shapes: abstract representation for words

Page 59: Information Extraction

Sequential  Classifier  Methods

•  Algorithms o  HMM o  SVM o  CRF

•  Features o  Words (current, previous, next) o  Other linguistic information (PoS, chunks…) o  Task specific features (NER…)

•  Word shapes: abstract representation for words

Page 60: Information Extraction

Weakly  supervised  and  unsupervised  

•  Manual annotation is also “expensive” o  IE is quite domain specific à not reuse

•  AutoSlog-Ts: o  Just needs 2 sets of documents: relevant/irrelevant o  Syntactic templates + relevance according to relevant set

•  Ex-Disco (Yangarber et al. 2000) o  No need preclassified corpus o  They use a small set of patterns to decide relevant/irrelevant

Page 61: Information Extraction

Weakly  supervised  and  unsupervised  

•  OpeNER: •  European project dealing with entity recognition,

sentiment analysis and opinion mining mainly in hotel reviews (also restaurants, attractions, news)

•  Double propagation o  Method to automatically gather opinion words and targets

•  From a large raw hotel corpus •  Providing a set of seeds and patterns

Page 62: Information Extraction

Weakly  supervised  and  unsupervised  

•  Seed list •  + à good, nice •  - à bad, ugly

•  Patterns •  a [EXP] [TAR] •  the [EXP] [TAR]

•  Polarity patterns •  = [EXP] and [EXP] [EXP], [EXP] •  ! [EXP] but [EXP]

Page 63: Information Extraction

Weakly  supervised  and  unsupervised  

•  Propagation method o  1) Get new targets using the seed expressions and the

patterns •  a nice [TAR] a bad [TAR] the ugly [TAR] •  Output à new targets (hotel, room, location)

o  2) Get new expression using the previous targets and the patterns •  a [EXP] hotel the [EXP] location •  Output à new expressions (expensive, cozy, perfect…)

o  Keep running 1 and 2 to get new EXP and TAR

Page 64: Information Extraction

Weakly  supervised  and  unsupervised  

•  Polarity guessing o  Apply the polarity patters to guess the polarity

•  = a nice(+) and cozy(?) à cozy(+) •  ! Clean(+) but expensive(?) à expensive (-)

hlps://github.com/opener-­‐‑project/opinion-­‐‑domain-­‐‑lexicon-­‐‑acquisition

Page 65: Information Extraction

Outline 1.  What is Information Extraction 2.  Main goals of Information Extraction 3.  Information Extraction Tasks and Subtasks 4.  MUC conferences 5.  Main domains of Information Extraction 6.  Methods for Information Extraction

o  Cascaded finite-state transducers o  Regular expressions and patterns o  Supervised learning approaches o  Weakly supervised and unsupervised approaches

7.  How far we are with IE

Page 66: Information Extraction

How  good  is  IE

Page 67: Information Extraction

How  good  is  IE •  Some progress has been done •  Still the barrier of 60% seems difficult to outperform •  Most errors on entities and event coreference •  Propagation errors

o  Entity recognition à 90% o  One event -> 4 entities o  0.9 x 4 à 60%

•  A lot of knowledge is implicit or “common world knowledge”

Page 68: Information Extraction

How  good  is  IE Information  Type Accuracy Entities 90  –  98% Alributes 80% Relations 60  –  70% Events 50  –  60%

•  Very optimistic numbers for well-established tasks •  The numbers go down for specific/new tasks

Page 69: Information Extraction

Information  Extraction

Ruben Izquierdo [email protected]

http://rubenizquierdobevia.com