introduction to information extraction chia-hui chang dept. of computer science and information...
Post on 21-Dec-2015
226 views
TRANSCRIPT
![Page 1: Introduction to Information Extraction Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan chia@csie.ncu.edu.tw](https://reader035.vdocuments.site/reader035/viewer/2022062308/56649d585503460f94a3813a/html5/thumbnails/1.jpg)
Introduction to Information Extraction
Chia-Hui Chang
Dept. of Computer Science and Information Engineering, National
Central University, [email protected]
![Page 2: Introduction to Information Extraction Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan chia@csie.ncu.edu.tw](https://reader035.vdocuments.site/reader035/viewer/2022062308/56649d585503460f94a3813a/html5/thumbnails/2.jpg)
2
Problem Definition Information Extraction (IE) is to identify
relevant information from documents, pulling information from a variety of sources and aggregates it into a homogeneous form.
Input extractor structured output
The output template of the IE task Several fields (slots) Several instances of a field
![Page 3: Introduction to Information Extraction Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan chia@csie.ncu.edu.tw](https://reader035.vdocuments.site/reader035/viewer/2022062308/56649d585503460f94a3813a/html5/thumbnails/3.jpg)
3
Difficulties of IE tasks depends on …
Text type From plain text to semi-structured Web
pages e.g. Wall Street Journal articles, or
email message, HTML documents. Domain
From financial news, or tourist information, to various language.
Scenario
![Page 4: Introduction to Information Extraction Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan chia@csie.ncu.edu.tw](https://reader035.vdocuments.site/reader035/viewer/2022062308/56649d585503460f94a3813a/html5/thumbnails/4.jpg)
4
Various IE Tasks Free-text IE:
For MUC (Message Understanding Conference) E.g. terrorist activities, corporate joint
ventures
Semi-structured IE: E.g.: meta-search engines, shopping agents,
Bio-integration system
![Page 5: Introduction to Information Extraction Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan chia@csie.ncu.edu.tw](https://reader035.vdocuments.site/reader035/viewer/2022062308/56649d585503460f94a3813a/html5/thumbnails/5.jpg)
5
Types of IE from MUC Named Entity recognition (NE)
Finds and classifies names, places, etc. Coreference Resolution (CO)
Identifies identity relations between entities in texts.
Template Element construction (TE) Adds descriptive information to NE results.
Scenario Template production (ST) Fits TE results into specified event scenarios.
![Page 6: Introduction to Information Extraction Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan chia@csie.ncu.edu.tw](https://reader035.vdocuments.site/reader035/viewer/2022062308/56649d585503460f94a3813a/html5/thumbnails/6.jpg)
6
Named Entity Recognitionhttp://www.cs.nyu.edu/cs/faculty/grishman/NEtask20.book_3.html
![Page 7: Introduction to Information Extraction Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan chia@csie.ncu.edu.tw](https://reader035.vdocuments.site/reader035/viewer/2022062308/56649d585503460f94a3813a/html5/thumbnails/7.jpg)
7
NE Recognition (Cont.) Spanish:
93% Japanese:
92% Chinese:
84.51%
![Page 8: Introduction to Information Extraction Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan chia@csie.ncu.edu.tw](https://reader035.vdocuments.site/reader035/viewer/2022062308/56649d585503460f94a3813a/html5/thumbnails/8.jpg)
8
Coreference Resolution Coreference resolution (CO) involves
identifying identity relations between entities in texts.
For example, in
Alas, poor Yorick, I knew him well.
Tie “Yorick" with “him“. The Sheffield system scored 51% recall and
71% precision.
http://www.cs.nyu.edu/cs/faculty/grishman/COtask21.book_4.html
![Page 9: Introduction to Information Extraction Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan chia@csie.ncu.edu.tw](https://reader035.vdocuments.site/reader035/viewer/2022062308/56649d585503460f94a3813a/html5/thumbnails/9.jpg)
9
Template Element Production Adds description with named entities Sheffield system scores 71%
![Page 10: Introduction to Information Extraction Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan chia@csie.ncu.edu.tw](https://reader035.vdocuments.site/reader035/viewer/2022062308/56649d585503460f94a3813a/html5/thumbnails/10.jpg)
10
Scenario Template Extraction STs are the
prototypical outputs of IE systems
They tie together TE entities into event and relation descriptions.
Performance for Sheffield: 49%
http://www.cs.nyu.edu/cs/ faculty/grishman/ IEtask15.book_2.html
![Page 11: Introduction to Information Extraction Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan chia@csie.ncu.edu.tw](https://reader035.vdocuments.site/reader035/viewer/2022062308/56649d585503460f94a3813a/html5/thumbnails/11.jpg)
11
Example The operational domains that user interests are centered
around are drug enforcement, money laundering, organized crime, terrorism, ….
1. Input: texts dealing with drug enforcement, money laundering, organized crime, terrorism, and legislation;
2. NE: recognizes entities in those texts and assigns them to one of a number of categories drawn from the set of entities of interest (person, company, . . . );
3. TE: associates certain types of descriptive information with these entities, e.g. the location of companies;
4. ST: identifies a set (relatively small to begin with) of events of interest by tying entities together into event relations.
![Page 12: Introduction to Information Extraction Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan chia@csie.ncu.edu.tw](https://reader035.vdocuments.site/reader035/viewer/2022062308/56649d585503460f94a3813a/html5/thumbnails/12.jpg)
12
Example Text
![Page 13: Introduction to Information Extraction Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan chia@csie.ncu.edu.tw](https://reader035.vdocuments.site/reader035/viewer/2022062308/56649d585503460f94a3813a/html5/thumbnails/13.jpg)
13
Output Example (NE, TE)
![Page 14: Introduction to Information Extraction Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan chia@csie.ncu.edu.tw](https://reader035.vdocuments.site/reader035/viewer/2022062308/56649d585503460f94a3813a/html5/thumbnails/14.jpg)
14
Output (STs)
![Page 15: Introduction to Information Extraction Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan chia@csie.ncu.edu.tw](https://reader035.vdocuments.site/reader035/viewer/2022062308/56649d585503460f94a3813a/html5/thumbnails/15.jpg)
15
Another IE Example Corporate Management Changes Purpose
which positions in which organizations are changing hands?
who is leaving a position and where the person is going to? who is appointed to a position and where the person is
coming from? the locations and types of the organizations involved in the
succession events; the names and titles of the persons involved in the
succession events
http://www.cs.umanitoba.ca/~lindek/ie-ex.htm
![Page 16: Introduction to Information Extraction Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan chia@csie.ncu.edu.tw](https://reader035.vdocuments.site/reader035/viewer/2022062308/56649d585503460f94a3813a/html5/thumbnails/16.jpg)
16
Input TextPresident Clinton nominated John Rollwagen, the chairman
and CEO of Cray Research Inc., as the No. 2 Commerce Department official. Mr. Rollwagen said he wants to push the Clinton administration to aggressively confront U.S. trading partners such as Japan to open their markets, particularly for high-tech industries. In a letter sent throughout the Eagan, Minn.-based company on Friday, Mr. Rollwagen warned: "Whether we like it or not, our country is in an economic war; and we are at a key turning point in that war." ......
Cray said it has appointed John F. Carlson, its president and chief operating officer, to succeed him. ......
![Page 17: Introduction to Information Extraction Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan chia@csie.ncu.edu.tw](https://reader035.vdocuments.site/reader035/viewer/2022062308/56649d585503460f94a3813a/html5/thumbnails/17.jpg)
17
Extraction ResultCorporate Management Database
Person Organization Position Transition
John Rollwagen Cray Research Inc. chairman out
John Rollwagen Cray Research Inc. CEO out
John F. Carlson Cray Research Inc. chairman in
John F. Carlson Cray Research Inc. CEO in
Organization Database
Name Location Alias Type
Cray Research Inc. Eagan, Minn. Cray COMPANY
Commerce Department GOVERNMENT
![Page 18: Introduction to Information Extraction Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan chia@csie.ncu.edu.tw](https://reader035.vdocuments.site/reader035/viewer/2022062308/56649d585503460f94a3813a/html5/thumbnails/18.jpg)
18
MUC Data Set for
MET2 http://www.itl.nist.gov/iaui/894.02/related_projects/muc/met2/met2package.tar.gz
MUC3&4 http://www.itl.nist.gov/iaui/894.02/related_projects/muc/muc_data/muc34.tar.gz
MUC6&7 from LDC http://www.ldc.upenn.edu/ MUC-6:
http://www.cs.nyu.edu/cs/faculty/grishman/muc6.html MUC-7 http://www.itl.nist.gov/iaui/894.02/related_projects/muc/
proceedings/muc_7_toc.html
![Page 19: Introduction to Information Extraction Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan chia@csie.ncu.edu.tw](https://reader035.vdocuments.site/reader035/viewer/2022062308/56649d585503460f94a3813a/html5/thumbnails/19.jpg)
19
Summary Evaluation
Precision= Recall=
Design Methodology for Text IE Natural Language Processing Machine Learning
# of correctly extracted fields# of extracted fields
# of correctly extracted fields# of fields to be extracted
![Page 20: Introduction to Information Extraction Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan chia@csie.ncu.edu.tw](https://reader035.vdocuments.site/reader035/viewer/2022062308/56649d585503460f94a3813a/html5/thumbnails/20.jpg)
20
IE from Web pages
Output Template: k-tuple Multiple instances of a field Missing data
![Page 21: Introduction to Information Extraction Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan chia@csie.ncu.edu.tw](https://reader035.vdocuments.site/reader035/viewer/2022062308/56649d585503460f94a3813a/html5/thumbnails/21.jpg)
21
Web data extraction
Various Web pages Multiple-record page extraction One-record (singular) page extraction
![Page 22: Introduction to Information Extraction Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan chia@csie.ncu.edu.tw](https://reader035.vdocuments.site/reader035/viewer/2022062308/56649d585503460f94a3813a/html5/thumbnails/22.jpg)
Multiple-record page extraction
![Page 23: Introduction to Information Extraction Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan chia@csie.ncu.edu.tw](https://reader035.vdocuments.site/reader035/viewer/2022062308/56649d585503460f94a3813a/html5/thumbnails/23.jpg)
One-record (singular) page extraction
![Page 24: Introduction to Information Extraction Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan chia@csie.ncu.edu.tw](https://reader035.vdocuments.site/reader035/viewer/2022062308/56649d585503460f94a3813a/html5/thumbnails/24.jpg)
24
Applications Information integration
Meta Search Engines Shopping agents Travel agents
![Page 25: Introduction to Information Extraction Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan chia@csie.ncu.edu.tw](https://reader035.vdocuments.site/reader035/viewer/2022062308/56649d585503460f94a3813a/html5/thumbnails/25.jpg)
25
Information Integration Systems
Unprocessed,Unintegrated
Details
Translation and Wrapping
Semantic Integration
Mediation
AbstractedInformation
Text,Images/Video,Spreadsheets
Hierarchical& NetworkDatabases
RelationalDatabases
Object &Knowledge
Bases
SQL ORBWrapper Wrapper
Mediator Mediator
Human & Computer Users
Heterogeneous Data Sources
InformationIntegrationService
Mediator
User Services:• Query• Monitor• Update Agent/Module
Coordination
![Page 26: Introduction to Information Extraction Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan chia@csie.ncu.edu.tw](https://reader035.vdocuments.site/reader035/viewer/2022062308/56649d585503460f94a3813a/html5/thumbnails/26.jpg)
26
Web Wrappers What is a wrapper?
An extracting program to extract desired information from Web pages.Web pages → wrapper→ Structure Info.
Web wrappers wrap... “Query-able’’ or “Search-able’’ Web sites Web pages with large itemized lists
![Page 27: Introduction to Information Extraction Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan chia@csie.ncu.edu.tw](https://reader035.vdocuments.site/reader035/viewer/2022062308/56649d585503460f94a3813a/html5/thumbnails/27.jpg)
27
Summary Evaluation
Precision= Recall=
Methodology for Web IE Programming package Machine Learning Pattern Mining
# of correctly extracted records# of extracted records
# of correctly extracted records# of records to be extracted
![Page 28: Introduction to Information Extraction Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan chia@csie.ncu.edu.tw](https://reader035.vdocuments.site/reader035/viewer/2022062308/56649d585503460f94a3813a/html5/thumbnails/28.jpg)
28
Type III: News Group IE Example: Computer-Related Jobs
![Page 29: Introduction to Information Extraction Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan chia@csie.ncu.edu.tw](https://reader035.vdocuments.site/reader035/viewer/2022062308/56649d585503460f94a3813a/html5/thumbnails/29.jpg)
29
Output Template
Between free-text IE and semi-structured IE [CaliffRapier 99]
![Page 30: Introduction to Information Extraction Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan chia@csie.ncu.edu.tw](https://reader035.vdocuments.site/reader035/viewer/2022062308/56649d585503460f94a3813a/html5/thumbnails/30.jpg)
30
Wrapper Induction Systems
Wrapper induction (WI) or information extraction (IE) systems are software that are designed to generate wrappers.
Taxonomy of Web IE systems by Task domain
• free text vs semi-structured pages Automation degree
• supervised vs unsupervised Techniques applied
• Machine learning vs pattern mining
![Page 31: Introduction to Information Extraction Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan chia@csie.ncu.edu.tw](https://reader035.vdocuments.site/reader035/viewer/2022062308/56649d585503460f94a3813a/html5/thumbnails/31.jpg)
31
Task Domain Document type Extraction level
Field-level, record-level, page-level Extraction target variation
Missing Attributes Multi-valued Attributes Multi-order attribute Permutations Nested Data Objects
Template variation Various Templates for an attribute Common Templates for various attributes
Untokenized Attributes
![Page 32: Introduction to Information Extraction Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan chia@csie.ncu.edu.tw](https://reader035.vdocuments.site/reader035/viewer/2022062308/56649d585503460f94a3813a/html5/thumbnails/32.jpg)
32
Automation Degree
Page-fetching Support Annotation Requirement Output Support API Support
![Page 33: Introduction to Information Extraction Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan chia@csie.ncu.edu.tw](https://reader035.vdocuments.site/reader035/viewer/2022062308/56649d585503460f94a3813a/html5/thumbnails/33.jpg)
33
Techniques Applied Scan passes Extraction rule types Learning algorithms Tokenization schemes Feature used
![Page 34: Introduction to Information Extraction Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan chia@csie.ncu.edu.tw](https://reader035.vdocuments.site/reader035/viewer/2022062308/56649d585503460f94a3813a/html5/thumbnails/34.jpg)
34
Conclusion Define the IE problem Specify the input: training example
with annotation, or without annotation
Depict the extraction rule Use necessary background knowledge
![Page 35: Introduction to Information Extraction Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan chia@csie.ncu.edu.tw](https://reader035.vdocuments.site/reader035/viewer/2022062308/56649d585503460f94a3813a/html5/thumbnails/35.jpg)
35
References *H. Cunningham, Information Extraction – a User
Guide, http://www.dcs.shef.ac.uk *MUC-6, http://www.cs.nyu.edu/cs/faculty/
grishman/muc6.html *I. Muslea,
Extraction Patterns for Information Extraction Tasks: A Survey, The AAAI-99 Workshop on Machine Learning for Information Extraction.
Califf, Relational Learning of Pattern-Matching Rule for Information Extraction, AAAI-99.