dwar ev ceremoniously soldered the final connection with gold. the eyes of a dozen television...
TRANSCRIPT
![Page 1: Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e495503460f94b3d1cd/html5/thumbnails/1.jpg)
Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the universe a dozen pictures of what he was doing.
He straightened and nodded to Dwar Reyn, then moved to a position beside the switch that would complete the contact when he threw it. The switch that would connect, all at once, all of the monster computing machines of all the populated planets in the universe - ninety-six billion planets - into the supercircuit that would connect them all into one supercalculator, one cybernetics machine that would combine all the knowledge of all the galaxies.
Dwar Reyn spoke briefly to the watching and listening trillions. Then after a moment’s silence he said, “Now, Dwar Ev.”
Dwar Ev threw the switch. There was a mighty hum, the surge of power from ninety-six billion planets. Lights flashed and quieted along the miles-long panel.
Dwar Ev stepped back and drew a deep breath. “The honour of asking the first questions is yours, Dwar Reyn.”
“Thank you,” said Dwar Reyn. “It shall be a question which no single cybernetics machine has been able to answer.”
He turned to face the machine. “Is there a God ?”
The mighty voice answered without hesitation, without the clicking of a single relay.“Yes, now there is a god.”
Sudden fear flashed on the face of Dwar Ev. He leaped to grab the switch.
A bolt of lightning from the cloudless sky struck him down and fused the switch shut.
‘Answer’ by Fredric Brown.©1954, Angels and Spaceships
Information Extraction: 10-707 and 11-748
![Page 2: Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e495503460f94b3d1cd/html5/thumbnails/2.jpg)
Information Extraction: 10-707 and 11-748
• Instructor and stuff: – William Cohen (wcohen@cs, Wean 5317)
• Assistant: Sharon Cavlovich (sharonw@cs)
– TA: Vitor Carvalho (vitor@cs)
• Web page:– http://www.cs.cmu.edu/~wcohen/10-707/– Approximate but fairly detailed through spring
break
• Tues-Thus 12-1:20pm, Wean 4615a– Go ahead and eat! – No class Thus 1/25
![Page 3: Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e495503460f94b3d1cd/html5/thumbnails/3.jpg)
Information Extraction: 10-707 and 11-748
• Prerequisite: – Machine learning or consent of William
• Grading:– Do a project, in groups of 2-3.
• Typical example: new algorithm on an old dataset, or old algorithm on a new dataset.
• Write it up as a conference paper and present (as poster?)• Your idea doesn’t have to actually work, or be novel.
– Do the readings and participate in class.• Short critique of each assigned paper: ~= 500 words, one thing you
liked and/or one thing you didn’t like, and why.• Start week of Jan 30th.• Presentation of at least one paper (one of the suggested “optional”
papers, or one I approve)
![Page 4: Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e495503460f94b3d1cd/html5/thumbnails/4.jpg)
Information Extraction: 10-707 and 11-748
• What is covered?– What is information extraction?
• “(ML Approaches to) Extracting Structured Information from Text”• “Learning How to Turn Words into Data”
– Applications:• Web info extraction: building catalogs, directories, etc from web sites• Biotext info extraction: extracting facts like regulates(CDC23,TNF-1b)• Question-answering: answering Q’s like “who invented the light bulb?”• ….
– Techniques:• Named entity recognition: finding names in text
– …– Graphical models for classifying sequences of tokens
• Extracting facts (aka events, relationships) – classifying pairs of extractions• Normalizing extracted data – classifying pairs of extractions• Semi- and unsupervised approaches to finding information from large corpora (aka
bookstrapping – “read the web” like techniques
• Today:– Admin, motivation– A brief overview of IE, and a less brief overview of named entity recognition
![Page 5: Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e495503460f94b3d1cd/html5/thumbnails/5.jpg)
Motivation: Why bother with IE?
![Page 6: Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e495503460f94b3d1cd/html5/thumbnails/6.jpg)
Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the universe a dozen pictures of what he was doing.
He straightened and nodded to Dwar Reyn, then moved to a position beside the switch that would complete the contact when he threw it. The switch that would connect, all at once, all of the monster computing machines of all the populated planets in the universe - ninety-six billion planets - into the supercircuit that would connect them all into one supercalculator, one cybernetics machine that would combine all the knowledge of all the galaxies.
Dwar Reyn spoke briefly to the watching and listening trillions. Then after a moment’s silence he said, “Now, Dwar Ev.”
Dwar Ev threw the switch. There was a mighty hum, the surge of power from ninety-six billion planets. Lights flashed and quieted along the miles-long panel.
Dwar Ev stepped back and drew a deep breath. “The honour of asking the first questions is yours, Dwar Reyn.”
“Thank you,” said Dwar Reyn. “It shall be a question which no single cybernetics machine has been able to answer.”
He turned to face the machine. “Is there a God ?”
The mighty voice answered without hesitation, without the clicking of a single relay.“Yes, now there is a god.”
Sudden fear flashed on the face of Dwar Ev. He leaped to grab the switch.
A bolt of lightning from the cloudless sky struck him down and fused the switch shut.
‘Answer’ by Fredric Brown.©1954, Angels and Spaceships
![Page 7: Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e495503460f94b3d1cd/html5/thumbnails/7.jpg)
The incredibly rapid growth and increasing pervasiveness of the Internet brings to mind a piece of science fiction, a short story, that I read many years ago in the days when UNIVACs and enormous IBM mainframes represented the popular image of computers. In the story, […]
While such an exaggerated view of the power of networked computers may now seem charmingly quaint, it is in fact not all that far beyond some of the wilder claims that one hears for the future of the global information superhighway, of which the Internet is widely regarded as the prototype. And if those claims are exaggerated, they are at least reflective of the extent to which this technology has caught hold of the popular and commercial imagination.
Technology and the Future, 7th EditionAlbert H. Teich, editor
© 1996
![Page 8: Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e495503460f94b3d1cd/html5/thumbnails/8.jpg)
Some observations
• In the distant future:– Complex AI systems are completed by ceremonially soldering
the final connection, not ceremonially compiling the last Java class
– Performance is monitored by clicking relays– A “lightning-from-a-cloudless-sky” peripheral exists
• Writing and debugging device drivers is a dangerous and highly skilled profession
– Question-answering interfaces are still in use• Natural-language query in, answer out
– Answering (some) complex questions requires combining information from many different places
• With different parts contributed by different people?
![Page 9: Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e495503460f94b3d1cd/html5/thumbnails/9.jpg)
Two ways to manage information
Xxx xxxx xxxx xxx xxx xxx xx xxxx xxxx xxx
Xxx xxxx xxxx xxx xxx xxx xx xxxx xxxx xxx
Xxx xxxx xxxx xxx xxx xxx xx xxxx xxxx xxx
Xxx xxxx xxxx xxx xxx xxx xx xxxx xxxx xxx
Xxx xxxx xxxx xxx xxx xxx xx xxxx xxxx xxx
Xxx xxxx xxxx xxx xxx xxx xx xxxx xxxx xxx
Xxx xxxx xxxx xxx xxx xxx xx xxxx xxxx xxx
Xxx xxxx xxxx xxx xxx xxx xx xxxx xxxx xxx
retrieval
Query Answer Query Answer
advisor(wc,vc)advisor(yh,tm)
affil(wc,mld)affil(vc,lti)
fn(wc,``William”)fn(vc,``Vitor”)
Xxx xxxx xxxx xxx xxx xxx xx xxxx xxxx xxx
inference
“ceremonial soldering” X:advisor(wc,Y)&affil(X,lti) ? {X=em; X=vc}
ANDIE
![Page 10: Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e495503460f94b3d1cd/html5/thumbnails/10.jpg)
Some observations
• Using computers to combine information from multiple places is and has been important…
![Page 11: Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e495503460f94b3d1cd/html5/thumbnails/11.jpg)
Some observations
• Using computers to merge information is and has been important…– Data cleaning and integration, record linkage, …– Standards for data exchange:
• KQML, KIF, DAML+OIL, …• Semantic web: N3Logic, OWL, …
– Friend-of-a-friend, GeneOntology, ….– Growth from 456 OWL ontologies in 2004 to 14,600 in 2007
• Number of web pages estimated at 11.5B as of early 2006– #webPages/#ontologies =~ 1,000,000 ?– #webSites/#ontologies =~ 10,000 ?
– It seems to be much easier to generate sharable text than to generate sharable knowledge.
– A lot of accessible knowledge is only accessible in text
![Page 12: Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e495503460f94b3d1cd/html5/thumbnails/12.jpg)
How do you extract information?
[Cohen / McCallum tutorial, NIPS 2002, KDD 2003, …]
[Some pilfering from Tom Mitchell’s invited talks]
![Page 13: Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e495503460f94b3d1cd/html5/thumbnails/13.jpg)
What is “Information Extraction”
Filling slots in a database from sub-segments of text.As a task:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
NAME TITLE ORGANIZATION
![Page 14: Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e495503460f94b3d1cd/html5/thumbnails/14.jpg)
What is “Information Extraction”
Filling slots in a database from sub-segments of text.As a task:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
NAME TITLE ORGANIZATIONBill Gates CEO MicrosoftBill Veghte VP MicrosoftRichard Stallman founder Free Soft..
IE
![Page 15: Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e495503460f94b3d1cd/html5/thumbnails/15.jpg)
What is “Information Extraction”
Information Extraction = segmentation + classification + clustering + association
As a familyof techniques:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation
aka “named entity extraction”
![Page 16: Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e495503460f94b3d1cd/html5/thumbnails/16.jpg)
What is “Information Extraction”
Information Extraction = segmentation + classification + association + clustering
As a familyof techniques:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation
![Page 17: Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e495503460f94b3d1cd/html5/thumbnails/17.jpg)
What is “Information Extraction”
Information Extraction = segmentation + classification + association + clustering
As a familyof techniques:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation
![Page 18: Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e495503460f94b3d1cd/html5/thumbnails/18.jpg)
What is “Information Extraction”
Information Extraction = segmentation + classification + association + clustering
As a familyof techniques:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation N
AME
TITLE ORGANIZATION
Bill Gates
CEO
Microsoft
Bill Veghte
VP
Microsoft
Richard Stallman
founder
Free Soft..
*
*
*
*
![Page 19: Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e495503460f94b3d1cd/html5/thumbnails/19.jpg)
Example: Finding Jobs Ads on the Web
Martin Baker, a person
Genomics job
Employers job posting form
![Page 20: Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e495503460f94b3d1cd/html5/thumbnails/20.jpg)
Example: A Solution
![Page 21: Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e495503460f94b3d1cd/html5/thumbnails/21.jpg)
Extracting Job Openings from the Web
foodscience.com-Job2
JobTitle: Ice Cream Guru
Employer: foodscience.com
JobCategory: Travel/Hospitality
JobFunction: Food Services
JobLocation: Upper Midwest
Contact Phone: 800-488-2611
DateExtracted: January 8, 2001
Source: www.foodscience.com/jobs_midwest.html
OtherCompanyJobs: foodscience.com-Job1
![Page 22: Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e495503460f94b3d1cd/html5/thumbnails/22.jpg)
Job
Op
enin
gs:
Cat
ego
ry =
Fo
od
Ser
vice
sK
eyw
ord
= B
aker
L
oca
tio
n =
Co
nti
nen
tal
U.S
.
![Page 23: Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e495503460f94b3d1cd/html5/thumbnails/23.jpg)
Data Mining the Extracted Job Information
![Page 24: Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e495503460f94b3d1cd/html5/thumbnails/24.jpg)
![Page 25: Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e495503460f94b3d1cd/html5/thumbnails/25.jpg)
Notice that we get something useful from just identifying the person names and then doing some counting and trending
![Page 26: Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e495503460f94b3d1cd/html5/thumbnails/26.jpg)
![Page 27: Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e495503460f94b3d1cd/html5/thumbnails/27.jpg)
![Page 28: Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e495503460f94b3d1cd/html5/thumbnails/28.jpg)
![Page 29: Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e495503460f94b3d1cd/html5/thumbnails/29.jpg)
Landscape of IE Tasks (1/4):Degree of Formatting
Text paragraphswithout formatting
Grammatical sentencesand some formatting & links
Non-grammatical snippets,rich formatting & links Tables
Astro Teller is the CEO and co-founder of BodyMedia. Astro holds a Ph.D. in Artificial Intelligence from Carnegie Mellon University, where he was inducted as a national Hertz fellow. His M.S. in symbolic and heuristic computation and B.S. in computer science are from Stanford University. His work in science, literature and business has appeared in international media from the New York Times to CNN to NPR.
![Page 30: Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e495503460f94b3d1cd/html5/thumbnails/30.jpg)
Landscape of IE Tasks (2/4):Intended Breadth of Coverage
Web site specific Genre specific Wide, non-specific
Amazon.com Book Pages Resumes University Names
Formatting Layout Language
![Page 31: Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e495503460f94b3d1cd/html5/thumbnails/31.jpg)
Landscape of IE Tasks (3/4):Complexity
Closed set
He was born in Alabama…
Regular set
Phone: (413) 545-1323
Complex pattern
University of ArkansasP.O. Box 140Hope, AR 71802
…was among the six houses sold by Hope Feldman that year.
Ambiguous patterns,needing context andmany sources of evidence
The CALD main office can be reached at 412-268-1299
The big Wyoming sky…
U.S. states U.S. phone numbers
U.S. postal addresses
Person names
Headquarters:1128 Main Street, 4th FloorCincinnati, Ohio 45210
Pawel Opalinski, SoftwareEngineer at WhizBang Labs.
E.g. word patterns:
![Page 32: Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e495503460f94b3d1cd/html5/thumbnails/32.jpg)
Landscape of IE Tasks (4/4):Single Field/Record
Single entity
Person: Jack Welch
Binary relationship
Relation: Person-TitlePerson: Jack WelchTitle: CEO
N-ary record
“Named entity” extraction
Jack Welch will retire as CEO of General Electric tomorrow. The top role at the Connecticut company will be filled by Jeffrey Immelt.
Relation: Company-LocationCompany: General ElectricLocation: Connecticut
Relation: SuccessionCompany: General ElectricTitle: CEOOut: Jack WelshIn: Jeffrey Immelt
Person: Jeffrey Immelt
Location: Connecticut
![Page 33: Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e495503460f94b3d1cd/html5/thumbnails/33.jpg)
A little more depth on named entity recognition (NER)
![Page 34: Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e495503460f94b3d1cd/html5/thumbnails/34.jpg)
Models for NER
Lexicons
AlabamaAlaska…WisconsinWyoming
Abraham Lincoln was born in Kentucky.
member?
Classify Pre-segmentedCandidates
Abraham Lincoln was born in Kentucky.
Classifier
which class?
Sliding Window
Abraham Lincoln was born in Kentucky.
Classifier
which class?
Try alternatewindow sizes:
Boundary Models
Abraham Lincoln was born in Kentucky.
Classifier
which class?
BEGIN END BEGIN END
BEGIN
Token Tagging
Abraham Lincoln was born in Kentucky.
Most likely state sequence?
This is often treated as a structured prediction problem…classifying tokens sequentially
HMMs, CRFs, ….
![Page 35: Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e495503460f94b3d1cd/html5/thumbnails/35.jpg)
Sliding Windows
![Page 36: Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e495503460f94b3d1cd/html5/thumbnails/36.jpg)
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer Science Carnegie Mellon University
3:30 pm 7500 Wean Hall
Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.
CMU UseNet Seminar Announcement
E.g.Looking forseminarlocation
![Page 37: Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e495503460f94b3d1cd/html5/thumbnails/37.jpg)
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer Science Carnegie Mellon University
3:30 pm 7500 Wean Hall
Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.
CMU UseNet Seminar Announcement
E.g.Looking forseminarlocation
![Page 38: Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e495503460f94b3d1cd/html5/thumbnails/38.jpg)
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer Science Carnegie Mellon University
3:30 pm 7500 Wean Hall
Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.
CMU UseNet Seminar Announcement
E.g.Looking forseminarlocation
![Page 39: Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e495503460f94b3d1cd/html5/thumbnails/39.jpg)
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer Science Carnegie Mellon University
3:30 pm 7500 Wean Hall
Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.
CMU UseNet Seminar Announcement
E.g.Looking forseminarlocation
![Page 40: Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e495503460f94b3d1cd/html5/thumbnails/40.jpg)
A “Naïve Bayes” Sliding Window Model[Freitag 1997]
00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrunw t-m w t-1 w t w t+n w t+n+1 w t+n+m
prefix contents suffix
If P(“Wean Hall Rm 5409” = LOCATION) is above some threshold, extract it.
… …
Estimate Pr(LOCATION|window) using Bayes rule
Try all “reasonable” windows (vary length, position)
Assume independence for length, prefix words, suffix words, content words
Estimate from data quantities like: Pr(“Place” in prefix|LOCATION)
![Page 41: Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e495503460f94b3d1cd/html5/thumbnails/41.jpg)
A “Naïve Bayes” Sliding Window Model
1. Create dataset of examples like these:+(prefix00,…,prefixColon, contentWean,contentHall,….,suffixSpeaker,…)
- (prefixColon,…,prefixWean,contentHall,….,ContentSpeaker,suffixColon,….)
…
2. Train a NaiveBayes classifier (or YFCL), treating the examples like BOWs for text classification
3. If Pr(class=+|prefix,contents,suffix) > threshold, predict the content window is a location.
• To think about: what if the extracted entities aren’t consistent, eg if the location overlaps with the speaker?
[Freitag 1997]
00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrunw t-m w t-1 w t w t+n w t+n+1 w t+n+m
prefix contents suffix
… …
![Page 42: Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e495503460f94b3d1cd/html5/thumbnails/42.jpg)
“Naïve Bayes” Sliding Window Results
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer Science Carnegie Mellon University
3:30 pm 7500 Wean Hall
Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.
Domain: CMU UseNet Seminar Announcements
Field F1 Person Name: 30%Location: 61%Start Time: 98%
![Page 43: Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e495503460f94b3d1cd/html5/thumbnails/43.jpg)
Token Tagging
![Page 44: Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e495503460f94b3d1cd/html5/thumbnails/44.jpg)
NER by tagging tokens
Yesterday Pedro Domingos flew to New York.
Yesterday Pedro Domingos flew to New York
Person name: Pedro Domingos Location name: New York
Given a sentence:
2) Identify names based on the entity labels
person name
location name
background
1) Break the sentence into tokens, and classify each token with a label indicating what sort of entity it’s part of:
3) To learn an NER system, use YFCL.
![Page 45: Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e495503460f94b3d1cd/html5/thumbnails/45.jpg)
NER by tagging tokens
Yesterday Pedro Domingos flew to New York
person name
location name
background
Another common labeling scheme is BIO (begin, inside, outside; e.g. beginPerson, insidePerson, beginLocation, insideLocation, outside)
BIO also leads to strong dependencies between nearby labels (eg inside follows begin)
Similar labels tend to cluster together in text
![Page 46: Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e495503460f94b3d1cd/html5/thumbnails/46.jpg)
NER with Hidden Markov Models
Yesterday Pedro Domingos spoke this example sentence.
Yesterday Pedro Domingos spoke this example sentence.
Person name: Pedro Domingos
Given a sequence of observations:
and a trained HMM:
Find the most likely state sequence: (Viterbi)
Any words said to be generated by the designated “person name”state extract as a person name:
),(maxarg osPs
person name
location name
background
![Page 47: Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e495503460f94b3d1cd/html5/thumbnails/47.jpg)
HMM for Segmentation of Addresses
• Simplest HMM Architecture: One state per entity type
CA 0.15
NY 0.11
PA 0.08
… …
Hall 0.15
Wean 0.03
N-S 0.02
… …
[Pilfered from Sunita Sarawagi, IIT/Bombay]
![Page 48: Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e495503460f94b3d1cd/html5/thumbnails/48.jpg)
HMMs for Information Extraction
1. The HMM consists of two probability tables• Pr(currentState=s|previousState=t) for s=background, location, speaker, • Pr(currentWord=w|currentState=s) for s=background, location, …
2. Estimate these tables with a (smoothed) CPT• Prob(location|location) = #(loc->loc)/#(loc->*) transitions
3. Given a new sentence, find the most likely sequence of hidden states using Viterbi method:
MaxProb(curr=s|position k)=
Maxstate t MaxProb(curr=t|position=k-1) * Prob(word=wk-1|t)*Prob(curr=s|prev=t)
00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun… …
![Page 49: Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e495503460f94b3d1cd/html5/thumbnails/49.jpg)
“Naïve Bayes” Sliding Window vs HMMs
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer Science Carnegie Mellon University
3:30 pm 7500 Wean Hall
Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.
Domain: CMU UseNet Seminar Announcements
Field F1 Speaker: 30%Location: 61%Start Time: 98%
Field F1 Speaker: 77%Location: 79%Start Time: 98%
![Page 50: Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e495503460f94b3d1cd/html5/thumbnails/50.jpg)
What is a “symbol” ???
Cohen => “Cohen”, “cohen”, “Xxxxx”, “Xx”, … ?
5317 => “5317”, “9999”, “9+”, “number”, … ?
000.. . . .999
3 -d ig i ts
00000 .. . .99999
5 -d ig i ts
0 ..99 0000 ..9999 000000 ..
O th e rs
N u m b e rs
A .. ..z
C h a rs
a a ..
M u lt i -le tte r
W o rds
. , / - + ? #
D e lim ite rs
A ll
Datamold: choose best abstraction level using holdout set
![Page 51: Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e495503460f94b3d1cd/html5/thumbnails/51.jpg)
HMM Example: “Nymble”
Other examples of shrinkage for HMMs in IE: [Freitag and McCallum ‘99]
Task: Named Entity Extraction
Train on ~500k words of news wire text.
Case Language F1 .Mixed English 93%Upper English 91%Mixed Spanish 90%
[Bikel, et al 1998], [BBN “IdentiFinder”]
Person
Org
Other
(Five other name classes)
start-of-sentence
end-of-sentence
Transitionprobabilities
Observationprobabilities
P(st | st-1, ot-1 ) P(ot | st , st-1 )
Back-off to: Back-off to:
P(st | st-1 )
P(st )
P(ot | st , ot-1 )
P(ot | st )
P(ot )
or
Results:
![Page 52: Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e495503460f94b3d1cd/html5/thumbnails/52.jpg)
What is a symbol?
Bikel et al mix symbols from two abstraction levels
![Page 53: Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e495503460f94b3d1cd/html5/thumbnails/53.jpg)
What is a symbol?
Ideally we would like to use many, arbitrary, overlapping features of words.
St -1
St
Ot
St+1
Ot +1
Ot -1
identity of wordends in “-ski”is capitalizedis part of a noun phraseis in a list of city namesis under node X in WordNetis in bold fontis indentedis in hyperlink anchor…
…
…part of
noun phrase
is “Wisniewski”
ends in “-ski”
Lots of learning systems are not confounded by multiple, non-independent features: decision trees, neural nets, SVMs, …
![Page 54: Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e495503460f94b3d1cd/html5/thumbnails/54.jpg)
What is a symbol?
St -1
St
Ot
St+1
Ot +1
Ot -1
identity of wordends in “-ski”is capitalizedis part of a noun phraseis in a list of city namesis under node X in WordNetis in bold fontis indentedis in hyperlink anchor…
…
…part of
noun phrase
is “Wisniewski”
ends in “-ski”
Idea: replace generative model in HMM with a maxent model, where state depends on observations
...)|Pr( tt xs
![Page 55: Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e495503460f94b3d1cd/html5/thumbnails/55.jpg)
What is a symbol?
St -1
St
Ot
St+1
Ot +1
Ot -1
identity of wordends in “-ski”is capitalizedis part of a noun phraseis in a list of city namesis under node X in WordNetis in bold fontis indentedis in hyperlink anchor…
…
…part of
noun phrase
is “Wisniewski”
ends in “-ski”
Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state
...),|Pr( ,1 ttt sxs
![Page 56: Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e495503460f94b3d1cd/html5/thumbnails/56.jpg)
What is a symbol?
St -1 S
t
Ot
St+1
Ot +1
Ot -1
identity of wordends in “-ski”is capitalizedis part of a noun phraseis in a list of city namesis under node X in WordNetis in bold fontis indentedis in hyperlink anchor…
…
…part of
noun phrase
is “Wisniewski”
ends in “-ski”
Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state history
......),|Pr( ,2,1 tttt ssxs
![Page 57: Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e495503460f94b3d1cd/html5/thumbnails/57.jpg)
Ratnaparkhi’s MXPOST
• Sequential learning problem: predict POS tags of words.
• Uses MaxEnt model described above.
• Rich feature set.• To smooth, discard features
occurring < 10 times.
![Page 58: Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e495503460f94b3d1cd/html5/thumbnails/58.jpg)
Conditional Markov Models (CMMs) aka MEMMs aka Maxent Taggers vs HMMS
St-1 St
Ot
St+1
Ot+1Ot-1
...
i
iiii sossos )|Pr()|Pr(),Pr( 11
St-1 St
Ot
St+1
Ot+1Ot-1
...
i
iii ossos ),|Pr()|Pr( 11
![Page 59: Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e495503460f94b3d1cd/html5/thumbnails/59.jpg)
HMMs vs MEMM vs CRF
HMM MEMM CRF
![Page 60: Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e495503460f94b3d1cd/html5/thumbnails/60.jpg)
Some things to think about
• We’ve seen sliding windows, non-sequential token tagging, and sequential token tagging.– Which of these are likely to work best, and when?– Are there other ways to formulate NER as a learning task?– Is there a benefit from using more complex graphical
models? What potentially useful information does a linear-chain CRF not capture?
– Can you combine sliding windows with a sequential model?
• Next lecture will survey IE of sets of related entities (e.g., person and his/her affiliation).– How can you formalize that as a learning task?