information retrieval
DESCRIPTION
Information Retrieval. March 3, 2003. Handout #5. Course Information. Instructor: Dragomir R. Radev ([email protected]) Office: 3080, West Hall Connector Phone: (734) 615-5225 Office hours: M&F 11-12 Course page: http://tangra.si.umich.edu/~radev/650/ - PowerPoint PPT PresentationTRANSCRIPT
(C) 2003, The University of Michigan 1
Information Retrieval
Handout #5
March 3, 2003
(C) 2003, The University of Michigan 2
Course Information
• Instructor: Dragomir R. Radev ([email protected])• Office: 3080, West Hall Connector• Phone: (734) 615-5225• Office hours: M&F 11-12• Course page: http://tangra.si.umich.edu/~radev/650/• Class meets on Mondays, 1-4 PM in 409 West Hall
(C) 2003, The University of Michigan 3
The Weka package
(C) 2003, The University of Michigan 4
Weka
• A general environment for machine learning (e.g. for classification and clustering)
• Book by Witten and Frank• www.cs.waikato.ac.nz/ml/weka
(C) 2003, The University of Michigan 5
K-means (continued)
(C) 2003, The University of Michigan 6
Demos
• http://www.cs.mcgill.ca/~bonnef/project.html• http://www.cs.washington.edu/research/
imagedatabase/demo/kmcluster/• http://www-2.cs.cmu.edu/~dellaert/software/• java weka.clusterers.SimpleKMeans -t
data/weather.arff
(C) 2003, The University of Michigan 7
EM algorithm
(C) 2003, The University of Michigan 8
EM algorithms
[Dempster et al. 77]
• Needed: probabilistic model Θ• Given estimate Θ0
• Useful in the absence of certain data• Class of iterative algorithms for maximum likelihood estimation in
problems with incomplete data. Given a model of data generation and data with some missing values, EM alternately uses the current model to estimate the missing values, and then uses the missing value estimates to improve the model. Using all the available data, EM will locally maximize the likelihood of the generative parameters giving estimates for the missing values.
[McCallum & Nigam 98]
(C) 2003, The University of Michigan 9
E-M algorithms
• Initialize probability model• Repeat
– E-step: use the best available current classifier to classify some datapoints
– M-step: modify the classifier based on the classes produced by the E-step.
• Until convergence
(C) 2003, The University of Michigan 10
Demos
• java weka.clusterers.EM -t data/iris.arff • http://www.neurosci.aist.go.jp/~akaho/
MixtureEM.html• http://www.cs.uic.edu/~liub/S-EM/S-EM-
download.html
(C) 2003, The University of Michigan 11
Question Answering
(C) 2003, The University of Michigan 12
Q: When did Nelson Mandela become president of South Africa?
A: 10 May 1994
Q: How tall is the Matterhorn?
A: The institute revised the Matterhorn 's height to 14,776 feet 9 inches
Q: How tall is the replica of the Matterhorn at Disneyland?
A: In fact he has climbed the 147-foot Matterhorn at Disneyland every week end for the last 3 1/2 years
Q: If Iraq attacks a neighboring country, what should the US do?
A: ??
Question answering
(C) 2003, The University of Michigan 13
Q: Why did David Koresh ask the FBI for a word processor?Q: Name the designer of the shoe that spawned millions of plastic imitations, known as "jellies".Q: What is the brightest star visible from Earth?Q: What are the Valdez Principles?Q: Name a film that has won the Golden Bear in the Berlin Film Festival?Q: Name a country that is developing a magnetic levitation railway system?Q: Name the first private citizen to fly in space.Q: What did Shostakovich write for Rostropovich?Q: What is the term for the sum of all genetic material in a given organism?Q: What is considered the costliest disaster the insurance industry has ever faced?Q: What is Head Start?Q: What was Agent Orange used for during the Vietnam War?Q: What did John Hinckley do to impress Jodie Foster?Q: What was the first Gilbert and Sullivan opera?Q: What did Richard Feynman say upon hearing he would receive the Nobel Prize in Physics?Q: How did Socrates die?Q: Why are electric cars less efficient in the north-east than in California?
(C) 2003, The University of Michigan 14
The TREC evaluation
• Document retrieval• Eight years• Information retrieval?• Corpus: texts and questions
(C) 2003, The University of Michigan 15
documents
query
TextractResporator Indexer
Index
QueryProcessing
Search
Hit ListAnSel/Werlect
RankedHitList
GuruQA
Answer selection
Prager et al. 2000 (SIGIR)Radev et al. 2000 (ANLP/NAACL)
(C) 2003, The University of Michigan 16
QA-Token Question type ExamplePLACE$ Where In the Rocky MountainsCOUNTRY$ Where/What country United KingdomSTATE$ Where/What state MassachusettsPERSON$ Who Albert EinsteinROLE$ Who DoctorNAME$ Who/What/Which The Shakespeare FestivalORG$ Who/What The US Post OfficeDURATION$ How long For 5 centuriesAGE$ How old 30 years oldYEAR$ When/What year 1999TIME$ When In the afternoonDATE$ When/What date July 4th, 1776VOLUME$ How big 3 gallonsAREA$ How big 4 square inchesLENGTH$ How big/long/high 3 milesWEIGHT$ How big/heavy 25 tonsNUMBER$ How many 1,234.5METHOD$ How By rubbingRATE$ How much 50 per centMONEY$ How much 4 million dollars
(C) 2003, The University of Michigan 17
<p><NUMBER>1</NUMBER></p><p><QUERY>Who is the author of the book, "The Iron Lady: ABiography of Margaret Thatcher"?</QUERY></p><p><PROCESSED_QUERY>@excwin(*dynamic* @weight(200*Iron_Lady) @weight(200 Biography_of_Margaret_Thatcher)@weight(200 Margaret) @weight(100 author) @weight(100book) @weight(100 iron) @weight(100 lady) @weight(100 :)@weight(100 biography) @weight(100 thatcher) @weight(400@syn(PERSON$ NAME$)) )</PROCESSED_QUERY></p><p><DOC>LA090290-0118</DOC></p><p><SCORE>1020.8114</SCORE></p><TEXT><p>THE IRON LADY; A <span class="NAME">Biography ofMargaret Thatcher</span> by <span class="PERSON">HugoYoung</span> (<span class="ORG">Farrar , Straus &Giroux</span>) The central riddle revealed here is why, asa woman <span class="PLACEDEF">in a man</span>'s world,<span class="PERSON">Margaret Thatcher</span> evinces suchan exclusionary attitude toward women.</p></TEXT>
(C) 2003, The University of Michigan 18
SYN-set N Score Score/NPERSON NAME 30 16.5 55.0%PLACE COUNTRY STATE NAME PLACEDEF 21 7.08 33.7%NAME 18 3.67 20.4%DATE YEAR 18 5.31 29.5%PERSON ORG NAME ROLE 19 4.62 24.3%undefined 19 11.45 60.3%NUMBER 18 8.00 44.4%PLACE NAME PLACEDEF 14 10.00 71.4%PERSON ORG PLACE NAME PLACEDEF 10 3.03 30.3%MONEY RATE 6 1.50 25%ORG NAME 4 1.25 31.2%SIZE1 4 2.50 62.5%SIZE1 DURATION 3 0.83 27.7%STATE 3 2.00 66.7%COUNTRY 3 1.33 44.3%YEAR 2 1.00 50.0%RATE 2 1.50 75.0%TIME DURATION 1 0.00 0.0%SIZE1 SIZE2 1 0.00 0.0%DURATION TIME 1 0.33 33.3%DATE 1 0 0.00%
(C) 2003, The University of Michigan 19
Span Type Number Rspanno Count Notinq Type Avgdst Sscore TOTALOllie Matson PERSON 3 3 6 2 1 12 0.02507 -7.53Lou Vasquez PERSON 1 1 6 2 1 16 0.02507 -9.93Tim O'Donohue PERSON 17 1 4 2 1 8 0.02257 -12.57Athletic Director Dave Cowen PERSON 23 6 4 4 1 11 0.02257 -15.87Johnny Ceballos PERSON 22 5 4 1 1 9 0.02257 -19.07Civic Center Director Martin Durham PERSON 13 1 2 5 1 16 0.02505 -19.36Johnny Hodges PERSON 25 2 4 1 1 15 0.02256 -25.22Derric Evans PERSON 33 4 4 2 1 14 0.02256 -25.37NEWSWIRE Johnny Majors PERSON 30 1 4 2 1 17 0.02256 -25.47Woodbridge High School ORG 18 2 4 1 2 6 0.02257 -28.37Evan PERSON 37 6 4 1 1 14 0.02256 -29.57Gary Edwards PERSON 38 7 4 2 1 17 0.02256 -30.87O.J. Simpson NAME 2 2 6 2 3 12 0.02507 -37.40South Lake Tahoe NAME 7 5 6 3 3 14 0.02507 -40.06Washington High NAME 10 6 6 1 3 18 0.02507 -49.80Morgan NAME 26 3 4 1 3 12 0.02256 -52.52Tennesseefootball NAME 31 2 4 1 3 15 0.02256 -56.27Ellington NAME 24 1 4 1 3 20 0.02256 -59.42assistant ROLE 21 4 4 1 4 8 0.02257 -62.77the Volunteers ROLE 34 5 4 2 4 14 0.02256 -71.17Johnny Mathis PERSON 4 4 6 -100 1 11 0.02507 -211.33Mathis NAME 14 2 2 -100 3 10 0.02505 -254.16coach ROLE 19 3 4 -100 4 4 0.02257 -259.67
(C) 2003, The University of Michigan 20
Features (1)• Number: position of the span among all spans returned. Example:
“Lou Vasquez” was the first span returned by GuruQA on the sample question.
• Rspanno: position of the span among all spans returned within the current passage.
• Count: number of spans of any span class retrieved within the current passage.
• Notinq: the number of words in the span that do not appear in the query. Example: Notinq (“Woodbridge high school”) = 1, because both “high” and “school” appear in the query while “Woodbridge” does not. It is set to –100 when the actual value is 0.
(C) 2003, The University of Michigan 21
• Type: the position of the span type in the list of potential span types. Example: Type (“Lou Vasquez”) = 1, because the span type of “Lou Vasquez”, namely “PERSON” appears first in the SYN-set, “PERSON ORG NAME ROLE”.
• Avgdst: the average distance in words between the beginning of the span and the words in the query that also appear in the passage. Example: given the passage “Tim O'Donohue, Woodbridge High School's varsity baseball coach, resigned Monday and will be replaced by assistant Johnny Ceballos, Athletic Director Dave Cowen said.” and the span “Tim O’Donohue”, the value of avgdst is equal to 8.
• Sscore: passage relevance as computed by GuruQA.
Features (2)
(C) 2003, The University of Michigan 22
Combining evidence
• TOTAL (span) = – 0.3 * number – 0.5 * rspanno + 3.0 * count + 2.0 * notinq – 15.0 * types – 1.0 * avgdst + 1.5 * sscore
(C) 2003, The University of Michigan 23
DocumentID
Score Extract
LA053189-0069
892.5 of O.J. Simpson , Ollie Matson and Johnny Mathis
LA053189-0069
890.1 Lou Vasquez , track coach of O.J. Simpson , Ollie
LA060889-0181
887.4 Tim O'Donohue , Woodbridge High School 's varsity
LA060889-0181
884.1 nny Ceballos , Athletic Director Dave Cowen said.
LA060889-0181
880.9 aced by assistant Johnny Ceballos , Athletic Direc
Extracted text
(C) 2003, The University of Michigan 24
First Second Third Fourth Fifth TOTAL# cases 49 15 11 9 4 88Points 49.00 7.50 3.67 2.25 0.80 63.22
First Second Third Fourth Fifth TOTAL# cases 71 16 11 6 5 109Points 71.00 8.00 3.67 1.50 1.00 85.17
50 bytes
250 bytes
Results
(C) 2003, The University of Michigan 25
Information Extraction
(C) 2003, The University of Michigan 26
Types of Information Extraction
• Template filling• Language reuse• Biographical information• Question answering
(C) 2003, The University of Michigan 27
MUC-4 Example
INCIDENT: DATE 30 OCT 89 INCIDENT: LOCATION EL SALVADOR INCIDENT: TYPE ATTACK INCIDENT: STAGE OF EXECUTION ACCOMPLISHED INCIDENT: INSTRUMENT ID INCIDENT: INSTRUMENT TYPEPERP: INCIDENT CATEGORY TERRORIST ACT PERP: INDIVIDUAL ID "TERRORIST" PERP: ORGANIZATION ID "THE FMLN" PERP: ORG. CONFIDENCE REPORTED: "THE FMLN" PHYS TGT: ID PHYS TGT: TYPEPHYS TGT: NUMBERPHYS TGT: FOREIGN NATIONPHYS TGT: EFFECT OF INCIDENTPHYS TGT: TOTAL NUMBERHUM TGT: NAMEHUM TGT: DESCRIPTION "1 CIVILIAN"HUM TGT: TYPE CIVILIAN: "1 CIVILIAN"HUM TGT: NUMBER 1: "1 CIVILIAN"HUM TGT: FOREIGN NATIONHUM TGT: EFFECT OF INCIDENT DEATH: "1 CIVILIAN"HUM TGT: TOTAL NUMBER
On October 30, 1989, one civilian was killed in a reported FMLN attack in El Salvador.
(C) 2003, The University of Michigan 28
Yugoslav President Slobodan Milosevic
[description]
NP
Phrase to be reused
Language reuse
[entity]
(C) 2003, The University of Michigan 29
NPExample
Andrija Hebrang , The Croatian Defense Minister
[description][entity]
NP NPPunc
(C) 2003, The University of Michigan 30
Issues involved
• Text generation depends on lexical resources• Lexical choice• Corpus processing vs. manual compilation• Deliberate decisions by writers• Difficult to encode by hand• Dynamically updated (Scott O’Grady)• No full semantic representation
(C) 2003, The University of Michigan 31
Named entitiesRichard Butler met Tareq Aziz Monday after rejecting Iraqi attempts to set deadlines for finishing his work.
Yitzhak Mordechai will meet Mahmoud Abbas at 7 p.m. (1600 GMT) in Tel Aviv after a 16-month-long impasse in peacemaking.
Sinn Fein deferred a vote on Northern Ireland's peace deal Sunday.
Hundreds of troops patrolled Dili on Friday during the anniversary of Indonesia's 1976 annexation of the territory.
(C) 2003, The University of Michigan 32
Entities + DescriptionsChief U.N. arms inspector Richard Butler met Iraq’s Deputy Prime Minister Tareq Aziz Monday after rejecting Iraqi attempts to set deadlines for finishing his work.
Israel's Defense Minister Yitzhak Mordechai will meet senior Palestinian negotiator Mahmoud Abbas at 7 p.m. (1600 GMT) in Tel Aviv after a 16-month-long impasse in peacemaking.
Sinn Fein, the political wing of the Irish Republican Army, deferred a vote on Northern Ireland's peace deal Sunday.
Hundreds of troops patrolled Dili, the Timorese capital, on Friday during the anniversary of Indonesia's 1976 annexation of the territory.
(C) 2003, The University of Michigan 33
Building a database of descriptions
• Size of database: 59,333 entities and 193,228 descriptions as of 08/01/98
• Text processed: 494 MB (ClariNet, Reuters, UPI)
• Length: 1-15 lexical items• Accuracy: (precision 94%, recall 55%)
(C) 2003, The University of Michigan 34
Ung Huot
A senior memberCambodia’sCambodian foreign ministerCo-premierFirst prime ministerForeign ministerHis excellencyMr.New co-premierNew first prime ministerNewly-appointed first prime ministerPremier
Multiple descriptions per entity
Profile for Ung Huot
(C) 2003, The University of Michigan 35
Language reuse and regeneration
+ =CONCEPTS CONSTRAINTS CONSTRUCTS
Corpus analysis: determining constraints
Text generation: applying constraints
(C) 2003, The University of Michigan 36
• Understanding: full parsing is expensive • Generation: expensive to use full parses• Bypassing certain stages (e.g., syntax)• Not(!) template-based: still required
extraction, analysis, context identification, modification, and generation
• Factual sentences, sentence fragments• Reusability of a phrase
Language reuse and regeneration
(C) 2003, The University of Michigan 37
Context-dependent solution
Redefining the relation:DescriptionOf (E,C) =
{Di,c, Di,c is a description of E in context C}
If named entity E appears in text and the context is C:Insert DescriptionOf (E,C) in text.
(C) 2003, The University of Michigan 38
Multiple descriptions per entity
Bill Clinton
U.S. PresidentPresidentAn Arkansas nativeDemocratic presidential candidate
Profile for Bill Clinton
(C) 2003, The University of Michigan 39
Choosing the right description
Bill Clinton CONTEXT
U.S. President …………………………..foreign relationsPresident ………………………………… national affairsAn Arkansas native ……………....false bomb alert in ARDemocratic presidential candidate …………….. elections
Pragmatic and semantic constraints on lexical choice.
(C) 2003, The University of Michigan 40
Semantic information from WordNet
• All words contribute to the semantic representation
• First sense is used only
• What is a synset?
(C) 2003, The University of Michigan 41
WordNet synset hierarchy
{07063762} director, manager, managing director
{07063507} administrator, decision maker
{07311393} head, chief, top dog
{06950891} leader
{00004123} person, individual, someone, somebody, human
{00002086} life form, organism, being, living thing
{00001740} entity, something
(C) 2003, The University of Michigan 42
Lexico-semantic matrixWord synsets Parent synsets
Description{07147929}premier
{07009772}Kampuchean …
{07412658}minister
{07087841}associate
A senior member … XCambodia's X …Cambodian foreign minister X … XCo-premier X … XFirst prime minister X … XForeign minister … XHis excellency …Mr. …New co-premier X … XNew first prime minister X … XNewly-appointed first prime minister X … XPremier X … XPrime minister X … X
Profile for Ung Huot
(C) 2003, The University of Michigan 43
Choosing the right description• Topic approximation by context: words that
appear near the entity in the text (bag) • Name of the entity (set)• Length of article (continuous)• Profile: set of all descriptions for that entity (bag)
- parent synset offsets for all words wi.• Semantic information: WordNet synset offsets
(bag)
(C) 2003, The University of Michigan 44
Choosing the right description
(Context, Entity, Description, Length, Profile, Parent) Classes
Ripper feature vector [Cohen 1996]
(C) 2003, The University of Michigan 45
Example (training)T# Context Entity Description Len Profile Parent Classes1 Election,
promised,said, carry,party …
KimDae-Jung
Veteranoppositionleader
949 Candidate,chief, policymaker,Korean ...
person,leader,Asian,importantperson ...
{07136302}{07486519}{07311393}{06950891}{07486079}
2 Introduced,responsible,running,should,bringing …
KimDae-Jung
SouthKorea'soppositioncandidate
629 Candidate,chief, policymaker,Korean ...
person,leader,Asian,importantperson ...
{07136302}{07486519}{07311393}{06950891}{07486079}
3 Attend,during,party, time,traditionally …
KimDae-Jung
A front-runner
535 Candidate,chief, policymaker,Korean ...
person,leader,Asian,importantperson ...
{07136302}{07486519}{07311393}{06950891}{07486079}
4 Discuss,making,party,statement,said …
KimDae-Jung
A front-runner
1114 Candidate,chief, policymaker,Korean ...
person,leader,Asian,importantperson ...
{07136302}{07486519}{07311393}{06950891}{07486079}
5 New, party,politics, in,it …
KimDae-Jung
SouthKorea'spresident-elect
449 Candidate,chief, policymaker,Korean ...
person,leader,Asian,importantperson ...
{07136302}{07486519}{07311393}{06950891}{07486079}
(C) 2003, The University of Michigan 46
Sample rules
Total number of rules: 4085 for 100,000 inputs
{07136302} IF PROFILE ~ P{07136302} LENGTH <= 603 LENGTH >= 361 .{07136302} IF PROFILE ~ P{07136302} CONTEXT ~ presidential LENGTH <=
412 .{07136302} IF PROFILE ~ P{07136302} CONTEXT ~ nominee CONTEXT ~
during .{07136302} IF PROFILE ~ P{07136302} CONTEXT ~ case .{07136302} IF PROFILE ~ P{07136302} LENGTH <= 603 LENGTH >= 390
LENGTH <= 412 .{07136302} IF PROFILE ~ P{07136302} CONTEXT ~ nominee CONTEXT ~
and .
(C) 2003, The University of Michigan 47
Evaluation
• 35,206 tuples; 11,504 distinct entities; 3.06 DDPE
• Training: 90% of corpus (10,353 entities)• Test: 10% of corpus (1,151 entities)
(C) 2003, The University of Michigan 48
Evaluation
• Rule format (each matching rule adds constraints):
X [A] (evidence of A)
Y [B] (evidence of B)
X Y [A] [B] (evidence of A and B)
• Classes are in 2W (powerset of WN nodes)• P&R on the constraints selected by system
(C) 2003, The University of Michigan 49
Definition of precision and recall
Model System P R
50.0 %[A] [B] [C]
[A] [B] [C] [A] [B] [D]
[B] [D] 33.3 %
66.7 % 66.7 %
(C) 2003, The University of Michigan 50
Precision and recallWord nodes only Word and parent nodes
Trainingset
Precision Recall Precision Recall500 64.29% 2.86% 78.57% 2.86%
1000 71.43% 2.86% 85.71% 2.86%2000 42.86% 40.71% 67.86% 62.14%5000 59.33% 48.40% 64.67% 53.73%
10000 69.72% 45.04% 74.44% 59.32%15000 76.24% 44.02% 73.39% 53.17%20000 76.25% 49.91% 79.08% 58.70%25000 83.37% 52.26% 82.39% 57.49%30000 80.14% 50.55% 82.77% 57.66%50000 83.13% 58.53% 88.87% 63.39%
100000 85.42% 62.81% 89.70% 64.64%150000 87.07% 63.17%200000 85.73% 62.86%250000 87.15% 63.85%