1 search engine statistics beyond the n-gram: application to noun compound bracketing conll-2005...
TRANSCRIPT
![Page 1: 1 Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005 Preslav Nakov EECS, Computer Science Division University](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfaa1a28abf838c9a881/html5/thumbnails/1.jpg)
1
Search Engine Statistics Beyond the n-gram:Application to Noun Compound Bracketing
CoNLL-2005Preslav NakovEECS, Computer Science DivisionUniversity of California, BerkeleyMarti HearstSIMSUniversity of California, Berkeley
![Page 2: 1 Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005 Preslav Nakov EECS, Computer Science Division University](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfaa1a28abf838c9a881/html5/thumbnails/2.jpg)
2
Outline
Introduction Related Work Models and Features
![Page 3: 1 Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005 Preslav Nakov EECS, Computer Science Division University](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfaa1a28abf838c9a881/html5/thumbnails/3.jpg)
3
Introduction
Noun compound bracketing-> Noun compound interpretation
liver cell antibody [[liver cell] antibody]
liver cell line [liver [cell line]]
POS equivalent, different syntactic trees
![Page 4: 1 Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005 Preslav Nakov EECS, Computer Science Division University](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfaa1a28abf838c9a881/html5/thumbnails/4.jpg)
4
This Paper A highly accurate unsupervised method for
making bracketing decisions for noun compounds (NCs) Current: using bigram estimates to compute
adjacency and dependency scores Improvement
χ2 measure a new set of surface features for querying Web
search engines Evaluate on 2 domains, encyclopedia &
bioscience
![Page 5: 1 Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005 Preslav Nakov EECS, Computer Science Division University](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfaa1a28abf838c9a881/html5/thumbnails/5.jpg)
5
Related Work NC syntax and semantics
Still active -> J. of Com. Speech and Language – Special Issue on Multiword Expressions
Adjacency model Probabilistic dependency model, Laucer
(1995) Data sparseness (use categories instead) 244 NCs from encyclopedia Inter-annotator agreement 81.5% Baseline 66.8% -> 77.5% Adding POS -> state-of-the-art result of 80.7%
![Page 6: 1 Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005 Preslav Nakov EECS, Computer Science Division University](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfaa1a28abf838c9a881/html5/thumbnails/6.jpg)
6
2003~2005 Keller and Lapata (2003)
Use Web Search Engines for obtaining frequencies for unseen bigrams
(2004) apply to six NLP tasks including disambiguation of NCs
Simpler version (use frequency only) - 78.68%
Girju et al. (2005) supervised (decision tree) (5 WordNet semantic features) 83.1%
![Page 7: 1 Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005 Preslav Nakov EECS, Computer Science Division University](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfaa1a28abf838c9a881/html5/thumbnails/7.jpg)
7
Models and Features Adjacency and dependency model w1w2w3 -> [w1 [w2w3]] (two reasons)
take on right bracketing1. w2w3 is a compound (modified by w1)
home health care Adjacency model checks 1.
2. w1 and w2 independently modify w3 adult male rat
(Better) Dependency model checks 2. Left bracketing -> only 1 choice
[law enforcement] agent
![Page 8: 1 Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005 Preslav Nakov EECS, Computer Science Division University](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfaa1a28abf838c9a881/html5/thumbnails/8.jpg)
8
Computing Probabilities
Alternative
Calculations
![Page 9: 1 Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005 Preslav Nakov EECS, Computer Science Division University](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfaa1a28abf838c9a881/html5/thumbnails/9.jpg)
9
χ2 measure
B=#(wi)-(A) C=#(wj)-(A) D=~N-A-B-C N=8T
=google 8B pages X 1000 words/page
(Yang and Pedersen, 1997) χ2 better than MI
![Page 10: 1 Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005 Preslav Nakov EECS, Computer Science Division University](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfaa1a28abf838c9a881/html5/thumbnails/10.jpg)
10
蛋包飯 蛋 2067593 蛋包 2217 包 10207448 包飯 3398 飯 1672224 χ2 包飯 750.34 > 蛋包 67.32
![Page 11: 1 Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005 Preslav Nakov EECS, Computer Science Division University](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfaa1a28abf838c9a881/html5/thumbnails/11.jpg)
11
Web-Derived Surface (1/2) Authors sometimes (consciously or not) disambiguate
the words they write by using surface-level markers to suggest the correct meaning.
Dash (hyphen) left bracketing
cell cycle analysis -> cell-cycle right bracketing less reliable
donor T-cell fiber optics-system t-cell-depletion
Possessive marker brain’s stem cells, brain stem’s cells, brain’s stem-cells
Internal capitalization Plasmodium vivax Malaria, brain Stem cells disable this feature on Roman digits and single-letter
words vitamin D deficiency
![Page 12: 1 Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005 Preslav Nakov EECS, Computer Science Division University](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfaa1a28abf838c9a881/html5/thumbnails/12.jpg)
12
Web-Derived Surface (2/2) Embedded slashes
leukemia/lymphoma cell growth factor (beta) or (growth factor) beta (brain) stem cells
a comma, a dot or a colon “health care, provider” or “lung cancer: patients” (weak
indicator) mouse-brain stem cells (weak indicator)
Unfortunately, Web SE ignore punctuation characters - hyphens, brackets, apostrophes, etc.
collect them indirectly – post-processing the resulting summaries (up to 1000 results)
Above features are clearly more reliable than others, we do not try to weight them
Features verifying Counts returned by SE, page hits as a proxy for n-gram
frequencies from 1000 summaries
![Page 13: 1 Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005 Preslav Nakov EECS, Computer Science Division University](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfaa1a28abf838c9a881/html5/thumbnails/13.jpg)
13
Other Web-Derived Features Abbreviations
tumor necrosis factor (NF) tumor necrosis (TN) factor
Concatenation health care reform -> healthcare, carereform
Wildcard (*) “ health care * reform” <-> “health * care reform”
Reorder reform health care <-> care reform health
myosin heavy chain, heavy chain myosin Internal inflection variability
tyrosine kinase activation, tyrosine kinases activation Switching
“adult male rat”, we would also expect “male adult rat”.
![Page 14: 1 Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005 Preslav Nakov EECS, Computer Science Division University](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfaa1a28abf838c9a881/html5/thumbnails/14.jpg)
14
新發現
![Page 15: 1 Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005 Preslav Nakov EECS, Computer Science Division University](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfaa1a28abf838c9a881/html5/thumbnails/15.jpg)
15
Paraphrases Warren (1978) proposes
stem cells in the brain cells from the brain stem
Copula paraphrase office building that/which is a skyscraper pain associated with arthritis migraine
search engines lack linguistic annotations small set of hand-chosen paraphrases associated with, caused by, contained in, derived
from, focusing on, found in, involved in, located at/in, made of, performed by, preventing, related to and used by/in/for
![Page 16: 1 Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005 Preslav Nakov EECS, Computer Science Division University](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfaa1a28abf838c9a881/html5/thumbnails/16.jpg)
16
Evaluations Lauer’s Dataset (1995)
244 unambiguous 3-noun NC-s Biomedical Dataset (Nakov et al.,
2005, SIG BioLink) Open NLP tools
sentence splitted, tokenized, POS tagged and shallow parsed a set of 1.4 million MEDLINE abstracts (citations between 1994 and 2003)
500 NCs, 361 left, 69 right, 70 ambiguous
![Page 17: 1 Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005 Preslav Nakov EECS, Computer Science Division University](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfaa1a28abf838c9a881/html5/thumbnails/17.jpg)
17
Experiments used MSN Search statistics for the
n-grams and the paraphrases (unless the pattern contained a “*”) MSN always returned exact numbers
Google for the surface features Google and Yahoo rounded their page
hits, which generally leads to lower accuracy (Yahoo was better than Google for these estimates)
![Page 18: 1 Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005 Preslav Nakov EECS, Computer Science Division University](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfaa1a28abf838c9a881/html5/thumbnails/18.jpg)
18
Tools Mentioned
UMLS Specialist lexicon 得到生物領域字不同的拼法 http://www.nlm.nih.gov/pubs/
factsheets/umlslex.html Carroll’s morphological tools
http://www.cogs.susx.ac.uk/lab/nlp/carroll/morph.html
![Page 19: 1 Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005 Preslav Nakov EECS, Computer Science Division University](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfaa1a28abf838c9a881/html5/thumbnails/19.jpg)
19
UMLS Lexicon {base=AAAentry=E0000049
cat=nounvariants=metareg variants=uncountacronym_of=abdominal aortic aneurysmectomy|E0429482
acronym_of=acne-associated arthritis|E0429483acronym_of=acquired aplastic anemia|E0429484acronym_of=acute anxiety attack|E0429485
acronym_of=androgenic anabolic agent|E0429486acronym_of=aneurysm of ascending aortaacronym_of=aromatic amino acid|E0356310acronym_of=acute apical abscess|E0356309abbreviation_of=abdominal aortic aneurysm|E0006446}
{base=AAMDspelling_variant=A.A.M.D.entry=E0000050cat=noun variants=groupuncountacronym_of=American Association on Mental Deficiency|E0000277}
![Page 20: 1 Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005 Preslav Nakov EECS, Computer Science Division University](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfaa1a28abf838c9a881/html5/thumbnails/20.jpg)
20
![Page 21: 1 Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005 Preslav Nakov EECS, Computer Science Division University](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfaa1a28abf838c9a881/html5/thumbnails/21.jpg)
21
![Page 22: 1 Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005 Preslav Nakov EECS, Computer Science Division University](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfaa1a28abf838c9a881/html5/thumbnails/22.jpg)
22
![Page 23: 1 Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005 Preslav Nakov EECS, Computer Science Division University](https://reader036.vdocuments.site/reader036/viewer/2022070413/5697bfaa1a28abf838c9a881/html5/thumbnails/23.jpg)
23
Conclusions and Future Work Improved upon the state-of-the-art
approaches to NC bracketing Future include
test on > 3 words recognize the ambiguous case Include determiners and modifiers on other NLP problems refine the parser output
Parser typically assume right bracketing