introduction to text mining and natural language processing bif-30806 january 2010 judith risse
TRANSCRIPT
Introduction to Text Mining and Natural Language Processing
BIF-30806January 2010
Judith Risse
2
Outline
Literature and Databases Natural Language Processing
Information Retrieval Question Answering Information Extraction
Indexing Document Classification Exercises
3
Definitions
Natural Language Processing (NLP) the study of automated generation and
understanding of natural human languages (Wikipedia)
Text Mining extract high quality (previously unknown)
information from large amounts of unstructured text
4
Biomedical Literature
communication of scientific discoveries peer-reviewed and community reviewed provides additional information of
experimental results base for annotation of biological
databases
5
Literature Databases NCBI Bookshelf PubMed Central PubMed
currently 19476540 citations (Jan 27, 2010) 5414 journals in Medline unique identifier PMID entries contain author, journal and title info more than 50% also abstracts links to full-text articles Medical Subject Headings (MeSH)
6
PubMed
PubMed growth
0123456789
101112131415161718192021
19
50
19
53
19
56
19
59
19
62
19
65
19
68
19
71
19
74
19
77
19
80
19
83
19
86
19
89
19
92
19
95
19
98
20
01
20
04
20
07
No
of
pu
bli
cati
on
s in
mil
lio
ns
entries per yeartotal No of entries
7
Pubmed (3)
© NLM 2008
8
A scientific article
journal specific format sections print style
type of article review letter
document format html pdf
9
Article content
Full-text title authors abstract body
Tables Figures References
10
Biomedical Language domain specific terminology
cytosolic, erythroid precursor polysemic words
e.g. Drosophila gene names: coitus interruptus, lost in space
acronyms APC (activated protein C), mdh (malate
dehydrogenase) low frequency words anaphora (references)
Overexpression of FumRs and Frds1 resulted in the best citrate-producing strain in the presence of trace manganese concentrations. This strain gave a maximum yield of ….
11
Biomedical Language (2)
synonyms/creating new terms typographical variants
malic dehydrogenase L-malate dehydrogenase NAD-L-malate dehydrogenase malic acid dehydrogenase NAD-dependent malic dehydrogenase NAD-malate dehydrogenase NAD-malic dehydrogenase malate (NAD) dehydrogenase MDH L-malate-NAD+ oxidoreductase
12
Natural Language Processing
create computational models of language
multi-disciplinary information technology, linguistics, artificial
intelligence, statistics …. statistical properties of language
machine learning, rule-based, regular expressions
grammatical, morphological, syntactic and semantic features
13
Grammatical Features
Grammar rules governing a language syntax and morphology
Part of speech (POS) noun, verb, adjective, adverb, preposition depends on context in sentence
Brill tagger (Eric Brill, PhD thesis,1993) http://www.cst.dk/online/pos_tagger/uk/
index.html http://en.wikipedia.org/wiki/Brill_Tagger
14
Morphological Features
structure of words inflection
enzyme and enzymes (plural form) catalyse, catalyses, catalysing (verb inflection)
word-formation earth, earthworm (compounding) dependent, independent (derivation)
stemming and lemmatisation reduction of words to common base form
am, are, is be catalyse, catalyses, catalysing catalys
Porter Stemmer (tartarus.org/martin/PorterStemmer)
15
Syntactic Features
relationships between words in a sentence noun-phrase, verb-phrase subject – object relationships
16
POS Tagged Sentence
(NNP Pain) (VBD vanished) (IN for) (IN at) (JJS least) (CD three) (NNS months) (IN in) (NNS rats) (WP who) (VBD were) (VBN injected) (IN in) (DT the) (NN spine) (IN with) (DT a) (NN gene) (IN that) (NNS triggers) (VBZ endorphins) (. .)
Pain - Proper singular nounvanished - Verb, past tense for - Prepositionat - Prepositionleast - Superlative adjectivethree - Cardinal numbermonths - Plural nounin - Prepositionrats - Plural nounwho - wh-pronounwere - Verb, past tense
injected - Verb, past participlein - Prepositionthe - Determinerspine - Singular nounwith - Prepositiona - Determinergene - Singular nounthat - Prepositiontriggers - Plural nounendorphins - Verb, 3rd ps. sing.present. - Final punctuation
17
Semantic Features
meaning of words given the context dictionaries, thesauri
Gene Ontology
18
Contextual Analysis
Guilt by association Co-occurrence analysis
Word frequency bag of words statistical analysis of word frequency
19
Exercise 1 take a gene/protein name of your
interest query pubMed and retrieve 1 abstract
Take a look at what the Porter stemmer does using the abstract
Describe what problems might occur from stemming
Porter Stemmer http://maya.cs.depaul.edu/~classes/ds575/
porter.html
Coffee Break
21
Tasks of NLP
Information Extraction (IE) Question Answering (QA) Information Retrieval (IR)
machine translation text proofing speech recognition optical character recognition (OCR)
22
Information Retrieval
Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). Introduction to IR (CambUnivPr, 2008)
Indexing Tokenization Case Folding (TNFalpha, Tnfalpha tnfalpha Stemming Stop-word removal (e.g. at, be, from, this …)
Boolean Queries Vector Space Model queries
23
Zipf’s Law
• A small number of words occur very often• Those high frequency words are often function words (e.g. prepositions)• Most words with low frequency
24
Boolean Queries
Combination of query terms with boolean operators AND OR NOT
Google, PubMed high recall, low precision unranked result
25
The vector space model
term weight term
frequency (TF) inverse
document frequency (IDF)
corpus size (N)
(1+logTF)log(N/DF)
the vector points in ‘word space’ each dimension corresponds to a word or
phrase© Nat Rev Gen(2002):3 pp 601-610
26
IR Evaluation
A document is relevant if it addresses the stated information need, not because it just happens to contain all the words in the query. Introduction to IR (CambUnivPr, 2008)
document collection test cases of information need, as queries measure of relevance
27
Evaluation (2)
Precision What fraction of the returned results are
relevant to the information need? Recall
What fraction of the relevant documents in the collection were returned by the system?
F-score harmonic mean of precision and recall (2×p×r)/(p+r)
28
Exercise 2
Compare the retrieval of abstracts between PubMed and Phasar (www.bioinformatics.nl/biometa/applet.html or twoquid.cs.ru.nl/applet.html) given the question:
What does prostaglandin inhibit? How many results do you get? Give examples of answers to the question. Give 5 pmids of papers you would read given
the results in each search. Which of the systems was more helpful and
why?
Coffee Break
30
Question Answering
question posed in human language answer extracted from unstructured
text more developed in generic domain difficult in biomedical domain
31
Information Extraction & Text Mining
extract structured information from unstructured text
Named Entity Recognition identify relationships
e.g. protein-protein interactions
32
Information Extraction
extract meaning from a text
combines: pos-tagging ontologies regular expressions
© Nat Rev Gen(2002):3 pp 601-610
33
Named Entity Recognition
tagging of biological entities high precision in generic NLP (0.9 F-
score) difficult in biology
complex terms, synonyms, disambiguation gene symbols
typographical variations no use of official symbols gene/protein names
34
Challenges of NLP
Abbreviation punctuation can be confused with end of
sentence Wash. (Washington) with wash.
Decimal points apostrophes: To split or not to split?
35
Challenges (2)
hyphens single or multiple words? data-base vs. data base vs. database carry-over?
simple stemming operate operating operates operation
operative operatives operational oper case folding
brown car vs Mr. Brown
36
Anaphora
co-references one expression refering to another
The monkey took the banana and ate it. strictly only local antecendent
statements Sortal anaphora
this gene, the virus resolution required for increased recall
37
Exercise 3
compare NER programmes retrieve one pubMed abstract http://biocreative.sourceforge.net/
bionlp_tools_links.html NLProt TerMine Whatizit
(http://www.ebi.ac.uk/webservices/whatizit/info.jsf)
What are the differences in recognized entities?
Do they miss any obvious entities?
38
Indexing
Inverted Index (Inverted File) for each word in the collection (dictionary) list occurrence and frequency
size of index is proportional to size of corpus
remove stopwords, use stemming for more efficient index
classic version is a boolean index can also contain positional information
sparse matrix
39
Example deterministic 20 73 89 90 106 173 194 233 243
251 252 255 257 258 267 276 281 304 312 315 32627 36822 44643 45285 53003 53061 86740 86743 97082 116618 121984 125750 125952 125968 126039 127633 128882 128978 129048 133781 133789 138493 140946 140947 152011 156191 157881 163490
deterrence 1 604 30309 30345 30444 30452
detonation 2 263 2644 131781 131956 131995 132303
number of docs containing the term
document ids
total # of occurrences
term position in counted words
40
Suffix Array
A suffix array is an array that contains all the pointers to the text suffixes listed in lexicographical order.
Text is seen as one long string A text suffix is a substring from given
position till end of string position refers to beginning of word return all occurrences of string W in large
text A
41
Example:
Finding every occurrence of the substring is equivalent to finding every suffix that begins with the substring
the word: abracadabra1. create all suffixes
2. sort suffixes on alphabet
3. resulting suffix array
42
Document Classification
assign a document to a class given its content manual (ad hoc) rule-based decision tree machine learning approaches
43
Statistical Text Classification
training documents for each class supervised learning test data or new data training data and test data have to be
similar
44
Naïve Bayes
Naïve: all words in text are considered independent
Bayes: uses Bayes theorem
)(
)()|()|(
BP
APABPBAP
prior probability
posterior probability
45
Basic Probability Theory
Given A represents an eventthe probability of A occuring is 0 ≤ P(A) ≤ 1
Joint probability P(A,B) = P(A∩B) Conditional probability P(A | B) Chain rule P(A,B) = P(A | B)P(B) = P(B |
A)P(A)
46
Application to Document Classification
wikipedia.org
probability of a word belonging to category C
probability of a document belonging to category C given its words
Coffee Break
48
Exercise 4
Try to apply naïve Bayes to a selection of sentences using http://search.cpan.org/~kwilliams/
Algorithm-NaiveBayes/ rugby.txt and tennis.txt as training and test
data. If you have it implemented try using this in
combination with the Porter Stemmer (http://bionlp.stanford.edu/bionlp.pl)
49
Added Challenge From sequence to abstract to NER
MSTESMIRDVELAEEALPQKMGGFQNSRRCLCLSLFSFLLVAGATTLFCLLNFGVIGPQR DEKFPNGLPLISSMAQTLTLRSSSQNSSDKPVAHVVANHQVEEQLEWLSQRANALLANGM DLKDNQLVVPADGLYLVYSQVLFKGQGCPDYVLLTHTVSRFAISYQEKVNLLSAVKSPCPKDTPEGAELKPWYEPIYLGGVFQLEKGDQLSAEVNLPKYLDFAESGQVYFGVIAL
retrieve UniprotID via BLAST (take best hit) retrieve gene name using getz (GeneName field) retrieve relevant abstracts from pubMed in Medline
format using eSearch and eFetch with the gene name
extract all protein/gene names from these abstracts http://bionlp.stanford.edu/webservices.html
how do they relate to the original protein? compare to the output of ebiMed using the gene
name (http://www.ebi.ac.uk/Rebholz-srv/ebimed/index.jsp)
50
Helpful resources
http://www-nlp.stanford.edu/links/statnlp.html
http://nlp.stanford.edu/IR-book/html/htmledition/mybook.html
www.biocreative.org Drosophila gene names:
http://www.curioustaxonomy.net/gene/fly.html
51
Further Reading
Introduction to Information Retrieval Cambridge University Press ISBN 987-0-521-86571-5
The Text Mining Handbook Cambridge University Press ISBN-13 978-0-521-83657-9