text mining in the biograph project · 2 context • the biograph project aims at: - assisting...
TRANSCRIPT
![Page 1: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/1.jpg)
Text Mining in the Biograph Project
Walter Daelemans
(Roser Morante, Vincent Van Asch)
CLiPS-Computational Linguistics Group
University of Antwerp
![Page 2: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/2.jpg)
1
Biograph (www.biograph.be)
• Funded by University of Antwerp:
- Text Mining: CLiPS CL Group• Roser Morante, Vincent Van Asch, Walter Daelemans
- Graph Data Mining: ADReM, Department of Mathematics and Computer Science
• Jeroen De Knijf, Bart Goethals
- Genetics: AMG, Department of Molecular Genetics• Anthony Liekens, Peter De Rijk, Jurgen Del-Favero
![Page 3: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/3.jpg)
2
Context
• The BIOGRAPH project aims at:- Assisting researchers in ranking candidate disease
causing genes by putting forward a new methodology for combined text analysis and data mining from heterogeneous information sources
- Mining biomedical texts: providing accurate relations automatically extracted from text and weighted according to their reliability
• Reliability:- Treatment of negation and modality
- Certainty of the extracted relations
![Page 4: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/4.jpg)
3
Gene Prioritization
• Candidate region
- Gene responsible for a disease (e.g. schizophrenia or Alzheimer) is in a known area of the genome
- Many genes (> 200) are in this candidate region
• Experimental validation is needed
- Very expensive in time and cost
• Combine information in literature and in databases
- Which genes in the candidate region could be most relevant for the disease and why?
- Provide a prioritization (ranking problem)
![Page 5: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/5.jpg)
4
![Page 6: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/6.jpg)
5
Adding Text Mining
• Extract (positive) relations of any kind between biomedical concepts found in biomedical abstracts and full papers
• Add to the biograph
• Long series of unsupervised extraction experiments did not give useful results
• Supervised: trained on available data (BioInfer)
![Page 7: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/7.jpg)
6
Types of patterns
Example sentence from BioInfer:
A binary complex of birch profilin and skeletal muscle actincould be isolated by gel chromatography .
Various patterns:- complex of a and b (triggered by a noun)
- a inhibits b (triggered by a verb)
- a-inducing b (inside syntactic chunk)
- ...
![Page 8: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/8.jpg)
7
Features
Create an instance for every pair of named entities (NE).
An instance contains information about:
- syntactic features of the 2 NE and their close-by context
- syntactic features of the common ancestor of both NEs and of the common ancestor’s ancestor
- syntactic features of the head of the syntactic chunk of NEs
- the pattern of lemmas, syntactic chunks and pos between the 2 NEs (shallow and dependency tree)
- Distances: between NEs, between NE and common ancestor, ...
In total 83 different features
![Page 9: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/9.jpg)
8
Evaluation
Precision Recall F-score
Baseline
Bioinfer41.09 100 58.24
Test on
Bioinfer timbl62.29 69.75 65.81
Test on
Bioinfer svm60.91 78.78 68.70
Test on Biographcorpus
79.03 55.98 65.54
All systems were trained on Bioinfer
![Page 10: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/10.jpg)
9
Manual Evaluation on Biograph corpus
Precision
System 65
Gold Standard
(upper baseline)75
The Biograph corpus has been semi-automatically constructed => precision is not
100.
![Page 11: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/11.jpg)
10
Text Mining (2)
• Reliability / certainty of extracted relations
• Should be useful in graph data mining
- (certainty of the machine learner)
- Handle negation and modality • [-1, +1]
![Page 12: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/12.jpg)
11
Contents• Motivation
• Negation- Task description
- Related work
- Corpus
- System description
- Results
• Modality- Related work
- Results
• Negation vs. modality
• Conclusions
• Further Research
![Page 13: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/13.jpg)
12
Motivation
• Extracted information that falls in the scope of hedge or negation cues cannot be presented as factual information
• Vincze et al. (2008) report that 17.70% of the sentences in the BioScope corpus contain hedge cues and 13 % negation cues
• Light et al. (2004) estimate that 11% of sentences in MEDLINE abstracts contain speculative fragments
![Page 14: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/14.jpg)
13
Finding the scope of negation
• Finding the scope of a negation cue means determining at a sentence level which words in the sentence are affected by the negation(s)
Analysis at the phenotype and genetic level showed that
lack of CD5 expression was due neither to segregation of human autosome 11, on which
the CD5 gene has been mapped, nor to deletion of the CD5 structural gene.
![Page 15: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/15.jpg)
14
Related work
• Most of the related work focuses on detecting whether a term is negated or not
- Rule or regular expression based systems like NegEx (Chapman et al. 2001) and NegFinder (Mutalik et al. 2001)
- Machine learning systems like Averbuch et al. (2004)
- Huang and Lowe (2007) develop a hybrid system that combines regular expression matching with parsing in order to locate negated concepts
![Page 16: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/16.jpg)
15
Corpus
![Page 17: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/17.jpg)
16
PMA treatment, and <xcope id=“X1.4.1”><cue type=“negation'' ref="X1.4.1"> not </cue> retinoic acid treatment of the U937 cells</xcope> acts in inducing NF-KB expression in the nuclei.
Corpus
• Medical and biological texts annotated with information about negation and speculation
</xcope>
<xcope id=“X1.4.1”>
</cue>
<cue type=“negation'' ref="X1.4.1">
Clinical Papers Abstracts
#Docs. 1954 9 1273
#Sent. 6383 2670 11871
#Words 41985 60935 282243
• Corpora
![Page 18: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/18.jpg)
17
Experimental Setting
• Abstracts corpus:
- 10 fold cross-validation experiments
• Clinical and papers corpora: robustness test
- Training on abstracts
- Testing on clinical and papers
![Page 19: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/19.jpg)
18
System Description
• We model the scope finding task as two consecutive classification tasks:
- Finding negation cues: a token is classified as being at the beginning of a negation signal, inside or outside
- Finding the scope: a token is classified as being the first element of a scope sequence, the last, or neither
• Supervised machine learning approach
![Page 20: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/20.jpg)
19
System Architecture
![Page 21: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/21.jpg)
20
Preprocessing
![Page 22: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/22.jpg)
21
Finding Negation Cues
• We filter out negation cues that are unambiguous in the training corpus (17 out of 30)
• For the rest, a classifier predicts whether a token is the first token of a negation signal, inside or outside of it
- Algorithm : IGTREE as implemented in TiMBL (Daelemans et al. 2007)
- Instances represent all tokens in a sentence
- Features about the token in focus and its context
![Page 23: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/23.jpg)
22
Features negation cue finding
• Of the token
- Lemma, word, POS and IOB chunk tag
• Of the token context
- Word, POS and IOB chunk tag of 3 tokens to the right and 3 to the left
![Page 24: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/24.jpg)
23
Ambiguous Negation CuesIn Abstracts Corpus
![Page 25: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/25.jpg)
24
BASELINE PREC RECALL F1 IAA
Abstracts 82.00 95.17 88.09 94.46
Papers 84.01 92.46 88.03 79.42
Clinical 97.31 97.53 97.42 90.70
Results• Baseline: tagging as negation signals tokens that are
negation signals at least in 50% of the occurrences in the training corpus
BASELINE TOKENS
absence, absent, cannot, could not, fail, failure, impossible, instead of, lack, miss, neither, never, no, none, nor, not, rather than, unable, with the exception of, without
SYSTEM PREC RECALL F1
Abstracts 84.72 98.75 91.20 (+3.11)
Papers 87.18 95.72 91.25 (+3.22)
Clinical 97.33 98.09 97.71 (+0.29)
![Page 26: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/26.jpg)
25
Results system vs. baselinein abstracts corpus
• The system performs better
0
10
20
30
40
50
60
70
80
90
100
Prec Recall F1
Baseline
System
![Page 27: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/27.jpg)
26
Results in the three corpora
• The system is portable
75
80
85
90
95
100
Prec Recall F1
Abstracts
Papers
Clinical
![Page 28: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/28.jpg)
27
Discussion
• Cause of lower recall on papers corpus:
NOT % negation signals
% classified correctly
Abstracts 58.89 98.25
Papers 53.22 93.68
Clinical 6.72 91.22
• Errors: not is classified as negation signal
However, programs for tRNA identification [...] do not necessarily perform well on unknown ones
The evaluation of this ratio is difficult because not all true interactions are known
![Page 29: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/29.jpg)
28
Finding Scopes
• Three classifiers predict whether a token is the first token in the scope sequence, the last or neither
- MBL (Daelemans et al. 2007)
- SVMlight (Joachims 1999)
- CRF++ (Lafferty et al. 2001)
• A fourth classifier predicts the same taking as input the output of the previous classifiers- CRF++
• The features used by the object classifiers and the metalearner are different
![Page 30: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/30.jpg)
29
Finding Scopes
![Page 31: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/31.jpg)
30
Finding Scopes
• Previous attempts: lower results
- Chunk-based classification, instead of word-based
- BIO classification of tokens (EMNLP’08) instead of FOL (First, Other, Last)
- Single classifier approach, instead of metalearner
![Page 32: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/32.jpg)
31
Features Scope FindingObject Classifiers
• Of the negation signal: Chain of words.
• Of the paired token: Lemma, POS, chunk IOB tag, type of chunk; lemma of the second and third tokens to the left; lemma, POS, chunk IOB tag, and type of chunk of the first token to the left and three tokens to the right; first word, last word, chain of words, and chain of POSs of the chunk of the paired token and of two chunks to the left and two chunks to the right.
• Of the tokens between the negation signal and the token in focus: Chain of POS types, distance in number of tokens, and chain of chunk IOB tags.
• Others: A feature indicating the location of the token relative to the negation signal (pre, post, same).
![Page 33: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/33.jpg)
32
Features Scope FindingMetalearner
• Of the negation signal: Chain of words, chain of POS, word of the two tokens to the right and two tokens to the left, token number divided by the total number of tokens in the sentence.
• Of the paired token: Lemma, POS, word of two tokens to the right and two tokens to the left, token number divided by the total number of tokens in the sentence.
• Of the tokens between the negation signal and the token in focus:Binary features indicating if there are commas, colons, semicolons, verbal phrases or one of the following words between the negation signal and the token in focus: Whereas, but, although, nevertheless, notwithstanding, however, consequently, hence, therefore, thus, instead, otherwise, alternatively, furthermore, moreover.
• About the predictions of the three classifiers: prediction, previous and next predictions of each of the classifiers, full sequence of previous and full sequence of next predictions of each of the classifiers.
• Others: A feature indicating the location of the token relative to the negation signal (pre, post, same).
![Page 34: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/34.jpg)
33
Parameters Classifiers
• TiMBL: IB1
- Similarity metric: overlap
- Feature weighting: gain ratio
- 7 k-nn
- Weighting class vote of neighbors as a function of their inverselinear distance
• SVM
- Classification
- Cost factor: 1
- Biased hyperplane
- Linear kernel function
• CRF
- Regularisation algorithm L2 for training
- Cut-off threshold of features: 1
- Unchanged hyper-parameter
![Page 35: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/35.jpg)
34
Post-processing
• Scope is always a consecutive block of scope tokens, including the negation signal
• The classifiers predict the first and last token of the scope sequence: - None or more than one FIRST and one LAST elements are
predicted
• In the post-processing we apply some rules to select one FIRST and one LAST token
Example:
- If more than one token has been predicted as FIRST, take as FIRST the first token of the negation signal
![Page 36: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/36.jpg)
35
Results
• Baseline: calculating the average length of the scope to the right of the negation signal and tagging that number of tokens as scope tokens
- Motivation: 85.70 % of scopes to the right
BASELINE PCS PCS-2 IAA
Abstracts 7.11 37.45 92.46
Papers 4.76 24.86 70.86
Clinical 12.95 62.27 76.29
SYSTEM PCS PCS-2
Abstracts 66.07 66.93
Papers 41.00 44.44
Clinical 70.75 71.21
SYSTEM
gold negs
PCS PCS-2
Abstracts +7.29 +7.17
Papers +9.26 +9.79
Clinical +16.52 +16.74
![Page 37: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/37.jpg)
36
Results on the abstracts corpus
0
10
20
30
40
50
60
70
80
90
100
Prec Rec F1 PCS PCS2
Baseline
System
Gold negs
The system performs clearly better than baseline
There is a higher upperbound calculated with gold standard negation signals
![Page 38: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/38.jpg)
37
The system is portable
Lower results in the papers corpus
0
10
20
30
40
50
60
70
80
90
Prec Rec F1 PCS PCS2
Abstracts
Papers
Clinical
Results on the three corpora
![Page 39: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/39.jpg)
38
Discussion
• Clinical reports are easier to process than abstracts and papers
- Negation signal no is very frequent (76.65 %) and has a high PCS (73.10 %)
No findings to account for symptoms
No signs of tuberculosis
- Sentences are shorter than in abstracts and papers
• Average length: 7.8 tokens vs. 26.43 and 26.24
• 75.85 % of the sentences have 10 or less tokens
![Page 40: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/40.jpg)
39
Discussion
• Papers are more difficult to process than abstracts
- Negation signal not is frequent (53.22%) and has a low PCS (39.50) in papers. Why?
NOT Papers Abstracts
Ambiguity (%¬neg) 25.56 14.29
Av. scope length 6.45 8.85
% Scopes left 23.28 16.41
Av. scope left 5.60 8.82
![Page 41: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/41.jpg)
40
0
10
20
30
40
50
60
70
80
90
Abstracts Papers Clinical
TiMBL
SVM
CRF
Meta
PCS results of the metalearner
compared to the object classifiers
The metalearner performs better than the three object classifiers (except SVMs on the clinical corpus)
![Page 42: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/42.jpg)
41
Finding the scope of modality
• Finding the scope of a hedge cue means determining at a sentence level which words in the sentence are affected by the hedge cues(s)
These results [suggest that expression of
c-jun, jun B and jun D genes [might be
involved in terminal granulocyte differentiation
[or in regulating granulocyte functionality]]].
![Page 43: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/43.jpg)
42
Related Work
• Theoretical descriptions that define hedging and modality (Lakoff 1972, Palmer 1986) based on corpora (Hyland 1998, Saurí et al. 2006, Thompson et al. 2008)
• Machine learning experiments that focus on classifying a sentence into speculative or definite (Medlock and Briscoe 2007, Medlock 2008, Szarvas 2008, Kilicoglu and Bergler 2008)
![Page 44: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/44.jpg)
43
Related work
• The system that we present here is based on the system developed for processing the scope of negation cues
• Our goal is to check whether the same approach can be applied to processing hedge cues
![Page 45: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/45.jpg)
44
System Architecture
![Page 46: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/46.jpg)
45
Preprocessing
![Page 47: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/47.jpg)
46
Finding Hedge Cues
• A classifier predicts whether a token is at the beginning of a hedge cue, inside or outside of it
- Algorithm : IGTREE as implemented in TiMBL (Daelemans et al. 2007)
- Instances represent all tokens in a sentence
- Features about the token in focus and its context
![Page 48: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/48.jpg)
47
Ambiguity in Hedge CuesSample from Abstracts Corpus
# Hedge cues:
110
# Non ambiguous hedge cues:
40
![Page 49: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/49.jpg)
48
BASELINE PREC RECALL F1 IAA
Abstracts 55.62 71.77 62.67 79.12
Papers 54.39 61.21 57.60 77.60
Clinical 66.55 40.78 50.57 84.01
Results
SYSTEM PREC RECALL F1
Abstracts 90.81 79.84 84.77
Papers 75.35 68.18 71.59
Clinical 88.10 27.51 41.92
• Baseline: tagging as hedge cues a list of words extracted from the abstracts corpusBASELINE
TOKENS
appear, apparent, apparently, believe, estimate, hypothesis, hypothesize, if, imply, likely, may, might, or, perhaps, possible, possibly, postulate, potentially, presumably, probably, propose, putatitve, should, seem, speculate, suggest, support, suppose, suspect, think, uncertain, unclear, unknown, unlikely, whether, would
![Page 50: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/50.jpg)
49
Results system vs. baselinein abstracts corpus
• The system performs better than baseline, with a main increase in precision (+35.19)
0
10
20
30
40
50
60
70
80
90
100
Prec Recall F1
Baseline
System
![Page 51: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/51.jpg)
50
Results in the three corpora
• The system is portable in terms of precision, but less so in terms of recall, which decreases (-13.27) in the clinical corpus. Why?
0
10
20
30
40
50
60
70
80
90
100
Prec Recall F1
Abstracts
Papers
Clinical
![Page 52: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/52.jpg)
51
Discussion
• Cause of lower recall on clinical corpus:
OR total # % as hedge
# as hedge
% of hedges
recall
Abstracts 1062 11.29 118 4.42 0.129
Papers 153 16.99 27 4.04 0.137
Clinical 281 98.22 276 24.62 0.007
• The use of OR as hedge cue is difficult to interpret
+CUE: Nucleotide sequence and PCR analyses demonstrated the presence of novel duplications or deletions involving the NF-kappa B motif.
-CUE: In nuclear extracts from monocytes or macrophages, induction of NF-KB occurred only if the cells were previously infected with HIV-1.
(= AND)
![Page 53: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/53.jpg)
52
Finding Scopes
• Three classifiers predict whether a token is the first token in the scope sequence, the last or neither
- MBL (Daelemans et al. 2007)
- SVMlight (Joachims 1999)
- CRF++ (Lafferty et al. 2001)
• A fourth classifier predicts the same taking as input the output of the previous classifiers- CRF++
• The features used by the object classifiers and the metalearner are different
![Page 54: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/54.jpg)
53
Finding Scopes
![Page 55: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/55.jpg)
54
Postprocessing
• Scope is always a consecutive block of scope tokens, including the negation signal
• The classifiers predict the first and last token of the scope sequence:- None or more than one FIRST and one LAST elements
might be predicted by the classifiers
• In the postprocessing we apply some rules to select one FIRST and one LAST token
Example:
- If more than one token has been predicted as FIRST, take as FIRST the first token of the negation signal
![Page 56: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/56.jpg)
55
Results
• Baseline: calculating the average length of the scope to the right of the hedge cue and tagging that number of tokens as scope tokens
- Motivation: 82.45 % of scopes to the right
BASELINE PCS PCS-2
Abstracts 3.15 3.17
Papers 2.19 2.26
Clinical 2.72 3.53
SYSTEM PCS PCS-2
Abstracts 65.55 66.10
Papers 35.92 42.37
Clinical 26.21 27.44
SYSTEM
gold cues
PCS PCS-2
Abstracts +11.58 +12.11
Papers +12.02 +15.84
Clinical +34.38 +36.50
![Page 57: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/57.jpg)
56
Baseline Modality vs Negation
• Baseline results are much lower for the hedge scope finder
Baseline Results PCS-2
0
10
20
30
40
50
60
70
Abstracts Papers Clinical
Negation Hedge
![Page 58: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/58.jpg)
57
Results on the abstracts corpus
0
10
20
30
40
50
60
70
80
90
Prec Rec F1 PCS PCS2
Baseline
System
Gold negs
The system performs clearly better than baseline
There is a higher upperbound calculated with gold standard hedge cues
![Page 59: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/59.jpg)
58
Results are lower for papers (PCS -29.63) and clinical (PCS -39.34). Why?
0
10
20
30
40
50
60
70
80
90
Prec Rec F1 PCS PCS2
Abstracts
Papers
Clinical
Results on the three corpora
![Page 60: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/60.jpg)
59
Discussion
• Why are the results in papers lower?
- 41 cues (47.00%) in papers are not in abstracts
- Some cues that are in abstracts and are frequent in papers get low scores.
• Example: suggest
(92.33 PCS in abstracts vs. 62.85 PCS in papers)
![Page 61: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/61.jpg)
60
Discussion
• Errors suggest:
- Bibliographic references
- Sentences with format typical of papers and not of abstracts
The conservation from Drosophila to mammals of these two structurally distinct but functionally similar E3 ubiquitin ligases is likely to reflect a combination of evolutionary advantages associated with: (i) specialized expression pattern, as evidenced by the cell-specific expression of the neur gene in sensory organ precursor cells [52]; (ii)
specialized function, as suggested by the role of murine MIB in TNF?? signaling [32]; (iii) regulation of protein stability, localization, and/or activity.
![Page 62: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/62.jpg)
61
Discussion
• Why are the results in clinical lower?
- 68 cues (35.45%) in clinical are not in abstracts
- Frequent hedge cues in clinical are not represented in abstracts
CUE %Clinical % Abstracts PCS Clinical
consistent with 5.28 0.00 0.00
evaluate for 6.67 0.00 3.84
or 21.41 3.99 0.00
rule out 5.12 0.00 0.00
![Page 63: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/63.jpg)
62
Hedge Scope Finder Compared to
Negation Scope Finder
• Gold hedge cues = no error propagation from the first phase
• The abstracts results show that the same system can be applied to finding the scope of negation and hedge processing
• The systems are equally portable to the papers corpus
• The negation system is better portable to the clinical corpus
PCS - Gold Cues Systems
0
20
40
60
80
100
AbstractsPapersClinical
NegationHedge
![Page 64: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/64.jpg)
63
Hedge Scope Finder Compared to
Negation Scope Finder
• Error propagation from the first phase:
- The hedge system is much less portable to the clinical corpus than the negation system
PCS - Predicted Cues Systems
0
20
40
60
80
AbstractsPapers Clinical
NegationHedge
![Page 65: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/65.jpg)
64
Conclusions
• We have presented a metalearning approach to processing the scope of negation cues. The metalearner performs better than the object classifiers
- We achieve a 32.07% error reduction over previous results (Morante et al 2008)
• We have shown that the same scope finding approach can be applied to both negation and modality
- Finding the scope of modality cues is more difficult
- Modality cues are more diverse and ambiguous than negation cues
![Page 66: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/66.jpg)
65
Conclusions
• We have shown that the system is portable to different corpora, although:
- Negation & modality: results are worse for the papers corpus
- In general, modality cues are less portable across corpora (Szarvas 2008)
• Negation: results per corpus are mostly determined by the scores of the negation signals no and not
• Modality: results per corpus are determined by corpus-specific cues
![Page 67: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/67.jpg)
66
Further Research• Error analysis to explain:
- why the metalearner performs better than the object classifiers
- why the papers corpus is more difficult to process
- why some negation signals are more difficult to process than others
• Experimenting with more features
- dependency syntax
• Test on general domain corpora
• Experimenting with other machine learning approaches (constraint satisfaction)
![Page 68: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/68.jpg)
67
References
• R. Morante and W. Daelemans. A metalearning approach to processing the scope of negation. Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL), pages 21–29, Boulder, Colorado, June 2009. ACL.
• R. Morante and W. Daelemans. Learning the scope of hedge cues in biomedical texts. Proceedings of the Workshop on BioNLP, pages 28–36, Boulder, Colorado, June 2009. ACL.
• Roser Morante, Anthony Liekens, and Walter Daelemans. Learning the Scope of Negation in Biomedical Texts. Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pages 715-724, Honolulu, Hawai, October 2008. ACL.
![Page 69: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/69.jpg)
68
Acknowledgements
• GOA project BIOGRAPH of the University of Antwerp
- www.biograph.be
• BioScope team
• Thanks for your attention!
![Page 70: Text Mining in the Biograph Project · 2 Context • The BIOGRAPH project aims at: - Assisting researchers in ranking candidate disease causing genes by putting forward a new methodology](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0e7e8c7e708231d43f84a8/html5/thumbnails/70.jpg)
69
Results Scope Finding
• Baseline: calculating the average length of the scope to the right of the negation signal and tagging that number of tokens as scope tokens (85.70 % of scopes to the right)
BASELINE PREC RECALL F1 PCS PCS-2 IAA
Abstracts 76.68 78.26 77.46 7.11 37.45 92.46
Papers 69.34 66.92 68.11 4.76 24.86 70.86
Clinical 86.85 74.96 80.47 12.95 62.27 76.29
SYSTEM PREC RECALL F1 PCS PCS-2
Abstracts 81.76 83.45 82.60 66.07 66.93
gold +8.92 +7.23 +8.07 +7.29 +7.17
Papers 72.21 69.72 70.94 41.00 44.44
gold +12.26 +15.23 +13.77 +9.26 +9.79
Clinical 86.38 82.14 84.20 70.75 71.21
gold +5.27 +1-.36 +7.87 +16.52 +16.74