tm and nlp for biology research issues in hpsg parsing junichi tsujii school of computer science...
TRANSCRIPT
![Page 1: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/1.jpg)
TM and NLP for BiologyResearch Issues in HPSG Parsing
Junichi TSUJII
School of Computer ScienceNational Centre for Text Mining
University of Manchester, UK
Department of Computer ScienceSchool of Information Science and Technology
University of Tokyo, JAPAN
![Page 2: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/2.jpg)
2
Increments
: accumulation
Increase in Medline
2002
2000
1998
199219941996
1990
1988
1980198219841986
1978
1970197219741976
1968
1966
1964
0
100,000
200,000
300,000
400,000
500,000
600,000
年
incr
emen
ts
0
2,000,000
4,000,000
6,000,000
8,000,000
10,000,000
12,000,000
14,000,000
acc
um
ula
tio
n
G-protein coupled receptor
Before 19889 papers
1992256 papers2005
14,000 papers
MEDLINE alone
More than 0.5 million per year More than 1.3 thousand per day
Articles added
Medline Access
1997: 0.163 M accesses/month2006: 82.027 M accesses/month
[D.L.Banville 2006]
500 times more
![Page 3: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/3.jpg)
3
NaCTeMwww.nactem.ac.uk
• First such centre in the world • Funding: JISC, BBSRC, EPSRC• Consortium investment
• Chair in TM (Prof. J. Tsujii, Univ. Tokyo)
• Location: Manchester Interdisciplinary Biocentre (MIB) www.mib.ac.uk funded by the Wellcome Trust
• Initial focus: biomedical academic community• Extend services to industry• Extend focus to other domains (social
sciences)
![Page 4: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/4.jpg)
4
Consortium
• Universities of Manchester, Liverpool• Service activity run by MIMAS (National
Centre for Dataset Services), within MC (Manchester Computing)
• Self-funded partners– San Diego Supercomputing Center – University of California, Berkeley – University of Geneva – University of Tokyo
• Strong industrial & academic support– IBM, AZ, EBI, Wellcome Trust, Sanger Institute,
Unilever, NowGEN, MerseyBio, …
![Page 5: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/5.jpg)
5
![Page 6: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/6.jpg)
6
![Page 7: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/7.jpg)
7
![Page 8: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/8.jpg)
8
![Page 9: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/9.jpg)
9
![Page 10: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/10.jpg)
10
![Page 11: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/11.jpg)
11
NLP and TM
Text Mining
Text as a bag of words
Words as surface strings
Natural Language Processing
Language as a complex system linking surfacestrings of characters with their meanings Text and words as structured objects
NLP-based TM
Linking text with knowledge
![Page 12: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/12.jpg)
12
Non-Trivial Mappings
Language Domain Knowledge Domain
Concepts and Relationships among Them
Linguistic expressions
Motivated Independently of language
TerminologyParsingParaphrasing
From surface diversities and ambiguities to
conceptual invariants
![Page 13: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/13.jpg)
13
Example
![Page 14: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/14.jpg)
14
Non-trivial Mapping
Language Domain Knowledge Domain
Independently motivated of Language
Same relationswith differentStructures
Full-strength Straufen protein lacking this insertion is able to assocaite with osker mRNA and activate its translation, but fails to …..
[A] protein activates [B] (Pathway extraction)
Since ……., we postulate that only phosphorylated PHO2 protein could activate the transcription of PHO5 gene.
Transcription initiation by the sigma(54)-RNA polymerase holoenzyme requires an enhancer-binding protein that is thought to contact sigma(54) to activate transcription.
[sentence] > ([arg1_activate] > [protein])Retrieval usingRegional Algebra
![Page 15: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/15.jpg)
15
Predicate-argument structureParser based on Probabilistic HPSG (Enju)
S
p53 has been shown to directly activate the Bcl-2 protein
NP
VP
ADVP
S
VP
VP
VP
NP arg1arg2
arg2
arg3
![Page 16: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/16.jpg)
16
述語 /項構造確率HPSG解析器 (Enju) の出力
The protein is activated by it
DT NN VBZ VBN IN PRP
dt np vp vp pp np
np pp
vp
vp
s
arg1arg2mod
Semantic Retrieval SystemUsing Deep Syntax
MEDIE
Passive
Passive and Infinitival Clause
![Page 17: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/17.jpg)
17
![Page 18: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/18.jpg)
18
![Page 19: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/19.jpg)
19
![Page 20: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/20.jpg)
20
![Page 21: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/21.jpg)
21
![Page 22: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/22.jpg)
22
![Page 23: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/23.jpg)
23
![Page 24: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/24.jpg)
24
![Page 25: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/25.jpg)
25
![Page 26: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/26.jpg)
26
Demos
•MEDIE
• Info-PubMed
![Page 27: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/27.jpg)
27
Predicate-argument structureParser based on Probabilistic HPSG (Enju)
S
p53 has been shown to directly activate the Bcl-2 protein
NP
VP
ADVP
S
VP
VP
VP
NP arg1arg2
arg2
arg3
![Page 28: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/28.jpg)
28
![Page 29: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/29.jpg)
29
![Page 30: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/30.jpg)
30
![Page 31: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/31.jpg)
31
Penn Treebank GENIA
Coverage 99.7% 99.2%
F-Value (PArelations) 87.4% 86.4%
Sentence Precison 39.2% 31.8%
Processing Time 0.68sec 1.00sec
Performance of Semantic Parser
![Page 32: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/32.jpg)
32
Scalability of TM Tools
The number of papers 14,792,890
The number of abstracts 7,434,879
The number of sentences 70,815,480
The number of words 1,418,949,650
Compressed data size 3.2GB
Uncompressed data size 10GB
Target Corpus: MEDLINE corpus
Suppose, for example, that it
takes one second for parsing one
sentence….70 million seconds, that is, about 2 years
![Page 33: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/33.jpg)
33
TM and GRID
• Solution– The entire MEDLINE were parsed by
distributed PC clusters consisting of 340 CPUs
– Parallel processing was managed by grid platform GXP [Taura2004]
• Experiments– The entire MEDLINE was parsed in 8 days
• Output– Syntactic parse trees and predicate
argument structures in XML format– The data sizes of compressed/uncompressed
output were 42.5GB/260GB.
![Page 34: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/34.jpg)
34
Efficient Parsing for HPSG
![Page 35: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/35.jpg)
35
Background: HPSG• Head-Driven Phrase Structure Grammar (HPSG) [Pollard and Sag, 1994]
– Lexicalized and Constraints-based Grammar–A few Rule Schema General constraints on linguistic constructions
–Constraints embedded in Lexicon Word-Specific Constraints
–Constraints between phrase structures and semantic structures
![Page 36: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/36.jpg)
36
I like it
Parsing by HPSG
![Page 37: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/37.jpg)
37
HEAD nounSUBJ < >COMPS < >
I
HEAD nounSUBJ < >COMPS < >
it
HEAD verbSUBJ COMPS
like
<NP><NP>
Parsing by HPSG
Assignment of Lexical Entries
![Page 38: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/38.jpg)
38
HEAD nounSUBJ < >COMPS < >
I
HEAD verbSUBJ COMPS
HEAD nounSUBJ < >COMPS < >
like it
1< >
2< >2
HEAD verb
SUBJ
COMPS < >
HEAD nounSUBJ < >COMPS < >
1< >
Head-Complement
Application of
Rule Schema
![Page 39: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/39.jpg)
39
HEAD nounSUBJ < >COMPS < >
HEAD nounSUBJ < >COMPS < >
I
HEAD verbSUBJ COMPS
like it
1< >2< >
2
HEAD verbSUBJ < >COMPS < >
1< >HEAD verbSUBJ COMPS < >
1
Subject-Head
Application of
Rule Schema
![Page 40: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/40.jpg)
40
Inefficiency of HPSG Parsing
• Complex DAG : Typed-feature structures– Abstract machine for Unification (LiLFeS)
• Unification: Expensive Operation (⇔ CFG Approximation: CFG Filtering )
• Assignment of Lexical Entries– High reduction of search space / Super
tagging
![Page 41: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/41.jpg)
41
Filtering with CFG (1/5)• 2-phased parsing
– Approximate HPSG with CFG with keeping important constraints.
– Obtained CFG might over-generate, but can be used in filtering.
– Rewriting in CFG is far less expensive than that of application of rule schemata, principles and so on.
CompileHPSG CFGFeature
Structures
Input Sentences
Built-in CFG Parser
LiLFeS UnificationParsing
+
Output
Complete parse trees
![Page 42: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/42.jpg)
47
System Overview
I like it
HEAD nounSUBJ < >
COMPS < >
HEAD nounSUBJ < >
COMPS < >
HEAD nounSUBJ < >
COMPS < >
HEAD nounSUBJ < >
COMPS < >
HEAD nounSUBJ < >
COMPS < >
HEAD nounSUBJ < >
COMPS < >
HEAD verbSUBJ <NP>
COMPS <NP>
HEAD verbSUBJ <NP>
COMPS <NP>
HEAD verbSUBJ <NP>
COMPS <NP>
HEAD verbSUBJ <NP>
COMPS <NP>
HEAD verbSUBJ <NP>
COMPS <NP>
HEAD verbSUBJ <NP>
COMPS <NP>
HEAD nounSUBJ < >
COMPS < >
HEAD nounSUBJ < >
COMPS < >
HEAD nounSUBJ < >
COMPS < >
HEAD nounSUBJ < >
COMPS < >
HEAD nounSUBJ < >
COMPS < >
HEAD nounSUBJ < >
COMPS < >
P High
Supertagger
I like itInputsentence
CFG Filtering
I like it
HEAD nounSUBJ < >
COMPS < >
HEAD verbSUBJ <NP>
COMPS <NP>
HEAD nounSUBJ < >
COMPS < >
I like it
HEAD nounSUBJ < >
COMPS < >
HEAD verbSUBJ <NP>
COMPS <NP>
HEAD nounSUBJ < >
COMPS < >
I like it
HEAD nounSUBJ < >
COMPS < >
HEAD verbSUBJ <NP>
COMPS <NP>
HEAD nounSUBJ < >
COMPS < >
...
Deterministic Shift/Reduce Parser
I like it
![Page 43: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/43.jpg)
Experiment Results
LP(%) LR(%) F1(%) Avg. timeStaged/Deterministic model
86.93 86.47 86.70 30ms/snt
Previous method 1( Supertagger+ChartParser)
87.35 86.29 86.81 183ms/snt
Previous method 2( Unigram + ChartParser )
84.96 84.25 84.60 674ms/snt
6 times faster20 times faster than the initial model
![Page 44: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/44.jpg)
49
Domain/Text Type Adaptation
![Page 45: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/45.jpg)
50
F-score Training Time ( Sec )
Baseline ( PTB-trained, PTB-applied) 89.81 0
Baseline (PTB-trained, GENIA-applied) 86.39 0
Retraining ( GENIA ) 88.45 14,695
Retraining ( PTB+GENIA) ) 89.94 238,576
Structure with RefDist 88.18 21,833
Lexical with RefDist 89.04 12,957
Lex/Structure with RefDist 90.15 31,637
![Page 46: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/46.jpg)
51
Adaptation with Reference Distribution
)(
)|()|(
,)|()|(1
)|(
w ww
ww
l
lw
Tt wsyniilex
wsyniilexE
i
i
tqwlpZ
tqwlpZ
tp
Lexical Assignment Syntactic Preference
Original model
j
jjs
stgZ
stpM )|(exp1
)|(
Feature function
Feature weight)|(0 stp
![Page 47: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/47.jpg)
52
83
84
85
86
87
88
89
90
0 2000 4000 6000 8000
Number of Sentence of the GENIA Training Set
F-s
core
Baseline (PTB)
Simple Retraining ( GENIA)
Retraining (GENIA+PTB)Structure with Ref.DistLexical with RefDist
Lexical/Structure woth RefDist
![Page 48: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/48.jpg)
53
83
84
85
86
87
88
89
90
0 10000 20000 30000
Training Time ( Sec )
F- s
core
Retrinaing(GENIA)
Structure with RefDistLexicon woth RefDist
Lex/Str with RefDist
![Page 49: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/49.jpg)
54
F-score Training Time ( Sec )
Baseline ( PTB-trained, PTB-applied) 89.81 0
Baseline (PTB-trained, GENIA-applied) 86.39 0
Retraining ( GENIA ) 88.45 14,695
Retraining ( PTB+GENIA) ) 89.94 238,576
Structure with RefDist 88.18 21,833
Lexical with RefDist 89.04 12,957
Lex/Structure with RefDist 90.15 31,637
![Page 50: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/50.jpg)
55
Tool1: POS Tagger
• General-Purpose POS taggers, trained by WSJ– Brill’s tagger, TnT tagger, MX POST, etc. – 97%
• General-Purpose POS taggers do not work well for MEDLINE abstracts
The peri-kappa B site mediates human immunodeficiency DT NN NN NN VBZ JJ NNvirus type 2 enhancer activation in monocytes … NN NN CD NN NN IN NNS
![Page 51: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/51.jpg)
56
Errors seen in TnT tagger (Brants 2000)
A chromosomal translocation in … DT JJ NN IN… and membrane potential after mitogen binding. CC NN NN IN NN JJ… two factors, which bind to the same kappa B enhancers… CD NNS WDT NN TO DT JJ NN NN NNS … by analysing the Ag amino acid sequence. IN VBG DT VBG JJ NN NN… to contain more T-cell determinants than … TO VB RBR JJ NNS IN Stimulation of interferon beta gene transcription in vitro by NN IN JJ JJ NN NN IN NN IN
![Page 52: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/52.jpg)
57
Performance of GENIA Tagger
Training corpus WSJ GENIA
WSJ 97.0 84.3
GENIA 75.2 98.1
WSJ+GENIA 96.9 98.1
Training corpus
WSJ GENIA
WSJ 96.7 84.3
GENIA 80.1 97.9
WSJ+GENIA 96.5 97.5
• GENIA tagger (Ref.) TnT tagger
No degradation of the taggertrained by the mixed corpus
Some degradations (0.2 ~ 0.4) were observed, compared withthe taggers trained by “pure” corpora
![Page 53: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/53.jpg)
58
CRF-based POS + Active LearningGENIA
3,000 sentences : 98.420,000 sentences: 98.58
![Page 54: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/54.jpg)
59
10,000 sentences: 96.76Best Performance: 97.18
CRF-based POS + Active LearningPTB
![Page 55: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/55.jpg)
60
Applications
![Page 56: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/56.jpg)
GENIA Event Annotation - example
LinkCauseLinkCause
– For an identified event in the given sentence,• classify the type of events and record the text span giving the clue of it (ClueType).• identify the theme of the events and record the text span linking the theme to the event
(LinkTheme).• identify the cause of the events and record the text span linking the cause to the event
(LinkCause).
• record the environment (location, time) of the events (ClueLoc, ClueTime).
LinkThemeLinkTheme
ClueLocClueLoc
ClueTypeClueType
ClueTypeClueType
![Page 57: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/57.jpg)
Gene_expression• Theme patterns observed (2,958)
– Protein 2,308– DNA 591– RNA 25– Peptide 4– Protein Protein 2– Erroneous 27
• Keywords– coexpress, nonexpress, overexpress,
express, biosynthesis, product, synthesize, constitute, …
coexpression
![Page 58: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/58.jpg)
Transcription
• Theme patterns observed (929)– DNA 449– RNA 272– Protein 167– Peptide 2– Erroneous22
• Keyword– Transcrib, transcript, synthesi, express,
…
![Page 59: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/59.jpg)
Localization• Theme patterns observed
(730)– Protein 608– Lipid 31– Atom 29– Other_organic_compound
14– DNA 12– Virus 5– Carbohydrate 5– RNA 4– Inorganic 4– Peptide 3• Keywords
– Translocation, sectetion, release, localization, mobilization, uptake, secrete, import, transport, translocate, sequester, influx, mograte, localisation, move, delivery, export, …
• ClueLoc
– NONE 241
– nuclear 140
– to the nucleus 12
– into the nucleus 11
– Cytoplasmic 8
– in the cytoplasm 7
– macrophages 5
– nuclear … in t lymphocytes4
– monocytes 4
– in the nucleus 4
– in the cytosol 4
– in colostrum 4
– from the cytoplasm to the nucleus 4
![Page 60: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/60.jpg)
Localization• Keywords and Locations
– translocation (166)• nuclear 108• NONE
38• …
– secretion (100)• NONE
57• name_of_cells 43
– release (80)• NONE
51• name_of_cells 19• …
– localization (30)• nuclear 25• intracellular 3
– uptake (24)• NONE 14• name_of_cells 20
• Keywords and Themes– translocation (166)
• Protein 161• Virus
4• RNA
1– secretion (100)
• Protein 98• Lipid
1• Peptide 1
– release (80)• Protein 67• Other_organic_compoun 6• Lipid
3– localization (30)
• Protein 30– uptake (24)
• Lipid15
• Carbohydrate 5• Protein 4
![Page 61: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/61.jpg)
69
Future Plan
Kitano’s group, Kell’s group
![Page 62: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/62.jpg)
70
![Page 63: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/63.jpg)
71
![Page 64: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/64.jpg)
72
Future Directions
• Domain Adaptation + Inter-operability– High performance can be obtained by using domain specific
characteristics and domain semantics– Differences among abstracts, full papers, comments in DBs
– Standardized Interfaces (API) of NLP tools
• Text Archives – Abstracts + Full Papers + Comments/Summary Descriptions
in DBs
• Combining NLP tools with Mining tools – Knowledge Discovery (Disease Gene Association)– Hypotheses Generation– Automatic Data Interpretation
![Page 65: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649d095503460f949dbf3b/html5/thumbnails/65.jpg)
73
Future Directions
• Domain Adaptation + Inter-operability– High performance can be obtained by using domain specific
characteristics and domain semantics– Differences among abstracts, full papers, comments in DBs
– Standardized Interfaces (API) of NLP tools
• Text Archives – Abstracts + Full Papers + Comments/Summary Descriptions
in DBs
• Combining NLP tools with Mining tools – Knowledge Discovery (Disease Gene Association)– Hypotheses Generation– Automatic Data Interpretation