ai, text mining and scientific research - jst text mining and scientific research junichi tsujii...
TRANSCRIPT
AI, Text Mining
and Scientific Research
Junichi Tsujii
Director
Artificial Intelligence Research Center, AIST
Plan of the talk
• Background
• Challenges in Biology
• Linking Text with Knowledge
• Conclusion
2
Plan of the talk
• Background
• Challenges in Biology
• Linking Text with Knowledge
• Conclusion
3
AI which models HIAI approaching to HI
• IBMWatson:NLU,Text and Structured Knowledge, Fact Retrieval, QA
• Computer Chess (Japanese Shogi): Large Search Space, Machine Learning
• Robot for entrance exam of U‐Tokyo:NLU, Problem Solving, Inferences based on Knowledge
• Conversational Agent:Intelligence with bodies, NLU in specific situations, Grounding of language
• Deep Learning:Brain inspired computation, Changes of Computation principles, Autonomous Intelligence, New Paradigms of Machine Learning
• Brain Science: Science of HI
AI evolving from Big Data and Data ScienceAI which surpasses HI
5
Another Stream of AI
Machine Learning Large‐scale GraphGraph Mining
GPU ・HPC Optimization Deep Learning
Integration of two AIs
AI with high Affinity for HI
①Data Knowledge Integration AI; AI which can explain
• AI which thinks based on Data
• HI which thinks based on Knowledge
②Brain‐Inspired AI; Revolution of Computation Principles
6
Modeling HI Surpassing HI
AlphaGo(2016)Machine Learning and Simulation
A game of perfect information
DNN
Database of Games in the past
Complete Simulation
v(s)
p(a|s)Training Data
Computational Science and AISimulation and Machine Learning
A game of perfect information
DNN
Complete Simulation
v(s)
p(a|s)Training Data
Unknowns
Diverse databasesOf
Omics
Incomplete and PartialSimulation
Database of Games in the past
Cooperation of HI and AI
• Collective Intelligence• Linked Data, Common Ontology, Shared Knowledge, Technology as Commodities
• Machine Learning• Autonomous AI, Self‐driving cars, Speech/Vision Processing, etc.
• Communication among NI and AI
• Collectively Solving Challenges • Social, Technological and Scientific Challenges
Plan of the talk
• Background
• Challenges in Biology
• Linking Text with Knowledge
• Conclusion
10
… ATTCGGATATTTAAGGC …
… ATTCGGGTATTTAAGCC … Healthy
Disease(e.g., Alzheimer, Cancer)
Genome‐Wide Association Studies (GWAS)
2000
2010
“Genetic diagnosis of diseases would beaccomplished in 10 years and that treatmentswould start to roll out perhaps five years afterthat.”
“A Decade Later, Genetic Maps Yield Few New Cures” New York Times, June 2010.
11
Francis Collins (NIH)
by Hoifung Poon (MSR, 2013)
Traditional Biology
12
Targeted Experiments Discovery
One hypothesis
by Hoifung Poon (MSR, 2013)
Genomics
13
High‐Throughput ExperimentsDiscovery
… ATTCGGATATTTAAGGC …
… ATTCGGGTATTTAAGCC …
… ATTCGGATATTTAAGGC …
… ATTCGGGTATTTAAGCC …
… ATTCGGATATTTAAGGC …
… ATTCGGGTATTTAAGCC …
Too many hypotheses
……
Big Data
by Hoifung Poon (MSR, 2013)
Genomics
14
High‐Throughput Experiments
Discovery
… ATTCGGATATTTAAGGC …
… ATTCGGATATTTAAGGC …
… ATTCGGGTATTTAAGCC …
… ATTCGGGTATTTAAGCC …
… ATTCGGATATTTAAGGC …
… ATTCGGGTATTTAAGCC …
Many hypotheses
Big Data
……
Oda K, Matsuoka Y, Funahashi A, Kitano H: A comprehensive pathway map of epidermal growth factor
receptor signaling. Mol Syst Biol 2005, 1:2005 0010.
Nodes : 652
Links: 444
600 papers were read to
construct the pathway
Artificial Intelligence
NaturalIntelligence
DataKnowledge
Cooperation
Big Challenges
1,2-Diacyglycerol intracellular
AKT(PKB)
ALK
Androgen receptor
B-Raf
BETA-PIX
C/EBPbeta
C3G
CDC42
CDK2
CREB1
Ca('2+) cytosol
Cyclic AMP intrac
Cyclic GMP intrace
EGR1
ERK1/2
ESR1 (nuclear)
Elk-1
FMO3
FRS2
GAB1
GRB2
Galectin-1
H-Ras
HDBP1
HGF receptor (Met)
HIF1A
HSP27
IRS-1
IRS-2
JNK(MAPK8-10)
K-RAS
Lyn
MAP2
MEK1/2
MEK4(MAP2K4)
MEK6(MAP2K6)
MEKK1(MAP3K1)
MEKK4(MAP3K4)MLK3(MAP3K11)
N-Ras
NCK2 (Grb4)
NO intracellularNeurofibromin
PAK1
PDGF receptor
PDLIM3
PDZ-GEF1
PI3K cat class IA
PIP5KI
PKC
PR (nuclear)
Protein kinase G1
Pyk2(FAK2)
R-Ras
RASGRF2
RIPK4
Rac1
SHP-2
SLC36A1
SOS
SP1
Shc
Slc39a14 (Zip14)
Tiam1
VEGFR-1
a-6/beta-4 integrin
c-Fos
c-Jun
c-Kit
c-Myc
c-Raf-1
cPLA2
p90Rsk
Big Mechanism: Robot ScientistsDARPA Chicago Univ. Manchester Univ. AIRC
Reading AssemblyExplanation
1,2-Diacyglycerol intracellular
AKT(PKB)
ALK
Androgen receptor
B-Raf
BETA-PIX
C/EBPbeta
C3G
CDC42
CDK2
CREB1
Ca('2+) cytosol
Cyclic AMP intrac
Cyclic GMP intrace
EGR1
ERK1/2
ESR1 (nuclear)
Elk-1
FMO3
FRS2
GAB1
GRB2
Galectin-1
H-Ras
HDBP1
HGF receptor (Met)
HIF1A
HSP27
IRS-1
IRS-2
JNK(MAPK8-10)
K-RAS
Lyn
MAP2
MEK1/2
MEK4(MAP2K4)
MEK6(MAP2K6)
MEKK1(MAP3K1)
MEKK4(MAP3K4)MLK3(MAP3K11)
N-Ras
NCK2 (Grb4)
NO intracellularNeurofibromin
PAK1
PDGF receptor
PDLIM3
PDZ-GEF1
PI3K cat class IA
PIP5KI
PKC
PR (nuclear)
Protein kinase G1
Pyk2(FAK2)
R-Ras
RASGRF2
RIPK4
Rac1
SHP-2
SLC36A1
SOS
SP1
Shc
Slc39a14 (Zip14)
Tiam1
VEGFR-1
a-6/beta-4 integrin
c-Fos
c-Jun
c-Kit
c-Myc
c-Raf-1
cPLA2
p90Rsk
1,2-Diacyglycerol intracellular
AKT(PKB)
ALK
Androgen receptor
B-Raf
BETA-PIX
C/EBPbeta
C3G
CDC42
CDK2
CREB1
Ca('2+) cytosol
Cyclic AMP intrac
Cyclic GMP intrace
EGR1
ERK1/2
ESR1 (nuclear)
Elk-1
FMO3
FRS2
GAB1
GRB2
Galectin-1
H-Ras
HDBP1
HGF receptor (Met)
HIF1A
HSP27
IRS-1
IRS-2
JNK(MAPK8-10)
K-RAS
Lyn
MAP2
MEK1/2
MEK4(MAP2K4)
MEK6(MAP2K6)
MEKK1(MAP3K1)
MEKK4(MAP3K4)MLK3(MAP3K11)
N-Ras
NCK2 (Grb4)
NO intracellularNeurofibromin
PAK1
PDGF receptor
PDLIM3
PDZ-GEF1
PI3K cat class IA
PIP5KI
PKC
PR (nuclear)
Protein kinase G1
Pyk2(FAK2)
R-Ras
RASGRF2
RIPK4
Rac1
SHP-2
SLC36A1
SOS
SP1
Shc
Slc39a14 (Zip14)
Tiam1
VEGFR-1
a-6/beta-4 integrin
c-Fos
c-Jun
c-Kit
c-Myc
c-Raf-1
cPLA2
p90Rsk
Very large conflicting(probabilistic) network
Smaller(relevant)groundedmodel
Computationalhypotheses/wet labExperimentscontrolling states of thenetwork
A.Rzhetsky(U.Chicago)
Big Mechanism
• Project supported by DARPA• Some of the systems that matter most to the Defense Department are
very complicated. Ecosystems, brains and economic and social systemshave many parts and processes, but they are studied piecewise, and their literatures and data are fragmented, distributed and inconsistent. It is difficult to build complete, explanatory models of complicated systems, and so effects in these systems that are brought about by many interacting factors are poorly understood.
• Big mechanisms are large, explanatory models of complicated systems in which interactions have important causal effects. The collection of big data is increasingly automated, but the creation of big mechanisms remains a human endeavor made increasingly difficult by the fragmentation and distribution of knowledge. To the extent that the construction of big mechanisms can be automated, it could change how science is done.
Plan of the talk
• Background
• Challenges in Biology
• Linking Text with Knowledge
• Conclusion
18
1,2-Diacyglycerol intracellular
AKT(PKB)
ALK
Androgen receptor
B-Raf
BETA-PIX
C/EBPbeta
C3G
CDC42
CDK2
CREB1
Ca('2+) cytosol
Cyclic AMP intrac
Cyclic GMP intrace
EGR1
ERK1/2
ESR1 (nuclear)
Elk-1
FMO3
FRS2
GAB1
GRB2
Galectin-1
H-Ras
HDBP1
HGF receptor (Met)
HIF1A
HSP27
IRS-1
IRS-2
JNK(MAPK8-10)
K-RAS
Lyn
MAP2
MEK1/2
MEK4(MAP2K4)
MEK6(MAP2K6)
MEKK1(MAP3K1)
MEKK4(MAP3K4)MLK3(MAP3K11)
N-Ras
NCK2 (Grb4)
NO intracellularNeurofibromin
PAK1
PDGF receptor
PDLIM3
PDZ-GEF1
PI3K cat class IA
PIP5KI
PKC
PR (nuclear)
Protein kinase G1
Pyk2(FAK2)
R-Ras
RASGRF2
RIPK4
Rac1
SHP-2
SOS
SP1
Shc
Slc39a14 (Zip14)
Tiam1
VEGFR-1
a-6/beta-4 integrin
c-Fos
c-Jun
c-Kit
c-Myc
c-Raf-1
cPLA2
p90Rsk
Big Mechanism: Reading‐Assembly‐Explanation
Reading Assembly Explanation
1,2-Diacyglycerol intracellular
AKT(PKB)
ALK
Androgen receptor
B-Raf
BETA-PIX
C/EBPbeta
C3G
CDC42
CDK2
CREB1
Ca('2+) cytosol
Cyclic AMP intrac
Cyclic GMP intrace
EGR1
ERK1/2
ESR1 (nuclear)
Elk-1
FMO3
FRS2
GAB1
GRB2
Galectin-1
H-Ras
HDBP1
HGF receptor (Met)
HIF1A
HSP27
IRS-1
IRS-2
JNK(MAPK8-10)
K-RAS
Lyn
MAP2
MEK1/2
MEK4(MAP2K4)
MEK6(MAP2K6)
MEKK1(MAP3K1)
MEKK4(MAP3K4)MLK3(MAP3K11)
N-Ras
NCK2 (Grb4)
NO intracellularNeurofibromin
PAK1
PDGF receptor
PDLIM3
PDZ-GEF1
PI3K cat class IA
PIP5KI
PKC
PR (nuclear)
Protein kinase G1
Pyk2(FAK2)
R-Ras
RASGRF2
RIPK4
Rac1
SHP-2
SLC36A1
SOS
SP1
Shc
Slc39a14 (Zip14)
Tiam1
VEGFR-1
a-6/beta-4 integrin
c-Fos
c-Jun
c-Kit
c-Myc
c-Raf-1
cPLA2
p90Rsk
1,2-Diacyglycerol intracellular
AKT(PKB)
ALK
Androgen receptor
B-Raf
BETA-PIX
C/EBPbeta
C3G
CDC42
CDK2
CREB1
Ca('2+) cytosol
Cyclic AMP intrac
Cyclic GMP intrace
EGR1
ERK1/2
ESR1 (nuclear)
Elk-1
FMO3
FRS2
GAB1
GRB2
Galectin-1
H-Ras
HDBP1
HGF receptor (Met)
HIF1A
HSP27
IRS-1
IRS-2
JNK(MAPK8-10)
K-RAS
Lyn
MAP2
MEK1/2
MEK4(MAP2K4)
MEK6(MAP2K6)
MEKK1(MAP3K1)
MEKK4(MAP3K4)MLK3(MAP3K11)
N-Ras
NCK2 (Grb4)
NO intracellularNeurofibromin
PAK1
PDGF receptor
PDLIM3
PDZ-GEF1
PI3K cat class IA
PIP5KI
PKC
PR (nuclear)
Protein kinase G1
Pyk2(FAK2)
R-Ras
RASGRF2
RIPK4
Rac1
SHP-2
SLC36A1
SOS
SP1
Shc
Slc39a14 (Zip14)
Tiam1
VEGFR-1
a-6/beta-4 integrin
c-Fos
c-Jun
c-Kit
c-Myc
c-Raf-1
cPLA2
p90Rsk
Very large conflicting(probabilistic) network
Smaller(relevant)groundedmodel
Computationalhypotheses/wet labExperimentscontrolling states of thenetwork
By A. Rzhetsky(U. Chicago)
The Need for Text Mining
Types of documents
• Full papers
• Abstracts
• Reports, discharge summaries
• EMR
• Textbooks, monographs
• Grey content, online discussion forums
MEDLINE
• 2005: ~14M
• 2009: ~18M
• 2013: ~22M
• 2015: ~26M
20
Overwhelming information in textual, unstructured format
By S. Ananiadou(U. Manchester)
Event Extraction
Finding events ( trigger mentions , , andevent types typed arguments
including locations ) involving genes or gene products
… In this study we hypothesized that the phosphorylation of TRAF2 inhibitsbinding to the CD40 cytoplasmic domain. …
phosphorylation
TRAF2
binding
inhibits
TRAF2 CD40
Theme2ThemeTheme
Cause Theme
Negative_regulation
Phospholylation Binding
cytoplasmic domain
Site2
http://www.nactem.ac.uk/EventMine/
Finding Evidence ‐EuropePubMed Central
• Currently: runs on 2,550, 328 full texts
• 82,198,474 facts in 38,411,661 sentences
• Full parsing used a version of Enju (Mogura)
• Parsing pipeline run on 60 machines at EBI ~30 days
22
http://labs.europepmc.org/evf
By S. Ananiadou(U. Manchester)
Deep Reading: Reading with a Model
• Goal: evaluate how TM systems process text in relation to what is known about a pathway
• Performers asked to produce
– Relationship/proposed change to the model (new/corroborating/conflicting information)
– A model fragment describing the change
– The source text supporting the change
By L.Hirschman(MITRE)
Reading against a Model (1)
“monoubiquitination of Rasenhances association with the downstream effectors Raf and PI3‐Kinase”
CORROBORATING: We know that Ras binds Raf
By L.Hirschman(MITRE)
Reading against a Model (2)
“monoubiquitination of Rasenhances association with the downstream effectors Raf and PI3‐Kinase”
NEW MECHANISM: Ras binds PI3‐Kinase.
BEL: complex(p(PFH:”Ras family”), p(“PI3K”))
By L.Hirschman(MITRE)
Reading against a Model (3)
“Moreover, the RAS‐ASPP interaction enhances the transcription function of p53”
NEW RELATIONSHIP: RAS‐ASPP complexincreases transcriptional activity of p53
BEL: complex(p(PFH:”Ras Family”),p(HGNC:ASPP2) ‐> act(p(HGNC:P53), ma(tscript))
By L. Hirschman(MITRE)
Epistemic knowledge
• Enriches event‐based search systems – Discovery of new knowledge
– Negation, uncertainty, speculative claims in literature
27
Miwa, Thompson, McNaught, Kell, Ananiadou (2012). Extracting semantically enrichedevents from biomedical literature. BMC Bioinformatics 13, 108
… In this study we hypothesized that the phosphorylation of TRAF2 inhibitsbinding to the CD40 cytoplasmic domain. …
Uncertainty
Negation
Analysis
Source
Extracting epistemic knowledge
28
By S. Ananiadou(U. Manchester)
Event Extraction
Finding events ( trigger mentions , , andevent types typed arguments
including locations ) involving genes or gene products
… In this study we hypothesized that the phosphorylation of TRAF2 inhibitsbinding to the CD40 cytoplasmic domain. …
phosphorylation
TRAF2
binding
inhibits
TRAF2 CD40
Theme2ThemeTheme
Cause Theme
Negative_regulation
Phospholylation Binding
cytoplasmic domain
Site2
http://www.nactem.ac.uk/EventMine/
Deep reading
custom components
existing components supplied with custom resources
existing components
By R. Batista(U. Manchester)
custom components
existing components supplied with custom resources
existing components
Reads passages from remote
folder
Reads passages from remote
folder
Performs tokenisation, POS, chunk tagging; recognisesproteins and cell lines
Performs tokenisation, POS, chunk tagging; recognisesproteins and cell lines
Makes distinction between genes/proteins and protein
families
Makes distinction between genes/proteins and protein
families
Uses model trained on overlapping corporaUses model trained on overlapping corpora
Reads in BioPAXmodel from a
SPARQL endpoint
Reads in BioPAXmodel from a
SPARQL endpoint
By R. Batista(U. Manchester)
32
Words
Terms
Entities
Relations
Events
Wordform co‐occurrence, pattern matching, …
Term recognition and normalisation
Named entity recognition
Relation extraction
Event extraction
Associations
epistemicextraction
Data mining, Clustering
What is known aboutthis disease, protein, person?
What is linked with X?
{Who, what} Xed {whom, what} where, when and how?
What if…?
Keywordsearch
Is X possible, certain, probable, suggested, past, to come?
What is thispaper about?
Increased sophistication? Increased customisation!
By S. Ananiadou(U. Manchester)
Plan of the talk
• Background
• Challenges in Biology
• Linking Text with Knowledge
• Conclusion
33
1,2-Diacyglycerol intracellular
AKT(PKB)
ALK
Androgen receptor
B-Raf
BETA-PIX
C/EBPbeta
C3G
CDC42
CDK2
CREB1
Ca('2+) cytosol
Cyclic AMP intrac
Cyclic GMP intrace
EGR1
ERK1/2
ESR1 (nuclear)
Elk-1
FMO3
FRS2
GAB1
GRB2
Galectin-1
H-Ras
HDBP1
HGF receptor (Met)
HIF1A
HSP27
IRS-1
IRS-2
JNK(MAPK8-10)
K-RAS
Lyn
MAP2
MEK1/2
MEK4(MAP2K4)
MEK6(MAP2K6)
MEKK1(MAP3K1)
MEKK4(MAP3K4)MLK3(MAP3K11)
N-Ras
NCK2 (Grb4)
NO intracellularNeurofibromin
PAK1
PDGF receptor
PDLIM3
PDZ-GEF1
PI3K cat class IA
PIP5KI
PKC
PR (nuclear)
Protein kinase G1
Pyk2(FAK2)
R-Ras
RASGRF2
RIPK4
Rac1
SHP-2
SOS
SP1
Shc
Slc39a14 (Zip14)
Tiam1
VEGFR-1
a-6/beta-4 integrin
c-Fos
c-Jun
c-Kit
c-Myc
c-Raf-1
cPLA2
p90Rsk
Big Mechanism: Reading‐Assembly‐Explanation
Reading Assembly Explanation
1,2-Diacyglycerol intracellular
AKT(PKB)
ALK
Androgen receptor
B-Raf
BETA-PIX
C/EBPbeta
C3G
CDC42
CDK2
CREB1
Ca('2+) cytosol
Cyclic AMP intrac
Cyclic GMP intrace
EGR1
ERK1/2
ESR1 (nuclear)
Elk-1
FMO3
FRS2
GAB1
GRB2
Galectin-1
H-Ras
HDBP1
HGF receptor (Met)
HIF1A
HSP27
IRS-1
IRS-2
JNK(MAPK8-10)
K-RAS
Lyn
MAP2
MEK1/2
MEK4(MAP2K4)
MEK6(MAP2K6)
MEKK1(MAP3K1)
MEKK4(MAP3K4)MLK3(MAP3K11)
N-Ras
NCK2 (Grb4)
NO intracellularNeurofibromin
PAK1
PDGF receptor
PDLIM3
PDZ-GEF1
PI3K cat class IA
PIP5KI
PKC
PR (nuclear)
Protein kinase G1
Pyk2(FAK2)
R-Ras
RASGRF2
RIPK4
Rac1
SHP-2
SLC36A1
SOS
SP1
Shc
Slc39a14 (Zip14)
Tiam1
VEGFR-1
a-6/beta-4 integrin
c-Fos
c-Jun
c-Kit
c-Myc
c-Raf-1
cPLA2
p90Rsk
1,2-Diacyglycerol intracellular
AKT(PKB)
ALK
Androgen receptor
B-Raf
BETA-PIX
C/EBPbeta
C3G
CDC42
CDK2
CREB1
Ca('2+) cytosol
Cyclic AMP intrac
Cyclic GMP intrace
EGR1
ERK1/2
ESR1 (nuclear)
Elk-1
FMO3
FRS2
GAB1
GRB2
Galectin-1
H-Ras
HDBP1
HGF receptor (Met)
HIF1A
HSP27
IRS-1
IRS-2
JNK(MAPK8-10)
K-RAS
Lyn
MAP2
MEK1/2
MEK4(MAP2K4)
MEK6(MAP2K6)
MEKK1(MAP3K1)
MEKK4(MAP3K4)MLK3(MAP3K11)
N-Ras
NCK2 (Grb4)
NO intracellularNeurofibromin
PAK1
PDGF receptor
PDLIM3
PDZ-GEF1
PI3K cat class IA
PIP5KI
PKC
PR (nuclear)
Protein kinase G1
Pyk2(FAK2)
R-Ras
RASGRF2
RIPK4
Rac1
SHP-2
SLC36A1
SOS
SP1
Shc
Slc39a14 (Zip14)
Tiam1
VEGFR-1
a-6/beta-4 integrin
c-Fos
c-Jun
c-Kit
c-Myc
c-Raf-1
cPLA2
p90Rsk
Very large conflicting(probabilistic) network
Smaller(relevant)groundedmodel
Computationalhypotheses/wet labExperimentscontrolling states of thenetwork
By A.Rzhetsky(U. Chicago)
Artificial Intelligence
NaturalIntelligence
DataKnowledge
Cooperation
Big Challenges