large-scale knowledge aggregation for infectious diseases
DESCRIPTION
Large-scale knowledge aggregation for infectious diseases. ASEAN-China International Bioinformatics Workshop Singapore, 17 th April 2008. Olivo Miotto Institute of Systems Science and Yong Loo Lin School of Medicine, National University of Singapore. Large-scale Research Questions. - PowerPoint PPT PresentationTRANSCRIPT
Large-scale knowledge aggregation for
infectious diseases
ASEAN-China International Bioinformatics Workshop Singapore, 17th April 2008
Olivo Miotto
Institute of Systems Science and Yong Loo Lin School of Medicine, National University of Singapore
Page 2
Large-scale Research Questions
What can we learn from large-scale studies of pathogens?
Does H5N1 Avian influenza have pandemic potential?
What makes Human flu different from Avian flu?
What are stable potential immune epitopes to use as vaccine candidates for influenza?
How does each serotype of dengue differ from all others?
Page 3
Large-scale Research Questions
What can we learn from large-scale studies of pathogens?
Does H5N1 Avian influenza have pandemic potential?
What makes Human flu different from Avian flu?
What are stable potential immune epitopes to use as vaccine candidates for influenza?
How does each serotype of dengue differ from all others?
Large scale
Statistical evidence
Historical data
Systematic analysis
Page 4
We need Metadata!
Metadata = Descriptive data about sequences
If you want to compare avian vs human, you need host organism info
If you want conservation analysis, you need to have serotype and host information
If you want to study a period of virus evolution, you need date information
If you want a balanced dataset, you may need to filter according to country, date, subtype
Page 5
Knowledge Mining
H5N1 mutation map
H5N1 mutation map
Knowledge Aggregation
User-defined Dictionaries
User-defined Dictionaries
User-defined Extraction Rules
and Priorities
User-defined Extraction Rules
and Priorities
Cross-referenceIdentifiers
Cross-referenceIdentifiers
Identify mutations in H5N1 that characterize transmissibility
amongst humans
Identify mutations in H5N1 that characterize transmissibility
amongst humans
User-defined Queries
User-defined Queries
Extract Desired Source
Knowledge from Public Databases
Extract Desired Source
Knowledge from Public Databases
Public Database Records
Public Database Records
Conservation Analysis
Evidence of strain co-
circulation
Evidence of strain co-
circulation
Viral Protein ReferencesViral Protein References
Identify Evolutionarily Stable Region across
subgroups
Identify Evolutionarily Stable Region across
subgroups
Characteristic Mutations Analysis
Epitope Vaccine
Candidates
Epitope Vaccine
Candidates
Active Text Mining
Identify Biomedical literature with Cross-reactivity information
Identify Biomedical literature with Cross-reactivity information
Documents with Cross-reactivity
information
Documents with Cross-reactivity
information
User-defined Dictionaries
User-defined Dictionaries
Curator's KnowledgeCurator's
Knowledge
User-defined Patterns
User-defined Patterns
Biomedical Text
Viral Sequence and Metadata
Viral Sequence and Metadata
Previous Annotations
Previous Annotations
Page 6
Scalability in Bioinformatics Knowledge Mining
Integrative scalability We need to integrate heterogeneous information from
multiple data repositories with multiple purposes
Quantitative scalability We need methods that can leverage on and explore
effectively large-scale data sets
Hierarchical scalability We need to cascade analysis tasks, flowing knowledge
from one task to the next
Page 7
Obstacles to Scalability
Heterogeneity of Biological DatabasesSystemic: access to data in different databases
Syntactic: data formats, use of free text
Structural: different table structures in different databases
Semantic: data with different meaning and intent
Semantic Heterogeneity is particularly insidiousData is rarely used in the way it was originally intended
Low level of end-use technical expertiseBiologists, not computer scientists
Excel spreadsheets, Web page “scraping”
Does not scale up
Page 9
Fields (e.g. country/date) are inconsistently encoded
Inconsistent level of details between databases
Inconsistent field location within different records of the same database
Implicit encoding of the data (e.g. within the title of a publication)
Multiple usage of the same field
Usage ofisolation_sourcefield in differentGenPept records
/isolation_source="Homo sapiens"
AAT85667/isolation_source="Homo sapiens"
AAT85667
/isolation_source=" Samoa
BAC77216/isolation_source="Samoa"
BAC77216
/isolation_source="isolated in
AAN74539/isolation_source="isolated in 1993"
AAN74539
Semantic Heterogeneity in GenBank
Page 10
Influenza Large-Scale Studies
Analyze all influenza protein sequences available GenBank + GenPept = 92,343 documents Final dataset comprises 40,169 unique sequences
Various types of analysis, e.g. Identify amino acid mutations sites that characterize
human-transmissible strains Compare the diversity of viral sequences over different
periods of time and geographical areas
Several Metadata fields requiredProtein name Subtype Isolate
Host Country Year
Manual Curation is not an Option!
Page 11
The Aggregator of Biological Knowledge
An end-user environment for data retrieval, extraction and analysis
Uses XML technology and structural rules to allow biologists to extract and reconcile the data needed
Wrapper framework provides accessto multiple sources
Manages extracted results
Offers plug-in architecture for analysis tools
Data Analysis
Data Collection
Data Management
augment
augmentfilter
input
input
Public Repositories
query
manage
control
Researcher
KDD System
Data Analysis
Data Collection
Data Management
augment
augmentfilter
input
input
Public Repositories
query
manage
control
Researcher
ABK
Page 12
ABK Structural Rules
Concise visualization of XML as name/value tree
Familiar presentation ofmetadata for biologists
Point-and-click selectionof location and constraints
Automatic formation ofXML Structural Rule
Hierarchical valuereconciliation
Tabulated visualizationand manual curation
RDF storage and output
Page 13
Data Extraction and Cleaning
DENV-1 sequencesDifferent rules
(or different documents) produced conflicting values
User can fill in or override values
Values produced by user-defined rules
Page 14
Rule performance
Subtype
rule
1
rule
1
rule
2
rule
2
rule
3
rule
3
0%10%20%30%40%50%60%70%80%90%
100%
genbank genpept
Isolate Name
rule
1
rule
1
rule
2
rule
2
rule
3
rule
3
0%10%20%30%40%50%60%70%80%90%
100%
genbank genpept
Host
rule
1
rule
1
rule
2
rule
2
rule
3
rule
3
rule
4
rule
4
0%10%20%30%40%50%60%70%80%90%
100%
rule1 rule2
Origin
rule
1
rule
1
rule
2
rule
2rule
3
rule
3
rule
4
rule
4
rule
5 rule
5
rule
6
rule
6
0%10%20%30%40%50%60%70%80%90%
100%
genbank genpept
Year
rule
1
rule
1
rule
2
rule
2
rule
3
rule
3
rule
4
rule
4
rule
5
rule
5
0%10%20%30%40%50%60%70%80%90%
100%
genbank genpept
Multiple rules often neededSome properties
are very fragmented
Page 17
Using MI to detect Characteristic Sites
At a characteristic site, the residue observed is strongly associated to a set of sequencesE.g. : Arg -> Avian Thr -> Human
This association is explored by measuring mutual information of The residue observed at a site The label of the set in which it is observed
MI is in range 0 – 1.0MI = 0.0 -> no statistical significance in the occurrence
of residues in the two sets
MI = 1.0 -> Residues observed in one set are never observed in the other, and vice versa
Page 18
A2A (719 sequences)
H2H (1650 sequences)
PB2 Protein
PB2 Protein
MI
Entropy
Spikes indicate characteristic sites
Page 19
RNP proteins: PB2
9 44 64 81 105 199 271 292 368 475 613 627 661 674567 588 702
DE M TITA IVA T A LR AE ASVAAV KDE
NT T AVM TS MV S MK TK TTII RN
Nuclear Localization
Signal
PB1binding
NPbinding
RNA capbinding
A2A
H2H
http://www-micro.msb.le.ac.uk/3035/Orthomyxoviruses.html
PB2 (759 aa)17 sites
Page 20
PB1
A2A V T T T G S I R L Y Q V G V I R L R L R F Q D A F A I V S D P E S S L P D R S G V S L N A K E P A S S T V D A M T T A T I R L D A V E A A K
1997,HONG KONG,A/Hong Kong/156/97 V T T T G S V R F Y Q V G V I R M R L R F Q D V F E I V S E P E N S L P D R S G V S L N A K E L A N S T V D A M T T S T V R L E A V E T A K
1997,HONG KONG,A/Hong Kong/481/97 V T T T G S V R F Y Q V G V I R M R L R F Q D S F E I V S E P E N S L P D R S G V S L N A K E L A N S T V D A M T T A T V R L E A V E T A K
1997,HONG KONG,A/Hong Kong/482/97 V T T T G S V R F Y Q V G V I R M R L R F Q D V F E I V S E P E N S L P D R S G V S L N A K E L A N S T V N A M T T S T V R L E A V E T A K
1997,HONG KONG,A/Hong Kong/483/97 V T T T G S V R F Y Q V G V I R M R L R F Q D A F E I V S E P E N S L P D R S G V S L N A K E L A S S T V D A M T T A T V R L E A V K T A K
1997,HONG KONG,A/Hong Kong/486/97 V T T T G S V R F Y Q V G V I R M R L R F Q D V F E I V S E P E N S L P D R S G V S L N A K E L A N S T V D A M T T S T V R L E A V E T A K
1997,HONG KONG,A/Hong Kong/532/97 V T T T G S V R F Y Q V G V I R M R L R F Q D S F E I V S E P E N S L P D R S G V S L S A K E L A N S T V D A M T T A T V R L E A V E T A R
1997,HONG KONG,A/Hong Kong/538/97 V T T T G S V R F Y Q V G V I R M R L R F Q D A F E I V S E P E N S L P D R S G V S L N A K E L A N S T V D A M T T S T V R L E A V E T A K
1997,HONG KONG,A/Hong Kong/542/97 V T T T G S V R F Y Q V G V I R M R L R F Q D A F E I V S E P E N S L P D R S G V S L N A K E L A N S T V D A M T T A T V R L E A V E T A R1998,HONG KONG,A/HongKong/97/98 V T T T G S V R F Y Q V G V I R M R L R F Q D S F E I V S E P E N S L P D R S G V S L N A K E L A S S T V D A M T T S T V R L E A V E T A K
2003,HONG KONG,A/HK/212/03 V T T G V I R L R L R F Q D S F A - - S G P E P D R S G V S L N A K E S A S S T V D A I T A A T I R L D A V E A A K
2003,HONG KONG,A/HK/213/03 V T T G V I R L R L R F Q D S F A - - S G P E P D R S G V S L N A K E S A S S T V D A I T A A T I R L D A V E A A K
2004,THAILAND,A/THAILAND/5(KK-494)/2004 V T T T E S V R L Y Q V G V I R L R L R F K D A F A - - S G P E I S L P D R S G V S L N A K E S A S S T V D A I T A A T I R L D A V K A A K
2004,VIETNAM,A/Viet Nam/1194/2004 V T T T E S V R L Y Q V G V I R L R L R F Q D A F A - - S G P E I S L P D R S G V S L N A K E S A S S T V D A I T A A T I R L D A V K A A K
2004,VIETNAM,A/Viet Nam/1203/2004 V T T T E S V R L Y Q V G V I R L R L R F Q D A F A - - S G P I S L P D R S G V S L N A K E S A S I T V D A I T A A T I R L D A V K A A K
2004,VIETNAM,A/Viet Nam/3046/2004 V T T T E S V R L Y Q V G V I R L R L R F Q D A F A - - S G P E I S L P D R S G V S L N A K E S A S S T V R L D A V E A A K
2004,VIETNAM,A/Viet Nam/3062/2004 V T T T E S V R L Y Q V G V I R L R L R F Q D A F A - - S G P E I S L P D R S G V S L N A K E S A S S T V D A I T A A T I R L D A V K A A K
2004,VIETNAM,A/Vietnam/CL01/2004 V T T T E S V R L Y Q V G V I R L R L R F Q D A F A - - S G P E I S L P D R S G V S L N A K E S A S S T V D A I T A A T I R L D A V K A A K
2004,VIETNAM,A/Vietnam/CL26/2004 V T T T E S V R L Y Q V G V I R L R L R F Q D A F A - - S G P E I S L P D R S G V S L N A K E S A S S T V D A I T A A T I R L D A V K A A K
2005,INDONESIA,A/Indonesia/5/2005 V T T T E S V R L Y Q V G V I R L R L R F Q D A F A - - S G P E I S L P D R S G V S L N A K E S A S S T V D A I T T A T T R L D A V E A A K
2005,INDONESIA,A/Indonesia/CDC184/2005 V T T T E S V R L Y Q V G V I R L R L R F Q D A F A - - S G P E I S L P D R S G V S L N A K E S A S S T V D A I T T A T T R L D A V E A A K
2005,INDONESIA,A/Indonesia/CDC287E/2005 V T A T E I V R L Y Q V G V I R L R L R F Q D A F A - - S G P E I S L P D R S G V S L N A K E S A S S T V D A I T T A T T R L D A V E A A K
2005,INDONESIA,A/Indonesia/CDC292T/2005 V T A T E I V R L Y Q V G V I R L R L R F Q D A F A - - S G P E I S L P D R S G V S L N A K E S A S S T V D A I T T A T T R L D A V E A A K
2005,INDONESIA,A/Indonesia/CDC7/2005 V T T T E S V R L Y Q V G V I R L R L R F Q D A F A - - S G P E I S L P D R S G V S L N A K E S A S S T V D A I T T A T T R L D A V E A A K
2005,THAILAND,A/Thailand/676/2005 V T T T E S V R L Y Q V G V I R L R L R F Q D A F A - - S G P E I S L P D R S G V S L N A K E S A S S T V D A I T A A T I R L D A V K A A K
2005,VIETNAM,A/Vietnam/CL105/2005 V T T T E S V R L Y Q V G V I R L R L R F Q E A F A - - S G P E I S L P D R S G V S L N A K E S A S S T V R L D A V K A A K
2005,VIETNAM,A/Vietnam/CL115/2005 V T T T E S V R L Y G V I R L R L R F Q D A F A - - S G P E I S L P D R S G V S L N A K E F A S S T V D A I T A A T I R L D A V E A A K
2005,VIETNAM,A/Vietnam/CL2009/2005 V T T T E S V R L Y Q V G V I R L R L R F Q D A F A - - S G P E I S L P D R S G V S L N A K E S A S S T V D A I T A A T I R L D A V K A A K
2006,CHINA,A/human/Zhejiang/16/2006 V T T T E S V R L Y Q V G V I R L R L R F Q D A F A - - S G L E I S L D A T T T A T I R L D A V E A A K
2006,INDONESIA,A/Indonesia/CDC326/2006 V T A T E I V R L Y Q V G V I R L R L R F Q D A F A - - S G P E I S L P D R S G V S L N A K E S A S S T V D A I T T A T T R L D A V E A A K
2006,INDONESIA,A/Indonesia/CDC329/2006 V T A T E I V R L Y Q V G V I R L R L R F Q D A F A - - S G P E I S L P D R S G V S L N A K E S A S S T V D A I T T A T T R L D A V E A A K
2006,INDONESIA,A/Indonesia/CDC357/2006 V T A T E I V R L Y Q V G V I R L R L R F Q D A F A - - S G P E I S L P D R S G V S L N A K E S A S S T V D A I T T A T T R L D A V E A A K
2006,INDONESIA,A/Indonesia/CDC390/2006 V T T T E S V R L Y Q V G V I R L R L R F Q D A F A - - S G P E I S L P D R S G V S L N A R E S A S S T V D A I T T A T T R L D A V K A A K
2006,INDONESIA,A/Indonesia/CDC523/2006 V T A T E I V R L Y Q V G V I R L R L R F Q D A F A - - S G P E I S L P D R S G I S L N A K E S A S S T V D A I T T A T T R L D A V E A A K
2006,INDONESIA,A/Indonesia/CDC582/2006 V T A T E I V R L Y Q V G V I R L R L R F Q D A F A - - S G P E I S L P D R S G I S L N A K E S A S S T V D A I T T A T T R L D A V K A A K
2006,INDONESIA,A/Indonesia/CDC594/2006 V T T T E S V R L Y Q V G V I R L R L R F Q D A F A - - S G P E I S L P D R S G I S L N A K E S A S S T V D A I T T A T T R L D A V E T A K
2006,INDONESIA,A/Indonesia/CDC595/2006 V T T T E S V R L Y Q V G V I R L R L R F Q D A F A - - S G P E I S L P D R S G I S L N A K E S A S S T V D A I T T A T T R L D A V E T A K
2006,INDONESIA,A/Indonesia/CDC623/2006 V T T T E S V R L Y Q V G V I R L R L R F Q D A F A - - S G P E I S L P D R S G V S L N A K E S A N S T V D A I T T A T T R L D A V E A A K
2006,INDONESIA,A/Indonesia/CDC624E/2006 V T T T E S V R L Y Q V G V I R L R L R F Q D A F A - - S G P E I S L P D R S G V S L N A K E S A S S T V D A I T T A T T R L D A V E A A K
2006,INDONESIA,A/Indonesia/CDC625/2006 V T T T E S V R L Y Q V G V I R L R L R F Q D A F A - - S G P E I S L P D R S G I S L N A K E S A S S T V D A I T T A T T R L D A V E T A K
2006,INDONESIA,A/Indonesia/CDC634/2006 V T A T E I V R L Y Q V G V I R L R L R F Q D A F A - - S G P E I S L P D R S G I S L N A K E S A S S T V D A I T T A T T R L D A V E A A K
2006,INDONESIA,A/Indonesia/CDC699/2006 V T A T E I V R L Y Q V G V I R L R L R F Q D A F A - - S G P E I S L P D R S G I S L N A K E S A S S T V D A I T T A M T R L D A V E A A K
2006,INDONESIA,A/Indonesia/CDC742/2006 V T A T E I V R L Y Q V G V I R L R L R F Q D A F A - - S G P E I S L P D R S G V S L N A K E S A S S T V D A V T T A T T R L D A V E A A K
H2H I A A I E N V L F H K A D I L V M K P K Y K G S V V M T P I T R N G F L N Q L D A C I Y S R D L S N I S I N S T M V S A T K M N I T K T T R
NS2 PA PB2M1 M2 NP NS1
H2H characteristic mutations in H5N1
Page 21
Ongoing Projects at ISS
InViDiA - Integrated Virus Diversity AnalysisWeb-based tool for metadata-enabled diversity analysis
WADE - Web-based Aggregation and Display of Epitopes Web-based tool for aggregating epitope predictions from
multiple prediction systems
Page 22
Thanks to
Johns Hopkins UniversityProf. J Thomas August
Dana-Farber Cancer Institute, HarvardDr. Vladimir Brusic
Dept. of Biochemistry, NUSProf. Tan Tin WeeAT Heiny, Asif M Khan, Hu Yong Li
Institut PasteurDr. Hervé Bourhy
Partial Grant Support:National Institute of Allergy and Infectious Diseases, NIH
Grant No. 5 U19 AI56541, Contract No. HHSN2662-00400085C