large-scale knowledge aggregation for infectious diseases

22
Large-scale knowledge aggregation for infectious diseases ASEAN-China International Bioinformatics Workshop Singapore, 17 th April 2008 Olivo Miotto Institute of Systems Science and Yong Loo Lin School of Medicine, National University of Singapore

Upload: luke-rivera

Post on 31-Dec-2015

20 views

Category:

Documents


3 download

DESCRIPTION

Large-scale knowledge aggregation for infectious diseases. ASEAN-China International Bioinformatics Workshop Singapore, 17 th April 2008. Olivo Miotto Institute of Systems Science and Yong Loo Lin School of Medicine, National University of Singapore. Large-scale Research Questions. - PowerPoint PPT Presentation

TRANSCRIPT

Large-scale knowledge aggregation for

infectious diseases

ASEAN-China International Bioinformatics Workshop Singapore, 17th April 2008

Olivo Miotto

Institute of Systems Science and Yong Loo Lin School of Medicine, National University of Singapore

Page 2

Large-scale Research Questions

What can we learn from large-scale studies of pathogens?

Does H5N1 Avian influenza have pandemic potential?

What makes Human flu different from Avian flu?

What are stable potential immune epitopes to use as vaccine candidates for influenza?

How does each serotype of dengue differ from all others?

Page 3

Large-scale Research Questions

What can we learn from large-scale studies of pathogens?

Does H5N1 Avian influenza have pandemic potential?

What makes Human flu different from Avian flu?

What are stable potential immune epitopes to use as vaccine candidates for influenza?

How does each serotype of dengue differ from all others?

Large scale

Statistical evidence

Historical data

Systematic analysis

Page 4

We need Metadata!

Metadata = Descriptive data about sequences

If you want to compare avian vs human, you need host organism info

If you want conservation analysis, you need to have serotype and host information

If you want to study a period of virus evolution, you need date information

If you want a balanced dataset, you may need to filter according to country, date, subtype

Page 5

Knowledge Mining

H5N1 mutation map

H5N1 mutation map

Knowledge Aggregation

User-defined Dictionaries

User-defined Dictionaries

User-defined Extraction Rules

and Priorities

User-defined Extraction Rules

and Priorities

Cross-referenceIdentifiers

Cross-referenceIdentifiers

Identify mutations in H5N1 that characterize transmissibility

amongst humans

Identify mutations in H5N1 that characterize transmissibility

amongst humans

User-defined Queries

User-defined Queries

Extract Desired Source

Knowledge from Public Databases

Extract Desired Source

Knowledge from Public Databases

Public Database Records

Public Database Records

Conservation Analysis

Evidence of strain co-

circulation

Evidence of strain co-

circulation

Viral Protein ReferencesViral Protein References

Identify Evolutionarily Stable Region across

subgroups

Identify Evolutionarily Stable Region across

subgroups

Characteristic Mutations Analysis

Epitope Vaccine

Candidates

Epitope Vaccine

Candidates

Active Text Mining

Identify Biomedical literature with Cross-reactivity information

Identify Biomedical literature with Cross-reactivity information

Documents with Cross-reactivity

information

Documents with Cross-reactivity

information

User-defined Dictionaries

User-defined Dictionaries

Curator's KnowledgeCurator's

Knowledge

User-defined Patterns

User-defined Patterns

Biomedical Text

Viral Sequence and Metadata

Viral Sequence and Metadata

Previous Annotations

Previous Annotations

Page 6

Scalability in Bioinformatics Knowledge Mining

Integrative scalability We need to integrate heterogeneous information from

multiple data repositories with multiple purposes

Quantitative scalability We need methods that can leverage on and explore

effectively large-scale data sets

Hierarchical scalability We need to cascade analysis tasks, flowing knowledge

from one task to the next

Page 7

Obstacles to Scalability

Heterogeneity of Biological DatabasesSystemic: access to data in different databases

Syntactic: data formats, use of free text

Structural: different table structures in different databases

Semantic: data with different meaning and intent

Semantic Heterogeneity is particularly insidiousData is rarely used in the way it was originally intended

Low level of end-use technical expertiseBiologists, not computer scientists

Excel spreadsheets, Web page “scraping”

Does not scale up

Page 8

Good

Pretty Bad

Not so Good

Semantic Heterogeneity in GenBank

Page 9

Fields (e.g. country/date) are inconsistently encoded

Inconsistent level of details between databases

Inconsistent field location within different records of the same database

Implicit encoding of the data (e.g. within the title of a publication)

Multiple usage of the same field

Usage ofisolation_sourcefield in differentGenPept records

/isolation_source="Homo sapiens"

AAT85667/isolation_source="Homo sapiens"

AAT85667

/isolation_source=" Samoa

BAC77216/isolation_source="Samoa"

BAC77216

/isolation_source="isolated in

AAN74539/isolation_source="isolated in 1993"

AAN74539

Semantic Heterogeneity in GenBank

Page 10

Influenza Large-Scale Studies

Analyze all influenza protein sequences available GenBank + GenPept = 92,343 documents Final dataset comprises 40,169 unique sequences

Various types of analysis, e.g. Identify amino acid mutations sites that characterize

human-transmissible strains Compare the diversity of viral sequences over different

periods of time and geographical areas

Several Metadata fields requiredProtein name Subtype Isolate

Host Country Year

Manual Curation is not an Option!

Page 11

The Aggregator of Biological Knowledge

An end-user environment for data retrieval, extraction and analysis

Uses XML technology and structural rules to allow biologists to extract and reconcile the data needed

Wrapper framework provides accessto multiple sources

Manages extracted results

Offers plug-in architecture for analysis tools

Data Analysis

Data Collection

Data Management

augment

augmentfilter

input

input

Public Repositories

query

manage

control

Researcher

KDD System

Data Analysis

Data Collection

Data Management

augment

augmentfilter

input

input

Public Repositories

query

manage

control

Researcher

ABK

Page 12

ABK Structural Rules

Concise visualization of XML as name/value tree

Familiar presentation ofmetadata for biologists

Point-and-click selectionof location and constraints

Automatic formation ofXML Structural Rule

Hierarchical valuereconciliation

Tabulated visualizationand manual curation

RDF storage and output

Page 13

Data Extraction and Cleaning

DENV-1 sequencesDifferent rules

(or different documents) produced conflicting values

User can fill in or override values

Values produced by user-defined rules

Page 14

Rule performance

Subtype

rule

1

rule

1

rule

2

rule

2

rule

3

rule

3

0%10%20%30%40%50%60%70%80%90%

100%

genbank genpept

Isolate Name

rule

1

rule

1

rule

2

rule

2

rule

3

rule

3

0%10%20%30%40%50%60%70%80%90%

100%

genbank genpept

Host

rule

1

rule

1

rule

2

rule

2

rule

3

rule

3

rule

4

rule

4

0%10%20%30%40%50%60%70%80%90%

100%

rule1 rule2

Origin

rule

1

rule

1

rule

2

rule

2rule

3

rule

3

rule

4

rule

4

rule

5 rule

5

rule

6

rule

6

0%10%20%30%40%50%60%70%80%90%

100%

genbank genpept

Year

rule

1

rule

1

rule

2

rule

2

rule

3

rule

3

rule

4

rule

4

rule

5

rule

5

0%10%20%30%40%50%60%70%80%90%

100%

genbank genpept

Multiple rules often neededSome properties

are very fragmented

Page 15

Can H5N1 viruses spread amongst humans?

Page 16

The Antigenic Variability Analyzer (AVANA)

Page 17

Using MI to detect Characteristic Sites

At a characteristic site, the residue observed is strongly associated to a set of sequencesE.g. : Arg -> Avian Thr -> Human

This association is explored by measuring mutual information of The residue observed at a site The label of the set in which it is observed

MI is in range 0 – 1.0MI = 0.0 -> no statistical significance in the occurrence

of residues in the two sets

MI = 1.0 -> Residues observed in one set are never observed in the other, and vice versa

Page 18

A2A (719 sequences)

H2H (1650 sequences)

PB2 Protein

PB2 Protein

MI

Entropy

Spikes indicate characteristic sites

Page 19

RNP proteins: PB2

9 44 64 81 105 199 271 292 368 475 613 627 661 674567 588 702

DE M TITA IVA T A LR AE ASVAAV KDE

NT T AVM TS MV S MK TK TTII RN

Nuclear Localization

Signal

PB1binding

NPbinding

RNA capbinding

A2A

H2H

http://www-micro.msb.le.ac.uk/3035/Orthomyxoviruses.html

PB2 (759 aa)17 sites

Page 20

PB1

A2A V T T T G S I R L Y Q V G V I R L R L R F Q D A F A I V S D P E S S L P D R S G V S L N A K E P A S S T V D A M T T A T I R L D A V E A A K

1997,HONG KONG,A/Hong Kong/156/97 V T T T G S V R F Y Q V G V I R M R L R F Q D V F E I V S E P E N S L P D R S G V S L N A K E L A N S T V D A M T T S T V R L E A V E T A K

1997,HONG KONG,A/Hong Kong/481/97 V T T T G S V R F Y Q V G V I R M R L R F Q D S F E I V S E P E N S L P D R S G V S L N A K E L A N S T V D A M T T A T V R L E A V E T A K

1997,HONG KONG,A/Hong Kong/482/97 V T T T G S V R F Y Q V G V I R M R L R F Q D V F E I V S E P E N S L P D R S G V S L N A K E L A N S T V N A M T T S T V R L E A V E T A K

1997,HONG KONG,A/Hong Kong/483/97 V T T T G S V R F Y Q V G V I R M R L R F Q D A F E I V S E P E N S L P D R S G V S L N A K E L A S S T V D A M T T A T V R L E A V K T A K

1997,HONG KONG,A/Hong Kong/486/97 V T T T G S V R F Y Q V G V I R M R L R F Q D V F E I V S E P E N S L P D R S G V S L N A K E L A N S T V D A M T T S T V R L E A V E T A K

1997,HONG KONG,A/Hong Kong/532/97 V T T T G S V R F Y Q V G V I R M R L R F Q D S F E I V S E P E N S L P D R S G V S L S A K E L A N S T V D A M T T A T V R L E A V E T A R

1997,HONG KONG,A/Hong Kong/538/97 V T T T G S V R F Y Q V G V I R M R L R F Q D A F E I V S E P E N S L P D R S G V S L N A K E L A N S T V D A M T T S T V R L E A V E T A K

1997,HONG KONG,A/Hong Kong/542/97 V T T T G S V R F Y Q V G V I R M R L R F Q D A F E I V S E P E N S L P D R S G V S L N A K E L A N S T V D A M T T A T V R L E A V E T A R1998,HONG KONG,A/HongKong/97/98 V T T T G S V R F Y Q V G V I R M R L R F Q D S F E I V S E P E N S L P D R S G V S L N A K E L A S S T V D A M T T S T V R L E A V E T A K

2003,HONG KONG,A/HK/212/03 V T T G V I R L R L R F Q D S F A - - S G P E P D R S G V S L N A K E S A S S T V D A I T A A T I R L D A V E A A K

2003,HONG KONG,A/HK/213/03 V T T G V I R L R L R F Q D S F A - - S G P E P D R S G V S L N A K E S A S S T V D A I T A A T I R L D A V E A A K

2004,THAILAND,A/THAILAND/5(KK-494)/2004 V T T T E S V R L Y Q V G V I R L R L R F K D A F A - - S G P E I S L P D R S G V S L N A K E S A S S T V D A I T A A T I R L D A V K A A K

2004,VIETNAM,A/Viet Nam/1194/2004 V T T T E S V R L Y Q V G V I R L R L R F Q D A F A - - S G P E I S L P D R S G V S L N A K E S A S S T V D A I T A A T I R L D A V K A A K

2004,VIETNAM,A/Viet Nam/1203/2004 V T T T E S V R L Y Q V G V I R L R L R F Q D A F A - - S G P I S L P D R S G V S L N A K E S A S I T V D A I T A A T I R L D A V K A A K

2004,VIETNAM,A/Viet Nam/3046/2004 V T T T E S V R L Y Q V G V I R L R L R F Q D A F A - - S G P E I S L P D R S G V S L N A K E S A S S T V R L D A V E A A K

2004,VIETNAM,A/Viet Nam/3062/2004 V T T T E S V R L Y Q V G V I R L R L R F Q D A F A - - S G P E I S L P D R S G V S L N A K E S A S S T V D A I T A A T I R L D A V K A A K

2004,VIETNAM,A/Vietnam/CL01/2004 V T T T E S V R L Y Q V G V I R L R L R F Q D A F A - - S G P E I S L P D R S G V S L N A K E S A S S T V D A I T A A T I R L D A V K A A K

2004,VIETNAM,A/Vietnam/CL26/2004 V T T T E S V R L Y Q V G V I R L R L R F Q D A F A - - S G P E I S L P D R S G V S L N A K E S A S S T V D A I T A A T I R L D A V K A A K

2005,INDONESIA,A/Indonesia/5/2005 V T T T E S V R L Y Q V G V I R L R L R F Q D A F A - - S G P E I S L P D R S G V S L N A K E S A S S T V D A I T T A T T R L D A V E A A K

2005,INDONESIA,A/Indonesia/CDC184/2005 V T T T E S V R L Y Q V G V I R L R L R F Q D A F A - - S G P E I S L P D R S G V S L N A K E S A S S T V D A I T T A T T R L D A V E A A K

2005,INDONESIA,A/Indonesia/CDC287E/2005 V T A T E I V R L Y Q V G V I R L R L R F Q D A F A - - S G P E I S L P D R S G V S L N A K E S A S S T V D A I T T A T T R L D A V E A A K

2005,INDONESIA,A/Indonesia/CDC292T/2005 V T A T E I V R L Y Q V G V I R L R L R F Q D A F A - - S G P E I S L P D R S G V S L N A K E S A S S T V D A I T T A T T R L D A V E A A K

2005,INDONESIA,A/Indonesia/CDC7/2005 V T T T E S V R L Y Q V G V I R L R L R F Q D A F A - - S G P E I S L P D R S G V S L N A K E S A S S T V D A I T T A T T R L D A V E A A K

2005,THAILAND,A/Thailand/676/2005 V T T T E S V R L Y Q V G V I R L R L R F Q D A F A - - S G P E I S L P D R S G V S L N A K E S A S S T V D A I T A A T I R L D A V K A A K

2005,VIETNAM,A/Vietnam/CL105/2005 V T T T E S V R L Y Q V G V I R L R L R F Q E A F A - - S G P E I S L P D R S G V S L N A K E S A S S T V R L D A V K A A K

2005,VIETNAM,A/Vietnam/CL115/2005 V T T T E S V R L Y G V I R L R L R F Q D A F A - - S G P E I S L P D R S G V S L N A K E F A S S T V D A I T A A T I R L D A V E A A K

2005,VIETNAM,A/Vietnam/CL2009/2005 V T T T E S V R L Y Q V G V I R L R L R F Q D A F A - - S G P E I S L P D R S G V S L N A K E S A S S T V D A I T A A T I R L D A V K A A K

2006,CHINA,A/human/Zhejiang/16/2006 V T T T E S V R L Y Q V G V I R L R L R F Q D A F A - - S G L E I S L D A T T T A T I R L D A V E A A K

2006,INDONESIA,A/Indonesia/CDC326/2006 V T A T E I V R L Y Q V G V I R L R L R F Q D A F A - - S G P E I S L P D R S G V S L N A K E S A S S T V D A I T T A T T R L D A V E A A K

2006,INDONESIA,A/Indonesia/CDC329/2006 V T A T E I V R L Y Q V G V I R L R L R F Q D A F A - - S G P E I S L P D R S G V S L N A K E S A S S T V D A I T T A T T R L D A V E A A K

2006,INDONESIA,A/Indonesia/CDC357/2006 V T A T E I V R L Y Q V G V I R L R L R F Q D A F A - - S G P E I S L P D R S G V S L N A K E S A S S T V D A I T T A T T R L D A V E A A K

2006,INDONESIA,A/Indonesia/CDC390/2006 V T T T E S V R L Y Q V G V I R L R L R F Q D A F A - - S G P E I S L P D R S G V S L N A R E S A S S T V D A I T T A T T R L D A V K A A K

2006,INDONESIA,A/Indonesia/CDC523/2006 V T A T E I V R L Y Q V G V I R L R L R F Q D A F A - - S G P E I S L P D R S G I S L N A K E S A S S T V D A I T T A T T R L D A V E A A K

2006,INDONESIA,A/Indonesia/CDC582/2006 V T A T E I V R L Y Q V G V I R L R L R F Q D A F A - - S G P E I S L P D R S G I S L N A K E S A S S T V D A I T T A T T R L D A V K A A K

2006,INDONESIA,A/Indonesia/CDC594/2006 V T T T E S V R L Y Q V G V I R L R L R F Q D A F A - - S G P E I S L P D R S G I S L N A K E S A S S T V D A I T T A T T R L D A V E T A K

2006,INDONESIA,A/Indonesia/CDC595/2006 V T T T E S V R L Y Q V G V I R L R L R F Q D A F A - - S G P E I S L P D R S G I S L N A K E S A S S T V D A I T T A T T R L D A V E T A K

2006,INDONESIA,A/Indonesia/CDC623/2006 V T T T E S V R L Y Q V G V I R L R L R F Q D A F A - - S G P E I S L P D R S G V S L N A K E S A N S T V D A I T T A T T R L D A V E A A K

2006,INDONESIA,A/Indonesia/CDC624E/2006 V T T T E S V R L Y Q V G V I R L R L R F Q D A F A - - S G P E I S L P D R S G V S L N A K E S A S S T V D A I T T A T T R L D A V E A A K

2006,INDONESIA,A/Indonesia/CDC625/2006 V T T T E S V R L Y Q V G V I R L R L R F Q D A F A - - S G P E I S L P D R S G I S L N A K E S A S S T V D A I T T A T T R L D A V E T A K

2006,INDONESIA,A/Indonesia/CDC634/2006 V T A T E I V R L Y Q V G V I R L R L R F Q D A F A - - S G P E I S L P D R S G I S L N A K E S A S S T V D A I T T A T T R L D A V E A A K

2006,INDONESIA,A/Indonesia/CDC699/2006 V T A T E I V R L Y Q V G V I R L R L R F Q D A F A - - S G P E I S L P D R S G I S L N A K E S A S S T V D A I T T A M T R L D A V E A A K

2006,INDONESIA,A/Indonesia/CDC742/2006 V T A T E I V R L Y Q V G V I R L R L R F Q D A F A - - S G P E I S L P D R S G V S L N A K E S A S S T V D A V T T A T T R L D A V E A A K

H2H I A A I E N V L F H K A D I L V M K P K Y K G S V V M T P I T R N G F L N Q L D A C I Y S R D L S N I S I N S T M V S A T K M N I T K T T R

NS2 PA PB2M1 M2 NP NS1

H2H characteristic mutations in H5N1

Page 21

Ongoing Projects at ISS

InViDiA - Integrated Virus Diversity AnalysisWeb-based tool for metadata-enabled diversity analysis

WADE - Web-based Aggregation and Display of Epitopes Web-based tool for aggregating epitope predictions from

multiple prediction systems

Page 22

Thanks to

Johns Hopkins UniversityProf. J Thomas August

Dana-Farber Cancer Institute, HarvardDr. Vladimir Brusic

Dept. of Biochemistry, NUSProf. Tan Tin WeeAT Heiny, Asif M Khan, Hu Yong Li

Institut PasteurDr. Hervé Bourhy

Partial Grant Support:National Institute of Allergy and Infectious Diseases, NIH

Grant No. 5 U19 AI56541, Contract No. HHSN2662-00400085C