functional annotation

68
Functional Annotation Episode 2: Preliminary Results The Group 1 27th Feb 2012 Lavanya Rishishwar Artika Nath Lu Wang Haozheng Tian Shengyun Peng Ashwath Kumar Hamidreza Hassanzadeh

Upload: sabina

Post on 24-Feb-2016

187 views

Category:

Documents


0 download

DESCRIPTION

Functional Annotation. Episode 2: Preliminary Results. The Group. Recap. What is Functional Annotation The I mportance of Functional Annotation The Biology of H . haemolyticus Background for Functional Annotation Pros/Cons of Available Approaches Planned Approach Breadth Depth . - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Functional Annotation

Functional Annotation

Episode 2: Preliminary Results

The Group

127th Feb 2012

Lavanya RishishwarArtika NathLu WangHaozheng Tian

Shengyun PengAshwath Kumar

Hamidreza Hassanzadeh

Page 2: Functional Annotation

Recap

• What is Functional Annotation• The Importance of Functional Annotation• The Biology of H. haemolyticus• Background for Functional Annotation• Pros/Cons of Available Approaches • Planned Approach

– Breadth– Depth

27th Feb 2012 2

Page 3: Functional Annotation

Flowchart

327th Feb 2012

Page 4: Functional Annotation

Flowchart

427th Feb 2012

Page 5: Functional Annotation

PRELIMINARY RESULTS

27th Feb 2012 5

Page 6: Functional Annotation

Subject Organisms

fucK : ncoding fuculose-kinase. fucK deletion has been observed in some Hi isolatesHpd: encoding a lipoprotein protein D,

Species Disease State State Isolated Hemolysis Hpd fuculose-

kinase

M19107 H. haemolyticus Asymptomatic Minnesota Y - -

M19501 H. haemolyticus Asymptomatic Minnesota N + -

M21127 H. haemolyticus Pathogenic Georgia Y - -

M21621 H. haemolyticus Pathogenic Texas Y - -

M21639 H. haemolyticus Pathogenic Illinois N - -

M21709 H. influenzae Pathogenic NY N - +

27th Feb 2012 6

Page 7: Functional Annotation

BLAST: Output and Parsing

• Once the results received from gene prediction tools, we should blast them against different databases

• The selected threshold: 0.005• This is automatically done by the ad-hoc

scripts utilizing the BioPerl lib, for all 6 strains• The results are then processed and the

certain metrics elicited for further analysis

27th Feb 2012 7

Page 8: Functional Annotation

27th Feb 2012 8

Page 9: Functional Annotation

27th Feb 2012 9

Page 10: Functional Annotation

BLAST v/s UniProt: Coverage

Organism # of unique organisms in the hits

M19107 2338M19501 2332M21127 2360M21621 2364M21639 2433M21709 2154

27th Feb 2012 10

Page 11: Functional Annotation

BLAST v/s UniProt: M19107

27th Feb 2012 11

Pasteurella Ralstonia Lactobacillus Mus Coxiella HomoXylella Legionella Klebsiella Erwinia Arabidopsis RickettsiaBrucella

Rhizobium

BordetellaActinobacillus

Acinetobacter

Francisella

Clostridium

MycobacteriumBuchnera

NeisseriaXanthomonas

Streptococcus

Shigella

Haemophilus

Bacillus

Vibrio

Staphylococcus

Burkholderia

Yersinia

ShewanellaPseudomonas

Salmonella

Escherichia

Others

Page 12: Functional Annotation

27th Feb 2012 12

BLAST v/s UniProt: M21709Listeria Homo Coxiella Legionella ErwiniaXylella Klebsiella Arabidopsis Rickettsia Brucella RhizobiumBordetella

ActinobacillusAcinetobacter

FrancisellaClostridium

Mycobacterium

BuchneraNeisseria

Xanthomonas

Streptococcus

Shigella

Vibrio

Burkholderia

Staphylococcus

Haemophilus

Bacillus

Yersinia

ShewanellaPseudomonas

Salmonella

Escherichia

Others

Page 13: Functional Annotation

CONSERVED DOMAIN DATABASE (CDD)

Page 14: Functional Annotation

Introduction

• CDD is a protein annotation resource that consists of a collection of well-annotated multiple sequence alignment models for ancient domains and full-length proteins.

• These are available as position-specific score matrices (PSSMs) for fast identification of conserved domains in protein sequences via RPS-BLAST.

• The PSSMs are meant to be used for compiling RPS-BLAST search databases only.

Page 15: Functional Annotation

RPS-BLAST

• Reversed Position Specific Blast• It searches a query sequence against a

database of profiles (opposite of PSI-BLAST).• Use pre-computed lookup table for the

profiles to allow the search to proceed faster (architecture dependent).

• The CD-Search databases for RPS-BLAST: ftp://ftp.ncbi.nih.gov/pub/mmdb/cdd/

Page 16: Functional Annotation

Strategy

Page 17: Functional Annotation
Page 18: Functional Annotation

FORMATRPSDB

• Formatrpsdb is a utility that converts a collection of input sequences into a database suitable for use with RPS-Blast.

• Formatrpsdb is designed to perform the work of formatdb, makemat and copymat simultaneously, without generating the large number of intermediate files these utilities would need to create an RPS Blast database.

Page 19: Functional Annotation

Build Database

Title for database file

Input file containing

list of ASN.1 Scoremat filenames

Create index files

for database

Threshold for

extending hits for RPS

database

For scoremats that contain only

residue frequencies, the scaling factor to

apply when creating PSSMs

Base name of output

database

Page 20: Functional Annotation

RUN RPS-BLAST

Page 21: Functional Annotation

Results for CDD: COGs

27th Feb 2012 22

Organism: M19107

>10

Page 22: Functional Annotation

Results for CDD: COGs

27th Feb 2012 23

Organism: M21709

>10

Page 23: Functional Annotation

LipoP

27th Feb 2012 24

Page 24: Functional Annotation

LiopP• LipoP classifies genes into 4 classes:

– SpI: Signal peptide I– SpII: Lipoprotein signal peptide– TMH: N-terminal transmembrane helix (Not very reliable, It is used to avoid

TMH being falsely predicted as signal peptides)– CYT: Cytoplasmic. (All the rest)

• The classification system in LipoP uses HMM with four branches, one each for SpI, SpII, TMH, CYT.

• Protein sets for training and testing was extracted from SWISS-PROT.• They consisted of lipoproteins, SPaseI-cleaved proteins, cytoplasmic

proteins from the two Gram-negative phyllums Proteobacteria and Spirochetes.

• Transmembrane proteins were taken from phyllums Proteobacteria and Gracilicutes.

Page 25: Functional Annotation

Output Example# M19107_final_1488 SpI score=11.1193 margin=11.320213 cleavage=31-32# Cut-off=-3M19107_final_1488 LipoP1.0:Best SpI 1 1 11.1193M19107_final_1488 LipoP1.0:Margin SpI 1 1 11.320213M19107_final_1488 LipoP1.0:Class CYT 1 1 -0.200913M19107_final_1488 LipoP1.0:Class SpII 1 1 -1.80091M19107_final_1488 LipoP1.0:Signal CleavI 31 32 11.119 # PISHA|SDLNQM19107_final_1488 LipoP1.0:Signal CleavI 30 31 -2.18348 # SPISH|ASDLNM19107_final_1488 LipoP1.0:Signal CleavII 19 20 -1.80091 # TALFS|CGLLI Pos+2=G

1. Sequence ID2. Type of prediction. Best means the highest scoring class, Margin gives the difference between the best score

and the second best score, Class gives the score of other classes and Signal lines contain predicted cleavage sites.

3. Feature type.4. Location in the sequence. For lines with a class prediction it is always 1. For cleavage sites it is the last amino

acid of the signal peptide relative to the predicted cleavage site.5. Location same as above except that for cleavage sites it is the first amino acids after the cleavage site.6. Score. For the "Margin" type it is the difference between the best and the second best class score. 7. For the cleavage sites the ±5 context is shown after the #, and for lipoprotein cleavage sites the amino acid in

postition +2 is shown (which may determine whether the lipoprotein is attached to the inner or outer membrane) - An aspartic acid (D) in position +2 after the cleavage site of a lipoprotein means that it is attached to the inner membrane, and most other lipoproteins are attached to the outer membrane (“Testing the '+2 rule' for lipoprotein sorting in the Escherichia coli cell envelope with a new genetic selection”, Seydel et al (1999) Molecular Microbiology 34: 810-821)

Page 26: Functional Annotation

Results

Strain SpI SpIIInner

Membrane Lipoproteins

TMH CYT Total

M19107 164 54 2 241 1470 1929

M19501 176 60 3 228 1293 1757

M21127 174 67 3 244 1564 2049

M21621 178 64 2 244 1413 1899

M21639 194 82 4 267 2072 2615

M21709 144 53 2 225 1383 1805

Hh

Hi

Page 27: Functional Annotation

SignalP

Page 28: Functional Annotation

Biological background

• Many different types of secretory signals are found. SignalP focused on prediction of classical signal peptides, which are the far most common type of signal peptide cleaved by signal peptidase I (SPase).

• In bacteria signal peptide is targeted directly to the cell membrane.

Page 29: Functional Annotation

SignalP

• SignalP 3.0 was the best method among PrediSi, SPEPlip, Signal-CF, Signal-3L and Signal-BLAST. (Choo, K., Tan, T. & Ranganathan, S. BMC Bioinformatics 10, S2 (2009).)

• SignalP4.0 is even better, and hence was included in our method. (SignalP 4.0: discriminating signal peptides from transmembrane regions Thomas Nordahl Petersen, et al. Nature Methods, 8:785-786, 2011)

Page 30: Functional Annotation

SignalP

• SignalP 4.0 is a purely neural network–based method.

• Two types of networks in SignalP 4.0:– SignalP-TM networks – SignalP-noTM networks

• The decision to select network: If SignalP-TM predicts four or more positions as being transmembrane positions, SignalP-TM is used for the final prediction, otherwise SignalPnoTM is used.

Page 31: Functional Annotation

Results from SignalPOrganism No. Signal Pep Total Genes Percentage

M19107 144 1929 7.47%

M19501 150 1757 8.54%

M21127 152 2049 7.42%

M21621 151 1899 7.95%

M21639 178 2615 6.81%

M21709 122 1805 6.76%

Page 32: Functional Annotation

Comparison between LipoP and SignalP

• The results obtained from LipoP and SignalP were compared with the help of a script.

• Both SpI and SpII were taken from LipoP and all the positive outputs were taken from SignalP.

• They were also analyzed for similar cleavage sites.

Page 33: Functional Annotation

Comparison table

Organism No.

Genes Predicted to have Signaling Peptides

Negatives Total # of Genes

No. of Cleavage Sites detected

LipoP Unique

SignalP Unique Common Consistent Sites Conflicting Sites

M19107 75 1 143 1710 1929 112 31

M19501 86 0 150 1521 1757 115 35

M21127 89 0 152 1808 2049 114 38

M21621 91 0 151 1657 1899 117 34

M21639 100 2 176 2337 2615 126 50

M21709 75 0 122 1608 1805 93 29

Page 34: Functional Annotation

75 143 1 86 150

15289 1519112275

100 2176

M19107 M19501 M21639

M21127M21621 M21709

Signal P LipoP

Page 35: Functional Annotation

Comparison between LipoP and SignalP

• Bottom-line: As was clearly visible by the Venn Diagram, the SignalP didn’t provided much of new information as compared to LipoP.

27th Feb 2012 36

Page 36: Functional Annotation

TMHMMPrediction of transmembrane helices in proteins

Page 37: Functional Annotation

TMHMM

Organism No. Transmembrane Helices Total Genes Percentage

M19107 392 1929 20.32%

M19501 385 1757 21.91%

M21127 417 2049 20.35%

M21621 413 1899 21.75%

M21639 464 2615 17.74%

M21709 361 1805 20.00%

Page 38: Functional Annotation
Page 39: Functional Annotation

Member Database Focus/FeaturesPFAM divergent domainsPROSITE functional sites

PRINTS hierarchical definitions from superfamily to subfamily levels

TIGRFAMs building HMMs for functionally equivalent proteins

PIRSFproduce HMMs over the full length of a protein and have protein length restrictions together family members

HAMAP profiles

manually created by expert curators they identify proteins that are part of well-conserved bacterial, archaeal and plastid-encoded proteins families or subfamilies

PANTHER build HMMS based on the divergence of function within families

SUPERFAMILY Structure using the SCOP as a basis for building HMMs

GENE3DUse Structure using the CATH superfamilies as a basis for building HMMs

Member signature databasesSimilar coverage in size; Different content

Page 40: Functional Annotation

About• A wrapper of sequence analysis applications• Database and output files scanning • Bulk data processing• Efficient(parallel) internal architecture

Querying with InterProScan

InterProScanQuery Sequence

Page 41: Functional Annotation

• Input– Nucleotide* or protein sequences – Recognized sequence format: raw, FASTA or

EMBL – Reformat and translate(if necessary)

*Nucleotide sequences will translated and scanned in all 6 frames without any further assumption

Querying with InterProScan

Page 42: Functional Annotation

• Running InterProScanscreenshot at<60s

Querying with InterProScan

Page 43: Functional Annotation

Querying with InterProScan

Page 44: Functional Annotation

• Output– InterProScan makes results available in four

formats: raw, ebixml, xml, txt, html

• Parse InterProScan Output(BioPerl)– Bio::SeqIO::interpro

• Interpretation of Output Data(example)

Querying with InterProScan

Page 45: Functional Annotation

Querying with InterProScan

Key: Intepretation

10683_1_ORF1 the id of the input sequence.

024307F93E501F2C the crc64 (checksum) of the protein sequence (supposed to be unique).

404 the length of the sequence (in AA).

HMMPfam the anaysis method launched.

PF03453 the database members entry for this match.

MoeA_N the database member description for the entry.

1 the start of the domain match.

163 the end of the domain match.

1.49999999999999999E-56 the evalue of the match (reported by member database method).

T the status of the match (T: true, ?: unknown).

26-Feb-12 the date of the run.

IPR005110 the corresponding InterPro entry (if iprlookup requested by the user).

MoeA, N-terminal and linker domain the description of the InterPro entry.

Biological Process: molybdopterin cofactor biosynthetic process (GO:0032324) the GO (gene ontology) description for the InterPro entry.

Page 46: Functional Annotation

Preliminary Results

M19107

   Total Searched Protein 1,769

Match 1,716

Unmatch 378

Total Hits: 12,393

533251,391

Page 47: Functional Annotation

Next Up

• Major Challenge: Funneling all the annotation information into a consolidated GenBank/GFF3 entry.

• Level 2!

27th Feb 2012 48

Page 48: Functional Annotation

Level 2Operons, Virulence Factors and Metabolic Pathways

27th Feb 2012 49

Page 49: Functional Annotation

VIRULENCELikelihood of a pathogen causing disease

27th Feb 2012 50

Page 50: Functional Annotation

H.haemolyticus

• As the name of the species implies, is generally hemolytic on blood agar plates

• Beta-hemolytic phenotype routinely used in the clinical setting to distinguish H.h from NTHi

• Nonhemolytic H. haemolyticus strains are being isolated > misidentified as NTHI

Gene(s) encoding hemolysin Unknown(Xin WangMeningitis Laboratory, CDC)

Photograph from From MicrobeLibrary.org

Page 51: Functional Annotation

Subject Organisms

fucK : ncoding fuculose-kinase. fucK deletion has been observed in some Hi isolatesHpd: encoding a lipoprotein protein D,

Species Disease State State Isolated Hemolysis Hpd fuculose-

kinase

M19107 H. haemolyticus Asymptomatic Minnesota Y - -

M19501 H. haemolyticus Asymptomatic Minnesota N + -

M21127 H. haemolyticus Pathogenic Georgia Y - -

M21621 H. haemolyticus Pathogenic Texas Y - -

M21639 H. haemolyticus Pathogenic Illinois N - -

M21709 H. influenzae Pathogenic NY N - +

27th Feb 2012 52

Page 52: Functional Annotation

Virulence factors • Refer to the traits encoded by `virluence genes` that pathogenic microbes are

equipped to cause infection.

HOW???

– Attach selectively to host tissues – Colonize parts of the host body– Gain access to nutrients by invading or destroying host tissues – Avoid host defenses

• Virulence factors include:– Bacterial toxins– Cell surface proteins that mediate bacterial attachment– Cell surface carbohydrates and proteins that protect a bacterium– Hydrolytic enzymes that may contribute to the pathogenicity of the bacterium

27th Feb 2012 53

Page 53: Functional Annotation

VFDB: Virulence factor Database • Set up in 2004• Up-to date information regarding validated VF’s from 24 genera of medically

important bacterial pathogens.• Detailed tabular comparison of virluence composition in terms of V. genes and

their composition• Multiple alignment and statistical analysis of homologous VFs• Graphical comparison of V. genes• VF’s

– Adhesion & invasion – Bacterial secretion systems& effectors– Toxins – Iron-acquisition system

• Pathogenicity island

27th Feb 2012 54

Page 54: Functional Annotation

Operon and Pathway Analysis

• As was pointed out by Alejandro Caro, usually a missing gene in an otherwise complete pathway reflects a hole in the annotation process.

• This path serves to fill such holes in the annotation process.

27th Feb 2012 55

Page 55: Functional Annotation

DOOR(Database of prOkaryotic OpeRons)

Page 56: Functional Annotation

• DOOR (Database of prOkaryotic OpeRons) is an operon database developed by Computational Systems Biology Lab (CSBL) at UGA. The operons in the database are based on prediction.

• DOOR is the biggest operon database available until now(2009).

• This algorithm is consistently best at all aspects including sensitivity and specificity for both true positives and true negatives, and the overall accuracy reach ~90%.

• Currently DOOR has operons for 971 prokaryotic genomes. • Although most of operons in DOOR are not verified by

experiments, they are also trying to provide some limited literature information, which is extracted from ODB.

Page 57: Functional Annotation

FOUR STRAINS IN DOOR

Page 58: Functional Annotation
Page 59: Functional Annotation

Strategy

Page 60: Functional Annotation

THE PATHWAY TOOLS

Page 61: Functional Annotation

A Glance at the End of Annotation

Enable• Browsing of Annotated Genes• Analysis of pathways

Database

Page 62: Functional Annotation

"Do not use a DBMS when the initial investment in hardware, software, and training is too high.”

- Shamkant Navathe,Georgia Institute of Technology

Page 63: Functional Annotation

The Pathway Tools

"Pathway Tools is a production-quality software environment for creating a type of model- organism database called a Pathway/Genome Database (PGDB)"

Page 64: Functional Annotation

• Prediction– Metabolic pathways– Metabolic pathway hole filler– Operons

• Curating• PGDB web service

– Publish PGDB– Query– Visualization

• Metabolic Network Analysis

The Pathway Tools

Page 65: Functional Annotation

• Pros– BioCyc Tier 1 and Tier 2 databases are highly

curated– Enables editing(curation) and querying of PGDB

• Cons– BioCyc have less number ofgenomes than other databases– Some tools are only availablein the local version(eg. PathoLogic)

WHY “The Pathway Tools” ?

Page 66: Functional Annotation

• Prediction– Metabolic pathways– Metabolic pathway hole filler– Operons

• Curating• PGDB web service

– Publish PGDB– Query– Visualization

• Metabolic Network Analysis

The Pathway Tools

PathoLogic

Page 67: Functional Annotation

The Pathway Tools Local Version(GUI)

Page 68: Functional Annotation

PathoLogic

Inputs and outputs of the computational inference modules within PathoLogic