isquest: finding insertion sequences in prokaryotic sequence fragment data & applications...

Post on 12-Jan-2016

219 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

ISQuest: Finding Insertion Sequences in Prokaryotic Sequence

Fragment Data & Applications

Presented By: Abhishek Biswas, Department Of Computer Science, ODU

Presentation For: Hampton University Interview, 2nd round

Date: 06/18/2015

Advisors:

David Gauthier, Department of Biological Sciences

Desh Ranjan, Department Of Computer Science

Mohammad Zubair, Department of Computer Science

Click icon to add picture

2

Presentation Outline

Biological Preliminaries

Repeat Structures, Mobile Elements and Insertion Sequences(IS)

ISQuest: Finding Insertion Sequences[1]

Applications of ISQuest Tool

Comparative Genomics

Correlative Algorithm for Repeat Placement (CARP)

Algorithm, Experiment & results [1]Biswas A., Gauthier D., Ranjan D.,Zubair M., ISQuest: Finding Insertion Sequences in Prokaryotic Sequence Fragment Data, Bioinformatics, Accepted, June 2015

What is a genome?

A genome is an organism’s complete DNA

Collection of genes Polypeptide codes

Genes, Operons

Non-coding region

Transcription

mRNA

Translation

Protein

A C T G

denine

ytosine

hymine

uanine

Start Codon

AUG/GUG

Stop Codon UAA/UAG/UG

A

The DNA Sequencing ProcessMultiple Copies of the Genome

Randomly Cut Pieces Size ~ 3-

4Kbp

~400bp

~400bp

mate pairs

~400bp

~400bp

single reads (orientation unknown)

Linker

known distance

Depth of Coverage

DNACircularizatio

n

Genome Assembly Process

Correctly ordering the short sequence fragments

Overlap information

Mate-pair links

6

Genome Assembly Output

Ideally the complete genome sequence

Most assemblies are incomplete Repeat sequences (e.g. Insertion Sequences) Error in sequencing process

Contiguous sequences (Contigs)/scaffolds returned

Assemblers terminate at ambiguous points

7

Assembly Validation

Often requires manual validation

View depth of coverage

Locate areas where mate pairs are stretched or

compressed

Design primers and validate the joins

Requires manual effort and is time consuming

Design of primers may be problematic

8

Hawkeye Manual Assembly Validation

M. Schatz, et. al. 2007

9

Repeat Structures or Mobile Genetic Elements

Repeat structure is a repeating segment of DNA sequence

Biologically significant

Copies not identical

Examples

Insertion Sequences

Transposase (pseudogenes)

Ribosomal RNA is coded by a large number of identical genes that

are tandemly repeated to form one or more clusters.

Mobile genetic sequences

Create close copies at different locations

Mobile Elements and Insertion Sequences(IS)

GenomeInsertion Sequence

Interrupted Gene X_1

GenomeInsertion Sequence

Gene X_1 Gene X_2

GenomeInsertion Sequence

Gene X_1 Gene X_2

Intragenic Insertion(pseudogene)

Intergenic Insertion

Insertion replacing parts of two genes

10

11

Why are MGEs important?

Horizontal Gene Transfer

Cause of interesting evolutionary traits not explained

by reproduction

Comparative genomics

Genome Assembly

Most assemblers generate incomplete assembly Repeat sequences (e.g. Insertion Sequences)

12

Contribution of ISQuest

Annotate partial repeat structures

MGEs often degenerate during transposition

Requires no prior assembly or annotation of ORFs

Though using a draft assembly improves assembly

time

Available on SourceForge

https://sourceforge.net/projects/isquest/

13

ISQuest Algorithm (1): Obtaining Seed Sequences

Input sequences (Reads/Contigs)

MegaBLAST against local “nt” Database

Select BLAST query sequences that hit results with IS annotations GenBank Files

Generate seed sequence library

14

ISQuest Algorithm (2): Extending Seed Sequences

Assemble raw reads to ends

Find boundary

Generate new seed sequence library

15

ISQuest Algorithm (2): Extending Seed Sequences

Assemble raw reads to ends

Find boundary

Generate new seed sequence library

16

ISQuest Algorithm (2): Extending Seed Sequences

Assemble raw reads to ends

Find boundary

Generate new seed sequence library

17

ISQuest : Output Sequences

18

ISQuest : Example

19

Experimental Setup

All sequenced bacterial genomes

3810 Genomic Sequences

We sumulated DNA fragmentation process

using ART[1] simulator

Assembled all read libraries into draft assemblies

Applied ISQuest to find Insertion Sequences

Verified against GenBank annotations

[1]Huang, W., et al. (2012) ART: a next-generation sequencing read simulator, Bioinformatics, 28, 593-594.

Performance of Repeat Quest for ISs

Robinson DG, Lee M-C, and Marx CJ. OASIS: an automated program for global investigation of bacterial and archaeal insertion sequences.

Nucleic Acids Research., 2012

70% Length Match

80% Length Match

90% Length Match

21

Applications : Comparative Genomics

Click icon to add picture

22

Phylogenetic Tree of M. Marinum

Built using kSNP[5]

[5] Gardner SN, Hall BG (2013) When Whole-Genome Alignments Just Won't Work: kSNP v2 Software for Alignment-Free SNP Discovery and Phylogenetics of Hundreds of Microbial Genomes.

23

Orthogonal Clustering

Phylogenetic tree of 42 Mycobacterium Marinum strains

Clustering based on IS sequences ISQuest to generate IS sequences for each

strain Clustering using CDHit[4]

Find core IS elements

[4] "Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences", Weizhong Li & Adam Godzik Bioinformatics, (2006) 22:1658-9.

24

Application : Genome Assembly

Click icon to add picture

25

Genome Assembly Process

Ordering the overlapping short sequence

fragments

Reconstruct the source genome

26

Computationally Guided Draft Assembly

Correlative Algorithm for Repeat Placement (CARP)

Correct repeat placement Repeat elements identified manually/computationally Currently use only repeating Insertion Sequences

Adding lines of evidence to joins Matched repeat elements Mate-pair evidence Synteny (gene organization) with reference genomes

User can look at all the evidence for a join

27

Correlative Algorithm for Repeat Placement(CARP)

Current version of CARP

Works with repeating insertion sequences

Identifies the insertion points

Joins contigs based on insertion sequences and gene synteny

Input to CARP

A set of high confidence contigs

A library of insertion sequences

One or more reference organisms

28

Step 1:Annotating the Contig Ends

Find the partial repeating IS at the contig ends Assemblers terminate as repeat cannot be resolved Therefore, contigs end in repeat regions

Annotation uses MegaBLAST for matching

ContigsC1

C2

C3

C4

Cn

Insertion SequencesIS1IS2IS3

ISn

Unknown Partial Repeats

End Annotated ContigsIS4

C2

C3

C4

Ck

IS2

IS1IS2

IS4

IS3

IS4

IS3

IS1

IS1

C1

Annotated Partial Repeats

29

Mobile genetic sequences

Create close copies at different locations

Insertion Sequences(IS) and Insertion

GenomeInsertion Sequence

Interrupted Gene

GenomeInsertion Sequence

Gene X_1 Gene X_2

GenomeInsertion Sequence

Gene X_1 Gene X_3

Intragenic Insertion(pseudogene)

Intergenic Insertion

Insertion replacing parts of two genes

30

Step 2: Identifying Insertion Type

Classify based on two types of insertions possible Intergenic: insertion within a gene Intragenic: insertion not within a gene

Database of genes from reference organism

Intergenic : closest neighboring geneIS4

IS2Contig 1

200 bp 200 bp

MegaBLAST (database of genes)

Gene XMatch

Intragenic Insertion

Gene YMatch

Intergenic Insertion

No Matc

h

31

Step 3: Matching contigs

Match contigs to be joined IS and orientation of IS must match Intragenic: Interrupted gene must be same Intergenic: Gene synteny with reference organism

Matches are accepted if genes are within threshold

Contig 1

Intragenic Insertion Sequence Match

Contig 4IS1

IS1

Complete Gene

Contig 1

Intergenic Insertion Sequence Match

Contig 4IS1

IS1

Close BLAST Hit in Reference Organism

32

Step 4: Applying Mate-Pair Evidence

Mate-pair evidence to validate or discard joins

Priority to strong mate-pair evidence Major rearrangements must not be missed

Adjustable threshold values Valid Mate-

pairsCARP Match Join

>20 N.A. (Added Evidence) Accepted

<20 Y Accepted

<3 Y Further Review

None Y Rejected*

* Unless the mate-pair distances are too small to cover IS gap in which case the join should be reviewed further.

33

Viewer Showing Joins

34

Experiment Setup

We selected 2 bacteria with large number of repeating IS

Bacillus halodurans 

Mycobacterium marinum M

Simulated sequencing read libraries

3Kbp mate-pair libraries

Mean 450bp read length

30x coverage

Assembled using Celera WGS[2, 3] assembler

35

Results

De novo assembly of M. shotsii Three 454 read libraries (3,8 kbp paired-end) Celera WGS assembler generated 42 scaffolds 6 repeating Insertion Sequences identified CARP reduced number of scaffolds to 17

Organism Contigs

Celera Scaffol

ds

CARP Scaffolds

a

CARP Scaffolds

b

Incorrect Joins

(Celera, CARP)

B. halodurans

857 92 38 17 (16,12,15)

M. marinum M

773 48 30 27 (5,3,3)

astrict CARP thresholds; bweaker CARP thresholds

36

Summary

Computationally Guided Draft Assembly

Joins based on 3 kinds of evidence Matched insertion sequences Gene synteny with one or more reference organisms Mate pairs information

Generates list of all joins Each join annotated with evidence User can assess confidence of join

We show that CARP provides valuable assembly validation

37

Future Work

ISQuest Algorithm

Metagenomics read handling

Incorporate capability to correctly place

any repeat element

RepeatQuest

38

References

1. Michael Schatz, Adam Phillippy, Ben Shneiderman, Steven Salzberg, Hawkeye: an interactive visual analytics tool for genome assemblies Genome Biology, Vol. 8, No. 3. (09 March 2007), R34, doi:10.1186/gb-2007-8-3-r34

2. Celera Assembler - http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=Main_Page

3. E. W. Myers, G. G. Sutton, A. L. Delcher, I. M. Dew, D. P. Fasulo, M. J. Flanigan, S. A. Kravitz, C. M. Mobarry, K. H. J. Reinert, K. A. Remington, E. L. Anson, R. A. Bolanos, H.-H. Chou, C. M. Jordan, A. L. Halpern, S. Lonardi, E. M. Beasley, R. C. Brandon, L. Chen, P. J. Dunn, Z. Lai, Y. Liang, D. R. Nusskern, M. Zhan, Q. Zhang, X. Zheng, G. M. Rubin, M. D. Adams, and J. C. Venter, "A Whole-Genome Assembly of Drosophila," Science, vol. 287, pp. 2196-2204, March 24, 2000.

39

Desh Ranjan and Mohammad ZubairDepartment of Computer Science

ISQuest: Finding Insertion Sequences

David GauthierDepartment of Biological Sciences

40

•Collaboration• Department of Computer Science at ODU (Jing He, Desh Ranjan,

M. Zubair)

•SupportNSF-DBI-135662ODU startup fund, MSF fund and M&S FellowshipNSF HRD-0420407

•Students• Dong Si (PhD, 2015)• Lin Chen (current PhD student)• Maryam Arab (current MS student)

Protein Structure Research Group

41

Thank You

Click icon to add picture

top related