high performance computational analysis of dna sequences from different environments

41
High performance computational analysis of DNA sequences from different environments Rob Edwards Computer Science Biology edwards.sdsu.edu www.theseed.org

Upload: ursula-moreno

Post on 03-Jan-2016

35 views

Category:

Documents


1 download

DESCRIPTION

High performance computational analysis of DNA sequences from different environments. Rob Edwards Computer Science Biology. edwards.sdsu.edu. www.theseed.org. Outline. There is a lot of sequence Tools for analysis More computers Can we speed analysis. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: High  performance computational  analysis of DNA sequences  from  different environments

High performance computational analysis of

DNA sequences from different environments

Rob Edwards

Computer ScienceBiology

edwards.sdsu.edu www.theseed.org

Page 2: High  performance computational  analysis of DNA sequences  from  different environments

Outline

There is a lot of sequence Tools for analysis More computers Can we speed analysis

Page 3: High  performance computational  analysis of DNA sequences  from  different environments

Firstbacterial genome

100bacterial genomes

1,000bacterial genomes

Num

ber

of

know

n s

equence

s

Year

How much has been sequenced?

Environmentalsequencing

Page 4: High  performance computational  analysis of DNA sequences  from  different environments

Everybody inSan Diego

Everybody inUSA

AllculturedBacteria

100people

How much will be sequenced?

One genome fromevery species

Most majormicrobial environments

Page 5: High  performance computational  analysis of DNA sequences  from  different environments

Metagenomics(Just sequence it)

200 liters water 5-500 g fresh fecal matter50 g soil

Sequence

Epifluorescent Microscopy

Concentrate and purify bacteria, viruses, etc

Extract nucleic acids

Publish papers

Page 6: High  performance computational  analysis of DNA sequences  from  different environments

The SEED Family

Page 7: High  performance computational  analysis of DNA sequences  from  different environments

The metagenomics RAST server

Page 8: High  performance computational  analysis of DNA sequences  from  different environments

Automated Processing

Page 9: High  performance computational  analysis of DNA sequences  from  different environments

Summary View

Page 10: High  performance computational  analysis of DNA sequences  from  different environments

Metagenomics ToolsAnnotation & Subsystems

Page 11: High  performance computational  analysis of DNA sequences  from  different environments

Metagenomics ToolsPhylogenetic Reconstruction

Page 12: High  performance computational  analysis of DNA sequences  from  different environments

Metagenomics ToolsComparative Tools

Page 13: High  performance computational  analysis of DNA sequences  from  different environments

Outline

There is a lot of sequence Tools for analysis More computers

Page 14: High  performance computational  analysis of DNA sequences  from  different environments

How much data so far

986 metagenomes

79,417,238 sequences

17,306,834,870 bp (17 Gbp)

Average: ~15-20 M bp per genome

~300 GS20~300 FLX~300 Sanger

Page 15: High  performance computational  analysis of DNA sequences  from  different environments

Computes

Page 16: High  performance computational  analysis of DNA sequences  from  different environments

Linear compute complexity

Page 17: High  performance computational  analysis of DNA sequences  from  different environments

Just waiting …

Page 18: High  performance computational  analysis of DNA sequences  from  different environments

Hours

of

Com

pute

Tim

e

Input size (MB)

Overall compute time~19 hours of compute per input megabyte

Page 19: High  performance computational  analysis of DNA sequences  from  different environments

How much so far

986 metagenomes

79,417,238 sequences

17,306,834,870 bp (17 Gbp)

Average: ~15-20 M bp per genome

Compute time (on a single CPU):

328,814 hours = 13,700 days = 38 years

~300 GS20~300 FLX~300 Sanger

Page 20: High  performance computational  analysis of DNA sequences  from  different environments

Outline

There is a lot of sequence Tools for analysis More computers Can we speed analysis

Page 21: High  performance computational  analysis of DNA sequences  from  different environments

Shannon’s Uncertainty

• Shannon’s Uncertainty – Peter’s surprisal

p(xi) is the probability of the occurrence of each base or string

Page 22: High  performance computational  analysis of DNA sequences  from  different environments

Surprisal in Sequences

Page 23: High  performance computational  analysis of DNA sequences  from  different environments

Uncertainty Correlates With Similarity

Page 24: High  performance computational  analysis of DNA sequences  from  different environments

But it’s not just randomness…

Page 25: High  performance computational  analysis of DNA sequences  from  different environments

Which has more surprisal:coding regions or non-coding regions?

Uncertainty in complete genomes

Coding regions Non-coding regions

Page 26: High  performance computational  analysis of DNA sequences  from  different environments

More extreme differences with 6-mers

Coding regions Non-coding regions

Page 27: High  performance computational  analysis of DNA sequences  from  different environments

Can we predict proteins

• Short sequences of 100 bp

• Translate into 30-35 amino acids

• Can we predict which are real and could be doing something?

• Test with bacterial proteins

Page 28: High  performance computational  analysis of DNA sequences  from  different environments

Kullback-Leibler Divergence

Difference between two probability distributions

Difference between amino acid composition and average amino acid composition

Calculate KLD for 372 bacterial genomes

Page 29: High  performance computational  analysis of DNA sequences  from  different environments

KLD varies by bacteria

Colored by taxonomy of the bacteria

Page 30: High  performance computational  analysis of DNA sequences  from  different environments

KLD varies by bacteria

Page 31: High  performance computational  analysis of DNA sequences  from  different environments

Most divergent genomes

• Borrelia garinii – Spirochaetes

• Mycoplasma mycoides – Mollicutes

• Ureaplasma parvum – Mollicutes

• Buchnera aphidicola – Gammaproteobacteria

• Wigglesworthia glossinidia – Gammaproteobacteria

Page 32: High  performance computational  analysis of DNA sequences  from  different environments

Divergence and metabolism

Bifidobacterium

Bacillus

Nostoc

Salmonella

Chlamydophila

Mean of all bacteria

Page 33: High  performance computational  analysis of DNA sequences  from  different environments

Divergence and amino acids

UreaplasmaWigglesworthia

BorreliaBuchnera

Mycoplasma

Bacteria meanArchaea mean

Eukaryotic mean

Page 34: High  performance computational  analysis of DNA sequences  from  different environments

Predicting KLD

y = 2x2-2x+0.5

KLD

per

gen

ome

Percent G+C

Page 35: High  performance computational  analysis of DNA sequences  from  different environments

Summary

• Shannon’s uncertainty could predict useful sequences

• KLD varies too much to be useful and is driven by %G+C content

Page 36: High  performance computational  analysis of DNA sequences  from  different environments

New solutions for old problems?

Page 37: High  performance computational  analysis of DNA sequences  from  different environments

Xen and the art of imagery

Page 38: High  performance computational  analysis of DNA sequences  from  different environments

The cell phone problem

Page 39: High  performance computational  analysis of DNA sequences  from  different environments

Searching the seed by SMS

1 2 34 5 67 8 9* 0 #

seed search

histidine coli

GMAIL.COM@

AUTOSEEDSEARCHES

edwar

ds.

sdsu

.ed

u

SEEDdatabases

22 proteins in E. coli

) ) ) )))))

Anywhere Idaho GMCS429 Argonne

Page 40: High  performance computational  analysis of DNA sequences  from  different environments

Challenges

• Too much data

• Not easy to prioritize

• New models for HPC needed

• New interfaces to look at data

Page 41: High  performance computational  analysis of DNA sequences  from  different environments

Acknowledgements

• Sajia Akhter• Rob Schmieder

• Nick Celms• Sheridan Wright

• Ramy Aziz

• FIG • The mg-RAST team

• Rick Stevens

• Peter Salamon• Barb Bailey• Forest Rohwer• Anca Segall