kerstin howe, mario caccamo, ian sealy the zebrafish genome sequencing project bioinformatics...

30
Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources

Upload: benjamin-harmon

Post on 11-Jan-2016

219 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources

Kerstin Howe, Mario Caccamo, Ian Sealy

The Zebrafish Genome Sequencing Project

Bioinformatics resources

Page 2: Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources

Bioinformatics resources

outline

• clone mapping, sequencing and manual annotation in

• genome assemblies and automated annotation in

• integrated ZF-Models data and tools

Page 3: Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources

Clone mapping and sequencing

mapping

• 2 BAC Tuebingen libraries

• 1 BAC and 1 cosmid library from single Tuebingen double-haploid fish

• end sequencing, RH mapping, fingerprinting

• pieced together according to fingerprints, marker mapping, sequence alignment

• currently ~ 2500 ctgs

Page 4: Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources

Clone mapping and sequencing

sequencing pipeline

• select clones based on position in fpc contig

• subcloning

• sequencing

• automatical assembly/pre-finishing (back to sequencing if necessary)

• finishing

• QC

• automated analysis pipeline

• manual annotation

• submission to EMBL

+

+

=

Page 5: Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources

• RepeatMasker

• CpG island prediction

• Genscan

• FGenesh

• halfwise (Pfam)

• EPCR

• Blast (ESTs, cDNAs, proteins)

• gene structures

• remarks (gene names, function, similarities)

• other features

EMBL

• mysql database in 'ensembl style'

• acedb or apollo front end

• open to users from the 'outside'

unfinished sequence

finished sequence

automated analysis pipeline

manual annotation

otter

Manual annotation

Page 6: Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources

Manual annotation

annotation policy

• follows guidelines for human annotation (havana team, Sanger Institute)

• no "guesses", annotations solely based on supporting evidence

• annotation of: CDSs and UTRs / transcriptssplice variants

pseudogenespoly A features

transposons repeats

• approved nomenclature (SI:clone.number)

• collaboration with ZFIN

existing ZFIN records are reported

ZFIN provides new records for newly found genes

Page 7: Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources

DNA

repeats

CpG island

Genscan FGenesH

proteins ESTs

mRNAs

Manual annotation

Page 8: Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources

vega.sanger.ac.uk

Page 9: Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources

Vega

contigview

Page 10: Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources

Vega

geneview

Page 11: Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources

www.sanger.ac.uk/Projects/D_rerio

Page 12: Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources

www.sanger.ac.uk/Projects/D_rerio

Page 13: Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources

when to use what

go to vega.sanger.ac.uk if you need

• highly reliable sequence

• highly reliable annotation (with your input)

• ‘your gene’ stable over time (TILLING)

go to www.ensembl.org if you need

• the whole genome

• comparative data

• ZF-Models microarray or insertional mutagenesis data

• complicated searches (BioMart)

Page 14: Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources

Zebrafish Genome Project

assembly release (Zv5)

clone libraries

map

(un)finished clones

whole genome shotgun sequencing clone mapping and sequencing

WGS reads

WGS assembly

integration

markers (T51)

supercontigcontig

tile path

BACs

fpc ctg

sequencing

~ 8,000 finished clones (~1 Gb)

clones+ctgs

contigs

finish clone 1.63 Gb

automatic annotationmanual annotation

Page 15: Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources

WGS assembly

reads

group reads

supercontig

Phusion assembler - High Performance Assembly Group (Zemin Ning et al.)

contigcontig contig contig contig

supercontigsupercontig supercontig

A B C phrap

read-pair tracker

A CB

BA C

gap

NNNNNNNN

Page 16: Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources

Read grouping

continuous base hash - k=12 continuous base hash - k=12

ATGGCGTGCAGTCCATGTTCGGATCAATGGCGTGCAGTCCATGTTCGGATCA

ATGGCGTGCAGTATGGCGTGCAGT

TGGCGTGCAGTCTGGCGTGCAGTC

GGCGTGCAGTCCGGCGTGCAGTCC

GCGTGCAGTCCAGCGTGCAGTCCA

gap hash k=12 (4x3) - dealing with variationgap hash k=12 (4x3) - dealing with variation

ATGGATGGCGTCGTGCAGGCAGTCCTCCATGTATGTTCGTCGGATCGATCAA

ATGGATGGCGTCGTGCAGGCAGTCCTCCATGTATGT

TGGCTGGCGTGGTGCAGTCAGTCCACCATGTTTGTT

GGCGGGCGTGCTGCAGTCAGTCCATCATGTTCGTTC

GCGTGCGTGCAGCAGTCCGTCCATGATGTTCGTTCG

• k-mer word hashing

~7

repeats

seq.

errors

• word distribution

k-mer occurrence

frequ

enc

y

Page 17: Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources

Zebrafish Genome Project

assembly release (Zv5)

clone libraries

map

(un)finished clones

whole genome shotgun sequencing clone mapping and sequencing

WGS reads

WGS assembly

integration

markers (T51)

sequencing

~ 7,000 finished clones (~1 Gb)

automatic annotationmanual annotation

Page 18: Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources

Integration

Zv5 scaffoldn

BX005049.6BX005057.8BX005153 BX005123.6

BX005153 BX005057.8BX005049.6 BX005123.6

fpc contig

WGS supercontig

marker

cDNA

bacends

BACs

Zv5 scaffoldn.3 Zv5 scaffoldn.5 Zv5 scaffoldn.7Zv5 scaffoldn.1

Page 19: Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources

Assemblies

Zv5 Zv4 Zv3 Zv2

release date assembly 27.05.05 12.07.04 27.11.03 03.04.03

total length [bp] 1,630,306,866 1,592,025,686 1,459,115,486 1,452,210,772

scaffolds 16,214 21,333 58,339 83,470

finished clones 4,519 (699 Mb) 2.828 (443 Mb) 1,502 (263Mb) -

scaffolds in chr 1-25 1,749 1,892 1,490 -

scaffolds in fpc contigs 265 (chrU) 694 (chrU) 1,842 5,677

NA scaffolds 14,676 18,747 54,798 77,793

sum(length) chr 1-25 [bp]

1,200,129,620 (73%) 1,097,507,810 (69%) 718,270,423 (49%) -

sum(length) ctgs 183,993,739 (11%) 176,222,396 (11%) 365,271,659 (25%) 1,143,459,008

sum(length) NAs 246,183,507 (16%) 318,295,480 (20%) 335,615,307 (23%) 308,751,764

Page 20: Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources

Automatic Annotation

Zebrafish Proteins

Genewisegenes

Other Proteins

AlignedcDNAs

Zebrafish cDNAs

Genewise geneswith UTRs

GenebuilderSupported ab initio

(optional)

Final set

AlignedESTs

Zebrafish ESTs

EnsemblEST genes

Exonerate Exonerate

ClusterMerge

Genewise

Page 21: Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources

Ensembl

Page 22: Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources

Contigview

Page 23: Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources

Geneview

Page 24: Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources

Searching Ensembl

Page 25: Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources

Biomart

start

filter

output

Page 26: Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources
Page 27: Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources

Do’s and Dont’s

go elsewhere (Ensembl) if you

want to know about the whole genome

need comparative data

need ZF-Models microarray or insertional mut data

need to do complicated searches

go to Vega if you

need highly reliable sequence

need highly reliable annotation

need ‘your gene’ stable over time (TILLING)

Page 28: Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources

DAS

reference sequence

genome browser

local storage

remote storage

DAS server

remote storage

DAS server

remote storage

DAS server

XML

DAS client

Page 29: Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources

SNPs and Indels

Page 30: Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources

Zv5 Zv4 Zv3 Zv2 Human Fugu Tetraodon

genes 22,877 23,526 22,409 20,062 24,194 22.339 28,005

transcripts 32,143 32,071 30,783 26,587 35,845 22,102 28,005

Ensembl releases