ensembl steve searle joint project leader, ensembl genebuild team

107
Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Upload: sheena-robertson

Post on 01-Jan-2016

231 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Ensembl

Steve Searle

Joint project leader, Ensembl Genebuild team

Page 2: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Outline

• Ensembl project overview

• Core database schema and API

• Pipeline

• Genomic annotation

• Comparative genomics

• Variation data

• Ensembl BioMart datamining db

• Making the data available

Page 3: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

What is Ensembl?project aims

• funded to provide vertebrate genomes to the world

• aims to provide the high quality automated genome annotation

• aims to a leading group in genome analysis

• all software, data and results freely available

Page 4: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

What is Ensembl ?project background

• Group split between EBI and Sanger

• Mainly Wellcome Trust funded (recently received a new five year grant for 2006-2011)

• Largest dedicated compute in biology in Europe

• Developer community > 300 people, including companies

Page 5: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Ensembl - Technical overview

• Data storage– Mysql databases (~160Gb in current release)

• Core databases - annotation for each species• Variation databases - variation data for some species• Compara - single database containing all comparative genomic

data for species in ensembl• Mart - set of denormalised databases for datamining

• Data production– Pipeline systems running automatic annotation on a compute farm

of 800 CPUs

• Interfaces– Website– Mart (datamining tool)– Apollo– SQL– APIs (both perl and Java)

Page 6: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Currently 21 organismsin Ensembl

Page 7: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Open sourceOpen source

• Object model– standard interface makes it easy for others to build custom applications on top of

Ensembl data

• Open discussion of design ([email protected])

• Most major pharmaceutical companies and many academics on mailing list

• Ensembl installs worldwide– Both public and commercial

e.g. Gramene (CSHL)

Ciona-sg (Temasek)

Arabidopsis (NASC)

Fugu (IMCB)

Page 8: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Outline

• Ensembl project overview

• Core database and API

• Pipeline

• Genomic annotation

• Comparative genomics

• Variation data

• Ensembl BioMart datamining db

• Making the data available

Page 9: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

The Ensembl Core Database

• Relational database (MySQL) containing the genomic sequence and annotations on it (genes, alignments, ab initio predictions etc)

• Data stored in it throughout analysis process and the website displays features from it

• Current schema has 68 tables

• Ensembl core API team control changes

Page 10: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Requirements for the schema

• Store data for human genome

• … and all the other genomes we have

• … and all the genomes we might get

• Flexible to add more data

• Easy to adapt to new genome

• Responds fast enough for web site display and pipelined genebuild

Page 11: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

System Context

Java API (EnsJ)

www PipelineApollo

Mart DB

MartShell MartView

Other Scripts & Applications

Ensembl DBs

Perl API

Page 12: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Sequence Tables

assembly

asm_seq_region_id int

cmp_seq_region_id int

asm_start int

asm_end int

cmp_start int

cmp_end int

ori int

seq_region

seq_region_id int

name varchar

coord_system_id int

length int

seq_region_attrib

seq_region_id int

attrib_type_id int

value varchar

attrib_type

attrib_type_id int

code varchar

name varchar

description text

dna

seq_region_id int

sequence mediumtext

dnac

seq_region_id int

sequence mediumblob

n_line text

coord_system

coord_system_id int

name varchar

version varchar

attrib“default_version”, “sequence_level”

rank int

0..1

0..n

0..1

0..n

1

1

0..1

0..1

1

1…n

1

0..n

1

0..n

Page 13: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Feature Tables• Feature tables describe annotations with positions in sequence.• Each feature is associated with a seq_region and has a start, end, and

orientation on the seq_region.• There is no central feature table. There are tables specific to each feature

type (DNA/DNA alignments, DNA/Protein alignments, Repeats, Simple features).

• Different feature tables have different attributes, but always have a seq_region position.

analysis

analysis_id int

created datetime

logic_name varchar

db varchar

db_version varchar

db_file varchar

program varchar

program_version varchar

program_file varchar

parameters varchar

module varchar

module_version varchar

gff_source varchar

gff_feature varchar

simple_feature

simple_feature_id int

seq_region_id int

seq_region_start int

seq_region_end int

seq_region_strand tinyint

display_label varchar

analysis_id int

score double

any feature

usually has a any_feature_id

int

contains a sequence position with or without strand on a sequence region

usually contains a string to display

varchar

usually links to the analysis responsible for calculating it

int

contains any number of other attributes

..

..

1

0..n

Page 14: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Other features

protein_align_feature

protein_align_feature_id int

Sequence position

hit_start int

hit_end int

hit_name varchar

analysis_id int

score double

evalue double

perc_ident float

cigar_line text

dna_align_feature

dna_align_feature_id int

Sequence position

hit_start int

hit_end int

hit_strand tinyint

hit_name varchar

analysis_id int

score double

evalue double

perc_ident float

cigar_line text

repeat_feature

repeat_feature_id int

Sequence position

repeat_start int

repeat_end int

repeat_consensus_id int

analysis_id int

score double

prediction_exon

prediction_exon_id int

prediction_transcript_id int

rank int

Sequence position

start_phase tinyint

score double

p_value double

prediction_transcript

prediction_transcript_id int

Sequence position

analysis_id int

repeat_consensus

repeat_consensus_id int

repeat_name varchar

repeat_class varchar

consensus text

1..n

1

1

1..n

Page 15: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Tables for Genes

exon

exon_id int

seq_region_id int

seq_region_start int

seq_region_end int

seq_region_strand tinyint

phase tinyint

end_phase tinyintexon_stable_id

exon_id int

stable_id varchar

version int

exon_transcript

exon_id int

transcript_id int

rank int

gene

gene_id int

status enum

description

source varchar

biotype varchar

analysis_id int

seq_region_id int

seq_region_start int

seq_region_end int

seq_region_strand tinyint

display_xref_id int

gene_stable_id

gene_id int

stable_id varchar

version int

transcript

transcript_id int

gene_id int

seq_region_id int

biotype varchar

status enum

seq_region_start int

seq_region_end int

seq_region_strand tinyint

display_xref_id int

transcript_stable_id

transcript_id int

stable_id varchar

version int

translation_stable_id

translation_id int

stable_id varchar

version int

translation

translation_id int

transcript_id int

seq_start int

start_exon_id int

seq_end int

end_exon_id int

1

0..1

0..1

0..1

1

1

10..n

0..n

1

0..1

1..n

1

1

0..1

1

1

0..n0..n

1

transcript_attrib

transcript_id int

attrib_type_id int

value varchar

translation_attrib

translation_id int

attrib_type_id int

value varchar

0..n

0..n

Page 16: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Other tables

• Sets of tables to handle:– Cross references of ensembl features to external database– Markers– QTLs– Regulatory regions and factors– Stable ID archive– Affymetrix probe data– Misc features– Density features

• Tables containing meta information about the database• Karyotype bands• Protein annotation• Supporting evidence• Assembly exceptions (haplotypes and PARs)

Page 17: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Ensembl APIs

• Programmatic access to ensembl databases is via three main APIs:– ensembl core API access to genome database– ensembl compara API access to compara database– ensembl variation API access to variation database

• All three have the same basic structure– Data objects to represent biological entities eg. Gene,

Homology, Variation– DataAdaptor objects to store and retrieve data objects from

database.

• Data production APIs– ensembl-pipeline genebuild pipeline– ensembl-analysis analysis wrapper objects– ensembl-hive compara pipeline

Page 18: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

The Perl Core API

• The Perl core API provides a layer of abstraction over the Ensembl core databases.

• Written in Object-Oriented Perl.

• Can be used to get information into or out of Ensembl databases.

• Insulates programmers to some extent from changes to the database schema.

• Insulates programmer from coordinate transformations

Page 19: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Data Objects

• Information is obtained from the API in the form of Data Objects.

• Each object represents some data which is stored in the database.

• A Gene object represents a gene, a Transcript object represents a transcript, a Marker Object represents a Marker, etc.

Page 20: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Data Objects – Code Example

# print out the start, end and strand of a transcriptprint $transcript->start(), '-', $transcript->end(), '(',$transcript->strand(), “)\n”;

# print out the stable identifier for an exonprint $exon->stable_id(), “\n”;

# print out the name of a marker and its primer sequencesprint $marker->display_marker_synonym()->name, “\n”;print “left primer: ”, $marker->left_primer(), “\n”;print “right primer: ”, $marker->right_primer(), “\n”;

# set the start and end of a simple feature$simple_feature->start(10);$simple_feature->end(100);

Page 21: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Object Adaptors

• Object Adaptors are factories for Data Objects.

• Data Objects are retrieved from and stored in databases using Object Adaptors.

• Each Object Adaptor is responsible for creating objects of only one particular type.

• Data Adaptor fetch, store, and remove methods are used to retrieve, save, and delete information in the database.

• All the SQL is in the Object Adaptors

Page 22: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Object Adaptors – Code Example

# fetch a gene by its internal identifier$gene = $gene_adaptor->fetch_by_dbID(1234);

# fetch a gene by its stable identifier$gene = $gene_adaptor->fetch_by_stable_id('ENSG0000005038');

# store a transcript in the database$transcript_adaptor->store($transcript);

# remove an exon from the database$exon_adaptor->remove($exon);

# get all transcripts having a specific interpro domain@transcripts = @{$transcript_adaptor->fetch_all_by_domain('IPR000980')};

Page 23: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

The DBAdaptor and the Registry

DB

DBAdaptor

GeneAdaptor MarkerAdaptor …

Gene Marker …

Object Adaptors

Data Objects

• The Database Adaptor is a factory for Object Adaptors• It is used to connect to the database and to obtain Object Adaptors

• Registry enables access to multiple databases using information from a config file (important for compara work)

Page 24: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Slices

• A Slice Data Object represents an arbitrary region of a genome.

• Slices are not directly stored in the database.• A Slice is used to request sequence or features

from a specific region in a specific coordinate system.

chr20

Clone AC022035

Page 25: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Slices – Code Example

# get the slice adaptor$slice_adaptor = $db->get_SliceAdaptor();

# fetch a slice on a region of chromosome 12$slice = $slice_adaptor->fetch_by_region('chromosome', '12', 1e6, 2e6);

# print out the sequence from this regionprint $slice->seq();

# get all clones in the database and print out their names@slices = @{$slice_adaptor->fetch_all('clone')};foreach $slice (@slices) { print $slice->seq_region_name(), “\n”;}

Page 26: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Features

• Features are Data Objects with associated genomic locations.

• All Features have start, end, strand and slice attributes.

• Features are retrieved from Object Adaptors using limiting criteria such as identifiers or regions (slices).

• Gene

• Transcript

• Exon

• PredictionTranscript

• PredictionExon

• DnaAlignFeature

• ProteinAlignFeature

• SimpleFeature

• MarkerFeature

• QtlFeature

• MiscFeature

• KaryotypeBand

• RepeatFeature

• AssemblyExceptionFeature

• DensityFeature

Page 27: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

A Complete Code Example

use Bio::EnsEMBL::DBSQL::DBAdaptor;my $db = Bio::EnsEMBL::DBSQL::DBAdaptor->new (-host => ‘ensembldb.ensembl.org’, -dbname => ‘homo_sapiens_core_35_35h’, -user => ‘anonymous’);

my $slice_ad = $db->get_SliceAdaptor();my $slice = $slice_ad->fetch_by_region('chromosome', 'X', 1e6, 10e6);foreach my $sf (@{$slice->get_all_SimpleFeatures()}) { my $start = $sf->start(); my $end = $sf->end(); my $strand = $sf->strand(); my $score = $sf->score(); print “$start-$end($strand) $score\n”;}

Page 28: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

A Gene Object Code Example

#!/usr/bin/perl -wuse Bio::EnsEMBL::DBSQL::DBAdaptor;use strict;my $db = Bio::EnsEMBL::DBSQL::DBAdaptor->new (-host => ‘ensembldb.ensembl.org’, -dbname => ‘homo_sapiens_core_35_35h’, -user => ‘anonymous’);

my $slice_ad = $db->get_SliceAdaptor();my $slice = $slice_ad->fetch_by_region('chromosome', 'X', 1e6, 10e6);foreach my $gene (@{$slice->get_all_Genes_by_type(‘ensembl’)}) { print “Gene “,$gene->stable_id,“ “, $gene->start,“ “, $gene->end,“\n”; foreach my $trans (@{$gene->get_all_Transcripts}) { print “ Trans “,$trans->stable_id,”\n”; my $tlnseq = $trans->translate->seq; $tlnseq =~ s/(.{1,60})/$1\n/g;

print “ “,$tlnseq; }}

Page 29: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Coordinate Transformations

• The API provides the means to convert between any related coordinate systems in the database.

• Feature methods transfer, transform, project can be used to move features between coordinate systems.

• Slice method project can be used to move features between coordinate systems.

Page 30: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Feature::transfer

• The Feature method transfer moves a feature from one Slice to another.

• The Slice may be in the same coordinate system or a different coordinate system.

Chr20

Chr17

AC099811

Chr17

ChrX

Page 31: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Feature::transfer – Code Example

# fetch an exon from the database$exon = $exon_adaptor->fetch_by_stable_id('ENSE00001180238');print “Exon is on slice: “, $exon->slice()->name(), “\n”;print “Exon coords: “, $exon->start(), '-', $exon->end(), “\n”; # transfer the exon to a small slice just covering it$exon_slice = $slice_adaptor->fetch_by_Feature($exon);$exon = $exon->transfer($exon_slice);

print “Exon is on slice: “, $exon->slice()->name(), “\n”;print “Exon coords: “, $exon->start(), '-', $exon->end(), “\n”;

Sample output:Exon is on slice: chromosome:NCBI34:12:1:132078379:1Exon coords: 56452706-56452951Exon is on slice: chromosome:NCBI34:12:56452706:56452951:1Exon coords: 1-246

Page 32: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Stability of API

• Ensembl API changes to meet our needs

• Request for greater stability from users

• Some methods are now labelled as stable and we guarentee that they will not change for at least 2 years.

Page 33: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Outline

• Ensembl project overview

• Core database and API

• Pipeline

• Genomic annotation

• Comparative genomics

• Variation data

• Ensembl BioMart datamining db

• Making the data available

Page 34: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Runnables and RunnableDBs

• Runnables are perl objects which wrap analysis programs. Methods:– run– parse_results

• Generates ensembl data objects

– output• Returns generated data objects

eg. Blast runnable wraps blast

• RunnableDBs are perl objects which wrap Runnables allowing them to retrieve input data from and store output data into ensembl databases– fetch_input– write_output

Page 35: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

my $seq = Bio::SeqIO->new( -file => "<test.fa", -format => 'Fasta')->next_seq;

my $slice = Bio::EnsEMBL::Slice->new( -seq => $seq->seq, -coord_system => Bio::EnsEMBL::CoordSystem->new(-name => 'contig', -rank => 1), -seq_region_name => $seq->display_id, -start => 1, -end => $seq->length);

my $genscan_runnable = Bio::EnsEMBL::Analysis::Runnable::Genscan->new( -query => $slice, -analysis => Bio::EnsEMBL::Analysis->new(-logic_name=>'genscan'));

$genscan_runnable->run;

my @output;foreach my $prediction (@{$genscan_runnable->output}) { my $blast_run = Bio::EnsEMBL::Analysis::Runnable::Blast->new ( -query => $prediction->translate, -parser => Bio::EnsEMBL::Analysis::Tools::BPliteWrapper->new(), -database => 'embl_vertrna', -program => 'wutblastn', -analysis => Bio::EnsEMBL::Analysis->new(-logic_name=>'vertrna'));

$blast_run->run; push(@output, @{$blast_run->output});}

Runnable example

Page 36: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

#!/usr/local/ensembl/bin/perl -w

use strict;use Bio::EnsEMBL::Pipeline::DBSQL::DBAdaptor;use Bio::EnsEMBL::Pipeline::Analysis;

my $db = new Bio::EnsEMBL::Pipeline::DBSQL::DBAdaptor( -host => 'localhost', -user => 'root', -dbname => 'test_db');

my $anal = $db->get_AnalysisAdaptor->fetch_by_logic_name(’Uniprot');

my $rdbstr = “Bio::EnsEMBL::Analysis::RunnableDB::”.$anal->module; my $runobj = “$rdbstr”->new( -db => $db, -input_id => 'contig::AL1347153.1.3517:1:3571:1', -analysis => $anal);

$runobj->fetch_input;

$runobj->run;

$runobj->write_output;

RunnableDB example

Page 37: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Writing Runnables and RunnableDBs

• A lot of functionality is implemented in the base classes

• At its simplest just requires implementing:parse_results in the Runnable

get_adaptor in the RunnableDB

fetch_input in the RunnableDB

• Other methods which may need overridingwrite_output in the RunnableDB

run_analysis in the Runnable

Page 38: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Pipeline Summary

Page 39: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Example pipeline

Page 40: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Current hardware

• 8x ES40 Alpha (667 MHz) with 2Tb fibre channel storage

• 10x ES45 Alpha (1GZ) with 5Tb fibre channel storage

• 3x Itanium 4 CPU with 1.6Tb storage

• 400 HS20 IBM Blades (2x2.8 or 3.2Ghz PIV + 4 Gig memory + 2TB clustered SAN filesystem or 600GB clustered IDE filesystem (both IBM GPFS)

• Tru64 UNIX/Linux

• 21 MySQL (v 4.1) instances

• Most binaries and all sequence databases stored locally (avoids using NFS)

Page 41: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Outline

• Ensembl project overview

• Core database and API

• Pipeline

• Genomic annotation

• Comparative genomics

• Variation data

• Ensembl BioMart datamining db

• Making the data available

Page 42: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Genome annotation overview

Raw compute - Alignments against protein and DNA dbs, and other basic analyses

Automatic gene annotation

Protein coding gene models

Pseudogenes (some)

RNA genes

Alignment of species ESTs and cDNAs

Affymetrix probe mapping

Protein domain annotation

Cross reference generation

Page 43: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

The Raw Computes

• Repeat Features– RepeatMasker– Dust– TRF

• Ab Initio Genes– Genscan (sometimes other programs)

• Blast alignments– Blastp against Uniprot– Blastn against EMBL vertebrate RNAs and UniGene Clusters

• Other Features– CPG islands– tRNA genes– Transcription start sites using Eponine

Page 44: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Species Specific Proteins

Other ProteinsSpecies Specific

cDNAs

AlignedcDNAs

Exonerate

Species Specific ESTs

Exonerate

AlignedESTs

EnsemblEST genes

ClusterMerge

Preliminarygene set

GenebuilderSupported ab initio

(optional)

cDNA genes

ClusterMerge

PseudogenesFinal set

+ pseudogenes

Genewisegenes

Genewise

GeneCombiner

Core Ensemblgenes

Gene Annotation

Genewise geneswith UTRs

Blessed gene set(optional)

Page 45: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

ncRNAs

• Functional RNAs

• Families share conserved secondary structure

• Low sequence identity

• Ribosome

• Spliceosome

• tRNAs

• miRNA

Page 46: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Difficulties in annotating ncRNAs

Ab initio gene predicting programs such as GENSCAN cannot predict non-coding genes.

BLAST performs poorly at detecting non coding genes where structure is conserved but sequence identity is low.

Cannot use repeat masked DNA as some ncRNAs look very much like repeats (ALU related to SRP RNA)

Cannot use ESTs as ncRNAs lack poly-A

Page 47: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

RFAM

• Hand made alignments

• Use Infernal to make Covariance Models

• Scan models over subset of EMBL to build family alignments

Page 48: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Problems 2

• Infernal does not scale well:

• “Covariance model searches are extremely compute intensive… The compute time scales roughly to the 4th power of the length of the RNA, so larger models quickly become infeasible without significant compute resources”

• How long would it take to run the human genome?

• Rough estimate > a week on the farm

• Need to limit the amount of sequence we run Infernal on

Page 49: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Rfam Scan• Rfam procedure to speed up Infernal on large eukaryotes

– Uses Blast to narrow search:• BLAST is poor at finding ncRNAs with low sequence ID• RFAM families contain sequences from all organisms• More sequence variation = more chance of Blast making

alignment

• In ensembl:– Separated blast and Infernal steps (using Runnables)

– Determined filtering for blast results to limit time without significant reduction in sensitivity

– Now runs in less than 24hrs.

Page 50: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

miRNA

• Highly conserved across species

• Precursor stem loop sequence ~ 70nt

• Mature miRNA ~ 21nt

• Identified using BLAST genomic vs miRBase precursors

• RNAfold used to test for stem loop

• Mature sequence identified (only 2 nt changes tolerated)

• Start with ~ 290,000 blast hits

• End with 222 miRNA

• 96% of SE miRNAs + additional 60

• Novel c.f. miRBase:

• 1 – chicken, 36 – mouse, 5 – rat

Page 52: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

StructuresStructures identified by Infernal / RNAfold are stored as transcript

attributes ::::::::::::::::<<-<<<<<-<<<________________>>>>>>>>-->>,,,, 1 AuCUUUGCGCAGGGGCaaUaucguAgccAGUGAGGcUuuaCCGAggcgcgauUAuuGCUA 60 A+CUUUGCGCAG GGCA:UAU :UAGCCA+UGAGG+UU++CCGAGGCG: AUUA:UGCUA 181 AGCUUUGCGCAGUGGCAGUAUCAUAGCCAAUGAGGUUUAUCCGAGGCGCAAUUAUUGCUA 240

<<<<_.________.__>>>>,,,,,<<<.<<<<<<<<<<____......__>>>>>>>> 61 gUugA.AAACUAUU.CCcaAccgCCCgcc.aagacgacauguua......uauugucggc 111 :UU A AAA UA AA:+G G:C ::: ::A:::+UUA U :::U::+: 241 AUUAAuAAAUUAAAuAAUAAAAGGG-GACuCUU-UUAGUGCUUAuaaaggUUUACUAACC 298

>>->>>,,,,,,,,,,,,<<<<____>>>> 112 uuuggcAAUUUUUGGAAGcccuccAaaggg 141 :: G:CAA UU +AAG ::C+AA:: 299 ACAGACAACUU---AAAGGUAACAAACCUA 325

Displayable on website as markup on transcript sequence

Page 53: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Low Coverage Genomes

• 16 mammalian genomes sequenced to 2x coverage are expected over the next 2-3 years.

• Ensembl is aiming to provide gene sets for these based on alignments to human, building predictions on scaffolds which align to genomic locations of human genes

• Test case– Cow preliminary 3x assembly 449727 scaffolds,

795212 contigs– Good test case because 6x assembly was recently

made available so we can assess accuracy of method

Page 54: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Method overview

- Details- Raw alignments grouped using UCSC chain and net method- Use human as source for cow - human has best annotation- ‘Gene scaffolds’ (new coord system) are stored in the database- Allow scaffolds to be broken at contig gaps- Retain ‘gap’ exons

NNNNNNNN

Page 55: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Co

w

BL

UE

C

ow

RE

D

Dealing with duplication: Iterative Human Net

Co

w

GR

EE

N

Human

Page 56: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Cow gene with ‘gap’ exons

Page 57: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

multicontigview comparison of gene structures of Cow, Mouse (and human)

Page 58: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

QC

• Internal QC– Comparison against Uniprot or Refseq– Comparison to previous build– Comparisons to homologs

• External comparisons - the CCDS set

Page 59: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Increase in quality

0%

20%

40%

60%

80%

100%

Human-UniSw-33Human-UniSw-34Human-UniSw-35Human-RefSeq-33Human-RefSeq-34Human-RefSeq-35Mouse-UniSw-30Mouse-UniSw-32Mouse-UniSw-33Mouse-UniSw-34Mouse-RefSeq-30Mouse-RefSeq-32Mouse-RefSeq-33Mouse-RefSeq-34

Missing

Matching

Edge perfect

Identical

Page 60: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Using homology to find partial predictions

Page 61: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Comparing builds using homology pipeline

Page 62: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Improving the human build

• CCDS– Collaboration with Havana, NCBI and UCSC to produce a stable,

reliable set of complete (ATG->Stop) CDS structures for human– NCBI and Ensembl guarantee to retain the set in builds– Generation:

• Comparison of merged Ensembl/Vega set with the NCBI Refseq set to find the set of complete CDSs both groups predict identically.

• UCSC (and the other groups) analyse the complete sets and the CDS intersection set for possible errors.

• Assign stable ids and release– This process has been very valuable to both NCBI and us in

highlighting problems in our build processes.– Two rounds of comparisons have taken place

Page 64: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Affymetrix probe mapping

• Exonerate to map probe sequences

• Assign xrefs to transcripts for matched probe sets

• API currently being modified to be less Affymetrix array format (probeset) specific (by Zebrafish annotation team)

• Dog, chicken, fruitfly, zebrafish, rat, mouse, human (worm next month)

Page 65: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Other developments

• Cross reference data– New system for generating this data, leading to:

• More reliable generation of xrefs

• New types of xref eg. Unigene

• Monthly cDNA alignment set updates for human– Genomic alignments of cDNAs using an up to date

cDNA set. – Displayed on the website in the ‘cDNAs’ track

Page 66: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Outline

• Ensembl project overview

• Core database and API

• Pipeline

• Genomic annotation

• Comparative genomics

• Variation data

• Ensembl BioMart datamining db

• Making the data available

Page 67: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

What is Ensembl Compara?

A single database which links all the Ensembl Species databases together through precalculated

comparative genomics data analysis.

A perl object API to access, and create that database

A production system for generating that database

Page 68: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

H. sapiens (human) NCBI35 3Gb*

M. musculus (house mouse) NCBIm33 2.6Gb *R. norvegicus (Norway rat) RGSC3.1 2.6Gb *

T. rubripes (tiger pufferfish) Fugu v2.0 400Mb *T. nigroviridis (Water fresh pufferfish) 400Mb *

D. rerio (zebrafish) WTSI Zv4 1.7Gb *

C. savignyi (sea squirt) 180Mb

D. melanogaster (fruitfly) BDGP3.1 125Mb *A. gambiae (African malaria mosquito) 230Mb *

C. elegans (nematode) WS116 100Mb *

O. latipes (Japanese medaka) 800Mb

M. mulatta (rhesus macaque) *P. troglodytes (common chimpanzee) 3Gb *

C. familiaris (domestic dog) BROAD1 2.5Gb *F. catus (domestic cat)

E. caballus (horse)S. scrofa (domestic pig)

B. taurus (domestic cattle) Btau 1.0 +O. aries (domestic sheep)

G. gallus (domestic fowl) WASHUC1 1.2Gb *X. laevis (African clawed frog) JGI3 3.1Gb

X. tropicalis (tropical clawed frog) 1.7Gb *

C. intestinalis (sea squirt) 180Mb + A. aegypti (yellow fever mosquito)

523

41

91

4574

83

65

20

310

197

92

360

450

990 25

70

140

?

550

250200?

Red : whole genome assembly availableGreen : whole genome assembly due in the next 2 years

1002003004005001000

Million years

* 17 species currently in Ensembl* 17 species currently in Ensembl++ 3 to be added soon 3 to be added soon

A. mellifera (honey bee) Amel1.1 200Mb *

340

S. cerevisiae (yeast, SGD) S228C 12Mb *

M. domestica (opposum) +

170

I. scapularis (tick)

1500?

Comparing different Comparing different speciesspecies

Page 69: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Compara database

• Gene orthology / paralogy predictions• Protein Family clusters• Raw protein alignments (wublastp)

Dna/Dna

Protein/Protein

• Synteny regions

• Whole genome alignments (BLAT, BlastZ, chain/net)

• Whole genome multiple alignments(Mercator, MLagan)

Page 70: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Compara database schema

Page 71: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Gene orthology prediction

species1 species2

Best Reciprocal Hits

Extra orthologous pairs foundBased on gene order conservation

Protein->DNA alignmentsdN, dS calculation

species3

wublastp+swqydb

dbqy

species1:species2species1:species3species2:species3

Page 72: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

RHS, Orphans and OthersRHS, Orphans and Others

Using UBRH and MBRH-DUPs as anchors and comparing genomic coordinates in both species, we identify additional orthologues labeled RHS for Reciprocal Hit supported by Synteny

OrphanOthers

Matches to someother chromosome

Human

Mouse

UBRHMBRHDUP1.2

UBRHRHSRHS

MBRHCOMPLEX

MBRHSYN

Page 73: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

MultiContigViewMultiContigView

Page 74: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Pairwise whole genome alignments pipeline

Species1dna chunking

Species2dna chunking

Dna chunks defined bySize, Overlap, Masking options,Chunk grouping size, Dump location

PairAligner Superclass

qy db

Blastz

Filtering (UCSC chain and net code)

BLATExonerate

Page 75: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

DNA/DNA matches web DNA/DNA matches web displaydisplay

Page 77: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Multiple whole genome alignments pipeline

Species1Coding exons

Species2Coding exons

Orthology Map Builder

Mercator

Species3Coding exons

wublastp all vs all(orthology anchors)

MultipleAlignerSuperClass

MLAGANMAVIDPECAN

Page 78: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

AlignSlice API

Using whole genome pairwise/multiple alignment data to generate a reference coordinate system common to the aligned species in the genomic region of interest.

• Able to project features (including transcripts) from one species to another through the alignment.

• Gives annotation context information across species.

Page 80: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Outline

• Ensembl project overview

• Core database and API

• Pipeline

• Genomic annotation

• Comparative genomics

• Variation data

• Ensembl BioMart datamining db

• Making the data available

Page 81: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Variation database

• Refactored ensembl-variation database replaced ensembl SNP (and lite) database

• New API to access DB from perl and java • Variation databases for 7 species:

– Homo_Sapiens– Mus_Musculus– Anopheles_Gambiae– Rattus_Norvegicus– Gallus_Gallus– Danio_Rerio– Canis_Familiaris

Page 82: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Ensembl variation db schema

Page 83: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Ensembl-variation API

• Similar in design to ensembl core and compara APIs:– Adaptor objects onto the database– Objects to represent biological entities such as:

• Variation and VariationFeature• TranscriptVariation• Allele• Genotype• Population• Individual• AlleleGroup

Page 84: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Generating SNP gene consequence

• SNPs occurring within transcripts are identified and their consequence for that transcript determined

• Classified into– Coding

• Synonymous• Non synonymous• Frameshift• Stop gain / loss

– Non coding UTR exonic– Intronic– Upstream or downstream

Page 85: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

LD calculation

•calculate pair-wise ld in different populations•calculate, in each population, how many individual have genotype of AA, Aa & aa

•for defined window size (100,000), for each pair of variations•including 7 populations (involving hapmap and perlegen) and 309M rows in individual_genotype table

•86M rows in pairwise_ld table (with r2>0.05 and population sample_size>=40)

Page 86: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Variation Data on the website

Page 87: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Outline

• Ensembl project overview

• Core database and API

• Pipeline

• Genomic annotation

• Comparative genomics

• Variation data

• Ensembl BioMart datamining db

• Making the data available

Page 88: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

BioMart and EnsMartBioMart and EnsMart

• Large-scale data retrieval tool

• Query builder interface

• Databases: Ensembl, SNP, Vega, (MSD, UniProt)

• Associated features or sequences

• Flexible output formats• http://www.biomart.org

• http://www.ensembl.org/Multi/martview

Page 91: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Outline

• Ensembl project overview

• Core database and API

• Pipeline

• Genomic annotation

• Comparative genomics

• Variation data

• Ensembl BioMart datamining db

• Making the data available

Page 93: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Web code• Encapsulates

– Input

– Output

– Ensembl API

– Rendering

• Improves– Maintainability

– Flexibility

– Code re-use

varestsnp

core

EnsemblAPI

View script

Client browser

Data

Output

Renderer

Input

Page 94: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Ensembl web site

• Web site is the main access method• Hardware recently upgraded

– website now runs on blades (2 CPU intel boxes) like the compute farm.

• Scale by adding more

– Site speed is important to users

• Code and interface updated during the summer– Plugins

• Customisation of the site – Side bar

• Quick access / discovery of pages relating to current page

• DAS can be used to put up user features as a track on the site

Page 96: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

ContigView

Overview

Detailed View

Basepair View

Page 98: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Data retrieval

BioMart

Data sets on ftp site

MySQL queries of databases

Perl API access to databases

Export View

Page 100: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

BioMart - Features

Page 101: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Database access via MySQLmysql -h ensembldb.ensembl.org -u anonymous

mysql> show databases like 'h%';+------------------------------+| Database (h%) |+------------------------------+| homo_sapiens_core_14_31 || homo_sapiens_core_15_33 || homo_sapiens_core_16_33 || homo_sapiens_disease_14_31 |

mysql> use homo_sapiens_core_29_35b;Database changedmysql> show tables;+-----------------------------------+| Tables_in_homo_sapiens_core_29_35b|+-----------------------------------+| analysis || assembly || chromosome || clone |

Page 104: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Archive site details

• Each new release is archived onto a web blade

• Plan is to keep each archive sites up for 2 years

• Stable links (for 2 years):http://nov2004.archive.ensembl.org/Homo_sapiens/geneview?gene=ENSG00000139618

• Will allow for better handling of retired stable ids

Page 105: Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

Database Schema and Core API

Arne Stabenau

Yuan Chen

Ian Longden

Craig Melsopp

Glenn Proctor

Daniel Ríos

Guy Slater

ENCODE

Damian Keefe

Paul Flicek

Distributed Annotation System

Andreas Kähäri

Stefan Gräf

Project Leader

Ewan Birney (EBI)

Tim Hubbard (Sanger)

Ensembl Web Team

James Smith

Fiona Cunningham

Eugene Kulesha

Anne Parker

VectorBase

Martin Hammond

Karyn Megy

Dan Lawson

Analysis and

Annotation Pipeline

Val Curwen

Steve Searle

Mario Caccamo

Laura Clarke

Jan Hinnerck-Vogel

Kevin Howe

Kerstin Jekosch

Felix Kokocinski

Simon White

Bronwen Aken

Julio Banet

EnsMart & BioMart

Arek Kasprzyk

Darin London

Damian Smedley

User Support

Xosé Mª Fernández

Michael Schuster

Bert Overduin

Comparative Genomics

Abel Ureta-Vidal

Javier Herrero Sánchez

Jessica Severin

Vega Web Team

Patrick Meidl

Steve Trevianon

Ensembl TeamEnsembl Team

October 2005