ensembl manual - post

Access to genes and genomes with

Ensembl

Taipei, November 2004

Table of Contents

Introduction 3 The Ensembl genome browser 7 Ensembl 9

How to access genomic information using Ensembl 16 Comparative genomics and proteomics 37 Ensembl and other resources 50 EnsMart 56 Genetic variation (SNPs, haplotypes) 69 Evaluating genes and transcripts 79 Adding your own data to Ensembl displays 90

Setting up your own DAS server 90 Uploading data for display in Ensembl ContigView via DAS 90 URL-based display 92

Databases and MySQL access 93 Perl API 94 Appendix: MySQL overview 95

2

Introduction

Introduction The aim of this course is to provide an overview of the Ensembl genome browser (http://www.ensembl.org), covering different aspects of this comprehensive source of annotated genomes, in order to allow the researcher to get the most out of the system, applying this knowledge to your own projects. Ensembl is a joint project of the European Bioinformatics Institute and the Wellcome Trust Sanger Institute, funded mainly by the Wellcome Trust, with additional funding from EMBL and NIH-NIAID.

Recent years have seen the release of huge amounts of sequence data from genome sequencing centres. However, this raw sequence data is most valuable to the laboratory biologist when provided along with quality annotation of the genomic sequence. This information can be the starting point for planning experiments, interpreting SNPs, inferring the function of gene products, predicting regulatory sites for gene expression and so on. The currently agreed 'gold standard' for the annotation of eukaryotic genomes is that made by a human being. Manual annotation is based on information derived from sequence homology searches and the results of various ab initio gene prediction methods. Annotation of large genomes such mouse and human meeting this standards is slow and labour intensive, taking large teams of annotators years to complete. As a result, the annotation can almost never be entirely up-to-date and free of inconsistencies (as the annotation process usually begins before the sequencing process is complete). Hence, an automated annotation system is desirable since it is a relatively rapid process that allows frequent updates to accommodate new data. To meet this need, we produced the Ensembl annotation system by observing how annotators build gene structures and condensing this process into a set of rules. Ensembl's genesis was in response to the acceleration of the public effort to sequence the human genome in 1999. At that point it was clear that if annotation of the draft sequence was to be available in a timely fashion it would have to be automatically generated and that

3

Introduction

new software systems would be needed to handle genome data sets that were much larger, much more fragmented and much more rapidly changing than anything previous dealt with. The Ensembl group has productive collaborations with both the UCSC genome informatics group and the NCBI genome group. Ensembl is one of the world's primary resources for genomic research, a resource through which scientists can access the human genome (as well as other model organisms such as chimpanzee, rat, mouse, chicken, Drosophila, honeybee, Tetraodon, zebrafish, Takifugu, Caenorhabditis elegans and C. briggsae). Because of the genome complexity and many different ways in which scientists want to use it, Ensembl has to provide many levels of access with a high degree of flexibility. Through the Ensembl website a wet-lab researcher with a simple web browser can perform BLAST searches against chromosomal DNA, download a genomic contig sequence or search for all members of a given protein family. But Ensembl is also an all-round software and database system that can be installed locally to serve needs of a genomic centre or a bioinformatics division in a pharma company enabling complex data mining of the genome or large-scale sequence annotation. Over the past few years Ensembl has grown into a large scale enterprise, with substantial compute resources enabling it to process and provide live database access to currently nine different genomes and a monthly update frequency to its heavily used web site. It has a large community of users in both industry and academia, using it as a base for their own organisations experimental and computational genome based investigations, some of whom maintain their own local installations.

Species Organism database Home page Human homo_sapiens http://www.ensembl.org/Homo_sapiens Mouse mus_musculus http://www.ensembl.org/Mus_musculus Rat rattus_norvegicus http://www.ensembl.org/Rattus_norvegicus Chimpanzee pan_troglodytes http://www.ensembl.org/Pan_troglodytes Chicken gallus_gallus http://www.ensembl.org/Gallus_gallus Dog n/a http://pre.ensembl.org/Canis_familiaris Cow n/a http://pre.ensembl.org/Bos_taurus Opossum n/a http://pre.ensembl.org/Monodelphis_domestica Zebrafish danio_rerio http://www.ensembl.org/Danio_rerio Takifugu fugu_rubripes http://www.ensembl.org/Fugu_rubripes Tetraodon tetraodon_nigroviridis http://www.ensembl.org/Tetraodon_nigroviridis Mosquito anopheles_gambiae http://www.ensembl.org/Anopheles_gambiae Fruitfly drosophila_melanogaster http://www.ensembl.org/Drosophila_melanogaster Honey bee apis_mellifera http://www.ensembl.org/Apis_mellifera C.elegans caenorhabditis_elegans http://www.ensembl.org/Caenorhabditis_elegans C.briggsae caenorhabditis_briggsae http://www.ensembl.org/Caenorhabditis_briggsae

Ensembl was conceived in three parts: as a scalable way of storing and retrieving genome scale data; as a web site for genome display; and as an automatic annotation method based around a set of heuristics. It was initially written for the draft human genome which was sequenced clone-by-clone but has also been successfully used for whole genome shotgun assemblies such as mouse, rat and Anopheles gambiae. The storage and display parts of

4

Introduction

Ensembl are used for all the genomes, and the automatic gene annotation has been run for human, mouse, rat, chimpanzee, mosquito, honeybee, Takifugu rubripes, Tetraodon nigroviridis zebrafish and C. briggsae. Ensembl provides easy access to genomic information with a number of visualisation tools. The key Ensembl web pages are called Views (e.g. GeneView, TextView, MapView, and ContigView), and will all be introduced appropriately later on. The Ensembl website gives you the opportunity to directly download data, whether it is a DNA sequence of a genomic contig you are trying to identify novel genes in, or positions of SNPs in a gene you are working on. There is also an ftp site that you can use to download large amounts of data from the Ensembl database, as well as a data mining tool (EnsMart) which allows flexible and rapid retrieval of information from the databases. As a software/database system Ensembl can be best described as a hybrid of a scripting programming language (Perl) and a relational database (MySQL, pronounced “My Ess Que Ell”). Ensembl Perl software inherits from a tradition of biological object-design developed through BioPerl (http://www.bioperl.org). That means that developers at Ensembl aimed at creating reusable pieces of software that would faithfully describe biological entities such as gene, transcript, protein, genomic clone, or chromosome. Rules of usage and design of Ensembl and BioPerl objects can be best learned while using them, browsing their code and through a bit of trial-and-error. There is a comprehensive BioPerl tutorial available at the BioPerl website. The Ensembl database is based on a relational database called MySQL. SQL in MySQL stands for: Structured Query Language - a universal database programming language shared by many relational databases. Because MySQL is available free of charge for non-commercial developers, every academic centre can install its own local copy of MySQL (MySQL server) and download Ensembl data from the Ensembl ftp site. Simple queries of the database can be handled using the SQL language (see appendix), but for complex queries demanded by most biological analyses the Ensembl MySQL server is best accessed using Ensembl Perl objects. The Ensembl project presents the latest sequence assembly1 of the human genome (and other species) and provides automatic baseline annotation of that sequence, including gene, transcript and protein predictions. The annotation is integrated with external data sources (such as Genome Ontology and resources from NCBI), making Ensembl a valuable starting and reference point for any work in human biology or medicine that utilises genetic information. The primary goal of Ensembl is to find all interesting genomic features. These include amongst other:

o Genes o Marker and SNP locations

1 Human Ensembl was formerly built on assemblies from UCSC, and is currently built on assemblies from NCBI. We collaborate closely with UCSC and NCBI to help assess, use and develop these assemblies.

5

Introduction

o Repeat locations Ensembl will not itself be storing other information, such as expression patterns or disease information. However, it does provide easy integration with external data, such as Disease and Maps. We are working on a similar relationship with databases such as ArrayExpress.

Ensembl identifiers

ENSG### Ensembl Gene ID ENST### Ensembl Transcript ID ENSP### Ensembl Peptide ID ENSE### Ensembl Exon ID ENSF### Ensembl Family ID

For other species, Ensembl adds a suffix to the ‘ENS’ (e.g. MUS for mouse, RNO for rat, DAR for Zebrafish, ANG for Anopheles and CBR for C briggsae). For Drosophila and C elegans where we import the genes from FlyBase and WormBase we use their identifiers.

Ensembl publications M. Hammond and E. Birney “Genome information resource – developments at Ensembl” Trends in Genetics (2004) 20:6 268-72. E. Birney et al. “An Overview of Ensembl” Genome Research (2004) 14:5 925-28*. E. Birney et al. “Ensembl 2004” Nucleic Acids Research (2004) 32:1 D468-D470. A Kasprzyk et al. “EnsMart: a generic system for fast and flexible access to biological data” Genome Research (2004) 14:1, 160-9. M Clamp et al. “Ensembl 2002: accommodating comparative genomics” Nucleic Acids Research (2003) 31:1, 38-42. T Hubbard et al. “The Ensembl genome database project” Nucleic Acids Research (2002) 30:1, 38-41. *An Ensembl special has been published in Genome Research 14:5 (May 2004) with a series of papers about the Ensembl project (covering from analysis pipeline to GeneWise).

6

The Ensembl genome browser

The Ensembl genome browser Web-based browsers facilitate access to information about whole regions of the genome, allowing the visualisation of features in and around a specific gene, investigate genome organisation and compare different genomes.

Assembly and annotation runs are referred to by “build numbers” (e.g. currently the human assembly is build 35, in which every chromosome has been finished, although there are still 341 gaps). Genome Browsers at UCSC and NCBI are based on the same assemblies, but may make a new assembly, with limited annotation, available earlier than Ensembl. These browsers have their own independent annotation pipelines. Prior to any annotation, sequence data has to be put together into “chromosomal assemblies”, Ensembl does not assemble any genome project directly but rather works in partnership with sequencing centres or consortia which generate the assembly based on clone maps created by sequencing centres, process which currently is undertaken at the NCBI (for human and mouse). Now that the genome is essentially finished, sequencing centres provide the complete final sequence, since no assembly is performed. Assembly methods converged and currently a single assembly is acceptable.

7


The current release of Ensembl displays (amongst other species) the human assembly NCBI 35 (freeze date, June 2004) consisting of 27262 clones assembled in 27288 contigs:

Estimated size: 2,863,476,365 bp Ensembl gene predictions: 24,195 (incl. 1974 pseudogenes) Ensembl gene exons: 245,938 Ensembl gene transcripts: 35,913

Computationally generated assemblies are produced after each sequencing centre produces a minimal tiling path map representing their ordering of clones across the chromosomes for which they were responsible. As sequencing efforts became focused on producing finished sequence and identifying clones to fill in remaining gaps, the responsibility for each chromosome was assigned to a particular sequencing centre, as follows: Sequencing Centre Chromosomes Baylor College of Medicine 3, 12 Genoscope - Centre National de Séquençage 14 Dept. of Energy Joint Genome Institute 5 Los Alamos National Laboratory 16 RIKEN Human Genome Research Group 11, 21 (21 consortium) Sanger Institute 1, 6, 9, 10, 13, 20, 22, X Stanford Human Genome Center 19 Washington University Genome Sequencing Center 2, 4, 7, Y Broad Institute 8, 15, 17, 18

8


Ensembl

I. Regions, maps and markers ContigView can be considered the central point of the Ensembl website. It shows the contigs that make up a genome assembly. It allows you to scroll along entire chromosomes, while viewing the annotated features within a selected region in detail.

CytoView provides a high level view of chromosome regions and their annotations. It gives a good overview of regions of several Mb and lets you zoom out to view an entire chromosome. You can jump to different chromosomes, load regions according to their absolute coordinates, and view cross species synteny blocks, when available. You can also download lists of clones and genes in a defined chromosomal region.

9


SyntenyView provides an interface that plots all regions where gene order is conserved between two species as synteny blocks (chromosomal segments that share common ancestry) along the length of a chromosome with the added functionality of examining homologous genes in these synteny blocks via easy links to detailed GeneView reports.

MarkerView displays information about a particular chromosome map marker or STS (sequence tagged site). Markers are normally defined by a pair of primers that can be used in a PCR.

10


SNPView provides supporting evidence for SNPs (Single Nucleotide Polymorphisms). The SNPView report provides detailed information for each SNP, including its genomic mapping, sequence, allelic variation, validation status, neighbouring SNPs, and SNP type. SNPs are also mapped to their source references at dbSNP, the SNP repository at NCBI, HGVBase (Human Genome Variation Database), and TSC (The SNP Consortium Ltd.) When more than a single accession number exists, all cross-references are provided.

11


GeneSNPView shows details of SNPs and Pfam domains in and around the exons of a particular gene. SNP annotation includes location, alleles, classification, and effects on the different transcripts.

II. Genes and gene products GeneView provides annotation and supporting evidence for the selected gene. The annotation provided consists of a gene's transcripts, structure associated diseases and GO terms, homologies to microarray probes and to other species, known and predicted proteins and domains, and links to external documentation.

12


TransView provides annotation and supporting evidence of the selected transcript (structure, transcribed proteins, Gene Ontology and InterPro associated entries). Divided into two sections, Transcript report panel provides a top-level summary of the transcript, with links to its genomic location, alignments to sequences in external databases, and export options. Underneath the report, the cDNA sequence of the transcript can be shown with codons, peptide sequence and/or SNPs highlighted. ExonView provides annotation and supporting evidence for the exon of a selected transcript. Ensembl's gene predictions for both known and novel genes are based on gene predictions and alignment evidence from external databases.

13


ProteinView shows information about the structure and function of the encoded protein in the transcript's report.

FamilyView shows all the peptides present in the family listed by source (i.e. each Ensembl species, SwissProt and TrEMBL). A protein family is a cluster of similar proteins based on sequence identity that have members between and across Ensembl species. The multiple alignments of all the peptide members can be displayed using the JalView applet link. The user can either chooses to only see Ensembl peptides aligned all together or with the UniProt (formerly SwissProt/TrEMBL) entries present in the family.

14


DomainView shows all the genes sharing a particular InterPro domain. GOView provides access to Ensembl's Gene Ontology (GO)/gene mapping, and allows you to search for genes based on their GO classification. Terms are assigned to Ensembl genes by taking the terms from UniProt entries to which the Ensembl gene has been mapped.

DiseaseView is a tool for browsing detailed information on genetic diseases. It provides access to Ensembl's disease mappings based on OMIMTM (Online Mendelian Inheritance in Man), allowing you to browse diseases by chromosome or search for specific diseases or to search for genes based on their OMIM classification. DiseaseView is only available for human.

15


III. Data retrieval MartView is a data-mining tool which facilitates efficient and flexible access to Ensembl (and other) data in a rapid and interactive manner.

ExportView lets you download/dump data. All the features for a genomic region may be downloaded or exported to various formats (e.g. FASTA, GenBank or EMBL-style flat file, as a feature list or an image).

16

How to access genomic information using Ensembl

STEP 1: Pick Human Ensembl

‘About Ensembl’ plus general help and documentation – multi-species

17


Search function

STEP 2: Pick chromosome X

Help!

Click to display a region

STEP 3: Display the region between markers DXS7072 and DXS1073

Display a region between any two features

Quick access to the different “Views”

18


Entire chromosome

Fwd

stra

nd

Rev

erse

str

and

“Overview” displays chromosomal region chromosome

“Detailed View” use menus on gold bar to customise view

STEP 4: Centre on the G6PD gene

DNA (contigs) sequence assembly

STEP 5: Zoom in using the “+” button

“Basepair View” shows six frame translations and restriction enzyme profile

19


STEP 7: Customise the display. Show SNPs, hide Genscans, CpG islands and EST transcripts

Evidence for gene prediction

STEP 6: ‘Expand’ or ‘compact’ the Unigene evidence track

STEP 8: Jump to genomic region in UCSC browser

20


STEP 9: Jump to “CytoView”

STEP 10:

Dump clones and genes

STEP 11: Use browser BACK button to return to “ContigView” and then click on a marker

21


Map information

STEP 12: Use browser BACK button to return to “ContigView” and then click on a SNP

STEP 13: Use browser BACK button to return to “ContigView” and then click on the G6PD Ensembl transcript (red)

22


View genomic location

STEP 14: Go to “TransView”

Homologies with other species

References from SwissProt

Export data - retrieve sequence and annotation

Links to other databases

Transcript structure

Find related proteins

GO annotation

Protein domains

23


STEP 15: Customise: Select “Codons/peptide/ SNPs” and “number bps”

STEP 16: Go to “ExonView”

Chromosomal location

Transcript sequence

24


STEP 17: Go to “ProteinView”

Exon/intron sequence

Display more or all of the intron sequence

Evidence used to build gene structure

25


Look back to “TransView” or “ExonView”

STEP 18: Go to “FamilyView”

Protein sequence

Protein domains

SNPs

26


Chromosomal location and links to other family members

Family members in other Ensembl species

STEP 19: See multiple alignments

27


STEP 21: Use browser BACK button to return to “GeneView” and then click on ‘Export gene data in EMBL, GenBank or FASTA’

STEP 20: Use browser BACK button to return to “GeneView”. Then try links to LocusLink, OMIM (MIM), & the GO browser

28


STEP 22: Choose the FASTA tab

STEP 23: Include 500bp on either side, select HTML format and click “Export”

STEP 24: Copy part of the sequence, and then click on “BLAST Search”
STEP 25:
Paste the sequence, choose the “Ensembl Peptides (known+novel)” database and “BLASTX”, then click on “RUN”

29


View the alignment

STEP 26: Display the region in “ContigView”

30


View the BLAST hits

STEP 27: Zoom out several steps, and then export an Ensembl gene list

31


STEP 28: Request positional data and external gene ID

STEP 29: Select “Excel” output format, and click “export”

32


Tasks: 1) Retrieve all mouse homologues of human disease genes containing transmembrane

domains located between 1p22 and 1q22.

Hint: Start with a human gene list. Include “Disease OMIM ID” and “Disease description” in the

OUTPUT section.

2) Retrieve the gene structure of the following mouse genes: ENSMUSG00000042351

ENSMUSG00000022393

Hint: This can be done under Filter – (Gene Section) Ensembl Gene IDs.

Select ‘Ensembl Gene ID’, ‘Ensembl Transcript ID’, ‘Ensembl Exon ID’, ‘Exon Start’

and ‘Exon End’. Export the gene structure in HTML format. Then take the link from

the Ensembl genes to GeneView in order to confirm the gene structure.

3) Retrieve the sequences 5kb upstream of all human ‘known’ genes between D1S2806 and

D1S464.

Hint: Use the ‘Genes – transcript information ignored’ under the sequence type options. ‘Known’ genes

in Ensembl are Ensembl gene predictions that could be mapped to external database entries (e.g.

SwissProt) with a high similarity score.

4) Retrieve all human SNPs that have a TSC ID, from chromosome 6 between 15 and 15.2

Mb, with 200 bases flanking sequence.

5) Design your own query!

33


Additional Tasks 1. Exploring features related to a gene Find the gene report for the human HRAS gene. Hint: use the search/find boxes to do a text search. How many transcripts are predicted for this gene? What is the size of the longest predicted mRNA? How many exons does it have? With what diseases is HRAS associated? What is its cellular function? Hint: follow some of the links in GeneView. How many amino acids does it code for? Which InterPro domains does the protein product contain? Find the GO section of GeneView and follow the links to explore the ‘Gene ontology’ terms (describing gene and protein function) in Ensembl GOView. In which chromosomal band is HRAS located? On which clone and contig in the genomic sequence assembly? Which regions in and around the gene correspond to regions shown in the ‘mouse matches’ track? Is there a putative mouse orthologue? If so, where is it in the mouse genome? 2. Exploring a region Display the region between markers D12S1871 and D12S1604 in ContigView. Hint: start by clicking chromosome 12 on the human homepage. Centre the display around the CSAD gene by clicking on this gene in “Overview”. In “Detailed View”, why are some genes above and some below the DNA contigs? Note the large gap in the assembly. Turn on the ‘Gaps’ track under ‘Decorations’ and you will also see smaller gaps. Find out what the small gaps represent, and zoom in to see them in the DNA (contigs) track in ContigView. Export a FASTA file of the sequence of the displayed region and look for the gaps (Ns). What is the nearest marker to the start of the CSAD gene? How many synonyms does this marker have? Zoom in on the RARG gene. Identify a coding SNP and look at the corresponding SNPView page. Hint: turn on the SNP track if necessary.

34


Export an Ensembl gene list for this region. Include Ensembl gene id, known gene external id, known gene external database name, gene description, InterPro short description. Hint: use the drop-down menu at the top of “Detailed View” to get to MartView. 3. Examining the supporting evidence for a gene prediction Display the BRCA2 gene (or your own favourite gene) in ContigView. Examine the evidence for the gene prediction. Hint: collapse the chromosome view and overview; compare the collapsed and expanded view of the supporting evidence, e.g. Unigene. Note that max. 7 tracks can be displayed for each section; if there are more entries in the database the display may change on multiple re-loads. Tick ‘Acembly Transcripts’, ‘Ensembl RefSeq’, ‘NCBI Genomescan’ and ‘NCBI Transcripts’ under ‘DAS sources’ on the yellow bar at the top of ‘Detailed View’. Compare the Ensembl gene prediction to these gene prediction models. (Models from other groups may also be available.) Click on the gene predictions to go to FastaView displaying the sequence and some more information on the method of gene prediction. Have a look at the corresponding region displayed by the UCSC browser. To get there, take the ‘Jump to’ link on the gold bar. Explore the options under ‘Track controls’. Likewise, go to the NCBI browser and customise the display by choosing the relevant options under ‘Maps & Options’. On the Ensembl ContigView display, click on the BRCA2 transcript (or your own gene) and examine the links / alignments given for this gene in GeneView. Note especially the RefSeq and SwissProt entries. What do the Target %id and Query %id figures indicate? Hint: Consult the help page by clicking on ‘Help’ in the top right corner. Follow the link ‘View supporting evidence’ for the transcript ENST00000267071. From there follow the link ‘View Evidence’ to TransView. Examine the evidence from the different entries used to build this gene.

35

Comparative genomics and proteomics


Species available Ensembl focuses on metazoan (animal) genomes. The genomes currently available at the Ensembl site are:

Vertebrates: human, chimpanzee, mouse, rat, chicken, puffer fish, zebrafish, Tetraodon Arthropods: the mosquito Anopheles gambiae and Drosophila melanogaster, honeybee Nematodes: Caenorhabditis elegans, C. briggsae

Reach the home pages for each species via the generic Ensembl home page (http://www.ensembl.org) or bookmark a species homepage with a URL like http://www.ensembl.org/Rattus_norvegicus for those species for which we have an assembly but still being annotated, we have a Preview browser (Pre!) where you can have a look at new assemblies or new species: http://pre.ensembl.org/Bos_taurus/

Rat home page

Additional animal genomes will be incorporated in the future, with the emphasis on those important for biomedical research or for evolutionary comparisons. At the time of writing, sequencing of the following genomes is underway, and some of these will appear in Ensembl during 2005:

36


Vertebrates: Rhesus macaque, and opossum Arthropods: Aedes mosquito

A key element of any genome browser is the display of genes in their chromosomal locations. For most species, Ensembl runs an automated sequence annotation pipeline and gene build to provide annotation including genome-wide gene and protein sets. There are different challenges associated with building a comprehensive gene set in different organisms. The Ensembl gene building process is discussed further later in the workshop. For species where the research community is generating comprehensive manual annotation, Ensembl incorporates those gene and protein sets instead of, or in addition too, its own automated annotation. Thus, manual annotation is displayed for some human chromosomes, alongside the Ensembl predictions, and the manually curated genome-wide gene sets for D. melanogaster and C. elegans are used in place of an Ensembl set. The additional types of annotation available will vary to some extent between species. But because annotation is stored and displayed in a consistent way for all species, your experience working with one species transfers to new species, and comparisons of genomic sequence and homologous genes and proteins between species are facilitated.

Orthologues One kind of comparative analysis focuses on genes and proteins, and attempts to identify the orthologue (the ‘same’ gene) in different genomes. Apart from the value of such data in evolutionary studies, it is very useful to be able to identify the equivalent of important genes in one organism (for example, human disease genes) in other organisms that provide experimentally tractable models for studying that gene. The classic ‘model’ animals are now all represented in Ensembl (Drosophila, C. elegans, mouse) as well as zebrafish, and it is hoped to include other useful models as they become available. The automated identification of orthologues is made more difficult by the existence of families of closely related genes. Under such circumstances, Ensembl may show more than one potential orthologue, and the results need to be treated with caution. Of course there may ‘really’ be more than one orthologue when a lineage has generated additional family members by duplication after the divergence of the two organisms under consideration. See the section below on ‘synteny blocks’ for one way that Ensembl can help you to assess the orthologue pairs. In Ensembl, orthologues are identified starting with comparisons at the protein level. ‘All-versus all’ BLASTP+SW (Smith-Waterman algorithm) is first used to identify those protein pairs that are reciprocal best hits between two sets of proteins that represent every gene in the two organisms. Additional putative orthologues are then sought using synteny information and the reciprocal best hits as ‘RHS’ for Reciprocal Hit supported by Synteny. Where two homologous proteins are encoded by genes each located within 1 Mb of a pair of ‘BRH’, they are good candidates for being an additional orthologous pair. Currently we divide these BRH into UBRH ‘Unique Best Reciprocal Hit’ and MBRH ‘Multiple Best Reciprocal Hit’ the latter when have multiple but identical best hits, as it can happen if there is perfect protein sequence duplication of translated genes within a species. The same approach permits the

37


identification of adjacent family members that may be recently duplicated lineage-specific paralogues. Ensembl shows the information about potential orthologues on each GeneView page (and also in SyntenyView displays of synteny blocks, where these are available). The procedure has been applied to all pairs of vertebrates within Ensembl, to the two nematodes, and to the two insects.

------------------------------------------------------------------------------------

Parts of a GeneView page, showing putative orthologues

EnsMart lets you access and use this orthology information in a variety of ways. A set of genes can be selected that have identified orthologues in another species, and further restricted to those that share conserved upstream sequence. For any set of genes, output can include the details of any orthologues in other species and the locations of conserved upstream sequence.

Protein families What if the orthologue identification procedure fails to find a pair? And what if you are interested in looking at a wider set of potential orthologues and paralogues between two species or within a wider range of organisms? One option is to look for proteins that share particular domains. Ensembl runs domain prediction programs on all its protein sets, and provides access to this information in ProteinView (for individual proteins), and in DomainView (showing all the genes in a species that share a particular InterPro domain). However, some domains are shared by a wide range of proteins that have very different functions. Ensembl’s protein families are an attempt to identify clusters of functionally related proteins, among which one might expect to find most orthologues and paralogues. The family database is generated by running the Tribe-MCL sequence clustering algorithm on a set of peptides consisting of the Ensembl predictions for each Ensembl species, together with all metazoan sequences from Swiss-Prot and SPTrEMBL. On this set of peptides, an all-against-all BLASTP is run to establish similarities. Using these similarities, clusters can be established using the MCL algorithm. [For more detail of the underlying

38


methods, see Enright, A.J. et al. (2002) “An efficient algorithm for large-scale detection of protein families” Nucleic Acids Res. 30, 1575-1584]. The efficiency of this approach permits clustering to be done within a realistic time despite the very large numbers of proteins involved (for a recent release, around half a million protein sequences were loaded and processed in less than 24 hours, using 400 CPUs).

---------------------------------------------------------------------------------

Domain and Family information on a ProteinView page

Both family and domain information are shown in Ensembl GeneView and ProteinView pages, and DomainView and FamilyView pages make it easy to examine all the identified genes and proteins within a species. For families, you can also see the family members in other Ensembl species and the UniProt entries (from all metazoans) that fall in the same cluster. In the future, we plan to introduce a similar multi-species display for protein domains. EnsMart provides the means to rapidly and easily download sets of transcript or protein sequence with particular domains or from particular families, which can be very useful as starting points for alignment and phylogenetic analysis. In addition, the Ensembl database stores pre-calculated protein alignments for all members of a family, and these alignments can be displayed in JalView.

39


Part of a family alignment in JalView

Whole genome DNA-DNA alignments The alignment of the whole DNA sequence from two organisms is computationally demanding, and the algorithms to carry it out are under active development [see for example Ureta-Vidal, A. et al. (2003) “Comparative genomics: genome-wide analysis in metazoan eukaryotes” Nat. Rev. Genet. 4, 251-262]. Such data are of great interest both in studies of the mechanisms of molecular evolution and in attempts to identify conserved functional sequences such as novel genes and regulatory regions. Whole genome alignments become increasingly difficult as the evolutionary distance between two organisms increases. At present, Ensembl displays pair-wise alignments within mammals (human, mouse and rat), and within nematodes (C. elegans and C. briggsae); within these species groups, separation probably occurred <100 Mya. Ensembl is experimenting with different procedures to do the alignments: at present conserved regions are by identified either by BLASTz (data obtained from UCSC Genome Bioinformatics group) or by Phusion/BLASTn (used for C. elegans and C. briggsae); this firstly runs the two genomes through Phusion, which takes unique 17mers from one genome and compares them to the second genome, creating clusters of contigs from both genomes; and then comparing contigs within clusters using Washington University's version of BLASTn (without repeat masking). The output of wublastn is post-processed to keep only high-scoring pairs and to identify diagonals of blast alignments. Regions of these alignments that represent highly conserved regions are then selected using a filtering method devised by Jim Kent [see Schwartz, S. et al. (2003) “Human-Mouse Alignments with BLASTZ” Genome Res. 13, 103-107]. A third method, translated BLAT is used to compare genomes from more evolutionarily distant species, at the amino acid level. Thus regions of similarity will be biased towards those that code for proteins, although highly conserved non-coding regions might be detected as well. You can show a number of tracks (e.g. human vs. chimpanzee, human vs. rat, human vs. mouse, human vs. chicken, human vs. Fugu, human vs. zebrafish, human vs. Tetraodon, etc) displaying the conservation within the cluster (vertebrates, arthropods and nematodes) clustered d/or Caenorhabditis elegans vs. C. briggsae) from the ‘Compara’ menu in ContigView

40


for each comparison, showing two levels of conservation (labelled ‘cons’ for BLASTz or Phusion/BLAST comparisons and ‘high cons’ for highly conserved). Links make it easy to navigate back and forth to see details of the region in the two genomes and to download the sequence of regions of interest.

Part of a human ContigView detailed display panel, showing whole genome alignments with mouse

and rat. Further access to the conserved sequences at Ensembl is provided via DotterView displays of local alignments, while EnsMart provides a route for identifying specifically those highly conserved sequences that are located in regions upstream of pairs of orthologous genes.

DotterView display showing two homologous exons

Synteny blocks The identifications of segments of the genome where the order of particular genes is conserved between two species (‘synteny blocks’) is of interest not only for studying the evolution of chromosome structure, but also for helping to predict and identify pairs of genes between species that are (or are not!) orthologues. Where candidate orthologues in two

41


species are found to be located within well-conserved synteny blocks, you can have more confidence that the pair have been correctly labelled as orthologues. Ensembl finds ‘synteny blocks’ by grouping the conserved regions identified from genomic alignments. To be grouped, matches must represent the same relative orientation of chromosome sequences and must be separated by less than 100 kb. Groups that make up a block of less than 100 kb are discarded (parameters may be varied for different species). The approach requires that the species are close enough for genomic alignments to be attempted.

Part of a human CytoView display of human chromosome 11, showing ‘synteny blocks’ conserved on chimpanzee, mouse, rat and chicken.

Ensembl shows the synteny blocks in ContigView (overview panel) and CytoView displays for mammalian and nematode comparisons. In CytoView only, the blocks provide links to display that region in the other species. In addition, SyntenyView shows all blocks on a whole chromosome, related to the conserved blocks in a second species. (SyntenyView is not available for the C. elegans - C. briggsae comparison, as the C. briggsae genome sequence has not yet been assembled into chromosomes.) Links provide navigation between blocks and between species, and the display also shows genes in the current block together with their putative orthologues in the second species.

42


Comparative genomics and proteomics in Ensembl - examples a) From human gene to putative mouse orthologue and to Ensembl protein family

Human CFTR GeneView

F

Link to mouse homologue

Link to amilyView (human)

43


b) FamilyView

Alignments in JalView

Family members in ‘home’

Family members in other species

44


c) SyntenyView

Human chromosome 8 surrounded by mouse chromosomes with conserved ‘synteny blocks’.

Human genes in the selected block, together with their putative mouse orthologues

45


Comparative genomics and proteins in Ensembl - exercises Main exercise: Explore a protein family in human, mouse and rat, identify putative orthologues, and explore regions of conserved synteny. 1. Find the GeneView page for human SNX5 (Ensembl gene), and scroll down to the first ‘Transcript/Translation Summary’. 2. Take the link to the associated Protein Family.

• How many human genes produce proteins in this family? Are they all ‘known’ genes? • Are there members of the same family in mouse, rat and zebrafish? How many?

What about invertebrate species? • Click on one of the rat peptides and go to rat ProteinView. • From there take the link to the corresponding rat FamilyView. • How many rat genes are part of this family? • Find your way to mouse FamilyView, and follow the link to mouse ‘Sorting Nexin

5’ (GeneView). • Have a look at the section Orthologue Predictions. Follow the link to human

SNX5 (this takes you back to where you started). 3. Examine the genomic context of the human and mouse genes.

• From human SNX5 GeneView, follow the link ‘View gene in genomic location’ to ContigView.

• Which chromosomal region is the human gene located in? • Customise the display of ContigView. Select only Ensembl Trans., mouse (Mm)

cons. and high cons. and rat (Rn) cons. and high cons.; deselect all other options. • Have a look at the mouse and rat conserved regions in relation to the human

Ensembl transcript. Note that there the correspondence with exons, but note also that this is not perfect. Zoom in to examine in more detail.

• The conserved regions are probably showing “grouped” (a red ‘-’ shows to the left of the track label). Note that pointing to a region produces a pop-up with details of and a link to that region in the other species.

• Click on the red ‘-’ to the left of the Mm cons. track: ContigView will reload, the red ‘+’ replaces the ‘-’ and the hits are now “ungrouped”.

• Point to a mouse match in this track, and take the link to DotterView. Note the dots on the diagonals where exons align. Zoom in to examine a smaller region.

• Go back to human ContigView, point to a mouse match, and this time take the link to ‘Jump to Mus musculus’.

• This takes you to the corresponding display in mouse ContigView. In which chromosomal region is the gene located ?

• Zoom and/or customise the ContigView display to focus on the mouse Snx5 transcript, and turn on the rat and human matches tracks if necessary. Compare the amount of sequence showing as matched (the same threshold Blast score is used).

46


4. Examine the synteny blocks that these homologous genes are part of. • In mouse ContigView look at the top of the Overview panel. Note the coloured

blocks indicating regions where gene order is conserved in human and rat (‘synteny blocks’).

• Take the ‘Jump to’ link to mouse CytoView (link on the gold bar in Detailed View). Zoom out several steps to see a large region and compare the number of distinct rat and human synteny. Point to a synteny block and see the information and links.

• Go back to mouse ContigView, and take the ‘Jump to’ link to SyntenyView (Homo sapiens).

• Note that the large chromosome in the middle is the mouse chromosome – as you have come from a mouse page. The red box shows the region you have come from. The smaller chromosomes at the sides are the human chromosomes have blocks where gene order is conserved with the mouse chromosome.

• To the right is a list of genes found in this region, together with their homologues (putative orthologues). You can scroll along this list by using the ‘Upstream’ and ‘Downstream’ links at the bottom. Does the synteny information increase your confidence that the two genes are real orthologues?

• Move your mouse over another synteny block and follow the link ‘Centre gene list’. You now get the list of genes corresponding to the new area indicated by the red box

• Point a synteny block on one of the human chromosomes and take the link ‘Centre display on this chr’. This takes you to human SyntenyView, centred around the human chromosome.

• Move your mouse over a human synteny block and follow the link to the mouse chromosomal region shown in ContigView. You will generally see the overview section only (no detailed view), as the blocks cover rather large regions.

• Find the link from here to SyntenyView. Note this is the mouse site again, centred around the mouse chromosome.

• Compare the SyntenyView displays for mouse v. rat. • For more details on the functionality of SyntenyView, consult the associated help

page – click the red button in the top right corner.

Additional exercises: 1) Compare the BRCA2 gene in human and mouse.

• Find the human BRCA2 gene. • Identify its orthologue in mouse. • Compare the two genes with respect to length and number of exons. Open two

windows, one for human GeneView and one for mouse. Then, for detailed comparison, display human ExonView and mouse ExonView and tick the box in each to ‘View only exon sequence’.

• Display the human gene in ContigView, and examine the ‘mouse matches’ track for the region around the human gene.

• View the regions of conserved synteny that include the genes using SyntenyView.

47


2) Have a look at the protein family ENSF00000000779 • What does the annotation confidence score indicate?

Hint: Consult the help page. • How many genes in the family are there in human/mouse/rat, and what are their

chromosomal locations? Hint: use mouse and rat FamilyView. • View the Ensembl members of the family with JalView. Hint: scroll right to see the region

with good alignments. Try exporting the alignments in FASTA format. • Go to MartView and export a list of human genes in this family and their mouse

and rat homologues. Hint: The ‘Export a list’ link on human FamilyView pages provides a shortcut; include the External gene ids to see the gene names.

• The mouse genes seem to match 1:1, but what about the rat genes? Ensembl sometimes has problems with closely related family members, and of course there will also be problems where a genome assembly is incomplete. Hint: don’t be confused by one human gene appearing twice because of its two alternative transcripts.

3) Browse the human-mouse synteny blocks • Have a look at a number of human and mouse chromosomes in SyntenyView in

order to get some idea as to the size, orientation and distribution of synteny blocks. Hint: Entry points to SyntenyView include the MapView display of a chromosome and ‘Jump to’ links from ContigView or CytoView pages.

• Look at the synteny blocks in CytoView displays. • Export both the human and mouse sequence of a synteny block.

Hint: Display them in ContigView, take the ‘Export’ link.

4) Choose any gene of interest to you, and try to identify an orthologue in another species. Confirm whether they are part of a synteny block. If you have time, take the sequence of a transcript from one species (cut and paste from GeneView) and try a BLAST search in the other species. How do the results compare? 5) Similarly you may have a region of interest. Check whether it is part of a synteny block and which genes it contains in human and mouse.

48

Ensembl and other resources


Resources Ensembl is a key source aim of information about genomes, genes and gene products, and it is also valuable as a starting point for reaching other resources. The aim of this section is to highlight interlinked resources available from within the Ensembl web site.

OMIMTM (Online Mendelian Inheritance in Man) catalogues human genes and inherited disorders associated with those genes. This database maintained at the NCBI (available at http://www.ncbi.nlm.nih.gov/omim) contains textual information and references, as well as links to MEDLINE and sequence records in the Entrez system, providing links to additional related resources at NCBI. Ensembl links both to OMIM entries for any gene, where available, and to a subset of this database, the OMIM Morbid Map presenting the disease-associated genes described in OMIM.

LocusLink at NCBI aims to provide a single entry for each genetic locus, this will discontinued in the near future by Entrez genes. It presents information on official nomenclature, aliases, sequence accessions, phenotypes, EC numbers, MIM numbers, UniGene clusters, homology, map locations, and related web sites.

49


Sequence accessions include a subset of GenBank accessions for a locus, as well as the relevant NCBI Reference Sequence (RefSeq).

LocusLink allows access to other related resources such as GO (also linked to Ensembl). RefSeq is an attempt by the NCBI to provide a comprehensive, integrated, non-redundant set of sequences, including genomic DNA, tRNA, and protein products, providing a stable reference for gene identification and characterisation, mutation analysis, expression studies, polymorphism discovery, and comparative analyses.

RefSeqs are used as a reagent for the functional annotation of some genome sequencing projects, including those of human and mouse. GO The Gene OntologyTM Consortium (GO) produced a controlled vocabulary that can be applied to organisms even as knowledge of gene and protein roles in cells is accumulating and changing. GO provides three structured networks of defined terms to describe gene product attributes:

50


Molecular Function Ontology. Tasks performed by individual gene products; examples are carbohydrate binding and ATPase activity. Biological Process Ontology. Broad biological goals, such as mitosis or purine metabolism, are accomplished by ordered assemblies of molecular functions. Cellular Component Ontology. Subcellular structures, locations, and macromolecular complexes; examples include nucleus, telomere, and origin recognition complex. A gene product has one or more molecular functions and is used in one or more biological processes; it might be associated with one or more cellular components. For instance, cytochrome c can be described by the molecular function term electron transporter activity, the biological process terms oxidative phosphorylation and induction of cell death, and the cellular component terms mitochondrial matrix and mitochondrial inner membrane. However, it is noteworthy stressing that GO is not a database of gene sequences, nor a catalogue of gene products. UniProt —the Universal Protein Resource is an integrated documentation resource for protein families, domains and sites. UniProt unifies SwissProt, TrEMBL and PIR. This system relies on the SuperFamily classification system, classifying proteins into a network structure that reflects their evolutionary relationships dealing with biological information in order to derive protein signatures, allowing consistent, rich and accurate proteins annotation. ProteinView, GeneView and DomainView provide links to the relevant entries from UniProt

. eVOC2 is a set of orthogonal controlled vocabularies that unifies gene expression data by facilitating a link between the genome sequence and expression phenotype information based on information retrieved from human cDNA and SAGE libraries. These expression terms are linked to libraries and transcripts allowing the assessment of tissue expression profiles, differential gene expression levels and the physical distribution of expression across the genome. This feature is available at present only within the EnsMart system, were it allows fast retrieval of expression information. SwissProt is a curated protein sequence database that provides a high level of annotation, a minimal level of redundancy and high level of integration with other databases. It is maintained collaboratively by the Swiss Institute for Bioinformatics and the European Bioinformatics Institute (EBI).

51

2 “eVOC: a controlled vocabulary for unifying gene expression data”, J Kelso et al, Genome Research 13:1222-30 (2003).


SPTrEMBL (Swiss-Prot TrEMBL) Contains the entries that should eventually be incorporated into Swiss-Prot and therefore it can be considered as a preliminary section of Swiss-Prot as all have been assigned Swiss-Prot accession numbers.

Sequence homology search tools incorporated into Ensembl BLAST (Basic Local Alignment Search Tool), although this is not strictly an external resource, as it is embedded into Ensembl (currently WU-BLAST 2.0 supported3). This tool allows rapid, sensitive similarity searches of protein and nucleotide sequences against Ensembl's annotated organism databases. You can search against various genomic, known and novel cDNA, and peptide databases, depending on the available data for each organism. More recently batch BLAST is supported. SSAHA4 (Sequence Search and Alignment by Hashing Algorithm) another useful tool integrated in Ensembl, is designed to detect exact matches, or nearly exact matches, in DNA or protein databases, and is extremely fast. It works by converting a sequence database into a hash table. The hash table is scanned for hits, which are concatenated into matches. The SSAHA search has been optimised for alignments of high percentage identity and display as results the most significant matches for ungapped alignments between sequences. Each exact match in an SSAHA alignment is analogous to finding a high-scoring segment pair in BLAST. A number of consecutive matches on a contig may represent features of a gene such as exons or 5' and 3' untranslated regions, depending on the nature of the query sequence. Results for both BLAST and SSAHA are shown on the genomic assembly (chromosomes or scaffolds depending on the species) displaying the position of hits as arrows. From the karyogram you can jump to the chromosomal location in ContigView. A summary table lists the hits. Each alignment or HSP (high scoring pair) is shown separately (the list is sorted by score as default). For each alignment, links are provided to additional displays:

• [C] - View the hit in a ContigView display of the chromosome region. • [S] - Display the query sequence with the bases of all hit alignments marked. The

location of the current alignment is shown in red, other alignments in the hit are shown in blue.

• [A] - Display the alignment for the selected HSP. • [G] - View the hit in the context of the annotated genome sequence. Ensembl and EST

exons predicted by the Ensembl analysis pipeline, manually curated Vega exons, as well as exons predicted by ab initio methods and SNPs are annotated on the genome sequence.

3 BLASTN 2.0MP-WashU (29-Mar-2003) Copyright 1996-2003 Washington University: W Gish (1996-2003) http://blast.wustl.edu 4 “SSAHA: a fast search method for large DNA databases” Z Ning et al, Genome Research 11:1725-9 (2001).

52


Both the [S] and [A] displays also show the position of the alignment in query, database hit and genome, together with alignment score, E-value, length and percentage identity.

Search sensitivity offers pre-defined sets of optimised search parameters to find

the following alignments:

• Exact matches [Exact sensitivity] • Near-exact matches [Low sensitivity] • Near-exact matches (oligo) [Oligo sensitivity] • Allow some local mismatch [Medium sensitivity] • Distant homologies [High sensitivity] • No optimisation [Default sensitivity]

53


Exercises. 1. Go to ENSG00000001626’s GeneView. From there, follow the link to EMBL’s M55110. Jump to the relevant LocusLink entry and back to the main Ensembl Gene Report. From here, follow the link to OMIM’s 219700 and SwissProt’s P13569. (http://www.ensembl.org/Homo_sapiens/geneview?gene=ENSG00000001626) 2. To which UniGene cluster is mapped the following EST (accession number: BM694439) isolated from a serially normalised cDNA library of human cells? How many entries are in this cluster? Which Ensembl transcripts correspond to this EST? Which OMIM entry does UniGene link this EST? To which RefSeq entry have we mapped this Ensembl transcript? What information can you provide about its genomic location and function? 3. Take exons 10 to 12 from CFTR (ENSG00000001626) and BLAST them against human, rat and mouse Ensembl cDNA sets. Which Ensembl genes are hit? Are any of them exact matches? To which Ensembl family does this correspond? Which InterPro domains are present in these proteins? 4.Use the same sequence to query the genome using SSAHA. Which locations do you find? To which of the genes from the previous exercise do they correspond? 5. Take the human chromosomal region flanked by genetic markers D22S687 and D22S594. What genes are there and what information can you obtain about them?

54

EnsMart

EnsMart The EnsMart system (http://www.ensembl.org/EnsMart) extends Ensembl’s genomic browser capabilities, facilitating rapid retrieval of customized data sets. A wide variety of complex queries are supported, on various types of annotations, for numerous species. These can be applied to many research problems, ranging from SNP selection for candidate gene screening, through cross-species evolutionary comparisons, to microarray annotation. Users can group and refine biological data according to many criteria, including cross-species analyses, disease links, sequence variations, and expression patterns. Both tabulated list data, and biological sequence output can be generated on the fly, in HTML, text, Microsoft Excel, and compressed formats. A wide range of sequence types, such as cDNA, peptides, coding regions, UTRs and exons, with additional upstream and downstream regions, can be retrieved. EnsMart can be accessed via a public website, or through a Java application suite. Both implementations and the database are freely available for local installation, and can be extended or adapted to ‘non-Ensembl’ datasets. EnsMart attempts to organise data from Ensembl and third party databases into one query optimized system, using a data warehousing technique specifically designed for descriptive data. Key features of this approach are: scalability for large amounts of data; rapid and flexible data access; support for easy integration with third party data and/or programs; and intuitive user interfaces. Data The present implementation of the EnsMart system is built on the data in the Ensembl genome databases, extended with additional data sets. The data fall into three broad categories: genomic annotation (relating predominantly to genes and SNPs), functional annotation, and expression. All the species in Ensembl are represented. In addition to Ensembl-generated data and the external data that is imported by Ensembl, for instance annotations for Drosophila, C. elegans, Vega genes and SNPs, the EnsMart database also includes Genomics Institute of the Novartis Research Foundation (GNF) and EST-based expression data sets accessible through a controlled expression vocabulary – eVOC (Kelso 2003). MartView implements the user input abstractions using a “wizard” interface. Users navigate through a series of pages, each designed to gather input for one of the three required user-input abstractions. Certain attributes, such as sequences, gene structure information and SNP data, are functionally separated from the other attributes to facilitate easier grouping of reasonable queries, and efficient server response. Finally, the user selects the output format, and exports the data. Throughout the process, the user interface provides feedback on the number of items that have been selected.

Query organization The querying of EnsMart data is organized into three steps: start, filter and output. Below is a detailed description of each step using MartView as an example.

55

EnsMart

Start

The start stage includes the initial selection of the species and focus for the query. Currently the database contains nine species and four foci (Ensembl genes, EST genes, Vega genes and SNPs). Each species is designated with its genome assembly version (see figure above).

56

EnsMart

Filter

The second, filter, stage allows the user to limit the initial search to a subset with particular characteristics. A wide range of filter types can be applied, in any combination. The system supports batch querying and a set of external identifiers can be uploaded directly from a file. These then act as an external filter, allowing rapid cross-referencing of large numbers of external identifiers and association of items in the set with corresponding genomic annotation and sequences. The region filter allows a search to be carried out on the full genome, on a single chromosome, or on a portion of a chromosome (as determined by markers, bands or base pairs). The availability of other filter options depends on the data content for a particular species and focus. For gene foci, multi-species filters can limit the selection of genes to those associated with homologues in other species, or with an upstream region that is conserved between species. Further filters allow restriction to a particular gene type (e.g. novel genes or disease genes) or to genes that have been mapped to a particular external id set (e.g. Affymetrix, EMBL, Gene Ontology or HUGO identifiers). Searches can also be limited to genes with protein products possessing particular features, such as the presence of a transmembrane domain, signal sequence, or other domain specified

57

EnsMart

using identifiers from domain databases. Access to expression data stored in EnsMart is provided via the eVOC controlled expression vocabulary. Currently two datasets can be accessed in this way: the GNF microarray dataset, and EST-derived expression data. Finally, one can restrict searches to genes with SNPs in particular regions (e.g. coding or UTR), or to genes that have non-synonymous SNPs. The SNP focus allows whole genome or regional querying of SNPs mapped to a particular species genome. SNPs can be filtered to include only those that have been validated, or those with external TSC or HGVBASE ids. In addition SNPs with allele frequency data from a particular geographical populations, or all available populations, can be selected. Finally SNPs mapping to upstream regions, UTRs, coding regions or introns of genes can be selected. SNPs present in coding regions can be further filtered on whether they are synonymous, non-synonymous or stop SNPs. Throughout the process of filter selection, a summary table provides feedback on the number of items that pass the currently-selected filters, allowing users to modify their searches in an interactive way.

Output At the final output stage the data that is available about the set of items that pass the filters is organised into a number of topics, reflecting the kinds of data that are most likely to be required in different types of analyses (see figure). Again, the topics available will depend on the species and focus. For example, with a gene focus you choose between “features”, gene “structures”, “SNPs” and “sequences”. Within gene “features”, the data types available correspond roughly to the types of filters described above. Thus, chromosomal location, identifiers from external databases, protein domain annotation, expression attributes, homologous genes in other species, and locations of conserved upstream regions are all available. The gene “structure” options include information about the genomic and transcript locations of exons, introns and coding sequences, and the GTF output format is supported. The gene “SNPs” options include validation status, location and SNP type (e.g. coding, intronic, UTR, non-synonymous) as well as data relating SNPs to gene function. For non-synonymous SNPs the peptide shift is available, and the ratio of synonymous to non-synonymous SNPs within a gene can be shown. The gene “sequence” options include gene sequence, gene with flanking sequence, upstream or downstream sequences of user-specified length, exons, transcripts, and coding sequence only. The user is guided by a graphical representation of sequence options (see figure). A variety of output formats are supported, including HTML, Microsoft Excel, a number of flat file formats, GTF and FASTA.

58

EnsMart

59

EnsMart

a) Building a query in EnsMart

Link to MartView

‘refresh’ updates thesummary

60

EnsMart

Restrict your query with filters, then ‘next’

61

EnsMart

Select output options

Select output format, then ‘export’

62

EnsMart

Output formats: HTML

Links back to Ensembl

Excel

63

EnsMart

Retrieval of sequences

Select sequence types

64

EnsMart

b) Genes from a candidate region displayed in ContigView

Links to MartView

Region defined already

65

EnsMart

Practical exercise: Exercise - Retrieve all coding SNPs for novel human G-protein coupled receptor genes (GPCRs – IPR000276) on chromosome 2. 1. Go to http://www.ensembl.org/EnsMart

2. Choose the species and the focus for your query on the START page:

• ‘Homo sapiens’ is selected by default.

• Select ‘Ensembl Genes’ as we are looking for a gene list (NOT a SNP list!)

3. Click next. Note that the Summary to the right is updated. It indicates the total number

of Ensembl genes present in the database.

4. Select your region of interest from FILTER:

• Select ‘Chromosome 2’. From ‘Chromosome’ to ‘Chromosome’.

• Tick ‘Known status’ and ‘Excluded’ to retrieve “novel” genes only.

• Tick ‘InterPro ID’ and enter ‘IPR000276’ in the ‘Protein’ section.

• Click refresh at the top of the Summary and note that the number of genes

fulfilling these criteria is given under filter.

• Tick ‘Entries associated with SNPs of type’ and ‘Coding only’.

5. Click refresh again and note that the number of genes is updated. Click next.

6. Select the OUTPUT for your gene list:

• Choose the ‘SNPs’ tab.

• ‘Ensembl Gene ID’ is selected by default.

• Under the GENE ASSOCIATED SNPs panel click ‘Reference ID’, ‘Allele’, ‘SNP

Class’, and ‘Transcript Location (bp)’ and finally ‘Peptide shift’, ‘Synonymous status’

• Choose ‘MS Excel’ as output format.

• Click export Note that the output for this query gives you one row for each SNP. This means that a particular gene cahave multiple entries. Find the coding SNPs in the list.

.

n

66

EnsMart

Additional exercises: 1. Take the list available from http://www.ebi.ac.uk/~xose/Affy_exercise.txt with and knowing that it was obtained from an experiment with GeneChip HG-U133A, retrieve 500bp of conserved upstream sequence. Hint: Remember to untick ‘Region’ to allow genome wide searches. And tick ‘Entries associated with upstream matches in mouse genome’. This is a human probe set 2. Retrieve all mouse homologues of human disease genes containing transmembrane domains located between 1p22.3 and 1q22. Hint: Start with a human gene list. Include “Disease OMIM ID” and “Disease description” in the OUTPUT section.

3. Retrieve the gene structure of the following mouse genes:

ENSMUSG00000042351 ENSMUSG00000022393 Hint: This can be done under Filter – (Gene Section) Ensembl Gene IDs. Select ‘Ensembl Gene ID’, ‘Ensembl Transcript ID’, ‘Ensembl Exon ID’, ‘Exon Start’ and ‘Exon End’. Export the gene structure in HTML format. Then take the link from the Ensembl genes to GeneView in order to confirm the gene structure.

4. Retrieve the sequences 5kb upstream of all human ‘known’ genes between D1S2806 and D1S464. Hint: Use the ‘Genes – transcript information ignored’ under the sequence type options. ‘Known’ genes in Ensembl are Ensembl gene predictions that could be mapped to external database entries (e.g. SwissProt) with a high similarity score.

5. Retrieve all human SNPs that have a TSC ID, from chromosome 6 between 15 and 15.2 Mb, with 200 bases flanking sequence. 6. Design your own query!

67

Genetic variation in Ensembl


SNPs One of the goals of the human genome project is the compilation of a catalogue of common human sequence polymorphisms. Every individual’s genome (with the exception of identical twins) differs by approximately 0.1% from any other. Such a small percentage equates to millions of differences in a genome with 3200 million base pairs. This variation is the result of mutation followed by a combination of drift, migration and selection bringing the frequencies high enough to be observed. The most simple form and most common source of genetic polymorphism in the human genome (90% of all human polymorphisms) takes the form of a single nucleotide polymorphism (SNP) or substitutions at single base pairs. When comparing two genomes, SNPs with a frequency >1% occur perhaps every 1000bp. These are among the differences that account for inherited phenotypes, including disease susceptibility, and therefore identifying SNP sites along the genome will help to track down disease genes. This was the reason that prompted The SNP Consortium Ltd. (TSC) to produce a human SNP map in the first instance to pinpoint the contribution of individual genes to diseases and other phenotypes.

Ensembl displays SNPs retrieved from dbSNP (the SNP repository maintained by NCBI). Those represented in HGVbase (Human Genome Variation Database) and TSC are identified. For each SNP, a SNPView page shows a report with quantitative data (score or ‘map weight’, Validation status) as well a genetic context (SNP neighbourhood covering 20kb centred on the selected SNP, Alleles and Sequence Region). SNPs are colour-coded depending on whether they are located within a gene (coding, intronic or UTR), in the flanking region of a gene (within 10 kb upstream or downstream), or somewhere else. The colour code is provided at the bottom of the illustration.

68


SNPs are also displayed as a track in ContigView, using the same colour coding. The transcript/protein positions of SNPs, and their effects on the protein sequence, can be displayed as mark up on the sequence in TransView and ProteinView pages. ProteinView also shows this information in tabular form.

69


--------------------------------------------------------------------------------

70


EnsMart provides a range of option for exporting annotated lists or sequences of the SNPs in a region, or for showing SNPs associated with genes, and their locations and properties. We have also calculated some values for overall SNP properties on a ‘per gene’ basis.

Haplotypes A haplotype can be considered a particular pattern of sequential SNPs (or alleles) found on a single chromosome that tend to be inherited together over time. For the human genome, a high SNP density and the identification of regions exhibiting linkage disequilibrium (Haplotype map project, see: (http://hapmap.cshl.org) will contribute to the mapping and cloning of genes involved in common multifactorial diseases and to investigating variation in responses to drug treatment (pharmacogenomics and toxicogenomics), for example by reducing the number of SNPs that need to be typed in mapping studies.

71


Ensembl displays such human ‘haplotype blocks’ as a track in ContigView linked to HaploView displays giving details of the block – currently this is only available for a chromosome 22 data set (data from Dawson et al5.). HaploView shows the haplotype identifier, the location on the chromosome of the bounding SNPs for the block, the number of SNPs in the block and the study reporting the haplotype data.

The location shown in bold is where the haplotype maps on the current assembly. Also provided is the minimum number of SNPs that, in theory, have to be typed to provide an unambiguous identification of any of the non-singleton blocks. A graphical representation of the major haplotype patterns is displayed with SNP positions in rows and individual typed chromosomes represented by columns. These patterns are grouped into consensus blocks (singleton patterns are not displayed). The dbSNP ID and chromosomal location of each SNP is provided.

Alternative versions of large chromosomal regions The current human genome assemblies from NCBI includes three alternative haplotypes for a ~33 Mb sections of chromosome 6 around the MHC region. Ensembl displays them as

72

5 “A first-generation linkage disequilibrium map of human chromosome 22”, Nature 418: 544-548 (2002).


'chromosomes' 6_DR51 and 6_DR52 to be able to handle them with our code. Although Ensembl doesn’t display SNPs that hit the assembly >2 locations we make an exception with these ones, since they were mostly tagged (correctly) as mapping to a single place on chromosome 6. We will be implementing some changes in our code in order to handle these and other cases where alternative versions of the same chromosome region need to be displayed and annotated.

Strain or population specific SNPs In highly inbred mammalian species, including mice and rats, SNPs that have different alleles in different inbred strains will help researchers to exploit strain differences as a means of studying quantitative traits. Although the main focus of this workshop is human, it is worth pointing out that strain-specific information for mouse and rat, can be accessed from EnsMart. It is possible to obtain lists of SNPs that represent differences between the strain that was the subject of the genomic sequences (C57BL/6J for mouse and BN/SsNHsd/MCW (BN) for rat), and another strain of interest. For the future, Ensembl has plans to show additional variation information. Where a species has distinct populations, and sequence data is available from more than one population, SNPs between populations may help identify genes that show signs of having been under recent selective pressure. For example, those studying the malaria transmission by mosquitoes may be helped to find genes involved in the evolution of vector competence.

73


Main Exercise:

Retrieve SNPs strain-specific from BALB/cByJ in chromosome 14.

You can jump to EnsMart from different locations within Ensembl

STEP 2

Focus your query: SNPs

Then click ‘Next’

Select mouse asyour species

STEP 1

74


Select chromosome Name

STEP 4

and click ‘Next’

Entries associated with validated SNPs, Non-synonymous SNPs from BALB/cByJ

Select chromosome 14

STEP 3

Select the following attributes: Allele, Reference ID, Strain, Peptide shift and Mapweight

75

Or you could retrieve the actual sequence adding 5’ and 3’ flanking regions.


76


Additional Exercises. 1. From the information available in the ProtView page corresponding to ENSP00000003084, what type of SNP is the corresponding to position 433? And 480? Hint: You can either use the pop-up window that appears when you hover over those SNPs after selecting ‘Exon/SNPs’ and ‘number residues’ from the toggle menu or you can see that information from the list provided at the bottom of the page. Similarly you could obtain this information from TransView. 2. From the ContigView page displaying chr7: 116816801-116823944 select the only coding SNP Hint: Follow the colour code to dbSNP: 1800096 (HGVBase: SNP000002766).

3. Retrieve all the validated SNPs associated to ENSG00000001626. How many o of these SNPs are coding? Are there any in-dels? Hint: Use EnsMart (SNP focus) to retrieve SNPs within 116713968 - 116902666 in chromosome 7.

77

Adding your own data to Ensembl displays

Evaluating genes and transcripts One of the key objectives of the Ensembl project is to present the best possible genome-wide set of genes, transcripts and proteins for each species. Any group attempting to do this is faced with a number of problems. One kind of problem is reconciling experimentally determined mRNA sequences with the assembled genome sequence. mRNA and/or protein sequences must be placed on the assembled genome sequence (which may be incomplete). For very few genes has the genomic sequence been determined experimentally, so a prediction must be made of the details of gene structure (intron and exon boundaries; coding and UTR sequence). A second issue is how to predict genes for which no mRNA sequence is available. One can make use of EST data, of homology matches with known mRNAs from the same or other species, and of ab initio gene prediction programmes. Transcript prediction must also address the existence of alternative transcript forms for many genes. Most people would consider that high quality, manually curated, predictions would make the best possible set, but these are not yet available for most species. This section of the workshop introduces the automated system by which the main Ensembl gene set is produced, and explains where we show the underlying evidence for each prediction. Understanding something about the ‘gene building’ process puts you in a good position to evaluate and make effective use of the data presented in Ensembl. For some species, Ensembl generates a set of alternative gene and transcript predictions, EST genes, based on EST evidence alone. Ensembl also shows sets of ab initio gene predictions, using the programs Genscan or SNAP, for most species. In addition, Ensembl shows some transcript sets generated by other groups. These include manually curated transcript predictions for some human chromosomes, and for D. melanogaster and for C. elegans, plus automated gene predictions generated by various methods. Although for some purposes it is useful to work with a single genome-wide set, if you are working on a single gene you may wish to look in detail at possible alternative gene structures. Note. In genomics and bioinformatics, the sequences of mRNAs are almost always worked with in the form of DNA. These sequences are referred to as cDNAs, but note that the sequence presented is a DNA version of the coding sequence itself, not its complement.

The Ensembl gene set The Ensembl gene build process can be briefly described as follows. Species-specific protein and cDNA sequences are first placed in the genome to create transcript models (the computational procedures are different for the two cases). Proteins from other species are then used to locate additional transcript, by homology. Protein and cDNA based transcripts are combined to obtain transcripts with untranslated region (UTR) information. Redundant transcript structures are eliminated and genes are created by combining overlapping transcripts.

78


Overview of the current Ensembl vertebrate gene build Most genes are predicted using the sequences of known proteins aligned to the genome using the program Genewise (Targetted and Similarity Builds). UTR sequences for these genes are derived from the alignment of cDNAs to the genomic sequence using the program. Exonerate. Transcripts created in this manner are then clustered to form genes with alternative transcripts (GeneBuilder). Finally, novel genes supported solely by cDNA evidence (cDNA Gene Build) are added to the gene set. As an example, the human gene build on the NCBI33 assembly started with 48,176 human protein sequences from the databases. 42,589 proteins were placed at one location in the genome, 3173 at 2 locations and 781 at 3 locations. 492 proteins (1%) could not be located at all due to missing genomic sequence or insufficient coverage of the placed protein. The final gene build step resulted in 23,299 genes containing 32,035 transcripts. Of these 270 (0.84%) were built solely from cDNAs, 6219 (19%) were built from human proteins with no UTR attachment and 2983 (9%) were built from non-human proteins with no UTR attachment. Of the combined transcripts with UTRs there were 21,889 (68%) built from human protein and cDNA and 674 (2%) built from non-human protein and cDNA. Of the transcripts built 962 were tagged as probable pseudogenes.

79


For some species, Ensembl transcripts may also be generated by taking ab initio transcript predictions and validating exons with homology evidence. This approach can permit the inclusion of additional transcripts where the sequence similarity of the homology evidence is insufficient to build a transcript directly, and was used in the past as part of human gene builds. In the current human release, however, this approach did not yield a significant number of additional transcripts and was not used. The overall quality of the gene set depends on the quality of the sequence assembly and the number of high quality cDNA sequences available. Where both of these are good, as for human, the majority of the gene predictions are probably very accurate. For other species, especially those that are evolutionarily distant from the best-studied species, a higher proportion of the predictions will have inaccuracies and missing segments. There are a large number of things that cause Ensembl to not give perfect predictions, or mean that you cannot find your gene of interest – this summary may be useful:

Absence of genomic DNA, either entirely for a gene or partially for an exon or two. Misassembled genomic DNA. There were undetected errors in the software/hardware/network at the time of running the analysis which meant that we dropped that information. Genewise made a mistake in gene prediction You might be looking for a “gene” which in fact was an error in cDNA sequencing or otherwise incorrect.

You are welcome to email the Ensembl helpdesk for help in understanding problematic situations. Extensive testing of the Ensembl dataset for human (against manually annotated genes) indicates that:

Our confirmed gene set is conservative - we are not predicting confirmed genes without real confirming evidence. We fragment genes roughly at a 1.4 rate (i.e., 1.4 Ensembl confirmed genes to a real gene).

For two species, D. melanogaster and C. elegans, genome-wide sets of manually curated genes and transcripts are already available. Ensembl has ‘imported’ these sets and uses them as the main Ensembl gene, transcript and protein set. The Ensembl set is displayed in ContigView as red (known) or black (novel) transcripts, and is what you see in the main GeneView, TransView, ProteinView and ExonView pages.

Ensembl EST Genes Expressed sequence tags (ESTs) are notorious for their variable quality. They are single-read sequences and thus prone to sequencing error. Additionally, the libraries from which they

80


are derived can often be contaminated with genomic sequence that cannot be detected by an automatic annotation system. Finally, they are generally around 400bp long and thus a single EST is unlikely to cover an entire gene. For these reasons we have less confidence in genes built from ESTs so we build them separately from the main gene build. The Ensembl EST gene build process involves two steps. Firstly, ESTs from the genome of interest are aligned to the genomic sequence using either Exonerate or BLAT. With the total human division of dbEST being currently around 5.5 million ESTs, the computing task is a very large one. The variable quality of EST sequence leads us to employ conservative match criteria. We screen for the best-in-genome match for each EST, which must exhibit not worse than 97% identity with the genomic sequence over not less than 90% of its length. In general, only perhaps half of the starting EST dataset will meet these criteria. Then the aligned ESTs are used to build gene structures using the ClusterMerge algorithm and translations are predicted using Genomewise. The EST genes are perhaps most useful for seeing possible alternative splicing patterns for the predicted genes. The Ensembl EST set is displayed in ContigView as purple transcripts, currently for human, mouse and rat, zebrafish and mosquito. These are linked to modified EST versions of GeneView, GeneView, TransView, ProteinView and ExonView pages.

Ab initio predictions As part of the initial annotation of a genome assembly, Ensembl may run ab initio gene prediction programs such as Genscan, Snap or Genefinder. These are programs that apply ‘models’ of gene structure, generally ‘tuned’ for particular organisms, and predict genes based solely on the genome sequence (i.e. without taking any supporting evidence into account). The predictions are displayed in ContigView, but for most purposes you would want to work with the Ensembl transcripts, not the Genscans or Snaps. Genscan transcripts are displayed in green for human, mouse, rat, zebrafish and Fugu. Snap predictions are displayed, also in green, for C. elegans, and mosquito. Genefinder predictions are shown for C. elegans and C. briggsae. In most cases, a link is provided to a simple ‘FASTAView’ display of the predicted peptide sequence only.

Manually curated gene sets shown in Ensembl Comprehensive manually curated sets of gene and transcript structures are available for some human chromosomes and for the D. melanogaster and C. elegans genomes. Where available, these structures are seen by most people as the best set.

81


ContigView display of part of human chromosome 7, showing the main Ensembl transcript predictions together with EST transcripts, Vega manual annotation, Genscans and (as DAS tracks) Geneid and Twinscan predictions. For D. melanogaster, Ensembl imports the FlyBase manually curated set of predictions, and shows it as its main set. Similarly, for C. elegans, Ensembl imports the WormBase manually curated set of predictions and shows it as the main set. In both cases the transcripts are shown in red in ContigView, and are linked to normal GeneView, TransView, ExonView and ProteinView displays. For human, manually curated sets are available in the Vega database for chromosomes 6, 7, 9, 10, 13, 14, 20, 22 and X. They can be viewed in an Ensembl-like genome browser at: http://vega.sanger.ac.uk The same manual annotation can also been seen in human Ensembl ContigView as transcripts in various shades of blue. These are linked to modified GeneView, TransView, ExonView and (in some cases) ProteinView displays.

Predictions from other groups For some genomes, groups other than Ensembl have generated sets of predictions and made them available for display in the Ensembl browser. When this happens, they are generally displayed using the ‘DAS’ (distributed annotation system) format. Tracks showing these sets can be turned on in ContigView using the DAS pull-down menu on the gold bar. For example, sets you can see in human Ensembl include Acembly, NCBI Genomescan, Fgenesh++, Geneid and Twinscan. For most sets, a link is provided to a FASTAView display of the predicted protein sequence, together with a link to the web site of the contributing group.

82


Supporting evidence for gene predictions For Ensembl gene predictions, you can see in ExonView the sequence database entries that were used to support the building of that gene. These may be entries from UniProt, RefSeq or EMBL/GenBank. The display shows which exons are supported by each piece of sequence evidence, and provides links to view the original database entries.

Supporting evidence display from an ExonView page

Other ‘evidence’ tracks For an in depth analysis of a particular gene, you may also want to examine additional sequences that map by homology to the same region of the assembly. Ensembl has several tracks that show the mappings of different sequence sets. Note that the display can only show up to seven entries in a track at a particular position, even if we have mapped more than that. Depending on the species and the data set, the homology mappings may be to the entire genome sequence or just to regions that look like they may represent coding sequences. (The latter is a short cut used for large genomes and large data sets, and is based on homology mapping just to regions identified as potential genes by Genscan). Also note that an evidence track can be set as ‘expanded’ (shows up to seven separate entries) or ‘compact’ (superimposes data into a single line display). Try clicking the red + or – button to the left of the track. When the track is expanded, each item is linked to the original database entry so that you can quickly review its details.

83


Some evidence tracks in mouse ContigView. Two of the tracks are shown ‘compact’, the other are shown ‘expanded’. The mapped sequence sets may be species-specific (e.g. ESTs) or not (e.g. all of UniProt); and they may be cDNAs or proteins. The details of what is available and how the mapping was done vary form species to species. Try checking the species-specific help pages for ContigView (linked to from the main ContigView help page). However, these are incomplete as yet, so you may need to ask the helpdesk for more details.

Known vs. Novel Ensembl transcripts and Naming of genes Naming of genes and transcripts takes place after the gene build is completed. Transcripts or proteins are mapped on the basis of sequence homology to all species-specific UniProt and RefSeq entries. If they can be mapped, Ensembl refers to them as ‘known’ and shown in red in the browser; if not, they are called ‘novel’ and shown in black. The matches are shown in the similarity matches section of GeneView/TransView/ ProteinView. For mapping, we require high sequence similarity, but allow incomplete coverage (so that we can map transcript predictions that may be incomplete). You can see the % values for coverage in the similarity matches section. Obviously, 100% will give you more confidence in the mapping than 60%. The mapping is difficult for families of closely related genes, and wrongly annotated pseudogenes may also cause problems. If you see an ‘align’ link you can click to go straight to an alignment of the transcript with the mapped database entry. Names for transcripts (and hence genes) are taken from mapped database entries.

84


The official HGNC (HUGO) name used, if one is available, for a human transcript (or equivalent for other species); otherwise we use the name of the mapped entry itself (with priority UniProt > RefSeq). Novel transcripts have only Ensembl stable ids Genes are named after the ‘best-named’ transcript, and gene descriptions are also taken from mapped database entries (the source used is stated). Other sources of information about a gene include the GO terms section, the protein family description and any domains identified. The similarity matches section also shows other database entries that have been linked to a transcript. These are either derived from one of the ‘primary’ mappings (UniProt or RefSeq), or, in the case of mappings to microarray sequences, produced in a separate sequence mapping process by Ensembl.

Top sections of a human TransView page

85


Main exercise: Examine the evidence for the KCNS1 gene. 1. Display the human KCNS1 gene in GeneView.

• Enter KCNS1 into the text search box at the top of any human Ensembl page. • Take the link to the GeneView page for the Ensembl gene KCNS1. • What are the external database sources for the description and for the gene name? • Scroll down the GeneView page and have a look at the predicted exon structure.

Note the 5’ and 3’UTRs (untranslated regions). 2. Examine the supporting evidence for the KCNS1 gene in ExonView.

• Click on ‘View exon information’ to go to ExonView. • The bottom section of ExonView shows the supporting evidence that was used

during the Ensembl transcript building process. Which databases did the entries come from, for this example? Which support the UTRs?

• Point to an exon and take the ‘View Exon Alignment’ link to see the alignments of the Ensembl predicted exon against the supporting evidence.

• Take the gene identifier link at the top of ExonView to return to the GeneView ‘gene report’ for KCNS1.

3. Examine homology evidence for the KCNS1 gene in ContigView.

• Click on the link ‘View gene in genomic location’ under Genomic Location. This takes you to ContigView displaying only the region encompassing the gene. Zoom out by clicking on the Minus button next to the Zoom triangle.

• Look at the homology evidence tracks, e.g. proteins, Unigene, human cDNAs. Note that the track labels are links to the help page. Moving your mouse over the evidence brings up a pop-up menu with a link to the external database entry. Try some links.

• Condense the Unigene and Protein track as well as the Overview section by clicking on the Minus boxes.

• Under Features select ‘EMBL mRNAs’, and deselect ‘Markers’. • Under Decorations select ‘Show empty tracks’. This will help you remember which

tracks you have selected. • Examine the evidence for the KCNS1 gene in the new evidence tracks.

4. Compare transcript predictions made by other methods.

• In the Features menu, turn off most of the evidence and make sure Vega Trans., EST Trans., and Genscans are turned on.

• Look at the Genscans track. Zoom out further. How does the Genscan prediction differ from the Ensembl prediction? Note that this track shows ab initio Genscan predictions, not relying on supporting evidence.

• Compare the manually curated Sanger Institute gene prediction (‘Vega trans.’: only available for a few chromosomes), and Ensembl gene transcript predictions.

86


• Click on a Vega (or point and take the pop-up links) to see associated GeneView pages. These are similar to standard Ensembl GeneView pages, but with less annotation.

• Use the DAS sources menu to turn on tracks showing transcript predictions from other groups (e.g. Fgenesh++, Twinscan, Acembly Transcripts, NCBI Gnomon, NCBI Genes). Compare and contrast!

• For most of the predictions, you can access more information by pointing and taking the link on the pop-up menu.

5. Look at the Similarity Matches section.

• Go to TransView by pointing to the Ensembl transcript and taking the direct ‘Transcr.’ link from the pop-up menu.

• ‘Known’ Ensembl transcripts like this one (shown in red in ContigView) have been successfully mapped to external database entries (note that this mapping is done after the genes have been built). These entries are given in the Similarity Matches section of TransView (repeated in the ‘Transcript/Translation Summary’ section of GeneView). Have a look at the types of databases linked out to. What do the Target and Query % ids indicate? Hint: There are context-sensitive help pages in Ensembl - click the red button in the top right corner.

Additional exercise:

1) Examine the evidence for the ADA gene. Have a look at the homology evidence presented in ContigView. Are some exons better supported than others? Compare the Ensembl prediction with the Vega manual curation. (For an explanation of the Vega curation method look at the GeneView page for the Sanger gene.) Most of the transcript looks very similar, but it looks like the Ensembl prediction may miss the most 5’ and 3’ exons (generally the most difficult, especially if they are all or mostly UTR.). Does the difference between the Ensembl and Vega prediction affect the resulting protein sequence? You can use TransView, turning on the ‘Codons/peptides’ mark-up to help check this out. Have a look at the SwissProt entry as well. Is there any indication in the ‘Similarity matches’ section of TransView that the Ensembl transcript may not be perfect?

87


Adding your own data to Ensembl displays Can you add your own data to the Ensembl system? If you work in a large group with a good level of computing/bioinformatics support, you could consider setting up your own ‘in-house’ installation of Ensembl. You could then add data directly to the databases or even add your own features and displays. This approach has been adopted by a few large research groups and by some pharmaceutical companies. Everything you need to set up the site is freely available and there is a detailed set of instructions at: http://www.ensembl.org/Docs/wiki/html/EnsemblDocs/InstallEnsemblWebsite.html The Ensembl project does not have the resources to provide support for such installations, but you can often get help with questions by emailing the Ensembl developers mailing list. But for most researchers this is ‘overkill’ – a more common wish is to be able to show a set of your own data mapped to the genome, alongside the standard Ensembl tracks in ContigView. This might be ‘private’ data, or something that you want to share data with your collaborators. There are a number of different ways you can do this: (1) Set up a ‘DAS server’ on your own computer (this lets you add the data in a permanent

and secure way). (2) Provide a simple file of data to Ensembl in the appropriate ‘DAS format’ and we will

store that data and enable its display on a temporary basis. (3) Put your data in an HTML file in a suitable format and provide the Ensembl browser

with the appropriate URL. The first two approaches use the DAS system, which was devised as a means of permitting independent groups to apply their own annotation to a reference genomic sequence [see Dowell, R.D. et al. (2001) “The Distributed Annotation System” BMC Bioinformatics. 2, 7]. Approach (1) is something you are likely to attempt only with some experience or systems support. Approach (2) and (3), on the other hand, should be usable by anyone who is prepared to spend a bit of time. Some simple examples are described below.

Approach (1): setting up your own DAS server A simple way to set up a DAS server for showing data in Ensembl displays is the LDAS approach described in: http://www.ensembl.org/Docs/wiki/html/EnsemblDocs/InstallLDASServer.html

Approach (2): Uploading data for display in Ensembl ContigView via DAS Procedure for uploading data: 1. Go to any ContigView page for the organism of interest. 2. Using the “DAS Sources” pull-down menu above the Detailed display panel, select the ‘Upload data …’ link.

88


3. Enter your details and the data file on the ‘Upload data’ page. 4. Note the data source id (e.g. ens_4498da542472435aa4799bd3912ab20b) 4. Edit the display options if required. 5. Refresh the ContigView display to see your data in a new track. Example of a data file: (The file describes features at the start of chromosome 2. To see the features after uploading the data, display the first 10 kb of chromosome 2 in ContigView. Upload and display for this example will only work in a species with a chromosome called 1.) #Example Ensembl data upload file (for chromosome coordinates) #Use tab-separated columns #Columns 1 & 2: Group/class and name #Columns 3 & 4: Type and Subtype #Column 5 Reference sequence (eg vhromsome name) #Columns 6 & 7 Range (start & end in bp) #Column 8: Strand #Column 9: Phase (put a dot if not applicable) #Column 10: Score #Columns 11-12: Similarity alignment start and end (OPTIONAL) #<col1> <col2> <col3> <col4> <col5> <col6> <col7> <col8> <col9> <col10> #<group> <name> <type> <subtype> <chromosome> <start> <end> <strand> <phase> <score> Similarity Fake_match_1 homology wublastn 2 4000 4050 + . 100 Transcription Fake_transcript_1 transcript exon 2 4200 4300 + . 100 Transcription Fake_transcript_1 transcript exon 2 4400 4500 + . 100

Display of uploaded features in Ensembl ContigView

To share the track with other people, tell them the data source id (e.g. ens_4498da542472435aa4799bd3912ab20b).

89


Approach (3): URL-based display Procedure for uploading data: 1. Go to any ContigView page for the organism of interest. 2. Using the “DAS Sources” pull-down menu above the Detailed display panel, select the ‘URL based data’ link. 3. Enter the URL for your data file in the box (e.g. http://www.ebi.ac.uk/~xose/ensembl_test_bed.txt) 4. The ContigView display will refresh automatically. Example of a data file: (The file describes features at the start of chromosome 2. Display for this example will only work in a species with a chromosome called 2. The first line tells the browser what region to display. The second line sets some general properties – there should be a space, not a new line, between the colour and the URL information. The last three lines describe the locations of some features on the assembly, in tab-separated columns.) browser position chr2:1-10000 track name=ensembl_test_bed description=“Ensembl test (BED)” color=000000 url=http://www.ebi.ac.uk/~xose/ensembl_test.html 2 1000 1100 bed_feature_1 1000 + 2 2000 2100 bed_feature_2 500 + 2 3000 3100 bed_feature_2 100 + For more details, see the help page at: http://www.ensembl.org/Homo_sapiens/helpview?kw=urlsource The approach is based on the custom track approach developed at UCSC – see: http://genome.ucsc.edu/goldenPath/help/customTrack.html To share the track with other people, tell them the URL (in this example, http://www.ebi.ac.uk/~xose/ensembl_test_bed.txt).

90

Databases and MySQL access

Databases and MySQL access All the data that you see on the Ensembl web site is stored in tables in MySQL databases. MySQL is a freely available relational database management system (RDMS). You may have heard of Oracle, which is a commercial RDMS. Each species has a ‘core’ database, plus additional databases (e.g. human also has disease, snp, est, estgene, lite, haplotypes and vega databases). There are also multi-species databases for EnsMart and for comparative and family data. If you were familiar with MySQL (or any other ‘standard query languages’) then you would find it fairly easy to start pulling information directly out of the Ensembl databases. We have a server with copies of the databases available for anyone to query directly. Just point a MySQL client at ensembldb.ensembl.org like this: [mycomputer]: mysql -h ensembldb.ensembl.org -u anonymous However, before you dive in, you should check if you can get the data you want via EnsMart. If you can, EnsMart is usually the best way to go. It has been designed so that you can access the data quickly and easily. The databases themselves are designed primarily to store the data efficiently.

int ori

int cmp_end

int cmp_start

int asm_end

int asm_start

int cmp_seq_region_id

int asm_seq_region_id

assembly

int coord_system_id

int length

varchar name

intseq_region_id

seq_region

varchar value

int attrib_type_id

intseq_region_id

seq_region_attrib varchar name

text description

varchar code

int attrib_type_id

attrib_type

mediumtext sequence

int seq_region_id

dna

mediumblobsequence

text n_line

int seq_region_id

dnac

“default_version”, “sequence_level”

attrib

varcharversion

int rank

varcharname

int coord_system_id

coord_system

1

0..n

1

1…n

1

0..1

0..1

1

1

0..n

0..1

0..n

0..1

0..n

Sequence region related tables from the ‘core’ database (primary keys are in red, foreign keys in green

or rose, arrows indicate relation between tables)

91

Databases and MySQL access

If you do want to use MySQL to access the databases directly, start by having a look at some of the documentation available on the web site. In particular, there are short descriptions of the database tables at: http://www.ensembl.org/Docs/schema_description.html

tinyint phase

tinyint end_phase

int seq_region_id

int seq_region_start

int seq_region_end

tinyint seq_region_strand

int exon_id

exon

0..n0..n

1

1

0..1

varchar type

int analysis_id

int display_xref_id

int seq_region_id


int seq_region_end


int gene_id

gene

varchar stable_id

int version

int gene_id

gene_stable_id

int gene_id

int display_xref_id

int seq_region_id


int seq_region_end


int transcript_id

transcript

varchar stable_id

int version

int transcript_id

transcript_stable_id

varchar stable_id

int version

int translation_id

translation_stable_id

int transcript_id

int seq_start

int start_exon_id

int seq_end

int end_exon_id

int translation_id

translation

0..n

0..n1

1

text description

int gene_id

gene_description 0..1

1

0..1

1

1

0..1

1

1..n

0..1

1

1

0..1

exon_transcript exon_id int

rank int

transcript_id int

exon_stable_id exon_id int

version int

stable_id varchar

Gene related tables from the ‘core’ database(primary keys are in red, foreign keys in green or rose,

arrows indicate relation between tables).

The tables reflect the Ensembl gene model which states that a gene contains one or more transcripts, which in turn contains one or more exons. Transcripts can have a translation (or not, like in pseudogenes or RNA genes). All of the above objects can have a stable_id that Ensembl tries to map over between different genebuilds.

92

http://www.ensembl.org/Docs/schema_description.html

Perl API

Perl API If you have some experience of programming in Perl, you may be interested to know that Ensembl has an extensive API (‘application-programming interface’) that lets you pull information out of the databases without having to know any of the details of how the data is stored in the tables. For Java fans, there is also a Java API, though this is less comprehensive. Again, before you dive in and learn how to access the data with Perl scripts, you would want to check if you can get the data you want via EnsMart. If you can, EnsMart is still going to be easier. But if you know you are going to be working with Ensembl data a lot, and especially if your research involves a lot of bioinformatics analyses, time spent getting familiar with the Ensembl Perl API could be time well spent. The system is all built around ‘objects’ so if you have used object-based API’s before (for example, BioPerl) you will find it easier to get started. There is a useful tutorial which reflect the latest changes in the schema at: http://cvsweb.sanger.ac.uk/cgi-bin/cvsweb.cgi/ensembl/docs/tutorial/ensembl_tutorial.pdf

93

Accessing database tables with MySQL

Accessing database tables with MySQL You can access Ensembl databases, both directly and through the Perl API. In order to access the public Ensembl MySQL server (ensembldb.ensembl.org, username ‘anonymous’) and run queries with your own SQL. It is advisable to start accessing the data on ensembldb.ensembl.org with just a MySQL client. If you are familiar with MySQL, then this is pretty easy - please see the start of the tutorial: http://www.ensembl.org/Docs/wiki/html/EnsemblDocs/CodeTutorial.html Ensembl provides an Internet accessible host (ensembldb.ensembl.org) with the latest databases that allows you to do a lot of work from an Internet connected host solely by using “client” software. To check whether you have a MySQL client installed on your Linux box, try:

mysql -u anonymous -h ensembldb.ensembl.org If it wasn’t installed, go to http://www.mysql.com to download the client. To connect to the actual MySQL program, you can use a protocol called telnet. Telnet is basically a connection to your web host server, so that you can perform some basic commands Enter in your MySQL password, and you will be connected to the MySQL Program. If you look on your screen, it will say something like

Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is ## to server version: ## Type 'help;' or '\h' for help. Type '\c' to clear the buffer mysql>

Inside the MySQL prompt you generally want to start with the database “homo_sapiens_core_26_35” (at time of writing this was the most recent build) and the database “ensembl_mart_25_1”, There are also databases for other species, that you should be able to query in the same way as described for human. Try the “show databases” SQL command to see all the databases that are available, followed by a “use” command similar to “use homo_sapiens_core_26_35” to choose the current/most recent database. Other databases are names like “homo_sapiens_snp_26_35”. Starting with the “core” database (homo_sapiens_core_26_35 or the most recent update of this database), try some of these queries to get a taste of what is stored in the databases; Choose the current human core database;

“USE homo_sapiens_core_26_35;”

94

ensembl manual - post

Documents