Post on 02-Jul-2015
Embed Size (px)
DESCRIPTIONGRC Workshop held at Churchill College on Sep 21, 2014. Talk by Bronwen Aken discussing the Ensembl approach to annotating the complete human reference assembly.
- 1. EBI is an Outstation of the European Molecular Biology Laboratory. Ensembl annotation Bronwen Aken 21 September 2014
2. How Ensembl started Ewan Birney Michele Clamp Tim Hubbard 3. Ensembls goals Annotate (vertebrate) genome Integrate with other biological data Make publicly available Stable, automatic annotation High quality Regular release cycles Open source Provide a bioinformatics framework to organise biology around the sequences of large genomes 4. Challenges 1. Find functional elements in a genome Data have lots of noise 2. Software / hardware Storing and manipulating data 3. Intuitive and comprehensive access to data Visualization 5. GRCh38 annotation in Ensembl 6. What is Genebuilding? Automatic, evidence-based annotation of genes Not ab initio Based on sequence alignment Best-in-genome Aim for high specificity Prefer to miss a few features than heavily over- predict Automated gene annotation pipeline is designed around decisions made during manual annotation 7. Advantages of re-annotating Add new genes to new / fixed genomic regions Updated supporting evidence: Remove models built on data that has been deleted from archives Move alignments to regions with better mapping 8. Gene annotation pipeline the basics Identify interesting regions Rough alignment of sequences to genome Exhaustive alignment to produce transcript models Filter models Prioritize data sources Produce best guess gene set 9. Repeatmasking Same-species proteins Other-species proteins cDNAs/ESTs UTR addition Final gene set Filtering Protein-coding genebuild Filtering TranscriptConsensus LayerAnnotation Also: Small ncRNAs LincRNAs Pseudogenes 10. Repeatmasking Same-species proteins Other-species proteins cDNAs/ESTs UTR addition Final gene set Filtering Protein-coding genebuild Filtering RNA-Seq models Also: Small ncRNAs LincRNAs Pseudogenes MERGE WITH HAVANA 11. Release cycle 26 September 2014 11 Regulation Gene Allele Conserved sequence Figure adapted from the ENCODE project www.nature.com/nature/focus/encode/ Genes Coding & noncoding Protein & mRNA alignments GTF & BAM files Compara Conserved DNA sequence Multiple genome alignments Homologues Protein families Regulatory regions DNA methylation TFBS Open chromatin Variation SNPs, indels, structural variation Phenotypes QTLs 12. Integrate with other speciesChimpanzeeHuman Gene SLC12A1 13. Patch annotation in Ensembl 14. Genome assembly representation Coord_system table Lists the allowed coordinate systems chromosome, scaffold, contig With versions GRCh37, GRCh38 Contigs are shared between assemblies so have no version Toplevel coordinate system Chromosomes + unplaced scaffolds + unlocalized scaffolds + alternate sequences Most popular means to access the whole genome API options for including/excluding alternate sequences and PAR 15. Genome assembly representation GRCh38 Scaffolds Contigs Chromosome DNA only loaded for contigs 16. Genome assembly representation GRCh38 Scaffolds Contigs Chromosome DNA only loaded for contigs 17. Genome assembly representation GRCh38 Scaffolds Contigs Chromosome 18. Genome assembly representation GRCh38 Scaffolds Contigs Chromosome GRCh37 19. Genome assembly representation GRCh38 Scaffolds Contigs Chromosome GRCh37 20. Seq_region names Regions of the genome are given a slice name; its like an address eg. chromosome:GRCh37:6:133090509:133119701:1 Users like to say, chromosome 6 INSDC coordinates are versioned, but less human-readable chromosome:GRCh37:CM000668.1:133090509:133119701:1 assembly seq_region. name coord_system start end strand 21. Alternate sequences Assembly_exception table defines bubbles Initially set up to handle Y chromosome PAR Adapted to work for MHC haplotypes Now also used for GRC patches Assumes equivalent region will be present in primary assembly 22. Gene annotation on a patched genome 62.3Mb 62.4Mb 62.5MbHsap HG183_PATCH Assembly excepti... SNORA76 > SNORD104 > MILR1 > Genes (GENCODE... Primary assembly... AC025362.12 > AC016489.18 > < AC234063.4Contigs < Y_RNA < hsa-mir-1273e < AC234063.1 < TEX2 < AC016489.1 < PECAM1 Genes (GENCODE... H.sap-H.sap lastz-... Assembly excepti... 62.3Mb 62.4Mb 62.5MbHsap HG183_PATCH protein coding merged Ensembl/Havana RNA gene pseudogene Alternative alleles Projection Gene Legend 62.225Mb 62.250Mb 62.275Mb 62.300Mb 62.325Mb 62.350Mb 62.375Mb 62.400Mb 62.425Mb 62.450Mb 62.475MbHsap Chr. 17 Assembly excepti... H.sap-H.sap lastz-... SNORA76 > SNORD104 > AC138744.2 > MILR1 > Genes (GENCODE... GL383558.1 ... ...GRC alignment i... AC025362.12 > AC016489.18 > < AC009994.10Contigs < TEX2 < RPL31P57 < POLG2 Genes (GENCODE... Assembly excepti... 62.225Mb 62.250Mb 62.275Mb 62.300Mb 62.325Mb 62.350Mb 62.375Mb 62.400Mb 62.425Mb 62.450Mb 62.475MbHsap Chr. 17 Insert relative to reference Delete relative to reference ... Large insert shown truncated due to image scale or edgeMatchAlignment Differe... protein coding merged Ensembl/Havana RNA gene pseudogene Alternative alleles Projection Gene Legend 331.04 kb Forward strand Reverse strand 331.04 kb 276.06 kb Forward strand Reverse strand 276.06 kb TEX2 gene lies across the patch boundary PECAM1 is annotated only on patch HG183 Gap in primary assembly PatchedchromosomePrimarychromosome 23. Gene annotation on a patched genome 24. Gene annotation on patches Patch Primary 25. Gene annotation on patches Patch Primary 1. Manual annotation 26. Gene annotation on patches Patch Primary Patch Primary 2. Project models to patch 1. Manual annotation 27. Gene annotation on patches Patch Primary Patch Primary Patch Primary 1. Manual annotation 2. Project models to patch 3. Gap-fill with mini genebuilld 28. Ongoing challenges How strict should we be when aligning proteins cDNAs to the genome? 1. Genome assembly Sequencing error (inversion, artificial duplication) Assembly incomplete Alignments must allow for truncated matches 2. Population variation Linear genome is made from one individual vs protein databases contain data from many unknown individuals Paralogues, gene families, pseudogenes 3. Public databases eg. UniProt Include suspect data and incomplete for many species When theres a match, or no match, is it biologically real? Aligning proteins from other species must allow for mismatches Specificity Sensitivity 29. Funding European Commission Framework Programme 7 Ensembl Acknowledgements 30. Questions? 31. Reporting data to users Visualisation and Data querying: - When browsing the primary assembly, how do we make it obvious to users when alternate sequences are available? - How do we show when the alternate genomic sequences are identical or differ from one another? - How do we show whether the alternate genome sequences result in identical or different transcribed / translated products? - How do we make a qualitative call about which allele is better to use? eg. ABO - Data download options - Concept of a canonical transcript per gene (per tissue) Data analysis: - Linking between alternate alleles (and paralogues?) - How do we show when data have been mapped from an old to new assembly, compared to freshly aligned to a new assembly? When is it right to map instead of align? - In a non-linear genome model, how will SNPs (rsIDs) work? - In a non-linear genome model, what coordinate system should be used?