supplementary methods section analysis of supporting...

13
SUPPLEMENTARY METHODS SECTION Analysis of supporting evidence for alternative Splicing The evidence used for this analysis are human transcript sequences from the International Nucleotide database Collaboration (Cochrane et al. 2011) databases (GenBank, ENA, and DDBJ). Exonerate (Slater and Birney 2005) RNA alignments from Ensembl (Flicek et al. 2011) and BLAT (Kent 2002) RNA and EST alignments from the UCSC Genome Browser Database (Fujita et al. 2011) are used in the analysis. Erroneous transcripts and libraries identified in lists maintained by the Ensembl, UCSC, HAVANA (Wilming et al. 2008) and RefSeq (Pruitt et al. 2012) groups are tagged as suspect. Multi-exon GENCODE annotations are evaluated using the criteria that all introns are supported by an evidence alignment and the evidence alignment doesn't indicate that there are unannotated exons. Small insertions and deletions in evidence alignments are assumed to be due to polymorphisms and not considered as differing from the annotations. All introns boundaries must match exactly but transcript start and end locations are allowed to differ. The following categories are assigned to each of the evaluated annotations: 1. Good: all splice junctions of the transcript are supported by at least one non- suspect mRNA 2. Suspect mRNA-: the best supporting mRNA is flagged as suspect 3. estN: supported by multiple ESTs 4. est1: supported by a single EST 5. suspectEst:best supporting EST is flagged as suspect 6. poor : no single transcript supports the model structure Annotations in the MHC region and other immunological genes are not evaluated, as automatic alignments tend to be very problematic. We are still working on methods of evaluating single-exon genes and they are not included in the current analysis. Supplementary Table 7 shows the number of GENCODE transcripts in each of these categories. SUPPLEMENTARY FIGURE LEGENDS Supplementary Figure 1: HAVANA/Ensembl Merge Process. These schematic diagrams show how decisions are made on the merging of manually annotated HAVANA gene models and automatically annotated Ensembl gene models to create the GENCODE geneset. Where Havana and Ensembl transcript models agree for all coding exons and all non-coding exons, the Havana transcript model is used for the GENCODE gene set and Ensembl's supporting evidence is transferred to this model. This rule applies also where both the Havana and Ensembl transcripts are noncoding. (See "Perfect match" in Figure 1a and "Perfect match" in Figure 1c). Where Havana and Ensembl transcript models agree for all coding exons and all but the outer 5'-start or 3'-end non-coding exon

Upload: others

Post on 15-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SUPPLEMENTARY METHODS SECTION Analysis of supporting …profs.scienze.univr.it/~delledonne/Insegnamenti/Genomi/6... · 2012. 9. 10. · UCSC 19984 6056 NONE 4731 59180 66612 AceView

SUPPLEMENTARY METHODS SECTION Analysis of supporting evidence for alternative Splicing The evidence used for this analysis are human transcript sequences from the International Nucleotide database Collaboration (Cochrane et al. 2011) databases (GenBank, ENA, and DDBJ). Exonerate (Slater and Birney 2005) RNA alignments from Ensembl (Flicek et al. 2011) and BLAT (Kent 2002) RNA and EST alignments from the UCSC Genome Browser Database (Fujita et al. 2011) are used in the analysis. Erroneous transcripts and libraries identified in lists maintained by the Ensembl, UCSC, HAVANA (Wilming et al. 2008) and RefSeq (Pruitt et al. 2012) groups are tagged as suspect. Multi-exon GENCODE annotations are evaluated using the criteria that all introns are supported by an evidence alignment and the evidence alignment doesn't indicate that there are unannotated exons. Small insertions and deletions in evidence alignments are assumed to be due to polymorphisms and not considered as differing from the annotations. All introns boundaries must match exactly but transcript start and end locations are allowed to differ. The following categories are assigned to each of the evaluated annotations:

1. Good: all splice junctions of the transcript are supported by at least one non-suspect mRNA

2. Suspect mRNA-: the best supporting mRNA is flagged as suspect 3. estN: supported by multiple ESTs 4. est1: supported by a single EST 5. suspectEst:best supporting EST is flagged as suspect 6. poor : no single transcript supports the model structure

Annotations in the MHC region and other immunological genes are not evaluated, as automatic alignments tend to be very problematic. We are still working on methods of evaluating single-exon genes and they are not included in the current analysis. Supplementary Table 7 shows the number of GENCODE transcripts in each of these categories. SUPPLEMENTARY FIGURE LEGENDS Supplementary Figure 1: HAVANA/Ensembl Merge Process. These schematic diagrams show how decisions are made on the merging of manually annotated HAVANA gene models and automatically annotated Ensembl gene models to create the GENCODE geneset. Where Havana and Ensembl transcript models agree for all coding exons and all non-coding exons, the Havana transcript model is used for the GENCODE gene set and Ensembl's supporting evidence is transferred to this model. This rule applies also where both the Havana and Ensembl transcripts are noncoding. (See "Perfect match" in Figure 1a and "Perfect match" in Figure 1c). Where Havana and Ensembl transcript models agree for all coding exons and all but the outer 5'-start or 3'-end non-coding exon

Page 2: SUPPLEMENTARY METHODS SECTION Analysis of supporting …profs.scienze.univr.it/~delledonne/Insegnamenti/Genomi/6... · 2012. 9. 10. · UCSC 19984 6056 NONE 4731 59180 66612 AceView

boundaries, the Havana transcript model is used for the GENCODE gene set and Ensembl's supporting evidence is transferred to this model. (See "Coding match 1" and "Coding match 2" in Figure 1a). Where Havana and Ensembl transcript models agree for all coding exons but their non-coding regions are structurally different, both the Havana and Ensembl models will be included in the GENCODE gene set and they will be linked to one another by an external cross-reference. (See "Coding match 3" in Figure 1a). Transcript models from either Havana or Ensembl that are unique in location or coding exon structure are included in the GENCODE gene set. (See "No match" in Figure 1a, "No match 1" and "No match 2" in Figure 1b, and "No match" in Figure 1c). When the exon structure of a coding transcript model from HAVANA matches the exon structure of a non-coding model from Ensembl, the supporting features from the Ensembl models are transferred to the HAVANA model and the Ensembl model is removed. A tag is added to the HAVANA transcript to show that the transcript is a shared model between both annotation methods. The coding transcript biotype used by HAVANA is retained in the final model. (See "Perfect match 1" in Figure 1b). When the exon structure of a coding transcript model from Ensembl matches the exon structure of a non-coding model from Havana, the Ensembl model is removed during the merge and the supporting features are transferred to the HAVANA transcript. The non-coding transcript biotype used by HAVANA is retained in the final model. (See "Perfect match 2" in Figure 1b). Please note that in the case where the Ensembl model is in CCDS, both the Ensembl and Havana genes are included, unmerged, in the final GENCODE gene set. Where Havana and Ensembl transcript models are both non-coding and agree for all exons except the outer 5'-start or 3'-end exon boundaries, the longest transcript model is used for the GENCODE gene set and supporting evidence from both transcripts are included in the final model. (See "Match 1" and "Match 2" in Figure 1c). Supplementary Figure 2: Assignment of long non-coding RNA (lncRNA) transcripts into loci. The screenshot taken from the Zmap annotation interface shows two coding loci; distal-less homeobox 5 (DLX5) and distal-less homeobox 6 (DLX6), shown with open green boxes representing the CDS and filled red boxes the 5' and 3' UTRs, and the lncRNA locus DLX6 antisense RNA 1 (DLX6AS) represented as filled red boxes. The DLX6AS locus contains 7 alternative splice variants which share either common promotor sequences (suggested by shared transcription start sites (TSS) indicated by red arrowhead) or exonic sequence (indicated by green arrowheads) or both. Supplementary Figure 3: Pseudogene ontology. The schematic diagram shows the full structure of the ontology developed for pseudogene annotation. Inferred type describes the classification of the pseudogene based on its mode of creation. Evidence feature captures information on features associated with creation (pseudo-polyA tail), disablement and further mutation (pseudo-intron gain). Biological features indicate additional features related to the function, confirmed or potential, of the pseudogene. Supplementary Figure 4: Number of long non-coding RNA (lncRNA) loci annotated on each human chromosome for GENCODE 7. Chromosomes with predominantly automated annotation for GENCODE 7 are 12q, 14, 15, 16, 17, 18 and 19.

Page 3: SUPPLEMENTARY METHODS SECTION Analysis of supporting …profs.scienze.univr.it/~delledonne/Insegnamenti/Genomi/6... · 2012. 9. 10. · UCSC 19984 6056 NONE 4731 59180 66612 AceView

Supplementary Figure 5: Overlap among GENCODE, RefSeq and AceView at the transcript and CDS levels. Both protein-coding and lncRNA transcripts of all datasets were compared at the transcript level. Two transcripts were considered to match if all their exon junction coordinates were identical in the case of multiexonic transcripts, or if their transcript coordinates were the same for monoexonic transcripts. Similarly the CDS's of two protein-coding transcripts matched when the CDS boundaries and the encompassed exon junctions were identical. Numbers in the intersections involving GENCODE are specific to this dataset, otherwise they correspond to any of the other datasets. Aceview monoexonic transcripts were excluded from this comparison.

Page 4: SUPPLEMENTARY METHODS SECTION Analysis of supporting …profs.scienze.univr.it/~delledonne/Insegnamenti/Genomi/6... · 2012. 9. 10. · UCSC 19984 6056 NONE 4731 59180 66612 AceView

SUPPLEMENTARY FIGURES Supplementary Figure 1a: The Ensembl-HAVANA transcript merge process: Coding transcripts vs. coding transcripts

Supplementary Fig 1b: The Ensembl-HAVANA transcript merge process: Coding transcripts vs. noncoding transcripts

Page 5: SUPPLEMENTARY METHODS SECTION Analysis of supporting …profs.scienze.univr.it/~delledonne/Insegnamenti/Genomi/6... · 2012. 9. 10. · UCSC 19984 6056 NONE 4731 59180 66612 AceView

Supplementary Figure 1c: The Ensembl-HAVANA transcript merge process: Non-coding transcripts vs non-coding transcripts

Page 6: SUPPLEMENTARY METHODS SECTION Analysis of supporting …profs.scienze.univr.it/~delledonne/Insegnamenti/Genomi/6... · 2012. 9. 10. · UCSC 19984 6056 NONE 4731 59180 66612 AceView

Supplementary Figure 2: DLX6AS

Page 7: SUPPLEMENTARY METHODS SECTION Analysis of supporting …profs.scienze.univr.it/~delledonne/Insegnamenti/Genomi/6... · 2012. 9. 10. · UCSC 19984 6056 NONE 4731 59180 66612 AceView

Supplementary Figure 3: Pseudogene Ontology

Supplementary Figure 4: The number of LncRNAs annotated for each chromosome in GENCODE 7

Pseudogene

Inferred Type

Feature

Unprocessed

Processed

Duplicated

Unitary

Semi Processed

Simple

Processed

Duplicated

Processed

Evidence

Feature

Biological

Feature

Pseudo-Intron

Disablement

Pseudo PolyA

Tail

Polymorphic

Transcribed

Regulatory

Regulatory

Element Lost

Premature Stop

Codon

Frameshift

has a

is a

Havana

Proposed

Page 8: SUPPLEMENTARY METHODS SECTION Analysis of supporting …profs.scienze.univr.it/~delledonne/Insegnamenti/Genomi/6... · 2012. 9. 10. · UCSC 19984 6056 NONE 4731 59180 66612 AceView

Supplementary Figure 5: Overlap among GENCODE, RefSeq and AceView at the transcript and CDS levels

Page 9: SUPPLEMENTARY METHODS SECTION Analysis of supporting …profs.scienze.univr.it/~delledonne/Insegnamenti/Genomi/6... · 2012. 9. 10. · UCSC 19984 6056 NONE 4731 59180 66612 AceView

Supplementary Table 3: Long non-coding RNA Biotypes in Gene Level LncRNA statistics lincRNA 5058 processed_transcript 930 antisense 3214 sense_intronic 378 Supplementary Table 4: Pseudogene Biotypes in GENCODE 7 Pseudogenes statistics 3c 7 Total pseudogenes 8894 11580 Transcript Biotypes: processed_pseudogene 6368 8837 unprocessed_pseudogene 1277 2151 pseudogene 2179 422 transcribed_unprocessed_pseudogene 148 309 transcribed_processed_pseudogene 62 171 unitary_pseudogene 123 144 retrotransposed 290 215 polymorphic_pseudogene 33 29 IG_V_pseudogene 0 151 TR_V_pseudogene 0 21 IG_C_pseudogene 0 7 G_J_pseudogene 0 3 IG_pseudogene 161 0 TR_pseudogene 19 0

Page 10: SUPPLEMENTARY METHODS SECTION Analysis of supporting …profs.scienze.univr.it/~delledonne/Insegnamenti/Genomi/6... · 2012. 9. 10. · UCSC 19984 6056 NONE 4731 59180 66612 AceView

Supplementary Table 5: Number of loci in GENCODE 3c to 7

GENCODE 3c 3d 4 5 6 7

Level 1 0 0 20 20 20 684

Level 2 14596 14464 14721 15343 16411 16370 coding

Level 3 8324 7210 6344 5603 4482 4007

Level 1 3164 3164 3170 6735 6728 7183

Level 2 5010 5010 6273 3207 3791 4041 pseudogenes

Level 3 720 720 598 597 549 356

Level 1 0 0 82 82 82 477

Level 2 6440 6440 8082 8580 9216 8730 lncRNA*

Level 3 56 3576 1527 1548 1484 433

Level 1 0 0 0 0 0 0

Level 2 0 0 0 0 0 0 sRNA

Level 3 9243 9203 8801 8801 8801 8801

Supplementary Table 6: Number of transcripts in GENCODE 3c to 7

GENCODE 3c 3d 4 5 6 7

Level 1 0 0 20 20 20 926

Level 2 69544 69412 78358 84442 95141 99166 coding

Level 3 32532 31423 28224 27035 23978 23194

Level 1 3171 3171 3177 6737 6731 7586

Level 2 5115 5115 6370 3269 3904 5273 pseudogenes

Level 3 1987 1987 2131 2159 2254 917

Level 1 0 0 82 82 82 507

Level 2 9867 10110 13301 14334 15562 14218 lncRNA*

Level 3 365 4022 2007 2001 2016 787

Level 1 0 0 0 0 0 0

Level 2 0 0 0 0 0 0 sRNA

Level 3 9243 9203 8801 8801 8801 8801

*The lncRNA biotype equates to the processed_transcript biotype in earlier Gencode releases

Page 11: SUPPLEMENTARY METHODS SECTION Analysis of supporting …profs.scienze.univr.it/~delledonne/Insegnamenti/Genomi/6... · 2012. 9. 10. · UCSC 19984 6056 NONE 4731 59180 66612 AceView

Supplementary Table 7: Number of transcripts in GENCODE 7 by category transcripts MHC Region 1619 Olfactory receptors

723

Immunoglobin 295 T-cell receptors 82

excluded

Single exon 8949 not analysed Multi-exon 135910 analysed Total 147578

Page 12: SUPPLEMENTARY METHODS SECTION Analysis of supporting …profs.scienze.univr.it/~delledonne/Insegnamenti/Genomi/6... · 2012. 9. 10. · UCSC 19984 6056 NONE 4731 59180 66612 AceView

Supplementary Table 10a: Total number of transcripts in GENCODE, RefSeq, UCSC and AceView datasets Dataset Coding

Loci Long Non-coding Loci

Pseudo- genes

Single Exon genes

Coding Transcripts

Total Transcripts

GENCODE 20687 9640 11580 1724 76311 140066 RefSeq 23191 4888 11022 3234 32170 38157 UCSC 19984 6056 NONE 4731 59180 66612 AceView 49366 23048 NONE 31057 159988 211863 Supplementary Table 10b: Comparison of exact matching transcripts and Coding sequence found in GENCODE, RefSeq, UCSC and AceView datasets Data bin GENCODE AceView RefSeq UCSC *aceview_gencode_refseq_ucsc 19.8% 12.7% 71.5% 40.4% *aceview_gencode_ucsc 10.5% 6.7% 21.4% aceview_refseq_ucsc 1.0% 5.5% 3.1% *aceview_gencode_refseq 1.2% 0.7% 4.2% *gencode_refseq_ucsc 1.0% 3.8% 2.1% aceview_ucsc 3.9% 12.6% *gencode_ucsc 2.1% 4.2% *gencode_refseq 0.4% 1.7% *aceview_gencode 30.3% 19.9% aceview_refseq 0.4% 2.3% refseq_ucsc 3.0% 1.5% *gencode 34.7% aceview 54.7% refseq 8.0% ucsc 14.7%

Page 13: SUPPLEMENTARY METHODS SECTION Analysis of supporting …profs.scienze.univr.it/~delledonne/Insegnamenti/Genomi/6... · 2012. 9. 10. · UCSC 19984 6056 NONE 4731 59180 66612 AceView

Supplementary Table 11: Access points for GENCODE data

Access point URL Remarks

UCSC genome browser

http://genome.ucsc.edu/cgi-bin/hgGateway

Visual representation of genes in genomic context Official ENCODE access point Not every release is shown

Ensembl genome browser

http://www.ensembl.org/Homo_sapiens

Visual representation of genes in genomic context Additional regions and haplotypes are present

Ensembl BioMart

http://www.ensembl.org/biomart/martview/ Query interface for integrated data export

Ensembl Perl API & database access

http://www.ensembl.org/info/data/intro.html; ensembldb.ensembl.org

Full programmatic access to gene set

GENCODE FTP site

ftp://ftp.sanger.ac.uk/pub/gencode/ All releases and data is available as GTF files