supplementary information for - cds-synon” (coding region synonymous) by dbsnp. silent...

25
SUPPLEMENTARY INFORMATION FOR Correspondence to Nature Genetics: Exploring pediatric cancer mutation information using ProteinPaint Xin Zhou 1 , Michael Edmonson 1 , Mark R. Wilkinson 1 , Aman Patel 1 , Gang Wu 1 , Yu Liu 1 , Yongjin Li 1 , Zhaojie Zhang 1 , Michael Rusch 1 , Matthew Parker 1 , Jared Becksfort 1 , James R. Downing 1,2 , Jinghui Zhang 1 1 Departments of Computational Biology, 2 Department of Pathology, St. Jude Children’s Research Hospital, 262 Danny Thomas Place, Memphis, Tennessee 38105 Correspondence should be addressed to: Jinghui Zhang, PhD Department of Computational Biology St. Jude Children’s Research Hospital 262 Danny Thomas Place, Memphis, TN 38105 Email: [email protected] doi:10.1038/ng.3466

Upload: nguyennhan

Post on 10-Mar-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

SUPPLEMENTARY INFORMATION FOR Correspondence to Nature Genetics: Exploring pediatric cancer mutation information using ProteinPaint Xin Zhou1, Michael Edmonson1, Mark R. Wilkinson1, Aman Patel1, Gang Wu1, Yu Liu1, Yongjin Li1, Zhaojie Zhang1, Michael Rusch1, Matthew Parker1, Jared Becksfort1, James R. Downing1,2, Jinghui Zhang1 1Departments of Computational Biology,2Department of Pathology, St. Jude Children’s Research Hospital, 262 Danny Thomas Place, Memphis, Tennessee 38105 Correspondence should be addressed to: Jinghui Zhang, PhD Department of Computational Biology St. Jude Children’s Research Hospital 262 Danny Thomas Place, Memphis, TN 38105 Email: [email protected]

doi:10.1038/ng.3466

Table of Contents

1. Supplementary Note 1.1 ProteinPaint aids discovery of novel splicing mutations 1.2 ProteinPaint facilitates identification of novel mechanism for aberrant gene expression 1.3 ProteinPaint design and implementation 1.4 ProteinPaint data source and curation 1.5 Comparison to other mutation viewers

2. Supplementary Figures Supplementary Figure 1: ProteinPaint reveals aberrant splicing of TP53 caused by recurrent “silent” mutations. Supplementary Figure 2: The unique mutational profile of USP7 suggests it is a driver gene in pediatric cancer. Supplementary Figure 3: Driver mutation identified in pediatric leukemia informs findings in other cancers. Supplementary Figure 4: Loss-of-heterozygosity (LOH) in tumors from pediatric patients with germline TP53 R337H mutation. Supplementary Figure 5: Outlier expression of FLT3 leads to discovery of a potential driver alteration in BALL255. Supplementary Figure 6: Comparison of JAK2 view on cBioPortal, COSMIC, and ProteinPaint. Supplementary Figure 7: Comparison of mutation presentation at KRAS G12 hotspot by cBioPortal, COSMIC, and ProteinPaint. Supplementary Figure 8: An example of integrating user-provided data on ProteinPaint using the TCGA SKCM (skin cutaneous melanoma) somatic mutation data.

3. Supplementary References

doi:10.1038/ng.3466

1. Supplementary Note

1.1 ProteinPaint aids discovery of novel splicing mutations ProteinPaint’s default view marks the exon boundaries for each protein, helping identify mutations located in close proximity to splice junctions. We demonstrate the power of this display by showing how “silent” mutations in the tumor suppressor gene TP53 can cause aberrant splicing through subsequent analysis of RNA-Seq data. This approach not only facilitates functional interpretation of somatic mutations but also clarifies pathogenicity of germline mutations occurring at the same site.

Of the 17,062 mutations compiled for the display of TP53, T125T is a recurrent mutation in both the Pediatric (2 somatic and 1 germline) and COSMIC (n=45) data sets (Fig. 1a, Supplementary Fig. 1a). This mutation, also present in dbSNP (rs55863639), is annotated as “coding silent” by COSMIC and “cds-synon” (coding region synonymous) by dbSNP. Silent mutations are generally considered non-functional passenger events; the recurrence of T125T mutation in both data sets was therefore unexpected. By zooming in, ProteinPaint shows that T125T is located at the -1 donor site of exon 4 (Supplementary Fig. 1b), raising the possibility that this mutation affects splicing. Similarly, E224E is another recurrent silent mutation located at the -1 donor site of exon 6 (Supplementary Fig. 1c). RNA-Seq data is available for HGG034, one of the three pediatric tumors that harbor the T125T mutation. Subsequent review of RNA-Seq alignment (Supplementary Fig. 1d) shows that none of the spliced reads harbor the mutant allele because T125T mutation (hg19 chr17:7579312, C/T, or c.375G>A) eliminated the splice donor site at exon 4, causing retention of intron 4. Translation of the aberrant intron-retained transcript introduces a premature stop codon 44 amino acid residues downstream of T125. This event is likely to cause nonsense-mediated decay of the mutant transcript.

The table view of ProteinPaint (Supplementary Fig. 1e and 1f) displays mutant allele fraction (MAF) in DNA and RNA samples and shows that the mutant allele of T125T is depleted (p-value 0.002, Fisher’s exact test) in RNA (40/160, 25% of the total reads), compared to that of tumor DNA (33/72, 46% of the total reads). This is likely due to nonsense-mediated decay of the mutant transcript with intron 4 retention caused by T125T mutation. By contrast, in the same tumor sample, the mutant allele of R248W is enriched in RNA-Seq (128/190 (67%) in RNA vs 45/83 (54%) in DNA, p-value 0.04, Supplementary Fig. 1f). The opposing pattern of allelic imbalance of these two mutations in RNA indicates that the tumor is likely to have compound heterozygosity of TP53 mutations resulting in bi-allelic loss of wild-type TP53.

Analysis of RNA-Seq data generated from tumors from other studies showed that T125T mutation always co-occurs with intron 4 retention. The samples evaluated include TCGA-28-5215-01 (glioblastoma), TCGA-EE-A2MG-06 (melanoma), and ACT007, a case harboring adrenocortical tumor (ACT) with a germline T125T mutation. T125T was previously reported as a germline mutation in patients with Li-Fraumeni syndrome and retention of TP53 intron 4 was detected by RT-PCR in fibroblast or lymphoblastoid cell lines derived from patients14. Currently germline T125T is considered pathogenic by ClinVar based on its association with Li-Fraumeni syndrome (http://www.ncbi.nlm.nih.gov/clinvar/variation/177825/) by a single submitter. It has the lowest review status indicating either conflicting interpretations (in which case the independent values are enumerated), or no submitter provided an interpretation. Our analysis on RNA-Seq splicing across multiple tumor types coupled with the previously published RT-PCR result show that T125T is a loss-of-function mutation instead of a silent mutation, confirming the pathogenicity assessed from genetic studies. Based on these

doi:10.1038/ng.3466

observations, ProteinPaint now displays this “silent” mutation as a “splice site” variant with a “T125T” label for cross-referencing with published literature.

Aside from T125T, two additional TP53 mutations, E224E and E224D, are reclassified as splice site mutations. Both mutations are located at -1bp of donor splice junction of exon 7 at chr17:7578177 (Supplementary Fig. 1d) and were present in multiple tumors in the COSMIC data set: E224E was mutated in 8 tumors, and E224D was mutated in 15 tumors. RNA-Seq were available for TCGA-DU-7298-01, a low grade glioma tumor harboring E224E, and TCGA-64-1678-01, a lung adenocarcinoma harboring E224D. Both mutations cause aberrant splicing leading to retention of intron 6, similar to what was found for T125T. E224E is also a germline variation (rs267605076) currently classified as Variant of Unknown significance by ClinVar. Similar to T125T, it was reported by a single submitter as the mutation was detected in hereditary cancer-predisposing syndrome. Our analysis suggests its current classification of Variant of Unknown Significance (VUS) may need to be upgraded to pathogenic or likely pathogenic given its effect on aberrant splicing.

1.2 ProteinPaint facilitates identification of novel mechanism for aberrant gene expression ProteinPaint hosts RNA-Seq data for close to 1,000 pediatric cancer samples. Gene-level expression, presented as FPKM (fragments per kilobase of exon per million fragments mapped), can be highlighted automatically in the expression panel on the right for samples harboring a mutation or a gene fusion selected by a user (Fig. 1b). The expression data can be viewed directly for exploring the global expression profile of the entire cohort, comparison of profiles across multiple cancer subtypes, evaluation of allelic imbalance of somatic mutations in expression, and stratification of cancer subtypes for a selected expression range. We use FLT3 as an example to illustrate how the expression panel enables the detection of an aberrant expression signature, which led to an investigation of its underlying molecular driver. FLT3 encodes a receptor tyrosine kinase involved in many cellular signaling pathways and is important for differentiation of hematopoietic stem cells. FLT3 mutations are known to cause constitutive kinase activation, promote cell proliferation and resistance to apoptosis in leukemia. Sequence mutations including the internal tandem duplication are known to be one of the molecular drivers of Ph-like BALL (B-cell acute lymphoblastic leukemia), a subtype of leukemia that lacks the BCR-ABL1 fusion but with a gene-expression profile resembling the kinase activating signature found in BCR-ABL1 positive leukemia. Our previous study reported that 91% of the Ph-like BALL had kinase activating mutations10. However, BALL255 was one of the remaining 9% Ph-like BALL samples whose causal kinase mutation was not identified by the initial study. The expression view shows that BALL accounts for the majority of the top 5% samples with the highest FLT3 expression, with BALL255 being the top sample (Supplementary Fig. 5a). An in-frame PAN3-ZCCHC7 fusion (Supplementary Fig. 5b) had been detected in this tumor, however FLT3 was not considered the target because the fusion disrupted its neighboring gene, PAN3. The aberrant expression of FLT3 prompted a detailed re-examination of the genomic lesions identified in this tumor, revealing that the PAN3-ZCCHC7 fusion was part of a complex rearrangement involving chromosomes 7, 9 and 13. This event was subsequently amplified, resulting a 8-10 fold copy number gain (Supplementary Fig. 5c) that spanned the entirety of the FLT3 gene. Connection of genomic breakpoints suggests that the initial rearrangement is likely to form an episome composed of five genomic segments including FLT3. Replication of the episome is likely to cause the aberrantly high expression of FLT3 in this tumor (Supplementary Fig. 5d). This analysis has unveiled a new mechanism for FLT3 activation, suggesting that FLT3 may be a therapeutic target when its overexpression can be attributed to a complex rearrangement event.

doi:10.1038/ng.3466

1.3 ProteinPaint design and implementation ProteinPaint focuses on visualizing somatic and germline sequence mutations as well as gene fusions affecting protein-coding genes. To represent proteins, ProteinPaint retrieves the reference genomic sequence by the genomic coordinates of a RefSeq isoform of this gene, builds the coding region sequence (CDS) according to the isoform structure, and translates the CDS to protein. A preferred isoform is selected for each gene based on isoform annotation in RefSeq and references in the literature.

ProteinPaint maps mutations by translating their genomic coordinates to protein coordinates. For single nucleotide variations (SNVs) and deletions, the exact chromosome positions are used; for small insertions, the chromosomal coordinates before the insertion are used. Then the chromosomal coordinates are converted to protein coordinate based on the following rules:

1. Mutations in protein-coding regions are mapped to base-pair positions in the CDS (Supplementary Fig. 1b).

2. Intronic mutations located within 10-bp of splice junctions are mapped to exon junctions (Supplementary Tutorial, section 2-7).

3. Mutations affecting the splicing of 5’ and 3’ UTRs are mapped to the start and the end of the protein (Supplementary Tutorial, section 2-6).

ProteinPaint offers two scales for viewing mutations: a full view for the entire protein, positioning

mutations by amino acid coordinates by default; a zoomed-in view which displays mutations by their genomic coordinates. In the zoomed-in view, mutations are positioned either to nucleotides within their corresponding codons, or between nucleotides for intronic variants. This design allows ProteinPaint to present precise site-specific position for variants which may be ambiguous at the amino acid level.

Mutations at the same position (amino acid position for the full view or nucleotide coordinates for the zoomed view) are displayed in a “skewer graph” composed of one or multiple stacked discs, each representing a distinct mutation, and a “stem” linking the discs to the amino acid residue or nucleotide, resembling a skewer. Discs are filled by colors representing mutation classes, and are sized by the number of affected samples. Multiple mutant alleles, including SNVs and indels, affecting the same site are shown as stacked discs, thereby differentiating alternative amino acid changes and the respective frequencies (e.g. R248Q and R248W). The fraction of germline and relapse mutations are highlighted by arcs framing a disc. A solid arc for germline mutations is shown first, followed by a hollow arc for mutations found only in relapse tumors. The proportion of each disc’s circumference not covered by an arc represents somatic mutations detected in a tumor at diagnosis.

ProteinPaint shows “skewer graphs” in either expanded or folded mode. When expanded, discs will appear as a vertical stack with the larger discs positioned closer to the protein. Labels appear to the right of discs showing their amino acid changes. Discs containing germline/relapse mutations show arcs, and those non-singletons are labeled by the numbers of affected samples. In folded mode, the labels, arcs, and numbers are hidden, and discs collapse to become concentric (Supplementary Tutorial, section 2-2). As a group, collapsed discs are slightly raised away from the protein by an offset proportional to the number of mutations. A user can click on folded discs to expand, or click on the label of expanded disc to fold it, thus allowing the display to be customized to highlight mutations of interest. For fusion genes, ProteinPaint employs a modified version of the “skewer graph”, with stems representing the breakpoint positions on the protein, and discs representing the partner protein of the fusion. The discs are half-filled to indicate which portion of the protein contributed to the fusion protein while the label of the disc displays the name of the fusion partner. Similar to the “skewer graph” used for SNV/indel mutations, the fusion gene graph can be dynamically expanded or folded.

doi:10.1038/ng.3466

ProteinPaint supports a synchronized view of gene expression when browsing mutations in the Pediatric data set. When a user loads the Pediatric data set for a protein, a panel appears on the right showing the corresponding gene’s expression in available pediatric cancer samples. This expression data can also be explored independently with convenient functions to analyze gene expression distribution for the entire cohort or a cancer subtype (Fig. 1c, Supplementary Fig. 5a). ProteinPaint can import data from mutation annotation format (MAF) files, allowing users to explore their own data sets (Supplementary Fig. 8, Supplementary Tutorial). During the entire process (file importing, parsing, and visualization), all mutations are handled by programs running within the user’s web browser, and no mutation data is transmitted to the server. ProteinPaint is therefore safe to use for browsing private or unpublished mutation data. ProteinPaint was written in JavaScript. Its server-side component runs in the Node.js environment (https://nodejs.org), retrieves genomic sequence using Samtools15, and queries a PostgreSQL database for gene annotation, mutation, SNP, and expression data. Its data visualization is performed as animated SVG graphs using the D3.js package (http://d3js.org).

1.4 ProteinPaint data source and curation From the UCSC Genome Browser16 we downloaded the following reference data sets to support mapping of the genomic coordinates to protein coordinates: human reference genome (GRCh37/hg19), RefSeq alignment (refSeqAli table), and dbSNP142 (snp142 table). The protein domain annotation was primarily based on NCBI’s CDD database17. To facilitate interpreting mutation pathogenicity, a subset of proteins frequently mutated in pediatric cancer was manually curated using protein domain annotations from literature. Manual curation of protein domain will be an ongoing effort and we are actively seeking contributions from the broader research community.

The pediatric cancer data set consists of genomic data from the St. Jude/Washington University Pediatric Cancer Genome Project1 (PCGP), National Cancer Institute’s Therapeutically Applicable Research to Generate Effective Treatments (TARGET) project18,19,20, two Wilms’ tumor studies by German Cancer research center21 and University of Texas Southwest Medical Center22, and a leukemia relapse study by Shanghai Children’s Medical Center23. All cataloged mutations have undergone experimental verification by orthogonal sequencing and those that failed in assay design of experimental validation have been manually reviewed to ensure quality. Only mutations in protein coding regions are incorporated into ProteinPaint. Silent mutations are included for the following reasons: (a) the ratio of missense to silent mutation is useful for assessing selective pressure of non-silent mutations, and (b) some variations annotated as silent may affect splicing (e.g. TP53 T125T). dbSNP 142 was used to provide a link between somatic mutation and germline mutations deposited to dbSNP (Supplementary Figure 1), many of which can be of medical importance for assessing germline cancer susceptibility.

We downloaded version 72 of sequence mutation data from Sanger Institute’s COSMIC database (http://cancer.sanger.ac.uk/cosmic). We included only mutations which were experimentally verified by using the tag “Confirmed somatic variant” or “Reported in another cancer sample as somatic”. This reduced the total number of mutations from 3,390,811 to 1,600,460. In this data set 14,060 tumors were screened by genome-wide sequencing based on the tag of “Genome-wide screen”. The resulting mutations were re-annotated using Annovar7, with in-house modifications to report more detailed protein change for indels, promote silent and missense variants at exon junctions to splice variants, and add a new class for “splice_region”. Mutations from hypermutable samples are excluded and duplicate entries were consolidated. Mutations not annotated with the mutant allele sequence were not included as they lack the minimum required sequence context to be reannotated by Annovar (e.g. FLT3 ITD mutations marked as “c.?” in COSMIC).

doi:10.1038/ng.3466

The gene expression data were based on RNA-Seq data of 928 tumor samples analyzed from the PCGP. Quantile-normalized FPKM values (fragments per kilobase of exon per million fragments mapped) were used to present gene-level expression for this cohort. Fusion genes were predicted for tumor samples based on their RNA-Seq data, using the software CICERO (Li, Y. and Zhang, J. in preparation). However, this was only available for the PCGP samples and the differences in fusion gene results between pediatric and adult cancer may therefore be attributable to analysis differences.

We provide loss-of-heterozygosity (LOH) information for the cancer genomes of PCGP pediatric patients analyzed by whole-genome sequencing. These patients all have whole-genome sequencing results for both diagnosis and germline DNA. We used CONSERTING8 to discover LOH events from these tumor genomes. For autosomal regions with sufficient sequencing coverage on dbSNP markers, a “segment mean” value is calculated as a quantitative measure, and the region is flagged as LOH positive if the value is above 0.1 (Supplementary Fig. 4).

1.5 Comparison to other mutation viewers ProteinPaint is a web-based tool for visualizing protein-coding mutations including predisposing germline mutations within normal cells as well as acquired somatic lesions within cancer cells. For somatic mutations, there are a few existing tools that provide similar functions, the best known being the cBioPortal4 and the COSMIC5. Both portals have a component to display mutation frequency as histograms along the protein sequence. We use the JAK2 and KRAS mutation data as two examples to illustrate how the “skewer graphs” of ProteinPaint enable display of mutation profiles that are not well represented by cBioPortal and COSMIC. This comparison includes only somatic sequence mutations deposited in COSMIC.

As a key regulator of hematopoietic stem cell differentiation, JAK2 somatic mutations were found at high frequency in many blood disorders and cancers. Mutations in the JAK2 pseudokinase domain are known to suppress the inhibitory regulation of this domain, resulting in activation of the accompanying kinase domain. V617F is a well-known missense mutation in JAK2 pseudokinase domain that has been detected in nearly all cases of polycythemia vera. It has been reported in literature at such high frequency that the COSMIC database (version 72) recorded 35078 affected samples, the highest of the entire data set. JAK2 is therefore a good test case for evaluating mutation visualization at wide range of mutation frequencies on the same protein. Although the wealth of JAK2 mutation data is exceptional in the current data set, we anticipate that the high cumulative incidence for targetable lesions like JAK2 V617F will become the norm as genome-wide sequencing is becoming a routine assay in the clinical setting.

Graphical display of JAK2 COSMIC mutations in the three portals is shown in Supplementary Fig. 6. cBioPortal adopts a “matchsticks” style, with a “stick” (i.e., a histogram) positioned at each amino acid coordinate, its height drawn in proportion to the number of affected samples. Each “stick” has a dot on the tip colored by its mutation class. The histogram shows that 5 samples harbor the V617F mutation based on the data from the 92 studies that cBioPortal hosts. Presence of the mutation in 34501 samples in the COSMIC data is accessible only via the accompanying table (Supplementary Fig. 6a). Furthermore, the protein domain labeling does not distinguish the pseudokinase domain from the kinase domain, which may compromise the interpretation of this mutation. The COSMIC mutation view presents a relatively “quiet” JAK2 protein with the exception of the V617 hotspot, as the high occurrence of V617F dwarfs all other mutations in the gene. There is no label for amino acid changes unless a user zooms into a small region. By contrast, the design of “skewer” graph in ProteinPaint not only shows the high occurrence of V617F mutation depicted by disc size and text label, but also reveals additional hotspot mutations at R683 and N542 that impair the function of the pseudokinase domain. The “skewer” graphs

doi:10.1038/ng.3466

representing distinct mutation sites are positioned at the same height, allowing ProteinPaint to display non-V617F hotspot mutations along with the exceptionally highly-abundant of V617F mutation. Presentation of KRAS mutations imposes another challenge as there are multiple allelic variations affecting the same amino acid residue G12 with varying frequencies (Supplementary Fig. 7). In cBioPortal, all variants affecting G12 are merged into a single “matchstick” with a single long label concatenating all amino acid changes, thus losing the relative frequency of each variant in this presentation. In COSMIC the diversity of mutant alleles at G12 is invisible unless a user zooms into a small region. By comparison, the “skewer” graph in ProteinPaint use one disc per mutant allele to depict the amino acid change and the associated frequency, while the shared “stem” retains the site-specific information. As a result, multiple mutant alleles at G12 are displayed by default on the full protein view of KRAS.

The JAK2 and KRAS examples illustrate that the novel design of ProteinPaint’s “skewer” graph allows it to scale well for a large amount of data, while simultaneously maintaining legibility for other mutations in the same gene which may have a wide frequency range, for example the JAK2 mutations in COSMIC (Supplementary Fig. 6c). The “skewer” graph supports different types of genetic lesions including SNVs, indels, and fusion genes, and collectively presents them in the same context. Additional novel features include display of coding mutations at nucleotide precision, co-display of multiple mutation data sets stratified by cancer subtypes, and a synchronized side-by-side display of expression and mutation data. These features, along with its intuitive navigation and cross-referencing of highly-curated cancer mutation databases, make ProteinPaint a highly valuable resource for interpretation of germline and somatic mutations in both research and clinical settings.

ProteinPaint focuses on presenting mutation profiles at a single-gene level. By contrast, the OncoPrint view implemented in cBioPortal provides a summary view of mutations across multiple genes, which is useful for assessing the mutual exclusivity or collaboration of mutations across multiple genes. ProteinPaint’s focus on pediatric cancer and presentation of gene-level mutations serves as a complement to existing cancer genome data portals, each of which presents mutation data from a distinct perspective.

doi:10.1038/ng.3466

2. Supplementary Figures

Supplementary Figure 1: ProteinPaint reveals aberrant splicing of TP53 caused by recurrent “silent” mutations.

doi:10.1038/ng.3466

a. A full view of TP53 which shows 17,062 mutations in Pediatric (top) and COSMIC (bottom) data sets. Each distinct mutation is represented by a disc sized in proportion to the number of affected samples and filled with the color representing its class based on the legend “Mutation classes”. An arc framing a disc represents either the fraction of germline samples (solid arc) or relapsed

doi:10.1038/ng.3466

tumors (hollow arc), while the portion without an arc represents the fraction of tumors obtained at diagnosis. Mutations are positioned by their amino acid coordinates, and those that affect the same amino acid residue are presented by a “skewer graph” with discs stacked on the same “stem” (T125, E224 and R248 sites). The corresponding amino acid changes and total mutation counts are labeled to the right and at the center of the discs, respectively. The dotted vertical lines inside the protein delineate the boundaries of coding exons.

b. A zoomed-in view centered at T125 of the TP53 protein which switches the mutation positions from amino acid coordinates to codon-based nucleotide coordinates. The “silent” T125T mutation affects the last nucleotide of exon 4 and was detected in 3 and 45 samples in Pediatric and COSMIC, respectively. The solid arc in the Pediatric data set indicates that T125T was detected in the germline DNA of a pediatric patient.

c. A zoomed-in view centered at E224 showing two recurrent mutations in COSMIC, E224D (n=15) and E224E (n=8), both affecting the last nucleotide of exon 6. The two mutations were classified in COSMIC as “missense” and “silent”, respectively.

d. RNA-Seq reads aligned to the donor splice site of exon 4 of TP53 in HGG034, a high-grade glioma harboring the T125T mutation. TP53 is transcribed in the reverse orientation and the T125 codon is marked by a blue box. The mutant allele is absent in spliced RNA while the reads harboring the mutant allele, indicated by the mismatching nucleotide “T” shown in red, invariably retain intron 4. This intron retention results in a premature stop codon 40aa downstream which might cause nonsense mediated decay of the mutant transcript.

e. Table view of samples with T125T mutation in the Pediatric data set which includes the germline of a patient with adrenocortical tumor (ACT) and diagnostic tumors of two high-grade glioma (HGG022 and HGG034). Mutant allele fractions (MAF) in DNA and RNA are shown in colored bars. Mouse-over on tumor HGG034 shows that the MAF in RNA and DNA is 25% and 46%, respectively.

f. Table view of the R248W mutation which was detected in 7 samples of the Pediatric data set including HGG034. Mouse-over shows that the MAF in RNA and DNA of HGG034 is 67% and 54%, respectively.

doi:10.1038/ng.3466

Supplementary Figure 2: The unique mutational profile of USP7 suggests it is a driver gene in pediatric cancer.

a. ProteinPaint view of USP7 with mutations from Pediatric (above protein) and COSMIC (below protein) data sets. In COSMIC, mutations are mostly missense (blue) or silent (green), and are scattered across the entire protein with no obvious hotspots. By contrast, mutations in the Pediatric data set are enriched in truncation mutations (e.g. frameshift and splice site) and clustered in the catalytic domain.

b. Summary of USP7 mutation count in descending order stratified by cancer subtype and tumor tissue in the Pediatric and COSMIC, respectively. In Pediatric, the frameshift mutations (red) are exclusively in T-cell acute lymphoblastic leukemia (TALL). While silent mutations are absent in Pediatric data set, there is a 1:2 ratio of silent (green) to missense (blue) mutations in the majority of tissues in COSMIC. The lack of selection for non-silent mutations in COSMIC suggests that USP7 mutations are unlikely to be driver events in adult cancer.

doi:10.1038/ng.3466

Supplementary Figure 3: Driver mutation identified in pediatric leukemia informs findings in other cancers.

doi:10.1038/ng.3466

The E1099K mutation in the SET domain of WHSC1 is a hotspot in pediatric B-cell acute lymphoblastic leukemia, and has been shown to increase methyltransferase activity resulting in increased H3K36me2 level11. In the table view of E1099K mutation of the pediatric data set (top), tumors obtained at diagnosis and at relapse of the same patient, PARFTR, are highlighted in yellow and green, respectively. The mutant allele fraction (MAF) of E1099K increased from 1% at diagnosis to 53% at relapse, indicating a minor subclone harboring the E1099K mutation at diagnosis persisted to relapse. In COSMIC, 10 samples harbor WHSC1 E1099K mutations including two solid tumors (lung cancer and Wilms tumor). PubMed links are provided in the table view, which in this case shows that this mutation’s significance was not recognized in the original publications.

doi:10.1038/ng.3466

Supplementary Figure 4: Loss-of-heterozygosity (LOH) in tumors from pediatric patients with germline TP53 R337H mutation.

From the Pediatric data set, the TP53 R337H mutation occurs exclusively in the germline DNA of adrenocortical tumor (ACT) patients. Loss-of-heterozygosity (LOH) analysis shows that 11 out of the 14 samples show LOH over the R337H position, as indicated in the mutation table’s “Somatic LOH” column (marked by solid black box). The samples are labeled “Yes” if the LOH score (difference between mutant allele fraction in tumor and germline, range 0 to 0.5) is above 0.1. All samples with LOH at the R337H mutation site show high mutant allele frequency in tumor DNA (red bars, greater than 50%). Only one sample (SJACT067) has low tumor purity (~30%) and therefore has low LOH score (0.13) compared to the others.

doi:10.1038/ng.3466

Supplementary Figure 5: Outlier expression of FLT3 leads to discovery of a potential driver alteration in BALL255, a BCR-ABL1 fusion-negative leukemia with a gene-expression profile resembling the kinase activating signature found in BCR-ABL1 positive leukemia. No activating kinase mutation or fusion was previously identified in this tumor10.

doi:10.1038/ng.3466

a. FLT3 expression in >900 pediatric tumors analyzed by RNA-Seq shown in the expression panel of ProteinPaint. The horizontal axis represents the expression level using the RNA-Seq FPKM

doi:10.1038/ng.3466

value. Each circle represents one tumor sample, its vertical position set by descending order of FPKM values. A tooltip box on the top right shows that BALL255 has the highest FLT3 expression. Selecting samples of FPKM >200 (indicated by the orange shade) brings up a sunburst chart indicating that leukemia (42 BALL, and 9 MLL) has the highest FLT3 expression in this cohort.

b. Based on whole-genome sequencing (WGS) data of the BALL255 tumor genome, CREST12 identified a translocation of chr9 (segment in blue) and chr13 (segment in green) resulting in an expressed PAN3-ZCCHC7 fusion gene. In the green segment, FLT3 is located approximately 100Kb upstream of the breakpoint neighboring PAN3. The WGS coverage track indicates that both segments from chr9 and chr13 have approximately 8-10 copy number gain, raising the possibility that FLT3 is likely to be the target gene of the rearrangement followed by amplification.

c. CIRCOS plot of BALL255 indicates the chr9-chr13 translocation is part of a complex rearrangement involving chromosomes 7, 9 and 13.

d. Reconstruction of an episome based on the breakpoints identified in BALL255 DNA and the FLT3 location is shown. Replication of this episome causes high-level of amplification of FLT3, resulting in its elevated expression in this tumor.

doi:10.1038/ng.3466

Supplementary Figure 6: Comparison of JAK2 view on cBioPortal, COSMIC, and ProteinPaint.

a. cBioPortal4 viewer showing JAK2 mutations from all available cancers, highlighting only V617F as a mutation hotspot. The table accompanying the graphic view shows that the mutation is present

doi:10.1038/ng.3466

in 34501 samples in COSMIC. The COSMIC column is hidden by default but can be made visible when a user clicks the "Show/hide columns" button. The V617F mutation is located in the pseudokinase domain, a negative regulator of the tyrosine kinase activity which is labeled as the kinase domain. This may compromise the interpretation of the V617F mutation which is known to abrogate the pseudokinase activity13.

b. COSMIC5 view showing substitutions in JAK2, with insertion/deletion mutations in separate histograms omitted from this screenshot. The only visible mutation hotspot is V617F: the amino acid change label is not visible unless the user has zoomed into a small region.

c. ProteinPaint presents the JAK2 mutations using “skewer” graphs that enable display of mutations with a dynamic range of abundance. In this example, the V617F mutation affecting 35078 samples can be shown in the same view with a singleton R683>TGR insertion. Expanding the “skewer” graphs of the most recurrently mutated sites not only characterizes the V617F hotspot, but also reveals two additional hotspots, one of which is an inframe deletion in a linker between pseudokinase and the SH2 domain (indicated by the arrow).

doi:10.1038/ng.3466

Supplementary Figure 7: Comparison of mutation presentation at KRAS G12 hotspot by cBioPortal, COSMIC, and ProteinPaint.

doi:10.1038/ng.3466

a. cBioPortal uses concatenated text to label the various amino acid changes at the G12 site. b. COSMIC uses a stack of bars to indicate various amino acid changes which are not labeled

unless the view is zoomed in to a small region. c. ProteinPaint’s “skewer” graph presents each G12 mutation as a disc labeled with its amino acid

change. The disc size and the number within indicate the number of affected samples. Additional hotspots such as G13, Q61 and A146 are presented in the same fashion.

doi:10.1038/ng.3466

Supplementary Figure 8: An example of integrating user-provided data on ProteinPaint using the TCGA SKCM (skin cutaneous melanoma) somatic mutation data.

doi:10.1038/ng.3466

a. The data set contains somatic mutations discovered in adult skin cancer, and was downloaded as a MAF-format file from the TCGA Data Portal, as described in Supplementary Tutorial. After

doi:10.1038/ng.3466

importing into ProteinPaint, A summary view of 180,722 somatic mutations in 17,018 genes detected in 253 tumors is shown.The table on the left shows top 32 genes, ranked by the number of samples with non-silent mutations in each gene, as an indication of recurrence in skin cancer. Additional columns in the table indicate the number of mutations in each class (missense, nonsense, etc). Most of the listed genes show close to a 2-to-1 ratio of missense (blue) to silent (green) mutations, while BRAF and NRAS stand out showing a high missense-to-silent ratio. Clicking on the gene name will display the mutation profile for each gene in skewer graph. On the right panel the mutation profiles for TTN, BRAF, and NRAS. The mutation profiles of the driver genes such as BRAF and NRAS have mutational hotspots as well as a very high ratio of missense to silent mutations, which are distinct from the passenger mutations in TTN which are scattered across the entire gene and with a low ratio of missense to silent mutations. TTN mutations are distributed across 4 RefSeq isoforms, the one with the most mutations is shown (NM_133378). The remaining three (NM_133379 with 157 mutations, NM_133437 with 81 mutations, and NM_133432 with 7 mutations) can be displayed by selecting the RefSeq accession number in ProteinPaint.

b. Comparison of user-provided mutation profile with that of Pediatric and COSMIC data sets in ProteinPaint. In the example of TCGA SKCM, mutations from Pediatric and COSMIC data sets can be displayed alongside those of TCGA SKCM for the most frequently mutated genes in (a). The overspanning row on the top labels the data set names and the total number of samples from each data set.

3. Supplementary References

11. Jaffe, J. D. et al. Nat. Genet. 45, 1386-1391 (2013). 12. Wang, J. et al. Nat. Methods 8, 652-654 (2011). 13. Ungureanu, D. et al. Nat. Struct. Mol. Biol. 18, 971-976 (2011). 14. Varley, J.M. et al. Oncogene 20, 2647-2654 (2001). 15. Li, H. et al. Bioinformatics 25, 2078-9 (2009). 16. Kent, W.J. et al. Genome Res. 12(6), 996-1006 (2002). 17. Marchler-Bauer, A. et al. Nucl. Acids Res. 43, D222-D226 (2015). 18. Walz, A.L. et al. Cancer Cell. 27, 286-297 (2015). 19. Pugh, T.J. et al. Nat. Genet. 45, 279–284 (2013). 20. Ma, X. et al, Nat. Commun. 6, 6604 (2015). 21. Wegert, J. et al. Cancer Cell. 27(2), 298-311 (2015). 22. Rakheja, D. et al. Nat. Commun. 2, 4802 (2015). 23. Li, B. et al. Nat. Med. 21, 563-571 (2015).

doi:10.1038/ng.3466