ucsc cancer browser workshop
DESCRIPTION
UCSC Cancer Browser Workshop. Mary Goldman [email protected]. First: use Firefox or Chrome. Please do not use Internet Explorer Download Firefox or Chrome if you need to Our browser does have some functionality on IE but it is limited. Use Firefox or Chrome for our full feature set. - PowerPoint PPT PresentationTRANSCRIPT
UCSC Cancer Browser Workshop
Mary [email protected]
First: use Firefox or Chrome
Please do not use Internet Explorer➔ Download Firefox or Chrome if you need to
Our browser does have some functionality on IE but it is limited. Use Firefox or Chrome for our full feature set.
What is the Cancer Browser?
It is a tool to visually explore and analyze cancer genomics data and its associated clinical information.
https://genome-cancer.ucsc.edu/
It can be used to:
● analyze data on the browser ● do proof-of-concept visualization to
determine if more complicated analysis is worth performing
● visualize analysis results○ for colleagues, papers, presentations,
posters, etc.
Outline
● Quick overview of the browser● Overview of our data (TCGA + more)● How to use the browser● Breast cancer PAM50 example● Lower Grade Glioma Telomere example
Outline
● Quick overview of the browser● Overview of our data (TCGA + more)● How to use the browser● Breast cancer PAM50 example● Lower Grade Glioma Telomere example
Interactive tutorial highlights features of our browser
Genomic data Clinical data
Genomic data Clinical data
Samples
Genomic locations / Genes
Genomic data Clinical data
Samples
Genomic locations / Genes
Both clinical and genomic heatmaps sorted by left-most clinical feature and then subsorted on following features
Red = amplificationBlue = deletion
View data in summary modes (box plot or proportions)
Also known as stacked bar graphs, proportions view shows the distribution of each column of data
Outline
● Quick overview of the browser● Overview of our data (TCGA + more)● How to use the browser● Breast cancer PAM50 example● Lower Grade Glioma Telomere example
Data Sources● TCGA● TARGET and other pediatric cancer● CCLE● SU2C● Connectivity Map
➢ 698 datasets including 526 public datasets➢ 227,000 samples
http://cancergenome.nih.gov/
Level 3 data
All of the TCGA data we display are Level 3. Level 3 means:● read-level data has been summarized to
gene- and probe-level data● no longer patient identifiable● publicly available
TCGA Data types
● Copy Number Variation● DNA Methylation● Gene and exon expression● Somatic mutation (gene-level)● Protein expression● Paradigm Pathway activity
TCGA Data types
● Copy Number Variation● DNA Methylation● Gene and exon expression● Somatic mutation● Protein expression● Paradigm Pathway activity
Vaske,C.J., Benz,S.C., Sanborn,J.Z., Earl,D., Szeto,C., Zhu,J., Haussler,D. and Stuart,J.M. (2010) Inference of patient-specific pathway activities from multi-dimensional cancer genomics data using PARADIGM. Bioinformatics, 26, i237-i245.
Multidimensional data is challengingGene Expression
DNA Methylation
Copy Number Variation
Mutation
Paradigm
● Infers patient-specific pathway activities using CNV and gene expression data
● Developed at UCSC
● Multiple datasets depending on what data was used to make the calls (e.g. RNAseq + CNV)
Vaske, et. al. 2010
PANCAN12 Datasets
● TCGA formed an Analysis Working Group to look at genomics abnormalities across cancers
● 12 tumor types: breast cancer, ovarian cancer, GBM, ....
● CNV, expression, mutation, protein
Hoadley,K.A., Yau,C., Wolf,D.M., Cherniack,A.D., Tamborero,D., Ng,S., Leiserson,M.D.M., Niu,B., McLellan,M.D., Uzunangelov,V., et. al. (2014) Multiplatform Analysis of 12 Cancer Types Reveals Molecular Classification within and across Tissues of Origin. Cell, 158, 929–944.
Pan-Cancer datasets
● We assembled these datasets● 19 tumor types: breast cancer, ovarian
cancer, GBM, melanoma, thyroid cancer, ....● CNV, expression, mutation, paradigm
Pan-Cancer mutations
Looking at the most frequently mutated genes in cancer, we can see across almost 4.5K samples that TP53 is by far the most mutated
Pan-Cancer Normalized Gene expression● Allows you to see differences in expression
across all cancer types● Combine illumina RNAseq data from all
TCGA cohorts● Mean-normalized per gene
https://genome-cancer.ucsc.edu/proj/site/hgHeatmap/#?bookmark=c347fdabddde3e73d824caff1290a6a8
FOXM1 Pathway
FOXM1 Pathway
GBM
LGG
TCGA Data Curation
● Map between patient, sample and omic IDs. Same ID on genomic and clinical matrices
● Curated overall and recurrence-free survival● More easily readable clinical/phenotype data● Matrix format can be downloaded for both
genomic and phenotype/clinical data
Non-TCGA Public Data
● TARGET and Childhood cancer
● Cell line data (CCLE, SU2C, Connectivity
Map)
TARGET and Childhood cancer
● TARGET applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. (AML and Neuroblastoma)
● Other cancer types, including some from the Pediatric Tumor Affymetrix Database
Cell Line data
● CCLE: Genome-wide information of ~1000 cell lines under baseline condition. Pharmacologic response profiles (IC50) and mutation status analysis.
● SU2C: 50 Breast cancer cell lines. GI50 to 77 therapeutic compounds.
● Connectivity Map: 4 cell lines and 1309 perturbagens at several concentrations. Gene expression change after treatment.
Outline
● Quick overview of the browser● Overview of our data (TCGA + more)● How to use the browser● Breast cancer PAM50 example● Lower Grade Glioma Telomere example
Browser Demo
Outline
● Quick overview of the browser● Overview of our data (TCGA + more)● How to use the browser● Breast cancer PAM50 example● Lower Grade Glioma Telomere example
PAM50
● Breast cancer● 4 major intrinsic subtypes: Luminal A,
Luminal B, Her2-enriched, Basal● Subtypes are clinically relevant for drug
sensitivity and long-term survival ● Determine tumor subtype by looking at the
gene expression of 50 genes
Our Goals
● Look at the expression of these 50 genes and their relationship to the subtype calls
● Look at the survivorship of these different subtypes
● Make a bookmark to share with others
Steps
1. Go to https://genome-cancer.ucsc.edu/2. Open TCGA Breast Agilent dataset3. Go to genes mode4. Replace current geneset with the predefined
PAM50 geneset from the Favorites menu.5. Perform KM plot6. Bookmark the view to share
How to view Kaplan-Meier Plots
Time
Sur
viva
l
Steep curve = Poor survival
worse survival
better survival
Initially Luminal AB have higher survival than Basal / HER2-enriched
As patients age, Basal / HER2-enriched have higher survival than Luminal AB
Bonus Question
We know that several of these tumor samples went through both Agilent and RNAseq analysis. Now we now want to see if the gene expression patterns we're seeing for these 50 genes are Agilent specific or if they are cross-platform.➢ How do you do this?
More information: PAM50● Tumors can instead be classified by hormone
cell surface receptors --> ER, PR and HER2● Patients who have at least one of these cell
surface receptors tend to respond to traditional hormone therapy
● Patients who are triple negative (negative for all 3 cell surface receptors) typically do not respond and have a poor prognosis
Our Goals
● Examine relationship between these two subtyping methods
● Examine survivorship of triple negative patients compared with other patients
But!
There is no 'triple negative' classification in the browser.
➔ We will need to create this classification and load it back into the browser
Steps1. Download clinical data in view2. Open clinical data Excel or other
spreadsheet program
Our Goal for Excel
Create a column next to the Sample ID column, where if the sample is triple negative it will be "1". Otherwise it will be "0".
➔ https://genome-cancer.ucsc.edu/download/public/BRCA_modified_clinical.xls
Last Steps:https://genome-cancer.ucsc.edu/download/public/BRCA_modified_clinical.xls3. Upload new column4. Perform KM plot, see differences in survival
between triple negative and non-triple negative patients
Outline
● Quick overview of the browser● Overview of our data (TCGA + more)● How to use the browser● Breast cancer PAM50 example● Lower Grade Glioma Telomere example
Telomeres and nervous system cancers
Olena Morozova
TCGA Lower Grade Glioma
TARGET Neuroblastoma
TCGA GBM
ATRX mutation frequency is high in TCGA Lower Grade Glioma, TARGET childhood neuroblastoma, but not TCGA GBM.
ATRX
TCGA Lower Grade Glioma
TARGET Neuroblastoma
TCGA GBM
ALT and ATRX● ATRX affects chromatin remodeling and methylation
patterns across the genome
● Loss of ATRX is associated with alternative lengthening of telomeres (ALT)
Telomeres● Repeating sequences at end of chromosomes● Shorten due to cell replication● Extended by telomerase in germline cells● If the telomeres get too short, the cell undergoes
apoptosis and dies● Cancer cells lengthen telomeres as a way to avoid
cell death
http://www.sciencemag.org/content/336/6087/1388.full
Hypothesis: Low expression of ATRX leads to ALT
ATRX Alternative lengthening of telomeres
Telomeres lengthened (avoids cell death)
ATRX Mutation
ATRX Alternative lengthening of telomeres
Telomeres lengthened (avoids cell death)
ATRX Mutation
➢ What about the telomerase pathway?
Hypothesis: Low expression of ATRX leads to ALT
Telomerase and TERT
● TERT protein is a subunit of telomerase
● High expression lengthens telomeres
TERT
Use gene expression to infer telomere lengthening method
Increased telomerase activity
Telomeres lengthened (avoids cell death)
ATRX Alternative lengthening of telomeres
Telomeres lengthened (avoids cell death)
Our Goals● Examine how ATRX mutation frequency
relates to ATRX expression ○ Does a mutation in ATRX lead to lower
expression?● Examine how ATRX expression and TERT
expression relate to each other ○ What is the relationship between these
two telomere lengthening methods?
On the current Cancer Browser
On the current Cancer Browser
➔ Create a new spreadsheet view
Xena
View multiple types of data together in a large spreadsheetView mutation position
Xena
https://genome-cancer.ucsc.edu/proj/site/ hgHeatmap-cavm/http://tinyurl.com/op8qk3g
➔ Please use Chrome if you have it
Xena demo
ATRX
TERT
2 pathways to same result
NO Alternative lengthening of telomeres
Increased telomerase activity
Telomeres lengthened (avoids cell death)
ATRX
TERT
Alternative lengthening of telomeres
Controlled telomerase no lengthening
OR
Telomeres lengthened (avoids cell death)
Summary: ATRX/TERT● ATRX mutations are associated low ATRX
expression● ATRX and TERT expression is positively
related -> one or the other pathway is being activated to lengthen telomeres
More: TERT promoter mutations
Looked at mutations in the promoter region of the TERT and found that many samples had mutationsMarked which samples were TERT promoter mutation wildtype or mutant
Our Goals
Examine Olena's TERT promoter calls in the context of other data from TCGA LGG
Is there a relationship between TERT promoter mutations and TERT expression?
Xena
View multiple types of data together in a large spreadsheetView mutation positionSecurely and easily view:
your own annotations
Xena demo 2
Summary TERT promoter mutations
TERT promoter mutations are associated with increased TERT expression
Xena
View multiple types of data together in a large spreadsheetView mutation position, including 3D structureSecurely and easily view:
your own annotations
Xena
View multiple types of data together in a large spreadsheetView mutation position, including 3D structureSecurely and easily view:
your own annotationsyour own cohort of data
Analyze data in Galaxy
Galaxy Analysis Tools
Users continually asking for more and more analysis tools ● keeping up with demand is impossible
--> Integrate with Galaxy to provide users with a huge range of tools
Galaxy● Large tool workshed● Import our data, analyze, and then visualize
on our browser● Galaxy keeps track of analysis done so that
can reproduce later● Can currently import and export from Cancer
Browser and Xena
Future Xena
Composite cohortsMore data (COSMIC, ICGC, LINCS)Make it easier to view own data
With your laptop Xena you could ...
● View your own genomic, clinical/phenotype or mutation data.
● View your annotations on TCGA data● Perform analysis in Galaxy
--> Click on 'Help' in the top menu bar to get started.
The End
Acknowledgements
Brian CraftTeresa SwatloskiJingchun ZhuMelissa ClineOlena MorozovaSofie SalamaMaximilian HaeusslerErich WeilerJoshua StuartDavid Haussler
Normalization for visualization
● All normalization we've talked about is on the data○ We also do some normalization for visualization
only● Does not affect underlying data● Subtract the mean of each genomic location● Automatically turned on for all transcription
datasets except for pancan normalized datasets● Can be turned off and on as desired
Without normalization
Everything is red because, for RNAseq, all values are above zero
With normalization
By subtracting out the mean, we can see places in the genome that are relatively under- or over-expressed compared to other samples
● Comprehensive study of 20+ cancer types● Bulk of our data in the browser● Typically studies only obtain a few types of
genomic data (e.g. only mutation)● TCGA aims to obtain as many different types of
genomic data about one tumor as possible
○ It's a comprehensive resource
The Cancer Genome Atlas (TCGA)
TCGA Data types
● Copy Number Variation● DNA Methylation● Gene and exon expression● Somatic mutation● Protein expression● Paradigm
Copy Number Variation (CNV)● 2 processing methods: CBS or Gistic2● Circular binary segmentation (CBS) determines which
pieces of DNA were amplified/deleted based on SNP array results ○ 2 datasets○ One dataset has germline CNV removed by Broad○ We don't display normal samples
● Gistic2 generates gene level CNV estimates ○ 3 datasets
Gistic2 Copy Number Variation Calls● Called by Firehose, an analysis pipeline ● Separates arm-level and focal alterations (short
segments) based on segment length before predicting overall CNV
● GISTIC2 focal: focal alterations only○ TCGA doesn't give arm-level alterations only
● GISTIC2 thresholded: data has been thresholded to -2,-1,0,1,2, representing homozygous deletion, single copy deletion, diploid normal copy, low copy number amplification, or high copy number amplification.
Ovarian Serous Cystadenocarcinoma
Glioblastoma Multiforme
segmented copy number (delete germline cnv)
TCGA Data types
● Copy Number Variation● DNA Methylation● Gene and exon expression● Somatic mutation● Protein expression● Paradigm
DNA methylation● 2 platforms: 27K and 450K. 90% of 27K in 450K.● DNA methylation beta values range between 0
(hypomethylated) and 1 (hypermethylated). Bimodal distribution with peaks at 0.1 and 0.9
● Beta values were offset by 0.5 (new range: -0.5 to 0.5)● In 27K platform, the average of the unshifted beta
values is 0.26, thus much of the heatmap appears hypomethylated (blue). In 450K platform, the average is around 0.5.
27k
450k
TCGA Data types
● Copy Number Variation● DNA Methylation● Gene and exon expression● Somatic mutation● Protein expression● Paradigm
Gene expression● Microarrays - Agilent and Affy● RNAseq - 2 Illumina sequencers
○ Most use RSEM to estimate gene-level transcription
○ We log2 transformed the data to normalize the distribution
Exon Expression
● Illumina RNAseq● Exon-level transcription estimates, as in
RPKM values (Reads Per Kilobase of exon model per Million mapped reads)
● We log2 transformed the data to normalize the distribution
GBM gene-level Illumina HiSeq. We can see a clear correlation in the expression of the genes to the subtypes called in the first clinical feature
Glioblastoma: gene expression subtypes
https://genome-cancer.ucsc.edu/proj/site/hgHeatmap/#?bookmark=6e2048a6ef5cd04fe7c336e9f270c106
TCGA Data types
● Copy Number Variation● DNA Methylation● Gene and exon expression● Somatic mutation● Protein expression● Paradigm
Somatic Mutation
● High level view of mutations across genome● Mutation calls from TCGA pan-cancer
analysis● If there is a non-silent mutation in a coding
gene or any mutation in a non-coding gene, then we mark the entire gene as being mutated
Somatic Mutation
Somatic Mutation
● View in genes mode since calls are per gene● View in proportions to get a feel for what proportion of the
samples have a mutation in a particular gene
TCGA Data types
● Copy Number Variation● DNA Methylation● Gene and exon expression● Somatic mutation● Protein expression● Paradigm
Protein
● Reverse Phase Protein Array (RPPA)● 200 antibodies. Most antibodies are for
phosphorylated protein level. Some are for total protein.
● Include kinases, cell surface receptors, etc● RBN (replicate-base normalization)
○ RBN allows you to combine datasets from multiple RPPA runs
Future Xena Data
COSMIC (Catalogue Of Somatic Mutations In Cancer)
● 947,213 Samples with 1,592,109 Mutations
● January 2014
Future Xena Data
LINCS (Library of Integrated Network-based Signatures)● 42,532 perturbations for 15 cell lines● April 2014
Outline● Quick overview of the browser● Overview of our data (TCGA + more)● How to use the browser● Breast cancer PAM50 example● Lower Grade Glioma Telomere example● Lower Grade Glioma IDH1 example
TCGA Lower Grade Glioma
● Large survivorship difference within the LGG● Can we find subtypes with LGG that were
predictive of survivorship?● Ran clustering algorithm on DNA, RNA and
methylation data
TCGA (2014) Comprehensive and Integrative Genomic Characterization of Diffuse Lower Grade Gliomas, N. Engl. J. Med., In press.
Clustering Results
● Found that IDH1 mutation status to be predictive of survival
● Also correlated with EGFR and PTEN copy number status
TCGA (2014) Comprehensive and Integrative Genomic Characterization of Diffuse Lower Grade Gliomas, N. Engl. J. Med., In press.
Goals
Start up a fresh browser sessionView IDH1 mutation status as well as EGFR and PTEN copy number status together to see how they relate to one another.
Steps
1. Open TCGA Lower Grade Glioma cohort2. Open LGG mutation (broad automated) -->
IDH13. Open LGG copy number (delete germline
cnv) --> EGFR, PTEN
Xena demo 3