computational identification of tumor heterogeneity 2015-03-25 sangwoo kim

Computational Identification of Tumor heterogeneity

2015-03-25Sangwoo Kim

Tumor heterogeneity

• Inter-tumor heterogeneity: genetic and phenotypic variation be-tween individuals with the same tumor type

• Intra-tumor heterogeneity: subclonal diversity within a tumor

Tumor heterogeneity in AML

Tumor progression and response

Heterogeneity and resistance

Inferring tumor heterogeneity

1. single cell sequenc-ing

2. bulk sequencing and recon-struction

COMPUTATIONAL IDENTIFICA-TION OF TUMOR SUBCLONES

Today’s paper 1 (PyClone)

• Shorab Shah, Ph.D.– Associate Professor in the Departments of Pathology

and Computer Science, University of British Colum-bia

– Dr. Shah’s work focuses on characterization of can-cer genomes for determination of pathogenic driver mutations in cancer subtypes and measuring and quantifying tumour evolution

Conceptual overview

• Sequencing – pool sequencing – unclassified tools

Allele frequency and Cellular prevalence

• Allele frequency (af): – ratio of alternative allele to total haploid

• cellular prevalence (cp): – proportion of tumor cells harboring a mutation

70%30%

subclone 1 (AA)

subclone 2 (AB)

• allele frequency = 15%• cellular prevalence = 30%

70%30%

subclone 1 (AA)

subclone 2 (AAB)

• allele frequency = 10%• cellular prevalence = 30%

Allele frequency to cellular preva-lence

Example AF Genotype CP

mutation 1 10% AB 20%

mutation 2 10% AAB 30%

mutation 3 10% ABB 15%


mutation 5 20% AABB 40%


mutation 7 50% ABB 75%

Genotype (copy number) is essential for heterogene-ity estimation

A toy example

Cellular prevalence and evolution model

Assumption:1) clonal population follows a perfect phylogeny:

no site mutates more than once in its evolutionary history and each harbors at most one somatic mutant genotype

2) clonal population follows a persistent phylogeny:mutations do not disappear or revert

Cellular prevalence and evolution model

10%

10% 20%

30%30%

What to infer:1) number and composition of subclones2) cellular prevalence (cp):

proportion of tumor cells harboring a mutation

Input and Output

• Input (observation):– a set of deeply sequenced mutations (AF)

• from one or multiple locus in each sample

– a measure of allele specific copy number at each muta-tion locus (genotype)

• Output:– CP of each mutation– Clustering among mutations– overall CP and cluster

Clusters and CP

CNV

muta-tion (AF)

Pyclone population structure

Allele frequency of this mutation: 6*4*(2/4) / {2*2 + 4*3 + 6*4}Cellular prevalence of this mutation: 6 / (4 + 6)

Things to consider

• fraction of cancer cell: t– fraction of normal cell = 1-t

• genotype of normal, reference, variant population of nth mutation– gN, gR, gV ∈ {-, A, B, AA, AB, BB, AAA, AAB...}

– ψn = (gnN , gn

R , gnV ) ∈ G3

• read depth at the locus of nth mutation: dn

• number of reads harboring nth mutation: bn

Cellular prevalence of nth mutation

The generative model

prior parameter

posterior parameter

ψn = (gnN , gn

R , gnV )

φn = fraction of cancer cells from the variant populations

The probability

the probability of sampling a read containing the variant allele covering a mutation with state ψ = (gN, gR, gV) and cellular preva-lence φ

c(g) : copy number of the genotype (e.g. g=AAB, c(g)=3)b(g) : number of variant allele of the genotype (e.g. g=AAB, b(g)=1)µ(g) : probability of sampling a variant allele from a cell = b(g)/c(g)

The probability of bn

)

when cp is given we can calculate the probability of observing bn

inferring cp from bn

1. mutations with same cellular prevalence are clustered to a same clone

2. We want to infer the most likely cellular prevalence of mutations from observation; and find clusters for subclonee,g, if the best is [0.7, 0.5, 0.5, 0.4, 0.2, 0.5, 1.0, 0.9, 0.1, 0.4]

always problematic!!

Getting cp by sampling

• Cp prior ~ Dirichlet process– to have discrete cp values

• Sampling:– Metropolis-Hastings algorithm

Let f(x) be a function that is proportional to the desired probability distribution P(x).1.Initialization:

• Choose an arbitrary point x0 to be the first sample, and choose an arbitrary probability density which suggests a candidate for the next sample value x, given the previous sample value y. For the Metropolis algorithm, Q must be symmetric; in other words, it must

satisfy . A usual choice is to let be a Gaussian distribution centered at y, so that points closer to y are more likely to be visited next—making the sequence of samples into a random walk. The function Q is referred to as the proposal density or jumping distribution.

2.For each iteration t:• Generate a candidate x' for the next sample by picking from the distribution .• Calculate the acceptance ratio α = f(x')/f(xt), which will be used to decide whether to accept or reject the candidate. Because f is

proportional to the density of P, we have that α = f(x')/f(xt) = P(x')/P(xt).• If α ≥ 1, then the candidate is more likely than xt; automatically accept the candidate by setting xt+1 = x'. Otherwise, accept the candidate

with probability α; if the candidate is rejected, set xt+1 = xt, instead.

http://en.wikipedia.org/wiki/Gaussian_distribution

http://en.wikipedia.org/wiki/Random_walk

example of cluster

results (synthetic data)

• accuracy with synthetic data– di ~ Poisson(10,000), t=0.75, 8 clusters with CP~Uniform(0,1), genotype -> total copy number

(1~5),

AB, BB, NZ, TCN, PCN -> genotype prior (goto 17p)

results (synthetic data)

prior for mutational genotype

• copy number must be measured– for each mutation site:

• =total copy number• =copy number of each homologous chromosome

• 5 different strategies for assigning genotype– AB prior: gR=AA, gV=AB

– BB prior: gR=AA, gV=BB

– No Zygosity (NZ) prior: gR=AA, c(gV)=, b(gV)=1

– Total Copy Number (TCN) prior: c(gV)=, b(gV) ∈{1... }, • gR=AA or c(gR)=, b(gR)=0

– Parental Copy Number (PCN) prior: c(gV)=, b(gV) ∈{1,}• if b(gV) ∈{}, gR=gN (AA) => mutation occurred before copy number in-

crease• if b(gV)=1, or c(gR)=, b(gR)=0 => mutation occurred after copy number in-

crease

c=4, c1=c2=2

c=3, c1=1, c2=2

results (real data)

Data = physical mixture of 4 individuals (from 1000 Genomes) {0.01,0.05,0.20,0.74)

- NA12156, NA12878, NA18507, NA19240- generated 7 clusters (unique 4, NA18507+NA19240,

NA12878+NA18507+NA19240, All four shared)

BeBin = Beta Binomial (instead of binomial) to emulate over-dis-persion

results (real data)

True answer

Pyclone (7 clusters)

naïve (12 clusters)false separation of clusters with homo and hetero

cluster1

result (ovarian cancer)

Four spatially sampled high-grade serous ovarian cancer -> 49 deeply sequenced validated mutations

LOH

hetero

CNV1~3

IBBMM cluster 1,2,6 should be collapsed to PyClone cluster 1 => single cell sequencing of 25

result (ovarian cancer)

IBBMM cluster 1, 2 is one cluster (as Pyclone ex-pected)

pyclone clus-ter(yellow box = cluster 1)

IBBMM

non-so-matic

Conclusions• PyClone can infer clonal population structures in cancer

1. Using beta-binomial emission densities, which models data sets with more variance in allelic prevalence measurements more effectively than a binomial model.

2. Flexible prior probability estimates ('priors') of possible muta-tional genotypes are used, reflecting how allelic prevalence measurements are deterministically linked to zygosity and co-incident copy-number variation events.

3. Bayesian nonparametric clustering is used to discover group-ings of mutations and the number of groups simultaneously. This obviates fixing the number of groups a priori and allows for cellular prevalence estimates to reflect uncertainty in this parameter.

4. Multiple samples from the same cancer may be analyzed jointly to leverage the scenario in which clonal populations are shared across samples.

Software

• Implemented in Python• Freely available in

– http://compbio.bccrc.ca/software/request-to-download/?sw=pyClone

• License: GPL3 (free for academic use)

http://compbio.bccrc.ca/software/request-to-download/?sw=pyClone



V-measure

computational identification of tumor heterogeneity 2015-03-25 sangwoo kim

Documents

g n v n

g n v g

locus of n th mutation

mutation clustering

response slide

resistance slide

aml slide

reconstruction slide