copy number variations jean-baptiste cazier [email protected] dtc bioinformatics...

63
Copy Number Variations Jean-Baptiste Cazier http://www.well.ox.ac.uk/dr-jean-baptiste-c azier [email protected] DTC BioInformatics Course Hillary Term 2010 WTHCG, Thursday 12 th of February

Post on 24-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Copy Number Variations Jean-Baptiste Cazier  Jean-Baptiste.Cazier@well.ox.ac.uk DTC BioInformatics Course

Copy Number Variations

Jean-Baptiste Cazier

http://www.well.ox.ac.uk/dr-jean-baptiste-cazier

[email protected]

DTC BioInformatics Course Hillary Term 2010

WTHCG, Thursday 12th of February

Page 2: Copy Number Variations Jean-Baptiste Cazier  Jean-Baptiste.Cazier@well.ox.ac.uk DTC BioInformatics Course

Outline

2

Page 3: Copy Number Variations Jean-Baptiste Cazier  Jean-Baptiste.Cazier@well.ox.ac.uk DTC BioInformatics Course

Definitions

• Acronyms:– CNP:

• Copy Number Polymorphisms

– CNV:• Copy Number Variations

– CNA:• Copy Number Aberrations• Copy Number Alterations

• Creation: Germline vs Somatic– Is the CNV coming from the original cell or did it evolve only in a few ?

• There are very many CNVs shared among population like SNPs or STRs• Somatic propagation of CNVs is a mark of Cancer

3

Finding the missing heritability of complex diseasesTA Manolio et al. Nature 461, 747-753 (2009)

Page 4: Copy Number Variations Jean-Baptiste Cazier  Jean-Baptiste.Cazier@well.ox.ac.uk DTC BioInformatics Course

Gain, Loss, etc

• Normal:– 2 chromosomes are inherited, one from each parents

• Deletion:– Homozygous: 0 copy left– Hemizygous: 1 copy left– Sizeable event:

• -> not InDels

• Gain– Can be 3, 4, 5, … copies– Most often nearby, but not always– Not Line, Sine, repeats, etc.

• Copy Neutral Loss of Heterozygosity– Not Copy Number Polymorsphism per se, but needs to be

addressed

4

Copy Number Variation in Human Health, Disease, and EvolutionZhang F et al, Ann. Rev. of Gen. and Hum. Gen. 2009 (10) 451-481

Page 5: Copy Number Variations Jean-Baptiste Cazier  Jean-Baptiste.Cazier@well.ox.ac.uk DTC BioInformatics Course

Mechanisms

• 4 main mechanisms in the generation of CNV:

– NAHR• Non-Allelic Homologous

Recombination

– NHEJ• Non-Homologous End-Joining

– FoSTeS• Fork Stalling and Template

Switching

– L1 retrotransposition

5Copy Number Variation in Human Health, Disease, and EvolutionZhang F et al, Ann. Rev. of Gen. and Hum. Gen. 2009 (10) 451-481

Page 6: Copy Number Variations Jean-Baptiste Cazier  Jean-Baptiste.Cazier@well.ox.ac.uk DTC BioInformatics Course

Characterization

• Identification: a Genome-Wide test– Karyotyping– Multi color chromosome painting

– Comparative Genetic Hybridization (CGH)– Array CGH (aCGH)– “SNP”- array

• Validation: a local test– qPCR: quantitative Polymerase Chain Reaction– MLPA: Multiplex Ligation-dependent Probe Amplification– Fluorescent In-Situ Hybridization (FISH)– Sequencing

6

Page 7: Copy Number Variations Jean-Baptiste Cazier  Jean-Baptiste.Cazier@well.ox.ac.uk DTC BioInformatics Course

Array technology

• Array CGH– Agilent, Nimblegen– 2 channels: compare hybridization level to a common background

reference– Usually 42 million probes genome-wide

• Resolution up to 200bp

• SNP array– Illumina, Affymetrix– Test one or few samples at a time– Initially developed for genotyping

• 2 channels: allele A/B

– Increasing density of markers• From 10,000 Linkage SNPs • Up to 5M SNPs and CNV probes

7Affymetrix

Page 8: Copy Number Variations Jean-Baptiste Cazier  Jean-Baptiste.Cazier@well.ox.ac.uk DTC BioInformatics Course

CNV in color

– (a) Aberrations leading to aneuploidy. – (b) Aberrations leaving the chromosome apparently intact

Chromosome aberrations in solid tumors

Donna G et al. Nature Genetics  34, 369 - 376 (2003)

SNP array + + + + + + + + +

8

Page 9: Copy Number Variations Jean-Baptiste Cazier  Jean-Baptiste.Cazier@well.ox.ac.uk DTC BioInformatics Course

Revival

• Genome-Wide Association provided some success in the identification of variants for many diseases:– AMD, Coeliac disease, Type 2

Diabetes, Prostate Cancer, Colorectal Cancer, etc.

• However most variants are ‘only’ statistically significant: – 80% fall outside of coding

regions

• The case of Missing Heritability:– Whatever the number of

variants identified, they usually account for only a small proportion of the heritability

9

Finding the missing heritability of complex diseasesTA Manolio et al. Nature 461, 747-753 (2009)

Page 10: Copy Number Variations Jean-Baptiste Cazier  Jean-Baptiste.Cazier@well.ox.ac.uk DTC BioInformatics Course

Missing Heritability

• Need to find other “reasons” to explain the difference.

• Heritability definition– Proportion of phenotypic variance attributable to additive

genetic factors

• The Common Variant Common Disease model is challenged

– Look for more markers• Rarer with strong effect

• Common with lower effect

• Gene-Gene interaction

• Shared environment

– This is essentially a question of power• Groups are joining forces in very large consortium

• Better technological coverage of the rarer variants

– More variant types• Copy Number Variation

• InDels, Segmental Duplications.

• Comparable phenotyping in meta analysis ?

• The ‘Dark Matter’– Does it really exists ?

– Can we see it beyond its influence ?10

Feasibility of identifying genetic variants by risk allele frequency and strength of genetic effect (odds ratio).

Finding the missing heritability of complex diseasesTA Manolio et al. Nature 461, 747-753 (2009)

Page 11: Copy Number Variations Jean-Baptiste Cazier  Jean-Baptiste.Cazier@well.ox.ac.uk DTC BioInformatics Course

SNP-array signature

• Sample data for a number of different copy number and LOH events.– The Log R Ratio scales with copy number– The distribution of the B allele frequency is governed by a more

complex relationship with allowable genotypes.

11

Simulation

Gain

Neutral

Loss

Real data

Page 12: Copy Number Variations Jean-Baptiste Cazier  Jean-Baptiste.Cazier@well.ox.ac.uk DTC BioInformatics Course

Copy Number Loss

12

SNP array

aCGH

Page 13: Copy Number Variations Jean-Baptiste Cazier  Jean-Baptiste.Cazier@well.ox.ac.uk DTC BioInformatics Course

Copy Number Loss and Gain

13

SNP array

aCGH

Page 14: Copy Number Variations Jean-Baptiste Cazier  Jean-Baptiste.Cazier@well.ox.ac.uk DTC BioInformatics Course

Mixed Cell Population

14

SNP array

aCGH

Page 15: Copy Number Variations Jean-Baptiste Cazier  Jean-Baptiste.Cazier@well.ox.ac.uk DTC BioInformatics Course

Copy Neutral LOH

15

SNP array

aCGH

Page 16: Copy Number Variations Jean-Baptiste Cazier  Jean-Baptiste.Cazier@well.ox.ac.uk DTC BioInformatics Course

Automatic recognition of CNVs

• Originally done by visual inspection

– Problem of reproducibility

– Problem of accuracy

– With increasing density, problem of possibility to see

• Automation and test

– Moving average

– Probe selection / compilation

– Segmentation, Hidden Markov Model

– Significance testing

• Need to compile data with uncertainty

16

Page 17: Copy Number Variations Jean-Baptiste Cazier  Jean-Baptiste.Cazier@well.ox.ac.uk DTC BioInformatics Course

Moving average

17

Page 18: Copy Number Variations Jean-Baptiste Cazier  Jean-Baptiste.Cazier@well.ox.ac.uk DTC BioInformatics Course

Automatisation by use of Hidden Markov Model

• Select automatically the optimal Copy Number sequence over a chromosome to fit the Model

• Evaluate the probability of the sequence of intensity signal fitting this model– Can test various models and select the most appropriate

• The Model can be trained simply by feeding “typical” data sets– Look for minimum number of changes– Look for maximum instability– Select a most likely default state– …

012

012

012

Page 19: Copy Number Variations Jean-Baptiste Cazier  Jean-Baptiste.Cazier@well.ox.ac.uk DTC BioInformatics Course

Process

• Definition:– Find the underlying states giving

the observation– Underlying states are the number

of copies: 0,1,2, …– Observation is the Signal Intensity– Defined by 3 probabilistic entities

2

1

0

2

1

0

2

1

0

2

1

0

2

1

0

ObsNObs4Obs3Obs2Obs1

Page 20: Copy Number Variations Jean-Baptiste Cazier  Jean-Baptiste.Cazier@well.ox.ac.uk DTC BioInformatics Course

Segmentation

CNAM employs a powerful optimal segmenting algorithm using dynamic programming to detect inherited and de novo CNVs on a per-sample (univariate) and multi-sample (multivariate) basis.

Unlike Hidden Markov Models, which assume the means of different copy number states are consistent, optimal segmenting properly delineates CNV boundaries in the presence of mosaicism, even at a single probe level, and with controllable sensitivity and false discovery rate.

20

Page 21: Copy Number Variations Jean-Baptiste Cazier  Jean-Baptiste.Cazier@well.ox.ac.uk DTC BioInformatics Course

Available software

• Graphical Interface:– Agilent– Golden Helix– Partek– BeadStudio/GenomeStudio– Golf– CNAT– CNAG– dChip– PennCNV– …

• Uneven field of quality and specificity

• Command line– QuantiSNP – BirdSuite– OncoSNP *– …

• R packages– Somatics *– DNACopy– Aroma– …

* Cancer Specific tools21

Page 22: Copy Number Variations Jean-Baptiste Cazier  Jean-Baptiste.Cazier@well.ox.ac.uk DTC BioInformatics Course

Development of recent array

• In 2008 McCarroll and Korn published the identification of CNPs and CNVs using/designing Affymetrix SNP 6.0 high resolution array

22

Page 23: Copy Number Variations Jean-Baptiste Cazier  Jean-Baptiste.Cazier@well.ox.ac.uk DTC BioInformatics Course

SNP 6.0 by McCarroll

• “ We designed a hybrid genotyping array (Affymetrix SNP 6.0) to simultaneously measure 906,600 SNPs and copy number at 1.8 million genomic locations. By characterizing 270 HapMap samples, we developed a map of human CNV (at 2-kb breakpoint resolution) informed by integer genotypes for 1,320 copy number polymorphisms (CNPs)” McCarroll

• Published both analysis with chip design and algorithm suite: BirdSuite– Perform both genotyping and CNV identification– First call for known CNP – Look for new CNV

• 80% of observed copy number differences due to common CNPs (MAF>5%), • > 99% derived from inheritance rather than new mutation. • Found a common deletion polymorphism in perfect LD with Crohn’s disease SNPs

– 2kb upstream IRGM– Affect level of expression

23

Page 24: Copy Number Variations Jean-Baptiste Cazier  Jean-Baptiste.Cazier@well.ox.ac.uk DTC BioInformatics Course

High density of probes

• Can identify smaller events

– E.g. Important to spot residual event in translocation/fusion genes

• Gain confidence in SNP-regions by increasing the number of probes

• Can get better resolutions, i.e. more accurate breakpoints:

– Can split existing large regions into smaller ones

• Better coverage of CNP

– These regions were mainly not be covered by SNP-only arrays

– Beware of overrepresentation of these regions

• Tiling across the genome

– More exhaustive picture

24

Page 25: Copy Number Variations Jean-Baptiste Cazier  Jean-Baptiste.Cazier@well.ox.ac.uk DTC BioInformatics Course

10K

250KNsp

250KSty

6.0

421

421

421

421

Copy Number

Loss of 65Kb region confidently identified only with SNP 6.0, Bryan Young et al, Cancer Research UK

Increase density

25

Page 26: Copy Number Variations Jean-Baptiste Cazier  Jean-Baptiste.Cazier@well.ox.ac.uk DTC BioInformatics Course

t-test on Run I

t-test on Run II

Summation of I and II

Too much data ?

26

Log 2 Ratio I

Log 2 Ratio II

Replicates increase signal to noise ratio and avoid false positives and true negativesBut it costs twice as much !

421

421

Copy Number

t-test

Page 27: Copy Number Variations Jean-Baptiste Cazier  Jean-Baptiste.Cazier@well.ox.ac.uk DTC BioInformatics Course

Potential Issues

• Interpretation– What to use as a baseline ? i.e. define the Ratio

• Variations in probe coverage:– Gaps– Overlapping probes

• Inaccurate reference– Reference build is inaccurate– Probes cannot match the locus accurately

• Systematic error– Autocorrelation with GC content– Preparation, e.g. genome amplification

27

Page 28: Copy Number Variations Jean-Baptiste Cazier  Jean-Baptiste.Cazier@well.ox.ac.uk DTC BioInformatics Course

Overlapping probes in regions of CNP

28

Page 29: Copy Number Variations Jean-Baptiste Cazier  Jean-Baptiste.Cazier@well.ox.ac.uk DTC BioInformatics Course

Probes in repeat elements

29

Page 30: Copy Number Variations Jean-Baptiste Cazier  Jean-Baptiste.Cazier@well.ox.ac.uk DTC BioInformatics Course

SNPs in probes

• The special case of rodents:

• There can be many strain from limited number of founders– Full sequencing has been limited– The reference used for the probe

generation can be far from the strain tested

– This will lead to failure across the genome

30Gauguier et al, in preparation

Page 31: Copy Number Variations Jean-Baptiste Cazier  Jean-Baptiste.Cazier@well.ox.ac.uk DTC BioInformatics Course

Systematic SNPs in probes

• There can be mosaicism– Grouping of SNPs in specific regions

• Generates systematic drops in hybridization at specific loci

• Can be misinterpreted as deletion– Be aware of the regions with SNPs

• And correct for the lack of hybridization

– Design specific probes for the strain

31Gauguier et al, in preparation

Page 32: Copy Number Variations Jean-Baptiste Cazier  Jean-Baptiste.Cazier@well.ox.ac.uk DTC BioInformatics Course

Recent CNV Survey

• Recently 2 projects started in parallel to identify and characterize CNVs in Human:– The Genome Structural Variation Consortium (GSV)

• CNV discovery project to identify common CNVs using aCGH by Nimblegen,

• Detection in 20 CEU, 20 YRI, 1 reference • Assayed in 450 HapMap samples

– The Wellcome Trust Case Control Consortium (WTCCC)• Test for association to diseases of CNVs in the WTCCC

– 16,000 cases, WTCCC plus Breast cancer– 3,000 common ontrols

32

Page 33: Copy Number Variations Jean-Baptiste Cazier  Jean-Baptiste.Cazier@well.ox.ac.uk DTC BioInformatics Course

The GSV study design

33

Page 34: Copy Number Variations Jean-Baptiste Cazier  Jean-Baptiste.Cazier@well.ox.ac.uk DTC BioInformatics Course

The GSV study outcome

Localization Function of CNVs

34

Page 35: Copy Number Variations Jean-Baptiste Cazier  Jean-Baptiste.Cazier@well.ox.ac.uk DTC BioInformatics Course

The GSV study outcome (II)

• Designed an array with 42 million probes – cover 11,700 CNV larger than 443 bp– 8,599 validated independently

• Generate reference genotype for 4,978 on 450 samples

• Identified 30 loci with CNV candidate for influencing phenotype• Striking effect of purifying selection

– Act on exonic and intronic deletions– So functional variants should be rare

• But most of common CNVs are already well tagged by the existing SNParray– May need to look elsewhere to solve the missing heritability

35

Page 36: Copy Number Variations Jean-Baptiste Cazier  Jean-Baptiste.Cazier@well.ox.ac.uk DTC BioInformatics Course

The WTCCC study

• Use the WTCCC cohort of 16,000 samples and 3,000 common controls. – Bipolar, type 1 diabetes, type 2 diabetes, coronary artery disease,

hypertension, rheumatoid arthritis, Crohn’s disease + Breast Cancer– 1,500 1958 Birth Cohort and 1,500 National Blood Donor

• Designed a specific array using GSV set, McCarroll,1M and WTCCC1– 104,000 probes targeting 12,000 putatitve loci

• Perform assay using the Agilent platform by Oxford Gene Technology (OGT) against a common pooled reference sample

• Attempt to design a robust pipeline to call all CNV across the different studies– Use CNVtools by Plagnol and local by Cardin (“Chiamesque”)

http://www.wtccc.org.uk/ccc1/plus_typing_array.shtml

36

Page 37: Copy Number Variations Jean-Baptiste Cazier  Jean-Baptiste.Cazier@well.ox.ac.uk DTC BioInformatics Course

The WTCCC results

• 3,900 CNV identified • 3,100 validated after QC

• Concordance of 99.8% on known 420 duplicates

• Remaining 8,000 CNVs from original selection:– False positive in discovery– Too noisy, but genuine– Genuine but very rare

• 19 CNVs taken forward to replication with Bayes Factor: ~10-4 p-value– 14 failed to replicate either using tagged SNPs or direct typing – 5 associations

37

Page 38: Copy Number Variations Jean-Baptiste Cazier  Jean-Baptiste.Cazier@well.ox.ac.uk DTC BioInformatics Course

The WTCCC conclusions

• Each CNV behaves uniquely• Size, genomic location, biological sample type, sample preparation

– Designed 16 different pipelines• Key paramaters:

– Normalization– Integration of the 10 probes

• Impossible to define one-pipe-fits all

– Show importance to have duplicates and large amount of diverse data

• Confirmed the overrepresentation of CNVs in intronic regions

• Confirm the high level of tag with SNP 6.0 or HapMap2– MAF > 10% : 75% tagged at r2>0.8– MAF <5% : 40% tagged at r2>0.8

• Found few new CNV associated with phenotype38

Page 39: Copy Number Variations Jean-Baptiste Cazier  Jean-Baptiste.Cazier@well.ox.ac.uk DTC BioInformatics Course

Conclusions of these studies

• Both identified many CNV in the human genome

• Characterization of CNV is very difficult, and not easily stream lined– Careful interpretation of association results– Some artifacts will survive confirmation

• Many CNVs co-localize with variants identified by GWAS– Good functional candidate

• But, most of the common CNVs are already well tagged with SNPs– This will not bring new common variant in common disease

• i.e. these will not solve the mystery of missing heritability.

• Still rare CNVs can be associated to diseases, but just as much as SNPs

39

Page 40: Copy Number Variations Jean-Baptiste Cazier  Jean-Baptiste.Cazier@well.ox.ac.uk DTC BioInformatics Course

What with CNV then ?

• Copy Number Variations are key in Cancer

• Cancers are typical of somatic variations– They are therefore mostly unique– Cannot be tagged– Relatively common event– Although still difficult to identify it is essential

40

Page 41: Copy Number Variations Jean-Baptiste Cazier  Jean-Baptiste.Cazier@well.ox.ac.uk DTC BioInformatics Course

Cancer

Schematic illustration of chromosomal evolution in human solid tumor progression.

The stages of progression are arranged with the earlier lesions at the top.

Cells may begin to proliferate excessively owing to loss of tissue architecture, abrogation of checkpoints and other factors. In general, relatively few aberrations occur before the development of in situ cancer.

A sharp increase in genome complexity (the number of independent chromosomal aberrations) in many (but not all) tumors coincides with the development of in situ disease.

The types and range in aberration number varies markedly between tumors,

HCT116, a mismatch repair–defective cell line T47D, a mismatch repair–proficient cell line64.

Chromosome aberrations in solid tumorsDonna G et al. Nature Genetics  34, 369 - 376 (2003)

41

Page 42: Copy Number Variations Jean-Baptiste Cazier  Jean-Baptiste.Cazier@well.ox.ac.uk DTC BioInformatics Course

Germline vs. Somatic

• Germline variants– The aberration exists from the start, and is inherited– Such variants are more likely to be common Copy Number Polymorphisms,

predisposing variants.– Approach similar to non-cancer studies

• Somatic events– Aberrations happen during the life-time– Happen more than once– Heterogeneous events;

=> Each cancer is unique

– In Tumours, recurrent aberrations are more likely to be linked to the cancer as a selective advantage

We want to identify the regions with recurrent events

42

Page 43: Copy Number Variations Jean-Baptiste Cazier  Jean-Baptiste.Cazier@well.ox.ac.uk DTC BioInformatics Course

More issues

• Interpretation– What to use as a baseline ? i.e. define the Ratio

• Within sample baseline of 2 is not an easy assumption anymore

• Heterogeneity of tissue– Biopsy can be “contaminated” by normal tissue– Cancer are usually made up of a set of co-existing clones

• CNVs are unique– Each one has its own breakpoints

• Systematic error– Preparation, e.g. genome amplification– Sample quality

43

Page 44: Copy Number Variations Jean-Baptiste Cazier  Jean-Baptiste.Cazier@well.ox.ac.uk DTC BioInformatics Course

Copy Number Variations in Cancer

• It is possible to analyse tumour samples using classic Copy Number tools, but the results are likely to be unsatisfactory as many model assumptions are violated:

– The normalisation of SNP genotyping data can be affected by tumour samples containing large scale chromosomal alterations.

– Most aberrations do not follow the classic diploidy and cannot fit usual clusters

– So Genotype Calls might be forced on the wrong model AA/AB/BB:

• Deletions should be 0 or A / B,

• Copy Neutral LOH should be AA/BB

• Triploid should be AAA/AAB/ABB/BBB

– There can be intra-tumour heterogeneity

• E.g. Mix of triploid and tetraploid

– There can be contamination with normal cells (stromal contamination)

Integrated genotype calling and association analysis of SNPs, common copy number polymorphisms and rare CNVs.

Korn et al. Nat Genet. 2008 Oct;40(10):1253-60

44

Page 45: Copy Number Variations Jean-Baptiste Cazier  Jean-Baptiste.Cazier@well.ox.ac.uk DTC BioInformatics Course

A deletion found in tumour AML sample at 8p using unpaired analysis.

45

4

2

1

Tumour sample vs Baseline

Page 46: Copy Number Variations Jean-Baptiste Cazier  Jean-Baptiste.Cazier@well.ox.ac.uk DTC BioInformatics Course

Same deletion found in corresponding diagnostic AML sample at 8p

46

421

421

Tumour sample vs Baseline

Normal sample vs Baseline

Page 47: Copy Number Variations Jean-Baptiste Cazier  Jean-Baptiste.Cazier@well.ox.ac.uk DTC BioInformatics Course

Need for pairing

47

Tumour sample vs Baseline

Normal sample vs Baseline

Tumour sample vs Normal sample

421

421

421

Page 48: Copy Number Variations Jean-Baptiste Cazier  Jean-Baptiste.Cazier@well.ox.ac.uk DTC BioInformatics Course

Outliers, Batches

48PCA on 118 samples

Page 49: Copy Number Variations Jean-Baptiste Cazier  Jean-Baptiste.Cazier@well.ox.ac.uk DTC BioInformatics Course

Batch effect

Removed the outlier, colored by batch

49

Page 50: Copy Number Variations Jean-Baptiste Cazier  Jean-Baptiste.Cazier@well.ox.ac.uk DTC BioInformatics Course

Type: Normal vs Tumour

Removed the outlier, colored by type 50

Page 51: Copy Number Variations Jean-Baptiste Cazier  Jean-Baptiste.Cazier@well.ox.ac.uk DTC BioInformatics Course

Pairs

Removed the outlier, colored by type, paired

51

Page 52: Copy Number Variations Jean-Baptiste Cazier  Jean-Baptiste.Cazier@well.ox.ac.uk DTC BioInformatics Course

Need to check pairing

52

Genotype information allows QC of patient samples

• checking the samples by clustering through genotypes. – First 4 pairs group as per

pairs.

– The last two were identified as two different individuals.

Page 53: Copy Number Variations Jean-Baptiste Cazier  Jean-Baptiste.Cazier@well.ox.ac.uk DTC BioInformatics Course

Heterogeneity

• Proportion of Cells, “c,” in a heterogeneous tumour sample harboring a Somatic genetic event

• BAF and the logR ratio plots from one chromosome reveal three somatic hemizygous deletions occurring in three different proportions of cells.

• Frequency distribution showing the number of SNPs included in the somatic deletions by the proportion of cells, “c,” in which these events occur. Some somatic deletions occur in over 80% of cells. Assuming that only cancer cells harbor somatic deletions, the proportion of cancer cells is then estimated as 80% in this sample.

• Schematic illustrating the relationship between the chronology of somatic events during tumorigenesis and the proportion of cancer cells with these events. Early somatic events are present in all (or a great majority of) cancer cells, whereas late somatic events are only present in subsets of cells.

SNP arrays in heterogeneous tissue: highly accurate collection of both germline and somatic genetic information from unpaired single tumor samples.

Assié et al Am J Hum Genet. 2008 Apr;82(4):903-15

53

Page 54: Copy Number Variations Jean-Baptiste Cazier  Jean-Baptiste.Cazier@well.ox.ac.uk DTC BioInformatics Course

Mixing proportion identification

• Estimating copy number and mixing proportions from simulated data using OncoSNP.

• The estimated copy number states and mixing proportions (grey) are comparable to the true values used for the simulations (black).

• In the two regions of copy number 3 that are incorrectly classified as copy number 4, an examination of the Bayes Factor shows that although the data favors the 4n amplification state, there is also strong support for both the true state (3n amplification). Identification of DNA copy number changes and

loss-of-heterozygosity events in heterogeneous tumor samples: a Bayesian Mixtures of Genotypes approach on SNP array dataYau C et al In preparation

54

Page 55: Copy Number Variations Jean-Baptiste Cazier  Jean-Baptiste.Cazier@well.ox.ac.uk DTC BioInformatics Course

Normal-Tumour Titration

• intra-tumor heterogeneity (red)

• stromal contamination only (black)

• Both models infer the level of normal DNA contamination with good accuracy up to 50% contamination

• At higher contamination levels, the stromal contamination only model has superior performance as it is able to borrow strength from all SNPs to infer the contamination level.

• This provides more power to detect duplications at high contamination levels than the intra-tumor heterogeneity model.

Identification of DNA copy number changes and loss-of-heterozygosity events in heterogeneous tumor samples: a Bayesian Mixtures of Genotypes approach on SNP array dataYau C et al. In preparation

55

Page 56: Copy Number Variations Jean-Baptiste Cazier  Jean-Baptiste.Cazier@well.ox.ac.uk DTC BioInformatics Course

Detection of alterations

Detecting chromosomal alterations in cancer cell line and tumor samples.

The intra-tumor heterogeneity model (red) indicates that approximately 50% of cell contain a different breakpoint location to the others whereas this feature is missed entirely by the stromal contamination only model (black)

The near-triploid status of the cell line HT29 is correctly identified and copy number estimates are correctly derived even though the Log R Ratios are centered on zero for the copy number 3 state.

The two heterogeneous deletions are separated by an unaltered region, however, there is still good agreement between the mixing proportion estimates given by the intra-tumor heterogeneity and stromal-only models. This suggests we do not pay too severely when assuming independent mixing proportions in the intra-tumor heterogeneity model.

Identification of DNA copy number changes and loss-of-heterozygosity events in heterogeneous tumor samples: a Bayesian Mixtures of Genotypes approach on SNP array dataYau C et al. In preparation

56

Page 57: Copy Number Variations Jean-Baptiste Cazier  Jean-Baptiste.Cazier@well.ox.ac.uk DTC BioInformatics Course

Recurrent events

Overview of all genetic aberrations found with SNP array in 45 adult and adolescent ALL cases. Minimally involved regions are shown to the right of each chromosome.

For each type of aberration, each line represents a different case.

– Blue lines are regions of uniparental disomy, – light green lines are hemizygous deletions, – dark green lines are homozygous deletions, – red lines are copy-number gains.

Note the high frequency of deletions involving chromosomes 9p21.3, 9p13.2, 7p12.2, 12p13.2, and 13q14.2 corresponding to the CDKN2A, PAX5, IKZF1, ETV6, and RB1 loci, respectively.

http://www.well.ox.ac.uk/~jcazier/GWA_Viewer.htmlMicrodeletions are a general feature of adult and adolescent acute

lymphoblastic leukemia: Unexpected similarities with pediatric disease.

Paulsson K et al, Proc Natl Acad Sci U S A. 2008 May 6;105(18):6708-13

57

Page 58: Copy Number Variations Jean-Baptiste Cazier  Jean-Baptiste.Cazier@well.ox.ac.uk DTC BioInformatics Course

Overlap of recurrences

• Aberrations observed on chromosomes 11 and 13 are shown with their bands, a subset of potential target genes in AML and regions of – gain (red), – loss (green) – aUPD (blue).

• The scale at the bottom shows the length of each chromosome in megabases (Mb). The color gradient above each kind of aberration summarizes the data for that aberration.

• Beware that GC content can induce systematic falsely identified aberrations

Novel regions of acquired uniparental disomy discovered in acute myeloid leukemia.

Gupta et al. Genes Chromosomes Cancer. 2008 Sep;47(9):729-39.

58

Page 59: Copy Number Variations Jean-Baptiste Cazier  Jean-Baptiste.Cazier@well.ox.ac.uk DTC BioInformatics Course

Typical workflow

• Normalisation– GC Content Correction– Paired– Unpaired with appropriate baseline

• Determination of Aberrations– Correct Genotype– Copy Number

• Identification of recurrent locations• Test against germline sample if possible

– Could it be an at-risk variant ?

• Test against known variations• Validation

– Identify precisely breakpoints• Sequencing

– Identify the frequency– Identify the Associated risk– Perform functional analysis

59

Page 60: Copy Number Variations Jean-Baptiste Cazier  Jean-Baptiste.Cazier@well.ox.ac.uk DTC BioInformatics Course

Summary

• CNP are very common in the human genome– It is easier to have a functional role for them

• Common ones are well tagged by existing markers– Does not bring much new loci, but function

• Hard to characterize uniformly

• Not yet much proven functional

• Still very key in Cancer– More challenges to be identified– More essential for the understanding

60

Page 61: Copy Number Variations Jean-Baptiste Cazier  Jean-Baptiste.Cazier@well.ox.ac.uk DTC BioInformatics Course

Future

• Catalogue of CNPs– GSV and WTCCC effort– Use of the 1000 genome project

• Methods– Improvements of the algorithms– Improvements of the Computing power

• Other technologies– Use of expression data– Use of Clonal Sequencing– Single molecule sequencing

61

Page 62: Copy Number Variations Jean-Baptiste Cazier  Jean-Baptiste.Cazier@well.ox.ac.uk DTC BioInformatics Course

Useful references

• Collections of known aberrations:– Mitelman Database of Chromosome Aberration in Cancer

• http://cgap.nci.nih.gov/Chromsomes/Mitelman• cytogenetic confirmed

– Database of Genomics Variants• Zhang, J et al. Development of bioinformatics resources for display and analysis of copy number

and other structural variants in the human genome. Cytogenet. Genome Res. (2006).

– Redon, R. et al. Global variation in copy number in the human genome. Nature, (2006).

– Iafrate, A.J. et al. Detection of large-scale variation in the human genome. Nat. Genet. (2004).

– McCarrol & Korn (2008)• Based on SNP 6.0 in 270 HapMap samples

– Genome Structural Variation Consortium • Conrad D et al. Origins and functional impact of copy number variation in the human genome

Nat. Genet. (2009)

62

Page 63: Copy Number Variations Jean-Baptiste Cazier  Jean-Baptiste.Cazier@well.ox.ac.uk DTC BioInformatics Course

Practical

• R- packages:– DNACopy:

• A Package for Analyzing DNA Copy data• A faster circular binary segmentation algorithm for the analysis of array cgh data.

Venkatraman, E. S. and Olshen, A. B. (2007). Bioinformatics, 23: 657 – 663

– snapCGH: • Segmentation, Normalization and Processing of aCGH Data• BioHMM: a heterogeneous hidden Markov model for segmenting array CGH data.

Marioni, J. C., Thorne, N. P., and Tavaré, S. (2006).Bioinformatics 22: 1144 – 1146

– BeadarraySNP:• package for the analysis of Illumina genotyping BeadArray data• High-resolution copy number analysis of paraffin-embedded archival tissue using SNP BeadArrays.

Oosting J et al. Genome Res. 2007 Mar;17(3):368-76

• Web interface:– Integration of CNV results across multiple samples– http://www.well.ox.ac.uk/~jcazier/GWA_Viewer.html

63