array informatics mark gerstein

35
Do not reproduce without permission 1 Lectures.GersteinLab.org (c) 2007 Array Informatics Mark Gerstein

Upload: sine

Post on 30-Jan-2016

50 views

Category:

Documents


0 download

DESCRIPTION

Array Informatics Mark Gerstein. CEGS Informatics Developing Tools and Technical Analyses Related to Genome Technologies. Main Genome Technologies Tiling Arrays Next Generation Sequencing Main Applications Transcript mapping Protein-DNA Binding CGH Transitioning to Seq. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Array Informatics Mark Gerstein

Do not reproduce without permission 1 L

ec

ture

s.G

ers

tein

La

b.o

rg

(c)

20

07

Array Informatics

Mark Gerstein

Page 2: Array Informatics Mark Gerstein

Do not reproduce without permission 2 L

ec

ture

s.G

ers

tein

La

b.o

rg

(c)

20

07

CEGS Informatics Developing Tools and Technical Analyses Related to

Genome Technologies

• Main Genome Technologies Tiling Arrays Next Generation Sequencing

• Main Applications Transcript mapping Protein-DNA Binding CGH

• Transitioning to Seq....

Page 3: Array Informatics Mark Gerstein

Do not reproduce without permission 3 L

ec

ture

s.G

ers

tein

La

b.o

rg

(c)

20

07

Tools & Tech. Analyses for Processing of Genome Technology Data

• Normalizing Arrays and Measuring & Correcting Artifacts COP - Correcting positional artifacts [Yu et al. NAR '07] Efficient Pseudomedian Calculation - for Tiling Array Scoring

[Royce et al., BMC Bioinfo. '07] Measuring Mismatch Effects

[Seringhaus et al., BMC Genomics (submitted)] Removing Seq. Effects [Royce et al., Bioinfo. '07] NN Prediction of Probe Intensity - measuring & exploiting specific

cross-hyb [Royce et al. NAR '07]

• Simulating NextGen Sequencing ChipSeqSim - simulating ChIP Seq [Zhang et al., PLoS CB '08]

Page 4: Array Informatics Mark Gerstein

Do not reproduce without permission 4 L

ec

ture

s.G

ers

tein

La

b.o

rg

(c)

20

07

Tools & Tech. Analyses for Genome Structural Variation

Breakptr - HMM-based Array Segmentation for CNV detection [Korbel et al., PNAS '07]

MSB - Mean-shift-based Array Segmentation for CNV detection with extension to sequencing [Wang et al. Gen. Res. (submitted)]

PEMer - Paired-end Mapping for SV Dectection with simulation calibration and breakpoint DB [Korbel et al., GenomeBiol. (submitted)]

Long-SV-Assembly Simulations [Du et al., Nat. Meth. (submitted)] SD-CNV-CORR - Approach for correlating the occurrence of CNVs

and SDs with genomic features (particularly repeats) [Kim et al., Genome Res. (submitted)]

Page 5: Array Informatics Mark Gerstein

Do not reproduce without permission 5 L

ec

ture

s.G

ers

tein

La

b.o

rg

(c)

20

07

5

(c

) M

ark

Ge

rste

in,

20

02

, Y

ale

, b

ioin

fo.m

bb

.ya

le.e

du

5

(c

) M

ark

Ge

rste

in,

20

02

, Y

ale

, b

ioin

fo.m

bb

.ya

le.e

du

A Starting Point: Noisy Raw Signal from Tiling Arrays (Transcription)

Joh

nso

n e

t a

l. (2

00

5)

TIG

, 2

1,

93

-10

2.

Li et al., PLOS one (2007)

Page 6: Array Informatics Mark Gerstein

Do not reproduce without permission 6 L

ec

ture

s.G

ers

tein

La

b.o

rg

(c)

20

07

6

(c

) M

ark

Ge

rste

in,

20

02

, Y

ale

, b

ioin

fo.m

bb

.ya

le.e

du

Perfect Match Specific Cross-hyb. Non-specific Cross-hyb.

Specific & Non-specific Cross-Hyb.• Perfect match (PM):

probe binding intended target• Specific cross-hyb.: probes binding non-PM

targets with a small number of mismatches• Non-specific cross-hyb.: probes binding

targets with many mismatches, due to general stickiness of oligos

Page 7: Array Informatics Mark Gerstein

Do not reproduce without permission 7 L

ec

ture

s.G

ers

tein

La

b.o

rg

(c)

20

07

Non-Specific Cross Hyb. (Sequence Effects)

Page 8: Array Informatics Mark Gerstein

Do not reproduce without permission 8 L

ec

ture

s.G

ers

tein

La

b.o

rg

(c)

20

07

8

(c

) M

ark

Ge

rste

in,

20

02

, Y

ale

, b

ioin

fo.m

bb

.ya

le.e

du

Creation of Standardized Datasets for

Quantifying Effect of Mismatches

[Seringhaus et al., BMC Genomics (in press)]

Types of Mispairs (probe on array is first)

No

rmal

ized

In

ten

sity

MM

M

M v

s. P

M

Yeast Human

GA

v A

G

CA

v A

C

GT

v T

G

CT

v T

C

Page 9: Array Informatics Mark Gerstein

Do not reproduce without permission 9 L

ec

ture

s.G

ers

tein

La

b.o

rg

(c)

20

07

Sourc

e:

Royce

, T.E

., e

t al (2

00

7),

Bio

info

rmati

cs, 23

, 9

88

-97

Avg. intensity of all background probes with a C at position 4

Avg. intensity of all background probes with a T at position 33

Observing Non-specific Cross-hyb. (Probe sequence effects)

Page 10: Array Informatics Mark Gerstein

Do not reproduce without permission 10

Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

00710

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Iterated Quantile Normalization to

Correct for Non-specific Cross-hyb.

• Adapt Bolstad et al (2003) approach to tiling arrays

• Force distributions with a given nt at each position to be same

• Distributions at other positions now different so iterate

• Also, robust adaptation of Naef & Magnasco (2003)

T Royce et al (2007), Bioinformatics, 23, 988-97

Page 11: Array Informatics Mark Gerstein

Do not reproduce without permission 11

Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

007

Measuring Specific Cross-Hyb

Source: Royce, T.E., et al (2007), Nucleic Acids Res., 23, 98-97

Page 12: Array Informatics Mark Gerstein

Do not reproduce without permission 12

Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

00712

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Proof of principle test to “exploit” this

Figure from http://www.members.cox.net/amgough/Fanconi-genetics-genetics-primer.htm

• Using Cheng et al. (2005), predict gene expression levels (and profiles across tissues) for genes on part of chr. #6

• ...Based on closest cross-hyb tiles on part of chr. #7

• Then compare to measured levels and profile on #6

Source: Royce, T.E., et al (2007), Nucleic Acids Res., 23, 98-97

Page 13: Array Informatics Mark Gerstein

Do not reproduce without permission 13

Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

007

Nearest Nbr Search on Virtual Tiling

Royce, T.E., et al (2007), Nucleic Acids Res., 23, 98-97

Page 14: Array Informatics Mark Gerstein

Do not reproduce without permission 14

Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

007

Agreement between predicted tile expression profile and actual one

Source: Royce, T.E., et al (2007), Nucleic Acids Res., 23, 98-97

• Correlated predicted profiles with the actual profiles of gene expression across cell lines

• Much more correlation than expected by chance (dist. centered on 0)

Page 15: Array Informatics Mark Gerstein

Do not reproduce without permission 15

Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

00715

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Royce, T. E. et al. Nucl. Acids Res. 2007 35:e99

Very Strong ROC Curve: Most genes are accurately detected using

nearest-neighbor features' signals

• Illustrates great magnitude of cross-hyb. on hi-density arrays

• High feature density arrays inadvertently resurrecting generic n-mer concept (van Dam & Quake, 2003)

• Suggests that tiling arrays could be exploited to create universal arrays

• Gold std. set of known expressed genes. How well do we find. • A set of known positives was defined as the Refseq genes with at

least 75% transfrag coverage. A set of known negatives was constructed by permuting the sequences in the set of known positives. For various thresholds, sensitivity and specificity were computed and then plotted.

Page 16: Array Informatics Mark Gerstein

Do not reproduce without permission 16

Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

007

CEGS Informatics Credits

• Array Corrections J Rozowsky T Royce M Seringhaus

• PEMer, SD-CNV, BreakPtr P Kim J Korbel J Du X Mu A Abyzov N Carriero

• Experimental M Snyder S Weissman A Urban

Page 17: Array Informatics Mark Gerstein

Do not reproduce without permission 17

Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

007

Computational Methods for SV Characterization

Mark Gerstein

Page 18: Array Informatics Mark Gerstein

Do not reproduce without permission 18

Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

007

Computational Methods for SV Characterization

Segmenting Array CGH data

Building a PEM pipeline

Correlating SVs and SDs with Repeats

Page 19: Array Informatics Mark Gerstein

Do not reproduce without permission 19

Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

00719

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

http://breakptr.gersteinlab.orgKorbel*, Urban* et al., PNAS (2007)

-0.5

0

0.5

Flu

ores

cenc

e

log2

rat

io

NO T TO SCALEACGTGACAC ATAAGCACACCA ATTGCTTGAGGGACCT TAGGCACAGT TAAC ATGATAAGCACACCA ATTGCTTGAGGTGACDNA

sequence

BreakPtr HMM

• To get highest resolution on breakpoints need to smooth & segment the signal

• BreakPtr: prediction of breakpoints, dosage and cross-hybridization using a system based on Hidden Markov Models

Page 20: Array Informatics Mark Gerstein

Do not reproduce without permission 20

Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

00720

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

High resolution of tiling arrays allows statistical integration of nucleotide sequence patterns

>4-fold enrichment of the breakpoints of copy number variants near segmental duplications (SDs)[e.g. Sharp et al., Am. J. Hum. Genet. 2005; 77:78-88].

Page 21: Array Informatics Mark Gerstein

Do not reproduce without permission 21

Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

00721

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

BreakPtr statistically integrates array signal and DNA sequence signatures (using a discrete-valued bivariate HMM)

Korbel*, Urban* et al., PNAS (2007)

Transition A

Transition A’

Transition B

Transition B’

Duplication DeletionNormal

Sequence

Arr

ayva

lues

Sequence

Arra

yva

lues

Page 22: Array Informatics Mark Gerstein

Do not reproduce without permission 22

Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

00722

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Gold standards

Training data

Parameter optimization

Model parameter estimation

Breakpointvalidation

[sequencing]

log (number of CNVs available for parameter estimation)10

10[core]

304[intermediate A]

1003[intermediate B]

2503[full]

SDs

norm

aliz

ed f

luor

esce

nt

inte

nsity

log

-ra

tios

2

Max

imum

num

ber

of p

aram

eter

s pe

r tr

ansi

tion

stat

e

‘Active’ approach for breakpoint identification: initial scoring with preliminary model, targeted validation (with sequencing),

retraining, and rescoring

HMM optimized iteratively (using Expectation Maximization, EM) Korbel*, Urban* et al., PNAS (2007)

CNV breakpoints sequenced in ~10 cases following BreakPtr analysis;

Median resolution <300 bp

No improvement in accuracy with higher resolution (9nt tiling)

Page 23: Array Informatics Mark Gerstein

Do not reproduce without permission 23

Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

00723

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Moving Beyond Arrays, Computational Methods for Next-

Generation Sequencing:Paired End Mapping to Find SVs

Page 24: Array Informatics Mark Gerstein

Do not reproduce without permission 24

Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

00724

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Overall Strategy for Analysis of

NextGen Seq. Data to Detect Structural Variants

[Korbel et al., Science ('07); Korbel et al., GenomeBiol. (submitted)]

Page 25: Array Informatics Mark Gerstein

Do not reproduce without permission 25

Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

00725

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Simulation strategy

Simulation

Experiment454 sequencing [Korbel et al., GenomeBiol. (submitted)]

Page 26: Array Informatics Mark Gerstein

Do not reproduce without permission 26

Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

00726

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Reconstruction efficiency at different coverage

[Korbel et al., GenomeBiol. (submitted)]

Page 27: Array Informatics Mark Gerstein

Do not reproduce without permission 27

Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

00727

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Building a Database of Variants: Complexities

[Korbel et al., GenomeBiol. (submitted)]

Page 28: Array Informatics Mark Gerstein

Do not reproduce without permission 28

Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

00728

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Analyzing Duplications in the Genome (SDs & CNVs)

pers. photo, see streams.gerstein.info

Page 29: Array Informatics Mark Gerstein

080907_SD_CNV_Slides_MBG_CEGS_PMK

29

SEGMENTAL DUPLCATIONS AND COPY NUMBER VARIANTS ARE RELATED PHENOMENA AND HAVE BEEN CREATED BY SEVERAL DIFFERENT MECHANISMS

Copy Number Variants (CNV) Segmental Duplications (SD)

Fixation

Intra-species variation Fixed mutations(differences to other species)

NAHR (Non-allelic homologous recombination)

Flanking repeat(e.g. Alu, LINE…)

NHEJ (Non-homologous-end-joining)

No (flanking) repeats. In some cases <4bp microhomologies

Page 30: Array Informatics Mark Gerstein

080907_SD_CNV_Slides_MBG_CEGS_PMK

30

PERFORM LARGE SCALE CORRELATION ANALYSIS TO DETECT REPEAT SIGNATURES OF SDs AND CNVs

Survey a range of genomic features

Count the number of features in each genomic bin (100kb)

Calculate correlations / enrichments using robust stats

1

2

3

…ATCAAGG CCGGAA…

Exact match

Local environment

If exact CNV breakpoints are known, we can calculate the enrichment of repeat elements relative to the genome or relative to the local environment

[Kim et al. Gen. Res. (submitted, '08), arxiv.org/abs/0709.4200v1 ]

Page 31: Array Informatics Mark Gerstein

080907_SD_CNV_Slides_MBG_CEGS_PMK

31

SDs ARE CORRELATED WITH ALUS AND OTHER SDs

0.080.09

0.120.14 0.14 0.13

Alu association with SDs by age

90-92% 92-94% 94-96% 96-98% 98-99% >99%

• The co-localization of Alu elements with SDs is highly significant.

• Older SDs have a much higher association with Alus than younger SDs.

• SDs can mediate NAHR and lead to the formation of CNVs

• Such mechanisms (“preferential attachment”) are well studied in physics and should leads a very skewed (“power-law”) distribution of SDs.

•Hotspots[Kim et al. Gen. Res. (submitted, '08), arxiv.org/abs/0709.4200v1 ]

f

Number of SDs in Genomic Bin

Oc

cu

rre

nc

e

Page 32: Array Informatics Mark Gerstein

080907_SD_CNV_Slides_MBG_CEGS_PMK

32

0.0480.0006

0.07390.0466

ASSOCIATIONS ARE DIFFERENT FOR SDs AND CNVs

Alu

0.92

CNV association with repeats

Microsatellite

<0.001

Pseudogenes

0.046

LINE

0.001

0.07

0.27

0.0940.21

Alu

<0.001

SD association with repeats

Microsatellite

<0.001

Pseudogenes

0.046

LINE

0.001

[Kim et al. Gen. Res. (submitted, '08), arxiv.org/abs/0709.4200v1 ]

CNVAssociationwith SDs

>99% SDs* CNVs

0.31

0.11

CNVs ARE LESS ASSOCIATED WITH SDs THAN THE GENERAL SD TREND

Page 33: Array Informatics Mark Gerstein

Do not reproduce without permission 33

Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

00733

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

uOldLow seq-ID (%)

YoungHigh seq-ID (%)

CNVs

Fixation Aging (~40Mya)

NAHR

NHEJ

SDs

AluSD

LINEMicrosatellite

SubtelomeresFragile sites

Alu Burst (40 MYA)

AFTER THE ALU BURST, THE IMPORTANCE OF ALU ELEMENTS FOR GENOME REARRANGEMENT DECLINED RAPIDLY

• About 40 million years ago there was a burst in retrotransposon activity

• The majority of Alu elements stem from that time

• This, in turn, led to rapid genome rearrangement via NAHR

• The resulting SDs, could create more SDs, but with Alu activity decaying, their creation slowed

[Kim et al. Gen. Res. (submitted, '08), arxiv.org/abs/0709.4200v1 ]

Page 34: Array Informatics Mark Gerstein

Do not reproduce without permission 34

Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

007

Future Directions

• Simulations of SV Assembly• Analysis of Split Reads• Detailed Analysis of SV and CNVs

with Genomic Features

Page 35: Array Informatics Mark Gerstein

Do not reproduce without permission 35

Le

ctu

res

.Ge

rste

inL

ab

.org

(c

) 2

007

CEGS Informatics Credits

• Array Corrections J Rozowsky T Royce M Seringhaus

• PEMer, SD-CNV, BreakPtr P Kim J Korbel J Du X Mu A Abyzov N Carriero

• Experimental M Snyder S Weissman A Urban