base-resolution rna-seq - jeff leek

80
@simplystats base-resolution rna-seq Jeff Leek Johns Hopkins Bloomberg School of Public Health

Upload: australian-bioinformatics-network

Post on 11-May-2015

1.068 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: Base-Resolution rna-seq - Jeff Leek

@simplystats

base-resolution rna-seq

Jeff Leek Johns Hopkins Bloomberg School of Public Health

Page 2: Base-Resolution rna-seq - Jeff Leek

@simplystats

normally You are free to: Provided: Provided you attribute this work to its author and respect the rights and licenses associated with its components

Copy, share adapt and remix Photograph, film and broadcast Live tweet, blog, post video of

Adapted  from:    

Page 3: Base-Resolution rna-seq - Jeff Leek

@simplystats

today

1.  types of statistical methods

2.  derfinder

3.  Unexpected expression

Page 4: Base-Resolution rna-seq - Jeff Leek

@simplystats

data generation

Genome

Page 5: Base-Resolution rna-seq - Jeff Leek

@simplystats

data generation

Genome

Transcripts

Page 6: Base-Resolution rna-seq - Jeff Leek

@simplystats

data generation

Genome

Transcripts

Reads

Page 7: Base-Resolution rna-seq - Jeff Leek

@simplystats

“simplest” thing – annotate-identify

Genome

Page 8: Base-Resolution rna-seq - Jeff Leek

@simplystats

exon model

Genome

Count by Exon

Bullard et al. BMC Bioinformatics 2010

Page 9: Base-Resolution rna-seq - Jeff Leek

@simplystats

union model

Genome

Union of all exons

Bullard et al. BMC Bioinformatics 2010

Page 10: Base-Resolution rna-seq - Jeff Leek

@simplystats

union-intersection model

Genome

Union/Intersection

Bullard et al. BMC Bioinformatics 2010

Page 11: Base-Resolution rna-seq - Jeff Leek

@simplystats

sources of variation in annotate-identify

1.  annotation 2.  gene models 3.  fragment-level biases 4.  technical variation 5.  biological variability

Page 12: Base-Resolution rna-seq - Jeff Leek

@simplystats

annotation variation

Frazee et al. Biostatistics under review

Page 13: Base-Resolution rna-seq - Jeff Leek

@simplystats

gc-variation

Hansen et al. 2011 Biostatistics

Page 14: Base-Resolution rna-seq - Jeff Leek

@simplystats

biological variation

Choy  et  al.    (2008)    vs.    

Pickrell  et  al.    (2010)    

Stranger  et  al.    (2007)    vs.    

Montgomery  et  al.    (2010)    

Hansen et al. 2010 Nat. Biotech

Page 15: Base-Resolution rna-seq - Jeff Leek

@simplystats

some data

hCp://bowGe-­‐bio.sourceforge.net/recount/  

Page 16: Base-Resolution rna-seq - Jeff Leek

@simplystats

assemble-identify

Genome

Reads

Page 17: Base-Resolution rna-seq - Jeff Leek

@simplystats

assemble-identify (align)

Genome

Page 18: Base-Resolution rna-seq - Jeff Leek

@simplystats

assemble-identify (assemble)

Genome

Fragments

Transcripts

Trapnell et al. 2010 Nat. Biotech

Page 19: Base-Resolution rna-seq - Jeff Leek

@simplystats

assemble-identify (abundance)

Genome

Transcripts

Trapnell et al. 2010 Nat. Biotech

Page 20: Base-Resolution rna-seq - Jeff Leek

@simplystats

inherent ambiguity (boundaries)

Genome

Fragments

Transcripts

Page 21: Base-Resolution rna-seq - Jeff Leek

@simplystats

inherent ambiguity (assembly)

Genome

Alternative Assemblies

Page 22: Base-Resolution rna-seq - Jeff Leek

@simplystats

assembly variation

Frazee et al. in prep

Page 23: Base-Resolution rna-seq - Jeff Leek

@simplystats

result of assembly variation

Frazee et al. in prep

Page 24: Base-Resolution rna-seq - Jeff Leek

@simplystats

result of assembly variation (bio reps)

Frazee et al. in prep

Page 25: Base-Resolution rna-seq - Jeff Leek

@simplystats

result of assembly variation

Frazee et al. in prep

Page 26: Base-Resolution rna-seq - Jeff Leek

@simplystats

result of assembly variation

Cufflinks p values

p values

Density

0.0 0.2 0.4 0.6 0.8 1.0

01

23

45

6

Cufflinks p-values

p-value

density

0.0 0.2 0.4 0.6 0.8 1.0

010

2030

4050

Cufflinks v1 Cufflinks v2

Frazee et al. 2012 in prep Frazee et al. in prep

Page 27: Base-Resolution rna-seq - Jeff Leek

@simplystats

methods

annotate-identify 1.  align 2.  gene-model 3.  abundances 4.  analyze

Pros: •  analogous to microarray, •  processed data easy to handle Cons: •  incorrect/variable annotation •  gene model choices have a big

impact

assemble-identify 1.  align 2.  assemble 3.  abundances 4.  analyze

Pros: •  alternative transcription •  (potentially) less annotation

dependent Cons: •  ambiguity/variation in

assembly

Page 28: Base-Resolution rna-seq - Jeff Leek

@simplystats

differentially expressed region finder

1.  Calculate base pair-resolution coverage 2.  Perform test at each base 3.  Identify regions of differential expression (segment) 4.  Annotate regions (optional) Pros: •  processed data still easier to handle •  less dependent on annotation •  no assembly variability Cons: •  still no transcript-level abundances (but…)

Frazee et al. 2012b in prep

Page 29: Base-Resolution rna-seq - Jeff Leek

@simplystats

derfinder notes

•  Ignores annotation •  Coverage data at base resolution, designed

for “differential” expression analysis –  Lose paired end information –  Lose junction information –  Lose potential mapping quality information – …

•  Annotate the resulting differentially expressed regions (DERs)

Page 30: Base-Resolution rna-seq - Jeff Leek

@simplystats

Solution ir

5 10 15 202 2 3 6 11 12 14 15 15 16 15 17 16 14 9 8 6 5 5 4 3 1 1

Page 31: Base-Resolution rna-seq - Jeff Leek

@simplystats

result n  samples  à  

3  billion  nt    

Frazee et al. Biostatistics in review

Page 32: Base-Resolution rna-seq - Jeff Leek

@simplystats

base-pair model (case/control)

g() = Transform (Box-Cox, log(+32) etc.) Yi,j = coverage on sample i at base j lj = genomic location j α() = baseline coverage β() = change in coverage between groups γk() = adjustment’s for confounders Wik = value of kth confounder on ith sample

Frazee et al. Biostatistics in review

Page 33: Base-Resolution rna-seq - Jeff Leek

@simplystats

batch-variation

Blue:  3  sds  below  the  mean  Orange:  3  sds  above  the  mean  

Human  chromosome  16  

Horizontal  lines  delimit  process  dates  

Leek et al. 2010 Nat. Rev. Genet.

Page 34: Base-Resolution rna-seq - Jeff Leek

@simplystats

finding the statistics for d.e. bases

t ~ π0f0 + π1f1 + π2f2 + π3f3

Page 35: Base-Resolution rna-seq - Jeff Leek

@simplystats

empirical bayes

Page 36: Base-Resolution rna-seq - Jeff Leek

@simplystats

estimating parameters

t ~ π0f0 + π1f1 + π2f2 + π3f3

Assumed known – the distribution of zeros Alternatively – Gottardo and Raftery 2008 JCGS

Page 37: Base-Resolution rna-seq - Jeff Leek

@simplystats

estimating parameters

t ~ π0f0 + π1f1 + π2f2 + π3f3

Estimated null distribution from e.g. Efron 2002

Page 38: Base-Resolution rna-seq - Jeff Leek

@simplystats

estimating parameters

Estimated from 2-groups model, assumed symmetric

t ~ π0f0 + π1f1 + π2f2 + π3f3

Page 39: Base-Resolution rna-seq - Jeff Leek

@simplystats

hmm

DE   DE   DE   not  DE   not  DE  

t1   t2   t3   t4   t5  

hidden states

emissions are statistics

Frazee et al. Biostatistics in review

Page 40: Base-Resolution rna-seq - Jeff Leek

@simplystats

statistic

Observed  

Frazee et al. Biostatistics in review

Page 41: Base-Resolution rna-seq - Jeff Leek

@simplystats

monte-carlo p-value

Observed   Null    

Frazee et al. Biostatistics in review Lagnmead et al. in prep

Jaffe et al. Biostatistics 2011

Page 42: Base-Resolution rna-seq - Jeff Leek

@simplystats

ma-plots

Frazee et al. Biostatistics in review

Page 43: Base-Resolution rna-seq - Jeff Leek

@simplystats

statistical significance

p values

Frequency

0.0 0.2 0.4 0.6 0.8 1.0

0200

400

600

800 DER Finder - sex

p values

Frequency

0.0 0.2 0.4 0.6 0.8 1.0

0200

400

600

800

1000

DER Finder - males

p values

Frequency

0.0 0.2 0.4 0.6 0.8 1.0

050

100

200

300

Cufflinks - sex

p values

Frequency

0.2 0.4 0.6 0.8 1.0

050

100150200250300 Cufflinks - males

p value

Frequency

0.0 0.2 0.4 0.6 0.8 1.0

050

100

150 EdgeR - sex

p value

Frequency

0.0 0.2 0.4 0.6 0.8 1.0

05

1015

2025 EdgeR - males

p value

Frequency

0.0 0.2 0.4 0.6 0.8 1.0

020

4060

80100

140

DESeq - sex

p value

Frequency

0.0 0.2 0.4 0.6 0.8 1.0

010

2030

40 DESeq - malesFrazee et al. Biostatistics in review

Page 44: Base-Resolution rna-seq - Jeff Leek

@simplystats

percent “correct hits” by ranking

Page 45: Base-Resolution rna-seq - Jeff Leek

@simplystats

caveat

Genome

Bullard et al. BMC Bioinformatics 2010

Page 46: Base-Resolution rna-seq - Jeff Leek

@simplystats

caveat

Genome

Bullard et al. BMC Bioinformatics 2010

Page 47: Base-Resolution rna-seq - Jeff Leek

@simplystats

annotation incorrect

4.5

5.0

5.5

6.0

6.5

7.0

log2(count+32)

chrY: 15016699 - 15017219

femalemale

12

34

56

7t s

tatis

tic

xaxinds

exons

states

15016742 15016842 15016942 15017119 15017219genomic position

Frazee et al. Biostatistics in review

Page 48: Base-Resolution rna-seq - Jeff Leek

@simplystats

annotation missing

4.5

5.0

5.5

6.0

6.5

7.0

log2(count+32)

chrY: 2715932-2716691

femalemale

2.0

2.5

3.0

3.5

4.0

4.5

t sta

tistic

xaxinds

exons

states

2715882 2716082 2716282 2716482 2716682genomic position

Frazee et al. Biostatistics in review

Page 49: Base-Resolution rna-seq - Jeff Leek

@simplystats

missed by cufflinks

Frazee et al. Biostatistics in review

Page 50: Base-Resolution rna-seq - Jeff Leek

@simplystats

computational goals

•  Aligned reads (say from TopHat) to DERs in < 24 hours, all within R statistical software – Table of DERs and matrix of mean coverage per

sample per region for post-hoc analysis – Annotated using data from UCSC and Ensembl:

counts of features and annotation lists – Visualized DERs, including annotation to identify

novel transcriptional activity –  Easy methods for counting exons from coverage

objects (~2-4 hours from aligned reads for all samples)

Page 51: Base-Resolution rna-seq - Jeff Leek

@simplystats

derfinder - fast

1.  Test for differential expression at each base, record statistic (linear modeling)

2.  Identify contiguous/adjacent bases that are differentially expressed above some cutoff (thresholding/ “bumphunter”)

3.  Summarize each DER (area) 4.  Perform significance testing on region-

level (permutations, empirical p-values)

Page 52: Base-Resolution rna-seq - Jeff Leek

@simplystats

time and memory needed: derSnyder

•  Load & filter data: 10 cores with mclapply 1hr 15min, 177 GB

•  Make models: 20 min, 52 GB •  Analysis: 10 permutations, 4 cores each chr,

total 59 mins –  chr1 41 min, 46 GB

•  Merging: 30 min, 22 GB •  Report: 27 min, 17 GB •  Total wallclock time: 3 hr 46 min

20 samples

Page 53: Base-Resolution rna-seq - Jeff Leek

@simplystats

Counts: derSnyder

•  Load & filter data: 10 cores with mclapply 1hr 15min, 177 GB

•  Create count table: 26 min, 24 GB •  Total wallclock time: 1 hr 41 min

20 samples

Page 54: Base-Resolution rna-seq - Jeff Leek

@simplystats

lieber brain samples

•  DLPFC Paired-end RNAseq Data •  36 samples across 6 age ranges, n=6/

group: Fetal (age < 0) ; Infant (0 -1) ; Child (1 - 10) ; Teen (10 - 20) ; Adult (20 -50) ; 50+

•  4 M and 2 F per group; mostly AA, but some Caucasians

•  RINs are evenly distributed across age

Page 55: Base-Resolution rna-seq - Jeff Leek

@simplystats

lieber brain samples

Page 56: Base-Resolution rna-seq - Jeff Leek

@simplystats

test for base-level de

Page 57: Base-Resolution rna-seq - Jeff Leek

@simplystats

thresholding on statistic

F-­‐staGsGc  corresponding  to  p-­‐value  <  10-­‐8    (F5,30)  

Page 58: Base-Resolution rna-seq - Jeff Leek

@simplystats

derfinder results

•  alt model: age group + median coverage •  null model: median coverage •  threshold: p-value < 1e-8 •  5,565 DERs with FWER ~ 0 (conservative) – Median length: 148bp [IQR: 112-235]

Page 59: Base-Resolution rna-seq - Jeff Leek

@simplystats

Page 60: Base-Resolution rna-seq - Jeff Leek

@simplystats

Page 61: Base-Resolution rna-seq - Jeff Leek

@simplystats

Page 62: Base-Resolution rna-seq - Jeff Leek

@simplystats

Page 63: Base-Resolution rna-seq - Jeff Leek

@simplystats

Page 64: Base-Resolution rna-seq - Jeff Leek

@simplystats

Page 65: Base-Resolution rna-seq - Jeff Leek

@simplystats

Page 66: Base-Resolution rna-seq - Jeff Leek

@simplystats

annotating

•  Devised “light-weight” R annotation files for UCSC hg19 knownGene and Ensembl GRCh37.p11

•  “Genomic State” objects: each base pair in the genome gets assigned to exactly one “state”, annotations merged across overlapping features

•  Two different configurations: –  “Full” (introns, exons, un-annotated/intragenic) –  “Coding” (introns, coding exons, UTRs, promoters,

un-annotated/intragenic) •  Very fast, 1000s of regions in seconds

Page 67: Base-Resolution rna-seq - Jeff Leek

@simplystats

derfinder results

•  2,655 regions (47.7%) show expression of 1+ annotated intron (UCSC: 2,505; 45%)

•  577 regions (10.4%) show expression of an “intragenic” region (UCSC: 800, 14%)

Ensembl   UCSC  

Page 68: Base-Resolution rna-seq - Jeff Leek

@simplystats

derfinder results

•  261 regions (4.7%) crossed a known lincRNA – 51 overlapping 535 “intragenic” regions

(9.6%; e.g. no exons)

•  Only one region crossed known miRNA, but same region had annotated exon on other strand

Page 69: Base-Resolution rna-seq - Jeff Leek

@simplystats

derfinder results

•  Verifying the 5,565 DERs: – 95% of regions had mappability of 100bp

reads greater than 99% – Only 16 regions were in tracks excluded by

Duke site of Encode (all “BSR/Beta” for satellite repeats) and 0 by Data Analysis Center of Encode

– Only 90 regions (1.5%) mapped to known pseudogenes

Page 70: Base-Resolution rna-seq - Jeff Leek

@simplystats

derfinder results

•  Fetal samples had the highest expression in the majority of the regions (84%; 18 [1.7-Inf] fold increase); second highest was 50+ group (7%; 1.4 [1-4.3] fold increase)

Page 71: Base-Resolution rna-seq - Jeff Leek

@simplystats

derfinder results

Page 72: Base-Resolution rna-seq - Jeff Leek

@simplystats

derfinder subgroup

•  Identified DERs within each 6-sample age group based on mean expression – Represents set of expressed sequences for

each group at a given coverage threshold – Varied mean coverage cutoff

Page 73: Base-Resolution rna-seq - Jeff Leek

@simplystats

% of genome expressed

Percen

t  of  G

enom

e  Expressed  

Page 74: Base-Resolution rna-seq - Jeff Leek

@simplystats

scaled % of genome expressed

Fetal  is  highest  at  EVERY  cutoff  

Teen  is  lowest  thru  114  reads  

Infant  is  lowest    a<er  114  reads  

Page 75: Base-Resolution rna-seq - Jeff Leek

@simplystats

higher cutoffs create longer DERs

Page 76: Base-Resolution rna-seq - Jeff Leek

@simplystats

% of genome expressed (L ≥ 12)

Percen

t  of  G

enom

e  Expressed  

Page 77: Base-Resolution rna-seq - Jeff Leek

@simplystats

Scaled % of genome expressed (L ≥ 12)

Fetal  is  s=ll  highest  at  EVERY  cutoff  

Page 78: Base-Resolution rna-seq - Jeff Leek

@simplystats

Higher cutoffs still create longer DERs

Page 79: Base-Resolution rna-seq - Jeff Leek

@simplystats

try that stuff, yo!

https://github.com/lcolladotor/derfinder https://github.com/lcolladotor/derfinderReport https://github.com/lcolladotor/derfinderExample

Page 80: Base-Resolution rna-seq - Jeff Leek

@simplystats

acknowledgements Leek Group Alyssa Frazee Prasad Patil Leo Collado Torres Abhi Nellore University of Maryland Héctor Corrada Bravo Harvard Rafael Irizarry Lieber Institute Andrew Jaffe Danny Weinberger Thomas Hyde

Hopkins Kasper Hansen Roger Peng Ben Langmead Sarven Sabunicyan Luigi Marchionni Donald Geman Funding Amazon Web Services Digital Science NIH CCNE Hopkins inHealth