introduction*to*large/scale*...

56
Introduction to LargeScale Biomedical Data Analysis

Upload: others

Post on 12-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction*to*Large/Scale* Biomedical*Data*Analysis*web1.sph.emory.edu/users/hwu30/teaching/introBDA/IntroBDA_sgse… · Workflow*of*NGS*data*analysisSec-gen sequencing data analysis

Introduction  to  Large-­‐Scale  Biomedical  Data  Analysis  

Page 2: Introduction*to*Large/Scale* Biomedical*Data*Analysis*web1.sph.emory.edu/users/hwu30/teaching/introBDA/IntroBDA_sgse… · Workflow*of*NGS*data*analysisSec-gen sequencing data analysis

A  grand  overview  of  the  course

• Introductory  course  created  for  the  BIG  (Bioinformatics,  Imaging  and  Genetics)  concentration.

• Purpose  of  the  course:  introduce  modern  high-­‐dimensional  biomedical  data from:– Bioinformatics  and  computational  biology.– Biomedical  imaging.– Statistical  genetic.

Page 3: Introduction*to*Large/Scale* Biomedical*Data*Analysis*web1.sph.emory.edu/users/hwu30/teaching/introBDA/IntroBDA_sgse… · Workflow*of*NGS*data*analysisSec-gen sequencing data analysis

• Focus  on:– Scientific  background:  questions  and  motivations.– Technologies.– Data  and  their  characteristics.– Brief  overview  of  some  statistical  methods,  opportunities  and  challenges  for  statisticians.  

There  will  be  a  lot  of  materials  and  new  terminologies!  • Not covered  in  this  course:  detailed  statistical  theories  and  

methods  for  data  analyses.  • This  is  a  knowledge-­‐centric  class,  big  picture  and  concepts  are  

more  important.

Contents  of  the  course

Page 4: Introduction*to*Large/Scale* Biomedical*Data*Analysis*web1.sph.emory.edu/users/hwu30/teaching/introBDA/IntroBDA_sgse… · Workflow*of*NGS*data*analysisSec-gen sequencing data analysis

Format  of  the  course

• Co-­‐taught  by  multiple  instructors.  • Students  are  evaluated  by  3  reading  assignments:  you  need  to  write  reading  reports!

Page 5: Introduction*to*Large/Scale* Biomedical*Data*Analysis*web1.sph.emory.edu/users/hwu30/teaching/introBDA/IntroBDA_sgse… · Workflow*of*NGS*data*analysisSec-gen sequencing data analysis

Lecture  1:Introduction  to  next  

generation  sequencing

Page 6: Introduction*to*Large/Scale* Biomedical*Data*Analysis*web1.sph.emory.edu/users/hwu30/teaching/introBDA/IntroBDA_sgse… · Workflow*of*NGS*data*analysisSec-gen sequencing data analysis

Outline

• Biological  Backgrounds:  DNA  and  DNA  sequencing.• Next  generation  sequencing  (NGS)  technologies.• Data  from  NGS.• Typical  NGS  data  analysis  workflow.  • An  application  of  NGS:  RNA-­‐seq.  

Page 7: Introduction*to*Large/Scale* Biomedical*Data*Analysis*web1.sph.emory.edu/users/hwu30/teaching/introBDA/IntroBDA_sgse… · Workflow*of*NGS*data*analysisSec-gen sequencing data analysis

Background:  DNA  and  sequencing

Page 8: Introduction*to*Large/Scale* Biomedical*Data*Analysis*web1.sph.emory.edu/users/hwu30/teaching/introBDA/IntroBDA_sgse… · Workflow*of*NGS*data*analysisSec-gen sequencing data analysis

DNA  (DeoxyriboNucleic  Acid)

• A  molecule  contains  the  genetic  instruction  of  all  known  living  organisms  and  some  viruses.

• Resides  in  the  cell  nucleus,  where  DNA  is  organized  into  long  structures  called  chromosomes.  

• Most  DNA  molecule  consists  of  two  long  polymers  (strands),  where  two  strands  entwine  in  the  shape  of  a  double  helix.

• Each  strand  is  a  chain  of  simple  units  (bases)  called  nucleotides:  A,  C,  G,  T.  

• The  bases  from  two  strands  are  complementary  by  base  pairing:  A-­‐T,  C-­‐G.

Page 9: Introduction*to*Large/Scale* Biomedical*Data*Analysis*web1.sph.emory.edu/users/hwu30/teaching/introBDA/IntroBDA_sgse… · Workflow*of*NGS*data*analysisSec-gen sequencing data analysis

DNA  sequence• The  order  of  occurrence  of  the  bases  in  a  DNA  molecule  is  

called  the  sequence of  the  DNA.  The  DNA  sequence  is  usually  store  in  a  big  text  file:

ACAGGTTTGCTGGTGACCAGTTCTTCATGAGGGACCATCTATCACAACAGAGAAAGCACTTGGATCCACCAGGGGCTGCCAGGGGAAGCAGCATGGGAGCCTGAACCATGAAGCAGGAAGCACCTGTCTGTAGGGGGAAGTGATGGAAGGACATGGGCACAGAAGGGTGTAGGTTTTGTGTCTGGAGGACACTGGGAGTGGCTCCTGGCATTGAAACAGGTGTGTAGAAGGATGTGGTGGGACCTACAGACAGACTGGAATCTAAGGGACACTTGAATCCCAGTGTGACCATGGTCTTTAAGGACAGGTTGGggccaggcacagtggctcatgcctgtaatcccagcact

• Some  interesting  facts:– Total  length  of  the  human  DNA  is  3  billion  bases.  – Difference  in  DNA  sequence  between  two  individuals  is  less  than  1%.– Human  and  chimpanzee  have  96%  of  the  sequences  identical.  Human  

and  mouse:  70%.  

Page 10: Introduction*to*Large/Scale* Biomedical*Data*Analysis*web1.sph.emory.edu/users/hwu30/teaching/introBDA/IntroBDA_sgse… · Workflow*of*NGS*data*analysisSec-gen sequencing data analysis

Genome  size  (total  length  of  DNA)

Organism Genome size  (bp) #  genes

E.  coli 4.6M 4,300

S.  cerevisiae (yeast) 12.5M 5,800

C.  elegans (worm) 100M 20,000

A.  thaliana  (plant) 115M 28,000

D.  melanogaster  (fly) 123M 13,000

M.  musculus (mouse) 3G 23,800

H.  sapiens  (human) 3.3G 25,000

Page 11: Introduction*to*Large/Scale* Biomedical*Data*Analysis*web1.sph.emory.edu/users/hwu30/teaching/introBDA/IntroBDA_sgse… · Workflow*of*NGS*data*analysisSec-gen sequencing data analysis

DNA  sequencing

• Technologies  to  determine  the  nucleotide  bases  of  a  DNA  molecule.  

• Motivation:  decipher  the  genetic  codes  hidden  in  DNA  sequences  for  different  biological  processes.  

• Genome  projects:  determine  DNA  sequences  for  different  species,  e.g.,  human  genome  project.

• Genomic  research  (in  a  nutshell):  study  the  functions  of  DNA  sequences  and  related  components.

Page 12: Introduction*to*Large/Scale* Biomedical*Data*Analysis*web1.sph.emory.edu/users/hwu30/teaching/introBDA/IntroBDA_sgse… · Workflow*of*NGS*data*analysisSec-gen sequencing data analysis

Sequencing  technologies

• Traditional  technology:  Sanger  sequencing.– Slow  (low  throughput)  and  expensive:  it  took  Human  Genome  Project  (HGP)  13  years  and  $3  billion  to  sequence  the  entire  human  genome.

– Relatively  accurate.  

• New  technology:  different  types  of  high-­‐throughput  sequencing.  

Page 13: Introduction*to*Large/Scale* Biomedical*Data*Analysis*web1.sph.emory.edu/users/hwu30/teaching/introBDA/IntroBDA_sgse… · Workflow*of*NGS*data*analysisSec-gen sequencing data analysis

Next  generation  sequencing  (NGS)

• Aka:  high-­‐throughput  sequencing,  second-­‐generation  sequencing.  

• Able  to  sequence  large  amount  of  short  sequence  segments  in  a  short  period:– High-­‐throughput:  hundreds  of  millions  sequences  in  a  run.– Cheap:  sequence  entire  human  genome  costs  a  few  thousand  dollars  now.  

– Higher  error  rate  compared  to  Sanger  sequencing.

Page 14: Introduction*to*Large/Scale* Biomedical*Data*Analysis*web1.sph.emory.edu/users/hwu30/teaching/introBDA/IntroBDA_sgse… · Workflow*of*NGS*data*analysisSec-gen sequencing data analysis

NGS  technologies

• Complicated  and  involves  a  lot  of  biochemical  reactions.– Sequencing  by  synthesis.– Sequencing  by  ligation.– Pyrosequencing.

• In  a  nutshell:  – Cut  the  long  DNA  into  smaller  segments  (several  hundreds  to  several  

thousand  bases).  – Sequence  each  segment:  start  from  one  end  and  sequence  along  the  

chain,  base  by  base.  – The  process  stops  after  a  while  because  the  noise  level  is  too  high.  – Results  from  sequencing  are  many  sequence  pieces.  The  lengths  vary,  

usually  a  few  thousands  from  Sanger,  and  several  hundreds  from  NGS.  – The  sequence  pieces  are  called  “reads”  for  NGS  data.  

Page 15: Introduction*to*Large/Scale* Biomedical*Data*Analysis*web1.sph.emory.edu/users/hwu30/teaching/introBDA/IntroBDA_sgse… · Workflow*of*NGS*data*analysisSec-gen sequencing data analysis

Single-­‐end  vs.  paired-­‐end  sequencing

• Sequence  one  end  or  both  ends  of  the  DNA  segments.  • Paired-­‐end  sequencing:  reads  are  “paired”,  separated  by  

certain  length  (the  length  of  the  DNA  segments).  • Paired-­‐end  data  can  be  used  as  single-­‐end,  but  contain  extra  

information  for  reads  pairing.• Useful  in  some  cases,  for  example,  detecting  structural  

variations  in  the  genome.  

Page 16: Introduction*to*Large/Scale* Biomedical*Data*Analysis*web1.sph.emory.edu/users/hwu30/teaching/introBDA/IntroBDA_sgse… · Workflow*of*NGS*data*analysisSec-gen sequencing data analysis

Available  NGS  platforms

• Major  player:– Illumina:  HiSeq,  Genome  Analyzer  (GA)  – LifeTech:  SOLiD,  IonTorrent– Roche  454

• Others:– Complete  Genomics– Pacific  Bioscience– Helicose

Page 17: Introduction*to*Large/Scale* Biomedical*Data*Analysis*web1.sph.emory.edu/users/hwu30/teaching/introBDA/IntroBDA_sgse… · Workflow*of*NGS*data*analysisSec-gen sequencing data analysis

Introduction  to  NGS  Applications

Page 18: Introduction*to*Large/Scale* Biomedical*Data*Analysis*web1.sph.emory.edu/users/hwu30/teaching/introBDA/IntroBDA_sgse… · Workflow*of*NGS*data*analysisSec-gen sequencing data analysis

Applications  of  NGS

• NGS  has  a  wide  range  of  applications.  – DNA-­‐seq:  sequence  genomic  DNA.– RNA-­‐seq:  sequence  RNA  products.– ChIP-­‐seq:  detect  protein-­‐DNA  interaction  sites.– Bisulfite  sequencing  (BS-­‐seq):  measure  DNA  methylation  strengths.  

– A  lot  of  others.

Page 19: Introduction*to*Large/Scale* Biomedical*Data*Analysis*web1.sph.emory.edu/users/hwu30/teaching/introBDA/IntroBDA_sgse… · Workflow*of*NGS*data*analysisSec-gen sequencing data analysis
Page 20: Introduction*to*Large/Scale* Biomedical*Data*Analysis*web1.sph.emory.edu/users/hwu30/teaching/introBDA/IntroBDA_sgse… · Workflow*of*NGS*data*analysisSec-gen sequencing data analysis

DNA-­‐seq

• Sequence  the  untreated  genomic  DNA.  – Obtain  DNA  from  cells,  cut  into  small  pieces  then  sequence  the  segments.

• Goals:  – Genome  re-­‐sequencing:  compare  to  the  reference  genome  and  look  for  genetic  variants:  • Single  nucleotide  polymorphisms  (SNPs)• Insertions/deletions  (indels),  • Copy  number  variations  (CNVs)  • Other  structural  variations  (gene  fusion,  etc.).

– De  novo  assembly  of  a  new  unknown  genome.  

Page 21: Introduction*to*Large/Scale* Biomedical*Data*Analysis*web1.sph.emory.edu/users/hwu30/teaching/introBDA/IntroBDA_sgse… · Workflow*of*NGS*data*analysisSec-gen sequencing data analysis

Variations  of  DNA-­‐seq

• Targeted  sequencing,  e.g.,  exome  sequencing.– Sequence  the  genomic  DNA  at  targeted  genomic  regions.  – Cheaper  than  whole  genome  DNA-­‐seq,  so  that  money  can  be  spent  to  

get  bigger  sample  size  (more  individuals).  – The  targeted  genomic  regions  need  to  be  “captured”  first  using  

technologies  like  microarrays.  

• Metagenomic  sequencing.– Sequence  the  DNA  of  a  mixture  of  species,  mostly  microbes,  in  order  

to  understand  the  microbial  environments.  – The  goal  is  to  determine  number  of  species,  their  genome  and  

proportions  in  the  population.– De  novo  assembly  is  required.  But  the  number  and  proportions  of  

species  are  unknown,  so  it  poses  challenge  to  assembly.  

Page 22: Introduction*to*Large/Scale* Biomedical*Data*Analysis*web1.sph.emory.edu/users/hwu30/teaching/introBDA/IntroBDA_sgse… · Workflow*of*NGS*data*analysisSec-gen sequencing data analysis

ChIP-­‐seq

• Chromatin-­‐Immunoprecipitation  (ChIP)  followed  by  sequencing  (seq):  sequencing  version  of  ChIP-­‐chip.

• A  type  of  “captured”  sequencing.  ChIP  step  is  to  capture  genomic  regions  of  interest.  

• Used  to  detect  locations  of  certain  “events”  on  the  genome:  genome-­‐wide  location  analysis  (GWLA).  Example  of  events:– Transcription  factor  binding.  – DNA  methylations  and  histone  modifications.

Page 23: Introduction*to*Large/Scale* Biomedical*Data*Analysis*web1.sph.emory.edu/users/hwu30/teaching/introBDA/IntroBDA_sgse… · Workflow*of*NGS*data*analysisSec-gen sequencing data analysis

RNA-­‐seq

• Sequence  the  “transcriptome”:  the  set  of  RNA  molecules.  

• Goals:  – Catalog  RNA  products.  – Determine  transcriptional  structures:  alternative  splicing,  gene  fusion,  etc.

– Quantify  gene  expression:  the  sequencing  version  of  gene  expression  microarray.  

Page 24: Introduction*to*Large/Scale* Biomedical*Data*Analysis*web1.sph.emory.edu/users/hwu30/teaching/introBDA/IntroBDA_sgse… · Workflow*of*NGS*data*analysisSec-gen sequencing data analysis

NGS  data  and  analyses  workflow

Page 25: Introduction*to*Large/Scale* Biomedical*Data*Analysis*web1.sph.emory.edu/users/hwu30/teaching/introBDA/IntroBDA_sgse… · Workflow*of*NGS*data*analysisSec-gen sequencing data analysis

Workflow  of  NGS  data  analysisSec-gen sequencing data analysis workflow

Raw images

Fluorescent intensities

sequence reads

aligned reads contigs

imaging analysis

base calling

alignment de novo assembly

variant calling(DNA seq)

DE/splicing (RNA seq)

peak/DMR detection(ChIP/MeDIP- seq)

...

Page 26: Introduction*to*Large/Scale* Biomedical*Data*Analysis*web1.sph.emory.edu/users/hwu30/teaching/introBDA/IntroBDA_sgse… · Workflow*of*NGS*data*analysisSec-gen sequencing data analysis

Base  calling

• The  “raw”  data  from  sequencer  are  many  images.  • To  obtain  the  sequence  reads,  need  a  “base  calling”  step:  – First  extract  data  (fluorescent  intensities)  from  images  – Next  convert  intensities  (continuous)  to  a  base  calls  (categorical).  

– Both  steps  involve  many  statistical  methods  to  extract  signals  from  noisy  data.

• Sequencing  data  analyses  start  from  the  sequence  reads  (results  of  base  calling)  for  most  people.  

Page 27: Introduction*to*Large/Scale* Biomedical*Data*Analysis*web1.sph.emory.edu/users/hwu30/teaching/introBDA/IntroBDA_sgse… · Workflow*of*NGS*data*analysisSec-gen sequencing data analysis

Raw  sequence  reads  data  from  NGS  after  base  calling

• Large  text  file  (millions  of  lines)  with  simple  format.  – Most  frequently  used:  fasta/fa  format  for  storing  the  sequences,  or  fastq  

format  storing  both  the  sequence  and  corresponding  quality  scores.  

• fa  format:

>5_143_428_832GATATTGTAGCATAACGCAACTTGGGAGGTGAGCTT>5_143_984_487GTTTTCATGCCTCCAAATCTTGGAGGCTTTTTTATG>5_143_963_690GGTATATGCACAAAATGAGATGCTTGCTTATCAACA>5_143_957_461GGAGGGTGTCAATCCTGACGGTTATTTCCTAGACAA>5_143_808_403GATAACCGCATCAAGCTCTTGGAAGAGATTCTGTCT

read  nameread  sequence

Page 28: Introduction*to*Large/Scale* Biomedical*Data*Analysis*web1.sph.emory.edu/users/hwu30/teaching/introBDA/IntroBDA_sgse… · Workflow*of*NGS*data*analysisSec-gen sequencing data analysis

fastq  formatSecgen sequencing data: fastq format

@HWI-EAS165:1:1:50:908:1CTGCGGTCTCTAAAGTGCCATCTCATTGTGCTTTGTATCAGTCAGTGCTGGA+BCCBCB8ABBBBBBB:BC=8@BBA:@BB@BBBCBB<9BBAC;A<C?BAAB<#@HWI-EAS165:1:1:50:0:1NCAACCCCCACAGTAATATGTAAAACAAAAACTAAAACCAGGAGCTGAAGGG+#BABABBBBBB@08<@?A@7:A@CCBCCCCBBBCCBB=?BBBB@7@B=A>:2@HWI-EAS165:1:1:50:708:1GGTCAGCATGTCTTCTGTTAAGTGCTTGCACAAGCTAGCCTCTGCCTATGGG+BB@A;B>@A@@=BB=BB?A>@@>B?ABBA=A?@@>@@A:=?>?A@=B8@@AB@HWI-EAS165:1:1:50:1494:1CTGGTGTCACACAAGCAGGTCTCCTGTGTTGACTTCACCAGACACTGTCATT+BCBB@AB@1ABBBBBBAAB?BBBBAB<A?AA>BB@?1ABBA@BBBA@;B>>:

read  nameread  sequenceseparatorquality  scores

Page 29: Introduction*to*Large/Scale* Biomedical*Data*Analysis*web1.sph.emory.edu/users/hwu30/teaching/introBDA/IntroBDA_sgse… · Workflow*of*NGS*data*analysisSec-gen sequencing data analysis

Sequence  alignment  and  assembly

• Sequence  a  known  genome  -­‐-­‐-­‐ Alignment– Use  the  known  genome  (called  “reference  genome”)  as  a  blue  print.– Determine  where  each  read  is  located  in  the  reference  genome.

• Sequence  a  whole  new  genome  -­‐-­‐-­‐ Assembly– New  genome:  a  species  with  unknown  genome,  or  the  genome  is  

believed    to  be  very  different  from  reference  (e.g.,  cancer).  – Basically  the  short  reads  are  “stitched”  together  to  form  long  

sequences  called  “contigs”.  – Overlaps  among  sequence  reads  are  required,  so  it  needs  a  lot  of  

reads  (deep  coverage).– More  computationally  intensive.  

Page 30: Introduction*to*Large/Scale* Biomedical*Data*Analysis*web1.sph.emory.edu/users/hwu30/teaching/introBDA/IntroBDA_sgse… · Workflow*of*NGS*data*analysisSec-gen sequencing data analysis

Alignment• For  most  people,  alignment  is  the  first  step  for  sequencing  

data  analysis,  since  the  machine  output  raw  sequence  read  file  (fastq).  

• Need:  sequence  reads  file  and  a  reference  genome.• It  is  basically  a  string  search  problem:  where  is  the  short  (50-­‐

letter)  string  located  within  the  reference  string  of  3  billion  letters.

• Brute-­‐force  searching  is  okay  for  a  single  read,  but  computationally  infeasible  to  alignment  millions  of  reads.  

• Clever  algorithms  are  needed  to  preprocess  the  reference  genome  (indexing),  which  is  beyond  the  scope  of  this  class.

Page 31: Introduction*to*Large/Scale* Biomedical*Data*Analysis*web1.sph.emory.edu/users/hwu30/teaching/introBDA/IntroBDA_sgse… · Workflow*of*NGS*data*analysisSec-gen sequencing data analysis

Popular  alignment  software

• Bowtie:  fast,  but  less  accurate.  • BWA  (Burrows-­‐Wheeler  Aligner):  same  algorithm  as  bowtie,  but  

allow  gaps  in  alignments.  – about  5-­‐10  times  slower  than  bowtie,  but  provide  better  results  especially  for  paired  end  data.

• Maq  (Mapping  and  Assembly  with  Qualities):  with  SNP  calling  capabilities.

• ELAND:  Illumina’s  commercial  software.  • A  lot  of  others.  See

http://en.wikipedia.org/wiki/List_of_sequence_alignment_software  for  more  details.

• Our  suggestion:  use  bowtie  for  single-­‐end,  and  bwa  for  paired-­‐end.  

Page 32: Introduction*to*Large/Scale* Biomedical*Data*Analysis*web1.sph.emory.edu/users/hwu30/teaching/introBDA/IntroBDA_sgse… · Workflow*of*NGS*data*analysisSec-gen sequencing data analysis

Once  you  have  the  reads  aligned

• Downstream  analyses  depend  on  purpose.• Often  one  wants  to  manipulating  and  visualizing  the  alignment  results.  There  are  several  useful  tools:– file  manipulating  (format  conversion,  counting,  etc.):    samtools/Rsamtools,  BEDTools,  bamtools,  IGV  tools.

– Visualizing:  samtools  (text  version),  IGV  (Java  GUI).

Page 33: Introduction*to*Large/Scale* Biomedical*Data*Analysis*web1.sph.emory.edu/users/hwu30/teaching/introBDA/IntroBDA_sgse… · Workflow*of*NGS*data*analysisSec-gen sequencing data analysis

RNA-­‐seq  data  Analysis

Page 34: Introduction*to*Large/Scale* Biomedical*Data*Analysis*web1.sph.emory.edu/users/hwu30/teaching/introBDA/IntroBDA_sgse… · Workflow*of*NGS*data*analysisSec-gen sequencing data analysis

A  little  biological  background

Central  dogma  of  molecular  biology:“The  central  dogma  of  molecular  biology  deals  with  the  detailed  residue-­‐by-­‐residue  transfer  of  sequential  information.  It  states  that  information  cannot  be  transferred  back  from  protein  to  either  protein  or  nucleic  acid.”

Gene  Expression

Page 35: Introduction*to*Large/Scale* Biomedical*Data*Analysis*web1.sph.emory.edu/users/hwu30/teaching/introBDA/IntroBDA_sgse… · Workflow*of*NGS*data*analysisSec-gen sequencing data analysis

Gene  expression  levels  are  measured  through  their  mRNA  abundance.

gene

The  amount  of  these  matters!    But  they  are  difficult  to  measure.

The  amount  of  these  is  easy  to  measure.  And  it  is  positively  correlated  with  the  protein  amount!

DNA(2  copies)

mRNA(multiple  copies)

Protein(multiple  copies)

Page 36: Introduction*to*Large/Scale* Biomedical*Data*Analysis*web1.sph.emory.edu/users/hwu30/teaching/introBDA/IntroBDA_sgse… · Workflow*of*NGS*data*analysisSec-gen sequencing data analysis

High-­‐throughput  technologies  to  measure  mRNA  abundance

• Traditionally:  gene  expression  microarray.  • Now:  RNA-­‐seq.  – Data  have  greater  dynamic  range  and  higher  signal-­‐to-­‐noise  ratios.  

– Contain  more  information:  structural  change  of  genes.  

• Results  show  good  correlation  between  microarray  and  RNA-­‐seq  data.  

Page 37: Introduction*to*Large/Scale* Biomedical*Data*Analysis*web1.sph.emory.edu/users/hwu30/teaching/introBDA/IntroBDA_sgse… · Workflow*of*NGS*data*analysisSec-gen sequencing data analysis

RNA-­‐seq

• Sequence  the  set  of  RNA  molecules.  • Scientific  goals:  – Catalogue  RNA  products.  – Determine  transcriptional  structures:  alternative  splicing,  gene  fusion,  etc.

– Quantify  gene  expression:  the  sequencing  version  of  gene  expression  microarray.  

Page 38: Introduction*to*Large/Scale* Biomedical*Data*Analysis*web1.sph.emory.edu/users/hwu30/teaching/introBDA/IntroBDA_sgse… · Workflow*of*NGS*data*analysisSec-gen sequencing data analysis

Experimental  procedures

• Procedures:– RNA  -­‐>  cDNA.– Shearing  and  amplification  of  cDNA.– Sequence  the  cDNA  segments.

• Compare  to  expression  microarray:– Clearer  signal  without  hybridization.– Better  dynamic  ranges  for  

expressions.– Provide  richer  information.

Page 39: Introduction*to*Large/Scale* Biomedical*Data*Analysis*web1.sph.emory.edu/users/hwu30/teaching/introBDA/IntroBDA_sgse… · Workflow*of*NGS*data*analysisSec-gen sequencing data analysis

RNA-­‐seq  data  analyses

• Gene  expression  analysis:  compare  expressions  of  genes  under  different  biological  conditions.  

• Structural  analysis  of  transcriptome:  discover  and  compare  the  structural  variations  of  mRNA:– Alternative  splicing.– Gene  fusion.

• Small  RNA  analyses.

Page 40: Introduction*to*Large/Scale* Biomedical*Data*Analysis*web1.sph.emory.edu/users/hwu30/teaching/introBDA/IntroBDA_sgse… · Workflow*of*NGS*data*analysisSec-gen sequencing data analysis

RNA-­‐seq  gene  expression  analysis  

• Goal:  compare  gene  expressions  between  different  samples.• Steps:  

1. Summarization:  get  a  number  for  each  gene  to  represent  its  expression  level.

2. Normalization:  remove  technical  artifacts  so  that  data  from  different  samples  are  comparable.  

3. Differential  expression  detection:  gene  by  gene  statistical  test,  usually  based  on  negative  binomial  model.  

• Popular  software:– R/Bioconductor  packages:  DESeq,  edgeR,  DSS.  – Stand  alone:  cufflink  and  cuffdiff.  

Page 41: Introduction*to*Large/Scale* Biomedical*Data*Analysis*web1.sph.emory.edu/users/hwu30/teaching/introBDA/IntroBDA_sgse… · Workflow*of*NGS*data*analysisSec-gen sequencing data analysis

Differential  expression

• Biological  motivation  is  the  same  as  in  gene  expression  microarray:  find  differentially  expressed  (DE)  genes.  

• Microarray  methods  are  not  directly  applicable:  continuous  vs.  count  data,  but  ideas  can  be  borrowed.

• Usually  needs  multiple  replicates  per  sample,  so  that  means  and  variances  can  be  evaluated.

• Test  is  carried  gene  by  gene.  • Two  major  tasks  of  developing  a  test:

– Within  group  variance  estimation.– Test  procedure  and  inference.  

Page 42: Introduction*to*Large/Scale* Biomedical*Data*Analysis*web1.sph.emory.edu/users/hwu30/teaching/introBDA/IntroBDA_sgse… · Workflow*of*NGS*data*analysisSec-gen sequencing data analysis

RNA-­‐seq  data  for  DE  analysis

• A  matrix  of  integers,  rows  are  genes,  columns  are  samples:  

Cancer Cancer Cancer Cancer Normal Normal Normal NormalGene 1 20 25 16 7 4 1 21 11Gene 2 493 628 133 450 28 97 462 225Gene 3 128 146 121 25 30 26 106 80Gene 4 1 0 0 0 1 0 0 0Gene 5 58 105 27 40 19 28 88 54Gene 6 0 0 1 0 0 0 0 0...

Page 43: Introduction*to*Large/Scale* Biomedical*Data*Analysis*web1.sph.emory.edu/users/hwu30/teaching/introBDA/IntroBDA_sgse… · Workflow*of*NGS*data*analysisSec-gen sequencing data analysis

Data  model  for  replicated  samples  

• For  a  sample  with  M  replicates,  the  counts  for  gene  i replicate  j,  denoted  by  Yij, is  often  modeled  by  following  hierarchical  model:

• Marginally,  the  Gamma-­‐Poisson  compound  distribution  is  Negative  binomial.  So  the  counts  for  a  gene  from  multiple  replicates  is  modeled  as  Negative  binomial:                

Yij | λi ~ Poisson(λi ), λi ~Gamma(α,β)

Yij ~ NB(α,β)

Page 44: Introduction*to*Large/Scale* Biomedical*Data*Analysis*web1.sph.emory.edu/users/hwu30/teaching/introBDA/IntroBDA_sgse… · Workflow*of*NGS*data*analysisSec-gen sequencing data analysis

A  little  more  about  the  NB  distribution

• NB  is  over-­‐dispersed  Poisson:  - Poisson:- NB:  

• Dispersion  parameter  Φ approximates  the  squared  coefficient  of  variation:

• Dispersion  Φ represents  the  biological  variance.• NB  distribution  can  be  parameterized  by  mean  and  dispersion.

4 H. WU AND OTHERS

2010; Robinson and Smyth, 2007; Hardcastle and Kelly, 2010). The variance from a NB

distribution depends on the mean µ in the relationship:

var = µ+ µ2⇥ (1.1)

where the first term represents variance due to Poisson sampling error and the second

term represents variance due to variation between biological replicates. The parameter

⇥ is referred to as the dispersion parameter. Notice that ⇥ is the reciprocal of the shape

parameter in the Gamma distribution, thus is the squared coefficient of variation (CV).

Therefore, ⇥ represents the variation of a gene’s expression relative to its mean.

Most statisticians agree that the over-dispersion problem needs to be addressed. The

difference is how the dispersion is modeled, estimated and used in inference. Robinson

and Smyth (2008) assumes a common dispersion for all genes and uses information from

all genes to estimate a global ⇥. This stabilizes the estimation for ⇥ but a common disper-

sion means identical CV for all genes, while it is known that some genes are more tightly

controlled (for example, housekeeping genes) and other genes vary much more relative

to their means (for example, immune-modulated and stress-induced genes (Pritchard

and others, 2001)). The gene specific biological variation is reproducible across tech-

nologies (Hansen and others, 2011) and is not reflected by a common dispersion. As a

result, genes that are naturally more variable are more likely to be reported as DE due

to an underestimate of their dispersion. Anders and Huber (2010) introduce DESeq, an-

other method based on the NB model. Instead of assuming the second term in Equation

1.1 to be proportional to µ2 with a dispersion ⇥, they let the variance be a smooth func-

tion of the mean (with proper offset accounting for sequencing depth). However, their

model still assumes that conditioning on the mean expression, the variance is constant.

4 H. WU AND OTHERS

“over-dispersion” problem since even non-differentially expressed genes show variation

greater than expected by Poisson.

A common model to address the over-dispersion problem is the Negative Binomial

(NB) model. The NB model is a Gamma-poisson mixture and can be interpreted as the

following: the Gamma distribution models the unobserved true expression levels in each

biological sample, and conditioning on the expression level, the measurement from the

sequencing machine follows a Poisson distribution (Anders and Huber, 2010; Robinson

and Smyth, 2007; Hardcastle and Kelly, 2010). The variance from a NB distribution

depends on the mean in the relationship:

var = µ + µ2⇥ (1.1)

where the first term reprents variance due to poisson sampling error and the second

term represents variance due to variation between biological replicates. The parameter

⇥ is referred to as the dispersion parameter. Notice that ⇥ is the inverse of the shape

parameter in the Gamma distribution, thus is the squared coefficient of variation. This

is crucial in the context of gene expression, because although expression is a stochastic

phenomenon and under constant regulation, the biological system is also robust and

thus the CV is usually small between replicates. Based on RNA-seq data from human

populations (Cheung and others, 2010), we observed that 70% of the genes having CV

less than 0.5. CV in inbred animal models or established cell lines are likely even lower.

HAO: do we have the log(phi.hat) from high counts and can we show that it looks

somewhat normal? This can be our motivation for using a lognormal prior.

Most statisticians agree that the over-dispersion problem needs to be addressed. The

difference is how the dispersion is modeled, estimated and used in inference. Robinson

BIOS 560R: Advanced Statistical Computing

Fall 2012 Homework 3

Due 10/18/2012 at 4pm before the class

Hidden Markov model (HMM) is useful for modeling financial time series data such as

the stock prices. In this homework we will practice using HMM to model the daily price of

QQQ, and use the modeling results for prediction. PowerShares QQQ is an ETF tracking

the Nasdaq 1000 Index. Its price is highly indicative of the overall stock market.

Its adjusted daily closing prices from Mar 10th, 1999 to Oct 1st, 2012 can be obtained

from the class website.

� = var�µµ2 ⇡ var

µ2

Note the

However the daily prices For example, the stock prices

1

Page 45: Introduction*to*Large/Scale* Biomedical*Data*Analysis*web1.sph.emory.edu/users/hwu30/teaching/introBDA/IntroBDA_sgse… · Workflow*of*NGS*data*analysisSec-gen sequencing data analysis

Simple  ideas  for  RNA-­‐seq  DE  analysis

• Transform  data  into  continuous  scale  (e.g.,  by  taking  log),  then  use  microarray  methods:– Troublesome  for  genes  with  low  counts.  

• For  each  gene,  perform  two  group  Poisson  or  NB  test  for  equal  means.  But:– Number  of  replicates  are  usually  small,  asymptotic  theories  don’t  apply  so  the  results  are  not  reliable.

– Like  in  microarray,  information  from  all  genes  can  be  combined  to  improve  inferences  (e.g.,  variance  shrinkage).

Page 46: Introduction*to*Large/Scale* Biomedical*Data*Analysis*web1.sph.emory.edu/users/hwu30/teaching/introBDA/IntroBDA_sgse… · Workflow*of*NGS*data*analysisSec-gen sequencing data analysis

DEseq  (Anders  et  al.  2010,  GB)

• Counts  are  assumed  to  following  NB,  parameterized  by  mean  and  variance:  

• The  variance  is  the  sum  of  shot  noise  and  raw  variance:

• The  raw  variance  is  a  smooth  function  of  the  mean,  so  it  is  assumed  that  genes  with  same  means  will  have  the  same  variances.  

• Hypothesis  testing  using  exact  test:

data, and show that it provides better fits (SectionModel). As a result, more balanced selection of differen-tially expressed genes throughout the dynamic range ofthe data can be obtained (Section Testing for differentialexpression). We demonstrate the method by applyingit to four data sets (Section Applications) and discusshow it compares to alternative approaches (SectionConclusions).

Results and DiscussionModelDescriptionWe assume that the number of reads in sample j thatare assigned to gene i can be modeled by a negativebinomial (NB) distribution,

Kij ij ij~ ( , ),NB � � 2 (1)

which has two parameters, the mean μij and thevariance � ij

2 . The read counts Kij are non-negativeintegers. The probabilities of the distribution are givenin Supplementary Note A. (All Supplementary Notes arein Additional file 1.) The NB distribution is commonlyused to model count data when overdispersion ispresent [12].In practice, we do not know the parameters μij and

� ij2 , and we need to estimate them from the data.

Typically, the number of replicates is small, and furthermodelling assumptions need to be made in order toobtain useful estimates. In this paper, we develop amethod that is based on the following three assumptions.First, the mean parameter μij, that is, the expectation

value of the observed counts for gene i in sample j, isthe product of a condition-dependent per-gene value qi,r(j) (where r(j) is the experimental condition of samplej) and a size factor sj,

� �ij i j jq S=

, ( ). (2)

qi,r(j) is proportional to the expectation value of thetrue (but unknown) concentration of fragments fromgene i under condition r(j). The size factor sj representsthe coverage, or sampling depth, of library j, and we willuse the term common scale for quantities, such as qi, r(j),that are adjusted for coverage by dividing by sj.

Second, the variance � ij2 is the sum of a shot noise

term and a raw variance term,

� � �ij ij j i js v2 2= +shot noise raw variance

� � �� ��, ( ) .

(3)

Third, we assume that the per-gene raw varianceparameter vi, r is a smooth function of qi, r,

v v qi j i j, ( ) , ( )( ).� � �= (4)

This assumption is needed because the number ofreplicates is typically too low to get a precise estimate ofthe variance for gene i from just the data available forthis gene. This assumption allows us to pool the datafrom genes with similar expression strength for the pur-pose of variance estimation.The decomposition of the variance in Equation (3) is

motivated by the following hierarchical model: Weassume that the actual concentration of fragments fromgene i in sample j is proportional to a random variableRij, such that the rate that fragments from gene i aresequenced is sjrij. For each gene i and all samples j ofcondition r, the Rij are i.i.d. with mean qir and variancevir. Thus, the count value Kij, conditioned on Rij = rij, isPoisson distributed with rate sjrij. The marginal distribu-tion of Kij - when allowing for variation in Rij - has themean μij and (according to the law of total variance) thevariance given in Equation (3). Furthermore, if thehigher moments of Rij are modeled according to agamma distribution, the marginal distribution of Kij isNB (see, for example, [12], Section 4.2.2).FittingWe now describe how the model can be fitted to data. Thedata are an n × m table of counts, kij, where i = 1,..., nindexes the genes, and j = 1,..., m indexes the samples. Themodel has three sets of parameters:(i) m size factors sj; the expectation values of all

counts from sample j are proportional to sj.(ii) for each experimental condition r, n expression

strength parameters qir; they reflect the expected abun-dance of fragments from gene i under condition r, thatis, expectation values of counts for gene i are propor-tional to qir.(iii) The smooth functions vr : R+ ® R+; for each con-

dition r, vr models the dependence of the raw variancevir on the expected mean qir.The purpose of the size factors sj is to render

counts from different samples, which may have beensequenced to different depths, comparable. Hence, theratios ( � Kij)/( � Kij’) of expected counts for the samegene i in different samples j and j’ should be equal tothe size ratio sj/sj ’ if gene i is not differentiallyexpressed or samples j and j’ are replicates. The totalnumber of reads, Σi kij, may seem to be a good measureof sequencing depth and hence a reasonable choice forsj. Experience with real data, however, shows this notalways to be the case, because a few highly and differ-entially expressed genes may have strong influence onthe total read count, causing the ratio of total readcounts not to be a good estimate for the ratio ofexpected counts.

Anders and Huber Genome Biology 2010, 11:R106http://genomebiology.com/2010/11/10/R106

Page 2 of 12

data, and show that it provides better fits (SectionModel). As a result, more balanced selection of differen-tially expressed genes throughout the dynamic range ofthe data can be obtained (Section Testing for differentialexpression). We demonstrate the method by applyingit to four data sets (Section Applications) and discusshow it compares to alternative approaches (SectionConclusions).

Results and DiscussionModelDescriptionWe assume that the number of reads in sample j thatare assigned to gene i can be modeled by a negativebinomial (NB) distribution,

Kij ij ij~ ( , ),NB � � 2 (1)

which has two parameters, the mean μij and thevariance � ij

2 . The read counts Kij are non-negativeintegers. The probabilities of the distribution are givenin Supplementary Note A. (All Supplementary Notes arein Additional file 1.) The NB distribution is commonlyused to model count data when overdispersion ispresent [12].In practice, we do not know the parameters μij and

� ij2 , and we need to estimate them from the data.

Typically, the number of replicates is small, and furthermodelling assumptions need to be made in order toobtain useful estimates. In this paper, we develop amethod that is based on the following three assumptions.First, the mean parameter μij, that is, the expectation

value of the observed counts for gene i in sample j, isthe product of a condition-dependent per-gene value qi,r(j) (where r(j) is the experimental condition of samplej) and a size factor sj,

� �ij i j jq S=

, ( ). (2)

qi,r(j) is proportional to the expectation value of thetrue (but unknown) concentration of fragments fromgene i under condition r(j). The size factor sj representsthe coverage, or sampling depth, of library j, and we willuse the term common scale for quantities, such as qi, r(j),that are adjusted for coverage by dividing by sj.

Second, the variance � ij2 is the sum of a shot noise

term and a raw variance term,

� � �ij ij j i js v2 2= +shot noise raw variance

� � �� ��, ( ) .

(3)

Third, we assume that the per-gene raw varianceparameter vi, r is a smooth function of qi, r,

v v qi j i j, ( ) , ( )( ).� � �= (4)

This assumption is needed because the number ofreplicates is typically too low to get a precise estimate ofthe variance for gene i from just the data available forthis gene. This assumption allows us to pool the datafrom genes with similar expression strength for the pur-pose of variance estimation.The decomposition of the variance in Equation (3) is

motivated by the following hierarchical model: Weassume that the actual concentration of fragments fromgene i in sample j is proportional to a random variableRij, such that the rate that fragments from gene i aresequenced is sjrij. For each gene i and all samples j ofcondition r, the Rij are i.i.d. with mean qir and variancevir. Thus, the count value Kij, conditioned on Rij = rij, isPoisson distributed with rate sjrij. The marginal distribu-tion of Kij - when allowing for variation in Rij - has themean μij and (according to the law of total variance) thevariance given in Equation (3). Furthermore, if thehigher moments of Rij are modeled according to agamma distribution, the marginal distribution of Kij isNB (see, for example, [12], Section 4.2.2).FittingWe now describe how the model can be fitted to data. Thedata are an n × m table of counts, kij, where i = 1,..., nindexes the genes, and j = 1,..., m indexes the samples. Themodel has three sets of parameters:(i) m size factors sj; the expectation values of all

counts from sample j are proportional to sj.(ii) for each experimental condition r, n expression

strength parameters qir; they reflect the expected abun-dance of fragments from gene i under condition r, thatis, expectation values of counts for gene i are propor-tional to qir.(iii) The smooth functions vr : R+ ® R+; for each con-

dition r, vr models the dependence of the raw variancevir on the expected mean qir.The purpose of the size factors sj is to render

counts from different samples, which may have beensequenced to different depths, comparable. Hence, theratios ( � Kij)/( � Kij’) of expected counts for the samegene i in different samples j and j’ should be equal tothe size ratio sj/sj ’ if gene i is not differentiallyexpressed or samples j and j’ are replicates. The totalnumber of reads, Σi kij, may seem to be a good measureof sequencing depth and hence a reasonable choice forsj. Experience with real data, however, shows this notalways to be the case, because a few highly and differ-entially expressed genes may have strong influence onthe total read count, causing the ratio of total readcounts not to be a good estimate for the ratio ofexpected counts.

Anders and Huber Genome Biology 2010, 11:R106http://genomebiology.com/2010/11/10/R106

Page 2 of 12

Hence, to estimate the size factors, we take the median ofthe ratios of observed counts. Generalizing the procedurejust outlined to the case of more than two samples, we use:

sk

k

ji

ij

ivv

m m^

/.=

⎝⎜

⎠⎟

=∏median

1

1 (5)

The denominator of this expression can be interpretedas a pseudo-reference sample obtained by taking thegeometric mean across samples. Thus, each size factorestimate s j^ is computed as the median of the ratios ofthe j-th sample’s counts to those of the pseudo-reference.(Note: While this manuscript was under review, Robinsonand Oshlack [13] suggested a similar method.)To estimate qir, we use the average of the counts from

the samples j corresponding to condition r, transformedto the common scale:

qm

k

si

ij

jj j

^

^: ( )

,�� � �

==

∑1(6)

where mr is the number of replicates of condition r andthe sum runs over these replicates. the functions vr, wefirst calculate sample variances on the common scale

wm

k

sqi

ij

ji

j j

��

�� �

=−

−⎛

⎜⎜⎜

⎟⎟⎟=

∑1

1

2

^^

: ( )

(7)

and define

zq

m si

i

jj j

��

� � �

==

∑^

^: ( )

.1 (8)

In Supplementary Note B in Additional file 1 we showthat wir - zir is an unbiased estimator for the raw varianceparameter vir of Equation (3).However, for small numbers of replicates, mr, as is

typically the case in applications, the values wir are highlyvariable, and wir - zir would not be a useful varianceestimator for statistical inference. Instead, we use localregression [14] on the graph ( , )q̂ wi i� �

to obtain asmooth function wr(q), with

v q w q zi i i^ ^ ^( ) ( )� � � � �= − (9)

as our estimate for the raw variance.Some attention is needed to avoid estimation biases in

the local regression. wir is a sum of squared randomvariables, and the residuals w w qi i� �− ( )^ are skewed.Following References [15], Chapter 8 and [14], Section

9.1.2, we use a generalized linear model of the gammafamily for the local regression, using the implementationin the locfit package [16].

Testing for differential expressionSuppose that we have mA replicate samples for biologi-cal condition A and mB samples for condition B. Foreach gene i, we would like to weigh the evidence in thedata for differential expression of that gene betweenthe two conditions. In particular, we would like to testthe null hypothesis qiA = qiB, where qiA is the expressionstrength parameter for the samples of condition A, andqiB for condition B. To this end, we define, as test statis-tic, the total counts in each condition,

K K K Ki ij

j j

i ij

j j

A

A

B

B

= == =

∑ ∑: ( ) : ( )

, ,

� �(10)

and their overall sum KiS = KiA + KiB. From the errormodel described in the previous Section, we show belowthat - under the null hypothesis - we can compute theprobabilities of the events KiA = a and KiB = b for anypair of numbers a and b. We denote this probability byp(a, b). The P value of a pair of observed count sums(kiA, kiB) is then the sum of all probabilities less or equalto p(kiA, kiB), given that the overall sum is kiS:

p

p a b

p a bi

a b kp a b p k k

a b k

i

i i

i

=+ =≤

+ =

( , )

( , ).

( , ) ( ),S

A B

S

(11)

The variables a and b in the above sums take thevalues 0,..., kiS. The approach presented so far followsthat of Robinson and Smyth [11] and is analogous tothat taken by other conditioned tests, such as Fisher’sexact test. (See Reference [17], Chapter 3 for a discus-sion of the merits of conditioning in tests.)Computation of p(a, b). First, assume that, under the

null hypothesis, counts from different samples are inde-pendent. Then, p(a, b) = Pr(KiA = a) Pr(KiB = b). Theproblem thus is computing the probability of the eventKiA = a, and, analogously, of KiB = b. The random vari-able KiA is the sum of mA

NB-distributed random variables. We approximate itsdistribution by a NB distribution whose parameters weobtain from those of the Kij. To this end, we first com-pute the pooled mean estimate from the counts of bothconditions,

q k si ij

j j A B

j^

: ( ) { , }

/ ,0 =∈

∑�

(12)

Anders and Huber Genome Biology 2010, 11:R106http://genomebiology.com/2010/11/10/R106

Page 3 of 12

Page 47: Introduction*to*Large/Scale* Biomedical*Data*Analysis*web1.sph.emory.edu/users/hwu30/teaching/introBDA/IntroBDA_sgse… · Workflow*of*NGS*data*analysisSec-gen sequencing data analysis

Bioconductor  package  DEseq

• Inputs  are:  – integer  matrix  for  gene  counts,  rows  for  genes  and  columns  for  samples.

– experimental  design:  samples  for  the  columns.library(DESeq)

conds=c(0,0,0,1,1,1)cds=newCountDataSet(data, conds )cds=estimateSizeFactors( cds )cds=estimateVarianceFunctions( cds )fit=nbinomTest( cds, 0, 1)pval.DEseq=fit.DEseq$pval

Page 48: Introduction*to*Large/Scale* Biomedical*Data*Analysis*web1.sph.emory.edu/users/hwu30/teaching/introBDA/IntroBDA_sgse… · Workflow*of*NGS*data*analysisSec-gen sequencing data analysis

edgeR• From  a  series  of  papers  by  Robinson  et  al.(the  same  group  

developed  limma):  2007  Bioinformatics,    2008  Biostatistics,  2010  Bioinformatics.  

• Empirical  Bayes  ideas  to  “shrink” gene-­‐specific  estimations  and  get  better  estimates  for  variances.

• The  parameter  to  shrink  is  over-­‐dispersion  (Φ)  in  NB,  which  controls  the  within  group  variances.    

• There  is  no  conjugate  prior  so  a  shrinkage  is  not  straightforward.

• Used  a  conditional  weighted  likelihood  approach  to  establish  an  approximate  EB  estimator  for  Φ.

Page 49: Introduction*to*Large/Scale* Biomedical*Data*Analysis*web1.sph.emory.edu/users/hwu30/teaching/introBDA/IntroBDA_sgse… · Workflow*of*NGS*data*analysisSec-gen sequencing data analysis

Bioconductor  package  edgeR

• Inputs  are  the  same  as  DEseq:  an  integer  matrix  for  counts  and  column  labels  for  design.  library(edgeR)

d = DGEList(counts=data, group=c(0,0,0,1,1,1), lib.size=colSums(data))

d = calcNormFactors(d)d = estimateCommonDisp(d)d = estimateTagwiseDisp(d, trend=TRUE)fit.edgeR = exactTest(d)pval.edgeR = fit.edgeR$table$p.value

Page 50: Introduction*to*Large/Scale* Biomedical*Data*Analysis*web1.sph.emory.edu/users/hwu30/teaching/introBDA/IntroBDA_sgse… · Workflow*of*NGS*data*analysisSec-gen sequencing data analysis

DSS  (Wu  et  al.  2012,  Biostatistics)  

• Found  that  the  shrinkage  from  DESeq  and  edgeR  are  too  strong.

(a) edgeR

True dispersion

Estim

ated

dis

pers

ion

0.005 0.050 0.500 5.000

0.00

50.

050

0.50

05.

000

(b) DESeq

True dispersion

Estim

ated

dis

pers

ion

0.005 0.050 0.500 5.000

0.00

50.

050

0.50

05.

000

Page 51: Introduction*to*Large/Scale* Biomedical*Data*Analysis*web1.sph.emory.edu/users/hwu30/teaching/introBDA/IntroBDA_sgse… · Workflow*of*NGS*data*analysisSec-gen sequencing data analysis

A  hierarchical  model  for  the  counts

• Ygi:  observed  counts  for  gene  g,  sample  i• Θgi:  unobserved  true  expression  for  gene  g,  sample  i• Φg:  dispersion  (related  biological  variance)  for  gene  g.• si:  library  size  for  sample  i.  

8 H. WU AND OTHERS

the estimates do not correlate well with the truth. Although both edgeR and DESeq

estimate the central magnitude of the dispersions well, they fail to capture the varia-

tions in dispersions among genes. The near horizontal clusters of points show that the

dispersion of many genes are estimated to be approximately the same, which indicates

over-shrinkage of the dispersions.

Since the dispersion parameter represents biological variation, having a good esti-

mate of ⌅ is crucial for finding DE that is beyond natural biological variation. In order to

account for gene specific dispersion, we take advantage of real RNA-seq data sets with

replicates and notice that the distribution of the logarithm of sample dispersion, log(⌅),

is approximately Gaussian as shown in Figure 2. In the next section we describe a novel

shrinkage estimator for ⌅ using a log-normal prior and a NB likelihood. In Section 5 we

show that the propose method greatly improves the estimation of ⌅ and leads to better

detection of DE.

4. METHODS

Denote the observed counts from gene g in sample i by Ygi, and the unobserved expres-

sion rate by �gi. We assume the following hierarchical model:

Ygi|�gi � Poisson(�gisi)

�gi|⌅g � Gamma(µg,k(i),⌅g)

⌅g � log-normal(m0, ⇤2)

Here the Gamma distribution is parametrized with mean and dispersion (⌅g is the recip-

rocal of the shape parameter in the common parameterization). k(i) denotes the treat-

Page 52: Introduction*to*Large/Scale* Biomedical*Data*Analysis*web1.sph.emory.edu/users/hwu30/teaching/introBDA/IntroBDA_sgse… · Workflow*of*NGS*data*analysisSec-gen sequencing data analysis

The  posterior

• Negative  binomial  is  parameterized  by  mean  and  dispersion,  then  the  posterior  for  dispersion  is:  

• Here,                                              is  the  expected  value  for  Ygi.• It’s  a  penalized  likelihood  to  penalize  (1)  dispersions  far  away  from  prior  mean;  and  (2)  large  dispersions.

10 H. WU AND OTHERS

Supplementary Materials):

log[p(⌅g|Ygi, ⇥gi, i = 1, . . . , n)]⇤�

i

⇧(⌅�1g + Ygi)� n⇧(⌅�1

g )� ⌅�1g

i

log(1 + ⇥gi⌅g)

+�

i

Ygi[log(⇥gi⌅g)� log(1 + ⇥gi⌅g)]

� [log(⌅g)�m0]2

2⇤ 2� log(⌅g)� log(⇤), (4.1)

where ⇧(.) is the log gamma function and n is the number of samples. To obtain a

point estimate of ⌅g, we could compute the posterior mean by using numerical methods

such as importance sampling. It is however too computationally intensive to be prac-

tical for RNA-seq data. Instead we compute the posterior mode by maximizing an ap-

proximate of Equation 4.1 and denote the estimate ⌅̃g. Specifically, we substitute ⇥gi

by ⇥̂gi ⇥ µ̂g,k(i)si where µ̂g,k(i) =P

j:k(j)=k(i) Ygj/sj

nk(i), and nk(i) is the number of samples

in the same treatment group as i. The estimate µ̂g,k(i) is the same as the sample esti-

mate of true concentration in Anders and Huber (2010). We plug in the hyperparameters

m0, ⇤ 2 estimated from the data, and maximize the approximated Equation 4.1 using the

Newton-Rhaphson method. In practice we use “optimize” function in R, which provides

excellent computational performance.

The hyperparameters m0, ⇤ 2 are estimated from the data by pooling information from

all genes (details in the next subsection). The estimate ⌅̃g is therefore an empirical Bayes

estimate shrunken towards the common prior. It can also be viewed as an estimate that

maximizes a penalized pseudo likelihood, as we are maximizing an approximate of the

penalized likelihood (Equation 4.1), with penalty �[log(⌅g) � m0]2/(2⇤ 2) � log(⌅g),

by replacing the nuisance parameter ⇥gi by its estimate ⇥̂gi. The first term in the penalty,

[log(⌅g) � m0]2/(2⇤ 2), penalizes values that deviate far from the common prior m0

DE analysis of RNA-seq 9

ment group of sample i. si represents the normalizing factor, such as the relative library

size, for sample i. This normalization factor could also be a gene specific factor, taking

into account of varying sequencing efficiency due to GC content and/or length (Hansen

and others, 2012; Risso and others, 2012), but would nonetheless be treated as a known

constant. Based on the hierarchical model, the marginal distribution of Ygi given µg,k(i)

and ⇤g is NB with mean ⇥gi = µg,k(i)si and dispersion ⇤g. In the simplest setting where

there are only two groups, the goal of DE detection is to test, for each gene, whether

the mean expressions are identical in both groups, e.g., µg,1 = µg,2, as noted by other

researchers (Robinson and Smyth, 2007, 2008; Robinson and others, 2010; Anders and

Huber, 2010).

Estimating ⇤g is a crucial step in DE detection and shrinkage estimators have been

shown to be useful in typical RNA-seq experiments with small number of replicates

(Robinson and others, 2010; Anders and Huber, 2010; Hardcastle and Kelly, 2010). As

discussed in previous works (Robinson and Smyth, 2007), there is no conjugate prior

for ⇤g to ease the computation of a posterior distribution. Thus we decide to use a prior

that approximates the ⇤g in real RNA-seq data. We chose the log-normal distribution,

motivated by Figure 2.

For gene g, the conditional posterior distribution of ⇤g given all observed counts

and means, p(⇤g|Ygi, ⇥gi, i = 1, . . . , n) satisfies (detailed derivation can be found at

Page 53: Introduction*to*Large/Scale* Biomedical*Data*Analysis*web1.sph.emory.edu/users/hwu30/teaching/introBDA/IntroBDA_sgse… · Workflow*of*NGS*data*analysisSec-gen sequencing data analysis

Testing  and  inference  in  two-­‐group  comparison  

• Wald  test:

– With  dispersions,  variances  can  be  computed  according  to  NB  distribution:

– The  variance  for                is:

• Inferences:  use  normal  P-­‐values  and  local  FDR.

4 H. WU AND OTHERS

“over-dispersion” problem since even non-differentially expressed genes show variation

greater than expected by Poisson.

A common model to address the over-dispersion problem is the Negative Binomial

(NB) model. The NB model is a Gamma-poisson mixture and can be interpreted as the

following: the Gamma distribution models the unobserved true expression levels in each

biological sample, and conditioning on the expression level, the measurement from the

sequencing machine follows a Poisson distribution (Anders and Huber, 2010; Robinson

and Smyth, 2007; Hardcastle and Kelly, 2010). The variance from a NB distribution

depends on the mean in the relationship:

var = µ + µ2⇥ (1.1)

where the first term reprents variance due to poisson sampling error and the second

term represents variance due to variation between biological replicates. The parameter

⇥ is referred to as the dispersion parameter. Notice that ⇥ is the inverse of the shape

parameter in the Gamma distribution, thus is the squared coefficient of variation. This

is crucial in the context of gene expression, because although expression is a stochastic

phenomenon and under constant regulation, the biological system is also robust and

thus the CV is usually small between replicates. Based on RNA-seq data from human

populations (Cheung and others, 2010), we observed that 70% of the genes having CV

less than 0.5. CV in inbred animal models or established cell lines are likely even lower.

HAO: do we have the log(phi.hat) from high counts and can we show that it looks

somewhat normal? This can be our motivation for using a lognormal prior.

Most statisticians agree that the over-dispersion problem needs to be addressed. The

difference is how the dispersion is modeled, estimated and used in inference. Robinson

12 H. WU AND OTHERS

pseudo datasets we obtain the mean S2 as the baseline. Lastly, we estimate ⇤ 2 as ⇤̂ 2 =

max([IQR{log(⌅̂g)}/1.349]2 � S2, c0). Here c0 is the lower bound for ⇤̂ 2. In practice

we use c0 = 0.01.

Notice that ⌅̂g is only used in estimating the hyper-parameters and not directly used

to obtain the shrinkage estimate ⌅̃g. Although each ⌅̂g is a rather crude estimate of the

dispersion, especially for small number of samples, the large number of genes in RNA-

seq data allow us to obtain stable estimates of the hyper parameters. Once the prior is

established in the empirical Bayesian method, our shrinkage estimate ⌅̃g directly comes

from maximizing Equation 4.1.

Test statistic and False Discovery Rate With all parameters estimated, a hypothesis

testing for the comparison of any two groups can be carried using a Wald test or ex-

act test. Previous reports (Robinson and Smyth, 2007) and our simulation show that

both procedures provide similar performances. We choose to use Wald test for its better

computational performance and simplicity. In addition, it can be generalized to more

complex experimental design.

The Wald test statistic for two-group comparison is constructed as:

tg =µ̂g,1 � µ̂g,2⌃⇥̂2g,1 + ⇥̂2

g,2

where ⇥̂2g,1 ⇥ 1

n21

⌅µ̂g,1

�⇤j:k(j)=1

1sj

⇥+ n1µ̂2

g,1⌅̃g

⇧, is the estimated variance for µ̂g,1

(detailed derivation in Supplementary Materials), and ⇥̂2g,2 similarly defined. When there

is no need for normalization (sj = 1 ⇤j), ⇥̂2g,1 reduces to (µ̂g,1+ µ̂2

g,1⌅̃g)/n1, an estimate

of the variance in Equation 1.1.

It is not trivial to derive the null distribution for the test statistic and asymptotic

12 H. WU AND OTHERS

pseudo datasets we obtain the mean S2 as the baseline. Lastly, we estimate ⇤ 2 as ⇤̂ 2 =

max([IQR{log(⌅̂g)}/1.349]2 � S2, c0). Here c0 is the lower bound for ⇤̂ 2. In practice

we use c0 = 0.01.

Notice that ⌅̂g is only used in estimating the hyper-parameters and not directly used

to obtain the shrinkage estimate ⌅̃g. Although each ⌅̂g is a rather crude estimate of the

dispersion, especially for small number of samples, the large number of genes in RNA-

seq data allow us to obtain stable estimates of the hyper parameters. Once the prior is

established in the empirical Bayesian method, our shrinkage estimate ⌅̃g directly comes

from maximizing Equation 4.1.

Test statistic and False Discovery Rate With all parameters estimated, a hypothesis

testing for the comparison of any two groups can be carried using a Wald test or ex-

act test. Previous reports (Robinson and Smyth, 2007) and our simulation show that

both procedures provide similar performances. We choose to use Wald test for its better

computational performance and simplicity. In addition, it can be generalized to more

complex experimental design.

The Wald test statistic for two-group comparison is constructed as:

tg =µ̂g,1 � µ̂g,2⌃⇥̂2g,1 + ⇥̂2

g,2

where ⇥̂2g,1 ⇥ 1

n21

⌅µ̂g,1

�⇤j:k(j)=1

1sj

⇥+ n1µ̂2

g,1⌅̃g

⇧, is the estimated variance for µ̂g,1

(detailed derivation in Supplementary Materials), and ⇥̂2g,2 similarly defined. When there

is no need for normalization (sj = 1 ⇤j), ⇥̂2g,1 reduces to (µ̂g,1+ µ̂2

g,1⌅̃g)/n1, an estimate

of the variance in Equation 1.1.

It is not trivial to derive the null distribution for the test statistic and asymptotic

12 H. WU AND OTHERS

pseudo datasets we obtain the mean S2 as the baseline. Lastly, we estimate ⇤ 2 as ⇤̂ 2 =

max([IQR{log(⌅̂g)}/1.349]2 � S2, c0). Here c0 is the lower bound for ⇤̂ 2. In practice

we use c0 = 0.01.

Notice that ⌅̂g is only used in estimating the hyper-parameters and not directly used

to obtain the shrinkage estimate ⌅̃g. Although each ⌅̂g is a rather crude estimate of the

dispersion, especially for small number of samples, the large number of genes in RNA-

seq data allow us to obtain stable estimates of the hyper parameters. Once the prior is

established in the empirical Bayesian method, our shrinkage estimate ⌅̃g directly comes

from maximizing Equation 4.1.

Test statistic and False Discovery Rate With all parameters estimated, a hypothesis

testing for the comparison of any two groups can be carried using a Wald test or ex-

act test. Previous reports (Robinson and Smyth, 2007) and our simulation show that

both procedures provide similar performances. We choose to use Wald test for its better

computational performance and simplicity. In addition, it can be generalized to more

complex experimental design.

The Wald test statistic for two-group comparison is constructed as:

tg =µ̂g,1 � µ̂g,2⌃⇥̂2g,1 + ⇥̂2

g,2

where ⇥̂2g,1 ⇥ 1

n21

⌅µ̂g,1

�⇤j:k(j)=1

1sj

⇥+ n1µ̂2

g,1⌅̃g

⇧, is the estimated variance for µ̂g,1

(detailed derivation in Supplementary Materials), and ⇥̂2g,2 similarly defined. When there

is no need for normalization (sj = 1 ⇤j), ⇥̂2g,1 reduces to (µ̂g,1+ µ̂2

g,1⌅̃g)/n1, an estimate

of the variance in Equation 1.1.

It is not trivial to derive the null distribution for the test statistic and asymptotic

Page 54: Introduction*to*Large/Scale* Biomedical*Data*Analysis*web1.sph.emory.edu/users/hwu30/teaching/introBDA/IntroBDA_sgse… · Workflow*of*NGS*data*analysisSec-gen sequencing data analysis

DSS  Bioconductor  package

• Inputs  are  the  same  as  DEseq  and  edgeR:  an  integer  matrix  for  counts  and  column  labels  for  design.

conds=c(0,0,0,1,1,1)seqData=newSeqCountSet(X, conds)seqData=estNormFactors(seqData)seqData=estDispersion(seqData)result=waldTest(seqData, 0, 1)

Page 55: Introduction*to*Large/Scale* Biomedical*Data*Analysis*web1.sph.emory.edu/users/hwu30/teaching/introBDA/IntroBDA_sgse… · Workflow*of*NGS*data*analysisSec-gen sequencing data analysis

Summary  for  RNA-­‐seq  DE  test

• Based  on  my  experiences  and  simulation  results:– All  methods  provide  very  similar  results  when  the  variation  of  dispersions  is  small.

– When  variation  of  dispersions  is  large,  DSS  outperforms.    

• Testing  procedure  for  multiple-­‐factor  design  is  still  not  well  developed.  Mostly  based  on  glm.  

• Some  rooms  for  statistical  researches.

Page 56: Introduction*to*Large/Scale* Biomedical*Data*Analysis*web1.sph.emory.edu/users/hwu30/teaching/introBDA/IntroBDA_sgse… · Workflow*of*NGS*data*analysisSec-gen sequencing data analysis

Review  of  this  class

• I  have  covered  some  aspects  of  next  generation  sequencing  :  – Biological  backgrounds.– Sequencing  technologies.– Brief  introduction  to  NGS  applications:  DNA-­‐seq,  RNA-­‐seq  and  ChIP-­‐seq.

– RNA-­‐seq  analyses:  motivation,  statistical  methods  for  DE  analysis  (very  briefly),  useful  software.