implementaon of computaonal pipelines to support next gen ...bioinformatics.org.au › ws09 ›...

48
Implementa)on of computa)onal pipelines to support next gen applica)ons Winter School in Mathema0cal and Computa0onal Biology, 6 th – 10 th July 2009 Roberto Barrero ([email protected]) miR-Seq: microRNA profiling ChIP-Seq: Chromatin modification Bi-Seq: DNA methylation analysis

Upload: others

Post on 03-Feb-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

  • Implementa)on
of
computa)onal
pipelines

to
support
next
gen
applica)ons 



    Winter
School
in
Mathema0cal
and
Computa0onal
Biology,

6th

–
10th

July
2009


    Roberto
Barrero

([email protected])


    miR-Seq: microRNA profiling ChIP-Seq: Chromatin modification Bi-Seq: DNA methylation analysis

  • *

    *

    *

    Bioplatforms Australia

    NCRIS 5.1 Evolving Biomolecular Platforms and Informatics

  • NCRIS 5.1

    Australian Bioinformatics Facility

    Genomics Australia

    Embedded Activities

    Proteomics Australia

    Metabolomics Australia

    Embedded Activities

    Embedded Activities

    Non-Embedded Activities

    Non-Embedded Activities

    Non-Embedded Activities

    Project 1

    Project 2

    Project 3

    Project 1

    Project 2

    Project 3

    Project 1

    Project 2

    Project 3

    Development of cross –omics Platform Projects Development of cross NCRIS Investment Projects

    NCRIS 5.16 NeAT

    BioNeAT pending

    NCRIS 5.16 NCI

    Specialised Facility in

    Bioinformatics pending

  • •  Implementa0on
of
a
short
read
mapping
pipeline


    –  Benchmarking
of
freely
available
aligners

•  miR‐Seq:
Profiling
of
miRNAs
and
miRNAs*
•  ChIP‐Seq:
Defining
genomic
regions
associated


    with
histone
modifica0ons


    •  Bi‐Seq:
Determining
genome‐wide
CpG,
CHG
and
CHH
methyla0on
marks


    Overview


  • •  Implementa0on
of
a
short
read
mapping
pipeline


    –  Benchmarking
of
freely
available
aligners

•  miR‐Seq:
Profiling
of
miRNAs
and
miRNAs*
•  ChIP‐Seq:
Defining
genomic
regions
associated


    with
histone
modifica0ons


    •  Bi‐Seq:
Determining
genome‐wide
methyla0on
marks
on
CpG,
CHG
and
CHH
marks


    Overview


  • Tool Name Performance Sanger Capillary ILLUMINA SOLiD 454 Finds

    Mismatches Finds Indels Uses Quality Information

    Tested platforms

    ELAND

    Large-Scale Alignment of Nucleotide Databases

    FAST N Y N N Y N Y Linux. OsX

    GMAP Genomic Mapping and Alignment Program

    FAST Y N N Y Y Y N Linux

    MOSAIK Reference guided aligner/assembler SLOW Y Y N Y Y Y Y Linux, OSX

    SHRiMP Maps short reads to a reference sequence

    SLOW N Y Y N Y Y N Linux

    MAQ Mapping and Assembly with Qualities

    FAST N Y Y N Y Y Y All BSD Platforms/Linux/OSX

    NOVOALIGN Genomic Mapping and SNP/indel finder

    FAST N Y N N Y Y Y Linux-64/ OSX

    SOAP Short Oligonucleotide Alignment Program

    VARIABLE N Y N N Y Y N Linux-64/32 /OSX

    SSAHA Sequence Search and Alignment by Hashing Algorithm

    FAST Y Y N Y Y N N Linux

    Initial list of available tools (as of April 2008)

  • Tool Name Performance Sanger Capillary ILLUMINA SOLiD 454 Finds

    Mismatches Finds Indels Uses Quality Information

    Tested platforms

    ELAND

    Large-Scale Alignment of Nucleotide Databases

    FAST N Y N N Y N Y Linux. OsX

    GMAP Genomic Mapping and Alignment Program

    FAST Y N N Y Y Y N Linux

    MOSAIK Reference guided aligner/assembler SLOW Y Y N Y Y Y Y Linux, OSX

    SHRiMP Maps short reads to a reference sequence

    SLOW N Y Y N Y Y N Linux

    MAQ Mapping and Assembly with Qualities

    FAST N Y Y N Y Y Y All BSD Platforms/Linux/OSX

    NOVOALIGN Genomic Mapping and SNP/indel finder

    FAST N Y N N Y Y Y Linux-64/ OSX

    SOAP Short Oligonucleotide Alignment Program

    VARIABLE N Y N N Y Y N Linux-64/32 /OSX

    SSAHA Sequence Search and Alignment by Hashing Algorithm

    FAST Y Y N Y Y N N Linux

    Initial list of available tools (as of April 2008)

  • •  ELAND does ungapped alignment of SE/PE reads up to 32 nt in length and generate accurate mapping qualities. •  MAQ uses probability models to measure the alignment quality of each read using sequence quality information.

    •  SHRiMP uses seeding and a Smith-Waterman algorithm for aligning short reads to a reference genome.

    •  RMAP map reads taking into account base-call quality scores to determine important positions.

    •  NOVOALIGN finds global optimum alignments using full Needleman-Wunsch algorithm with affinity gap penalties.

    Mapping Tools

  • Genome coverage range by distinct next gen applications

    Bi-Seq ChIP-Seq

    Small RNAs

    Genome

  • Generation of Simulated Short Reads (1)

    Tool: MAQ-simulate DNA template: Human chromosome 22 Read length: 36 bases Mutation rates: 0.1% up to 16% Number of reads: 70,000 x 3 per mutation rate Number of SNPs: 220~3,500 Indels: 10% probability of SNPs

  • Single-end mapping performance at various mutation rates

    •  70,000 reads •  Triplicate

  • Pair-end mapping and SNP calling

  • Mapping performance of real data

    HapMap project NA18507 10.2 million SE reads

    5.1 million PE reads

  • Generation of Simulated Short Reads (2)

    Dataset Simulated Set 1 Simulated

    Set 2 Mutation Rate 0.1% 1.0% Number of single-end reads(1) 8,453,489 8,453,489 Number of paired-end reads(2) 16,906,978 16,906,978 Number of insertions 1,512 15,131 Number of deletions 1,514 15,166 Total number of indels 3,026 30,297 Number of Heterozygous SNPs 18,024 182,055 Number of Homozygous SNPs 9,034 90,985 Total number of SNPs 27,058 273,040 Total number of SNPs+indels 30,084 303,337

    (1) Total number of single-end (SE) reads utilized in the comparisons (2) Total number of paired-end (PE) reads utilized in the comparisons

    36bp-long reads datasets were generated using MAQ-simulate

    Selected tools: NOVOALIGN, MAQ, BOWTIE, BWA

    Arabidopsis thaliana (chr 5)

  • Mapping

    (Colin et al. Submitted)

  • SNPs

    (Colin et al. Submitted)

  • Indels

    (Colin et al. Submitted)

  • Run Time

    (Colin et al. Submitted)

  • Benchmarking Conclusion

    NOVOALIGN is the best overall aligner for mapping both SE and PE reads as well as SNP calling and indel detection.

  • •  Implementa0on
of
a
short
read
mapping
pipeline


    –  Benchmarking
of
freely
available
aligners

•  miR‐Seq:
Profiling
of
miRNAs
and
miRNAs*
•  ChIP‐Seq:
Defining
genomic
regions
associated


    with
histone
modifica0ons


    •  Bi‐Seq:
Determining
genome‐wide
CpG,
CHG
and
CHH
methyla0on
marks


    Overview


  • miRNA function in Drosophila: •  Cell proliferation/anti-apoptosis

    (bantam)

    •  Fat storage/anti-apoptosis (mir-14) •  Homeostasis/anti-apoptosis (mir-278) •  Anti-apoptosis (mir-2) •  Photoreceptor differentiation (mir-7) •  Neurogenesis/neurodegeneration

    (mir-9) •  Muscle differentiation (mir-1) •  Homeotic transformation (iab-4) •  Energy metabolism/fat storage (mir-278) •  Metamorphosis (let-7, mir-100, 125, 34)

    microRNA

pathway


  • 2L 2R 3L 3R

    X 4 U

    mir-959,960,961,962,963,964 mir-275, 305

    mir-1002, 968

    mir-306, 79, 9b

    mir-100, 125, let-7 mir-2a-2 2a-1, 2b-2

    Drosophila melanogaster microRNA clusters

    mir-281-2, 281-1

    mir-6-3, 6-2, 6-2, 5, 4, 286, 3, 309

    mir-310, 311, 312, 313, 991, 992

    mir-277, 34 mir-994, 318

    mir-13b-1, 13a, 2c

    mir-998, 11

    mir-982, 303, 983-1, 983-2, 984

    mir-283, 304, 12

    mir-972, 973, 974 mir-975, 976, 977, 978, 979

    •  148 dme-miRNAs •  17 clusters (=

  • Method Type of method Resource

    miRanda Complementarity http://www.microrna.org/

    TargetScan Seed complementarity http://www.targetscan.org/

    PicTar Thermodynamics http://pictar.bio.nyu.edu/

    Canonical site

    Dominant seed

    Compensatory site

    Prediction of miRNA targets

  • Chelicerata

    Crustacea

    Myriapoda

    Insecta

    Pasani et al. (2004) BMC Biology

    A timescale of arthropod evolution

  • •  
Ixodes
scapularis
–  
ESTs:
183,834
–  
Genome
Project
(version:
IscaW1;
released
3Dec08)



    •  Supercon0gs:
369,492
•  Annotated
genes:
24,925
•  Pep0des:

20,486



    •  
Rhipicephalus
microplus
–  
ESTs:
13,643


    Tick Genomic Resources

  • Genome

    Precursor miRNA (Pre-miRNA)

    miRNA miRNA*

    5P 3P

    pre-dme-miR-33

    Drosophila melanogaster

    Rhipicephalus microplus

    microRNA

locus


  • Simplified
data

processing
pipeline


    Unique
Seq
Reads
(USR)


    USR
w/o
adaptors


    Retain
clone
count


    Map
onto
genome


    • 
NOVOALIGN
• 
Up
to
3
mismatches
• 
Single‐locus
mapping


    Mapped
reads


    miRBase


    miRNA
clusters


    • 
Extract
coordinates



of
miRNAs


    
‐
mature
miRNAs

‐
pre‐miRNAs


    Illumina
Short
reads


    Adaptor
removal


    OUTPUTS miRNA,
miRNA*,
Mul0ple
Alignments,
etc


  • 1.  Collect Total RNA/small RNA fraction •  Eggs •  Larvae (frustrated larvae, larvae) •  Adult ticks (female, male)

    2.  Construct small RNA libraries 3.  Illumina/Solexa sequencing

    •  Eggs: 4,215,404 •  Larvae: 9,437,803 •  Adult ticks: 8,319,734

    4.  Data Analysis Pipeline

    21,972,942 short reads

    LARVA

    NYMPH

    ADULT

    EGGS

    female male

    microRNA
discovery


  • 0.0010


    0.0100


    0.1000


    1.0000


    We
found
58
miRNAs
in
Rhipicephalus
microplus

expressed
at
various
life
cycle
stages
that
are
highly
conserved
in
Drosophila
melanogaster.


    Highly
conserved

0ck
miRNAs


    Eggs (37)

    26

    Larvae (46)

    Adults (44)

    2 1

    9 5

    1

    0

    Fold

    -incr

    emen

    t in

    m

    iRN

    A ex

    pres

    sion

    R

    eads

    Per

    Mill

    ion

    0
20
40
60
80


    100
120
140


    Eggs
 Frus
Larvae


    Larvae
Female
 Male


  • Eggs

    Larvae

    Adults

    Pre-miRNA

    Mature miRNA

    miRNA:miRNA* co-expression

  • To
whom
it
may
concern:


    Slides
containing
unpublished
data
were
removed.


    We
appreciate
your
understanding.


    RB.


  • mir-9a is conserved in the Ixodes scapularis genome

    369,492 supercontigs

    Finding I. scapularis miRNAs

    BLAT onto D. melanogaster genome

    Inspect known miRNA loci

    Only mir9a was identified in the current I. scapularis supercontigs

  • •  Implementa0on
of
a
short
read
mapping
pipeline


    –  Benchmarking
of
freely
available
aligners

•  miR‐Seq:
Profiling
of
miRNAs
and
miRNAs*
•  ChIP‐Seq:
Defining
genomic
regions
associated


    with
histone
modifica0ons


    •  Bi‐Seq:
Determining
genome‐wide
CpG,
CHG
and
CHH
methyla0on
marks


    Overview


  • Active Gene Expression

    Less Gene Expression

    Acetylation Methylation

    Implications of Chromatin Modifications

  • cisGenome


    ChIP‐Seq
simplified

processing
pipeline


    Map
onto
genome


    Mapped
reads


    FindPeaks


    Illumina
Short
reads


    OUTPUTS
Genomic
regions
associated
with
chroma0n
modifica0ons

    NOVOALIGN

  • Ji et al. (2008) Nature Biotechnology 26: 1293-1300

    One Sample Data Processing •  Scan genome with sliding windows and identifies regions with read counts greater than a user chosen cut off for bona fide binding regions.

    •  False Discovery Rates (FDRs) are estimated by modeling the read count in nonbinding windows using a negative binomial distribution.

      Allows the background rate of occurrence of the reads to vary across the genome and to have a more flexible gamma distribution.

    •  Use the directionality of reads to refine peak boundaries and filter out low-quality predictions.

    cisGenome

  • Protein-DNA Interactions

  • Diverse genomic contexts for chromatin marks

  • Arabidopsis thaliana nucleosome

  • •  Implementa0on
of
a
short
read
mapping
pipeline


    –  Benchmarking
of
freely
available
aligners

•  miR‐Seq:
Profiling
of
miRNAs
and
miRNAs*
•  ChIP‐Seq:
Defining
genomic
regions
associated


    with
histone
modifica0ons


    •  Bi‐Seq:
Determining
genome‐wide
CpG,
CHG
and
CHH
methyla0on
marks


    Overview


  • Bisulfite Sequencing (Bi-Seq; BS-Seq)

    Next
Gen
Sequencing


  • Genome‐wide
Methyla0on
Marks


    Map
onto
genome


    Mapped
reads


    Check
Bisulfite
Conversion


    Illumina
Short
reads


    OUTPUTS
Bisulfite
conversion
report;
Genome‐wide
methyla0on
marks


    MAQ

    C T CpG CHG CHH

  • Sample C T Y Unconverted (Percentage) Converted

    (Percentage)

    run 1 10,806 10,577(97.88) 183(1.69) 14 (0.13) 10,760 (99.57) run 2 10,837 10,570(97.54) 219(2.02) 11 (0.10) 10,789 (99.56)

    Sample Read

    Sequences Unique

    Alignments Gapped

    Alignments Aligned run 1 11,653,511 133,712 4,129 275,944 run 2 11,540,171 132,690 4,251 273,806

    Checking Bisulfite Conversion Efficiency

    Aligned reads onto the Arabidopsis chloroplast genome

    Bisulfite conversion efficiency of the chloroplast genome

  • Bisulfite conversion of the Arabidopsis thaliana chloroplast genome

  • Genome‐wide
Methyla0on
Marks


    Coverage

    CHG

    CHH

    CpG

    chr1 chr2 chr3 chr4 chr5

  • hfp://ccg.murdoch.edu.au/yabi




    Web
HPC
‐
Enabled


  • •  Zhang Bing •  Ala Lew-Tabor

    Acknowledgements

    Colin Hercus NCRIS 5.1

    •  Zayed Albertyn •  Matthew Bellgard

    An Australian Government Initiative

    National Collaborative Research Infrastructure Strategy

    Department of Primary Industries and Fisheries

    Queensland Government

    •  Frances Shannon •  Jun Fan

    •  Liz Dennis •  Ian Greaves •  Sameer Tiwari