implementaon of computaonal pipelines to support next gen ...bioinformatics.org.au › ws09 ›...

Implementa)on of computa)onal pipelines  to support next gen applica)ons   

Winter School in Mathema0cal and Computa0onal Biology,  6th  – 10th  July 2009 

Roberto Barrero  ([email protected]) 

miR-Seq: microRNA profiling ChIP-Seq: Chromatin modification Bi-Seq: DNA methylation analysis

*

*

*

Bioplatforms Australia

NCRIS 5.1 Evolving Biomolecular Platforms and Informatics

NCRIS 5.1

Australian Bioinformatics Facility

Genomics Australia

Embedded Activities

Proteomics Australia

Metabolomics Australia

Embedded Activities

Embedded Activities

Non-Embedded Activities



Project 1

Project 2

Project 3

Project 1

Project 2

Project 3

Project 1

Project 2

Project 3

Development of cross –omics Platform Projects Development of cross NCRIS Investment Projects

NCRIS 5.16 NeAT

BioNeAT pending

NCRIS 5.16 NCI

Specialised Facility in

Bioinformatics pending

•  Implementa0on of a short read mapping pipeline 

–  Benchmarking of freely available aligners  •  miR‐Seq: Profiling of miRNAs and miRNAs* •  ChIP‐Seq: Defining genomic regions associated 

with histone modifica0ons 

•  Bi‐Seq: Determining genome‐wide CpG, CHG and CHH methyla0on marks 

Overview 




•  Bi‐Seq: Determining genome‐wide methyla0on marks on CpG, CHG and CHH marks 

Overview 

Tool Name Performance Sanger Capillary ILLUMINA SOLiD 454 Finds

Mismatches Finds Indels Uses Quality Information

Tested platforms

ELAND

Large-Scale Alignment of Nucleotide Databases

FAST N Y N N Y N Y Linux. OsX

GMAP Genomic Mapping and Alignment Program

FAST Y N N Y Y Y N Linux

MOSAIK Reference guided aligner/assembler SLOW Y Y N Y Y Y Y Linux, OSX

SHRiMP Maps short reads to a reference sequence

SLOW N Y Y N Y Y N Linux

MAQ Mapping and Assembly with Qualities

FAST N Y Y N Y Y Y All BSD Platforms/Linux/OSX

NOVOALIGN Genomic Mapping and SNP/indel finder

FAST N Y N N Y Y Y Linux-64/ OSX

SOAP Short Oligonucleotide Alignment Program

VARIABLE N Y N N Y Y N Linux-64/32 /OSX

SSAHA Sequence Search and Alignment by Hashing Algorithm

FAST Y Y N Y Y N N Linux

Initial list of available tools (as of April 2008)

•  ELAND does ungapped alignment of SE/PE reads up to 32 nt in length and generate accurate mapping qualities. •  MAQ uses probability models to measure the alignment quality of each read using sequence quality information.

•  SHRiMP uses seeding and a Smith-Waterman algorithm for aligning short reads to a reference genome.

•  RMAP map reads taking into account base-call quality scores to determine important positions.

•  NOVOALIGN finds global optimum alignments using full Needleman-Wunsch algorithm with affinity gap penalties.

Mapping Tools

Genome coverage range by distinct next gen applications

Bi-Seq ChIP-Seq

Small RNAs

Genome

Generation of Simulated Short Reads (1)

Tool: MAQ-simulate DNA template: Human chromosome 22 Read length: 36 bases Mutation rates: 0.1% up to 16% Number of reads: 70,000 x 3 per mutation rate Number of SNPs: 220~3,500 Indels: 10% probability of SNPs

Single-end mapping performance at various mutation rates

•  70,000 reads •  Triplicate

Pair-end mapping and SNP calling

Mapping performance of real data

HapMap project NA18507 10.2 million SE reads

5.1 million PE reads

Generation of Simulated Short Reads (2)

Dataset Simulated Set 1 Simulated

Set 2 Mutation Rate 0.1% 1.0% Number of single-end reads(1) 8,453,489 8,453,489 Number of paired-end reads(2) 16,906,978 16,906,978 Number of insertions 1,512 15,131 Number of deletions 1,514 15,166 Total number of indels 3,026 30,297 Number of Heterozygous SNPs 18,024 182,055 Number of Homozygous SNPs 9,034 90,985 Total number of SNPs 27,058 273,040 Total number of SNPs+indels 30,084 303,337

(1) Total number of single-end (SE) reads utilized in the comparisons (2) Total number of paired-end (PE) reads utilized in the comparisons

36bp-long reads datasets were generated using MAQ-simulate

Selected tools: NOVOALIGN, MAQ, BOWTIE, BWA

Arabidopsis thaliana (chr 5)

Mapping

(Colin et al. Submitted)

SNPs


Indels


Run Time


Benchmarking Conclusion

NOVOALIGN is the best overall aligner for mapping both SE and PE reads as well as SNP calling and indel detection.





Overview 

miRNA function in Drosophila: •  Cell proliferation/anti-apoptosis

(bantam)

•  Fat storage/anti-apoptosis (mir-14) •  Homeostasis/anti-apoptosis (mir-278) •  Anti-apoptosis (mir-2) •  Photoreceptor differentiation (mir-7) •  Neurogenesis/neurodegeneration

(mir-9) •  Muscle differentiation (mir-1) •  Homeotic transformation (iab-4) •  Energy metabolism/fat storage (mir-278) •  Metamorphosis (let-7, mir-100, 125, 34)

microRNA  pathway 

2L 2R 3L 3R

X 4 U

mir-959,960,961,962,963,964 mir-275, 305

mir-1002, 968

mir-306, 79, 9b

mir-100, 125, let-7 mir-2a-2 2a-1, 2b-2

Drosophila melanogaster microRNA clusters

mir-281-2, 281-1

mir-6-3, 6-2, 6-2, 5, 4, 286, 3, 309

mir-310, 311, 312, 313, 991, 992

mir-277, 34 mir-994, 318

mir-13b-1, 13a, 2c

mir-998, 11

mir-982, 303, 983-1, 983-2, 984

mir-283, 304, 12

mir-972, 973, 974 mir-975, 976, 977, 978, 979

•  148 dme-miRNAs •  17 clusters (=

Method Type of method Resource

miRanda Complementarity http://www.microrna.org/

TargetScan Seed complementarity http://www.targetscan.org/

PicTar Thermodynamics http://pictar.bio.nyu.edu/

Canonical site

Dominant seed

Compensatory site

Prediction of miRNA targets

Chelicerata

Crustacea

Myriapoda

Insecta

Pasani et al. (2004) BMC Biology

A timescale of arthropod evolution

•   Ixodes scapularis –   ESTs: 183,834 –   Genome Project (version: IscaW1; released 3Dec08)  

•  Supercon0gs: 369,492 •  Annotated genes: 24,925 •  Pep0des:  20,486  

•   Rhipicephalus microplus –   ESTs: 13,643 

Tick Genomic Resources

Genome

Precursor miRNA (Pre-miRNA)

miRNA miRNA*

5P 3P

pre-dme-miR-33

Drosophila melanogaster

Rhipicephalus microplus

microRNA  locus 

Simplified data  processing pipeline 

Unique Seq Reads (USR) 

USR w/o adaptors 

Retain clone count 

Map onto genome 

•  NOVOALIGN •  Up to 3 mismatches •  Single‐locus mapping 

Mapped reads 

miRBase 

miRNA clusters 

•  Extract coordinates    of miRNAs 

 ‐ mature miRNAs  ‐ pre‐miRNAs 

Illumina Short reads 

Adaptor removal 

OUTPUTS miRNA, miRNA*, Mul0ple Alignments, etc 

1.  Collect Total RNA/small RNA fraction •  Eggs •  Larvae (frustrated larvae, larvae) •  Adult ticks (female, male)

2.  Construct small RNA libraries 3.  Illumina/Solexa sequencing

•  Eggs: 4,215,404 •  Larvae: 9,437,803 •  Adult ticks: 8,319,734

4.  Data Analysis Pipeline

21,972,942 short reads

LARVA

NYMPH

ADULT

EGGS

female male

microRNA discovery 

0.0010 

0.0100 

0.1000 

1.0000 

We found 58 miRNAs in Rhipicephalus microplus  expressed at various life cycle stages that are highly conserved in Drosophila melanogaster. 

Highly conserved  0ck miRNAs 

Eggs (37)

26

Larvae (46)

Adults (44)

2 1

9 5

1

0

Fold

-incr

emen

t in

m

iRN

A ex

pres

sion

R

eads

Per

Mill

ion

0 20 40 60 80 

100 120 140 

Eggs  Frus Larvae 

Larvae Female  Male 

Eggs

Larvae

Adults

Pre-miRNA

Mature miRNA

miRNA:miRNA* co-expression

To whom it may concern: 

Slides containing unpublished data were removed. 

We appreciate your understanding. 

RB. 

mir-9a is conserved in the Ixodes scapularis genome

369,492 supercontigs

Finding I. scapularis miRNAs

BLAT onto D. melanogaster genome

Inspect known miRNA loci

Only mir9a was identified in the current I. scapularis supercontigs





Overview 

Active Gene Expression

Less Gene Expression

Acetylation Methylation

Implications of Chromatin Modifications

cisGenome 

ChIP‐Seq simplified  processing pipeline 


Mapped reads 

FindPeaks 


OUTPUTS Genomic regions associated with chroma0n modifica0ons

NOVOALIGN

Ji et al. (2008) Nature Biotechnology 26: 1293-1300

One Sample Data Processing •  Scan genome with sliding windows and identifies regions with read counts greater than a user chosen cut off for bona fide binding regions.

•  False Discovery Rates (FDRs) are estimated by modeling the read count in nonbinding windows using a negative binomial distribution.

  Allows the background rate of occurrence of the reads to vary across the genome and to have a more flexible gamma distribution.

•  Use the directionality of reads to refine peak boundaries and filter out low-quality predictions.

cisGenome

Protein-DNA Interactions

Diverse genomic contexts for chromatin marks

Arabidopsis thaliana nucleosome





Overview 

Bisulfite Sequencing (Bi-Seq; BS-Seq)

Next Gen Sequencing 

Genome‐wide Methyla0on Marks 


Mapped reads 

Check Bisulfite Conversion 


OUTPUTS Bisulfite conversion report; Genome‐wide methyla0on marks 

MAQ

C T CpG CHG CHH

Sample C T Y Unconverted (Percentage) Converted

(Percentage)

run 1 10,806 10,577(97.88) 183(1.69) 14 (0.13) 10,760 (99.57) run 2 10,837 10,570(97.54) 219(2.02) 11 (0.10) 10,789 (99.56)

Sample Read

Sequences Unique

Alignments Gapped

Alignments Aligned run 1 11,653,511 133,712 4,129 275,944 run 2 11,540,171 132,690 4,251 273,806

Checking Bisulfite Conversion Efficiency

Aligned reads onto the Arabidopsis chloroplast genome

Bisulfite conversion efficiency of the chloroplast genome

Bisulfite conversion of the Arabidopsis thaliana chloroplast genome

Genome‐wide Methyla0on Marks 

Coverage

CHG

CHH

CpG

chr1 chr2 chr3 chr4 chr5

hfp://ccg.murdoch.edu.au/yabi   

Web HPC ‐ Enabled 

•  Zhang Bing •  Ala Lew-Tabor

Acknowledgements

Colin Hercus NCRIS 5.1

•  Zayed Albertyn •  Matthew Bellgard

An Australian Government Initiative

National Collaborative Research Infrastructure Strategy

Department of Primary Industries and Fisheries

Queensland Government

•  Frances Shannon •  Jun Fan

•  Liz Dennis •  Ian Greaves •  Sameer Tiwari

implementaon of computaonal pipelines to support next gen ...bioinformatics.org.au › ws09 ›...

Documents