implementaon of computaonal pipelines to support next gen ...bioinformatics.org.au › ws09 ›...
TRANSCRIPT
-
Implementa)on of computa)onal pipelines to support next gen applica)ons
Winter School in Mathema0cal and Computa0onal Biology, 6th – 10th July 2009
Roberto Barrero ([email protected])
miR-Seq: microRNA profiling ChIP-Seq: Chromatin modification Bi-Seq: DNA methylation analysis
-
*
*
*
Bioplatforms Australia
NCRIS 5.1 Evolving Biomolecular Platforms and Informatics
-
NCRIS 5.1
Australian Bioinformatics Facility
Genomics Australia
Embedded Activities
Proteomics Australia
Metabolomics Australia
Embedded Activities
Embedded Activities
Non-Embedded Activities
Non-Embedded Activities
Non-Embedded Activities
Project 1
Project 2
Project 3
Project 1
Project 2
Project 3
Project 1
Project 2
Project 3
Development of cross –omics Platform Projects Development of cross NCRIS Investment Projects
NCRIS 5.16 NeAT
BioNeAT pending
NCRIS 5.16 NCI
Specialised Facility in
Bioinformatics pending
-
• Implementa0on of a short read mapping pipeline
– Benchmarking of freely available aligners • miR‐Seq: Profiling of miRNAs and miRNAs* • ChIP‐Seq: Defining genomic regions associated
with histone modifica0ons
• Bi‐Seq: Determining genome‐wide CpG, CHG and CHH methyla0on marks
Overview
-
• Implementa0on of a short read mapping pipeline
– Benchmarking of freely available aligners • miR‐Seq: Profiling of miRNAs and miRNAs* • ChIP‐Seq: Defining genomic regions associated
with histone modifica0ons
• Bi‐Seq: Determining genome‐wide methyla0on marks on CpG, CHG and CHH marks
Overview
-
Tool Name Performance Sanger Capillary ILLUMINA SOLiD 454 Finds
Mismatches Finds Indels Uses Quality Information
Tested platforms
ELAND
Large-Scale Alignment of Nucleotide Databases
FAST N Y N N Y N Y Linux. OsX
GMAP Genomic Mapping and Alignment Program
FAST Y N N Y Y Y N Linux
MOSAIK Reference guided aligner/assembler SLOW Y Y N Y Y Y Y Linux, OSX
SHRiMP Maps short reads to a reference sequence
SLOW N Y Y N Y Y N Linux
MAQ Mapping and Assembly with Qualities
FAST N Y Y N Y Y Y All BSD Platforms/Linux/OSX
NOVOALIGN Genomic Mapping and SNP/indel finder
FAST N Y N N Y Y Y Linux-64/ OSX
SOAP Short Oligonucleotide Alignment Program
VARIABLE N Y N N Y Y N Linux-64/32 /OSX
SSAHA Sequence Search and Alignment by Hashing Algorithm
FAST Y Y N Y Y N N Linux
Initial list of available tools (as of April 2008)
-
Tool Name Performance Sanger Capillary ILLUMINA SOLiD 454 Finds
Mismatches Finds Indels Uses Quality Information
Tested platforms
ELAND
Large-Scale Alignment of Nucleotide Databases
FAST N Y N N Y N Y Linux. OsX
GMAP Genomic Mapping and Alignment Program
FAST Y N N Y Y Y N Linux
MOSAIK Reference guided aligner/assembler SLOW Y Y N Y Y Y Y Linux, OSX
SHRiMP Maps short reads to a reference sequence
SLOW N Y Y N Y Y N Linux
MAQ Mapping and Assembly with Qualities
FAST N Y Y N Y Y Y All BSD Platforms/Linux/OSX
NOVOALIGN Genomic Mapping and SNP/indel finder
FAST N Y N N Y Y Y Linux-64/ OSX
SOAP Short Oligonucleotide Alignment Program
VARIABLE N Y N N Y Y N Linux-64/32 /OSX
SSAHA Sequence Search and Alignment by Hashing Algorithm
FAST Y Y N Y Y N N Linux
Initial list of available tools (as of April 2008)
-
• ELAND does ungapped alignment of SE/PE reads up to 32 nt in length and generate accurate mapping qualities. • MAQ uses probability models to measure the alignment quality of each read using sequence quality information.
• SHRiMP uses seeding and a Smith-Waterman algorithm for aligning short reads to a reference genome.
• RMAP map reads taking into account base-call quality scores to determine important positions.
• NOVOALIGN finds global optimum alignments using full Needleman-Wunsch algorithm with affinity gap penalties.
Mapping Tools
-
Genome coverage range by distinct next gen applications
Bi-Seq ChIP-Seq
Small RNAs
Genome
-
Generation of Simulated Short Reads (1)
Tool: MAQ-simulate DNA template: Human chromosome 22 Read length: 36 bases Mutation rates: 0.1% up to 16% Number of reads: 70,000 x 3 per mutation rate Number of SNPs: 220~3,500 Indels: 10% probability of SNPs
-
Single-end mapping performance at various mutation rates
• 70,000 reads • Triplicate
-
Pair-end mapping and SNP calling
-
Mapping performance of real data
HapMap project NA18507 10.2 million SE reads
5.1 million PE reads
-
Generation of Simulated Short Reads (2)
Dataset Simulated Set 1 Simulated
Set 2 Mutation Rate 0.1% 1.0% Number of single-end reads(1) 8,453,489 8,453,489 Number of paired-end reads(2) 16,906,978 16,906,978 Number of insertions 1,512 15,131 Number of deletions 1,514 15,166 Total number of indels 3,026 30,297 Number of Heterozygous SNPs 18,024 182,055 Number of Homozygous SNPs 9,034 90,985 Total number of SNPs 27,058 273,040 Total number of SNPs+indels 30,084 303,337
(1) Total number of single-end (SE) reads utilized in the comparisons (2) Total number of paired-end (PE) reads utilized in the comparisons
36bp-long reads datasets were generated using MAQ-simulate
Selected tools: NOVOALIGN, MAQ, BOWTIE, BWA
Arabidopsis thaliana (chr 5)
-
Mapping
(Colin et al. Submitted)
-
SNPs
(Colin et al. Submitted)
-
Indels
(Colin et al. Submitted)
-
Run Time
(Colin et al. Submitted)
-
Benchmarking Conclusion
NOVOALIGN is the best overall aligner for mapping both SE and PE reads as well as SNP calling and indel detection.
-
• Implementa0on of a short read mapping pipeline
– Benchmarking of freely available aligners • miR‐Seq: Profiling of miRNAs and miRNAs* • ChIP‐Seq: Defining genomic regions associated
with histone modifica0ons
• Bi‐Seq: Determining genome‐wide CpG, CHG and CHH methyla0on marks
Overview
-
miRNA function in Drosophila: • Cell proliferation/anti-apoptosis
(bantam)
• Fat storage/anti-apoptosis (mir-14) • Homeostasis/anti-apoptosis (mir-278) • Anti-apoptosis (mir-2) • Photoreceptor differentiation (mir-7) • Neurogenesis/neurodegeneration
(mir-9) • Muscle differentiation (mir-1) • Homeotic transformation (iab-4) • Energy metabolism/fat storage (mir-278) • Metamorphosis (let-7, mir-100, 125, 34)
microRNA pathway
-
2L 2R 3L 3R
X 4 U
mir-959,960,961,962,963,964 mir-275, 305
mir-1002, 968
mir-306, 79, 9b
mir-100, 125, let-7 mir-2a-2 2a-1, 2b-2
Drosophila melanogaster microRNA clusters
mir-281-2, 281-1
mir-6-3, 6-2, 6-2, 5, 4, 286, 3, 309
mir-310, 311, 312, 313, 991, 992
mir-277, 34 mir-994, 318
mir-13b-1, 13a, 2c
mir-998, 11
mir-982, 303, 983-1, 983-2, 984
mir-283, 304, 12
mir-972, 973, 974 mir-975, 976, 977, 978, 979
• 148 dme-miRNAs • 17 clusters (=
-
Method Type of method Resource
miRanda Complementarity http://www.microrna.org/
TargetScan Seed complementarity http://www.targetscan.org/
PicTar Thermodynamics http://pictar.bio.nyu.edu/
Canonical site
Dominant seed
Compensatory site
Prediction of miRNA targets
-
Chelicerata
Crustacea
Myriapoda
Insecta
Pasani et al. (2004) BMC Biology
A timescale of arthropod evolution
-
• Ixodes scapularis – ESTs: 183,834 – Genome Project (version: IscaW1; released 3Dec08)
• Supercon0gs: 369,492 • Annotated genes: 24,925 • Pep0des: 20,486
• Rhipicephalus microplus – ESTs: 13,643
Tick Genomic Resources
-
Genome
Precursor miRNA (Pre-miRNA)
miRNA miRNA*
5P 3P
pre-dme-miR-33
Drosophila melanogaster
Rhipicephalus microplus
microRNA locus
-
Simplified data processing pipeline
Unique Seq Reads (USR)
USR w/o adaptors
Retain clone count
Map onto genome
• NOVOALIGN • Up to 3 mismatches • Single‐locus mapping
Mapped reads
miRBase
miRNA clusters
• Extract coordinates of miRNAs
‐ mature miRNAs ‐ pre‐miRNAs
Illumina Short reads
Adaptor removal
OUTPUTS miRNA, miRNA*, Mul0ple Alignments, etc
-
1. Collect Total RNA/small RNA fraction • Eggs • Larvae (frustrated larvae, larvae) • Adult ticks (female, male)
2. Construct small RNA libraries 3. Illumina/Solexa sequencing
• Eggs: 4,215,404 • Larvae: 9,437,803 • Adult ticks: 8,319,734
4. Data Analysis Pipeline
21,972,942 short reads
LARVA
NYMPH
ADULT
EGGS
female male
microRNA discovery
-
0.0010
0.0100
0.1000
1.0000
We found 58 miRNAs in Rhipicephalus microplus expressed at various life cycle stages that are highly conserved in Drosophila melanogaster.
Highly conserved 0ck miRNAs
Eggs (37)
26
Larvae (46)
Adults (44)
2 1
9 5
1
0
Fold
-incr
emen
t in
m
iRN
A ex
pres
sion
R
eads
Per
Mill
ion
0 20 40 60 80
100 120 140
Eggs Frus Larvae
Larvae Female Male
-
Eggs
Larvae
Adults
Pre-miRNA
Mature miRNA
miRNA:miRNA* co-expression
-
To whom it may concern:
Slides containing unpublished data were removed.
We appreciate your understanding.
RB.
-
mir-9a is conserved in the Ixodes scapularis genome
369,492 supercontigs
Finding I. scapularis miRNAs
BLAT onto D. melanogaster genome
Inspect known miRNA loci
Only mir9a was identified in the current I. scapularis supercontigs
-
• Implementa0on of a short read mapping pipeline
– Benchmarking of freely available aligners • miR‐Seq: Profiling of miRNAs and miRNAs* • ChIP‐Seq: Defining genomic regions associated
with histone modifica0ons
• Bi‐Seq: Determining genome‐wide CpG, CHG and CHH methyla0on marks
Overview
-
Active Gene Expression
Less Gene Expression
Acetylation Methylation
Implications of Chromatin Modifications
-
cisGenome
ChIP‐Seq simplified processing pipeline
Map onto genome
Mapped reads
FindPeaks
Illumina Short reads
OUTPUTS Genomic regions associated with chroma0n modifica0ons
NOVOALIGN
-
Ji et al. (2008) Nature Biotechnology 26: 1293-1300
One Sample Data Processing • Scan genome with sliding windows and identifies regions with read counts greater than a user chosen cut off for bona fide binding regions.
• False Discovery Rates (FDRs) are estimated by modeling the read count in nonbinding windows using a negative binomial distribution.
Allows the background rate of occurrence of the reads to vary across the genome and to have a more flexible gamma distribution.
• Use the directionality of reads to refine peak boundaries and filter out low-quality predictions.
cisGenome
-
Protein-DNA Interactions
-
Diverse genomic contexts for chromatin marks
-
Arabidopsis thaliana nucleosome
-
• Implementa0on of a short read mapping pipeline
– Benchmarking of freely available aligners • miR‐Seq: Profiling of miRNAs and miRNAs* • ChIP‐Seq: Defining genomic regions associated
with histone modifica0ons
• Bi‐Seq: Determining genome‐wide CpG, CHG and CHH methyla0on marks
Overview
-
Bisulfite Sequencing (Bi-Seq; BS-Seq)
Next Gen Sequencing
-
Genome‐wide Methyla0on Marks
Map onto genome
Mapped reads
Check Bisulfite Conversion
Illumina Short reads
OUTPUTS Bisulfite conversion report; Genome‐wide methyla0on marks
MAQ
C T CpG CHG CHH
-
Sample C T Y Unconverted (Percentage) Converted
(Percentage)
run 1 10,806 10,577(97.88) 183(1.69) 14 (0.13) 10,760 (99.57) run 2 10,837 10,570(97.54) 219(2.02) 11 (0.10) 10,789 (99.56)
Sample Read
Sequences Unique
Alignments Gapped
Alignments Aligned run 1 11,653,511 133,712 4,129 275,944 run 2 11,540,171 132,690 4,251 273,806
Checking Bisulfite Conversion Efficiency
Aligned reads onto the Arabidopsis chloroplast genome
Bisulfite conversion efficiency of the chloroplast genome
-
Bisulfite conversion of the Arabidopsis thaliana chloroplast genome
-
Genome‐wide Methyla0on Marks
Coverage
CHG
CHH
CpG
chr1 chr2 chr3 chr4 chr5
-
hfp://ccg.murdoch.edu.au/yabi
Web HPC ‐ Enabled
-
• Zhang Bing • Ala Lew-Tabor
Acknowledgements
Colin Hercus NCRIS 5.1
• Zayed Albertyn • Matthew Bellgard
An Australian Government Initiative
National Collaborative Research Infrastructure Strategy
Department of Primary Industries and Fisheries
Queensland Government
• Frances Shannon • Jun Fan
• Liz Dennis • Ian Greaves • Sameer Tiwari