corona lite introduction...general – global sets versus corona lite category solid global sets...
TRANSCRIPT
Corona Corona LiteLite IntroductionIntroduction
2 © 2009 Applied Biosystems
Section Outline Section Outline –– Corona Corona LiteLite IntroductionIntroduction
• Workflow and Setup
• Matching pipeline
• Pairing pipeline
• Variation pipeline
3 © 2009 Applied Biosystems
Corona Corona LiteLite OverviewOverview
4 © 2009 Applied Biosystems
GlobalSETSGlobalSETS versus Corona versus Corona LiteLite
Category SOLiD™ Global SETS v3.0 Corona_Lite v4.2
Mapping Algorithm MapReads MapReads
Mapping scheme -
progressive
Yes, for max throughput. (default)
No
-full length with fixed number of mismatches
Yes Yes
Repeat Classifier Yes, new in v3.0 No
MatchingRepeat, Random,
and Consolidate
Yes, new in v3.0 No
SNP algorithm diBayes SNP caller
Multiple run combination
analysis
No Yes.
Integrated small indel
analysis
Jun-09 Yes
5 © 2009 Applied Biosystems
Global SETS Versus Corona Global SETS Versus Corona LiteLite
Category SOLiD™ Global SETS v3.0 Corona_Lite v4.2
Matching
- Fasta-like .ma files- gff v.2
YesDefault
YesOptional using MaToGff.sh
Pairing
- .mates - gff v.2
YesDefault
YesOptional using MatesToGff.sh
SNP pipe
- SNP summary- Consensus base
sequence
Gff v.3Yes
SNP list text fileYes
Stats Files New format Old Stats file
6 © 2009 Applied Biosystems
• Before you start
• Set the correct environment
• Make cmap file
• Validate reference
• Generate double encode reference
Corona Corona LiteLite SetupSetup
7 © 2009 Applied Biosystems
• Set up environment variables
• for csh/tcsh:
• setenv CORONAROOT /share/apps/corona_lite
• source $CORONAROOT/etc/profile.d/corona.csh
• for sh/ksh/bash:
• export CORONAROOT=/share/apps/corona_lite
• source $CORONAROOT/etc/profile.d/corona.sh
Corona Corona LiteLite Setup Setup –– Environment VariablesEnvironment Variables
8 © 2009 Applied Biosystems
1 chr1 /path/to/file/chr1.fa /path/to/file/de_chr1.fa2 chr2 /path/to/file/chr2.fa /path/to/file/de_chr2.fa3 chr3 /path/to/file/chr3.fa /path/to/file/de_chr3.fa4 chr4 /path/to/file/chr4.fa /path/to/file/de_chr4.fa
• Prepare the chromosome map file (tab-delimited):
• Chromosome ID
• Chromosome Name
• FASTA Reference
• Double-Encoded Reference
• For example
Corona Corona LiteLite Setup Setup –– Chromosome Map (Chromosome Map (cmapcmap) File) File
9 © 2009 Applied Biosystems
• Validate reference
• reference_validation.pl –r chr1.fa –s 9999999999 –o chr1_validated.fa
• Generate double-encoded sequence
• encodeFasta.py -n -l sequence.fasta > de_sequence.fasta
Corona Corona LiteLite Setup Setup –– Validate and Double Encode RefValidate and Double Encode Ref
10 © 2009 Applied Biosystems
Section Outline Section Outline –– Corona Corona LiteLite IntroductionIntroduction
• Workflow and Setup
• Matching pipeline
• Pairing pipeline
• Variation pipeline
11 © 2009 Applied Biosystems
Things To ConsiderThings To Consider
• Number of hits to report (-z)
• Default is 10 per chromosome
• What does it mean if it hit 10 times?
• Recommended mismatches
• 2 for 25bp reads
• 3-4 for 35bp reads
• 4-6 for 50bp reads
• Can consider counting valid adjacent mismatches as 1 (-a=1)
12 © 2009 Applied Biosystems
Matching Parameters (Required)Matching Parameters (Required)
• matching_large_genomes_cmap_save.pl
• -csfasta – F3 or R3 reads
• -dir – Output directory
• -cmap – Chromosome map file
• -t – Tag length
• -e – Number of errors allowed
13 © 2009 Applied Biosystems
• matching_large_genomes_cmap_save.pl
• -p – Pattern mask for reads
• -a – 0 = no; 1 = valid adjacent errors; 2 = all adjacent errors: defaults to 0
• -z – Maximum number of hits per chromosome: defaults to 10
• -incremental – Remove reads that have already mapped
Matching Parameters (Optional)Matching Parameters (Optional)
14 © 2009 Applied Biosystems
Submitting JobsSubmitting Jobs
• For PBS, use submit_scripts_to_PBS.pl
• Submission scripts exist for LSF, SGE and SMP machines
• Required Options
• -j – Job list file
• Optional Options
• -h – Usage description
• -q – Specify a queue
• -i – Interactive queue
15 © 2009 Applied Biosystems
Section Outline Section Outline –– Corona Corona LiteLite IntroductionIntroduction
• Workflow and Setup
• Matching pipeline
• Pairing pipeline
• Variation pipeline
16 © 2009 Applied Biosystems
• pairing_by_group.pl
• -F3 – F3 match file (.csfasta.ma)
• -R3 – R3 match file (.csfasta.ma)
• -e – Total errors allowed during mapping
• -output_dir – Output directory
• -find_pairing_dist – Flag for finding distance distribution
• Look at pairingDist.freq.binned file
Find Insert SizeFind Insert Size
17 © 2009 Applied Biosystems
Insert Size DistributionInsert Size Distribution
0
100
200
300
400
500
600
700
800
0 500 1000 1500 2000 2500 3000
18 © 2009 Applied Biosystems
• pairing_by_group.pl
• -F3 – F3 match file (.csfasta.ma)
• -R3 – R3 match file (.csfasta.ma)
• -e – Total errors allowed during mapping
• -output_dir – Output directory
• -min_insert_size – From distribution
• -max_insert_size – From distribution
• -ref – Multi FASTA reference file
Perform MatePerform Mate--pair Rescuepair Rescue
19 © 2009 Applied Biosystems
MateMate--pair Descriptionspair Descriptions
• Mate-pairs are annotated with a three letter code
20 © 2009 Applied Biosystems
Section Outline Section Outline –– Corona Corona LiteLite IntroductionIntroduction
• Workflow and Setup
• Matching pipeline
• Pairing pipeline
• Variation pipeline
21 © 2009 Applied Biosystems
• Preparation
• Single tag: split_by_chromosome.pl
• -f – Unique match file (.unique.csfasta.ma)
• -c – Output chromosome directory
• Mate pair: multi_chr_pairing_parser.pl
• -mates – Mates file from pairing pipeline (.mates)
• -o_dir – Output directory
SNP PipelineSNP Pipeline
22 © 2009 Applied Biosystems
• Consensus and SNP calling
• consensus_prep_and_wrapper_cmap_save_script.pl
• -mates/match_dir – Output from preparation step
• -cmap – Chromosome map file
• -mlf3/mlr3 – Tag length
• -ef3/er3 – Mismatches allowed
• -o_dir – Output directory
• -insert_start/_end – Pairing size for mate pair run
SNP PipelineSNP Pipeline
23 © 2009 Applied Biosystems
SNP PipelineSNP Pipeline
• Consensus sequence generated from alignment to the reference sequence
• Files
• snps.txt
• snps_sorted.txt
• snp_probs.dat
• bp_consensus_confirmed_sequence_with_Ns.fasta
24 © 2009 Applied Biosystems
Corona Corona LiteLite OverviewOverview
25 © 2009 Applied Biosystems
QuizQuiz
• What do you need to do before running Corona Lite?
• What are the three main steps of Corona Lite?
• What is the workflow of each pipeline in Corona Lite?
• What is the meaning of the three letter annotations of mate-pairs (e.g., AAA, ABA, etc)?
• What are the main differences between Corona Lite and GlobalSETS?
AppendixAppendix
27 © 2009 Applied Biosystems
General General –– Global SETS Versus Corona Global SETS Versus Corona LiteLiteCategory SOLiD™ Global SETS v3.0 Corona_Lite v4.2
Supported OS Linux CentOS, ScyldClusterware, PBS (Torque)Will test LSF, PBS pro and SGE by June 2009
Linux, PBS, LSF, SGE
Programming language Java (algorithms in C++) Scripting languages (some algorithms in C++)
Analysis set up and
execution
Automatic through SETS GUI;Integrated command line
Integrated command. Can run batch mode.
Integrate with custom pipeline
Yes (SAI)GUI, and Command line
Yes, command line interface
Speed Optimized for compute performance for complex genome analysis
Support complex genome analysis
Warranty Yes No
AB support to end users Yes Yes
License Fee Comes with SOLiD 3 System Contact AB sales
Free open source