dr. sonika tyagi, - australia bioinformatics resource · dr philippa griffin open data coordinator...

36
A/Prof Vicky Schneider Deputy Director A/Prof Andrew Lonie Director Dr Philippa Grin Open Data Coordinator Dr. Sonika Tyagi, Bioinforma2cs Supervisor AGRF Training Coordina2on EMBL-ABR

Upload: tranmien

Post on 05-Jun-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

A/Prof Vicky Schneider Deputy Director A/Prof Andrew Lonie Director Dr Philippa Griffin Open Data Coordinator

Dr.SonikaTyagi, Bioinforma2csSupervisorAGRF

TrainingCoordina2onEMBL-ABR

Outline

•  Stackspipelineworkflow– SinglevsPERADseq– Overviewofdifferentstepsinthepipeline– DenovovsReferencebasedSNPcalls– Outputs

Single-endRAD

SlidesfromYvanleBras,AnthonyBretaudeau,CyrinMonjeaud,GildasleCorguilleatCESCO(Centred’ÉcologieetdesSciencesdelaConserva2on),IFB(FrenchIns2tuteofBioinforma2cs)SomeslidesoriginallyfromKarimGharbi–EdinburghGenomics,UniversityofEdinburgh

Single-endRAD

Cut5’->3’

Complementarystrand

Hohenloheetal.,PLoSGene2cs2010

Single-endRAD

Paired-endRAD

Paired-endRAD

Paired-endRAD

SinglevsPaired-endRAD

ddRAD

ddRAD

~500pb

Paired-endddRAD

RADvsddRAD

•  classicRAD:readsbetweentherestric2onsiteandarandomsite

(shearing/sonica2on)•  ddRAD:readsbetweenthe2restric2onsites.Somoreflexibility

onthebalancecoverage/depthofcovarage

RADvsddRAD

Samplingthegenome

source:KarimGharbi-edinburghgenomics/UniversityofEdinburgh

Becauseallreadsbeginwith[halfof]therestric2onsite

Biases

•  Consequence:•  TheIlluminasequencerhavedifficulty

separa2ngpolonies/clustersduringthefirstcyclesimagingstep

•  Solu2on:

•  useasetbarcodeswithdifferentsizes•  mixdifferentexperienceswhichuse

differentrestric2onenzymes

MainBioinforma2cspipelines•  STACKS

•  Website: http://catchenlab.life.illinois.edu/stacks/ •  mbRAD, ddRAD, ezRAD & 2bRAD? •  STACKS does not handle INDELS, so any loci near an INDEL is lost •  STACKS does not call SNPs from paired end reads natively, and does especially poorly

with paired end fragments that are not of a random length (e.g., ddRAD and ezRAD) •  dDocent

•  Website: https://ddocent.wordpress.com/ddocent-pipeline-user-guide/ •  ddRAD & ezRAD

•  PyRAD

•  Website: http://dereneaton.com/software/pyrad/ •  mbRAD, ddRAD, PE-ddRAD, GBS, PE-GBS, EzRAD, PE-EzRAD, 2B-RAD •  use of an alignment-clustering method (vsearch)

•  2bRAD (Wang et al 2012)

•  de novo: https://github.com/z0on/2bRAD_denovo •  With reference genome: https://github.com/z0on/2bRAD_GATK •  2bRAD

SOFTWARES

http://catchenlab.life.illinois.edu/stacks J. Catchen, A. Amores, P. Hohenlohe, W. Cresko, and J. Postlethwait. Stacks: building and genotyping loci de novo from short-read sequences. G3: Genes, Genomes, Genetics, 1:171-182, 2011.

Stacks

STACKS:pilingsimilarreadstogether

A.   PileexactmatchestogetherB.   Makingdic9onarybasedonK-

mersC.   MatchingreadsbasedonK-mer

similarityThingstoremember:1.  Stacksisnotop6mizedfordifferent

readslength(allreadsfromdifferentbarcodesshouldbetrimmeduniformly).

2.  PCRduplicatesarenotrecognizable.

3.  INDELsarenothandled*

STACKS:pilingsimilarreadstogether

D.Matchessecondaryreadsthatwerenotini9allyplacedinastackagainstputa9velocitoincreasestackdepth.E.CallsaconsensussequenceandrecordsSNPandhaplotypedata.F.PuHngconsensussequenceintocatalogue

Stacksdenovo_map.pl1.   ustacks2.  cstacks3.  sstacks

ref_mappipelinepstackscstackssstacks

Stacks

.bam

.bam

.bam

Results

•  Buildingloci:Generates3filespersample:–  sample_BARCODE.alleles.tsv–  sample_BARCODE.snps.tsv–  sample_BARCODE.tags.tsv

•  CataloguingofobservedSNPs:–  batch_1001.catalog.alleles.tsv–  batch_1001.catalog.snps.tsv–  batch_1001.catalog.tags.tsv

•  Verifyingindividualsamplesagainstcatalogue–  batch_1001.catalog.matches.tsv–  sample_BARCODE.matches.tsv

SoI’vegotmySNPs……whatnext?•  Whatisyourresearchques2on?

•  Areyouinterestedin– Popula2onstructure– Gene2cdiversity– Phylogeography– Phylogene2chistory– ???

‘Typical’downstreamanalysisworkflow

•  Ifit’sthefirst2meyou’reworkingwiththisspeciesandlibrarydesign:– ExplorehaplotypeandSNPcallsinStacksinterfacetoassessparametersenngeffects

‘Typical’downstreamanalysisworkflow

•  Exportafairly‘permissive’vcffilefrompopula6ons

•  Furtherproject-specificfilteringusingvcooolsorcustomscripts•  E.g.‘Exclude

individualswithmissingdataat>50%ofloci;thenexcludealllocimissingin>30%ofindividualsperpopula2on’

•  Datavisualisa2onisusefulinassessingfiltering

Someideasfordownstreamanalysis

–  Popula2onstructure•  CanobtainF-sta2s2csfrompopula6onsitself

•  F-sta2s2csinGENEPOP(popula6onsexportsGENEPOPformat)

•  F-sta2s2csinR(e.g.adegenetpackage)

•  Morecomplexclusteringapproachestoexplorestructurewithoutassump2ons

–  PCA,DAPC–  STRUCTURE

Someideasfordownstreamanalysis

– Gene2cdiversity•  popula6onscanoutputheterozygosity,pi,FISperpopula2on

•  Alterna2vely,outputasvcfandcalculateinRorothersooware

Someideasfordownstreamanalysis

– Phylogeography•  Outputasvcf,usetoolsinRorothersooware

– Phylogeny•  Outputasfastaorphylipformat;easilyconvertedtonexus(forexample

•  Useanynumberoftree-buildingapproaches:BEAST,MrBayes,SplitsTree,PAUP...

Ques2ons?

How can I use Stacks in Galaxy on my own data after this course?

Public Galaxy Servers Training: http://galaxy-tut.genome.edu.au

Research work: http://galaxy-mel.genome.edu.au http://galaxy-qld.genome.edu.au http://usegalaxy.org

List of other public Galaxy servers: https://wiki.galaxyproject.org/PublicGalaxyServers

galaxy-mel galaxy-mel.genome.edu.au

•  Galaxy server for (primarily Melbourne) researchers

•  Available to everyone •  Users get 100GB disk but can get more •  Helpdesk available: [email protected] •  Stacks software installed and available.