a. dereeper, g. sarah, f. sabot explore snp polymorphism data bioinformatics trainings, vietnam...

28
A. Dereeper, G. Sarah, F. Sabot Explore SNP polymorphism data Bioinformatics trainings, Vietnam Hanoi, November, 20

Upload: hollie-charles

Post on 18-Jan-2016

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A. Dereeper, G. Sarah, F. Sabot Explore SNP polymorphism data Bioinformatics trainings, Vietnam Hanoi, November, 2015

A. Dereeper, G. Sarah, F. Sabot

Explore SNPpolymorphism data

Bioinformatics trainings, Vietnam Hanoi, November, 2015

Page 2: A. Dereeper, G. Sarah, F. Sabot Explore SNP polymorphism data Bioinformatics trainings, Vietnam Hanoi, November, 2015

Tablet

• Graphical tools to visualize assemblies

• Accept many formatsACE, SAM, BAM

A. Dereeper, G. Sarah, F. Sabot, Y. Hueber Formation Bio-informatique, 9 au 13 février 2015 A. Dereeper, G. Sarah, F. Sabot Bioinformatics trainings, Vietnam Hanoi, November, 2015

Page 3: A. Dereeper, G. Sarah, F. Sabot Explore SNP polymorphism data Bioinformatics trainings, Vietnam Hanoi, November, 2015

GATK (Genome Analysis ToolKit)

• Software package to analyse NGS data.

• Implemented to analyse human resequencing data, for medical purpose(1000 genomes, The Cancer Genome Atlas)

• Included: depth analyses, quality score recalibration, SNP/InDel detection

• Complementary with other packages: SamTools, PicardTools, VCFtools, BEDtools

PREPROCESS:

* Index human genome (Picard), we used HG18 from UCSC. * Convert Illumina reads to Fastq format * Convert Illumina 1.6 read quality scores to standard Sanger scores

FOR EACH SAMPLE:

1. Align samples to genome (BWA), generates SAI files. 2. Convert SAI to SAM (BWA) 3. Convert SAM to BAM binary format (SAM Tools) 4. Sort BAM (SAM Tools) 5. Index BAM (SAM Tools) 6. Identify target regions for realignment (Genome Analysis Toolkit) 7. Realign BAM to get better Indel calling (Genome Analysis Toolkit) 8. Reindex the realigned BAM (SAM Tools) 9. Call Indels (Genome Analysis Toolkit) 10. Call SNPs (Genome Analysis Toolkit) 11. View aligned reads in BAM/BAI (Integrated Genome Viewer)

A. Dereeper, G. Sarah, F. Sabot, Y. Hueber Formation Bio-informatique, 9 au 13 février 2015 A. Dereeper, G. Sarah, F. Sabot Bioinformatics trainings, Vietnam Hanoi, November, 2015

Page 4: A. Dereeper, G. Sarah, F. Sabot Explore SNP polymorphism data Bioinformatics trainings, Vietnam Hanoi, November, 2015

Global BAM with read group

Cutadapt

Mapping BWA

VCF file

Fastq (RC1)

BAM with read group

Mapping BWA

Fastq (RC2)

BAM with read group

Mapping BWA

Fastq (RC3)

BAM with read group

Mapping BWA

Fastq (RC4)

BAM with read group

….

mergeSam

Add or Replace GroupsAdd or Replace Groups Add or Replace Groups Add or Replace Groups

Cutadapt Cutadapt Cutadapt

Page 5: A. Dereeper, G. Sarah, F. Sabot Explore SNP polymorphism data Bioinformatics trainings, Vietnam Hanoi, November, 2015

TASSEL-GBSPlos One, 2014

A. Dereeper, G. Sarah, F. Sabot, Y. Hueber Formation Bio-informatique, 9 au 13 février 2015 A. Dereeper, G. Sarah, F. Sabot Bioinformatics trainings, Vietnam Hanoi, November, 2015

For GBS data

Tassel pipelineVersion 5

Page 6: A. Dereeper, G. Sarah, F. Sabot Explore SNP polymorphism data Bioinformatics trainings, Vietnam Hanoi, November, 2015

TasselSNP Calling and

genotype assignation

Genotyping dataStorage and mining

Genotyping data analyses and visualization

(GWAS, diversity…)

GBSRAD-Seq

RNA-SeqWGRS

Galaxyworkflow

Reads pre-processing and mapping

+

Page 7: A. Dereeper, G. Sarah, F. Sabot Explore SNP polymorphism data Bioinformatics trainings, Vietnam Hanoi, November, 2015

Format VCF (Variant Call Format)Advantages: Variation description for each position + genotype assignationsIndexed flat files. Binary files also exist: BCF format

A. Dereeper, G. Sarah, F. Sabot, Y. Hueber Formation Bio-informatique, 9 au 13 février 2015 A. Dereeper, G. Sarah, F. Sabot Bioinformatics trainings, Vietnam Hanoi, November, 2015

Page 8: A. Dereeper, G. Sarah, F. Sabot Explore SNP polymorphism data Bioinformatics trainings, Vietnam Hanoi, November, 2015

Autres fonctionalités GATK• Module DepthOfCoverage:Allows to get sequencing depth for each gene, each position and each individual

• Module ReadBackedPhasing:Allows to set, if possible, associations between alleles (phase and haplotypes) when we are in an heterozygote situation.

Et non AGGGGA

A. Dereeper, G. Sarah, F. Sabot, Y. Hueber Formation Bio-informatique, 9 au 13 février 2015

Other GATK functionalities

A. Dereeper, G. Sarah, F. Sabot Bioinformatics trainings, Vietnam Hanoi, November, 2015

Page 9: A. Dereeper, G. Sarah, F. Sabot Explore SNP polymorphism data Bioinformatics trainings, Vietnam Hanoi, November, 2015

Format Pileup

- Another format for variant calling (generated by samtools)- Describe alignment row by row (not line by line like in SAM format)- Used by VarScan like softwares (varscan pileup2snp)- Frequently used for rare variants, with a low frequency (e.g. viral pop)

A. Dereeper, G. Sarah, F. Sabot, Y. Hueber Formation Bio-informatique, 9 au 13 février 2015 A. Dereeper, G. Sarah, F. Sabot Bioinformatics trainings, Vietnam Hanoi, November, 2015

Page 10: A. Dereeper, G. Sarah, F. Sabot Explore SNP polymorphism data Bioinformatics trainings, Vietnam Hanoi, November, 2015

A. Dereeper, G. Sarah, F. Sabot, Y. Hueber Formation Bio-informatique, 9 au 13 février 2015

- Based on NoSQL technology

- Handles VCF files (Variant Call Format) and annotations

- Supports multiple variant types: SNPs, InDels, SSRs, SV

- Powerful genotyping queries

- Easily scalable with MongoDB sharding

- Transparent access

- Takes phasing information into account when importing/exporting in VCF format

Projet Gigwa, pour la gestion des données massives de variants (GBS, RADSeq, WGRS)

« With NGS arise serious computational challenges in terms of storage, search, sharing, analysis, and data visualization, that redefine some practices in data management. »

A. Dereeper, G. Sarah, F. Sabot Bioinformatics trainings, Vietnam Hanoi, November, 2015

Page 11: A. Dereeper, G. Sarah, F. Sabot Explore SNP polymorphism data Bioinformatics trainings, Vietnam Hanoi, November, 2015

A. Dereeper, G. Sarah, F. Sabot, Y. Hueber Formation Bio-informatique, 9 au 13 février 2015

http://gigwa.southgreen.fr/gigwa/

A. Dereeper, G. Sarah, F. Sabot Bioinformatics trainings, Vietnam Hanoi, November, 2015

Page 12: A. Dereeper, G. Sarah, F. Sabot Explore SNP polymorphism data Bioinformatics trainings, Vietnam Hanoi, November, 2015

SNiPlay: Web application for polymorphism analyses

http://sniplay.southgreen.fr

A. Dereeper, G. Sarah, F. Sabot, Y. Hueber Formation Bio-informatique, 9 au 13 février 2015 A. Dereeper, G. Sarah, F. Sabot Bioinformatics trainings, Vietnam Hanoi, November, 2015

Page 13: A. Dereeper, G. Sarah, F. Sabot Explore SNP polymorphism data Bioinformatics trainings, Vietnam Hanoi, November, 2015

• IFB project “Galaxy4Sniplay” (WP4 IFB, Plant node)

A. Dereeper, G. Sarah, F. Sabot, Y. Hueber Formation Bio-informatique, 9 au 13 février 2015 A. Dereeper, G. Sarah, F. Sabot Bioinformatics trainings, Vietnam Hanoi, November, 2015

Page 14: A. Dereeper, G. Sarah, F. Sabot Explore SNP polymorphism data Bioinformatics trainings, Vietnam Hanoi, November, 2015

• Available using Galaxy Toolshed Installable on any Galaxy instance

A. Dereeper, G. Sarah, F. Sabot, Y. Hueber Formation Bio-informatique, 9 au 13 février 2015 A. Dereeper, G. Sarah, F. Sabot Bioinformatics trainings, Vietnam Hanoi, November, 2015

Page 15: A. Dereeper, G. Sarah, F. Sabot Explore SNP polymorphism data Bioinformatics trainings, Vietnam Hanoi, November, 2015

Upload a VCF file in SNiPlay

Upload a VCF file (+ reference if not available in genome collection)

Select rice genome

The reference corresponce to mRNA

A. Dereeper, G. Sarah, F. Sabot Bioinformatics trainings, Vietnam Hanoi, November, 2015

http://sniplay.southgreen.fr

Page 16: A. Dereeper, G. Sarah, F. Sabot Explore SNP polymorphism data Bioinformatics trainings, Vietnam Hanoi, November, 2015

• Filters using VCFtools or Gigwa

• Maf• Missing data• Annotation• Position…

A. Dereeper, G. Sarah, F. Sabot, Y. Hueber Formation Bio-informatique, 9 au 13 février 2015 A. Dereeper, G. Sarah, F. Sabot Bioinformatics trainings, Vietnam Hanoi, November, 2015

Page 17: A. Dereeper, G. Sarah, F. Sabot Explore SNP polymorphism data Bioinformatics trainings, Vietnam Hanoi, November, 2015

SNP annotation using SnpEff

A. Dereeper, G. Sarah, F. Sabot, Y. Hueber Formation Bio-informatique, 9 au 13 février 2015 A. Dereeper, G. Sarah, F. Sabot Bioinformatics trainings, Vietnam Hanoi, November, 2015

Page 18: A. Dereeper, G. Sarah, F. Sabot Explore SNP polymorphism data Bioinformatics trainings, Vietnam Hanoi, November, 2015

A. Dereeper, G. Sarah, F. Sabot, Y. Hueber Formation Bio-informatique, 9 au 13 février 2015 A. Dereeper, G. Sarah, F. Sabot Bioinformatics trainings, Vietnam Hanoi, November, 2015

Page 19: A. Dereeper, G. Sarah, F. Sabot Explore SNP polymorphism data Bioinformatics trainings, Vietnam Hanoi, November, 2015

sNMF

• Test different values of K (estimates the probability (likelihood tests) that samples are structured in K populations)

• For the best value of K, the application shows Q estimates for each individual (admixture percent)(probability that the individual belongs to each population)

Page 20: A. Dereeper, G. Sarah, F. Sabot Explore SNP polymorphism data Bioinformatics trainings, Vietnam Hanoi, November, 2015

MDS (Multi-Dimensional Scale) plot

SNP-based Distance tree with FastME

Page 21: A. Dereeper, G. Sarah, F. Sabot Explore SNP polymorphism data Bioinformatics trainings, Vietnam Hanoi, November, 2015

A. Dereeper, G. Sarah, F. Sabot, Y. Hueber Formation Bio-informatique, 9 au 13 février 2015 A. Dereeper, G. Sarah, F. Sabot Bioinformatics trainings, Vietnam Hanoi, November, 2015

Comparison between individuals

Diversity analysis

Pi: Nucleotide diversity: Average number of nucleotide differences per site between any two DNA sequences chosen randomly from the sample populationUsed to measure the degree of polymorphism within a population

Page 22: A. Dereeper, G. Sarah, F. Sabot Explore SNP polymorphism data Bioinformatics trainings, Vietnam Hanoi, November, 2015

A. Dereeper, G. Sarah, F. Sabot, Y. Hueber Formation Bio-informatique, 9 au 13 février 2015 A. Dereeper, G. Sarah, F. Sabot Bioinformatics trainings, Vietnam Hanoi, November, 2015

=> Can allows the detection of introgression

Introgression = Movement of a exogene region (gene flow) from one species into the gene pool of another by the repeated backcrossing of an interspecific hybrid with one of its parent species

Widely used in agronomy obtained but can occurs naturally

Page 23: A. Dereeper, G. Sarah, F. Sabot Explore SNP polymorphism data Bioinformatics trainings, Vietnam Hanoi, November, 2015

High frequency haplotypes

Low frequency haplotype

Group distribution whithin this haplotype

Distance between 2 haplotypes (nb of mutations)

Haplotypes• Haplotype reconstruction using Gevalt

• Network with Haplophyle

• Available only for regions presenting few variants (short regions, genes)

• Exploit phased VCF (in progress…)

A. Dereeper, G. Sarah, F. Sabot, Y. Hueber Formation Bio-informatique, 9 au 13 février 2015 A. Dereeper, G. Sarah, F. Sabot Bioinformatics trainings, Vietnam Hanoi, November, 2015

Page 24: A. Dereeper, G. Sarah, F. Sabot Explore SNP polymorphism data Bioinformatics trainings, Vietnam Hanoi, November, 2015

Cartesian coordinates

Genotypage file

Fichier de soumission pour Illumina

Analyse with BeadStudio software

Design de puces Illumina

A. Dereeper, G. Sarah, F. Sabot, Y. Hueber Formation Bio-informatique, 9 au 13 février 2015

Illumina ship design Submission file for Illumina

A. Dereeper, G. Sarah, F. Sabot Bioinformatics trainings, Vietnam Hanoi, November, 2015

Page 25: A. Dereeper, G. Sarah, F. Sabot Explore SNP polymorphism data Bioinformatics trainings, Vietnam Hanoi, November, 2015

A. Dereeper, G. Sarah, F. Sabot, Y. Hueber Formation Bio-informatique, 9 au 13 février 2015

• Estimate association between a marker and a phenotypic character

• Manhattan plots: displays GWAS statistical tests(-log10 pvalue) along chromosomes

• TASSEL, MLMM sofwares

• False positives because of the studied structuration panel=> correction using structure population et and kinship

GWAS (Genome-Wide Association Studies)

A. Dereeper, G. Sarah, F. Sabot Bioinformatics trainings, Vietnam Hanoi, November, 2015

Page 26: A. Dereeper, G. Sarah, F. Sabot Explore SNP polymorphism data Bioinformatics trainings, Vietnam Hanoi, November, 2015

A. Dereeper, G. Sarah, F. Sabot, Y. Hueber Formation Bio-informatique, 9 au 13 février 2015

TD: Study of root charaters using GWAS in Oryza sativa japonica. Influence of a correction using structure and kinship

A. Dereeper, G. Sarah, F. Sabot Bioinformatics trainings, Vietnam Hanoi, November, 2015

Page 27: A. Dereeper, G. Sarah, F. Sabot Explore SNP polymorphism data Bioinformatics trainings, Vietnam Hanoi, November, 2015

A. Dereeper, G. Sarah, F. Sabot, Y. Hueber Formation Bio-informatique, 9 au 13 février 2015

Analyse de structure de populations

• Test different values of K (estimates of probability that samples are structured in K populations)

• For the best value of K, the application shows Q estimates for each individual(admixture percent)

Population structure analysis

A. Dereeper, G. Sarah, F. Sabot Bioinformatics trainings, Vietnam Hanoi, November, 2015

Page 28: A. Dereeper, G. Sarah, F. Sabot Explore SNP polymorphism data Bioinformatics trainings, Vietnam Hanoi, November, 2015

A. Dereeper, G. Sarah, F. Sabot, Y. Hueber Formation Bio-informatique, 9 au 13 février 2015

Relatedness between individuals (kinship matrix)

• TASSEL and plink softwares

• Estimation of relatedness between individuals using a distance matrix

A. Dereeper, G. Sarah, F. Sabot Bioinformatics trainings, Vietnam Hanoi, November, 2015