key tasks in sequence analysis- alignment-variant call-assembly

Upload: anand-maurya

Post on 07-Aug-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/19/2019 Key Tasks in Sequence Analysis- Alignment-Variant Call-Assembly

    1/27

    Aylwyn Scally W ellcome Trust Sanger Institute

    March 2012

  • 8/19/2019 Key Tasks in Sequence Analysis- Alignment-Variant Call-Assembly

    2/27

    Key tasks in sequence analysis

    • 

    Data handling

    • 

     Alignment to a reference sequence

    • 

     Alignment file handling

    • 

    Variant calling

    • 

    SNPs, genotypes

    • 

    structural variation

    • 

    Sequence assembly

  • 8/19/2019 Key Tasks in Sequence Analysis- Alignment-Variant Call-Assembly

    3/27

    Data handling

    • 

    Important to have a data hierarchy

    corresponding to experimental factors

    lane/run

    library

    sequencingtechnology

    individual

    strain/subspecies/population

    species  hsa

    YRI

    NA12878

    SLX

    NA12878-WG

    297_1 297_2

    454

    CEU

    NA19240

    SLX

    NA19240-WG

    505_7 505_8

  • 8/19/2019 Key Tasks in Sequence Analysis- Alignment-Variant Call-Assembly

    4/27

    Raw sequence data

    •  FASTQ format•

     

    original Sanger standard for capillary data

    • 

    derived from FASTA format

    • 

    sequence and an associated per base qualityscore

    •  PHRED quality scores encoded as ASCIIprintable characters (ASCII 33–126)

    •  standard offset 33 but older Solexa/Illumina variantsused 64

    @title

    sequence+optional_text

    quality

    @SRR010930.8436795/1 

     ACCCCAGGATCAACACTTCACATGCATTAGCAGAGAGAGATAAATCAA  + 

    =>=??A?

  • 8/19/2019 Key Tasks in Sequence Analysis- Alignment-Variant Call-Assembly

    5/27

    PHRED quality scores

    • 

    Encodes the probability of an erroneous

    call

    • 

    quality score Q = –10 log10 P

    • 

    error probability P  = 10 –Q/10

    • 

    example: call with Q = 30 has error probability

    P  = 10-3 = 1 in 1000

    • 

     ASCII encoding

    ! ” # $ % & ’ ( ) * + , - . / 0 1 2 3 4

    0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 

    encoding

    Q score

  • 8/19/2019 Key Tasks in Sequence Analysis- Alignment-Variant Call-Assembly

    6/27

    DATA PROCESSING

     ALIGNMENT

    SAM FILE

    PROCESSING

     Alignment pipeline

    • 

    Get data

    •  Prepare and index reference•

     

    sequence names; alternate haplotypes etc

    •   Align data•

     

    by lane or smaller unit – optimise throughput

    •  Sort by position

    •  Merge alignments

    •  Improve alignments

    • 

    Merge libraries•  Index final alignment

  • 8/19/2019 Key Tasks in Sequence Analysis- Alignment-Variant Call-Assembly

    7/27

     Alignment pipeline

    BAMBAM BAMLibrarymerge

    Library

    … Alignment

    Fastq

    BAM

    BAM

    Fastq

    BAM

    BAM

    Fastq

    BAM

    BAM

    Fastq

    BAM

    BAM

    Fastq

    BAM

    BAMBAM

    ImprovementLane/plex

    BAM BAM Sample/PlatformSample

    merge

  • 8/19/2019 Key Tasks in Sequence Analysis- Alignment-Variant Call-Assembly

    8/27

    bwa

    • 

    bwa index [-a bwtsw|div|is] [-c] •  Burroughs-Wheeler transform construction algorithm

    •  ‘bwtsw’ for vertebrate sized genomes, ‘is’ for smaller genomes

    •  bwa aln [options] •  align each single-ended fastq file individually•  is name of reference file

    • 

    options control alignment parameters, scoring matrix, seed length•  bwa sampe [options]

    •  generates pairwise alignment from sai files produced by bwa aln•  produces SAM output

    •  bwa bwasw [options] •  alignment of long reads in query.fa•  produces SAM output

  • 8/19/2019 Key Tasks in Sequence Analysis- Alignment-Variant Call-Assembly

    9/27

    bwa usage notes

    • 

    bwa finds matches up to a finite edit distance•  by default for 100-bp reads allows 5 edits

    •  Important to quality-clip reads•  -q in bwa aln, e.g. set to 20

    • 

    Non-ACGT bases on reads are treated asmismatches

    •  Parallelise for speed•  split data into 1 Gbp blocks•  bwa takes ~8 hrs per block

    • 

    Check for truncated BAM files•  e.g. with samtools flagstats

  • 8/19/2019 Key Tasks in Sequence Analysis- Alignment-Variant Call-Assembly

    10/27

     Alignment improvement

    • 

    Library duplicate removal

    •  samtools, Picard

    • 

    Realignment around indels

    • 

    GATK

    • 

    Base quality recalibration

    • 

    GATK

  • 8/19/2019 Key Tasks in Sequence Analysis- Alignment-Variant Call-Assembly

    11/27

    Library duplicate removal

    • 

    PCR amplification step in library preparation can resultin duplicate DNA fragments•  PCR-free protocols exist but require larger volumes of

    input DNA

    •  Generally a low number of duplicates in good libraries;

    increases with depth of sequencing•  Duplicates can result in false SNP calls

    •  manifest as high read depth support

    •  Removal method•  Identify read-pairs where outer ends map to the same

    position on the genome and remove all but one copy

    • 

    samtools rmdup

    •  Picard/GATK MarkDuplicates

  • 8/19/2019 Key Tasks in Sequence Analysis- Alignment-Variant Call-Assembly

    12/27

    Realignment

    • 

    Short indels in the sample relative to reference posedifficulties for alignment•  Indels occurring near the ends of reads often not aligned

    correctly

    •   Aligners prefer to introduce SNPs rather than an indel

    • 

    Realignment algorithm•  Input set of known indel sites and a BAM file

    • 

    Previously published indel sites, dbSNP, 1000 Genomes, orestimate from alignment

    •   At each site, model the indel and reference haplotypes andselect best fit with data

    • 

    New BAM file produced, modified where indels have beenintroduced by realignment

    •  Implemented in GATK (IndelRealigner function)

  • 8/19/2019 Key Tasks in Sequence Analysis- Alignment-Variant Call-Assembly

    13/27

     Additional alignment issues

    • 

    Separate chromosomal BAMs

    •  easier to process in parallel

    • 

    Realign/assemble unmapped reads

    • 

    recover sequence missed due to referenceincompatibility or incompleteness

  • 8/19/2019 Key Tasks in Sequence Analysis- Alignment-Variant Call-Assembly

    14/27

    SAM/BAM

    • 

    Sequence Alignment/Map format•

     

    unified format for storing read alignments to areference genome

    •  BAM (Binary Alignment/Map) format

    • 

    binary equivalent of SAM•  Features

    • 

    stores alignments from most alignment programs

    • 

    supports multiple sequencing technologies

    • 

    supports indexing for quick retrieval

    • 

    reads can be classed into logical groups•  e.g. lanes, libraries, individuals

  • 8/19/2019 Key Tasks in Sequence Analysis- Alignment-Variant Call-Assembly

    15/27

    SAM file format

    • 

    Header

    •   Alignment lines (one per read)•  11 mandatory fields

    •  several optional fields (format TAG:TYPE:VALUE)

    Col Field Type Description 1 QNAME str query name of the read or the read pair

    2 FLAG int bitwise flag (pairing, mapped, mate mapped, etc.)

    3 RNAME str reference sequence name

    4 POS int 1-based leftmost position of clipped alignment

    5 MAPQ int mapping quality (Phred scaled)

    6 CIGAR str extended CIGAR string (details of alignment)

    7 RNEXT str mate reference name (‘=’ if same as RNAME)

    8 PNEXT int position of mate/next segment9 TLEN int observed template length

    10 SEQ str segment sequence

    11 QUAL str ASCII of Phred-scaled base quality

  • 8/19/2019 Key Tasks in Sequence Analysis- Alignment-Variant Call-Assembly

    16/27

    SAM format

    • 

    Example

    •  http://picard.sourceforge.net/explain-flags.html

    IL4_315:7:105:408:43 

    177 X 

    1741 0 

    1S35M 

    X 56845228 

    0  ATTTGGCTCTCTGCTTGTTTATTATTGGTGTATNGG 

    +1,1+16;>;166>;>;;>>;>>>>>>,>>>>>+>> 

    QNAME

    FLAGRNAME

    POSMAPQ

    CIGAR

    RNEXTPNEXT

    TLENSEQ

    QUAL

  • 8/19/2019 Key Tasks in Sequence Analysis- Alignment-Variant Call-Assembly

    17/27

    SAM/BAM file processing tools

    • 

    samtools•  C program and library

    • 

    http://samtools.sourceforge.net

    •  view: SAM-BAM conversion

    •  sort, index, merge multiple BAM files

    •  flagstat: summary counts of mapping flags

    • 

    Picard•  Java program suite

    http://picard.sourceforge.net

    •  MarkDuplicates, CollectAlignmentSummaryMetrics,CreateSequenceDictionary, SamToFastq, MeanQualityByCycle

    •  Pysam•

     

    Python interface to samtools API•

     

    http://code.google.com/p/pysam/ 

  • 8/19/2019 Key Tasks in Sequence Analysis- Alignment-Variant Call-Assembly

    18/27

     Variant calling

    • 

    Call SNPs with genotypes (heterozygous andhomozygous), indels and structural variants

    •  Tools•  samtools, bcftools

    •  GATK, SOAPsnp, Dindel

    • 

    SVMerge•  File formats:

    •  VCF, pileup

    •  Filters and calling protocols•  depth, quality, strand bias, multiple samples

    • 

    Indels harder to call accurately than SNPs•  structural variation harder still

  • 8/19/2019 Key Tasks in Sequence Analysis- Alignment-Variant Call-Assembly

    19/27

     Variant Call Format (VCF)

    • 

    Stores polymorphism data with annotation

    •  SNPs, insertions, deletions and structural variants

    • 

    Can be indexed for fast data retrieval

    • 

    Variant calls across many samples

    • 

    Metadata

    •  e.g. dbSNP accession, filter status, validation status

    • 

     Arbitrary tags can be used to describe new typesof variant

    • 

    Note: binary BCF produced by samtools•  get vcf with samtools mpileup | bcftools view

  • 8/19/2019 Key Tasks in Sequence Analysis- Alignment-Variant Call-Assembly

    20/27

     VCF Format

    • 

    Header•   Arbitrary number of INFO definition lines starting with ‘##’

    •  Column definition line starts with single ‘#’

    •  Mandatory columns•  Chromosome (CHROM)

    • 

    Position of the start of the variant (POS)•  Unique identifiers of the variant (ID)

    •  Reference allele (REF)

    •  Comma separated list of alternate non-reference alleles(ALT)

    • 

    Phred-scaled quality score (QUAL)•  Site filtering information (FILTER)

    •  User extensible annotation (INFO)

  • 8/19/2019 Key Tasks in Sequence Analysis- Alignment-Variant Call-Assembly

    21/27

     VCF format

    • 

    Example3 

    74393 . 

    G T 

    999 

    .DP=31;AF1=0.7002;AC1=4;DP4=4,0,22,2

    ;… GT:PL:DP:GQ 1/1:181,57,0:19:57 

    1/1:90,15,0:5:16 0/0:0,12,85:4:7 

    CHROM

    POSID

    REF ALT

    QUAL

    FILTERINFO

    FORMATsample1

    sample2sample3

    see H. Li, Bioinformatics 27(21): 2987–2993 (2011) for

    details of likelihood and population genetic calculations

  • 8/19/2019 Key Tasks in Sequence Analysis- Alignment-Variant Call-Assembly

    22/27

    More information

    • 

    SNP calling and genotyping•  Samtools

    • 

    http://bioinformatics.oxfordjournals.org/content/25/16/2078.long

    •  http://samtools.sourceforge.net

    •  GATK

    • 

    http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit

    •  VCF•  VCFtools

    • 

    http://vcftools.sourceforge.net

    •  Danacek et al. Bioinformatics 27(15): 2156-2158 (2011)

    • 

    http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41

  • 8/19/2019 Key Tasks in Sequence Analysis- Alignment-Variant Call-Assembly

    23/27

    Structural variation

  • 8/19/2019 Key Tasks in Sequence Analysis- Alignment-Variant Call-Assembly

    24/27

    Structural variation

    • 

    Read depth and pairing information used to detect events•  deviations from the expected fragment size

    •  presence/absence of mate pairs

    •  excessive/reduced read depth (CNV)

    •  Several methods/tools released

    • 

    SVMerge pipeline•  makes SV predictions using a collection of callers•  Input is one BAM file per sample

    •  callers run individually & outputs converted into standard BEDformat

    •  calls merged

    • 

    computationally validated using local de novo assembly•  http://svmerge.sourceforge.net/

  • 8/19/2019 Key Tasks in Sequence Analysis- Alignment-Variant Call-Assembly

    25/27

     Assembly

    • 

    Tools•   Abyss

    •  http://www.bcgsc.ca/platform/bioinfo/software/abyss

    •  SGA•  https://github.com/jts/sga

    • 

    SOAPdenovo•  http://soap.genomics.org.cn/soapdenovo.html

    •   ALLPATHS-LG•  http://www.broadinstitute.org/software/allpaths-lg/blog

    •  Cortex

    • 

    http://cortexassembler.sourceforge.net/•  Velvet

    •  http://www.ebi.ac.uk/~zerbino/velvet/

  • 8/19/2019 Key Tasks in Sequence Analysis- Alignment-Variant Call-Assembly

    26/27

     Assembly metrics

    • 

    N50, N10, N90 etc

    •  x  % of assembly is in fragments larger than

    N x

    • 

    Number of contigs, mean/max contiglength

    • 

    Realignment

    • 

    fraction of read pairs mapped correctly

    • 

    correct homozygous SNPs

    • 

    identify breakpoints

  • 8/19/2019 Key Tasks in Sequence Analysis- Alignment-Variant Call-Assembly

    27/27

    Thanks to Thomas Keane and the Vertebrate

    Resequencing team at WTSI for several slides