dealing with the output of a next generation experiment
TRANSCRIPT
Dealing with the output of a next
generation experiment.
Mike Cornell
NGRL Manchester
www.ngrl.org.uk/Manchester
Aims
Discuss the file types encountered in NGS,
Discuss how these files can be manipulated.
Highlight issues for clinical NGS
NGS in Manchester
Currently there are three SOLiD sequencers in Manchester:
- Patterson Institute
- Faculty of Life Sciences
- St Mary’s Hospital
SOLiD Bioinformatics Group covering all three institutes.
Most work on ChIP-seq at present.
Started targeted exome sequencing at St Mary’s
NGS file types
Four types of file reflecting different stages in
experiment.
Individual sequences - FASTQ
Alignment to reference sequences – SAM/BAM
Pileup of alignments – Pileup File
Variations from reference sequence – GFF file
Current technology
Lots of information in the trace file.
Not just sequence, there’s also information about sequence quality.
Experienced scientist can use trace to revise sequence – reclassifying N’s or amending GGGGG to GGGGGG.
Converted to FASTA
>Identifier | Some other stuff here if you wantAAATTTTTGGGAATTCCCCATCGAGGTACTGATCCTAGTANTACTAGGNTTAAATTGGAATNGGTACTAGTANCCTCCTAGTNAATA
Quality information is not kept in the FASTA file
NGS files contain sequence quality information
FASTQ
Represents the output sequence.
Extension of FASTA developed by Sanger Institute to store quality data for each nucleotide in a sequence
Different versions of FASTQ exist (e.g. Solexa FASTQ,
CSFASTQ)
Compatability issues between different versions
Use different ASCI offsets for the PHRED scores.
ABI’s Color Space format CSFASTQ - uses 0-3 to encode
colour calls
www.ngrl.org.uk/Manchester
FASTQ Example
@EAS54_6_R1_2_1_413_324
CCCTTCTTGTCTTCAGCGTTTCTCC
+
;;3;;;;;;;;;;;;7;;;;;;;88
@EAS54_6_R1_2_1_540_792
TTGGCAGGCCAAGGCCGATGGATCA
+
;;;;;;;;;;;7;;;;;-;;;3;83
@EAS54_6_R1_2_1_443_348
GTTGCTTCTGGCGTGGGTGGGGGGG
+EAS54_6_R1_2_1_443_348
@;;;;;;;@;;9;7;;.7;393333
www.ngrl.org.uk/Manchester
@title and optional description
quality line(s) – can be wrapped –
quality stored as ASCII with 33
offset
+optional repeat of title line
sequence line(s) – can be wrapped
Four types of line in a FASTQ entry
Initial QC
The quality scores for each base can be used to
analyse the quality of the run.
www.ngrl.org.uk/Manchester
Alignment to reference sequence
Bowtie (http://bowtie.cbcb.umd.edu ) algorithm for a aligning short reads
Uses a Burrows–Wheeler transform to compress data.
Small memory requirements – can run on a desktop machine with 2 GB of RAM.
Very fast - 35 times faster than Maq and 300 times faster than SOAP.
If the best match is an inexact one then Bowtie is not guaranteed in all cases to find the highest quality alignment (Langmead et al., 2009).
Alignment Files
Sequence Alignment/Map (SAM) format
Stores read alignments against reference
sequences.
Used for release of 1000 Genomes Project
alignments
Supports short and long reads (up to 128Mbp)
produced by different sequencing platforms.
www.ngrl.org.uk/Manchester
SAM example
File consists of two sections:
1. Header starts with ‘@’
2. Alignment section, containing 11 mandatory fields
Bitwise Flag
The SAM flag is a bitwise field - a single number containing several pieces of information.
Store true/false answers in binary
1101 = Q1 true, Q2 false, Q3 true, Q4 true.
1101 = 8 + 4 + 1 = 9
Bitwise Flag
Bit Description
0 the read is paired in sequencing, no matter whether it is mapped in a pair
1
1 the read is mapped in a proper pair (depends on the protocol, normally inferred during alignment)
2
2 query sequence itself is unmapped 4 3 the mate is unmapped 8 4 strand of the query (0 for forward; 1 for reverse
strand) 16
5 strand of the mate 32 6 the read is the first read in a pair 64 7 the read is the second read in a pair 128 8 the alignment is not primary (a read having split
hits may have multiple primary alignment records) 256
9 the read fails platform/vendor quality checks 512 10 the read is either a PCR duplicate or an optical
duplicate 1024
163 = 0001010011 = 1+2+32+128
TRUE
TRUE
TRUE
TRUE
Extended CIGAR
A CIGAR string is comprised of a series of operation lengths plus the operations.
Extended CIGAR format allows for the following operations:
– M = match/mismatch
– I = insertion
– D = deletion
– N = skipped region from the reference
– S = soft clip on the read (clipped sequence present in seq)
– H = hard clip on the read (clipped sequence NOT present in seq)
– P = padding (silent deletion from the padded reference sequence)
CIGAR example
S is used to describe softly clipped alignment
(where subsequences at the ends are ignored)
3S8M1D6M4S = 3 soft clipped, 8 match/missmatch,
1 deletion, 6 match/missmatch, 4 soft clipped
Pileup format
Format consists of:reference (chromosome)coordinatereference basecoverage5th column
dot = match on forward strandcomma = identical match on reverse strandAGTC = mismatch on forward strandagtc = mismatch on reverse strand
Describing SNPs
General Feature Format (GFF3) - a format for
describing genes and other features associated
with DNA, RNA and Protein sequences.
chrID | source of result | feature type | start | end
16 | SoapSNP | SNP | 7475 | 7475 |
File Manipulation
At Manchester they are using BioScope software framework.
Also:
Partek (http://www.partek.com)
Galaxy (http://usegalaxy.org ) data analysis platform.
SeattleSeq – SNP annotation
For Illumina there is the CASAVA genome studio pipeline
Leeds - writing software in Perl, Python and R, but some also in .Net framework.
Summary
Lots of new file formats to understand.
More information contained in files.
Easier to share data because quality data is stored.
The content of files will depend on bioinformatics tools used – there may be more than one alignment output possible.
For clinical NGS we need bioinformatics skills to evaluate software and develop SOPs.
This is a fast-moving field.
NGS Platforms
1. illumina (Avantome, GAIIx, HisSeq2000)2. Roche (454 Genome Sequencer FLX, GS Junior) 3. Life Technologies (SOLiD, Single Molecule Sequencing) 4. Helicos (True Single Molecule Sequencing) 5. Pacific Biosciences (Real Time Single Molecule Sequencing) 6. Ion Torrent (Post-Light Sequencing™ technology) 7. Oxford nanopore (Single Molecule Sequencing) 8. IBM nanopore sequencing (Single Molecule Sequencing) 9. Nabsys (nanopore + electronic sequencing) 10. ZS Genetics 11. Visigen 12. Halcyon Molecular 13. Complete Genomics 14. Mobious (Nexus I) 15. Polonator (MPS by ligation) 16. Cracker (SMRT on a chip) 17. Kavli Institute of Nanoscience; Graphene Nanopores 18. IBM/Roche – nanopore based sequencing
NGS software
Lots of software already available
Sunday 5 September 2010 SEQwiki contains
pages for 304 bioinformatics applications.
How do we share knowledge about using
bioinformatics tools for clinical NGS.
Possible changes
The available NGS platforms may be very different in 5 years.
NGS will be cheaper and faster
8th June 2010 - GnuBio Proposes to Sequence Human Genome for $30 at 30 x coverage in 10 hours
Sequence reads will be (much) longer (SMRT 10,361 bases at AGBT 2010)
Current bioinformatics problems (e.g. aligning short reads) may disappear.
WGS will be feasible.
Data sharing will be more beneficial.