dealing with the output of a next generation experiment

Dealing with the output of a next

generation experiment.

Mike Cornell

NGRL Manchester

[email protected]

www.ngrl.org.uk/Manchester

Aims

Discuss the file types encountered in NGS,

Discuss how these files can be manipulated.

Highlight issues for clinical NGS

NGS in Manchester

Currently there are three SOLiD sequencers in Manchester:

- Patterson Institute

- Faculty of Life Sciences

- St Mary’s Hospital

SOLiD Bioinformatics Group covering all three institutes.

Most work on ChIP-seq at present.

Started targeted exome sequencing at St Mary’s

NGS file types

Four types of file reflecting different stages in

experiment.

Individual sequences - FASTQ

Alignment to reference sequences – SAM/BAM

Pileup of alignments – Pileup File

Variations from reference sequence – GFF file

Current technology

Current technology

Lots of information in the trace file.

Not just sequence, there’s also information about sequence quality.

Experienced scientist can use trace to revise sequence – reclassifying N’s or amending GGGGG to GGGGGG.

Converted to FASTA

>Identifier | Some other stuff here if you wantAAATTTTTGGGAATTCCCCATCGAGGTACTGATCCTAGTANTACTAGGNTTAAATTGGAATNGGTACTAGTANCCTCCTAGTNAATA

Quality information is not kept in the FASTA file

NGS files contain sequence quality information

FASTQ

Represents the output sequence.

Extension of FASTA developed by Sanger Institute to store quality data for each nucleotide in a sequence

Different versions of FASTQ exist (e.g. Solexa FASTQ,

CSFASTQ)

Compatability issues between different versions

Use different ASCI offsets for the PHRED scores.

ABI’s Color Space format CSFASTQ - uses 0-3 to encode

colour calls


FASTQ Example

@EAS54_6_R1_2_1_413_324

CCCTTCTTGTCTTCAGCGTTTCTCC

+

;;3;;;;;;;;;;;;7;;;;;;;88

@EAS54_6_R1_2_1_540_792

TTGGCAGGCCAAGGCCGATGGATCA

+

;;;;;;;;;;;7;;;;;-;;;3;83

@EAS54_6_R1_2_1_443_348

GTTGCTTCTGGCGTGGGTGGGGGGG

+EAS54_6_R1_2_1_443_348

@;;;;;;;@;;9;7;;.7;393333


@title and optional description

quality line(s) – can be wrapped –

quality stored as ASCII with 33

offset

+optional repeat of title line

sequence line(s) – can be wrapped

Four types of line in a FASTQ entry

Initial QC

The quality scores for each base can be used to

analyse the quality of the run.


Alignment to reference sequence

Bowtie (http://bowtie.cbcb.umd.edu ) algorithm for a aligning short reads

Uses a Burrows–Wheeler transform to compress data.

Small memory requirements – can run on a desktop machine with 2 GB of RAM.

Very fast - 35 times faster than Maq and 300 times faster than SOAP.

If the best match is an inexact one then Bowtie is not guaranteed in all cases to find the highest quality alignment (Langmead et al., 2009).

http://bowtie.cbcb.umd.edu/

Alignment Files

Sequence Alignment/Map (SAM) format

Stores read alignments against reference

sequences.

Used for release of 1000 Genomes Project

alignments

Supports short and long reads (up to 128Mbp)

produced by different sequencing platforms.


SAM example

File consists of two sections:

1. Header starts with ‘@’

2. Alignment section, containing 11 mandatory fields

Bitwise Flag

The SAM flag is a bitwise field - a single number containing several pieces of information.

Store true/false answers in binary

1101 = Q1 true, Q2 false, Q3 true, Q4 true.

1101 = 8 + 4 + 1 = 9

Bitwise Flag

Bit Description

0 the read is paired in sequencing, no matter whether it is mapped in a pair

1

1 the read is mapped in a proper pair (depends on the protocol, normally inferred during alignment)

2

2 query sequence itself is unmapped 4 3 the mate is unmapped 8 4 strand of the query (0 for forward; 1 for reverse

strand) 16

5 strand of the mate 32 6 the read is the first read in a pair 64 7 the read is the second read in a pair 128 8 the alignment is not primary (a read having split

hits may have multiple primary alignment records) 256

9 the read fails platform/vendor quality checks 512 10 the read is either a PCR duplicate or an optical

duplicate 1024

163 = 0001010011 = 1+2+32+128

TRUE

TRUE

TRUE

TRUE

Extended CIGAR

A CIGAR string is comprised of a series of operation lengths plus the operations.

Extended CIGAR format allows for the following operations:

– M = match/mismatch

– I = insertion

– D = deletion

– N = skipped region from the reference

– S = soft clip on the read (clipped sequence present in seq)

– H = hard clip on the read (clipped sequence NOT present in seq)

– P = padding (silent deletion from the padded reference sequence)

CIGAR example

S is used to describe softly clipped alignment

(where subsequences at the ends are ignored)

3S8M1D6M4S = 3 soft clipped, 8 match/missmatch,

1 deletion, 6 match/missmatch, 4 soft clipped

Quality scores

Mapping quality range 0 – 28 -1

Query quality in ASCII -33

Pileup format

Format consists of:reference (chromosome)coordinatereference basecoverage5th column

dot = match on forward strandcomma = identical match on reverse strandAGTC = mismatch on forward strandagtc = mismatch on reverse strand

Describing SNPs

General Feature Format (GFF3) - a format for

describing genes and other features associated

with DNA, RNA and Protein sequences.

chrID | source of result | feature type | start | end

16 | SoapSNP | SNP | 7475 | 7475 |

File Manipulation

At Manchester they are using BioScope software framework.

Also:

Partek (http://www.partek.com)

Galaxy (http://usegalaxy.org ) data analysis platform.

SeattleSeq – SNP annotation

For Illumina there is the CASAVA genome studio pipeline

Leeds - writing software in Perl, Python and R, but some also in .Net framework.

Summary

Lots of new file formats to understand.

More information contained in files.

Easier to share data because quality data is stored.

The content of files will depend on bioinformatics tools used – there may be more than one alignment output possible.

For clinical NGS we need bioinformatics skills to evaluate software and develop SOPs.

This is a fast-moving field.

NGS – what’s next

NGS Platforms

1. illumina (Avantome, GAIIx, HisSeq2000)2. Roche (454 Genome Sequencer FLX, GS Junior) 3. Life Technologies (SOLiD, Single Molecule Sequencing) 4. Helicos (True Single Molecule Sequencing) 5. Pacific Biosciences (Real Time Single Molecule Sequencing) 6. Ion Torrent (Post-Light Sequencing™ technology) 7. Oxford nanopore (Single Molecule Sequencing) 8. IBM nanopore sequencing (Single Molecule Sequencing) 9. Nabsys (nanopore + electronic sequencing) 10. ZS Genetics 11. Visigen 12. Halcyon Molecular 13. Complete Genomics 14. Mobious (Nexus I) 15. Polonator (MPS by ligation) 16. Cracker (SMRT on a chip) 17. Kavli Institute of Nanoscience; Graphene Nanopores 18. IBM/Roche – nanopore based sequencing

NGS software

Lots of software already available

Sunday 5 September 2010 SEQwiki contains

pages for 304 bioinformatics applications.

How do we share knowledge about using

bioinformatics tools for clinical NGS.

Possible changes

The available NGS platforms may be very different in 5 years.

NGS will be cheaper and faster

8th June 2010 - GnuBio Proposes to Sequence Human Genome for $30 at 30 x coverage in 10 hours

Sequence reads will be (much) longer (SMRT 10,361 bases at AGBT 2010)

Current bioinformatics problems (e.g. aligning short reads) may disappear.

WGS will be feasible.

Data sharing will be more beneficial.

Sharing Data

Can we use a similar model to the European Genome-phenome archive (EGA)?

Ontologies to describe patient phenotypes e.g. HPO and Orphanet ontology.

GENPHEN developing software for data sharing and integration.

Incorporating LIMS systems.

dealing with the output of a next generation experiment

Documents