ngs data format and general quality control. data format “flowchart” sequencer raw data...

13
NGS data format and General Quality Control

Upload: hugh-gaines

Post on 23-Dec-2015

234 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: NGS data format and General Quality Control. Data format “Flowchart” Sequencer raw data FastqSAM/BAM

NGS data format and General Quality Control

Page 2: NGS data format and General Quality Control. Data format “Flowchart” Sequencer raw data FastqSAM/BAM

Data format “Flowchart”

Sequencer raw data Fastq SAM/BAM

Page 3: NGS data format and General Quality Control. Data format “Flowchart” Sequencer raw data FastqSAM/BAM

Fastq file

• Used to record raw reads coming off the sequencers

• Each record contains four lines• Parameters were usually set by the sequencer,

such as read length

Page 4: NGS data format and General Quality Control. Data format “Flowchart” Sequencer raw data FastqSAM/BAM

Fastq file

Page 5: NGS data format and General Quality Control. Data format “Flowchart” Sequencer raw data FastqSAM/BAM

• Line 1 begins with a '@' character and is followed by a sequence identifier and an optional description (like a FASTA title line).

• Line 2 is the raw sequence letters. The read length is the length of the string.

• Line 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again.

• Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence.

http://en.wikipedia.org/wiki/FASTQ_format

Page 6: NGS data format and General Quality Control. Data format “Flowchart” Sequencer raw data FastqSAM/BAM

General quality control of raw reads

• Using FASTQC– A tool that implements some general rules– Basic Statistics– Per base sequence quality– Per sequence quality scores– Per base sequence content– Per base GC content– Per sequence GC content– Per base N content– Sequence Length Distribution– Sequence Duplication Levels– Overrepresented sequences– Kmer Content

Page 7: NGS data format and General Quality Control. Data format “Flowchart” Sequencer raw data FastqSAM/BAM

Quality scores

Page 8: NGS data format and General Quality Control. Data format “Flowchart” Sequencer raw data FastqSAM/BAM

Perbase “N” percentage

Page 10: NGS data format and General Quality Control. Data format “Flowchart” Sequencer raw data FastqSAM/BAM

Data format “Flowchart”

Sequencer Fastq SAM/BAM

Page 11: NGS data format and General Quality Control. Data format “Flowchart” Sequencer raw data FastqSAM/BAM

SAM/BAM

• SAM stands for Sequence Alignment Map• BAM is the binary form of SAM• Used for mapped/aligned reads• Generated by NGS mapper/aligners

Page 12: NGS data format and General Quality Control. Data format “Flowchart” Sequencer raw data FastqSAM/BAM

SAM

Page 13: NGS data format and General Quality Control. Data format “Flowchart” Sequencer raw data FastqSAM/BAM

BAM