achim tresch computational biology ‘omics’ - analysis of high dimensional data

63
Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

Upload: theodore-pierce

Post on 13-Dec-2015

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

Achim TreschComputational Biology

‘Omics’

- Analysis of high

dimensional Data

Page 2: Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

Next Generation Sequencing

Today:

Illumina NGS platform, Fastq files

Sequence bioinformatics:Hash tablesSuffix arrays

Burrows-Wheeler transform

Page 3: Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

Illumina

Slides from Kurt Strueber Genome Center MPIPZ Cologne

Page 4: Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

• RNA-seq, ChIP-seq, Methyl-seq

…CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC…GCGCCCTA

GCCCTATCGGCCCTATCG

CCTATCGGACTATCGGAAA

AAATTTGCAAATTTGC

TTTGCGGTTTGCGGTA

GCGGTATA

GTATAC…

TCGGAAATTCGGAAATTT

CGGTATAC

TAGGCTATA

GCCCTATCGGCCCTATCG

CCTATCGGACTATCGGAAA

AAATTTGCAAATTTGC

TTTGCGGT

TCGGAAATTCGGAAATTTCGGAAATTT

AGGCTATATAGGCTATATAGGCTATAT

GGCTATATGCTATATGCG

…CC…CC…CCA…CCA…CCAT

ATAC…C…C…

…CCAT…CCATAG TATGCGCCC

GGTATAC…CGGTATAC

GGAAATTTG

…CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC…ATAC……CC

GAAATTTGC

Goal: identify variations

Goal: classify, measure significant peaks

Short Read Applications

• Genotyping

Reference genome

Short reads

Page 5: Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

Illumina

Page 6: Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

Illumina

Page 7: Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

Illumina

Page 8: Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

Illumina

Page 9: Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

Illumina

Page 10: Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

Illumina

Page 11: Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

Illumina

Page 12: Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

Illumina

Page 13: Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

Illumina

Page 14: Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

Illumina

Page 15: Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

Illumina

Page 16: Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data
Page 17: Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

Slides taken from Michael MainUniversity of Colorado

Hash tables

Page 18: Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

• The simplest kind of hash table is an array of records.

• This example has 701 records.

[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ]

An array of records

. . .

[ 700]

Hash tables

Page 19: Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

• Each record has a special field, called its key.

• In this example, the key is a long integer field called Number.

[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ]

. . .

[ 700]

[ 4 ]

Number 506643548

Hash tables

Page 20: Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

• The number might be a person's identification number, and the rest of the record has information about the person.

[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ]

. . .

[ 700]

[ 4 ]

Number 506643548

Hash tables

Page 21: Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

• When a hash table is in use, some spots contain valid records, and other spots are "empty".

[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548Number 233667136Number 281942902 Number 155778322

. . .

Hash tables

Page 22: Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

• In order to insert a new record, the key must somehow be converted to an array index.

• The index is called the hash value of the key.

[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548Number 233667136Number 281942902 Number 155778322

. . .

Number 580625685

In our case: The keys are short sequences, and the records contain their location in the genome

Inserting a new record

Page 23: Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

• Typical way create a hash value:

[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548Number 233667136Number 281942902 Number 155778322

. . .

Number 580625685

(Number mod 701)

What is (580625685 mod 701) ?

Inserting a new record

Page 24: Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

• Typical way to create a hash value:

[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548Number 233667136Number 281942902 Number 155778322

. . .

Number 580625685

(Number mod 701)

What is (580625685 mod 701) ?3

Inserting a new record

Page 25: Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

• The hash value is used for the location of the new record.

Number 580625685

[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548Number 233667136Number 281942902 Number 155778322

. . .

[3]

Inserting a new record

Page 26: Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

• The hash value is used for the location of the new record.

[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548Number 233667136Number 281942902 Number 155778322

. . .Number 580625685

Inserting a new record

Page 27: Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

• Here is another new record to insert, with a hash value of 2.

[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548Number 233667136Number 281942902 Number 155778322

. . .Number 580625685

Number 701466868

My hashvalue is

[2].

Collisions

Page 28: Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

• This is called a collision, because there is already another valid record at [2].

[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548Number 233667136Number 281942902 Number 155778322

. . .Number 580625685

Number 701466868

When a collision occurs,

move forward until you

find an empty spot.

Collisions

Page 29: Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

• This is called a collision, because there is already another valid record at [2].

[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548Number 233667136Number 281942902 Number 155778322

. . .Number 580625685

Number 701466868

When a collision occurs,

move forward until you

find an empty spot.

Collisions

Page 30: Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

• This is called a collision, because there is already another valid record at [2].

[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548Number 233667136Number 281942902 Number 155778322

. . .Number 580625685

Number 701466868

When a collision occurs,

move forward until you

find an empty spot.

Collisions

Page 31: Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

• This is called a collision, because there is already another valid record at [2].

[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548Number 233667136Number 281942902 Number 155778322

. . .Number 580625685 Number 701466868

The new record goes

in the empty spot.

Collisions

Page 32: Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548Number 233667136Number 281942902 Number 155778322Number 580625685 Number 701466868

. . .

If the keys were short sequences, where would you place the sequenceATACCG?(NB: this is an ill-posed question)

A Quiz

Page 33: Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

• The data that's attached to a key can be found fairly quickly.

[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548Number 233667136Number 281942902 Number 155778322

. . .Number 580625685 Number 701466868

Number 701466868

Searching for a Key

Page 34: Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

• Calculate the hash value.• Check that location of the array

for the key.

[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548Number 233667136Number 281942902 Number 155778322

. . .Number 580625685 Number 701466868

Number 701466868

My hashvalue is

[2].

Not me.

Searching for a Key

Page 35: Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

• Keep moving forward until you find the key, or you reach an empty spot.

[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548Number 233667136Number 281942902 Number 155778322

. . .Number 580625685 Number 701466868

Number 701466868

My hashvalue is

[2].Not me.

Searching for a Key

Page 36: Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

• Keep moving forward until you find the key, or you reach an empty spot.

[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548Number 233667136Number 281942902 Number 155778322

. . .Number 580625685 Number 701466868

Number 701466868

My hashvalue is

[2].Not me.

Searching for a Key

Page 37: Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

• Keep moving forward until you find the key, or you reach an empty spot.

[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548Number 233667136Number 281942902 Number 155778322

. . .Number 580625685 Number 701466868

Number 701466868

My hashvalue is

[2].Yes!

Searching for a Key

Page 38: Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

• When the item is found, the information can be copied to the necessary location.

[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548Number 233667136Number 281942902 Number 155778322

. . .Number 580625685 Number 701466868

Number 701466868

My hashvalue is

[2].Yes!

Searching for a Key

Page 39: Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

Hash tables store a collection of records with keys.

The location of a record depends on the hash value of the record's key.

When a collision occurs, the next available location is used.

Searching for a particular key is generally quick.

THE END

Summary

Page 40: Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data
Page 41: Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

Suffix Arrays

• Suffix arrays were introduced by Manber and Myers in 1993

• More space efficient than suffix trees• A suffix array for a string x of length m is an array of

size m that specifies the lexicographic ordering of the suffixes of x.

Idea: Every substring is a prefix of a suffix

Page 42: Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

Example of a suffix array for acaaacatat$

3415792681011

Starting position of

that suffix in the search

string

Suffix Arrays

Page 43: Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

• Naive in place construction– Similar to insertion sort– Insert all the suffixes into the array one by one

making sure that the new inserted suffix is in its correct place

– Running time complexity:• O(m2) where m is the length of the string

• Manber and Myers give a O(m log m) construction in their 1993 paper.

Suffix Array Construction

Page 44: Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

• O(n) space where n is the size of the database string• Space efficient. However, there’s an increase in

query time• Lookup query

– Binary search– O(m log n) time; m is the size of the query– Can reduce time to O(m + log n) using a more efficient

implementation

Suffix Array Construction

Page 45: Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

find(Pattern P in SuffixArray A):

i = 0 lo = 0, hi = length(A) for 0<=i<length(P):

Binary search for x,y

where P[i]=S[A[j]+i] for lo<=x<=j<y<=hi

lo = x, hi = y

return {A[lo],A[lo+1],...,A[hi-1]}

Suffix Array Search

Page 46: Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

Search ‘is’ in mississippi$ 0 11 i$

1 8 ippi$

2 5 issippi$

3 2 ississippi$

4 1 mississippi$

5 10 pi$

6 9 ppi$

7 7 sippi$

8 4 sissippi$

9 6 ssippi$

10 3 ssissippi$

11 12 $

Examine the pattern letter by letter, reducing the range of occurrence each time.

- First letter i: occurs in indices from 0 to 3- Second letter s: occurs in indices from 2 to 3

Done. Output: issippi$ and ississippi$

Suffix Array Search

Page 47: Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

• It can be built very fast.• It can answer queries very fast:

– How many times ATG appears?

• Disadvantages: – Can’t do approximate matching– Hard to insert new stuff dynamically

(need to rebuild the array)

Summary

Page 48: Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

• http://pauillac.inria.fr/~quercia/documents-info/Luminy-98/albert/JAVA+html/SuffixTreeGrow.html

• http://home.in.tum.de/~maass/suffix.html• http://homepage.usask.ca/~ctl271/857/suffix_tree.shtml• http://homepage.usask.ca/~ctl271/810/

approximate_matching.shtml• http://www.cs.mcgill.ca/~cs251/OldCourses/1997/topic7/• http://dogma.net/markn/articles/suffixt/suffixt.htm• http://www.csse.monash.edu.au/~lloyd/tildeAlgDS/Tree/

Suffix/

Links

Page 49: Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data
Page 50: Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

Bowtie: A Highly Scalable Tool for Post-Genomic

Datasets

(Slides by Ben Langmead)

Page 51: Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

Short Read Applications

• Genotyping

• RNA-seq, ChIP-seq, Methyl-seq

…CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC…GCGCCCTA

GCCCTATCGGCCCTATCG

CCTATCGGACTATCGGAAA

AAATTTGCAAATTTGC

TTTGCGGTTTGCGGTA

GCGGTATA

GTATAC…

TCGGAAATTCGGAAATTT

CGGTATAC

TAGGCTATA

GCCCTATCGGCCCTATCG

CCTATCGGACTATCGGAAA

AAATTTGCAAATTTGC

TTTGCGGT

TCGGAAATTCGGAAATTTCGGAAATTT

AGGCTATATAGGCTATATAGGCTATAT

GGCTATATGCTATATGCG

…CC…CC…CCA…CCA…CCAT

ATAC…C…C…

…CCAT…CCATAG TATGCGCCC

GGTATAC…CGGTATAC

GGAAATTTG

…CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC…ATAC……CC

GAAATTTGC

Goal: identify variations

Goal: classify, measure significant peaks

Page 52: Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

Short Read Applications

Finding the alignments is typically the performance bottleneck

…CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC…GCGCCCTA

GCCCTATCGGCCCTATCG

CCTATCGGACTATCGGAAA

AAATTTGCAAATTTGC

TTTGCGGTTTGCGGTA

GCGGTATA

GTATAC…

TCGGAAATTCGGAAATTT

CGGTATAC

TAGGCTATA

GCCCTATCGGCCCTATCG

CCTATCGGACTATCGGAAA

AAATTTGCAAATTTGC

TTTGCGGT

TCGGAAATTCGGAAATTTCGGAAATTT

AGGCTATATAGGCTATATAGGCTATAT

GGCTATATGCTATATGCG

…CC…CC…CCA…CCA…CCAT

ATAC…C…C…

…CCAT…CCATAG TATGCGCCC

GGTATAC…CGGTATAC

GGAAATTTG

…CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC…ATAC……CC

GAAATTTGC

Page 53: Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

Short Read Alignment

• Given a reference and a set of reads, report at least one “good” local alignment for each read if one exists– Approximate answer to: where in genome did read

originate?

…TGATCATA… GATCAA

…TGATCATA… GAGAAT

better than

• What is “good”? For now, we concentrate on:

…TGATATTA… GATcaT

…TGATcaTA… GTACAT

better than

– Fewer mismatches is better

– Failing to align a low-quality base is better than failing to align a high-quality base

Page 54: Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

Indexing

• Genomes and reads are too large for direct approaches like dynamic programming

• Indexing is required

• Choice of index is key to performance

Suffix tree Suffix array Seed hash tablesMany variants, incl. spaced seeds

Page 55: Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

Indexing

• Genome indices can be big. For human:

• Large indices necessitate painful compromises1. Require big-memory machine2. Use secondary storage

> 35 GBs > 12 GBs > 12 GBs

3. Build new index each run4. Subindex and do multiple

passes

Page 56: Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

Burrows-Wheeler Transform

Burrows WheelerMatrix

Last column contains the characters

preceding the characters in the first column

BWT(T)

a c a a c g $$ a c a a c gg $ a c a a ca c g $ a c aa a c g $ a cc a a c g $ aa c a a c g $

Rotate string one by one in each row

Sort suffixes lexicographically

Text T

Page 57: Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

Burrows-Wheeler Transform• Reversible permutation used originally in compression

• Once BWT(T) is built, all else shown here is discarded– Matrix will be shown for illustration only

• In long texts, BWT(T) contains more repeated character occurrences than the original text easier to compress!

BurrowsWheelerMatrix

Last column

BWT(T)T

Page 58: Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

Burrows-Wheeler Transform

• Property that makes BWT(T) reversible is “LF Mapping”– ith occurrence of a character in Last column is

same text occurrence as the ith occurrence in First column

BWT(T)

Burrows WheelerMatrix

Rank: 2

Rank: 2

Burrows M, Wheeler DJ: A block sorting lossless data compression algorithm. Digital Equipment Corporation, Palo Alto, CA 1994, Technical Report 124; 1994

Page 59: Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

Burrows-Wheeler Transform

• To recreate T from BWT(T), repeatedly apply rule:T BWT[ LF(i) ] + T; i = LF(i)– Where LF(i) maps row i to row whose first

character corresponds to i’s last per LF Mapping

• Could be called “unpermute” or “walk-left” algorithm

Final T

Page 60: Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

BWT in Bioinformatics

• Oligomer counting– Healy J et al: Annotating large genomes with exact word

matches. Genome Res 2003, 13(10):2306-2315.

• Whole-genome alignment– Lippert RA: Space-efficient whole genome comparisons

with Burrows-Wheeler transforms. J Comp Bio 2005, 12(4):407-415.

• Smith-Waterman alignment to large reference– Lam TW et al: Compressed indexing and local alignment

of DNA. Bioinformatics 2008, 24(6):791-797.

Page 61: Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

Comparison to Maq & SOAP

• PC: 2.4 GHz Intel Core 2, 2 GB RAM• Server: 2.4 GHz AMD Opteron, 32 GB RAM• Bowtie v0.9.6, Maq v0.6.6, SOAP v1.10• SOAP not run on PC due to memory constraints• Reads: FASTQ 8.84 M reads from 1000 Genomes (Acc:

SRR001115)• Reference: Human (NCBI 36.3, contigs)

CPU timeWall clock

time

Readsper

hour

Peak virtual memory footprint

Bowtiespeedu

p

Reads aligned (%)

Bowtie –v 2 (server) 15m:07s 15m:41s 33.8 M 1,149 MB - 67.4

SOAP (server) 91h:57m:35s 91h:47m:46s 0.08 M 13,619 MB 351x 67.3

Bowtie (PC) 16m:41s 17m:57s 29.5 M 1,353 MB - 71.9

Maq (PC) 17h:46m:35s 17h:53m:07s 0.49 M 804 MB 59.8x 74.7

Bowtie (server) 17m:58s 18m:26s 28.8 M 1,353 MB - 71.9

Maq (server) 32h:56m:53s 32h:58m:39s 0.27 M 804 MB 107x 74.7

• Bowtie delivers about 30 million alignments per CPU hour

Page 62: Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

TopHat: Bowtie for RNA-seq

• TopHat is a fast splice junction mapper for RNA-Seq reads. It aligns RNA-Seq reads using Bowtie, and then analyzes the mapping results to identify splice junctions between exons.– Contact: Cole Trapnell ([email protected])– http://tophat.cbcb.umd.edu

Page 63: Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

Nicolas Delhomme, EMBL Heidelberg

University of Umeå

Acknowledgements

NGS Exercises were designed by