thahir p. mohamed, asia d. mitchell and madhavi ganapathiraju

17
Open access toolkit for nonparametric explorative pattern mining to detect events relating to disease in large scale genome sequences Thahir P. Mohamed, Asia D. Mitchell and Madhavi Ganapathiraju Department of Biomedical Informatics University of Pittsburgh School of Medicine Pittsburgh PA USA Advancing Practice , Innovation, and Instruction through Informatics October 20, 2008

Upload: oliver-russell

Post on 03-Jan-2016

28 views

Category:

Documents


3 download

DESCRIPTION

Open access toolkit for nonparametric explorative pattern mining to detect events relating to disease in large scale genome sequences. Thahir P. Mohamed, Asia D. Mitchell and Madhavi Ganapathiraju Department of Biomedical Informatics University of Pittsburgh School of Medicine - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Thahir P. Mohamed,  Asia D. Mitchell  and Madhavi Ganapathiraju

Open access toolkit for nonparametric explorative pattern

mining to detect events relating to disease in large scale genome sequences

Thahir P. Mohamed, Asia D. Mitchell and Madhavi GanapathirajuDepartment of Biomedical Informatics

University of Pittsburgh School of MedicinePittsburgh PA USA

Advancing Practice , Innovation, and Instruction through InformaticsOctober 20, 2008

Page 2: Thahir P. Mohamed,  Asia D. Mitchell  and Madhavi Ganapathiraju

The Genome Sequence

3 billion nucleotides

20 to 25 thousand genes

Two-thirds of the genome made of repetitive elements (2 billion nucleotides)

ATGGCACTGAGCTCCCAGATCTGGGCCGCTTGCCTCCTGCTCCTCCTCCTCCTCGCCAGCCTGACCAGTGGCTCTGTTTTCCCACAACAGGTGAGAGCCCAGTGGCCTGGGTCCTTAGCAGGGCAGCAGGGATGGGAGAGCCAGGCCTCAGCCTAGGGCACTGGAGACACCCGAGCACTGAGCAGAGCTCAGGACGTCTCAGGAGTACTGGCAGCTGAACAGGAACCAGGACAGGCACGGTGGCTCATGCCTGTAATCCCAGCACTTTGGGAGGTTGAGGCAGGCAGCCCACTTGAGGTCAGTTTGAGACCAGCCTGGCCAACATGGTAAAACCCCGTCTCTACTAAAAATACAAAAGTTAGCCAGGCTTGGTGGCAGGTGCCTGTAATCCCAGCTACTCGGGAGACTGAGGCAGGAGAATTGCTTGAACCCGCAAGGTGGAGGTTGCACAGTGAGCTGAGATTGCACCACTGCACTCCAGCCTGGCAACAGAGCAAGACTCCATCTCCAAAAAAGAACAGAAATCAATGAAGCACCGAGTGACAGGGACTGGAAGGTCCTAATTCCATGGGTATTTACGGAACCCCTACGCCGTGTGGAGTCTTATTCTAGACAGTGGGGACGAGGCCATGAACAAGGTAGATGAGAGAGGAGATTTCTCCATCCTGGTCAGGGAATTTGTTAAAGACTGATGAAAACATGAATAAATAATTGTGTCTAGTACATTCTATTCGTGAATCTCATAACAGACAGTGGTAGAGTGACCGTGACCCATTCGCCACACAGTAGAGTCACTTTTTTGGTTTGTTTTTTAGAGACAGGGTCTTCCTCTGTTGCTGAGGCTGGAGTGCAGTGGTGCAGTCATAGTTCACTGCAGCCTCAACCTCCTGTGCTCAAGCAATCCTCCCACCTCAGCGTCCCAAGTAGCTGGGACAGCAGGCACATGCCACGGGTTGGGGGACCACAGGCATGGTCAAGGGGCTGGCAGTCAAGCAAGTG

The human genome contains…

Page 3: Thahir P. Mohamed,  Asia D. Mitchell  and Madhavi Ganapathiraju

Genomic PatternsShort Tandem Repeats (STRs)

Variable Number Tandem Repeats (VNTRs)

CpG IslandsA sequence of > 500 nucleotides C+G content of > 55%High frequency of CG dinucleotides

1 to 6 nucleotides repeated in tandem

Same as short tandem repeatsNumber of repeats variable across individuals

…CGCGCCGGACGTTACGCGCGCCGCGAAACGCGCGCCGGACGGCGCCGCAAACGGCCGCGCGTAC…

Page 4: Thahir P. Mohamed,  Asia D. Mitchell  and Madhavi Ganapathiraju

Palindromes

300 bp

>1,000 bp

ALU Elements

LINE-1 ElementsRetrotransposon of >1,000 nucleotides High A+T contentPoly A tail

Retrotransposon of ~300 nucleotides withHigh G+C content Recognition site for alu endonucleaseSegment high in A contentA poly A tail

A sequence that is like a normal palindrome (mom, racecar, …)One half is a complement of the other in reverse order.

Genomic Patterns

Page 5: Thahir P. Mohamed,  Asia D. Mitchell  and Madhavi Ganapathiraju

Disease Relevance

Page 6: Thahir P. Mohamed,  Asia D. Mitchell  and Madhavi Ganapathiraju

Challenges in Pattern Mining

ScalableGenomes are large 3 billion nucleotides

Genes are small 3 thousand nucleotides

Genomes of different organisms vary greatly in size

FlexibleTypes of patterns differ

There are variations within a single type of pattern

Flexibility in resolution of analysis

NonparametricNew and unknown patterns

Explorative analysis

Computational tools for pattern mining must be…

Currently, there are no tools that are scalable, flexible, and nonparametric for genomic pattern mining

Page 7: Thahir P. Mohamed,  Asia D. Mitchell  and Madhavi Ganapathiraju

Pattern Mining Toolkit

Applications layer contains programs that utilize features computed by tools layer and also the preprocessed layer to compute specific commonly known patterns such short tandem repeats, DNA palindromes, short and long interspersed nuclear elements, etc.

Page 8: Thahir P. Mohamed,  Asia D. Mitchell  and Madhavi Ganapathiraju

Foundation Layer

Data Preprocessing:Suffix array computationLongest common prefix array computation

Efficient Preprocessing of Genome Sequence

Repetitive patterns appear next to each other

Allows for efficient computation of patterns

Page 9: Thahir P. Mohamed,  Asia D. Mitchell  and Madhavi Ganapathiraju

Tools Layer

Locate Specific Patterns

Find Ngram Counts Compare Ngram CountsNgram = CG

Window Count1 762 1083 904 1065 1856 0

Ngram = GCCWindow Chrom A Chrom B

1 42 1002 98 1653 63 794 72 605 25 151

TTAAAAAAAA-TTTTTTAAAA 10 251555TAAAAAAC-GTTTTTAA 8 276649CAAAAAAG-CTTTTTAG 8 312629TCTCTACTAAAAAT-ATTTTTAAAAAAAA 14 364179TGAAAAACA-TGTTTTAAA 9 449648

Page 10: Thahir P. Mohamed,  Asia D. Mitchell  and Madhavi Ganapathiraju

Tools LayerLarge Repeats Find RegEx

23 17 29441 CAGATTTGAAACACTCTTTTTGT24 93 4161 ATATCTTCGTATAAAAACAAGACA25 123 292054 TTTTCAGAAACTGCTTTGTGATGTG31 255 3983 GAAACGGGATTTCTTTATATTATGCTAGACA

Find Perplexity

Page 11: Thahir P. Mohamed,  Asia D. Mitchell  and Madhavi Ganapathiraju

5 MB

Explorative pattern analysis in chromosome 19

Page 12: Thahir P. Mohamed,  Asia D. Mitchell  and Madhavi Ganapathiraju

5 MB

250 KB

Explorative pattern analysis in chromosome 19

Page 13: Thahir P. Mohamed,  Asia D. Mitchell  and Madhavi Ganapathiraju

5 MB

250 KB

10 KB

Explorative pattern analysis in chromosome 19

Page 14: Thahir P. Mohamed,  Asia D. Mitchell  and Madhavi Ganapathiraju

Explorative pattern analysis in chromosome 19

5 MB

250 KB

10 KB

1 KB

Page 15: Thahir P. Mohamed,  Asia D. Mitchell  and Madhavi Ganapathiraju

Feature analysis of the centromere of the X chromosome

Perplexity drops near the centromere region that is highly repetitive, containing ngrams that are unique to this region.

Page 16: Thahir P. Mohamed,  Asia D. Mitchell  and Madhavi Ganapathiraju

Pattern landscape of chromosome 19

Duplication events

Page 17: Thahir P. Mohamed,  Asia D. Mitchell  and Madhavi Ganapathiraju

Ackowledgements

Madhavi GanapathirajuThahir Mohamed

Kamiya Mopwani

Thank you! Visit us at

Department of Biomedical Informatics University of Pittsburgh

Cathedral of Learning, University of Pittsburgh

www.dbmi.pitt.edu/madhavi