thahir p. mohamed, asia d. mitchell and madhavi ganapathiraju
DESCRIPTION
Open access toolkit for nonparametric explorative pattern mining to detect events relating to disease in large scale genome sequences. Thahir P. Mohamed, Asia D. Mitchell and Madhavi Ganapathiraju Department of Biomedical Informatics University of Pittsburgh School of Medicine - PowerPoint PPT PresentationTRANSCRIPT
Open access toolkit for nonparametric explorative pattern
mining to detect events relating to disease in large scale genome sequences
Thahir P. Mohamed, Asia D. Mitchell and Madhavi GanapathirajuDepartment of Biomedical Informatics
University of Pittsburgh School of MedicinePittsburgh PA USA
Advancing Practice , Innovation, and Instruction through InformaticsOctober 20, 2008
The Genome Sequence
3 billion nucleotides
20 to 25 thousand genes
Two-thirds of the genome made of repetitive elements (2 billion nucleotides)
ATGGCACTGAGCTCCCAGATCTGGGCCGCTTGCCTCCTGCTCCTCCTCCTCCTCGCCAGCCTGACCAGTGGCTCTGTTTTCCCACAACAGGTGAGAGCCCAGTGGCCTGGGTCCTTAGCAGGGCAGCAGGGATGGGAGAGCCAGGCCTCAGCCTAGGGCACTGGAGACACCCGAGCACTGAGCAGAGCTCAGGACGTCTCAGGAGTACTGGCAGCTGAACAGGAACCAGGACAGGCACGGTGGCTCATGCCTGTAATCCCAGCACTTTGGGAGGTTGAGGCAGGCAGCCCACTTGAGGTCAGTTTGAGACCAGCCTGGCCAACATGGTAAAACCCCGTCTCTACTAAAAATACAAAAGTTAGCCAGGCTTGGTGGCAGGTGCCTGTAATCCCAGCTACTCGGGAGACTGAGGCAGGAGAATTGCTTGAACCCGCAAGGTGGAGGTTGCACAGTGAGCTGAGATTGCACCACTGCACTCCAGCCTGGCAACAGAGCAAGACTCCATCTCCAAAAAAGAACAGAAATCAATGAAGCACCGAGTGACAGGGACTGGAAGGTCCTAATTCCATGGGTATTTACGGAACCCCTACGCCGTGTGGAGTCTTATTCTAGACAGTGGGGACGAGGCCATGAACAAGGTAGATGAGAGAGGAGATTTCTCCATCCTGGTCAGGGAATTTGTTAAAGACTGATGAAAACATGAATAAATAATTGTGTCTAGTACATTCTATTCGTGAATCTCATAACAGACAGTGGTAGAGTGACCGTGACCCATTCGCCACACAGTAGAGTCACTTTTTTGGTTTGTTTTTTAGAGACAGGGTCTTCCTCTGTTGCTGAGGCTGGAGTGCAGTGGTGCAGTCATAGTTCACTGCAGCCTCAACCTCCTGTGCTCAAGCAATCCTCCCACCTCAGCGTCCCAAGTAGCTGGGACAGCAGGCACATGCCACGGGTTGGGGGACCACAGGCATGGTCAAGGGGCTGGCAGTCAAGCAAGTG
The human genome contains…
Genomic PatternsShort Tandem Repeats (STRs)
Variable Number Tandem Repeats (VNTRs)
CpG IslandsA sequence of > 500 nucleotides C+G content of > 55%High frequency of CG dinucleotides
1 to 6 nucleotides repeated in tandem
Same as short tandem repeatsNumber of repeats variable across individuals
…CGCGCCGGACGTTACGCGCGCCGCGAAACGCGCGCCGGACGGCGCCGCAAACGGCCGCGCGTAC…
Palindromes
300 bp
>1,000 bp
ALU Elements
LINE-1 ElementsRetrotransposon of >1,000 nucleotides High A+T contentPoly A tail
Retrotransposon of ~300 nucleotides withHigh G+C content Recognition site for alu endonucleaseSegment high in A contentA poly A tail
A sequence that is like a normal palindrome (mom, racecar, …)One half is a complement of the other in reverse order.
Genomic Patterns
Disease Relevance
Challenges in Pattern Mining
ScalableGenomes are large 3 billion nucleotides
Genes are small 3 thousand nucleotides
Genomes of different organisms vary greatly in size
FlexibleTypes of patterns differ
There are variations within a single type of pattern
Flexibility in resolution of analysis
NonparametricNew and unknown patterns
Explorative analysis
Computational tools for pattern mining must be…
Currently, there are no tools that are scalable, flexible, and nonparametric for genomic pattern mining
Pattern Mining Toolkit
Applications layer contains programs that utilize features computed by tools layer and also the preprocessed layer to compute specific commonly known patterns such short tandem repeats, DNA palindromes, short and long interspersed nuclear elements, etc.
Foundation Layer
Data Preprocessing:Suffix array computationLongest common prefix array computation
Efficient Preprocessing of Genome Sequence
Repetitive patterns appear next to each other
Allows for efficient computation of patterns
Tools Layer
Locate Specific Patterns
Find Ngram Counts Compare Ngram CountsNgram = CG
Window Count1 762 1083 904 1065 1856 0
Ngram = GCCWindow Chrom A Chrom B
1 42 1002 98 1653 63 794 72 605 25 151
TTAAAAAAAA-TTTTTTAAAA 10 251555TAAAAAAC-GTTTTTAA 8 276649CAAAAAAG-CTTTTTAG 8 312629TCTCTACTAAAAAT-ATTTTTAAAAAAAA 14 364179TGAAAAACA-TGTTTTAAA 9 449648
Tools LayerLarge Repeats Find RegEx
23 17 29441 CAGATTTGAAACACTCTTTTTGT24 93 4161 ATATCTTCGTATAAAAACAAGACA25 123 292054 TTTTCAGAAACTGCTTTGTGATGTG31 255 3983 GAAACGGGATTTCTTTATATTATGCTAGACA
Find Perplexity
5 MB
Explorative pattern analysis in chromosome 19
5 MB
250 KB
Explorative pattern analysis in chromosome 19
5 MB
250 KB
10 KB
Explorative pattern analysis in chromosome 19
Explorative pattern analysis in chromosome 19
5 MB
250 KB
10 KB
1 KB
Feature analysis of the centromere of the X chromosome
Perplexity drops near the centromere region that is highly repetitive, containing ngrams that are unique to this region.
Pattern landscape of chromosome 19
Duplication events
Ackowledgements
Madhavi GanapathirajuThahir Mohamed
Kamiya Mopwani
Thank you! Visit us at
Department of Biomedical Informatics University of Pittsburgh
Cathedral of Learning, University of Pittsburgh
www.dbmi.pitt.edu/madhavi