the mobios project molecular biological information system daniel p. miranker dept. of computer...
TRANSCRIPT
![Page 1: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics](https://reader034.vdocuments.site/reader034/viewer/2022042717/56649e2f5503460f94b1f93f/html5/thumbnails/1.jpg)
The MoBIoS ProjectMolecular Biological Information System
Daniel P. MirankerDept. of Computer Sciences &
Center for Computational Biology and Bioinformatics
University of Texas
Weijia Xu, Rui Mao, Will Briggs, Smriti Ramakrishnan, Shu Wang, Lulu Zhang
![Page 2: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics](https://reader034.vdocuments.site/reader034/viewer/2022042717/56649e2f5503460f94b1f93f/html5/thumbnails/2.jpg)
Problem:
In Life Sciencses, database management systems (DBMS) serve as glorified file managers.
Little use of sophisticated data and pattern-based retrieval
Real scientific and technological problems
![Page 3: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics](https://reader034.vdocuments.site/reader034/viewer/2022042717/56649e2f5503460f94b1f93f/html5/thumbnails/3.jpg)
When biological data is put in to an RDBMS
• Primary data is stored in text or blob fields– Annotations may be relational
• Data retrieval – Filter DB, sequential dump, O(n), to utilities
• E.g. BLAST,
Organism Function Sequence
Yeast membrane AACCGGTTT
Yeast mitosis TATCGAAA
E. Coli membrane AGGCCTA
![Page 4: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics](https://reader034.vdocuments.site/reader034/viewer/2022042717/56649e2f5503460f94b1f93f/html5/thumbnails/4.jpg)
Linear Data Scans, O(n), Endemic in Life Sciences
Sequences: DNA, RNA, Protein databases
Mass Spectra proteomics
Small Molecules & Protein Structure Protein interaction Rational drug design
Pathways (graphs) Phylogenies (graphs, trees in particular)
![Page 5: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics](https://reader034.vdocuments.site/reader034/viewer/2022042717/56649e2f5503460f94b1f93f/html5/thumbnails/5.jpg)
Scope: To Find Common Ground Both Biology and DBMS’ Have to Move
DBMS
Biological
Information
System
Metric-Space Database as the Common Ground
![Page 6: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics](https://reader034.vdocuments.site/reader034/viewer/2022042717/56649e2f5503460f94b1f93f/html5/thumbnails/6.jpg)
Metric Space is a pair, M=(D,d),
where D is a set of points d is [metric] distance function with the following
properties:
d(x,y) = d (y,x) (symmetry) d(x, y) > 0, d(x,x) = 0 (non negativity) d(x,z) <= d(x,y) + d(y,z) (triangle inequality)
x
y z
![Page 7: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics](https://reader034.vdocuments.site/reader034/viewer/2022042717/56649e2f5503460f94b1f93f/html5/thumbnails/7.jpg)
Definition - By Analogy
A Spatial Database Management System:
Extend relational DBMS Special indexes for 2D and
3D data; k-d and R-trees New data types
Geographic information systems Topographic maps Buildings and the like
A Metric-Space Database Management System
Extend Relational DBMS Special indexes for metric-
spaces New data types
Biological information system Life science data types
![Page 8: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics](https://reader034.vdocuments.site/reader034/viewer/2022042717/56649e2f5503460f94b1f93f/html5/thumbnails/8.jpg)
Develop index structures to support distance & nearest-neighbor queries
• Well studied in main-memory– But by no means a closed problem
• In databases (external/disk based methods)– Embryonic– Many myths
• Often assumed to be the basis of multimedia database systems
![Page 9: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics](https://reader034.vdocuments.site/reader034/viewer/2022042717/56649e2f5503460f94b1f93f/html5/thumbnails/9.jpg)
How to build a metric-space index
• Three algorithmic classes [Tasan, Ozsoyoglu 04]
– Vantage points– Hyperplanes– Bounding spheres
![Page 10: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics](https://reader034.vdocuments.site/reader034/viewer/2022042717/56649e2f5503460f94b1f93f/html5/thumbnails/10.jpg)
Vantage Point Method [Burkhard&Keller73]
![Page 11: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics](https://reader034.vdocuments.site/reader034/viewer/2022042717/56649e2f5503460f94b1f93f/html5/thumbnails/11.jpg)
Vantage Point Method
Choose a point,VP
And a radius, R
![Page 12: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics](https://reader034.vdocuments.site/reader034/viewer/2022042717/56649e2f5503460f94b1f93f/html5/thumbnails/12.jpg)
Vantage Point Method
Choose a point,VP
And a radius,R
• Given VP, R
The predicates
• d(VP,x) < R
• d(VP,x) R
Divide the set into two equal halves
• apply recursively
![Page 13: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics](https://reader034.vdocuments.site/reader034/viewer/2022042717/56649e2f5503460f94b1f93f/html5/thumbnails/13.jpg)
Query, q, range r
qr
![Page 14: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics](https://reader034.vdocuments.site/reader034/viewer/2022042717/56649e2f5503460f94b1f93f/html5/thumbnails/14.jpg)
Query, q, range r
VP
R
q
r
if• d(q,VP) > R + rthen• all neighbors are outside the sphere
![Page 15: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics](https://reader034.vdocuments.site/reader034/viewer/2022042717/56649e2f5503460f94b1f93f/html5/thumbnails/15.jpg)
Multi-vantage point method
![Page 16: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics](https://reader034.vdocuments.site/reader034/viewer/2022042717/56649e2f5503460f94b1f93f/html5/thumbnails/16.jpg)
Multi-vantage point method
• Consider d(VPi, x) a projection onto an axis
• Looks like a k-d tree– Choose number k & d
![Page 17: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics](https://reader034.vdocuments.site/reader034/viewer/2022042717/56649e2f5503460f94b1f93f/html5/thumbnails/17.jpg)
Myths
• Solved problem; M-trees [Ciaccia et.al. 96, 97]
– I can’t get them to work on anything but their original synthetic data generator
• Good choice for vantage points is to find corners[Yianilos93] (farthest-first clustering)– Might be true for euclidean spaces– Early result, not true for our data
• High dimensional indexing always asymptotically reduces to linear scans.– Formal result based on an assumption of uniform data
distributions.
![Page 18: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics](https://reader034.vdocuments.site/reader034/viewer/2022042717/56649e2f5503460f94b1f93f/html5/thumbnails/18.jpg)
#di st . cal . : RBT VS. GHT VS. MVPT
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
0 2 4 6 8 10radi us
#dist cal.
RBTGHTMVPT
#I / O, RBT VS. GHT. VS MVPT
0
100
200
300
400
500
600
700
800
0 2 4 6 8 10radi us
#IO
RBT
GHT
MVPT
Figure 9. Comparison of metric-space index structures: RBT, GHT, and VPT
Comparison of Three Methods of Metric-Space Indexing
![Page 19: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics](https://reader034.vdocuments.site/reader034/viewer/2022042717/56649e2f5503460f94b1f93f/html5/thumbnails/19.jpg)
Open problems
• Is there a general metric-space index structure that is generally good for most work loads.– We are optimistic mvp tree’s – further tuning will be a
useful answer
– Hyperplane methods are fair game – there is circumstantial evidence that that is key component in Google’s search engine.
• No work addresses clustering data pages on disk.• Metric-space join algorithms
![Page 20: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics](https://reader034.vdocuments.site/reader034/viewer/2022042717/56649e2f5503460f94b1f93f/html5/thumbnails/20.jpg)
Biological Models are Usually Based on Similarity
Similarity• Biologist like scoring functions that reward each
similar feature with a positive number• Intuitive
Distance:• More Similar smaller numbers• Identical 0
![Page 21: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics](https://reader034.vdocuments.site/reader034/viewer/2022042717/56649e2f5503460f94b1f93f/html5/thumbnails/21.jpg)
But Do Metric Models Capture Biology?But Do Metric Models Capture Biology? • Metrics are a subset of possible mathematical models
.
![Page 22: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics](https://reader034.vdocuments.site/reader034/viewer/2022042717/56649e2f5503460f94b1f93f/html5/thumbnails/22.jpg)
Sequence Problem 1
Sequence similarity based on weighted edit distance
Accepted weight matrices, PAM & BLOSSUM, are not metric
Log-odd matrices – negative values
Defy simple algebraic normalization[TaylorJones93,Linialetal97]
![Page 23: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics](https://reader034.vdocuments.site/reader034/viewer/2022042717/56649e2f5503460f94b1f93f/html5/thumbnails/23.jpg)
Our First Result: mPAM [Xu&Miranker04]
Dayhoffetal’s PAM Derivation[74]
• Took a set of closely related protein sequences
• Developed a phylogenetic tree
• Counted substitutions to transform one sequence to another
• Tree determines a measure of time
![Page 24: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics](https://reader034.vdocuments.site/reader034/viewer/2022042717/56649e2f5503460f94b1f93f/html5/thumbnails/24.jpg)
PAM vs. mPAM: t = 1/f
Using original substitution counts
PAM: frequency of substitution
S(a,b|t) = log P(b|a,t)/qb
mPAM: expected time between substitutions
D(a,b) = 1/log(1 – (P(a,x)P(b,x))x
![Page 25: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics](https://reader034.vdocuments.site/reader034/viewer/2022042717/56649e2f5503460f94b1f93f/html5/thumbnails/25.jpg)
Sequence Problem 2
• Sequences long units (identity for storage and retrieval)– Genes– Chromosomes
• Analysis comprises comparing small substrings
![Page 26: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics](https://reader034.vdocuments.site/reader034/viewer/2022042717/56649e2f5503460f94b1f93f/html5/thumbnails/26.jpg)
Soln: Sequence View
• New view type
• Breaks sequences into q-grams
create SEQUENCEVIEW rice_sview asSELECT CREATE FRAGMENTS (…, 3, 1)FROM …WHERE …
USING HAMMING-DISTANCE
![Page 27: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics](https://reader034.vdocuments.site/reader034/viewer/2022042717/56649e2f5503460f94b1f93f/html5/thumbnails/27.jpg)
Materialize as an Index
Genomes
Rowid Seq
R1 CAACA
R2 ATCAAA
R3 …
Rowd Offset Logical Fragment
R1 1 A C A
R1 2 C A A
R1 3 A A C
R1 4 A C A
… … …
R2 1 A T C
R2 2 T C A
R2 3 C A A
R2 4 A A A
… … …
D(ACA)
≤ 1D(CAA)
≤ 0D(ATC)
≤ 1
D(AAA)≤ 2
{
{
![Page 28: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics](https://reader034.vdocuments.site/reader034/viewer/2022042717/56649e2f5503460f94b1f93f/html5/thumbnails/28.jpg)
Status
• Started with McKoi– A Java open source object-relational DBMS– (Think of Postgress written in Java)
• AddedBiological data typesMetric-space indexExtending SQL engine (in progress)
![Page 29: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics](https://reader034.vdocuments.site/reader034/viewer/2022042717/56649e2f5503460f94b1f93f/html5/thumbnails/29.jpg)
Computed in MoBIoS Compare Arabidopsis Genome X Rice Genome
1. Locate nucleotide patterns of form
primer pair candidate
2. Eliminate non-unique primer candidates3. Merge overlapping primer candidates
• Usual implementations O(n2), n = 109
Rice
Arab.
18 Matching Nucleotides
Rice Gap 400 – 3000 Long Arab. Gap 400 – 3000 Long
18 Matching Nucleotides
![Page 30: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics](https://reader034.vdocuments.site/reader034/viewer/2022042717/56649e2f5503460f94b1f93f/html5/thumbnails/30.jpg)
mSQL Query to locate candidate primer pairsSELECT merge(R1.fragment, A1.fragment)
FROM
G1_sview R1, G1_sview R2, G2_sview A1, G2_sview A2
WHERE
distance(‘HAMMINGDISTANCE', R1.fragment, A1.fragment) <= 1.0 AND distance(‘HAMMINGDISTANCE', R2.fragment, A2.fragment) <= 1.0 AND
(FRAGOFFSET(R2.fragment)-FRAGOFFSET(R1.fragment)) >= 400 AND
(FRAGOFFSET(R2.fragment)-FRAGOFFSET(R1.fragment)) <= 3000 AND
(FRAGOFFSET(A2.fragment)-FRAGOFFSET(A1.fragment)) >= 400 AND
(FRAGOFFSET(A2.fragment)-FRAGOFFSET(A1.fragment)) <= 3000
GROUP BY R1.fragment, A1.fragment;
![Page 31: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics](https://reader034.vdocuments.site/reader034/viewer/2022042717/56649e2f5503460f94b1f93f/html5/thumbnails/31.jpg)
Query Plan Arab. Genome, O(n) Rice Genome, O(m)
Offline: Build Sequence View O(n log n)
Compare O(mlogn) Indexed Nested Loop
Eliminate Duplicates
Eliminate Low ComplexityPrimers (LZ compression)
Merge Overlapping Primers
~10,000 conserved primer pairs candidates
![Page 32: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics](https://reader034.vdocuments.site/reader034/viewer/2022042717/56649e2f5503460f94b1f93f/html5/thumbnails/32.jpg)
Preliminary Results• Found 13,418 possible primer pairs from MoBIoS• 100 best candidates BLASTed for matches in GenBank
– 15 matched other plant genes and the primers– At least 2 of 15 showed potential after PCR amplification against
Helianthus and Phalaenopsis.
![Page 33: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics](https://reader034.vdocuments.site/reader034/viewer/2022042717/56649e2f5503460f94b1f93f/html5/thumbnails/33.jpg)
MoBIoS Architecture(Molecular Biological Information System)
![Page 34: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics](https://reader034.vdocuments.site/reader034/viewer/2022042717/56649e2f5503460f94b1f93f/html5/thumbnails/34.jpg)
Analysing Mass-Spectra
Spectrum = Histogram of Mass/Charge Ratios of a collection peptides
Similarity = Shared peaks count = Inner Product
(0100101) • (0111100) = 2
![Page 35: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics](https://reader034.vdocuments.site/reader034/viewer/2022042717/56649e2f5503460f94b1f93f/html5/thumbnails/35.jpg)
Cosine Distance Approx. Inner Product
Drs= 1 – xrx’s/(x’rxr)1/2(x’sxs)1/2
shown store and retrieve mass-spectra
- using cosine distance, and it scales
![Page 36: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics](https://reader034.vdocuments.site/reader034/viewer/2022042717/56649e2f5503460f94b1f93f/html5/thumbnails/36.jpg)
mSQL Query for Protein Identification by Mass-Spec.
Signature Database Look
SELECT Prot.accesion_id, Prot.sequenceFROM protein_sequences Prot, digested_sequences DS,
mass_spectra MS
WHEREMS.enzyme = DS.enzyme = E and
Cosine_Distance(S, MS.spectrum, range1) and
DS.accession_id = MS.accession_id = Prot.accesion_id and
DS.ms_peak = P and MPAM250(PS, DS.sequence, range2);
![Page 37: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics](https://reader034.vdocuments.site/reader034/viewer/2022042717/56649e2f5503460f94b1f93f/html5/thumbnails/37.jpg)
Matching Electrostatic Shape of Molecules
![Page 38: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics](https://reader034.vdocuments.site/reader034/viewer/2022042717/56649e2f5503460f94b1f93f/html5/thumbnails/38.jpg)
Still benefit from grid-services: Intermittently, but regularly compile (recluster) the indices O(nlog n), n > 106
Rational drug design: O(log n) finite element solutions to traverse search tree. Make a service call to the grid for these operations only Mirror data contents to minimize I/O Since need is intermittant, one grid serves many MoBIoS servers
G R I D
Mirror DB-Contents
MoBIoSServer
recluster
New index Shape match (FEM)
Distance(real)
High speed I/O
![Page 39: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics](https://reader034.vdocuments.site/reader034/viewer/2022042717/56649e2f5503460f94b1f93f/html5/thumbnails/39.jpg)
Hyper-planes [Ulhmann91]
• If d(x,h1) < d(x,h2) then x assigned to h1h1
h2
x
![Page 40: The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics](https://reader034.vdocuments.site/reader034/viewer/2022042717/56649e2f5503460f94b1f93f/html5/thumbnails/40.jpg)
Develop a Hierarchical Clustering
Hierarchy of Bounding spheres, (center, radius), • Bounding spheres may overlap
• Inspired by R-trees
B
F D
EA
C