![Page 1: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/1.jpg)
Supporting on-the-fly data Integration for bioinformatics
Candidate: Xuan Zhang
Advisor: Gagan Agrawal
![Page 2: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/2.jpg)
Road Map
• Mission Statement
• Motivation
• Implementation
• Comprehensive Examples
• Future work
• Conclusion
![Page 3: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/3.jpg)
Mission Statement
• Enhance information integration systems on– Functionality
• On-the-fly data incorporation• Flat file data process
– Usability• Declarative interface• Low programming requirement
![Page 4: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/4.jpg)
Motivation
• Integration is essential for biological research– Biological data include
• Sequences: DNA (GenBank), protein (Swiss-Prot)• Structure: RNA (RNAbase), protein (PDB)• Interaction: pathway (KEGG), regulation (GRBase)• Function: disease (OMIM)• 2ndary: protein family (Pfam)
– Biological data is inter-related.
![Page 5: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/5.jpg)
Motivation
• Challenges of bioinformatics integration– Data volume: overwhelming
• DNA sequence: 100 gigabases (August, 2005)
– Data growth:
exponential
Figure provided by PDB
![Page 6: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/6.jpg)
Motivation
• Challenges of bioinformatics integration (cont.)– Tools: Many and more– Service interfaces: Variety
• Web pages• Web service• Grid service
![Page 7: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/7.jpg)
Motivation
• Challenges of bioinformatics integration (cont.)– Inter-operability: Low
• Heterogeneous data sources– Semi-structured by nature– Flat file, relational, object-oriented databases
• Independently developed tools• No data exchange standard
– Little Collaboration
![Page 8: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/8.jpg)
Road Map
• Mission Statement
• Motivation
• Implementation
• Future
• Conclusion
– Approach Overview– Advantage– Components
![Page 9: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/9.jpg)
Approach Summary
• Metadata– Declarative description of data– Data mining algorithms for semi-automatic
writing– Reusable by different requests on same data
• Code generation– Request analysis and execution separated– General modules with plug-in data module
![Page 10: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/10.jpg)
System OverviewUnderstand Data Process Data
Data File User Request
Answ
er
Metadata Description
Layout Descriptor---------------------------------------------------
Schema DescriptorLayout Descriptor
---------------------------------------------------
Schema DescriptorLayout Descriptor
---------------------------------------------------
Schema Descriptor
CodeGeneration
RequestProcessor
Layout Miner
SchemaMiner
Information Integration System
![Page 11: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/11.jpg)
Advantages
• Simple interface– At metadata level, declarative
• General data model– Semi-structured data– Flat file data
• Low human involvement– Semi-automatic data incorporation– Low maintenance cost
• OK Performance– Linear scale guaranteed
![Page 12: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/12.jpg)
Road Map
• Mission Statement
• Motivation
• Implementation
• Future
• Conclusion
– Approach Overview– Advantage– Components
![Page 13: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/13.jpg)
System Components
• Understand data– Layout mining– Schema mining
• Process data– Wrapper generation– Query Process– Query Process with indices
![Page 14: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/14.jpg)
Layout Mining
• Goal 1: Separate delimiters from values– D-score: location &
frequency
• Goal 2: Organize delimiters and values– NFA
Data File
Token Parser
Tokens
Delimiter Mining
Candidate Delimiters
Layout Learning
Layout Descriptor
![Page 15: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/15.jpg)
Schema Mining Road Map
• Schema Mining– Overview– Mining System– Core Mining Algorithm– Experiments
![Page 16: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/16.jpg)
Schema Mining Goals
• Ultimate goal: discover schema about an unknown flat file dataset
• Immediate goal: Assign attributes with meaningful labels
![Page 17: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/17.jpg)
Our Approach
• Summarize values from bottom up• Use knowledge from
– Ontology– Heuristics
• A head-up: attribute label attribute name– What we can mine
• date
– What we cannot do• Creation date, last modification date, birthday, …
![Page 18: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/18.jpg)
Schema Mining Road Map
• Schema Mining– Overview– Mining System– Core Mining Algorithm– Experiments
![Page 19: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/19.jpg)
Schema Mining System
• Major Components– Data Cleaning and
summarization– Score calculation
• Score function• Ontology• Heuristics
– Score Clustering
Raw attribute valuesRaw attribute values
Value cleaning and summarizationValue cleaning and summarization
Attribute summariesAttribute summaries
Score calculationScore calculation
ScoresScoresClusteringClustering
algorithmalgorithm
Cutoff valuesCutoff values
LabelingLabeling
Attribute LabelsAttribute Labels
![Page 20: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/20.jpg)
• Goal: reduce amount of data
• Collect frequent tokens– Approximate frequent token mining algorithm
Data Summarization
• Goal: reduce amount of data
• Collect frequent tokens– Approximate frequent token mining algorithm
• Token categorization by profile– Token profile: a ordered list of N(numerical),
A(alphabetic) and special characters– Token categories:
• Word, number, else and other user defined categories
![Page 21: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/21.jpg)
Score Function Template
• Desired property– Simple– Adjustable trade-
off between sensitivity and error tolerance
0.00.10.20.30.40.50.60.70.80.91.0
F_pt B_pt t
Temperature
![Page 22: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/22.jpg)
Score Clustering
• Goal: Sort attributes into three groups, H (high), L (low) and M (middle), by scores
• Mathematically, find two scores, scorei and scorej, from {score1, score2, score3, …, scoreN}, to minimize the standard deviation
• N (number of attributes) is not large. Exact answer can be found.
![Page 23: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/23.jpg)
Schema Mining Road Map
• Schema Mining– Overview– Mining System– Core Mining Algorithm
• Mining with ontology• Mining with heuristics
– Experiments
![Page 24: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/24.jpg)
Use of Ontology
• An observation: a similarity between ontology and schema– Both satisfy “is-a” relation
• E.g “Diabetes is a disease.”• Ontology: “diabetes” is a child of “disease”• Schema: “diabetes” is a valid instance of attribute
“disease”
• Common ancestors in ontology ~ attribute label
![Page 25: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/25.jpg)
Real-world Complications
• To find an arbitrary value in an ontology– Complete and comprehensive ontology?
• Selective sampling
– Error-free dataset?• Adjustable sensitivity & fault tolerance
• Performance
![Page 26: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/26.jpg)
Ontology Database
• Goal: to approximate a complete comprehensive ontology database
• Approach– “Complete”: sample popular terms– “Comprehensive”: public ontology databases +
common facts
• Result– 6 major categories– 386 terms
![Page 27: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/27.jpg)
Ontology Based Metrics (1)
1. Occurrence(term) =Frequent_Count[i],
if term=Frequent_Token[i]
mini:[0, t] Frequent_Count[i],
if term=Frequent_Token[0]|…|Frequent_Token[t]
0, else
2. Strength(term) = Occurrence(term) + Strength(child_term)
![Page 28: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/28.jpg)
Ontology Based Metrics (2)
• Two factors– Relative strength compared with other concepts– Completeness of ontology as a whole
• Ontology score = product of two factors– Each modulated by the template score function
![Page 29: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/29.jpg)
Mining With Heuristics (1)
• Use token profile– “number”: {N, N.N}– “date”: {N-A-N, N/N/N}
• Use frequent token counts– “identification”: Frequent_Counts[]=1
• Use other token information– “biological sequence”: length >45, or in 10’s
![Page 30: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/30.jpg)
Mining With Heuristics (2)
• Use token sequence information– “people name”: length (2~3), separator (“,” or
“and”), profile (not number, date)
• Again, these counts are modulated by the template function to calculate scores
![Page 31: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/31.jpg)
Schema Mining Road Map
• Schema Mining– Overview– Mining System– Core Mining Algorithm– Experiments
![Page 32: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/32.jpg)
Schema Mining Experiment Design
• Datasets– GenBank, UniProt SWISSPROT and Pfam
• Cutoff values– Exact clustering
• Evaluation– Weighted Cohen’s Kappa
Compare group most, middle and little with true label Y(yes), P(partial) and N(no)
![Page 33: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/33.jpg)
Result Summary: Kappa
Very goodVery good
GoodGood
ModerateModerate
1: cellular component, 2: database, 3: date, 4: free text, 5: ID, 6: molecule type,
7: name, 8: number, 9: organism, 10: publication method, 11: sequence
![Page 34: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/34.jpg)
Cellular Component (O)
![Page 35: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/35.jpg)
Date (H)
![Page 36: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/36.jpg)
Organism Name (O)
![Page 37: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/37.jpg)
Schema Mining Summary
• According to Kappa tests, results are good or very good
• Possible improvement– Clustering method with better intelligence– Better ontology database– More involved language analysis– Hybrid of bottom-up and top-down approaches
![Page 38: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/38.jpg)
System Components
• Understand data– Layout mining– Schema mining
• Process data– Metadata description language– Wrapper generation– Query Process– Query Process with indices
![Page 39: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/39.jpg)
Data Process Overview
• Automatic code generation approach• Input
– Metadata about datasets involved– Optional:
• Implicit data transformation task• Request by users• Indexing functions
• Output– Executable programs
• General modules• Task-specific data module
![Page 40: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/40.jpg)
Metadata Description
• Two aspects of data in flat files– Logical view of the data– Physical data organization
• Two components of every data descriptor– Schema description– Layout description
• Design goals– Powerful– Easy for writing and interpretation
![Page 41: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/41.jpg)
Metadata Challenges
• Examples of sequence formats– ALN/ClustalW format – AMPS Block file format – ClustalW – Codata – EMBL – GCG/MSF – GDE – Genebank – Fasta (Pearson) – NBRF/PIR – PDB format – Pfam/Stockholm format – Phylip – Raw – RSF – UniProtKB/Swiss-Prot
List and example provided by EMBL-EBI
>FOSB_MOUSE Protein fosB. 338 bp MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGSGGPSTSTTTSGPVSARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRRRELTDRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRDLPGSTSAKEDGFGWLLPPPPPPPLPFQSSRDAPPNLTASLFTHSEVQVLGDPFPVVSPSYTSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPSLLAL
{ name "Short name for sequence" longname "Long (more descriptive) name for sequence" sequence-ID "Unique ID number" creation-date "mm/dd/yy hh:mm:ss" direction [-1|1] strandedness [1|2] type [DNA|RNA||PROTEIN|TEXT|MASK] offset (-999999,999999) group-ID (0,999) creator "Author's name" descrip "Verbose description“ comments "Lines of comments that can be fairly arbitrary text about a sequence. Return characters are allowed, but no internal double quotes or brace characters. Remember to close with a double quote" sequence "gctagctagctagctagctcttagctgtagtcgtagctgatgctagct gatgctagctagctagctagctgatcgatgctagctgatcgtagctgacg gactgatgctagctagctagctagctgtctagtgtcgtagtgcttattgc" }
LOCUS MMFOSB 4145 bp mRNA linear ROD 12-SEP-1993 DEFINITION Mouse fosB mRNA. ACCESSION X14897 VERSION X14897.1 GI:50991 KEYWORDS fos cellular oncogene; fosB oncogene; oncogene. SOURCE Mus musculus. ORGANISM Mus musculus
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Mus.
REFERENCE 1 (bases 1 to 4145) AUTHORS Zerial,M., Toschi,L., Ryseck,R.P., Schuermann,M., Muller,R. and
Bravo,R. TITLE The product of a novel growth factor activated gene, fos B,
interacts with JUN proteins enhancing their DNA binding activity JOURNAL EMBO J. 8 (3), 805-813 (1989) MEDLINE 89251612 PUBMED 2498083COMMENT clone=AC113-1; cell line=NIH3T3. FEATURES Location/Qualifiers source 1..4145
/organism="Mus musculus" /db_xref="taxon:10090“
CDS 1202..2218 /note="fosB protein (AA 1-338)" /codon_start=1 /protein_id="CAA33026.1" /db_xref="GI:50992" /db_xref="MGD:95575" /db_xref="SWISS-PROT:P13346" /translation="MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQEC AGLGEMPGSFVPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGT SYSTPGLSAYSTGGASGSGGPSTSTTTSGPVSARPARARPRRPREETLTPEEEEKRRV RRERNKLAAAKCRNRRRELTDRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAH KPGCKIPYEEGPGPGPLAEVRDLPGSTSAKEDGFGWLLPPPPPPPLPFQSSRDAPPNL TASLFTHSEVQVLGDPFPVVSPSYTSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPS LLAL" BASE COUNT 960 a 1186 c 1007 g 991 t 1 others ORIGIN 1 ataaattctt attttgacac tcaccaaaat agtcacctgg aaaacccgct ttttgtgaca 61 aagtacagaa ggcttggtca catttaaatc actgagaact agagagaaat actatcgcaa 121 actgtaatag acattacatc cataaaagtt tccccagtcc ttattgtaat attgcacagt 181 gcaattgcta catggcaaac tagtgtagca tagaagtcaa agcaaaaaca aaccaaagaa 241 aggagccaca agagtaaaac tgttcaacag ttaatagttc aaactaagcc attgaatcta 301 tcattgggat cgttaaaatg aatcttccta caccttgcag tgtatgattt aacttttaca 361 gaacacaagc caagtttaaa atcagcagta gagatattaa aatgaaaagg tttgctaata 421 gagtaacatt aaataccctg aaggaaaaaa aacctaaata tcaaaataac tgattaaaat 481 tcacttgcaa attagcacac gaatatgcaa cttggaaatc atgcagtgtt ttatttaaga 541 aaacataaaa caaaactatt aaaatagttt tagagggggt aaaatccagg tcctctgcca 601 ggatgctaaa attagacttc aggggaattt tgaagtcttc aattttgaaa cctattaaaa 661 agcccatgat tacagttaat taagagcagt gcacgcaaca gtgacacgcc tttagagagc 721 attactgtgt atgaacatgt tggctgctac cagccacagt caatttaaca aggctgctca 781 gtcatgaact taatacagag agagcacgcc taggcagcaa gcacagcttg ctgggccact 841 ttcctccctg tcgtgacaca atcaatccgt gtacttggtg tatctgaagc gcacgctgca 901 ccgcggcact gcccggcggg tttctgggcg gggagcgatc cccgcgtcgc cccccgtgaa 961 accgacagag cctggacttt caggaggtac agcggcggtc tgaaggggat ctgggatctt 1021 gcagagggaa cttgcatcga aacttgggca gttctccgaa ccggagacta agcttccccg 1081 agcagcgcac tttggagacg tgtccggtct actccggact cgcatctcat tccactcggc 1141 catagccttg gcttcccggc gacctcagcg tggtcacagg ggcccccctg tgcccaggga 1201 aatgtttcaa gcttttcccg gagactacga ctccggctcc cggtgtagct catcaccctc 1261 cgccgagtct cagtacctgt cttcggtgga ctccttcggc agtccaccca ccgccgccgc 1321 ctcccaggag tgcgccggtc tcggggaaat gcccggctcc ttcgtgccaa cggtcaccgc 1381 aatcacaacc agccaggatc ttcagtggct cgtgcaaccc accctcatct cttccatggc 1441 c
• Major Challenges:
1. Various representation
2. Semi-structured data
![Page 42: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/42.jpg)
Schema Descriptors
• Follow XML DTD standard for semi-structured data
• Simple attribute list for relational data
<?xml version='1.0' encoding='UTF-8'?><!ELEMENT FASTA (ID, DESCRIPTION, SEQ)><!ELEMENT ID (#PCDATA)><!ELEMENT DESCRIPTION (#PCDATA)><!ELEMENT SEQ (#PCDATA)>
[FASTA] //Schema NameID = string //Data type definitionsDESCRIPTION = stringSEQ = string
![Page 43: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/43.jpg)
Layout Descriptors
• Overall structure (FASTA example)
DATASET “FASTAData” { //Dataset nameDATATYPE {FASTA} //Schema name
DATASPACE LINESIZE=80 {
// ---- File layout details goes here ----
}DATA {osu/fasta} //File location
}
![Page 44: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/44.jpg)
File Layout
• Key observations on line-based biological data files– Strings of variable length– Delimiters widely used– Data fields may be divided into variables– Repetitive structures>seq1 comment1 \nASTPGHTIIYEAVCLHNDRTTIP \n>seq2 comment2 \nASQKRPSQRHGSKYLATASTMDHARHGFLPRHRDTGILDSIGRFFGGDRGAPK \nKSAHKGFKGVDAQGTLSKIFKLGGRDSRSGSPMARRELVISLIVES \n>seq3 …
![Page 45: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/45.jpg)
Layout Descriptors
• File layout (FASTA example)
DATASPACE LINESIZE=80 { <
“>” ID “ ” DESCRIPTION < “\n” SEQ >
“\n” | EOF>
}
>seq1 comment1 \nASTPGHTIIYEAVCLHNDRTTIP \n>seq2 comment2 \nASQKRPSQRHGSKYLATASTMDHARHGFLPRHRDTGILDSIGRFFGGDRGAPK \nKSAHKGFKGVDAQGTLSKIFKLGGRDSRSGSPMARRELVISLIVES \n>seq3 …
![Page 46: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/46.jpg)
System Component
• Understand data– Layout mining– Schema mining
• Process data– Metadata description language– Wrapper generation– Query execution– Query execution with indices
![Page 47: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/47.jpg)
Wrapper Generation Road Map
• Motivation and overview
• System structure
• Wrapper generation
• Wrapper execution
• Experiments
![Page 48: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/48.jpg)
Wrapper Generation Motivation
• Wrappers are essential for bioinformatics integration– Heterogeneous data sources– Function: transform data
• Current solutions– Manually written wrappers– Scripts
![Page 49: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/49.jpg)
Wrapper GenerationAdvantages
• Wrapper generated automatically– Stand-alone programs for integration systems and
workflows– Little human interference. New resources can be
integrated on-the-fly– Direct transformation. No unnecessary intermediate form
needed– Only requires data description at metadata level, one
descriptor/data source
• Transfer data from flat files directly– No DB support required– No other domain or format heuristics
![Page 50: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/50.jpg)
Wrapper GenerationSystem Overview
DataReader DataWriterSynchronizer
SourceDataset
TargetDataset
WRAPINFO
Wrapper generationsystem
wrapper
Mapping File
Mapping Parser
Schema Mapping
Mapping Generator
Schema Descriptors
Layout Parser
Layout Descriptor
Data EntryRepresentation
Application Analyzer
![Page 51: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/51.jpg)
Layout Parse Tree
• FASTA exampleDATASPACE LINESIZE=80 {
<“>” ID “ ” DESCRIPTION
< “\n” SEQ >“\n” | EOF
>}
DATASPACE rootlinesize = 80
< >
< >
“>”-ID “ “-DESCRIPTION
“\n”-SEQ
“\n”-DUMMY | EOF
Leaf: delimiter-variable (DLM-VAR) pair
Internal node: environment
![Page 52: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/52.jpg)
Schema Mapping
• Algorithm: strict name matchingfor field ft in target schema
for field fs in source schema
if ft=fs then add pair (fs, ft) to the mapping
• Output– A list of attribute pairs– A editable file for user to verify and modify
![Page 53: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/53.jpg)
Wrapping Assumptions
• Convert semi-structured (and structured) data to structured data
• Both datasets are stored record-wise
• Order of records not disturbed after wrapping
Semi-structured Structured
Data can be transformed entry by entry
![Page 54: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/54.jpg)
Application Analyzer
• Task: to generate clear directions for wrapper and organize them in WRAPINFOR
• Sub-tasks– What values to store– How to extract values– How to store values– How to write values
![Page 55: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/55.jpg)
Important Concepts (1)
• “Useful”– An attribute is useful iff its values are in target
• “Reachable”– node b is reachable from node a, if there exists
a valid layout configuration such that a.DLM and b.DLM defines the boundaries of a.VAR.
i.e “… a.DLM a.VAR b.DLM …”
– A value instance is between• Its own delimiter• The first appearance of its reachable delimiters
![Page 56: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/56.jpg)
Important Concepts (2)
• Attribute Cardinality– Regular attribute: fixed number of values per
entry• ID
– Semi-structured attribute: varied number of values per entry
• References
![Page 57: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/57.jpg)
WRAPINFOR
• Contents: information to answer a particular wrapping task
• Forms: in XML– 5 look-up tables
• Delimiter, Usefulness, Cardinality, Label, Reachable
– 3 parameters• one_to_one_total, one_to_multiple_total, complete_in
• Function: plug into general modules to form a functional wrapper
![Page 58: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/58.jpg)
Wrapper Generation Road Map
• Motivation and overview of our approach
• System structure
• Wrapper generation
• Wrapper execution
• Experiments
![Page 59: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/59.jpg)
Wrapper Overview
Inputdataset
Datasetbuffer
DataReader
Value buffer
one_to_multiple_values
one_to_one_values
DataWriterOutputdataset
Synchronizer
load run
FARA
run
RA
halt
![Page 60: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/60.jpg)
Wrapper Structure
• One data module: WRAPINFO
• Three general action module– Synchronizer: central controler– DataReader, DataWriter: interact with datasets
• One value buffer
• Suitable for data grid
• Transform data one entry at a time
![Page 61: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/61.jpg)
Wrapper Execution
• DataReader– Extract attribute value
• Delimiter table + Reachable table
– Fill value buffer: Label look-up table
• DataWriter– Retrieve from value buffer: Label look-up table– Write target file
• Delimiter table + Reachable table + label table
• Synchronizer– Call DataReader on source: parameters– Call DataWriter on target: parameters
![Page 62: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/62.jpg)
Wrapper Experiments (1)
TRANSFAC-to-Reference Problem
(in logarithm)
(in logari
thm
)
•Analysis time constant•Execution time linear
![Page 63: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/63.jpg)
Wrapper Experiments (2)
SWISSPROT-to-FASTA Problem
•Performance comparable to handwritten codes
![Page 64: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/64.jpg)
System Components
• Understand data– Layout mining– Schema mining
• Process data– Metadata description language– Wrapper generation– Query execution– Query execution with indices
![Page 65: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/65.jpg)
Query Execution Road Map
• Motivation
• System Overview
• System Implementation– Languages– System
• Experiments
![Page 66: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/66.jpg)
Limitation of Wrapper
• Data Wrapping =
Data formatting + Data projection
• Other query types– Selection– Cross Product– Join
New Functionalities• Value examination• Multiple datasets
![Page 67: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/67.jpg)
Advantages
• Retrieve multiple pieces of information all at once
• Data easily available
• Declarative languages only
• High flexibility
• Low over-head
• Suitable for data grid
![Page 68: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/68.jpg)
System Enhancedquery
Query parser
Metadatacollection
Datasetdescriptors
Descriptorparser
Application analyzer
QUERYINFOR
DataReader DataWriter
Synchronizer
Source data files
TargetData file
Source/target names
Schema & Layout informationmappings
Query analysis
Query execution
![Page 69: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/69.jpg)
Query ExecutionRoad Map
• Motivation• System Overview• System Implementation
– Languages• Metadata Description Language• Query Language
– System• Query Analysis• Query Execution
• Experiments
![Page 70: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/70.jpg)
Query Language• Declarative, SQL-like• Projection, selection, cross product, join queries• Example AUTOWRAP POSTBLAST
FROM BLASTP, SWISSPROT
BY BLASTP.SP_ID = SWISSPROT.ID
WHERE
POSTBLAST.QUERY = BLASTP.QUERY
POSTBLAST.SP_AC = BLASTP.SP_AC
POSTBLAST.SP_ID = BLASTP.SP_ID
POSTBLAST.FULL_DESCR = SWISSPROT.DEPOSTBLAST.FULL_DESCR = SWISSPROT.DE
POSTBLAST.SEQUENCE = SWISSPORT.SQPOSTBLAST.SEQUENCE = SWISSPORT.SQ
POSTBLAST.SCORE = BLASTP.SCORE
POSTBLAST.E_VALUE = BLASTP.E_VALUE
Target dataset
Source datasets
Join criteria
Attribute pairs
![Page 71: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/71.jpg)
Application AnalyzerEnhancement
• Constant values in query– Pseudo-label look-up table
• Other query information– Parameters: comparing field pairs
• Output: QUERYINFOR
![Page 72: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/72.jpg)
Query Execution
• Query-Proc Structure
• DataReader and DataWriter– Similar to wrapper
• Value buffer– Store useful values from one data entry of every
source dataset
QUERYINFOR
DataReader DataWriter
Synchronizer
Source data files
TargetData file
![Page 73: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/73.jpg)
Enhanced Synchronizer
• Synchronizer– Set up pseudo-attributes: Pseudo label look-up
table– Call DataReader on source 1 and 2; Call
DataWriter on target: Parameters– Test join conditions: Parameters– Clean value buffer: Parameters
![Page 74: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/74.jpg)
Post-BLAST Query
• Goal: Enhance BLAST output to FASTA format
• Query: Join query between BLAST output (source 1) and SWISSPROT (source 2)
• 2 modes– UNIQUE: halt once a
match found in source 2– ALL: search all source 2
entries
0
200
400
600
800
1000
1200
1400
1600
1800
2000
Tim
e (
se
c)
3 5 12
Query Size (Sequence Number)
UNIQUE
ALL
![Page 75: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/75.jpg)
Chip-Supplement Query• Goal: Look up microarray
genes information into tabular format
• Query: Join query between protein array and yeast genome database
• 2 queries– Chip-Supplement:
• array join genome
– Chip-Supplement-Sorted:• genome join array
0
10
20
30
40
50
60
70
80
90
Tim
e (
se
c)
Chip-Supplement Chip-Supplement-
Sorted
Query Type
UNIQUE
ALL
![Page 76: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/76.jpg)
OMIM-Plus Query
• Add reverse links of proteins to disease database
• Join query between OMIM database and SWISSPROT database
• Results in OMIM form
• 86.38 seconds/entry * 12,158 OMIM entry = 291.7 hours
![Page 77: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/77.jpg)
System Components
• Understand data– Layout mining– Schema mining
• Process data– Metadata description language– Wrapper generation– Query execution– Query execution with indices
![Page 78: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/78.jpg)
Query with IndicesRoad Map
• Motivation and Overview
• System
• System Enhancement– Language– System Implementation
• Experiments
![Page 79: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/79.jpg)
Query With IndicesMotivation
• Goal– Improve the performance of query-proc program
• Index
– Maintain the advantages• Flat file based• Low requirement on programming
![Page 80: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/80.jpg)
Challenges & Approaches
• Various indexing algorithms for various biological data– User defined indexing functions– Standard function interfaces
• Flat file data– Values parsed implicitly and ready to be indexed– Byte offset as pointer
• Metadata about indices– Layout descriptor
![Page 81: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/81.jpg)
System Revisitquery
Query parser
Metadatacollection
Datasetdescriptors
Descriptorparser
Application analyzer
QUERYINFOR
DataReader DataWriter
Synchronizer
Source data files
Targetdata file
Source/target names
Schema & Layout information mappings
Query analysis
Query execution
Index file Index functions
![Page 82: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/82.jpg)
Language Enhancement
• Describe indices– Indexing is a property of dataset– Extend layout descriptors
– Maintain query format
DATASET “name”{…INDEX {attribute:index_file_loc:index_gen_fun:index_retr_fun:fun_loc[, attribute:index_file_loc:index_gen_fun:index_retr_fun:fun_loc]}}
AUTOWRAP GNAMESFROM CHIPDATA, YEASTGENOMEBY CHIPDATA.GENE = YEASTGENOME.IDWHERE …
New meaning of “=“:If index available, use index
retrieving functionElse, compare values directly
![Page 83: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/83.jpg)
System Enhancement
• Metadata Descriptor Parser+ parse index information
• Application Analyzer+ index information: index look-up table
+ test condition: compare_field_indexing
![Page 84: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/84.jpg)
Query-Proc Enhancement
• Synchronizer+ if index is applicable, check availability of index
data file• If no, call index generation function
+ Load indices
+ Call index retrieving function first for candidate entry list
![Page 85: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/85.jpg)
Microarray Gene Information Look-up
• Goal: gather information about genes (120)
• Query: microarray output join genome database
• Index: gene names in genome
0.01 0.72
20.89
81.59
0
10
20
30
40
50
60
70
80
90
Per
form
ance
(se
c)
queryanalysis
indexgeneration
query withindices
query w/oindices
![Page 86: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/86.jpg)
BLAST-ENHANCE Query
• Goal: Add extra information to BLAST output
• Query: BLAST output join Swiss-Prot database
• Index: protein ID in Swiss-Prot
0
200
400
600
800
1000
1200
Per
form
ance
(se
c)
indexgeneration
query w/indices
query w/oindices
3 5 12
![Page 87: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/87.jpg)
OMIM-PLUS Query
• Goal: add Swiss-Prot link to OMIM
• Query: OMIM join Swiss-Prot
• Index: protein ID in Swiss-Prot
1
10
100
1000
10000
100000
1000000
10000000
Perf
orm
ance
(sec
)
indexgeneration
query w/indices
query w/oindices
![Page 88: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/88.jpg)
Homology Search Query
• Goal: find similar sequences
• Query: query sequence list * sequence database
• Indexing algorithm– Sequence-based– Transformation of sub-string composition– Indexing n-D numerical values
![Page 89: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/89.jpg)
Homology Search (1)
• Index (Singh’s algorithm)– Data: yeast
genome– wavelet
coefficients – minimum
bounding rectangles
0
50
100
150
200
250
300
350
Per
form
ance
(sec
)
1 2 3 4 5
Database size (9.8MB)
Index generation
10
20
40
![Page 90: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/90.jpg)
Homology Search (2)
• Index (Ferhatosmanoglu’s algorithm)– Data: GenBank– Wavelet coefficients– Scalar quantization– R-tree 0
5
10
15
20
25
30
perf
orm
ance
(sec
)
1 2 3 4 5
Database size (250MB)
10
20
40
![Page 91: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/91.jpg)
Road Map
• Mission Statement
• Motivation
• Implementation
• Comprehensive Example
• Future work
• Conclusion
![Page 92: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/92.jpg)
Gene Name Nomenclature
• It is crucial to identify genes CORRECTLY and UNAMBIGUOUSLY– Genes with multiple names– Multiple gene share same names
• Historically, little central control on naming process“…As biologists strive to make sense of the growing wealth of genomic information, this messy nomenclature is becoming a bugbear…”
Helen Pearson, Nature, 2001
![Page 93: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/93.jpg)
Gene Name in DBs
• Databases related to genes– Genome databases (main force in nomenclature)
• SGD (yeast)• HGNC (human)• TAIR (a plant)• dictyBase (an one-cell amoeba)
– Curated gene databases• Entrez Gene by NCBI
– Curated gene product databases• Swiss-Prot by SIB and EBI
![Page 94: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/94.jpg)
Queries About Gene Name
• Gene identifiers usages in databases– How are gene symbols in DB A used in DB B?– How are gene alias in DB A used in DB B?
• Nomenclature across species– Q1-Q2: genome – Entrez Gene, Swiss-Prot– Q3-Q4: Entrez Gene – Swiss-Prot
• Nomenclature over time– Q5-Q7: Swiss-Prot – genome
![Page 95: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/95.jpg)
Challenges
• Various data representation– Line-based texts– Tabular forms with or without title– Format evolves over time
• Data storage– Large volume– Each file queried limited times
Metadata descriptors
Format and schemalearning
Flat file processing
![Page 96: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/96.jpg)
Integration System RevisitUnderstand Data Process Data
Data File User Request
Metadata Description
Layout Descriptor---------------------------------------------------
Schema DescriptorLayout Descriptor
---------------------------------------------------
Schema DescriptorLayout Descriptor
---------------------------------------------------
Schema Descriptor
CodeGeneration
QueryProcessor
Layout Miner
SchemaMiner
Information Integration System
GenomeEntrez GeneSwiss-Prot
- Join queries
![Page 97: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/97.jpg)
Nomenclature Results (1)
• Across Species
0
10
20
30
40
50
60
70
80
90
Pe
rce
nta
ge
(%
)
Entrez GeneID
Entrez GeneAlias
Swiss-ProtID
Swiss-ProtAlias
Q1-Q2
SGD
HGNC
TAIR
dictyBase
0
10
20
30
40
50
60
Per
cen
tag
e (%
)
Swiss-Prot ID Swiss-Prot Alias
Q3-Q4
SGD
HGNC
TAIR
dictyBase
![Page 98: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/98.jpg)
Nomenclature Results (2)
• Over time
Q5: How many gene ID in Swiss-Prot are gene ID in genome?Q6: How many gene ID in Swiss-Prot are alias in genome?Q7: How many gene alias in Swiss-Prot are gene ID in genome?
![Page 99: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/99.jpg)
Performance
• Linear w.r.t. source 1 size
![Page 100: Supporting on-the-fly data Integration for bioinformatics](https://reader036.vdocuments.site/reader036/viewer/2022062301/568145e9550346895db2eb47/html5/thumbnails/100.jpg)
Conclusion
• A frame work and a set of tools for on-the-fly flat file data integration– New data source understood semi-automatically
by data mining tools– New data processed automatically by generated
programs
• AdvantagesHigh level interface, flat file based, ok
performance, low maintenance cost