Toward Making Online Biological Data Machine Understandable Cui Tao
Data Extraction Research GroupDepartment of Computer Science, Brigham Young University, Provo, UT 84602
Introduction Source Location by Semantic Indexing
Contact Information
Data Extraction Research GroupDepartment of Computer Science Brigham Young UniversityProvo, UT 84602
Cui Tao, [email protected]
http://www.deg.byu.edu/
Conclusions
PROBLEMS:Huge evolving number of Bio-databases e.g. molecular biology database collection
2004: total 548, 162 more than 20032005: total 719, 171 more than 2004
Different access capabilities
Syntactic heterogeneity
Semantics heterogeneity
Updated at anytime by independent authorities
SOLUTION:
Source page understanding
Table Interpretation
Aligning with an ontology
Source location through semantic annotation
Metadata vs. instance data annotation
Use of annotation in query processing
Ontology evolution
Adjustments to ISA and Part-Of hierarchies
Addition of attributes
GOALS:To help biologists cross search various resources
Examples: Cross-linked information (Join queries) “Find genes which are longer than 5kbp, whose
products have at least two helices, and participate in glycolysis” – GenBank, PDB, KEGG
Collecting information from similar data sources (Union queries)
“Find genes newly annotated after Jan. 2003 in the fly and worm genomes” – FlyBase, WormBase
table
tr
tr
td
td
td
td
td
td
td
td
td
td
Status
Nucleotides (coding/transcript)
Protein
Swissprot
Amino Acids
F47G6.1 1, 2
confirmed by cDNA(s)
1773/7391 bp
WP:CE26812
DTN1_CAEEL
td 590 aa
td Gene Model
F18H3.5b 1, 2, 3
F18H3.5a 1, 2
table
tr
tr
tr
td
td
td
td
td
td
td
td
td
td
td
td
td
td
td
Gene Model
Status
Nucleotides (coding/transcript)
Protein
Amino Acids
confirmed by cDNA(s)
1029/3051 bp
WP:CE18608
342 aa
partially confirmed by cDNA(s)
1221/1704 bp
WP:CE28918
406 aa
SAMPLE ONTOLOGY OBJECT RECOGNITIONSAMPLE ONTOLOGY OBJECT RECOGNITION
Key Concepts: sample ontology object, expected values
Steps:
Map the values with the sample ontology object set
Map the labels with the ontology concepts
Understand all pages from the same web site
Ontology Evolution
Source Page Understanding
Key Concepts: sibling pages and sibling tables
Main Idea:
Compare two sibling tables:
variable fields ~ values & fixed fields ~ labels
Structure pattern for one pair of sibling tables General structure pattern for all sibling tables
SIBLING PAGE COMPARISONSIBLING PAGE COMPARISON
Steps:Transfer each HTML table to a DOM treeFind sibling tree pairsCompare and find matched nodes
Generate a structure pattern for all sibling tables
Source Organism
Accession Number
Protein Name
Length in Amino Acid
Molecular Weight in Da
ProtoNet
ProtoNet
ProtoNet
ProtoNet
ProtoNet
ProtoNet
Semantic Web
Semantic annotation
Query
META-DATA ANNOTATIONMETA-DATA ANNOTATION
DATA ANNOTATIONDATA ANNOTATION
Likely to have “imperfect” ontologies
Can enrich semi-automatically
Two possibilities:
Value enrichment
Object-set and relationship-set enrichment
VALUE ENRICHMENTVALUE ENRICHMENT
Source
Target
Source Organism
Accession Number
Protein Name
Length in Amino Acid
Molecular Weight in Da
RELATIONSHIP-SET ENRICHMENTRELATIONSHIP-SET ENRICHMENT
OBJECT-SET ENRICHMENTOBJECT-SET ENRICHMENT
Start End
Length in Amino Acid
Location
Gene
“37,?612,?680”;
“37,?610,?585”;
“3,?095”:
A sample ontology object (partial information)
Two sample pages (partial information)
Specie
Protein Name
Map to
Update values
Finished: sibling table comparison technique
Working on: sample ontology object recognition
ontology generation in the biological domain
Implementation Status:
Ontology: will not cover everything in the domain
Source page understanding: structured/semi-structured
Value enrichment: only value lexicons
Object set and relationship set enrichment: only ISA and Part-Of hierarchies and simple attribute additions
Delimitations:
Old ontology
Updated ontology
Possible new object sets that could be added to the ontology
Data Extraction Data Extraction Research GroupResearch Group