dongkyoo shin ([email protected]) sejong university, incob2007

Exploitation of Structural Similarity in Semi-Structured Bioinformatics

Data for Efficient Storage Construction

Dongkyoo Shin ([email protected])

Sejong University, InCob2007

Multimedia & Internet Laboratory, Sejong University 2/20

Table of contents

• Abstract• Background• Methods• Results• Conclusions


Abstract (1)

• Background– Many researches related to storing XML data

• Reduce the number of joins between tables• Not proper to microarray data with distinctive hierarchy

– Hierarchical feature of microarray data model• a few core values occurs iteratively

– New approach for capturing the feature• Class elements with similar structure into a group• Design common database table for the group


Abstract (2)

• Results– Database schema created by our approach

• Reduce the number of table joins remarkably• Improve performance of storing and loading XML-based

microarray data

• Conclusions– Efficient way to improve performance of microarray

data is mining structural similarity of elements


Background (1)

• DTD (Data Type Definition)-dependent base– Map one element into one table

For each e E, #(S) ≥1 OR #(A) ≥1 -> define_Class(e)For each Se S -> Add_attributes_of_Class(e)

Se SequenceType -> Define_multivalued_att(Se, e)

<?xml?><!ELEMENT Dept(A*)><!ATTLIST Dept dept_id ID #REQUIRED><!ELEMENT Employee (Name, Enroll*)><!ATTLIST Employee emplopyee_id ID #REQUIRED><!ELEMENT Name #PCDATA><!ELEMENT Enroll #PCDATA>

ParentID ID dept_id

1 2 “dept1”

The Dept table

ParentID ID TEXT

3 7 “CS10”

3 8 “CS20”

The Enroll table

ParentID ID TEXT

3 5 “St1”

3 6 “St2"

The Name tableThe Employee table

ParentID ID Employee_id

2 3 “123”

2 4 “124”


Background (2)

• Inline technique base– Reduce the complexity of DTD (Data Type Definition)

For each e, #(S) == 1 AND Se SequenceType -> Add_Multi-valued_attribute_of_Paren-tClass(e)

<?xml?><!ELEMENT A(B, C)><!ELEMENT B(#PCDATA))><!ELEMENT C(D)* ><!ELEMENT D(E, F*, G)> <!ELEMENT E(H)><!ELEMENT F(#PCDATA)><!ELEMENT G(#PCDATA)><!ELEMENT H(#PCDATA)>

A

D

FE G

H

*

*

B C

inlining

A

C

FE G

H

*

*

B


Background (3)

• Drawback of previous approaches– DTD-dependent

• Database schema has the same complexity with DTD

– Inline technique• Strongly depend on the number of omissible elements

• New design approach for microarray database– Capture similar structural features of microarray data– Need fast and simple way to mine the structural

features


Background (5)

• Microarray data and MAGE (Microarray Gene Expression) standards– Research groups share microarray data with others,

and use it to solve their biological questions– MGED society’s standard definitions

• MIAME (Minimum Information for the Annotation of a Microarray Experiment)

• MAGE-OM and MAGE-ML – Exchange object model and format for MIAME

– Structural feature of MAGE-OM• a variety set of objects defining the same data types

including complex types.


Background (6)

• Decision Tree– a simple model for easy understanding classification

rules correlations, and effects between variables

– Proper for mining structural features of MAGE-ML DTD itself (Not MAGE-ML instances !!!)

• Possible to classify all elements three levels:– A root, mediators group, and bottoms group

A

B C

E FDleaf

depthnode

root


Methods (1)

• Classification of core features using decision tree– Terminologies for expression of a complexType

• e: an element defined in XML schema• E: an elements set of e• SE: a sub-elements set of e• a: an attribute of e• A: an attributes set of e• SA: an attributes set for all sub-elements of e• complexType: Structural information that consists of

SE and (or) A of e. • Lowest child: an element without a sub-element• Lowest parent: an element with a sub-element that

is one of the lowest child elements• PG (Parent Group): a set of candidate elements to be parents of a Lowest Child• LPCG (The Lowest Parent Candidate Group): a set of candidates to be Lowest Parent• LCG (The Lowest Child Group): a set of Lowest child elements• LPG (The Lowest Parent Group): a set of Lowest Parent elements• ULPG (Upper Level Parent Group): a set of upper level parents, including elements that are neither

Lowest Child nor Lowest Parent

<xsd:element name="A"><xsd:complexType>

<xsd:sequence><xsd:element ref="B"/><xsd:element ref="C"/>

</xsd:sequence></xsd:complexType>

</xsd:element><xsd:element name="B">

<xsd:complexType><xsd:sequence>

<xsd:element ref="D"/></xsd:sequence>

</xsd:complexType></xsd:element><xsd:element name="C">


<xsd:element ref="E"/><xsd:element ref="F"/>


</xsd:element>

<xsd:element name="D"><xsd:complexType>

<xsd:sequence><xsd:element ref="G"/><xsd:element ref="H"/>


</xsd:element><xsd:element name="E">


<xsd:element ref="I"/></xsd:sequence>

</xsd:complexType></xsd:element><xsd:element name="F">


<xsd:element ref="I"/><xsd:element ref="J"/>


</xsd:element>

<xsd:element name="G"><xsd:complexType>

<xsd:attribute name="identifier" use="required"/><xsd:attribute name="name"/>

</xsd:complexType></xsd:element><xsd:element name="H">

<xsd:complexType><xsd:attribute name="identifier" use="required"/><xsd:attribute name="name"/>

</xsd:complexType></xsd:element><xsd:element name="I">

<xsd:complexType><xsd:attribute name="width" use="required"/><xsd:attribute name="hight" use=”required”/><xsd:attribute name="length" use=”required”/>

</xsd:complexType></xsd:element><xsd:element name="J">

<xsd:complexType><xsd:attribute name="identifier" use="required"/><xsd:attribute name="name"/>

</xsd:complexType></xsd:element>

A

B C

E FD

I JG H

upper level parent

lowest parent

lowest child

parent group


Methods (2)

• Expression of a complexType– A complexType defines structural information of

elements• A set of arrays including data type

• Definition of structural similarity

SEelex = {e1, e2, … , en}, SAelex = {Ae1, Ae2, … , Aen}

complexType(elex) = {SEelex, SAelex}

complexType(elex) == complexType(eley)


Methods (3)

• Decision Tree for recognizing the core features

– Condition 1: If rule 1 is satisfied, then e arrives at LCG. Otherwise, it arrives at PG.– Condition 2: If rule 2 is satisfied, then e and its similar element e arrive at a new LCG.– Condition 3: If rule 3 is satisfied, then e arrives at LPG. Otherwise, it arrives at ULPG.– Condition 4: If rule 4 is satisfied, then e and elements similar to e arrive at a new LPG.

Elements

PGLCG

LCGi ULPG

yes no

no yes

Condition 1

Condition 2 Condition 3

Condition 4LPG

LPGi


Methods (4)

• Classification rules– Rule 1

• Decide that an element should belong to group LCG or PG

For each ei E { if(number of elements in SEei == 0){ ei is classified into LCG; }else{ ei is classified into PG; }}


Methods (5)


• Classify multiple sets of LCGp = 0;For each ei LCG0 { Flag=0; If (p>0) { For q=1 to p If (complexType(ei) = complexType(element in LCGq) { ei is classified into LCGq; Flag=1; } } If (Flag==0) { For each ej LCG0

if(complexType(ei) = complexType(ej) { p=p+1; ei and ej are classified into a new group of LCGp; } } }


Methods (6)


• Separate elements in PG into two groups: LPG and ULPG

For each ei PG { if(SEei LCG) { ei is classified into LPG; }else{ ei is classified into ULPG; }}


Methods


• Classify multiple sets of LPGp = 0;For each ei LPG0 { Flag=0; If (p>0) { For q=1 to p If (complexType(ei) = complexType(element in LPGq) { ei is classified into LPGq; Flag=1; } } If (Flag==0) { For each ej LPG0

if(complexType(ei) = complexType(ej) { p=p+1; ei and ej are classified into a new group of LPGp; } } }


Result (1)

• Database design by the proposed decision tree

Parent_1 Parent_2 Parent_3 Parent_4 Parent_N

Lowest child_1

Lowest child_2

Lowest child_3

Lowest child_4

Lowest child_N

Upper level elements


Parent_1 Parent_2 LPG Parent_4 Parent_N

Lowest child_1

Lowest child_2

LCGLowest child_4

Lowest child_N

Upper level elements


upper level parent

lowest parentgroup

lowest child group

identifer Object_type Parent_id

1354979 Array_assnlist 1256789

4661459 Bioassay_assnlist 1264564

... ... ...

Name Object_type Parent_id

null Array_ref 1354979

null Bioassay_ref 4661459

... ... ...

identifer

R:A- UMCU- 2:h000001_01

R:A- UMCU- 2:h000001_02

...


Result (2)

• Database space complexity

• Time complexity

Raw schema Classified schema

Total classes 455 314

Total tables 455 314

Total records 2012 160

Total DB size 710 (Kb) 27 (Kb)

0

200

400

600

800

1000

1200

1400

1600

Sto

ring

tim

e (m

s)

storing 1515 528

raw schema classified schema0

2000

4000

6000

8000

10000

12000

14000

16000

18000L

oadin

g tim

e (

ms)

loading 15890 6362

raw schema classified schema


Result (3)

• Reconstructing the XML Document...<LPG> <LCG identifier=”R:A-UMCU-2:h000001_01" name=”” objectType=”Array_ref” parent_id=”1354979"/></LPG>...

...<Array_assnlist> <Array_ref identifier=”R:A-UMCU-2:h000001_01" name=””/> ...</Array_assnlist>...

XSLT Processor

OriginalXML schema

RelationalDatabase

tuples

objects

JAXB

SimplifiedXML schema


Conclusions

• Proposed approach– Mine elements with structural similarity from XML

Schema for biological information– Experimental result

• Mining structural similarity of object model is proper to microarray data and more efficient than previous approaches

• Future work– Plan to extend current classification rules to root,

LCG, LPG, ULPG respectively

dongkyoo shin ([email protected]) sejong university, incob2007

Documents

performance of microarray

sejong universitybackground

sejong universityabstract

data types

sejong universitymethods

microarray experimentmageom

number of table

number of joins