[ieee 2009 24th international symposium on computer and information sciences (iscis) - guzelyurt,...

6
Multi-Relational Concept Discovery with Aggregation Yusuf Kavurucu Computer Engineering Middle East Technical University Ankara, Turkey Email: [email protected] Pinar Senkul Computer Engineering Middle East Technical University Ankara, Turkey Email: [email protected] I.Hakki Toroslu Computer Engineering Middle East Technical University Ankara, Turkey Email: [email protected] Abstract—Concept discovery aims at finding the rules that best describe the given target predicate (i.e., the concept). Aggregation information such as average, count, max, etc. are descriptive for the domains that an aggregated value takes part in the definition of the concept. Therefore, a concept discovery system needs aggregation capability in order to construct high quality rules (with high accuracy and coverage) for such domains. In this work, we describe a method for concept discovery with aggregation on an ILP-based concept discovery system, namely C 2 D-A. C 2 D-A extends C 2 D by considering all instances together and thus improves the generated rule’s quality. Together with this extension, aggregation handling mechanism is modified accordingly, leading to more accurate aggregate values, as well. I. I NTRODUCTION The amount of data collected on relational databases have been increasing due to increase in the use of complex data for real life applications. This has called for the development of multi-relational learning algorithms that can be directly applied to multi-relational data on the databases [1]. For such learning systems, generally, first-order predicate logic is em- ployed as the representation language. The learning systems, which induce logical patterns valid for given background knowledge, have been investigated under a research area, called Inductive Logic Programming (ILP) [2]. Confidence-based Concept Discovery (C 2 D) [3], [4], [5], [6] is a predictive concept learning ILP system that employs relational association rule mining concepts and techniques to find frequent and strong concept definitions according to given target relation and background knowledge. This approach is further improved in C 2 D-A [7], by considering the number of occurrences of constant value arguments in the rule through using all of the target instances together. By this way, the effect of target instance ordering on the concept discovery is eliminated and thus rule quality is improved. An important feature for a concept discovery method is the ability of incorporating aggregated information into the concept discovery. Such information becomes descriptive as in the example ”the total charge of the atoms of a compound is descriptive for the usefulness or harmfulness of the com- pound”. Following the basics given in [4], [6], in C 2 D-A, well-known aggregate functions COUNT, SUM, MIN, MAX and AVG are defined in first-order logic and used as aggregate predicates for the situations where one-to-many relationships exist in the data set. Aggregation handling mechanism in C 2 D- A considers the whole domain of an aggregated attribute, resulting in increase in the quality of the discovered rules in certain domains. This paper is organized as follows: Section 2 presents the related work. Section 3 introduces C 2 D-A. Section 4 describes the usage of aggregate predicates in C 2 D-A. Section 5 presents experimental work. Finally, Section 6 includes concluding remarks. II. RELATED WORK In this section, firstly, similar concept learning systems are presented. Then, multi-relational learning techniques with aggregation are summarized and discussed. FOIL, PROGOL, ALEPH and WARMR are some of the well-known ILP-based systems in the literature. FOIL [8] is one of the earliest concept discovery systems. It is a top-down relational ILP system, which uses refinement graph in the search process. In FOIL, negative examples are not explicitly given; they are generated on the basis of CWA. PROGOL [9] is a top-down relational ILP system, which is based on inverse entailment. A bottom clause is a maximally specific clause, which covers a positive example and is derived using inverse entailment. PROGOL extends clauses by travers- ing the refinement lattice. ALEPH [10] is similar to PROGOL, whereas it is possible to apply different search strategies and evaluation functions. Design of algorithms for frequent pattern discovery has become a popular topic in data mining. Almost all algorithms have the same level-wise search technique known as APRIORI algorithm. WARMR [11] is a descriptive ILP system that employs Apriori rule to find frequent queries having the target relation by using support criteria. C 2 D-A is similar to ALEPH as both systems produce con- cept definition from given target. WARMR is another similar work such that, both systems employ Apriori-based searching methods. Unlike ALEPH and WARMR, C 2 D-A does not need mode declarations. It only requires type specifications of the arguments, which already exist together with relational tables corresponding to predicates. Most of the ILP-based systems require negative information, whereas C 2 D-A directly works on databases which have only positive data. Similar to 248

Upload: i-hakki

Post on 10-Mar-2017

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: [IEEE 2009 24th International Symposium on Computer and Information Sciences (ISCIS) - Guzelyurt, Cyprus (2009.09.14-2009.09.16)] 2009 24th International Symposium on Computer and

Multi-Relational Concept Discovery withAggregation

Yusuf KavurucuComputer Engineering

Middle East Technical UniversityAnkara, Turkey

Email: [email protected]

Pinar SenkulComputer Engineering

Middle East Technical UniversityAnkara, Turkey

Email: [email protected]

I.Hakki TorosluComputer Engineering

Middle East Technical UniversityAnkara, Turkey

Email: [email protected]

Abstract—Concept discovery aims at finding the rules that bestdescribe the given target predicate (i.e., the concept). Aggregationinformation such as average, count, max, etc. are descriptivefor the domains that an aggregated value takes part in thedefinition of the concept. Therefore, a concept discovery systemneeds aggregation capability in order to construct high qualityrules (with high accuracy and coverage) for such domains. Inthis work, we describe a method for concept discovery withaggregation on an ILP-based concept discovery system, namelyC2D-A. C2D-A extends C2D by considering all instances togetherand thus improves the generated rule’s quality. Together withthis extension, aggregation handling mechanism is modifiedaccordingly, leading to more accurate aggregate values, as well.

I. INTRODUCTION

The amount of data collected on relational databases havebeen increasing due to increase in the use of complex datafor real life applications. This has called for the developmentof multi-relational learning algorithms that can be directlyapplied to multi-relational data on the databases [1]. For suchlearning systems, generally, first-order predicate logic is em-ployed as the representation language. The learning systems,which induce logical patterns valid for given backgroundknowledge, have been investigated under a research area,called Inductive Logic Programming (ILP) [2].

Confidence-based Concept Discovery (C2D) [3], [4], [5],[6] is a predictive concept learning ILP system that employsrelational association rule mining concepts and techniques tofind frequent and strong concept definitions according to giventarget relation and background knowledge. This approach isfurther improved in C2D-A [7], by considering the number ofoccurrences of constant value arguments in the rule throughusing all of the target instances together. By this way, theeffect of target instance ordering on the concept discovery iseliminated and thus rule quality is improved.

An important feature for a concept discovery method isthe ability of incorporating aggregated information into theconcept discovery. Such information becomes descriptive asin the example ”the total charge of the atoms of a compoundis descriptive for the usefulness or harmfulness of the com-pound”. Following the basics given in [4], [6], in C2D-A,well-known aggregate functions COUNT, SUM, MIN, MAXand AVG are defined in first-order logic and used as aggregatepredicates for the situations where one-to-many relationships

exist in the data set. Aggregation handling mechanism in C2D-A considers the whole domain of an aggregated attribute,resulting in increase in the quality of the discovered rules incertain domains.

This paper is organized as follows: Section 2 presents therelated work. Section 3 introduces C2D-A. Section 4 describesthe usage of aggregate predicates in C2D-A. Section 5 presentsexperimental work. Finally, Section 6 includes concludingremarks.

II. RELATED WORK

In this section, firstly, similar concept learning systemsare presented. Then, multi-relational learning techniques withaggregation are summarized and discussed.

FOIL, PROGOL, ALEPH and WARMR are some of thewell-known ILP-based systems in the literature. FOIL [8] isone of the earliest concept discovery systems. It is a top-downrelational ILP system, which uses refinement graph in thesearch process. In FOIL, negative examples are not explicitlygiven; they are generated on the basis of CWA.

PROGOL [9] is a top-down relational ILP system, which isbased on inverse entailment. A bottom clause is a maximallyspecific clause, which covers a positive example and is derivedusing inverse entailment. PROGOL extends clauses by travers-ing the refinement lattice. ALEPH [10] is similar to PROGOL,whereas it is possible to apply different search strategies andevaluation functions.

Design of algorithms for frequent pattern discovery hasbecome a popular topic in data mining. Almost all algorithmshave the same level-wise search technique known as APRIORIalgorithm. WARMR [11] is a descriptive ILP system thatemploys Apriori rule to find frequent queries having the targetrelation by using support criteria.

C2D-A is similar to ALEPH as both systems produce con-cept definition from given target. WARMR is another similarwork such that, both systems employ Apriori-based searchingmethods. Unlike ALEPH and WARMR, C2D-A does notneed mode declarations. It only requires type specificationsof the arguments, which already exist together with relationaltables corresponding to predicates. Most of the ILP-basedsystems require negative information, whereas C2D-A directlyworks on databases which have only positive data. Similar to

248

Page 2: [IEEE 2009 24th International Symposium on Computer and Information Sciences (ISCIS) - Guzelyurt, Cyprus (2009.09.14-2009.09.16)] 2009 24th International Symposium on Computer and

FOIL, negative information is implicitly described accordingto CWA. Finally, it uses a novel confidence-based hypothesisevaluation criterion and search space pruning method.

ALEPH and WARMR can generate transitive rules only byusing strict mode declarations. In C2D-A, transitive rules aregenerated without the guidance of mode declarations.

There are several techniques that incorporates aggregationinto multi-relational learning. Crossmine [12] is such an ILPbased multi-relational classifier that uses TupleID propagation.Multi-relational g-mean decision tree, called Mr.G-Tree [13]is proposed to extend the concepts of propagation describedin Crossmine by introducing the g-mean TupleID propagationalgorithm, also known as GTIP algorithm. Classification withAggregation of Multiple Features (CLAMF) extends TupleIDpropagation in order to efficiently perform single and multi-feature aggregation over related tables [14].

Multi-Relational Decision Tree Learning (MRDTL) is amulti-relational learning method that constructs SelectionGraph (SG) for rule discovery. SG is a graphical language thatis developed to express multi-relational patterns. These graphscan be translated into SQL or first-order logic expressions.Generalised Selection Graph (GSG) is an extended version ofSG that uses aggregate functions [15]. In C2D-A, we followeda logic-based approach and included aggregate predicates inan ILP-based context for concept discovery.

III. C2D-A: CONFIDENCE-BASED CONCEPT DISCOVERYWITH ALL INSTANCES

A concept is a set of frequent patterns, embedded in thefeatures of the concept instances and relations of objectsbelong to the concept with other objects. C2D [16], [3], [4],[5], [6] is a concept discovery system that uses first-orderlogic as the concept definition language and generates a set ofdefinite clauses having the target concept in the head.

In the algorithm of C2D, the experiments show that theselection order of the target instance (the order in the target re-lation) may change the result hypothesis set. In each coverageset, the induced rules depend on the selected target instanceand the covered target instances in each step do not have anyeffect on the induced rules in the following coverage steps.

As a remedy to this problem, a new mechanism is developedand used in the improved version of C2D, namely C2D-A[7] (the flowchart is given in Figure 2). C2D-A modifies thegeneralization step of C2D in an efficient manner.

In concept discovery systems, an important problem is tobe able to cope with intractable large search space and toconstruct high quality rules, meanwhile. C2D-A uses threebasic mechanisms for pruning the search space:

1. The first one is a generality ordering on the conceptclauses based on θ-subsumption and is defined as follows:A definite clause C θ-subsumes a definite clause C′, i.e. atleast as general as C′, if and only if ∃θ such that: head(C) =head(C ′) and body(C ′) ⊇ body(C)θ.

2. The second one is the use of confidence as follows: If theconfidence value of a clause is not higher than the confidencevalues of the two parent clauses in the Apriori search lattice,

then it is pruned. A similar approach is used in the Dense-Miner system [17] for traditional association rule mining.

3. The last pruning strategy utilizes primary-foreign keyrelationship between the head and body relations. If such arelationship exists between the head and the body predicates,the foreign key argument of the body relation can only havethe same variable as the primary key argument of the headpredicate in the generalization step.

Another useful feature is the parametric structure for sup-port, confidence, recursion and f-metric definitions. The usercan set support/confidence thresholds in order to find ruleshaving acceptable support/confidence values. Similarly, it ispossible to allow/disallow recursion in discovered concepts.F-metric is the rule evaluation function whose definition isgiven in [6].

The database given in Figure 1, is used as a running examplein this section. In this example, daughter (d) is the concept tobe learned, and two concept instances are given. Backgroundfacts of two relations, namely parent (p) and female (f) areprovided. Finally, types of the attributes of relations are listed.

Concept Instances Background Facts Type Declarationsd(mary, ann). p(ann, mary). d(person, person).d(eve, tom). p(ann, tom). p(person, person).

p(tom, eve). f(person).f(ann).f(mary).f(eve).

Fig. 1. The daughter database

Generalization:In C2D-A, for the target relation t(a, b), the induced

rules has a head including either a constant value or avariable for each argument of t. Each argument can behandled independently to find the feasible head relations forthe hypothesis set. As an example, for the first argumenta, a constant value must be repeated at least min sup ×number of uncovered instances in the target relation so thatit can take part as a constant value in a frequent rule. To findsuch values for a, the following SQL statement is executed:

SELECT aFROM tGROUP BY aHAVING COUNT(*) ≥ (min sup * num of uncov inst)

For example, in PTE-1 database [18] the target relationpte active has only one argument (drug). Initially, thereare 298 uncovered instances in pte active. Assume that themin sup parameter is set as 0.1. Under this min sup, thefollowing SQL statement returns empty set which means therecannot be a constant value for the argument drug of pte active:

SELECT drugFROM pte activeGROUP BY drugHAVING COUNT(*) ≥ 298 * 0.1

The result of the SQL statement (empty set) indicates thatthe argument drug of pte active can only be a variable for the

249

Page 3: [IEEE 2009 24th International Symposium on Computer and Information Sciences (ISCIS) - Guzelyurt, Cyprus (2009.09.14-2009.09.16)] 2009 24th International Symposium on Computer and

DATABASE(Target Relation & Background Facts)

INPUTPARAMSMin_supMin_confMax_depth Print

hypothesis

Y

N Are all targetinstancescovered?

Calculate feasible values for head and body relations

GENERALIZATIONFind General Rules (one head & one body literal) using absorption. Depth = 1.

Is Candidate Rule Set empty?

NSPECIALIZATION Refine generalrules using APRIORI.Depth = Depth + 1

Y COVERAGEFind Solution RulesCover Target Instances

Is Depth smaller thanMax_depth?

FILTERDiscard infreq and un-strong rules

N

Y

Fig. 2. The flowchart of C2D-A algorithm

head of the solution hypothesis clauses.In the same manner, for a background relation r(a, b, c),

if a constant value exists frequently many times for the sameargument in r, then it is a frequent value for that argumentof r and may take part in a solution clause for the hypothesisset. As an example, in PTE-1 database, pte atm(drug, atom,element, integer, charge) is a background relation and theconstant values which can take part in the hypothesis set canbe found for each argument of pte atm by executing SQLstatements in the same structure with the one given above.

As in C2D, for numeric attributes, less than/greater thanoperators are applied for constant values. As an example, forthe argument charge of pte atm, first of all, the correspondingvalues in the database are sorted in the ascending order. Then,under min sup 0.1, there will be 9 ((1 / 0.1) - 1, exceptthe minimum and the maximum) thresholds for each lessthan/greater than operator. The pte atm relation has 9189records, then after ordering from smallest to largest, lessthan/greater than operator is applied for the 920th value,1840th value and so on. In addition, this argument may bea variable, as well. However, experimentally it is observedthat this technique decreases the time efficiency considerably.For this reason, in experiments, only the median element inthe ordered sequence is taken as a frequent value for which

less than/greater than operators are applied.The frequent values for each argument of pte atm are as

follows:drug: empty set (only variable)atom: empty set (only variable)element: c, h (and variable)integer: 3, 10, 22 (and variable)charge: less than/greater than operators applied on the

median value after ordering the feasible values in the relation(and variable).

As a result, the pte atm predicate has 36 (1*1*3*4*(2+1))frequent variations for each valid head relation in the gener-alization step of C2D-A. Some examples to generated clausesin generalization step for PTE-1 are as follows:

pte active(A) :- pte atm(A, B, c, 3, ≤ 0.047)pte active(A) :- pte atm(A, B, c, 3, ≥ 0.047)pte active(A) :- pte atm(A, B, c, 3, C)pte active(A) :- pte atm(A, B, c, 10, ≤ 0.047)pte active(A) :- pte atm(A, B, c, 10, ≥ 0.047)pte active(A) :- pte atm(A, B, c, 10, C)pte active(A) :- pte atm(A, B, c, 22, ≤ 0.047)pte active(A) :- pte atm(A, B, c, 22, ≥ 0.047)pte active(A) :- pte atm(A, B, c, 22, C)pte active(A) :- pte atm(A, B, c, C, ≤ 0.047)pte active(A) :- pte atm(A, B, c, C, ≥ 0.047)pte active(A) :- pte atm(A, B, c, C, D)pte active(A) :- pte atm(A, B, h, 3, ≤ 0.047)pte active(A) :- pte atm(A, B, h, 3, ≥ 0.047)pte active(A) :- pte atm(A, B, h, 3, C)...

Support and confidence values for each clause are calculatedand the infrequent clauses are eliminated.

In the daughter example, the target and background relationscan only have variables for the arguments in the hypothesisset. Therefore, the following clauses are generated in thegeneralization step:

d(A, B) :- p(A, A). d(A, B) :- p(A, B).d(A, B) :- p(A, C). d(A, B) :- p(B, A).d(A, B) :- p(B, B). d(A, B) :- p(B, C).d(A, B) :- p(C, A). d(A, B) :- p(C, B).d(A, B) :- p(C, C). d(A, B) :- p(C, D).d(A, B) :- f(A). d(A, B) :- f(B).d(A, B) :- f(C).

Under the support threshold value 0.8, among 13 generatedclauses, only the following 6 of them satisfy the threshold.

d(A, B) :- p(B, A). d(A, B) :- p(C, A).d(A, B) :- p(B, C). d(A, B) :- p(C, D).d(A, B) :- f(A). d(A, B) :- f(C).Refinement of Generalization: C2D-A refines the two lit-

eral concept descriptions with an Apriori-based specializationoperator that searches the definite clause space in a top-down manner, from general to specific. As in Apriori, thesearch proceeds level-wise in the hypothesis space and it ismainly composed of two steps: frequent clause set selection

250

Page 4: [IEEE 2009 24th International Symposium on Computer and Information Sciences (ISCIS) - Guzelyurt, Cyprus (2009.09.14-2009.09.16)] 2009 24th International Symposium on Computer and

from candidate clauses and candidate clause set generation asrefinements of the frequent clauses in the previous level. Thestandard Apriori search lattice is extended in order to capturefirst-order logical clauses and the candidate generation andfrequent pattern selection tasks are customized for first-orderlogical clauses.

The candidate clauses for the next level of the search spaceare generated in three important steps:

1. Frequent clauses of the previous level are joined togenerate the candidate clauses via union operator. In orderto apply the union operator to two frequent definite clauses,these clauses must have the same head literal, and bodies musthave all but one literal in common.

2. For each frequent union clause, a further specializationstep is employed that unifies the existential variables ofrelations having the same type in the body of the clause. Bythis way, clauses with relations indirectly bound to the headpredicate can be captured.

3. Except for the first level, the candidate clauses that haveconfidence value not higher than parent’s confidence valuesare eliminated.

Evaluation: For the first instance of the target concept,which has not been covered by the hypothesis yet, the systemconstructs the search tree consisting of frequent and confidentcandidate clauses that induce the current concept instance.Then it eliminates the clauses having less confidence valuethan the confidence threshold. Finally, the system decides onwhich clause in the search tree represents a better conceptdescription than others according to f-metric definition.

Covering: After the best clause is selected, concept in-stances covered by this clause are removed from the conceptinstances set. The main iteration continues until all conceptinstances are covered or no more feasible candidate clausecan be found for the uncovered concept instances.

In the daughter example, the search tree constructed forthe instance d(mary, ann) is traversed for the best clause. Theclause d(A, B) :- p(B, A), f(A) with support value of 1.0 and theconfidence value of 1.0 (f-metric=1.0) is selected and added tothe hypothesis. Since all the concept instances are covered bythis rule, the algorithm terminates and outputs the followinghypothesis:

d(A, B) :- p(B, A), f(A).

IV. AGGREGATE PREDICATES IN C2D-A

In relational database queries, aggregate functions charac-terize groups of records gathered around a common property.In concept discovery, aggregate functions are utilized in orderto construct aggregate predicates that capture some aggregateinformation over one-to-many relationships. In [4], [6], aggre-gate predicates are formally defined for C2D. In this work,these basic definitions for aggregate predicate and aggregatequery are used, as given below.

Definition 1: An Aggregate Predicate (Π) is a predicate thatdefines aggregation over an attribute of a given Predicate (α).We use a notation similar to given in [19] to represent thegeneral form for aggregate predicates as follows:

Πα;βγ;ω(γ, σ)

where α is the predicate over which the Aggregate Function(ω) (COUNT, MIN, MAX, SUM and AVG are frequently usedfunctions) is computed, Key (γ) is a set of arguments thatwill form the key for Π, Aggregate Value (σ) is the value ofω applied to the set of values defined by Aggregate VariableList (β).

As an example, PTE-1 database [18] is used for explain-ing the usage of aggregate functions. There is one-to-manyrelationship between pte active and pte atm relations overthe drug argument. A similar relation exists between thepte active and pte bond tables. Also there is a one-to-manyrelationship between pte atm and pte bond relations over theatm-id argument.

pte atm countpte atm;atm−iddrug;COUNT (drug, cnt) is an example

aggregate predicate that can be defined in the PTE-1 database.For simplicity, we abbreviate it as pte atm count(drug, cnt).

The other aggregate predicates that can be defined in thedatabase are:

pte bond count(drug, cnt).pte atm b count(atm-id, cnt).pte atm charge max(drug, mx).pte atm charge min(drug, mn).Definition 2: An aggregate rule is a definite clause which

has at least one aggregate predicate in the body relations.An example aggregate rule is:pte active(d1, true) :- pte atm count(d1, A), A ≥ 26.

Definition 3: An aggregate query is a SQL statement havingSELECT and GROUP BY commands and aggregate functionsdefined in SQL. The instances of aggregate predicates arecreated by using aggregate queries. Given Πα;β

γ;ω(γ, σ), thecorresponding aggregate query is

SELECT γ, ω(β) as σ

FROM α

GROUP BY γ

For example, the instances of pte atm count(drug,cnt) ag-gregate predicate on pte atm relation are constructed by thefollowing query:

SELECT drug, COUNT(atm-id) as cntFROM pte atmGROUP BY drug

Aggregate predicates have numeric attributes by their na-ture. For this reason, it is crucial to represent numeric attributesin concept description as described in Section III. For theaggregate predicates having numerical attributes, it is infea-sible to generate rules that test such numeric values throughequality. For example, the pte atm cnt relation in the PTE-1database has the argument “cnt” which has integer values. It isinfeasible to search for a rule such as: A drug is active if it has26 atoms. As there are many possible numeric values in therelation, such a rule would probably be eliminated accordingto minimum support criteria. Because of this, the followingmodification is applied in C2D-A for numeric attributes inaggregate predicates:

251

Page 5: [IEEE 2009 24th International Symposium on Computer and Information Sciences (ISCIS) - Guzelyurt, Cyprus (2009.09.14-2009.09.16)] 2009 24th International Symposium on Computer and

As the first step, domain of the numerical attributes areexplicitly defined as “infinite” in the generalization step.For a given target concept t(a,x) and a related fact such asp(a,b,num), where a and b are nominal values and num is anumeric value; instead of a single clause, the following twoclauses are generated:

t(a, x) :- p(a, b, A), A ≥ num.t(a, x) :- p(a, b, A), A ≤ num.The num value can be found as follows:The feasible values for the cnt argument of pte atm cnt

relation are sorted in ascending order. Then, because of searchspace and execution time restrictions, the median element inthe ordered list is used as num value for the aggregate predicatewhich is described in Section III.

Once the comparison is defined on numeric attributes, ag-gregate predicates are included into C2D-A. For this purpose,one-to-many relationships between target concept and back-ground relations are defined according to schema information.Then, aggregate predicates are generated by using pre-definedSQL commands. In the generalization step, the instances ofthese predicates are considered for clause generation.

As an example, for the pte atm count predicate defined inPTE-1 database, the following example clauses are created inthe generalization step.

pte active(A, true) :- pte atm count(A, B), B ≥ 22.pte active(A, true) :- pte atm count(A, B), B ≤ 22.

V. EXPERIMENTAL RESULTS

In order to test the effect of aggregate predicate in rulegeneration, experiments on Predictive Toxicology Evaluation(PTE) data set are conducted. The carcinogenecity tests ofcompounds are important, however time-consuming and ex-pensive. For this reason, discovering rules that describe toxi-cological compounds help to facilitate the process. In NationalToxicity Program, 367 compounds are classified as the resultof tests conducted on rodents. Among these compounds, 298of them are separated as training set, 39 of them formed thetest set of PTE-1 and the other 30 chemicals constitute the testset of PTE-2 challenge [18].

In this work, the experiment is conducted on PTE-1 data.In C2D-A system configuration, recursion is disallowed, max-imum rule length is set to be 3 predicates and support andconfidence thresholds are set as 0.1 and 0.7, respectively.

For PTE-1 data set, the aggregate predicates given in Figure3 are defined and their instances are added to the backgroundinformation.

The predictive accuracies of the state-of-art methods andC2D with its improved version for PTE-1 data set are listedin Figure 4. As seen from the figure, C2D-A has a betterpredictive accuracy than the basic C2D algorithm. In addition,it finds the best results (having highest accuracy) with respectto other systems.

An example rule including an aggregate predicate is shownbelow:pte active(A,false) :- pte atm(A,B,c,22,X), X>=-0.020,

pte has property(A,salmonella,n),

Predicate Description SQL Query Definitionpte atm count Number of SELECT drug,COUNT(atm-id)(drug, cnt) atoms for FROM pte atm

each drug GROUP BY drugpte bond count Number of SELECT drug,COUNT(atm-id)(drug, cnt) bonds for FROM pte bond

each drug GROUP BY drugpte atm b cnt Number of SELECT atm-id1,COUNT(atm-id2)(atm-id, cnt) bonds for FROM pte bond

each atom GROUP BY atm-id1pte charge max Max charge SELECT drug,MAX(charge)(drug, mx) of the atoms FROM pte atm

in a drug GROUP BY drugpte charge min Min charge SELECT drug,MIN(charge)(drug, mn) of the atoms FROM pte atm

in a drug GROUP BY drug

Fig. 3. The aggregate predicates in PTE-1 data set

Method Type Pred. Acc.C2D-A (with aggr.) ILP + DM 0.88

C2D-A ILP + DM 0.86Ashby Chemist 0.77

PROGOL ILP 0.72RASH Biological Potency An. 0.72

C2D (with aggr.) ILP + DM 0.70TIPT Propositional ML 0.67

Bakale Chemical Reactivity An. 0.63Benigni Expert-guided Regr. 0.62DEREK Expert System 0.57TOPCAT Statistical Disc. 0.54

COMPACT Molecular Modeling 0.54

Fig. 4. Predictive accuracies for PTE-1

pte has property(A,mouse lymph,p).Within this work, the effect of including aggregate pred-

icates in execution time of the system is experimentallyanalyzed. For the experiments, proposed method is applied onPTE-1 data set is with none, one and five aggregate predicatesincluded in the background knowledge. The experiments areconducted on a computer with 1GB RAM and 1.6 Core DuoCPU. The result is presented in Figure 5. As seen in thefigure, including a single aggregate predicate in rule discoverymechanism causes a high increase in execution time (i.e.,duration of concept discovery). However, the increase ratedrops in the inclusion of new aggregate predicates. For thedomains, where the aggregate predicates are descriptive forthe concept, experimentally observed increase rate in executiontime can be tolerated. Furthermore, the number of aggregatedpredicates included in the system can be more than one, sincethe cost of adding more than one aggregate predicate does notdouble the execution time.

VI. CONCLUSION

This work presents a method for including aggregation pred-icates in multi-relational concept discovery. Proposed methodis designed and implemented on a concept discovery system,named C2D-A. It is extended over an ILP-based conceptdiscovery system, named C2D, which combines rule extractionmethods in ILP and Apriori-based specialization operator.By this way, strong declarative biases are relaxed; instead,support and confidence values are used for pruning the search

252

Page 6: [IEEE 2009 24th International Symposium on Computer and Information Sciences (ISCIS) - Guzelyurt, Cyprus (2009.09.14-2009.09.16)] 2009 24th International Symposium on Computer and

Execution Time for PTE-1 Data Set

19

156

256

0

50

100

150

200

250

300

0 1

Number of Aggregate Predicates

Execu

tio

n T

ime (

min

ute

)

5

Fig. 5. Execution time graph for concept discovery with aggregation

space. In addition, C2D does not require user specification ofinput/output modes of arguments of predicates and negativeconcept instances. In C2D-A, generalization step of C2D ismodified in such a way that most general rules are constructedby considering the number of occurrences of constant valuearguments in the rules. By this new method, the effect of targetinstance order on rule generation is eliminated and rule qualityis improved.

In order to generate successful rules for the domains whereaggregated values such as sum, avg are descriptive in thesemantics of the target concept, it is essential for a conceptdiscovery system to support definition of aggregation andinclusion in the concept discovery mechanisms. In both C2Dand C2D-A, aggregation information is defined in the form ofaggregate predicates and they are included in the backgroundknowledge of the concept. These systems differ in the wayaggregate predicates are included in the rule discovery. InC2D-A, the aggregate value that takes part in the aggregatepredicate and the overall rule is generated by consideringall values of the attribute. The effect of aggregation on thequality of generated rules is tested on PTE-1 challenge dataset. The experiment shows that aggregate rules and comparisonon numerical data provide better accuracy for data that hasone-to-many relationships between the target and backgroundrelations. Inclusion of aggregate predicates and consideringall possible values for the aggregate attribute causes increasein execution time. However, due to satisfactory results inrule quality, this decrease in efficiency may be consideredacceptable.

REFERENCES

[1] S. Dzeroski, “Multi-relational data mining: an introduction.” SIGKDDExplorations, vol. 5, no. 1, pp. 1–16, 2003.

[2] S. Muggleton, Ed., Inductive logic programming. London: AcademicPress, 1992.

[3] Y. Kavurucu, P. Senkul, and I. H. Toroslu, “Confidence-based conceptdiscovery in multi-relational data mining,” in Proceedings of Interna-tional Conference on Data Mining and Applications (ICDMA), HongKong, March 2008, pp. 446–451.

[4] Y. Kavurucu, P.Senkul, and I.H.Toroslu, “Aggregation in confidence-based concept discovery for multi-relational data mining,” in Pro-ceedings of IADIS European Conference on Data Mining (ECDM),Amsterdam, Netherland, July 2008, pp. 43–50.

[5] Y. Kavurucu, P.Senkul, and I. H. Toroslu, “Analyzing transitive ruleson a hybrid concept discovery system,” in Proceedings of the 4thInternational Conference on Hybrid Artificial Intelligent Systems (HAIS),Salamanca, Spain, June 2009.

[6] Y. Kavurucu, P. Senkul, and I. Toroslu, “Ilp-based concept discovery inmulti-relational data mining,” Expert Syst. Appl., vol. 41, 2009.

[7] Y. Kavurucu, P.Senkul, and I. Toroslu, “Confidence-based concept dis-covery in relational databases,” in Proceedings of 2009 World Congresson Computer Science and Information Engineering (CSIE 2009), LosAngeles, USA, April 2009.

[8] J. R. Quinlan, “Learning logical definitions from relations,” Mach.Learn., vol. 5, no. 3, pp. 239–266, 1990.

[9] S. Muggleton, “Inverse entailment and Progol,” New Generation Com-puting, Special issue on Inductive Logic Programming, vol. 13, no. 3-4,pp. 245–286, 1995.

[10] A. Srinivasan, “The aleph manual,” 1999. [Online]. Available:http://www.comlab.ox.ac.uk/oucl/research/areas/machlearn/Aleph/

[11] L. Dehaspe and L. D. Raedt, “Mining association rules in multiplerelations,” in ILP’97: Proceedings of the 7th International Workshop onInductive Logic Programming. London, UK: Springer-Verlag, 1997,pp. 125–132.

[12] X. Yin, J. Han, J. Yang, and P. S. Yu, “Crossmine: Efficient classificationacross multiple database relations,” in ICDE, 2004, pp. 399–411.

[13] C.-I. Lee, C.-J. Tsai, T.-Q. Wu, and W.-P. Yang, “An approach to miningthe multi-relational imbalanced database,” Expert Syst. Appl., vol. 34,no. 4, pp. 3021–3032, 2008.

[14] R. Frank, F. Moser, and M. Ester, “A method for multi-relationalclassification using single and multi-feature aggregation functions,” inPKDD, 2007, pp. 430–437.

[15] A. J. Knobbe, A. Siebes, and B. Marseille, “Involving aggregate func-tions in multi-relational search,” in PKDD ’02: Proceedings of the 6thEuropean Conference on Principles of Data Mining and KnowledgeDiscovery. London, UK: Springer-Verlag, 2002, pp. 287–298.

[16] S. D. Toprak, P. Senkul, Y. Kavurucu, and I. H. Toroslu, “A new ILP-based concept discovery method for business intelligence,” in ICDEWorkshop on Data Mining and Business Intelligence, April 2007.

[17] J. Roberto J. Bayardo, R. Agrawal, and D. Gunopulos, “Constraint-basedrule mining in large, dense databases,” Data Mining and KnowledgeDiscovery, vol. 4, no. 2-3, pp. 217–240, 2000.

[18] A. Srinivasan, R. D. King, S. H. Muggleton, and M. Sternberg,“The predictive toxicology evaluation challenge,” in Proceedings ofthe Fifteenth International Joint Conference on Artificial Intelligence(IJCAI-97). Morgan-Kaufmann, 1997, pp. 1–6. [Online]. Available:citeseer.ist.psu.edu/108020.html

[19] L. Getoor and J. Grant, “Prl: A probabilistic relational language,” Mach.Learn., vol. 62, no. 1-2, pp. 7–31, 2006.

253