using domain knowledge to optimize the knowledge discovery process in databases

16
Using Domain Knowledge to Optimize the Knowledge Discovery Process in Databases M. Mehdi Owrang O. The American University, Department of Computer Science and Information Systems, Washington, D.C. 20016 Modern database technologies process large volumes of data to discover new knowledge. Some large databases make discovery computationally expensive. Additional knowledge, known as domain or background knowledge, can often guide and restrict the search for interesting knowledge. This paper discusses mechanisms by which domain knowledge can be used effectively in discovering knowledge from databases. In particular, we look at the use of domain knowledge to reduce the size of the database for discovery, to optimize the hypotheses which represent the interesting knowledge to be discovered, to optimize the queries used to prove the hypotheses, and to avoid possible redundant and contradic- tory rule discovery. Some experimental results using the IDIS knowledge discovery tool is provided. Q 2000 John Wiley & Sons, Inc. I. INTRODUCTION Modern database technology involves processing a large volume of data in databases in order to discover new knowledge. Knowledge discovery is defined as the nontrivial extraction of implicit, previously unknown, and potentially useful information from data. 1 ] 8 Many organizations have started to develop or employ tools to discover knowledge from databases. For example, banks are analyzing data to find better rules for credit assessment. Similarly, several systems, e.g., RX, 9 are developed to discover knowledge from medical databases. Tools specifically designed for knowledge discovery have been released recently. These tools differ substantially in the types of problems they are Ž designed to address and in the ways in which they work. DataLogicrR Reduct . 10 Systems is a PC-based package, based on ‘‘rough sets,’’ that helps the user to ferret out rules that characterize the data in the database and that suggests how to make decisions on categorizing the data for optimum analysis. DatalogicrR provides pattern recognition, modeling, and data analysis techniques that are used to discover new knowledge in the form of rules from the database. IDIS:2, the information discovery system 11,12 also examines databases with the intent of hypothesizing possible rules for explaining the relationship among variables. It Ž . INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, VOL. 15, 45]60 2000 Q 2000 John Wiley & Sons, Inc. CCC 0884-8173r00r010045-16

Upload: m-mehdi-owrang-o

Post on 06-Jun-2016

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Using domain knowledge to optimize the knowledge discovery process in databases

Using Domain Knowledge to Optimizethe Knowledge Discovery Processin DatabasesM. Mehdi Owrang O.The American University, Department of Computer Science and InformationSystems, Washington, D.C. 20016

Modern database technologies process large volumes of data to discover new knowledge.Some large databases make discovery computationally expensive. Additional knowledge,known as domain or background knowledge, can often guide and restrict the search forinteresting knowledge. This paper discusses mechanisms by which domain knowledge canbe used effectively in discovering knowledge from databases. In particular, we look at theuse of domain knowledge to reduce the size of the database for discovery, to optimizethe hypotheses which represent the interesting knowledge to be discovered, to optimizethe queries used to prove the hypotheses, and to avoid possible redundant and contradic-tory rule discovery. Some experimental results using the IDIS knowledge discovery tool isprovided. Q 2000 John Wiley & Sons, Inc.

I. INTRODUCTION

Modern database technology involves processing a large volume of data indatabases in order to discover new knowledge. Knowledge discovery is definedas the nontrivial extraction of implicit, previously unknown, and potentiallyuseful information from data.1 ] 8 Many organizations have started to develop oremploy tools to discover knowledge from databases. For example, banks areanalyzing data to find better rules for credit assessment. Similarly, severalsystems, e.g., RX,9 are developed to discover knowledge from medical databases.

Tools specifically designed for knowledge discovery have been releasedrecently. These tools differ substantially in the types of problems they are

Ždesigned to address and in the ways in which they work. DataLogicrR Reduct. 10Systems is a PC-based package, based on ‘‘rough sets,’’ that helps the user to

ferret out rules that characterize the data in the database and that suggests howto make decisions on categorizing the data for optimum analysis. DatalogicrRprovides pattern recognition, modeling, and data analysis techniques that areused to discover new knowledge in the form of rules from the database. IDIS:2,the information discovery system11,12 also examines databases with the intent ofhypothesizing possible rules for explaining the relationship among variables. It

Ž .INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, VOL. 15, 45]60 2000Q 2000 John Wiley & Sons, Inc. CCC 0884-8173r00r010045-16

Page 2: Using domain knowledge to optimize the knowledge discovery process in databases

OWRANG46

can uncover information based on questions no one thought to ask by posing ahypothesis and then testing it for accuracy and relevancy. It concludes with a listof rules in two- and three-dimensional, hypermedia graphs. IDIS uses induction,guided by the user, to assign weights to attributes used in the rules. It findssuspicious entries and unusual patterns automatically, including data itemswhich violate correlations, extreme boundary items, and items which are beyondnormal standard deviations. IDIS has been used to discover knowledge in areasas diverse as financial analysis, marketing, scientific discovery, quality control,medical discovery, and manufacturing.11,12

While promising, these tools are limited in many ways. Some databases areso large that they make the discovery process computationally expensive. Theproblem of searching for all possible relationships in a database is NP-hard.13,14

The vastness of the data forces the use of techniques for focusing on specificportions of the data, which requires some additional information about the formof data and constraints on it. This information, known as domain or backgroundknowledge, can be defined as any information that is not explicitly presented inthe data.1,3 ] 5,12,15,6,16 A knowledge discovery in a database system must be able torepresent and appropriately use domain knowledge in conjunction with theapplication of discovery algorithms. Domain knowledge assists knowledge dis-covery by focusing the search.3,4,14,15 However, we should be careful in usingdomain knowledge to narrow the search in a database in order to avoid blockingthe discovery of unexpected knowledge.

Domain knowledge has been used in different aspects of the knowledgediscovery in few systems. For example, metadendral17,16 uses domain knowledgeŽ . Žknowledge of chemistry heavily for both hypothesis representing the knowl-

. 17,16edge to be discovered generation and testing. Prospector uses its domainŽ . 9geological knowledge in the same areas as metadendral. RX employs a littledomain knowledge for generating correlations in its medical database, but usesmore of domain knowledge for testing.

Although the use of domain knowledge in knowledge discovery has beenmentioned by researchers,3 ] 5,12,15,16 the literature does not have any detaileddiscussions regarding the use of domain knowledge in different aspects of theknowledge discovery. In this paper, we discuss the areas in which the domainknowledge can be used to reduce the search time in discovering knowledge fromdatabases. In particular, we look at the use of domain knowledge

Ž .1 To reduce the size of the databases.Ž .2 To optimize the hypotheses, which represent the interesting knowledge to be

discovered.Ž .3 To optimize queries used to prove the hypotheses.Ž .4 To verify possible contradictory rule discovery.Ž .5 To avoid possible redundant rule discovery.

II. DOMAIN KNOWLEDGE

Domain or background knowledge can be defined as any information that isnot explicitly presented in the database.3,4 In a medical database, for example,the knowledge ‘‘male patients cannot be pregnant’’ or ‘‘male patients do not get

Page 3: Using domain knowledge to optimize the knowledge discovery process in databases

DOMAIN KNOWLEDGE FOR DISCOVERY 47

breast cancer’’ is considered to be domain knowledge since it is not contained inthe database directly. Similarly, in a business database, the domain knowledge‘‘customers with high-income are good credit risks’’ may be useful even though

Žit is not always true. Other types of knowledge like interfield knowledge e.g.,.experience and salary being positively correlated and interinstance knowledge

can be related to data, but they are more related toward the semantic of thedomain.4,5

Domain knowledge originates from many sources. A data dictionary is themost basic form of domain knowledge.1,4,5 Typical information in the datadictionary includes the types of attributes, size of attributes, name of attributes,meaning of each attribute, format, constraints, domain of attribute, usagestatistics, access control, mapping definitions, etc.18 Additional information about

Žthe specific analysis objectives may come from the domain expert although it9,5.may be generated automatically from the database and can assume many

forms. A few examples include5: lists of relevant fields on which to focus forŽthe discovery purposes; definition of new fields e.g., age s current date]

. Žbirth date ; lists of useful classes or categories of fields or records e.g., revenue. Ž .fields: profits, expenses, . . . ; generalization hierarchies e.g., A is]a B is]a C ;

functional or causal models.ŽFormally, domain knowledge can be represented as X « Y meaning X

.implies Y , where X and Y are simple or conjunctive predicates over someattributes in the database. For example, consider the following relations that arepart of a medical database:

Ž .Patient Patient a, name, age, place of birth, . . .Ž .Medical-history Patient a, disease, medication, effects, . . .

Assume that we are trying to discover whether drug X has any effects onpatients who have malaria. The available domain knowledge includes ‘‘peopleborn in the United States and Europe have had malaria vaccine in theirchildhood,’’ and ‘‘people who had malaria vaccine cannot get malaria,’’ whichcan be represented as:

birth place s EuroperUSA « malaria vaccination s yesŽ . Ž .malaria vaccination s yes « getting malaria s noŽ . Ž .

It is also possible to derive domain knowledge from a set of given domainknowledge. For instance, through transitive dependency, one could establish anew domain knowledge as ‘‘people born in the United States and Europe cannotget malaria’’ that can be represented as:

birth place s EuroperUSA « getting malaria s noŽ . Ž .

Let DK be the set of all domain knowledge available for a database. Wedefine DKq, the closure of DK as:

q <DK s DK j DDK DDK is implied by DK� 4i i

Page 4: Using domain knowledge to optimize the knowledge discovery process in databases

OWRANG48

That is, the set of all domain knowledge consists of those defined by the domainexpert and those that can be derived from the defined domain knowledge. Thederivation process can be accomplished by using augmentation and transitiverules, the same way used to operate on functional dependencies in a database.18

III. USING DOMAIN KNOWLEDGE FORKNOWLEDGE DISCOVERY

Knowledge discovery is the process of extracting the implicit, previouslyunknown, and potentially useful information from the databases.3,4,6 In practice,however, some databases are so large that even the fastest algorithms for rulediscovery can be too expensive to apply to all data. There are several approachesthat can be utilized in order to minimize the search efforts. In the first approach,the size of the database can be reduced by eliminating the attributes that do notparticipate in the discovery. Ziarki20 uses the theory of rough set for theidentification and the analysis of data dependencies or cause]effect relation-ships in databases. He demonstrates how to evaluate the degree of the relation-ship and identify the most critical factors contributing to the relationship.Identification of the most critical factors allows for the elimination of irrelevantattributes prior to the generation of rules describing the dependency.

Limiting the number of fields alone may not sufficiently reduce the size ofthe data set, in which case a subset of records must be selected. In the secondapproach, we can apply the discovery algorithms to a random sample of data.However, the rules discovered in a sample can be invalid on the full data set.Piatetsky-Shapiro15 presents a formal statistical analysis for estimating theaccuracy of sample-derived rules when applied to a full data set.

Ž .Finally, in the third approach taken by the author , additional information,called domain knowledge,1,3 ] 5 can be used to guide and constrain the search forinteresting knowledge. The search time can be minimized by reducing the size ofthe database, and optimizing the hypothesis that represents the knowledge to bediscovered, and optimizing queries that are used to process the data to prove thehypothesis.

A. Using Domain Knowledge to Reduce Database Size

Domain knowledge can be used to reduce the size of the database that isbeing searched for discovery by eliminating data records that are not needed fordiscovery. Consider a medical database in which simple domain knowledge couldbe male patients cannot be pregnant. If the knowledge to be discovered is‘‘whether drug X has effects on pregnant patients,’’ domain knowledge can beused to reduce the size of database by eliminating the records for male patientsfrom consideration. Other domain knowledge, for example, ‘‘female patientsunder 12 or above 65 cannot be pregnant,’’ can be applied to further reduce thesize of the database. To formalize the process, assume that the set of domain

Page 5: Using domain knowledge to optimize the knowledge discovery process in databases

DOMAIN KNOWLEDGE FOR DISCOVERY 49

knowledge is represented as:

DK s sex s female « pregnancy s yes , age ) 12 « pregnancy s yes ,� Ž . Ž . Ž . Ž .age - 65 « pregnancy s yes , . . . 4Ž . Ž .

ŽThe initial hypothesis note that the actual hypothesis for discovery may include.other attributes of the patients, e.g., race, weight, etc. can be represented as a

rule as follows:

IF pregnancy s Yes AND drug taken s X

THEN Effects s Yes

The database reduction algorithm can apply the domain knowledge to the initialhypothesis to create a set of constraints. Basically, for each condition in thehypothesis, the reduction algorithm searches the set of domain knowledge. If thecondition is found to be in the Y part of a domain knowledge, then the X partof the domain knowledge is selected as a constraint. The set of constraints canthen be used to create an SQL statement to be executed in order to produce thereduced database. For the above hypothesis, the following SQL statement canbe created and executed to produce the reduced database.

SelectU

FROM Patient-FileWhere sex s‘female’ AND age ) 12 AND age - 65Into Reduced-Patient-File

Database Reduction Algorithm:

Begin� < 4Let C s C k s 1, . . . , n , where C is a condition in the premise of thek k

hypothesis, and n is the number of conditions.Let DK s set of all domain knowledge

C , C C « C , i / j ;Ž .½ 5i j i j

Let N s set of constraints to be used to reduce the database, initializedto B

For k s 1 to n doŽ .If C in C s C , such that C , C in DKk j i j

� 4Then N s N j CiEnd.

B. Using Domain Knowledge to Optimize Hypothesis

In addition to reducing the size of the database, domain knowledge can beused to define an optimal hypothesis by eliminating unnecessary conditions inthe hypothesis, thereby reducing the search time for discovering interestingknowledge from the database. A discovery process can be guided by specifying

Page 6: Using domain knowledge to optimize the knowledge discovery process in databases

OWRANG50

the criteria to focus on, although it may be set to move freely through thedatabase to find any pattern or relationship. For each pattern or relationship tobe discovered, one would:12

Ž .1 Form hypotheses.Ž .2 Make some queries to the database.Ž .3 View the result and modify the hypotheses if needed.Ž .4 Continue this cycle until a pattern merges.

This task can be automated with a discovery module which can repeatedly querythe database until knowledge is discovered. Initially, hypotheses can be formed

Ž .by domain experts or by the discovery system automatically and should aimtoward one specific concept. The basic form of representing a hypothesis is therule representation as:

IF Premise THEN Conclusion

Ž .where Premise is a set of conditions or criteria , ANDed together, defined bythe domain expert to focus the search and the Conclusion will be the discoveredknowledge when the Premise is satisfied by the database. In general, there maybe some interdependency between or among conditions which implies that someconditions can be implied by others. Of course, these dependencies can beidentified by domain knowledge. Subsequently, those conditions that can be

Žimplied by others may be removed from the hypothesis since they provide no.additional information in knowledge discovery resulting in a faster discovery

process. The following algorithm shows the process for eliminating the unneces-sary conditions in a hypothesis.

Hypothesis Optimization Algorithm:

BeginLet DK be the set of all defined domain knowledge;

C , C C « C , i / j ;Ž .½ 5i j i j

Let DKq be the closure of DK;Let C be the set of all conditions in the premise of the hypothesis;

Ž . qFor every C , C in DK doi jif C g C and C g Ci j

then C s C y C ;jEnd.

To show how the algorithm works, consider the CAR data relation inFigure 1. A collection of cars is described in terms of such attributes as overall

Ž . Ž . Ž .length SIZE , number of cylinders CYL , presence of a turbocharger TURBO ,Ž . Ž .type of fuel system FUELSYS , engine displacement DISPLACE , compres-

Ž . Ž .sion ratio COMP , POWER, type of transmission TRANS , WEIGHT, andŽ . 20mileage MILEAGE . Suppose we are interested in factors affecting high car

mileage. The full functional dependency means that the mileage of a car is

Page 7: Using domain knowledge to optimize the knowledge discovery process in databases

DOMAIN KNOWLEDGE FOR DISCOVERY 51

Figure 1. A sample CAR data relation.

affected by interactions of all or some possible causes represented by attributesŽ .contained in the CAR relation. A discovery system or a domain expert may

start with the following hypothesis represented as a rule:

IF SIZE s subcompactand CYL s 4and TURBO s noand FUELSYS s efiand DISPLACE s smalland COMP s highand POWER s mediumand TRANSs manualand WEIGHT s lightTHEN MILEAGE s high

Page 8: Using domain knowledge to optimize the knowledge discovery process in databases

OWRANG52

Figure 2. Rules generated by the IDIS tool based on the CAR relation in Figure 1, withmileage as the goal and the rest of attributes as conditions.

The set of domain knowledge may include:

SIZE s subcompact « WEIGHT s lightŽ . Ž .TURBO s no « POWER s mediumŽ . Ž .

ŽBy applying domain knowledge to the initial hypothesis, conditions 7 POWER. Ž .s medium and 9 WEIGHT s light can be removed from the hypothesis. The

discovery system will evaluate the hypothesis with respect to actual data andmay remove additional irrelevant conditions from the hypothesis in discoveringknowledge.

We have done several runs on the CAR relation using a microcomputerbased knowledge discovery tool called IDIS.11 Figure 2 shows the rules gener-ated by IDIS when all conditions are applied. In the second run, we eliminated

Žthe WEIGHT condition because of the existence of domain knowledge e.g.,.SIZE implies WEIGHT . Except for the rule 2 in Figure 2, the rest of the rules

have been generated in the second run. We note that rule 2 seems to be anotherdomain knowledge and not a new discovery.

In the third run, the POWER condition was removed using the domainŽ .knowledge e.g., TURBO implies POWER . Except for rule 5 in Figure 2, the

rest of the rules have been generated in the third run. Note again that if wereplace the POWER condition with the TURBO condition, then rule 5 becomes

Ž .a subsumption of rule 6 a redundant discovery . Therefore, the absence of rule5 in the third run is not an indication of blocking the unexpected discovery.

Finally, in the fourth run, the WEIGHT and POWER conditions have bothbeen removed from the initial hypothesis. Except for rules 2 and 5, the rest ofthe rules in Figure 2 have been generated. The absence of rules 2 and 5 doesnot mean the blocking of unexpected discovery as explained in the second andthird runs as above.

C. Using Domain Knowledge to Optimize Query Usedto Prove Hypothesis

To discover patterns, a discovery system forms the hypotheses, makesqueries to the database, views the result, and modifies the hypotheses if needed.This process continues until a pattern merges. A major component of a discov-

Page 9: Using domain knowledge to optimize the knowledge discovery process in databases

DOMAIN KNOWLEDGE FOR DISCOVERY 53

Žery system is the database interface. Raw data is selected from the DBMS using.queries and then processed by the extraction algorithms which produce the

discovered patterns. The queries can be posed in SQL, a standard querylanguage for many relational databases.18 The DBMS interface is where databasequeries are generated. Domain knowledge can be used to optimize a query usedto prove a hypothesis. For example, consider the following data relations in adatabase:

employee Ea, Ename, title, experience, seniorityŽ .money title, seniority, salary, responsibilitiesŽ .

Assume the knowledge to be discovered is ‘‘what are the criteria for anemployee to earn more than $50,000.’’ An expert may suggest that experienceand seniority are the two criteria contributing to having a salary more than$50,000. The hypothesis may be represented as the following rule:

IF has experience and has seniority

THEN earn more than 50,000

Ž .To prove or disprove the hypothesis, a discovery system may execute thefollowing SQL statement:

Select experience, seniority

From employee E, money M

Where salary G 50,000

and E.title s M .title and E.seniority s M .seniority

Now, assume that we have the following domain knowledge: Only level-1 andlevel-2 managers have a salary more than 50,000, represented as

titleslevel-1 « salary G 50,000Ž .Ž .titleslevel-2 « salary G 50,000Ž .Ž .

We can use this domain knowledge to minimize our search by eliminating theunnecessary join operation. Basically, for each condition in the hypothesis,the query optimization algorithm searches the set of domain knowledge. Ifthe condition is found to be in the Y part of a domain knowledge, then the Xpart of the domain knowledge will replace the condition and the unnecessaryjoin operation will be removed from the query. The optimized SQL statementfor the above example would be:

Select experience, seniority

From employee

Where title s level-1 or title s level-2

The following algorithm shows the process for the optimization of the queryused to prove]disprove a hypothesis.

Page 10: Using domain knowledge to optimize the knowledge discovery process in databases

OWRANG54

Query Optimization Algorithm:

Begin� < 4Let C s C k s 1, . . . , n , where C is a condition in the Where Clause ofk k

Ž .the SQL statement used to prove or disprove the hypothesis, and nis the number of conditions.Let DK s set of all domain knowledge

� 4C , C C « C , i / j ;Ž .i j i jŽ .For every domain knowledge C , Ci j

If C in C s Ck jThen Replace the condition C with C ;k i

Remove the unnecessary join condition from the SQL statement;End.

D. Using Domain Knowledge to Verify Possible ContradictoryRule Discovery

Domain knowledge can be used to test the validity of the discoveredknowledge. In general, domain knowledge can be used to verify whether acontradictory discovered knowledge is indeed contradictory or if a possibleconsistent discovered knowledge is, in fact, inaccurate. For example, considerour CAR relation. Suppose one is interested to find what affects the high-waymileage. A discovery system may discover the following knowledge:

RULE 1. If CarModel s Honda AND Cylinders s 4Then Mileage s High

RULE 2. If CarModel s Honda AND Cylinders s 4Then Mileage s Low

At first glance, it seems like the two discovered rules are contradictory. How-ever, we have the available domain knowledge that cars produced after 1980have special features that cause a better performance and better mileage. Thus,domain knowledge verifies that discovered knowledge is accurate rather thancontradictory.

This brings up an interesting question as to whether we could use thedomain knowledge in defining a more accurate hypothesis in order to avoidgenerating rules that seem to be contradictory otherwise. The basic idea is toexpand the hypothesis by adding more conditions based on the available domainknowledge. The process is to examine the set of available domain knowledgeand find any of them that involve the goal defined for the discovery. In theabove CAR example, let us assume we have the following domain knowledge:

Car Year ) 1980 « Mileage s HighŽ . Ž .

Page 11: Using domain knowledge to optimize the knowledge discovery process in databases

DOMAIN KNOWLEDGE FOR DISCOVERY 55

Ž .Subsequently, we or the discovery system should include the Car Year at-tribute into the hypothesis. Then, we may get the following rules that do notseem to be contradictory.

RULE 1. If CarModel s Honda AND Cylinders s 4 AND Car Year ) 1980Then Mileage s High

RULE 2. If CarModel s Honda AND Cylinders s 4 AND Car Year F 1980Then Mileage s Low

E. Using Domain Knowledge to Avoid Possible RedundantRule Discovery

Databases normally contain redundant data and definitions that could leadto discovering redundant rules. The redundant data]definitions are generallydifferent syntactically. For instance, consider the CAR relation in Section IV.A.The relation contains the attribute Engine Size, Bore, Struck, and Cylinderamong other attributes. The redundant attribute Engine Size is defined as:

Engine Size s BoreU StruckU Cylinder

In our discovery experiment, we defined the High MPG as the goal and the restof the attributes as premise. The discovery tool IDIS discovered rules relating

Ž .the Engine Size to High MPG as well as rules relating Bore, Struck, Cylinderto High MPG. Obviously, the discovered rules based on Engine Size andŽ .Bore, Struck, Cylinder are syntactically different, but they are semanticallyidentical.

We can define the redundant information in the database as domainknowledge and apply them in the discovery process in order to avoid generatingrules that are syntactically different but semantically equivalent. Before knowl-

Ž .edge discovery, the user or the discovery system should check the availabledomain knowledge to find a domain knowledge that has attributes involved inthe discovery hypothesis. If there is such domain knowledge, then the attributesin one side of the domain knowledge should be included in the discoveryprocess. For the above CAR relation, we could use the Engine Size attribute or

Ž .the Bore, Struck, Cylinder attributes in the discovery process. The choicedepends on whether we are looking to generate more general rules or moredetailed rules.

The advantage of using this process is not only a gain in avoiding redundantrules, but also generating rules that are more meaningful. In our experiment,

ŽIDIS generated rules for the High MPG based on Engine Size and Bore. Ž .alone and Engine Size and Struck alone which do not seem to be meaningful

since none of the attributes Bore or Struck or Cylinder by itself has anyconnection with Engine Size.

Page 12: Using domain knowledge to optimize the knowledge discovery process in databases

OWRANG56

We should note that the discussion provided here is the expansion of thehypothesis optimization process given in Section III.B. However, we created aseparate section for clarity.

IV. EVALUATION OF USING DOMAIN KNOWLEDGE

A. Experimental Results

We have done several experiments on the following Car relation, with 26attributes and 205 records, using the IDIS discovery tool on a 486 IBMcompatible PC. We were interested to discover the relationship between thehigh way mileage and the rest of the attributes. The following discussions showour findings without using and using domain knowledge on Car relation. Weshould note that the current discovery tools are lacking the ability to representdomain knowledge and use them automatically in discovering the rules. The useof domain knowledge is handled manually by eliminating any irrelevant at-tributes from consideration in the discovery process when defining a particular

Ždiscovery case. CAR Symboling, Losses, Make, Fuel-Type, Aspiration, Doors,Body, Drive, Engine-Loc, Wheel-Base, Length, Width, Height, Weight, Engine-Type, Cylinders, Engine-Size, Fuel-Sys, Bore, Stroke, Compress, Horse-Power,

.Peak-RPM, City-MPG, High-MPG, Price

1. Results without Using Domain Knowledge

1Ž .1 The discovery process was too slow. It took 2 days to generate 121 rules. The2reason was that the discovery process had to consider all possible combinations

Žof attributes even though some of them were inappropriate i.e., price of the car,.which is not related to high way mileage .

Ž .2 Most of the generated rules were uninteresting andror known facts. Forexample, the tool discovered that ‘‘the smaller the Engine-Size, the better

ŽHigh-MPG’’ which is a trivial discovery since it is a known fact or a domain.knowledge . Similarly, the discovered rule ‘‘the more expensive the car, the

better High-MPG’’ which seems to be uninteresting since there is no relation-ship between the price of the car and the high way mileage.

Ž .3 Some of the discovered rules were redundant. In general, databases haveredundant attributes that could lead to the discovery of redundant rules. In theCar relation, for example, we have the attribute Engine-Size which is the sameas BoreUStrokeUCylinders. The discovery tool discovered rules relating the highway mileage to Engine-size and high way mileage to Bore, Stroke, and Cylin-ders. Thus, the rules relating high way mileage to Bore, Stroke, and Cylindersappear to be redundant.

2. Results Using Domain Knowledge

ŽIn this experiment, we have eliminated some of the attributes i.e., Price,.Doors , from consideration in the discovery process, based on the available

Page 13: Using domain knowledge to optimize the knowledge discovery process in databases

DOMAIN KNOWLEDGE FOR DISCOVERY 57

domain knowledge. Some of the domain knowledge were:

v The smaller the Engine-Size, the better the High-MPG.v The lighter the car, the better the High-MPG.v Price of the car is not related to High-MPG.v

U UEngine-Size s Bore Stroke Cylinders.

ŽWhen domain knowledge was used, the discovery process was very fast it took.3 h . Also, the generated rules were fewer but more interesting and nontrivial.

In other experiments, as noted in Section III.B, our runs did not show anyblocking of unexpected discovery when domain knowledge is used.

B. Avoid Blocking Unexpected Discovery

The main purpose of using domain knowledge is to bias the search forinteresting patterns. This can be achieved by focusing the discovery on portionsof the data. The benefits are greater efficiency and more relevant discoveries.Too much reliance on domain knowledge, however, may unduly constrain theknowledge discovery and may block unexpected discovery by leaving portions ofthe database unexplored. Consider the following hospital data file:

hospital pa, pname, age, diagnosis, drugs, effects, . . .Ž .

Assume that the knowledge to be discovered is ‘‘the effects of drug X onpatients with heart disease’’ and domain knowledge is ‘‘People under 20 do nothave heart disease.’’ This domain knowledge helps us to reduce the size of ourdatabase by eliminating the records for patients under 20. Suppose the discov-ered knowledge is:

Drug X has such and such effects on people with heart disease

If we avoid using domain knowledge, the knowledge discovery system may findout a more reasonable result such as:

Drug X has these effects on people over 20 with heart disease???

and these effects on people under 20 with heart disease???

Excluding this domain knowledge during discovery may help to classify the datamore efficiently. For example, our data may support that drug X has differenteffects on people under 20 and over 20. However, due to the elimination of part

Ž .of the database records for patients under 20 , the discovery scheme just cannotfind enough data to support this. In another example, if we use domainknowledge male patients do not get breast cancer for the hypothesis ‘‘effects ofdrug X on patients with breast cancer,’’ we may never discover that male

Ž .patients can have breast cancer an unexpected discovery, as found in Ref. 21 .

Page 14: Using domain knowledge to optimize the knowledge discovery process in databases

OWRANG58

There are several things that we can do to improve the effective use ofdomain knowledge in knowledge discovery and to avoid blocking the unexpecteddiscovery. First, the domain expert can assign a confidence factor to eachdomain knowledge and uses it only if the confidence factor is greater than aspecified threshold. The assignment of a confidence factor to a domain knowl-edge depends on how close the domain knowledge is to the established facts.For instance, given known facts, a domain knowledge such as ‘‘male cannot bepregnant’’ should get a higher confidence factor than the domain knowledgethat ‘‘Female over 65 and under 12 may not be pregnant’’ as the former canmedically be proven to be true whereas in the latter there may be a slightchance that female patients under 12 or above 65 can get pregnant. The domainexpert needs to define a mechanism to calculate the confidence factor of adomain knowledge that is derived from the given domain knowledge.

Second, rarely is discovered knowledge true across all the data. It isimportant to represent and convey the degree of certainty to decide how muchconfidence the system or user should put into a discovery. Certainty involvesseveral factors including the integrity of the data, the size of the sample onwhich the discovery is performed, and perhaps the degree of support fromavailable domain knowledge.4 Therefore, if the size of the database is reducedtoo drastically after using some domain knowledge, then we may consider usingfewer domain knowledge, or none of them, in order to avoid blocking unex-pected discovery results. Otherwise, knowledge is discovered that does not havea high enough confidence factor to be considered interesting.

Third, using too much of domain knowledge can produce a specializeddiscovery scheme that can be more efficient than any general scheme in itsdomain, but will not be useful outside its domain.4,15 Domain knowledge can beused more effectively by developing a general scheme for knowledge discoveryand then augment it with the specific domain knowledge.4 The interfaceŽ .integration of the general purpose discovery scheme and domain knowledgemay require a set of rules that can recommend when and how much of the

Ždomain knowledge to be used in different phases e.g., creation of the hypothe-.sis, querying the database, modifying the hypothesis of the general purpose

discovery scheme. The hypothesis optimization algorithm given in Section III.B,for example, can be part of such an interface and can automatically be appliedto every hypothesis generated by the general purpose discovery scheme. The

Žinterface should have a mechanism for reducing the size of the database if. Žpossible by using all the available domain knowledge efficiently e.g., perhaps by

.using the criteria explained in this section and other criteria . The reduceddatabase should be proved to the general purpose discovery scheme for knowl-edge discovery.

V. CONCLUSION

Databases become larger and they continue to contain incomplete andinaccurate data, which make knowledge discovery to be more difficult. Domainknowledge can be used to provide some assistance in different aspects of

Page 15: Using domain knowledge to optimize the knowledge discovery process in databases

DOMAIN KNOWLEDGE FOR DISCOVERY 59

knowledge discovery. We have discussed the benefits using domain knowledge toconstrain the search when discovering knowledge from databases. Domainknowledge can be used to reduce the search by reducing the size of thedatabases, reduce the size of the hypotheses by eliminating unnecessary condi-tions from the hypotheses, and remove unnecessary operations from a query

Ž .that is used to process the data to prove or disprove the hypotheses.The problem with the use of domain knowledge in knowledge discovery is

the likelihood of blocking unexpected discovery. This may happen as a result ofusing too much domain knowledge which may result in a large reduction of datain the database or in the conditions in the hypothesis. Subsequently, thediscovered knowledge may not have a high enough confidence to be consideredinteresting. We defined some guidelines in order to use domain knowledgeeffectively to avoid some of the above problems. In particular, we suggestassigning confidence factors to domain knowledge and use them when theseconfidence factors are high enough based on users specification. In addition, werecommended using domain knowledge when their use do not lead to a majorreduction in the databases in order to avoid having few sample data fordiscovery or missing interesting data. Finally, domain knowledge should be usedas a separate and supplement resource to general knowledge discovery systemsto make these systems more efficient and yet domain independent, rather thanbeing used directly in developing the general methods for knowledge discovery.Currently, we are studying this aspect of domain knowledge utilization in thediscovery process.

A major area in which the use of domain knowledge may be beneficial isthe validation of the discovered knowledge. One possible scheme is to validatethe discovered knowledge to see whether it contradicts the available domain

Žknowledge. In some cases, there may not be any applicable domain knowledge.to be used for evaluation. If it does, then either domain knowledge or the

Ž .discovered knowledge or both is wrong. If the discovered knowledge does notcontradict domain knowledge, then we may have some confidence in its accu-racy. Currently, we are investigating this aspect of the use of domain knowledgein knowledge discovery.

In the future, we need to define precisely the significance and role of thedomain knowledge in knowledge discovery identifying the sources and mecha-nism of acquiring domain knowledge from domain experts or from automateddiscovery tools. We also need to consider knowledge representation and manip-ulation in the discovery process so that domain knowledge can be used effec-tively. Furthermore, we need to define other mechanisms to guarantee unblock-ing of the unexpected discovery when using domain knowledge to narrow thesearch in the databases. Finally, we should see how domain knowledge can beused to make the discovered knowledge more understandable to end users.

References

1. Agrawal R, Imielinski T, Swami A. Database mining: A performance perspective.Ž .IEEE Trans Knowledge Data Eng 1993;5 6 :914]925.

Page 16: Using domain knowledge to optimize the knowledge discovery process in databases

OWRANG60

2. Brachman RJ, Anand T. The process of knowledge discovery in databases. In: FayyadUM, Piatetsky-Shapiro G, Symth P, editors. Advances in knowledge discovery anddata mining. Menlo Park, CA: AAAI PressrMIT Press; 1996. pp 37]57.

3. Fayyad UM, Piatetsky-Shapiro G, Symth P. From data mining to knowledge discov-ery: An overview. In: Fayyad UM, Piatetsky-Shapiro G, Symth P, editors. Advances inknowledge discovery and data mining. Menlo Park, CA: AAAI PressrMIT Press;1996. pp 1]34.

4. Frawly WJ, Piatetsky-Shapiro G, Matheus CJ. Knowledge discovery in databases: AnŽ .overview. AI Mag, Fall 1992;14 3 :57]70.

5. Matheus CJ, Chan PK, Piatetsky-Shapiro G. Systems for knowledge discovery inŽ .databases. IEEE Trans Knowledge Data Eng 1993;5 6 :903]913.

6. Uthurusamy R. From data mining to knowledge discovery: Current challenges andfuture directions. In: Fayyad UM, Piatetsky-Shapiro G, Symth P, editors. Advances inknowledge discovery and data mining. Menlo Park, CA: AAAI PressrMIT Press;1996. pp 37]57, 561]569.

7. Vasant D, Tuzhilin A. Abstract-driven pattern discovery in databases. IEEE TransŽ .Knowledge Data Eng 1993;5 6 :926]938.

8. Yoon JP, Kerschberg L. A framework for knowledge discovery and evolution inŽ .databases. IEEE Trans Knowledge Data Eng 1993;5 6 :973]979.

9. Blum RL. Induction of causal relationships from a time-oriented clinical database:An overview of the RX project. Proc of the Second National Conf on ArtificialIntelligence, Pittsburgh, PA. Menlo Park, CA: AAAI Press. pp 355]357.

10. Szladow A. DatalogicrR}Mining the knowledge in databases. PC AI Jan.rFeb.1993;25, 40]41.

11. IDIS user’s manual. Los Angeles, CA: Intelligenceware, Inc. 1990.12. Parsaye K, Chignell M, Khoshafian S, Wong H. Intelligent data base and automatic

discovery. In Soucek B, the IRIS Group, editors. Neural and intelligent systemsintegration. New York: John Wiley & Sons; 1991.

13. Hong J, Mao C. Incremental discovery of rules and structure by hierarchical andparallel clustering. Knowledge discovery in databases. Menlo Park, CA: AAAIPressrMIT Press; 1991. pp 177]194.

14. Long JM, Irani EA, Slagle JR, POSCH Group. Automating the discovery of causalrelationships in a medical records database. Knowledge discovery in databases.Menlo Park, CA: AAAI PressrMIT Press; 1991. pp 465]476.

15. Piatetsky-Shapiro G. Discovery, analysis, and presentation of strong rules. Knowledgediscovery in databases. Menlo Park, CA: AAAI PressrMIT Press; 1991. pp 229]248.

16. Walker MG. How feasible is automated discovery? IEEE Expert Spring 1987;69]82.17. Buchanan BG, Feigenbaum EA. Dendral and meta-dendral: Their applications

Ž .dimension. Artif Intell 1978;11 1 :5]24.18. Date CJ. An introduction to database systems. Vol. I, 5th ed. Reading, MA:

Addison-Wesley; 1990.19. Chiang RHL, Barron TM, Storey VC. Extracting domain semantics for knowledge

discovery in relational databases. AAAI workshop on knowledge discovery indatabases. Seattle, WA: AAAI Press; 1994. pp 299]310.

20. Ziarki W. The discovery, analysis, and presentation of data dependencies in databases.Knowledge discovery in databases. Menlo Park, CA: AAAI PressrMIT Press; 1991.pp 195]209.

21. Hayward J. Hormones and human breast cancer. BerlinrNew York: Springer-Verlag;1970.