knowledge representation for expert systems · knowledge representation for expert systems ... and...

KNOWLEDGE REPRESENTATION FOR EXPERT SYSTEMS

MAREK PETRIK

Abstract. The purpose of this article is to summarize the state-of-the-art of the expert

systems research field. First, we introduce the basic notion of knowledge, and specifically

of shallow knowledge, and deep knowledge.The first section of the document summarizes the history of the field. We analyze

the differences between the first generation of expert systems, based primarily upon rule-

based and frame-based representation of shallow knowledge. We mainly concentrate onthe most important expert systems and their impact on the subsequent research. These

are the traditional Mycin and Prospector expert systems, but also less famous ones such

as General Problem Solver, Logic Theory Machine, and others. Finally, we present somemodern expert systems and shells such as Gensym’s G2 and also some light-weight prolog

based expert systems, usually based on deep knowledge of the domain.In the forth section, we compare various knowledge representation languages. We

briefly describe each, present some inference techniques, and also discuss primary the

upsides and downsides. For each, we finally present successful expert systems and shellsusing the language. As for shallow knowledge, we review mainly rule-based and frame-

based knowledge representation languages. We try to argue why these are not very

suitable to model complex relationships present in many real world applications andtherefore not suitable for deep knowledge representation. Subsequently, we present early

semantic networks as the first attempt to model deep knowledge. Then more in depth, we

analyze the approaches based on simplifications and extensions of traditional logic. In thefirst place this is the propositional logic, first order predicate logic, modal logic, and finally

logic programming (Prolog). We further continue extended with constraint programming.

Then we discuss nonmonotonic knowledge representation languages, such as answer setprogramming and default logic. At last we analyze representation of knowledge for

continuous domains, mostly addressed by qualitative and semi-qualitative simulation.The fifth section presents various uncertainty measures and their combinations with

the previously mentioned representation languages. First we reason why we need to rep-

resent uncertainty. Then we present generally required properties to which the measureshould adhere. We follow with the classic probability theory and its combinations with

propositional, first order predicate logic, and modal logic, and logic programs. We dis-

cuss the most popular representation model, Bayesian belief networks. We point out theproperties and reasons why probability is a measure that works perfectly for statisticians,

but is not completely satisfactory for many artificial intelligence domains. We continue

with Dempster-Shafer theory. We introduce the Transferable Belief Model that employsthis measure. Next we present possibility theory and it’s combination with predicate

logic, known as fuzzy logic. We present why this seems to be a very popular choice for

simple systems and why it seems unsuitable for large and complex expert systems. Wealso presents systems trying to combine rule-based systems with neural networks.

1. Introduction

Even before the conception of the first computer, people dreamt of creating an intelligentmachine. With the conception of first computers, it seamed the idea would be shortly mate-rialized. Despite the tremendous growth of the processing power of the modern computers,their intelligence, as understood by most people, remains very limited. And it is only thelimited domains where the computers achieved the most success. We now live in the worldwhere the best chess player is not a human, and hardly anyone would take his chance tobuild a bridge without a computer aid.

Key words and phrases. Expert Systems, Uncertainty, Knowledge Representation, Adaptive Systems,Multiagent Systems.

1

2

The feature of computers people valuate the most is their precision. Computers do notmake mistakes. Computers are made to deal with perfect information and they do that well.Unfortunately, the world is not perfect; incomplete, noisy or misleading data is present inevery aspect of our lives. That is why computers are able to make decisions usually onlywith heavy support from human operators. The main research field, which tries to overcomethis serious drawback is the domain of Expert Systems (ES). The crucial question for everysystem is how to represent and acquire knowledge. This is the question we try to answer inthis article.

2. History

We will not try to introduce the general concept of expert systems, we only point inter-ested readers to [Girratano, 1998], which provides an excellent overview. In further we usethe following definition for expert system.

Expert System (Knowledge-based system, ES) is a reasoning system that performs com-parable to or better than a human expert within a specified domain. We further augmentthe term ES to represents also traditional AI agents that are more knowledge than compu-tationally intensive.

Expert systems have experienced a tremendous success in the last two decades.For exam-ple, one of the most successful companies developing ES is Gensym Corporation - creatorof G2 system. By 1995 all of the 10 biggest world corporations were using the G2 expertsystem in some point of their operation. [Jozef Kelemen, 1996]. In the next section wewill chronologically describe the characteristics of expert systems in the various eras. Thisoverview is mostly based upon [Girratano, 1998] and [Eric J. Horvitz, 1988].

2.1. Beginnings. Probably the first expert systems were built in early 1960’s. One of themost famous ones was the General Problem Solver build by Newell and Simon in 1961. Itwas built to imitate human problem solving protocols. The system had a goal, which couldbe achieved by achieving a series of subgoals. These were chosen by an arbitrary heuristicfunction. In fact many of the subsequent expert systems were based on human cognitionprocesses.

2.2. Probabilistic Systems. The first probabilistic ES were conceived for diagnostic prob-lems. This was a very early approach influenced by the very developed decision theory, whichdeals with probability and utility theory. This approach created some concerns among theopponents, because of the exponential inference complexity, conditional independence amongthe variables was assumed in many cases. Despite the questionable approach, these systemsactually outperformed human experts in some domains. For example, the system of deDombal and his colleagues averaged over 90% correct diagnoses of acute abdominal pain,where expert physicians were averaging 65%-80% correct.

These results were very surprising, because the models not only used simplifying assump-tions, but also were given only a fraction of knowledge available to human experts.

2.3. Production Systems. Unfortunately, the enthusiasm for the ES based strictly onprobability has faded. The reasons were various. Some of them were reasonable, otherswere based on poor user interface of the early systems. Main reason was the computationalexplosion of systems using more complex representations, which designed these systems tovery small and thus impractical domains.

In the early 1970’s, new and promising approach emerged. Since the exact methodsare usually intractable, the condition of precision was relaxed to gain possibility of reason-ing with extremely large expert knowledge. Simple if-then rules emerged as a very usefultool of expressing the human cognitive processes. Research found that it were desirable tohave graded measure of uncertainty, which could be used to compare the strength of evi-dence and hypotheses. The most famous systems of were Mycin and Prospector. The sametrends could be observed in other domains, such as combinatorial optimization problems[Anant Singh Jain, 1998, Jones and Rabelo, 1998].

3

2.4. Modern Systems. Expert systems are a reality today. They are generating huge rev-enues and make companies successful [Jozef Kelemen, 1996]. Perception of most of audienceis in many cases the strict opposite. Since the diagnoses are still made almost strictly byhuman doctors this fact is taken to mean that expert systems have so far failed to fulfilltheir promise. This in not entirely accurate. Since mid 1980 a shift from the medical do-main became apparent. Money is usually sparse in ordinary medical domain what leadsto decreased interest of industry (cite: ???). Moreover there are a lot of unanswered ethicquestion and doubts. As a result expert systems are almost invisible in medical domains.Business is taking the greatest advantage of expert systems and mostly financial, productionand research companies respectively.

Despite the promising outlook in the beginning, rule-based ES achieved only limitedsuccess. Rule engines are still somewhat popular for mostly simple domains. Some of themost popular are:

(1) Clips - NASA developed simple rule engine with no uncertainty measure.(2) Jess - A CLIPS descendant coded entirely in Java(3) ABLE - IBM’s Agent Building and Learning Environment. Besides the standard

rule-engine with fuzzy measures offers possibility to use neural networks, decisiontrees, Bayes networks and other common decision and learning techniques.

(4) iLog Rules - A commercial shell for creating business rules.

Rule systems suffer from both lack of expressivity and increased performance of moderncomputer systems. This leads to their continually decreasing popularity. Other possibilitiesto represent and acquire knowledge will be presented in following sections.

Most of expert systems are built either in-house or by specialized contractors. Since eachdomain is very different, most companies specialize only on a part of market. General expertsystem shells are very sparse and only developed by very large companies. Probably themost successful general expert system producer is Gensym, the producer of world-famous G2ES shell. Unfortunately for research, their technologies are very well guarded. On the otherhand, most of the simpler systems are built using standard tools, such as Prolog, Lisp or evenVisual Basic, Java, .NET. This is also erasing the distinctions between standard computersystems and expert systems. The main distinction is that expert systems are usually morefree in managing the knowledge and making decision based on it. While standard computerssystems are programmed to make decision, expert systems are programmed how to makedecision from knowledge. This leads to separation of control and knowledge part of thesystem.

3. Knowledge

As with the most of basic terms, it is hard to define what precisely knowledge means.In the scope of this work, we understand knowledge as ability to reason about a consistentenvironment, forecast, hypothesize in it. Consistent environment means an environmentthat has rules that are time-invariant. Rules may be either first-order or of any other order- enabling us to have rules about rules.

Some of the modern trends in Artificial Intelligence, such as Multi-agent systems, criticizethe failure of traditional artificial intelligence to achieve a break-through results during itshalf-century of research. They claim that the world is the best representation of itself, andas an example of such a problem they present the famous Subsumption Architecture. Theyalso reason that the ”‘artificial”’ knowledge representation schemes are deemed to failure.We think that using real world for representation of itself may be a very precise model,but too complex to be practical. Usually the richer the representation is the harder thereasoning with this representation is. This is why we model human language as context-freegrammars and why we use linear or polynomial functions to approximate functions that aremuch more complex. This is also the reason why we use simple functions as heuristics todirect us in the search.

4

We look at our knowledge only as on heuristics that does not necessarily exactly measuresthe world, but instead offers an approximation. Obviously there is hardly a representationthat is superior to others on all domains, because each domain required a different quality,structure and conditions on the representation scheme.

A basic concept in the ES domain is term knowledge engineering. This is a processof encoding the human knowledge into a formalized framework. Many of the knowledgerepresentation schemes were developed mainly to ease development and maintenance ofextensive knowledge bases. We will deal with these issues only marginally, and focus mostlyon the expressive power, stability, scalability and inference performance.

Generally, knowledge can be divided into two following main categories. Procedural knowl-edge has been the one that has been exploited the most by the computer systems. Everyprogram represents a procedural knowledge how to achieve the desired result. Despite it isenough for most of the simple applications, it is not satisfactory when dealing with partlyor fully undefined situations. This is the reason why computers crash and cannot writean essay for you. Relational knowledge describes the relationship among the events in theworld. Some sources define a third kind - hierarchical knowledge, but we consider it only asa subset of relational knowledge framework. In the following text we will deal mainly withthe relational knowledge representations.

Further, we can distinguish knowledge into two quantitatively different categories, thatis shallow knowledge and deep knowledge. The main distinction between these two is theprecision of the representation. Shallow knowledge concentrates on relationships amongperceivable attributes. Systems with shallow knowledge then concentrate on gathering asmuch facts as possible and inferring the result from these. Deep knowledge involves not onlyobservable attributes but also hidden attributes resulting from modeling the domain. Thesesystems concentrate not only on gathering as much data as possible, but also they createand evaluate number of hypotheses. Such systems were in part almost impossible becauseof both lack of supporting theory and algorithms and also because of lack of computationalpower. This has been changing recently.

There have been many successful expert systems, based upon the shallow knowledgeabout the domain. Most of the early based statistical ES as well as most of the rule basedES do not have a real model of the world, or the model is very simplified. On the other hand,we think that deep knowledge might be very helpful in construction of flexible reasoning ES.

Expert systems have usually additional requirements on both knowledge and conclusionsthey provide. Since they are built to save money, it is imperative that knowledge canbe managed with as little effort as possible. Also it is in many cases essential that theycan support their conclusion by arguments. People making decision based on decisionsof computer systems are more comfortable if they can understand the conclusions, checkthem or find error in knowledge base. Therefore we prefer representations that are modular,predictable and mainly contain symbolic information. These requirements make for exampleneural networks infeasible.

In the next two sections we summarize the existing representation schemes, we describetheir applicability and drawbacks. After that we try to point out how this knowledge couldbe automatically acquired.

4. Structural Knowledge

The most common representation of world as a consistent environment is as a set ofstates and state transformation rules in time. Since enumeration of all possible states andtheir transformations is infeasible even in small environments, we need a different methodto represent which states are possible.

A common method helping to decrease problem complexity is decomposition. Fortu-nately most practical environments may be decomposed into a dynamic set of variables,each representing a fraction of the state. This not only dramatically decreases the complex-ity of representation, but also enables generalization and identification of a state based on a

5

partial observation. Further, to process the variables, it is very useful to define constraintsbinding values of the possible states and their transformations. Knowledge Base is usuallythis decomposition together with constraints binding the variables. We will also use thisdefinition. Interpretation is generally a set of variables and their values assigned to them.Then we say that an interpretation is a model or a possible world when it fulfills all requiredconstraints of the knowledge base.

In the next part we analyze the possible methods. We start with basic methods employedby the first expert systems. Since most of these methods have been applied in tens ofcommercial and research applications, there are various advanced extensions and heuristicinference methods. We do not try to describe all these new methods, we only try to describethe principles and point out the differences.

These methods that enable representations of world as states fulfilling specified con-straints. It is worth adding that all these representation assume strictly consistent environ-ments - transformation rules are exactly same for the state at any instant of time. Thisis a valid assumption for many domains, but there are strong reasons why this assumptionshould be weakened. We discuss these reasons in section 5.

4.1. Semantic Networks. A semantic network is a classical representation of hierarchicalrelational knowledge in AI1. It is basically a graph where the nodes are labeled by atomicformulas, and arcs represent relations between them. The nodes of this graph then represententities and classes of entities. These classes then may be hierarchically ordered to representthe knowledge. This leads to two basic relations between the node, that is: subclass, entityof. A simple example of the network is in figure ????.

They were first developed as a method of representing human knowledge. In fact everysemantic network can be represented by the language of first order logic (see below). Evenbetter, semantic nets can be directly translated to logic programs. Therefore, we will not dealmuch further with this representation until the sections devised for logic and non-monotoniclogic. Most of the principles of semantic networks are also employed in rule-based systems.

Though the basic feature of semantic networks that are employed in many very large ES.This is hierarchical classification of knowledge. For people, this is a very natural reasoningprocess, but not as much for computers. Hierarchical classification leads to enhanced gen-eralization, information reduction and also can dramatically increase performance. On theother hand it can lead to reduced precision. For a classic example of this feature considerhaving an expert systems classifying animal species. For this it is much more convenient todefine that dogs bark instead of defining for each species of dog that barks. Obviously it iscomputationally much simpler to consider a hypothesis dog instead of considering a largenumber of dog species. Also adding a new dog to the knowledge base is easier when allcommon features are specified for class of dogs and not for each entity separately.

4.2. Frames. The following description of frames is based open many sources, but mainly[Vladimir Marik, 1993]. Frames have been the first attempt to mimic human reasoningand knowledge hierarchy representation. They were first proposed by Minsky and foundtremendous application in early expert systems. Frames are grouping of slots that representsemantically close knowledge. Despite their wide-spread application, their background ismostly technically based, and more simplify the development for humans then offer a solidbase for sound inference. The principle of frames has been further enhanced and refined inObject Oriented Programming paradigm and Multi-Agent System.

4.3. Rules. Rules represent a very human friendly knowledge representation. They arecomposed of simple if-then clauses that are activated usually according to a custom heuristicfunction. One of the often cited rule-based systems is their modularity, simplicity and goodperformance [Girratano, 1998].

1Artificial Intelligence

6

Rule-engines are generally not suitable for modeling complex world relationships, andcreation world models, but instead can used to represent procedural or shallow knowl-edge. Reasoning in a partially observable domain requires a measure of certainty in theproposition. There have been many more or less successful approaches to represent theuncertainty in a rule-based framework. The early trials and also the recent research haveshown that once the uncertainty measures are introduced, the system is no longer modular[Eric J. Horvitz, 1988] and in case of Mycin, the ad-hoc certainty factors could lead to dis-astrous results [Stuart Russel, 2003]. Some of the most successful measures were fuzzy setsand graded logic values. Both offer similar approach to the reasoning, but while graded logicoffers discrete values of membership, in fuzzy logics the uncertainty is usually measured byvalues from infinite set of real numbers. For more information refer to the section dedicatedto fuzzy logic below.

This approach proved very efficient in simple domains, also due to very fast Rete matchingalgorithm, but it’s use in complex systems is at least very questionable [Stuart Russel, 2003].Wepoint readers interested in this knowledge representation model to [Girratano, 1998], andfor an example of more advanced approach [Poli and Brayshaw, 1995].

4.4. Logic. The basic notion of logic has been known already to old Greeks. It is a sys-tems that defines a framework for representing relational knowledge and reasoning aboutit. Unlike rule systems, logics is a very suitable tool for representing real world models. Itcan represent very complex relationships among objects, it can represent hierarchies, andit is very extensible. The main problem of reasoning with logic is that inference is usuallyan NP-complete problem, and there have not been many successful methods of expressingheuristic shallow knowledge using logics.

The reasoning is performed according to strictly defined rules of inference. We will firstintroduce the propositional logic, and then move on to first-order logic and also we willintroduce modal logics.

4.4.1. Propositional Logic. The syntax of propositional knowledge defines the allowed sen-tences. The atomic sentences, indivisible syntactic elements, consist of single propositionalelements. From these, complex sentences are composed using unary operator ¬ and binaryoperators ∨, ∧, ⇒. Please note that the logic can be defined also by only a subset of theseoperators. For detailed description of semantics, please see any of books dealing with logicrepresentation, for example [Stuart Russel, 2003], [Vladimir Marik, 1993].

Further important concepts are following.

(1) model is an assignment of values to every propositional element(2) entailment α |= β if β is true in all models where α is true(3) valid sentence is true in all models(4) logically equivalent if they are true in the same set of models (their logical value is

same in all models)(5) satisfiable sentence is true in at least one model

The aim of inference is to determine whether KB |= α for some sentence α, where KB isthe knowledge available to the agent. We say that the entailment procedure is sound wheneverything it infers from KB is entailed and complete when everything that is entailed isinferred. The main inference rules are Modus Ponens and And-Elimination.

Resolution is a single inference rule that yield both sound an complete inference. It isusually applied to sentences in Conjunctive Normal Form (CNF) - conjunction of disjunc-tions of literals2. It can be shown that every logical sentence can be expressed in CNF withdisjunctions.

Theorem 4.1. Every logical sentence can be expressed in the CNF form.

2Literal is an atom or a negated atom.

7

Proof. Let have a sentence α and let’s try to express its negation ¬α in Disjunctive NormalForm (DNF). DNF is disjunction of conjunctions and therefore for each model, where ¬αis true, we can write a single conjunction, which is true in this model only. Therefore6= α = (α1 ∧ α2 ∧ . . .) ∨ . . . ∨ (αk ∧ αnldots), where αi are propositional literals3. Thenα = (6= α1 ∨ ¬α2 ∨ . . .) ∧ . . . ∧ (¬αk ∨ ¬αn ∨ ldots) what is CNF.

Now, resolution is a simple procedure, where we join two disjunctive clauses γ = (αi∨β1)and δ = (¬ alphai ∨ β2) where αi is a propositional atom and βi is a disjunctive clause.The result is (β1 ∨ β2). We can use this method to determine entailment because α |= β isequivalent to unsatisfiability of α ∧ ¬β. Therefore if α entails β we get an empty clause by resolving it, otherwise there are some models in which α is true and β is not.

Theorem 4.2. Resolution is both sound and complete.

Complexity. While resolution offers a good method of theorem proof, it works in O(2n),where n is the number of clauses. The following theorem says, that we hardly expect analgorithm, which could be asymptotically faster.

Theorem 4.3. Inference in propositional logic is NP-complete from the number of clausesin the knowledge base.

Some of the algorithm, which use heuristic or probabilistic principles to decide entailmentfollow

(1) DPLL Davis-Putnam Algorithm - It is essentially a heuristic depth first enumerationof possible models.

(2) WalkSAT is a probabilistic Monte Carlo algorithm, which uses the min-conflictsCSP4 heuristics to find a model satisfying the formula. In most cases it is more thantwice as fast as DPLL.

4.4.2. First Order Predicate Logic. The predicate logic is enriched by two additional quan-tifiers for-all - ∀ and exists - ∃, where exists quantifier ∃α may be defined also as ¬∀(6= α).Also, we allow atoms to be only predicates - what is a relation on the universe of possiblevalues. Unlike in propositional logic, we allow functions that map several terms into anotherterm. The logic without functions is called datalog.

The concept of first order-logic and inference is too complex and well-known that we onlypoint interested readers to any classical literature. We only mention that inference in firstorder logic is only semidecidable, algorithms exists that say yes to every entailed sentence,but never say that a sentence is not entailed.

The key concepts are:(1) Lifting is a generalized Modus Ponens.(2) Unification is process of assigning propositional symbols and functions in place of

quantified expressions in two sentences. It finds a unique most general unifier, whichalways exists.

The concepts of resolution, Horn clauses (define clauses), forward and backward chainingare defined in the same way as in the propositional logic. Only principal difference is theneed of unification between the clauses. Also, resolution is, based on the mentioned semi-decidability, is only refutation-complete.

4.4.3. Modal Logic. Modal logic extends propositional logic of possibility ♦ and necessity .It has been invented to represent not only the model of the wold, but also agent’s beliefsabout it. It was invented be Lewis in 1913 in an attempt to avoid paradoxes of implication,when a false proposition implies anything.

There are many families of the modal logic. Many of them are based upon weak logiccalled K (after Saul Kripke), extension of propositional logic. The symbols of K logic include∼ for not, ⇒ for if-then, and for necessity. The possibility can be expressed by necessity in

3Literal is an atom or a negated atom.4Constraint Satisfaction Problem

8

the similar fashion as exists quantifier. is defined by for-all α ≡ ¬♦(¬α). A good referencefor the modal logics framework is Modal Logic.

In propositional logic, a valuation of the atomic sentences (or row of a truth table) assignsa truth value (T or F) to each propositional variable p. Then the truth values of the complexsentences is calculated with truth tables. In modal semantics, a set W of possible worldsis introduced. A valuation then gives a truth value to each propositional variable for eachof the possible worlds in W. This means that value assigned to p for world w may differfrom the value assigned to p for another world w. The possible worlds are defined by anaccessibility relation which conforms to certain rules. This is beyond the scope of this article.

Some of the extensions of modal logics are:

(1) Temporal Logic Modal logic, where the operators are understood as time validityspecification, e.g. it is possible in some time moments and also that its is necessaryfor some time moments.

(2) Deontic Logic defines obligatory and permissible operators.(3) Conditional Logic Tries to avoid the above mentioned paradoxes of implication, like

for example a ⇒ (¬a ⇒ b)

4.4.4. Non-monotonic Logic. Non-monotonic logic is result of synthesis of cognitive sciencesand traditional logic representation. One of the results of cognitive sciences, as well asdaily life is that people tend to come conclusions that are not valid in all models of theirknowledge. This leads to behavior in which additional information can lead to rejection ofprevious conclusion thus non-monotonic.

There have been many disputes how the cognitive models are built and used. As a conse-quence a number of possible representations have been proposed. Some of these are DefaultLogics and Closed/Open World Assumption in logic systems, and Answer Set Programming.Answer set programming is a very interesting and dynamically growing representation sim-ilar to constraint programming mentioned in ??subsection 4.6. There are two main systemsfor DLV and Smodels. There are no widely-known Expert Systems in this paradigm, butthis seem promising, thought yet immature field.

4.5. Logic Programming. Since the inference in the logic systems is in most cases in-tractable, researchers sought ways to improve the reasoning process. The general approachis to restrict the expressivity of the knowledge representation in order to boost the perfor-mance. In this section we will present such approach to both propositional and first orderlogic.

As mentioned above, entailment is an NP-complete problem even in the logic as weak aspropositional is. One of the ways to restrict the expressive power to achieve much betterperformance is to permit only Horn clauses. Horn clause is a disjunction of literals, ofwhich at most one is positive. With these rules we are able to represent even very complexstructures, with very fast inference.

Definite logic program is a set of Horn clauses5, among which conjunction is assumed.Main methods of inference in these programs is Forward chaining and Backward chaining,which moth have the same asymptotic computational complexity, but differ in their ap-proach to data and query. Their principles are beyond the scope of this article, a goodintroduction may be found in [Stuart Russel, 2003], and also a good source of informationis [Ulf Nilsson, 1990].

Application of logic programs in ES domain is probably even wider than that of rule-engines. Logic programs can be used to express even very complex structures. Like, rule-based systems they are very modular, and stable. Their main drawback is their lack ofuncertainty representation. Though, there are ways to deal with this problem, none of themproved to be successful in general sense.

5From first-order predicative logic.

9

4.6. Constraint Programming. Constraint programming is a paradigm to solve con-straint satisfaction and optimization problems just by their specification. Principally, aprogrammer only needs to know how a solution for his problem looks like. Specificationof the problem is accomplished by a set of constraints over a set of variables. Both ofthese sets may be dynamic and unbounded. Then the programming environment uses stan-dard methods for solving constraint satisfaction problems. These usually employ constraintpropagation and value distribution and branch and bound.

There is a number of different domain and constraint specifications. The most widely useddomain is a finite set of integers, but also real numbers and tree structures are common.The most popular approach to specify the constraints is a logic programming language thatis very similar to prolog. A common feature is possibility to specify meta-constraints - i.e.constraints on constraints. This can be also very elegantly addressed by standard logicprogramming. This combination is known as Constraint Logic Programming - CLP(D),where D stands for the domain of the variables.

Applicability of pure constraint programming to ES domain is somewhat limited dueto the large scale of these systems. Thought there have been some efforts in using thisparadigm in the domain of expert systems, most notably in [Bartak, 1999]. The employedstrategy implements the Hierarchical Constraint Logic Programming(HCLP) discussed in[Wilson and Borning, 1993]. HCLP uses a partial preference ordering on the constraintsidentifying a preference of the constraint being satisfied.

4.7. Mathematical Programming. Mathematical modeling is a subset of constraint pro-gramming model, in which domains are real numbers and constraints are specified as func-tions. The standard model is:

min g0(x)gi(x) ≤ 0; i = 1, . . . ,m

x ∈ X ⊆ <n

There are following common variations of mathematical programming:

(1) Convex Programming if both g0 and gi;∀i function are convex and X is a convexset6.

(2) Linear Programming convex programming with all constraints as functions are lin-ear.

(3) Integer Linear Programming convex programming with linear constraints and X ⊆ℵn

(4) Mixed Linear Programming is a combination of linear and integer programming.

Linear programming as the simplest presented model is also the one most widely used. Thefirst method that addressed this type of problems was simplex method. Though the methodis in worst case exponential, it has been applied with many successes. The more recentpolynomial method belong to a set of interior-point methods [Karmarkar, 1984]. Complex-ities for the other problems are much less optimistic. For instance, integers programmingproblem has been proved to be NP complete7. Prominent method to solve the problem isby Lagrangian relaxation.

Mathematical programming is very often employed in economical and optimization sys-tems, but it is questionable whether these are expert systems. Since most of the economicalknowledge can be written in form of mathematical equations, it is a field where standardlogic usually fails.

6Informally a convex set is a set where or interior point can be connected with a straight line that is

whole inside of the set.7Can be shown by SAT problem transformation.

10

5. Uncertain Knowledge

Most of the models mentioned in section 4 were not able to deal with the uncertainty. Thissection will describe common methods for expressing uncertainty, and also their applicationswith the knowledge representation languages.

As mentioned in section 3 we assume that environment is consistent. Yet there are manyreasons that lead us to represent uncertainty. Most common of these reasons are probablythe need to simplify the world representation, to represent noisy or biased measures of theworld, and to represent weight of knowledge in KB.

Three different models of representing uncertainty... [Halpern, 89]During the study of general uncertainty measures, various requirements for the measure

were introduced. The ones we present here are from [Eric J. Horvitz, 1988].

(1) Clarity Propositions should be well defined.(2) Scalar continuity A single real number is both necessary and sufficient for represent-

ing a degree of belief in a proposition.(3) Completeness A degree of belief can be assigned to any well-defined proposition.(4) Context dependency The belief assigned to a proposition can depend on the belief

in other propositions.(5) Hypothetical conditioning There exists some function that allows the belief in a con-

junction of propositions, B(X∧Y ), to be calculated from the belief in one propositionand the belief in the other proposition given that the first proposition is true. Thatis, B(X ∧ Y ) = f(B(X|Y ),B(Y ))

(6) Complementarity The belief in the negation of a proposition is a monotonicallydecreasing function of the belief in the proposition itself.

(7) Consistency There will be equal belief in propositions that are logically equivalent.

These principles well model general requirements, but many measured defy some of them.We will note in each of these measures which propositions are satisfied and which are not.We will further concentrate mainly on methods that are interesting but generally not verywell known.

5.1. Probability. The notion of probability has been first introduced no later than in the17th century in the works of Pascal, Bernoulli and Fermat [Eric J. Horvitz, 1988]. It is hasbeen ever since the most widely used and the most thoroughly developed system to representuncertain beliefs.

Ramsey and De Finetti argued in the famous Dutch book argument that any uncertaintymeasure must fulfill the following properties. The argument is based upon [Freedman, 2003].

(1) P(E) ∈ 〈0, 1〉(2) If event E is certain then P(E) = 1(3) If E = E1 ∪ E2 and E1 and E2 are exclusive events then P(E) = P(E1) + P(E2)

De Finetti has shown that agent which did not adhere to these rules could be forced into anall negative gain situation. This proof is based on certainty as willingness to bet money atodds8 1

p : 1. Moreover .The probability is defined as an additive measure9 on a σ-algebra. Some of the more

recent methods tried do define probability as a non-additive measure. We will describethese in later sections.

The probability model assumes that the world is consistent. Usually, it is used to representuncertainty, not imprecision. There has been an ongoing dispute what actually probabilitiesmean. Some researchers agree that probability is a fundamental property of the world, someprefer complete Bayesian approach and assume that probabilities are subjective, and haverepresent only view of the agent.

8Meaning that agent would be willing to pay $ 1 to get $ 1/p− 1.9Additive means that for exclusive sets E1, E2 holds P(E1 ∪ E2) = P(E1) + P(E2)

11

Unfortunately, probability does not appear to be a fundamentally good concept for un-certain reasoning, as it is for statistics. The reason might be in its inability to distinguishbetween the lack of knowledge, and conflicting knowledge. For example, if we know nothingabout E we can only assume the prior probability, what might not be a very good ap-proach in many cases. Thought, probability is more suitable to represent uncertainty andinconsistency, it might be possible to use it for representation of imprecision. There is alsono standard tool to meta-reason, meaning to reason about the reasoning and uncertaintiesthemselves.

Probability has been very widely applied in the ES domain. We discuss some of thecombinations that proved to be most successful.

5.2. Bayes Networks and Influence Diagrams. Bayes Network is a graphical modelthat depicts conditional dependencies among the concerned random variables, thus Bayesian.It takes advantage of a powerful property of probability theory - conditional probability canbe inversed through Bayes rule. This means that we can calculate Pr[a|b] = Pr[b|a] ∗Pr[a]/ Pr[b] knowing only atomic probabilities. Instead of representing probability of eachvariable as dependent on all other, we choose only ones that are the most influential.

Inference on the model is an NP complete or harder problem [Park and Darwiche, 2004],depending on the questions asked. On the there hand there have been a number of algorithmsapplicable to structures fulfilling special criteria, i.e. poly-trees, or statistical sampling algo-rithms such as Monte Carlo Markov Chain. A very nice introduction to Bayesian networkscan be found in [Heckerman, 1996, Stuart Russel, 2003].

This is a model that has been put in use by the first expert systems and has been verythoroughly studied ever since. There have been many application of bayes networks toexpert systems, or knowledge intensive agents. Applications range from sorting spam out ofincoming e-mail correspondence to deciding what design of a car would be more beneficial.Compared to the traditional rule-based expert systems, this approach is much more preciseand offers a well founded model for dealing with uncertain data. Since the model is basedon traditional probability measure, dutch book is inapplicable and the standard decisiontheory methods can be applied.

5.3. First Order Probability Languages. While bayes networks have proved very bene-ficial to the whole AI community, they suffer from the same problem as propositional logic.They can only express finite knowledge, what leads to very static problem representations,and clumsy knowledge bases. Imagine the following example: The colt John has been bornrecently on a stud farm. John suffers from a life threatening hereditary carried by a recessivegene. The disease is so serious that John is displaced instantly, and the stud farm wantsthe gene out of production, his parents are taken out of breeding.What are the probabilitiesfor the remaining horses to be carriers of the unwanted gene? Only possible representationthrough bayes networks is to create a random variable for each sheep and its cancer probabil-ity. Obviously this is too complex and inflexible. Some research has therefore been appliedto extending the standard bayes model of the expressivity of first-order representations.Some of these representations are Bayesian Logic Programs [Kersting and Raedt, 2000] andStochastic Logic Programs [Muggleton, 1995], Probabilistic Logic Programs. Neither of thepresented approaches has proved as successful as ordinary bayes networks.

5.4. Stochastic Programming.

5.5. Dempster Shafer Theory. This theory tries to address the main problem of prob-ability theory, required prior beliefs. It has been pioneered by Dempster, and later refinedby Shafer. Sometimes it is called Theory of Evidence.

We assume to have a finite set of mutually exclusive and exhaustive elements calledframe of discourse and symbolized by Θ. We also define define “mass” function m. Its basicproperties are:

(5.1) m(∅) = 0

12∑Ain2Θ

m(A) = 1

This setup allows us to assign a quasi-probability to each set from the potential set of theenvironment. This is different then in the probability theory, where we assign probabilitiesonly to atomic events. Then we need to define one more function to define measure on anysubset of the environment.

Definition 5.1. Let m be the mass function defined over frame Θ. Then we define thebelief function Bel as

Bel(A) =∑B⊆A

m(B);∀A ∈ Θ

Unlike in probability, where the measure is single scalar constant. This theory intro-duces another measure, plausibility, which measures the weight of evidence, which does notcontradict the given event. This concept is also used possibility theory, which is discussedbelow.

Definition 5.2. Let m be the mass function defined over the frame Θ. Then we define theplausibility function Pl as

Pl(A) =∑

A∩B 6=∅

m(B);∀A ∈ Θ

The important, and also questionable, part of the theory is the combination of evidence.This is done by a Dempster rule. In the theory of probability, this is usually the combinationof prior beliefs, and evidence.

Definition 5.3. Assume that m1 and m2 are mass functions, such that∑

Ai∩Bi 6=∅ m1(A) ·m2(B). The combination m1 ⊕ m2(A) of these mass functions according to the DempsterRule is, if A 6= ∅

m1 ⊕m2(A) =

∑Ai∩Bj=A m1(Ai) ·m2(Bj)∑Ai∩Bj 6=∅ m1(Ai) ·m2(Bj)

and we set m1 ⊕m2(∅) = ∅.

There are some applications of this theory, but very few of them practically used. Itsmain drawback is computational complexity of combining the beliefs.

5.5.1. Transferable Belief Model. Transferable Belief Model is one of the most researchedmodel based on the Dempster-Shafer theory (DST). The research is mostly lead by professorPhilippe Smets. It introduces an interesting idea of having two levels of representation.The first is credential, it is used to model one’s belief. This level is built upon a slightlymodified DST. As mentioned in section 5.1, if measure used for decision making does notconform to the probability laws, then agent can be forced into all lose position. Therefore,since DST does not conform this laws, it can not be directly used to make decision. Forthis purpose a new pignistic level is introduce, with a regular probability distribution overevidence. The transformation from credential to pignistic level of representation is basedupon the generalized insufficient reason principle. For more details please see the thoroughdescription in [Smets et al., ].

5.6. Fuzzy Logic. An alternative approach to representing uncertain or indeterministicknowledge was proposed by Lotfi Zadeh in 1965. It is an attempt to enrich the traditional 2valued logic by graded truth values. Instead of being able to say that a statement is eithertrue of false, we can say that it is true to a certain degree. We use the definitions from[Vladimir Marik, 2003].

Fuzzy logic is defined in the same manner as first order logic. The first notable differenceis the definition of predicates. Predicate in fuzzy logic are not ordinary relations, but insteadfunctions mapping terms from the universe of possible values to a real number from 〈0, 1〉.The second is that logical operators are defined as function f : 〈0, 1〉〈0, 1〉 → 〈0, 〉 . Finally,

13

we add a default atom 0, which always has truth value 0. Truth value 1 represents somethingthat is absolutely true, and on the other hand the truth value 0 represents something thatis absolutely false.

The most usual way of defining the logical operators is by using a t-norm and its residuum.T-norm or triangular norm is a function ? which obeys the following properties.

x ? y = y ? x

x ? (y ? z) = (x ? y) ? z

x ≤ x′ ∧ y ≤ y′ ⇒ x ? y ≤ x′ ? y′

1 ? x = x

Some of the most widely used measures are

x ? y = min(x, y)x ? y = max(0, x + y − 1)

We understand this operation to mean &, what is not equal to the standard conjunctionoperation ∧. Every t-norm has a residuum which is used as a definition for ⇒ operation.We define this as

(5.2) x ⇒ y = maxz|x ? z ≤ y

We define values of the other logical operators from this t-norm as following:

¬α = α =⇒ 0α ∧ β = α&(α ⇒ β)α ∨ β = α ⇒ β) ⇒ β) ∧ ((β ⇒ α) ⇒ α)

For all t-norms we get that α ∧ β = min(α, β) and α ∨ β = max(α, β). Specially for widelyused Lukasiewicz measure we get 6= α = (1− α).

Only thing left to be defined are quantifiers. For these we use following10.

(∀x)α = inf∀interpretations

α

(∃x)α = sup∀interpretations

α

Now, the truth value of a formula is calculated in the same manner as it is in the ordinarypredicate logic.

The fuzzy logics has experienced a tremendous growth in applications the ta least decades.It is used in anything from transmissions in cars to washing-machines. According to [Elkan, 1993],there have been no successful ES applications based on fuzzy logic. He argues that undersome circumstances the fuzzy logic can collapsed to the ordinary 2–valued logic. The reasonis that if we have two logically equivalent formulas A = neg(α∧¬β) and B = β ∨ (¬β ∧¬α)their truth value represented by t should be equal. Then for fuzzy logic we get that eithert(α) = t(β) or t(α) = 1− t(β). Therefore only possible truth values for α and β are 0 and1, what is the same as the ordinary logic. He explains the success in the small devices asdue to very short inference chains and empirically learned coefficients.

5.7. Possibility. Possibility theory is, as well as probability theory a measure theory. Itwas developed by Lotfi Zadeh in early 1980’s and is based upon fuzzy sets. One of themotivations for it’s conception was the fact that the human reasoning does not correspondto the rules of probability.

The principal difference from the theory of probability is that the function is not definedas a sum of the measures of subsets, but as their maximum value. The basic measure is

10We deliberately do not explicitly mention the interpretations, and therefore commit to a little crime on

the precise formal definition. This is to focus on the principle and not the technical difficulties. For precisedefinitions, please consult references.

14

Possibility Pos, and as in Dempster–Shafer theory it has a conjugate measure Necessity Nec.Basic properties of Possibility are:

Pos(∅) = 0Pos(Ω) = 1

Pos(A ∪B) = max(Pos(A),Pos(B))

The for necessity measure we get:

Nec(A) = 1− Pos(A)

The basic advantage of this theory of the theory of probability is possibility to combinemeasure of overlapping sets by simply calculating the maximum of the measures. The theoryhas been rigidly axiomatized, but there is no “Dutch Book” argument supporting it. Thisis a very active area of research.

We are not aware of any applications of possibility theory. There is a lot of theoreticalresearch, but specific applications are rarely discussed. Some authors claim that this theorybetter represents indeterministic processes, and therefore is more suitable for representinguncertainty. Since these claims are mostly unsupported, we believe that the only advantageof possibility theory lies in its computational simplicity. Therefore we consider it to be usefulmostly for heuristic approximations of probabilistic models.

—————————————————————-

References

[Anant Singh Jain, 1998] Anant Singh Jain, S. M. (1998). A state-of-the-art review of job-shop schedulingtechniques. Technical report.

[Bacchus, 1991] Bacchus, F. (1991). Probabilistic belie logic.

[Bartak, 1999] Bartak, R. (1999). Expert Systems Based On Constraints. PhD thesis.[Elkan, 1993] Elkan, C. (1993). The paradoxical success of fuzzy logic. IEEE Expert.

[Eric J. Horvitz, 1988] Eric J. Horvitz, J. S. B. (1988). Decision theory in expert systems and artificial

intelligence. International Journal of Approximate Reasoning, (2):247–302.[Freedman, 2003] Freedman, D. A. (2003). Notes on the duch book argument. Unpublished.

[Girratano, 1998] Girratano, R. (1998). Expert Systems, Priciples and Programming. PWS.[Halpern, 89] Halpern, J. Y. (89). An analysis of first-order logics probability. Artificial Intelligence, 46:311–

350.[Heckerman, 1996] Heckerman, D. (1996). A tutorial on learning with bayesian networks. Technical report,Microsoft Co.

[Jones and Rabelo, 1998] Jones, A. and Rabelo, L. C. (1998). Survey of job-shop scheduling techniques.[Jozef Kelemen, 1996] Jozef Kelemen, M. L. (1996). Expertne Systemy pre Prax. SOFA.[Karmarkar, 1984] Karmarkar, N. (1984). A new polynomial-time algorithm for linear programming. Com-

binatorica, 4.

[Kass and Raftery, 1994] Kass, R. E. and Raftery, A. E. (1994). Bayes factors. Technical report.[Kersting and Raedt, 2000] Kersting, K. and Raedt, L. D. (2000). Bayesian logic programs. In Cussens, J.

and Frisch, A., editors, Proceedings of the Work-in-Progress Track at the 10th International Conference

on Inductive Logic Programming, pages 138–155.[Kuipers, 2001] Kuipers, B. (2001). Qualitative simulation.

[McCarthy, 1987] McCarthy, J. (1987). Generality in artificial intelligence. Commun. ACM, 30(12):1030–1035.

[Muggleton, 1995] Muggleton, S. (1995). Stochastic logic programs. In De Raedt, L., editor, Proceedingsof the 5th International Workshop on Inductive Logic Programming, page 29. Department of ComputerScience, Katholieke Universiteit Leuven.

[Park and Darwiche, 2004] Park, J. D. and Darwiche, A. (2004). Complexity results and approximation

strategies for map explanations. Journal of Artificial Intelligence Research, 1:101–133.[Poli and Brayshaw, 1995] Poli, R. and Brayshaw, M. (1995). A hybrid trainable rule-based system. Tech-

nical Report CSRP-95-3.[Smets et al., ] Smets, P., Hsia, Y., Saffiotti, A., Kennes, R., Xu, H., and Umkehrer, E. The transferablebelief model. pages 91–98.

[Stuart Russel, 2003] Stuart Russel, P. N. (2003). Artificial Intelligence, Modern Approach. ???[Ulf Nilsson, 1990] Ulf Nilsson, J. M. (1990). LOGIC, PROGRAMMING AND PROLOG. John Wiley and

Sons Ltd.

[Vladimir Marik, 1993] Vladimir Marik, e. a. (1993). Umela Inteligence 1. ACADEMIA.[Vladimir Marik, 2003] Vladimir Marik, e. a. (2003). Umela Inteligence 4. ACADEMIA.

15

[Wilson and Borning, 1993] Wilson, M. and Borning, A. (1993). Hierarchical constraint logic programming.

Technical Report TR-93-01-02.

E-mail address: [email protected]

knowledge representation for expert systems · knowledge representation for expert systems ... and...

Documents