learning sequential decision rules using simulation ?· learning sequential decision rules using...
Post on 09-Jul-2018
Embed Size (px)
Machine Learning, 5, 355-381 (1990) 1990 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands.
Learning Sequential Decision Rules Using Simulation Models and Competition
JOHN J. GREFENSTETTE (GREF@AIC.NRL.NAVY.MIL) CONNIE LOGGIA RAMSEY (RAMSEY@AIC.NRL.NAVY.MIL) ALAN C. SCHULTZ (SCHULTZ@AIC.NRL.NAVY.MIL) Navy Center for Applied Research in Artificial Intelligence, Naval Research Laboratory, Washington, DC 20375-5000
Abstract. The problem of learning decision rules for sequential tasks is addressed, focusing on the problem of learning tactical decision rules from a simple flight simulator. The learning method relies on the notion of com- petition and employs genetic algorithms to search the space of decision policies. Several experiments are presented that address issues arising from differences between the simulation model on which learning occurs and the target environment on which the decision rules are ultimately tested.
Keywords. Sequential decision rules, competition-based learning, genetic algorithms.
1. Introduct ion
In response to the knowledge acquisition bott leneck associated with the design of expert systems, research in machine learning attempts to automate the knowledge acquisition proc- ess and to broaden the base of accessible sources of knowledge. The choice of an appropriate learning technique depends on the nature of the performance task and the form of available knowledge. I f the performance task is classification, and a large number of training exam- pies are available, then inductive learning techniques (Michalski, 1983) can be used to learn classification rules. I f there exists an extensive domain theory and a source of expert behavior, then explanation-based methods may be applied (Mitchell, Mahadevan & Steinberg, 1985). Many interesting practical problems that may be amenable to automated learning do not fit either of these models. One such class of problems is the class of sequential decision tasks. For many interesting sequential decision tasks, there exists neither a database of ex- amples nor a complete and tractable domain theory that might support traditional machine learning methods. In these cases, one method for manually developing a set of decision rules is to test a hypothetical set of rules against a simulation model of the task environ- ment, and to incrementally modify the decision rules on the basis of the simulated experi- ence. Research in machine learning may help to automate this process of learning from a simulation model. This paper presents some initial efforts in that direction.
Sequential decision tasks may be characterized by the following general scenario: A deci- sion making agent interacts with a discrete-t ime dynamical system in an iterative fashion. At the beginning of each time step, the system is in some state. The agent observes a represen- tation of the current state and selects one of a finite set of actions, based on the agent's decision rules. As a result, the dynamical system enters a new state and returns a (perhaps
356 J.J. GREFENSTETTE, C.L. RAMSEY, A.C. SCHULTZ
null) payoff. This cycle repeats indefinitely. The objective is to find a set of decision rules that maximizes the expected total payoff. 1 For many sequential decision problems, including the one considered here, the most natural formulation of the problem includes delayed payoff, in the sense that non-null payoff occurs only when some special condition occurs. While the tasks we consider here have a naturally graduated payoff function, it should be noted that any problem solving task may be cast into sequential decision paradigm, by defining the payoff to be a positive constant for any goal state and null for non-goal states (Barto et al., 1989).
Several laboratory-scale sequential decision tasks have been investigated in the machine learning literature, including pole balancing (Selfridge, Sutton & Barto, 1985), gas pipeline control (Goldberg, 1983), and the animat problem (Wilson, 1985; Wilson, 1987). In addi- tion, sequential decision problems include many important practical problems, and much work has been devoted to their solution. The field of adaptive control theory has developed sophisticated techniques for sequential decision problems for which sufficient knowledge of the dynamical system is available in the form of a tractable mathematical model. For problems lacking a complete mathematical model of the dynamical system, dynamic pro- gramming methods can produce optimal decision rules, as long as the number of states is fairly small. The Temporal Difference (TD) method (Sutton, 1988) addresses learning control rules through incremental experience. Like dynamic programming, the TD method requires sufficient memory (perhaps distributed among the units of a neural net) to store information about the individual states of the dynamical system (Barto et al., 1989). For very large state spaces, genetic algorithms offer the chance to learn decision rules without partitioning of the state space apriori. Classifier systems (Holland, 1986; Goldberg, 1983) use genetic algorithms at the level of individual rules, or classifiers, to derive decision rules for sequential tasks.
The system described in this paper adopts a distinctly different approach, applying genetic algorithms at the level of the tactical plan, rather than the individual rule, with each tac- tical plan comprising an entire set of decision rules for the given task. This approach is especially designed for sequential decision tasks involving a rapidly changing state and other agents, and therefore best suited to reactive rather than projective planning (Agre & Chapman, 1987).
The approach described here reflects a particular methodology for learning via a simulation model. The motivation behind the methodology is that making mistakes on real systems may be costly or dangerous. Since learning may require experimenting with tactical plans that might occasionally produce unacceptable results if applied to the real world, we assume that hypothetical plans will be evaluated in a simulation model (see Figure 1).
Periodically, a plan is extracted from the learning system to represent the learning system's current plan. This plan is tested in the target environment, and the resulting performance is plotted on a learning curve. In principle, this mode of learning might continue indefinitely, with the user periodically updating the decision rules used in the target environment with the current plan suggested by the learning system.
Simulation models have played an important role in several machine learning efforts. The idea of using a simulator to generate examples goes back to Samuel (1963) in the domain of checkers. Buchanan, Sullivan, Cheng and Clearwater (1988), use the RL system to learn error classification rules from a model of a particle beam accelerator, but do not explicitly
L E A R N I N G S E Q U E N T I A L D E C I S I O N R U L E S 357
ON-LINE SYSTEM OFF-LINE SYSTEM
TARGET ~ _ ~ ENVIRONMENT
I I RULE SIMULATION I_ _1 RULE INTERPRETER MODEL ~ INTERPRETER
. . . . . . MODULE
Figure L A mode l for learning f rom a s imulat ion model .
explicitly address the effects of differences between the simulation model and a real target system. Goldberg (1983) describes a classifier system that learns control rules for a simulated gas pipeline. Booker (1982, 1988) and Wilson (1985) present classifier systems for organisms learning in a simulated environment. Research on classifier systems does not usually distinguish between the simulated environment used for learning and a separate target en- vironment. One exception is Booker (1988), who discusses the possibility of an organism using a classifier system to build an internal model of the external environment. Making a clear distinction between the simulation model used for training and the target environ- ment used for testing suggests a number of experiments that measure the effects of dif- ferences between the training model and the target environment. The experiments described here represent some steps in this direction.
The remainder of the paper is organized as follows: Section 2 describes the particular sequential decision task, called the Evasive Maneuvers problem, that provides the context for the current research. Section 3 describes the learning system SAMUEL, including its knowledge representation, its performance module, and its learning methods. Section 4 presents a case study of the application of SAMUEL to the Evasive Maneuvers problem. First some experiments are described that focus on specific mechanisms of the learning systems. These are followed by studies that deal with the effects of the difference between the simulation model on which learning occurs and the target environment on which the results of learning will be tested. Section 5 summarizes our results and presents topics for further research.
2. The evasive maneuvers problem
The experiments described here concern a particular sequential decision task called the Evasive Maneuvers (EM) problem, inspired in part by Erickson and Zytkow (1988). In the EM problem, there are two objects of interest, a plane and a missile. The tactial objec- tive is to maneuver the plane to avoid being hit by the approaching missile. The missile tracks the motion of the plane and steers toward the plane's anticipated position. The initial speed of the missile is greater than that of the plane, but the missile loses speed as it maneuvers. If the missile speed drops below some threshold, it loses maneuverability and
358 J.J. GREFENSTETTE, C.L. RAMSEY, A.C. SCHULTZ
drops out of the sky. It is assumed that the plane is more maneuverable than the missile;