[lecture notes in computer science] advances in artificial intelligence – iberamia 2010 volume...

Normative Reasoning with an Adaptive Self-interestedAgent Model Based on Markov Decision Processes

Moser Silva Fagundes, Holger Billhardt, and Sascha Ossowski

Centre for Intelligent Information Technologies (CETINIA)University Rey Juan Carlos, Madrid – Spain

[email protected]

Abstract. Rational self-interested agents, which act so as to achieve the best ex-pected outcome, should violate the norms if the expected rewards obtained withthe defections from the norms surpass the expected rewards obtained by beingnorm-compliant. It means they should estimate the earnings brought about by theviolations and the losses caused by their respective reactions. In this paper, wepresent a rational self-interested agent model that takes into account the possi-bility of breaking norms. To develop such model, we employ Markov DecisionProcesses (MDPs). Our approach consists of representing the reactions for normviolations within the MDPs in such a way that the agent is able to reason abouthow those violations affect her expected utilities and future options. Finally, weperform an experiment in order to establish comparisons between the model pre-sented in this work and its norm-compliant version.

Keywords: Agent Architecture, Behaviour Adaptation, Norms, MDP.

1 Introduction

In Regulated Multiagent Systems, hereafter referred to as RMAS, the agents have todeal with mechanisms for adjusting their behaviour at some level in order to orches-trate global behaviours. The usefulness of such regulative mechanisms becomes moreprominent in open systems, where heterogeneous agents are able to join and leave theRMAS at runtime. In these open RMAS there is no guarantee that the agents will actin a particular manner, and in this case, the establishment of some type of control overthem makes possible the coordination of tasks and the avoidance of particular unde-sired states of the world. One way to regulate the agents’ behaviour is through theimplementation of norms into the MAS. Such specifications of principles of proper andacceptable conduct are used to guide regulative mechanisms, which in fact monitor theenvironmental changes and enforce the norms through the execution of reactions.

Although autonomous agents may be persuaded to assume a particular behaviour,they are supposed to be free to accept or refuse to comply with a set of norms, andhandle the consequences of their choices. If there are norms governing the environment,an autonomous agent needs to be able to choose dynamically which of them to obey soas to successfully fulfil her purpose. Thus, in order to achieve an adaptive behaviour,the normative structures cannot be hard-wired into the agents.

A. Kuri-Morales and G.R. Simari (Eds.): IBERAMIA 2010, LNAI 6433, pp. 274–283, 2010.c© Springer-Verlag Berlin Heidelberg 2010

Normative Reasoning with an Adaptive Self-interested Agent Model Based on MDPs 275

A rational self-interested agent, which acts so as to achieve the best expected out-come, considers to violate a set of norms in those cases where the expected rewards ob-tained by the violation surpass the expected rewards obtained by being norm-compliant.By this we mean she estimates the earnings brought about by the violations and thelosses caused by the reactions for the defections from the norms. Finally, based on suchestimation, that agent is able to set a course of actions, possibly non-compliant with thenorms, in order to maximize her expected rewards.

This paper presents an architecture for norm-aware agents capable of adapting theirpolicy (plan) to the norms in a self-interested way. Thus, during the adaptation process,the agent considers the possibility of disobeying (some of) the norms governing theRMAS. It was performed an experiment in order to establish comparisons between theproposed architecture and its norm-compliant version. It was carried out in the car trafficdomain with different impunity degrees and different normative setups.

The agent model is based on Markov Decision Processes (MDPs), a formal mathe-matical framework widely used for modelling decision-making, planning and control instochastic domains. They provide a model of an agent interacting synchronously witha world, where there may be uncertainty about the agent’s actions. A MDP can be de-scribed as a tuple 〈S,s0,A,A(·),T,R〉, where: S denotes a finite set of states of the world;s0 is the initial state; A is a finite set of actions; A(si)⊆A denotes the set of admissibleactions in the state si (the agent’s capabilities); T:S×A→ Π(S) is a state-transition func-tion, giving for each state and action, a probability distribution over states (T(si,a,sj)for the probability of executing a at si, and ending at sj ); and R:S→ � is a reward func-tion that gives the expected immediate reward gained by the agent for achieving a givenstate of the world (R(si) is the reward for achieving the state si).

This paper is organized as follows: Section 2 defines the normative structure; Section3 introduces our normative agent architecture; in Section 4, we describe our experiment;Section 5 presents the related work; and finally, in Section 6 we draw the conclusionand point out the future work directions.

2 Normative Structure

In RMAS the norms provide specifications of how the agent should behave. It is as-sumed that such norms can be violated – enforcement instead of regimentation [9]. If anorm is violated, then a reaction against the transgression takes place. Reactions can becarried out by the RMAS or another empowered entity that acts in the benefit of it (i.e.agents, regulative mechanisms). In this paper, we are not interested in who monitors thetransgressions or executes the reactions, but in how these reactions affect the agents.

Our normative structure specifies only prohibitions and obligations. It means thatif something is not prohibited neither obliged, then it is permitted. The violation of aprohibition happens when an agent executes a prohibited action, while the violation ofan obligation occurs when an agent does not perform an obliged action. The norm en-forcement demands the detection of violations and the execution of reactions against thetransgressors. Such check-react mechanism is resource-bounded such as other computa-tional processes. Therefore, from a practical viewpoint, impunity might be unavoidable(the reaction’s execution is uncertain).

276 M.S. Fagundes, H. Billhardt, and S. Ossowski

Regarding how the reactions affect the agents, we adapt the concept of regulativemechanisms proposed in Ref. [6]. Originally, they were conceived as pro-active intentsof raising an overall utility of the MAS through adjustments the agents’ transition modeland capabilities. Here they are employed as reactions for norm violations.

In order to represent a norm, we use the following form:

norm(deontic modality, agent, action, state, reaction), where:

– deontic modality ∈ {prohibition, obligation};– agent ∈ Γ , and Γ is the set of agents participating in the RMAS;– action ∈ A, and A is the agent’s action set;– state ∈ S, and S is the agent’s state space;– reaction has the form:

reaction(outcomeA(·), outcomeT ), where:

– outcomeA(·) specifies the set of modifications to be done in the agent’s capabilityfunction; an element of this set has the form (statei, action, {0,1}), where 1 meansthat the action is admissible in the statei, 0 otherwise;

– outcomeT specifies the set of adjustments to be done in the probabilities of theagent’s state-transition model; an element of this set has the form (statei, action,statej , [0 . . . 1]), which indicates the probability of executing an action at statei andending at statej ;

3 Reasoning about Norm Violations

This section presents our proposal of a normative agent architecture based on MDPs.Figure 1 gives an overview of the agent’s internal architecture by showing the compo-nents (grey boxes) and the information (white rounded boxes) flow among them. Theagent’s initial information consists of a set of Norms and the Original MDP1. The agentinteracts with a Regulated Environment, illustrated in the left-hand side of Figure 1. Theagent architecture is composed of five components: the Adaptive Component adapts theagent’s knowledge in order to express the reactions for violating norms; the MDP Con-nector Component establishes connections between the Original MDP and the AdaptedMDPs through the violating actions; the Utility Computation and the Policy Constructorcompute the agent’s policy; finally, the Execution Component handles the perceptionsand executes the actions specified within the policy.

As stated before in this paper, our model of rational agent violates the norms only ifthe expected rewards obtained with the violations are higher than the expected rewardsobtained by assuming a norm-compliant behaviour. This is the reason we have createdan Adaptive Component to represent how the world would be if the agent violates agiven norm. The adaptation process begins by replicating the Original MDP, and then

1 The agent’s parameters are specified without taking into account any particular set of norms.Such MDP tuple, which do not represent any type of reaction (sanction or penalty) to normviolations, is called Original MDP.


AGENT

Adaptive Component

A-MDPsmdp1 … mdpm

Original MDPMDP = ⟨S,s0,A,A(⋅),T,R⟩

MDP Connector Component

UtilityComputation

Policy Constructor

C-MDPmdp0

UtilitiesU[mdp0]Policy π

ExecutionComponent

REG

ULA

TED

ENVI

RON

MEN

TN

orm

sno

rm1

…no

rmn

Fig. 1. Overview of the agent’s internal architecture

it represents within this replica the effects of the reactions for violating the norm. Ac-cording to the normative structure defined in Section 2, the reactions affect the agent’scapability function and the transition model. The adapted tuples are named A-MDPs.The whole process is carried out online since the norms are assumed to be unknown atthe agent’s design time (not hard-wired).

Differently from the Adaptive Component, whose purpose consists of representinghow the world would be if the agent violates a norm, the MDP Connector Componentfocuses on the construction of a single MDP by connecting the Original MDP withthe A-MDPs. These connections are done by changing the outcomes of the violatingactions. Instead of arriving exclusively at states of the Original MDP, the execution ofa violating action may lead to states of the A-MDPs.

Figure 2 illustrates how the connections between the Original MDP and the A-MDPsare created. In this example, there is a normi indicating that the action a is prohibitedfor the agent in the state s0. To model the chances of being caught and suffering thereactions, we replicate the transitions for the action a starting at the state s0. But in-stead of arriving exclusively at states of the Original MDP, the new transitions arriveat their analogous2 states {s1, . . ., sk} in the A-MDP(normi). The probabilities of thesenew transitions are multiplied by Pi – the probability of the violation being detected.However, there is a chance of going undetected. To model this chance, we multiply by(1–Pi) the probabilities of arriving at the states {s1, . . ., sk} of the Original MDP.

The Utility Component computes the expected utilities for the states, while the Pol-icy Constructor finds a policy based on those expected utilities. These two componentsaddress the question regarding what norms are worthy of breaking. It is possible to em-ploy a wide range of algorithms to implement these two components. The best knownones are the Value Iteration and the Policy Iteration.

2 Originally, every A-MDP is a replica of the Original MDP. They have identical state spaces,and consequently, every state of every A-MDP has exactly one analogous state in the OriginalMDP and in the other A-MDPs.


Original MDP

A-MDP( normi )

s0

s1

sk

...

(1-Pi).T(s0,a,-)

(1-Pi).T(s0,a,s1)

(1-Pi).T(s0,a,sk)

Original MDP

s0

s1

sk

...

Pi.T(s0,a,s1)

Pi.T(s0,a,sk)

Original MDP

A-MDP( normi )

Pi.T(s0,a,-)

...

...

normi (prohibition agent a s0 reactioni )

Fig. 2. Rearrangement of the state-transitions T(s0,a,–) of the Original MDP in order to establishthe connection with the MDP(normi)

The Execution Component is the agent’s interface with the RMAS. Assuming theperceptions are correct and complete, the agent always knows the current environmentalstate, and consequently, she knows when the norms have been enforced. In this way, sheis able to behave rationally according to the policy.

4 Experiment

To validate our model, we have implemented a car traffic RMAS with roads and junc-tions, where the agent is able to drive around. It is a synchronized environment, wherethe driver is able to execute one action per time step. Figure 3 illustrates the environ-ment, composed of 57 cells that can be occupied by the driver (white squares, identifiedby their respective numbers) and 16 cells that cannot be occupied by the driver (greysquares). The arrows indicate the possible traffic flows in each cell, except for the 12terminal cells (label T) where no action is admissible. On the side of each terminal cell,there is an entry point to the environment. Such cells are used as initial positions for thedriver. There are 9 junction cells, where movements on all directions (up, down, rightand left) are possible. The agent is able to execute the action skip in all non-terminalstates in order to stay in the current cell. Initially, all actions are deterministic with re-spect to the resulting cell, however, during the experiment the state-transition is changedby the reactions for norm violations.

In order to regulate the system, we have placed a traffic light in the environment atthe junction 24. It assumes one of the two possible states: red or green. The transitionfrom one state to another happens with probability 0.5 for any time step. A traffic lightitself is an informative mechanism, therefore, it requires a set of norms to specify thereactions for entering the junction when it is red. We create the following norm for adriver coming from cell 18:

norm1(prohibition, driver, up, {cell=18, light=red}, reaction)


T T T

T T T

T

T

T

T

T

T

Traffic Light

Driver sInitial Cell

Fig. 3. Car traffic environment

We create four different reactions to be used when the norm1 above is broken. Allaffect exclusively the transition model of the driver. The first, reaction1, is the mostrestrictive – it keeps the driver in the cell 24 with probability 0.9. The last one, reaction4,is the most relaxed, it holds the driver in the cell 24 with probability 0.6.

reaction1( � , {{cell=24, up, cell=24, 0.9}, {cell=24, up, cell=35, 0.1},{cell=24, down, cell=24, 0.9}, {cell=24, down, cell=17, 0.1},{cell=24, left, cell=24, 0.9}, {cell=24, left, cell=30, 0.1},{cell=24, right, cell=24, 0.9}, {cell=24, right, cell=25, 0.1} })reaction2( � , {{cell=24, up, cell=24, 0.8}, {cell=24, up, cell=35, 0.2},{cell=24, down, cell=24, 0.8}, {cell=24, down, cell=17, 0.2},{cell=24, left, cell=24, 0.8}, {cell=24, left, cell=30, 0.2},{cell=24, right, cell=24, 0.8}, {cell=24, right, cell=25, 0.2} })reaction3( � , {{cell=24, up, cell=24, 0.7}, {cell=24, up, cell=35, 0.3},{cell=24, down, cell=24, 0.7}, {cell=24, down, cell=17, 0.3},{cell=24, left, cell=24, 0.7}, {cell=24, left, cell=30, 0.3},{cell=24, right, cell=24, 0.7}, {cell=24, right, cell=25, 0.3} })reaction4( � , {{cell=24, up, cell=24, 0.6}, {cell=24, up, cell=35, 0.4},{cell=24, down, cell=24, 0.6}, {cell=24, down, cell=17, 0.4},{cell=24, left, cell=24, 0.6}, {cell=24, left, cell=30, 0.4},{cell=24, right, cell=24, 0.6}, {cell=24, right, cell=25, 0.4} })


By holding an agent in a given cell we affect directly her expected utilities – for eachtime step the driver stays in the cell 24, she accumulates –0.05 on her sum of rewards.In the real world, it could correspond to a situation where an officer stops the car for awarning and keeps the driver there for a while. Despite the fact that no explicit fine isissued, the driver loses time and accumulates punishments.

4.1 Specification of the Original MDP

The state space and the action space of the driver respectively corresponds to:

S = {0, 1, 2, . . . 56} × {red, green}A = {up, down, right, left, skip}

The admissible actions (capability function A(·)) are given by the arrows shown inFigure 3. The action skip is admissible in all non-terminal states. Regarding the state-transition function, all actions are initially deterministic and the agent reaches withcertainty the intended cell. We say initially because the state-transition model may bechanged by normative reactions as shown in the next subsection. The traffic light, asstated before, changes its state with probability 0.5 every time step. In order to completethe parameters of the Original MDP, we have to specify the reward function. The rewardfor all states is –0.05, except for the following ones:

R({cell=52, light=red}) = +1.0R({cell=52, light=green}) = +1.0

Finally, the agent’s initial state is {cell=1, light=green}.

4.2 Experiment and Results

In this subsection we compare the agent model shown in Section 3, referred to as NNCAgent (non-norm-compliant), with its norm-compliant version [8], referred to as NCAgent (norm-compliant). Both agents have the same Original MDP.

Two parameters have a direct impact on the agent’s expected utility (EU): the de-tection rate of violations and the reactions’ outcomes. Regarding the first parameter,we analyse it from 0% to 100% in intervals of 10%. Regarding the second one, weuse the four reactions specified previously in this section, which have different degreesof restriction. The goal of our experiment consists of estimating the EU for the agentswith different settings for those two parameters. The EU is found by running the ValueIteration algorithm with maximum error equals to 0.0001 (we do not establish a limitednumber of iterations) and discount factor equals to 1.0. In the comparisons we use theEU for the agents’ initial state ({cell=1, light=green}).

Figure 4 shows the results. The vertical axis indicates the EU for the agent in theinitial state, while the horizontal axis indicates the detection rate of norm violations inthe RMAS. The NC Agent does not violate the norm1, therefore her EU is constant(0.6500, dotted line) – it does not depend on the reactions, and consequently, does notdepend on the detection rate. On the other hand, for the NNC Agent, the EU decreasesas the detection rate increases until it reaches the EU equals to 0.6500. At this point, the


NNC Agent has assumed a norm-compliant behaviour because the EU of being non-normative are lower than the EU of being norm-compliant. In the worst case, the EUobtained by NNC Agent is equal to the EU obtained by the NC Agent. For example,assume the reaction2 is executed when the norm1 is violated. In this case, the NNCAgent breaks the norm1 only if the detection rate is under 50%. Otherwise, she followsthe norm1 and executes skip when the traffic light is red.

Comparing the effect of the four different reactions under the same detection rate, weconclude the EU is lower for the reaction1 (most restrictive) and higher for the reaction4

(most relaxed). The only setting where the NNC Agent never follows the norm1 is whenthe reaction4 is the chosen one. In this case, the reaction4 is too soft.

0,65000

0,70000

0,66000

0,67000

0,68000

0,69000

0 10 20 30 40 50 60 70 80 90 100

EU(cell=1)

ViolationDetectionRate (%)

NNC / reaction4

NNC / reaction3

NNC / reaction2

NNC / reaction1

NC

Fig. 4. Chart showing the agents’ expected utility (EU) for the initial state, taking into accountthe detection rate of norm violations in the RMAS. Each line depicts the EU for a particular agent(NC or NNC) with different reactions for violating the norm governing the RMAS.

5 Related Work

The idea of norm violation by autonomous agents has been discussed by the MAS com-munity in the past years. Most work on normative reasoning has been done from thepractical reasoning perspective, specially through the well-known goal-oriented archi-tectures [5,7,2,3,11,13]. In these models, usually, there is some deliberation for weight-ing competing alternatives, which can be personal goals of the agent or normative goalsgenerated from the norms ruling the MAS. If these normative goals survive the delib-eration process, then the agent complies with the respective norms, otherwise normviolations may take place. Assuming the deliberation is done on the basis of a pref-erence ordering, the main problem consists of computing the preference values to beassigned to the normative goals. However, the impact of the planning task (also referredto as means-end reasoning) on the normative reasoning has received less attention.

Another research direction on norm violations by autonomous agents consists ofemploying essentially utility-oriented agents. Cardoso and Oliveira [4] study adaptive


mechanisms that enable a normative framework to change deterrence sanctions accord-ing to an agent population. The agents are characterized by a risk tolerance parameterand a social awareness parameter. Although these agents are utility maximizers, theyare not all equally self-interested, what makes possible the generation of heterogeneouspopulations. In this work, the agents decide to violate a norm exclusively based on thosetwo parameters. The relation between norm deterrence and planning is not considered.Agotnes et. al. [1] develop a model of normative multiagent system, whose possibletransitions are represented through Kripke structures and norms are implemented asconstraints over these structures. The agent’s goals are specified as a hierarchy of for-mulae of Computational Tree Logic. Such hierarchy defines a model of ordinal util-ity, what makes possible the interpretation of the Kripke-based normative systems asgames. This paper focuses on the problem concerning whether the agents should de-fect or not from the normative system. To address this question, the authors use gametheory: an agent takes into account not just whether the normative system would bebeneficial for itself, but also whether other agents will rationally choose to participate.The normative reasoning is based on the utility of the goals and the pay-offs for thegame. It is assumed the reasoning process takes place before the agents join the system(design time).

6 Conclusion

This paper presents an architecture for adaptive norm-aware agents that estimates thebenefits for breaking norms. Considering such estimation, the agent decides to complyor not with norms. We use MDPs for normative reasoning. It makes explicit the cost ofplanning with norms (amount of time to make a rational choice in a regulated environ-ment) and the benefits of the plans (through the expected rewards). Internally, the agentdoes not assign preference values or desirability degrees to the normative structures.Instead, the impact of the norms and reactions on the agent is observed in her expectedutilities and policy.

Our results have shown that the presented agent model obtains more rewards thanits norm-compliant version. However, the main drawback of the model when comparedwith the norm-compliant agent is the amount of time taken to compute the policy. Ourfirst future work concerns the exploration of techniques to improve the efficiency of thereasoning process. Among these possible techniques, we cite the Factored MDPs [10]and hybrid agent architectures [12]. Our second future work consists of experimentingour model in a RMAS with multiple agents, where the norms can dynamically changein order to fulfil successfully their purpose.

Acknowledgements

This work is supported by the Spanish Ministry of Science and Innovation throughthe projects “AT” (CSD2007-0022, CONSOLIDER-INGENIO 2010) and “OVAMAH”(TIN2009-13839-C03-02).


References

1. Agotnes, T., van der Hoek, W., Wooldridge, M.: Normative system games. In: 6th Interna-tional Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS 2007),Honolulu, Hawaii, USA, May 14-18, pp. 1–8. IFAAMAS (2007)

2. Boella, G., Damiano, R.: An Architecture for Normative Reactive Agents. In: Kuwabara, K.,Lee, J. (eds.) PRIMA 2002. LNCS (LNAI), vol. 2413, pp. 1–17. Springer, Heidelberg (2002)

3. Broersen, J., Dastani, M., Hulstijn, J., van der Torre, L.: Goal Generation in the BOID Ar-chitecture. Cognitive Science Quarterly Journal 2(3-4), 428–447 (2002)

4. Cardoso, H.L., Oliveira, E.: Adaptive Deterrence Sanctions in a Normative Framework. In:Proceedings of the 2009 IEEE/WIC/ACM International Conference on Intelligent AgentTechnology (IAT 2009), Milan, Italy, September 15-18, pp. 36–43. IEEE, Los Alamitos(2009)

5. Castelfranchi, C., Dignum, F., Jonker, C., Treur, J.: Deliberative Normative Agents: Princi-ples and Architecture. In: Jennings, N.R. (ed.) ATAL 1999. LNCS, vol. 1757, pp. 364–378.Springer, Heidelberg (2000)

6. Centeno, R., Billhardt, H., Hermoso, R., Ossowski, S.: Organising MAS: A Formal ModelBased on Organisational Mechanisms. In: Proceedings of the 2009 ACM Symposium onApplied Computing (SAC), Honolulu, Hawaii, USA, March 9-12, pp. 740–746. ACM, NewYork (2009)

7. Dignum, F., Morley, D.N., Sonenberg, L., Cavedon, L.: Towards socially sophisticated BDIagents. In: 4th International Conference on Multi-Agent Systems (ICMAS 2000), Boston,MA, USA, July 10-12, pp. 111–118. IEEE Computer Society, Los Alamitos (2000)

8. Fagundes, M.S., Billhardt, H., Ossowski, S.: Behavior Adaptation in RMAS: An Agent Ar-chitecture based on MDPs. In: 20th European Meeting on Cybernetics and Systems Research(EMCSR 2010), Vienna, Austria, April 6-9, pp. 453–458. Austrian Society for CyberneticStudies (2010)

9. Grossi, D., Aldewereld, H., Dignum, F.: Ubi Lex, Ibi Poena: Designing Norm Enforcementin E-Institutions. In: Noriega, P., Vazquez-Salceda, J., Boella, G., Boissier, O., Dignum, V.,Fornara, N., Matson, E. (eds.) COIN 2006. LNCS (LNAI), vol. 4386, pp. 101–114. Springer,Heidelberg (2006)

10. Guestrin, C., Koller, D., Parr, R., Venkataraman, S.: Efficient Solution Algorithms for Fac-tored MDPs. Journal Artificial Intelligence Research (JAIR) 19, 399–468 (2003)

11. y Lopez, F.L., Luck, M., d’Inverno, M.: A Normative Framework for Agent-Based Systems.In: Normative Multi-agent Systems. Dagstuhl Seminar Proceedings, vol. 07122 (2007)

12. Nair, R., Tambe, M.: Hybrid BDI-POMDP Framework for Multiagent Teaming. J. Artif.Intell. Res (JAIR) 23, 367–420 (2005)

13. Pacheco, N.C., Argente, E., Botti, V.: A BDI Architecture for Normative Decision Making(extended abstract). In: 9th International Joint Conference on Autonomous Agents and Mul-tiagent Systems (AAMAS 2010), Toronto, Canada, pp. 1383–1384. IFAAMAS (May 2010)

[lecture notes in computer science] advances in artificial intelligence – iberamia 2010 volume...

Documents