1 university of southern california security in multiagent systems by policy randomization praveen...

1University of Southern California

Security in Multiagent Systems by Policy Randomization

Praveen Paruchuri, Milind Tambe, Fernando Ordonez

University of Southern California

Sarit Kraus

Bar-Ilan University,Israel

University of Maryland, College Park


Motivation: The Prediction Game

An UAV (Unmanned Aerial Vehicle) Flies between the 4 regions

Can you predict the UAV-fly pattern ??

Pattern 11, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4,……Pattern 21, 4, 3, 1, 1, 4, 2, 4, 2, 3, 4, 3,… (as generated by 4-sided dice)Can you predict if 100 numbers in pattern 2 are given ??

Randomization decreases Predictability Increases Security

Region 1 Region 2

Region 3 Region 4


Problem Definition

Problem : Increase security by decreasing predictability for agent/agent-team acting in uncertain adversarial environments. Even if Policy Given, it is Secure Efficient Algorithms for Reward/Randomness Tradeoff

Assumptions for Agent/agent-team: Adversary is unobservable

– Adversary’s actions/capabilities or payoffs are unknown

Assumptions for Adversary: Knows the agents plan/policy Exploits the action predictability Can see the agent’s state (or belief state)


Solution Technique

Technique developed: Intentional policy randomization MDP/POMDP framework

– Sequential decision making

– MDP Markov Decision Process

– POMDP Partially Observable MDP

Increase Security => Solve Multi-criteria problem for agents Maximize action unpredictability (Policy randomization) Maintain reward above threshold (Quality constraints)


Domains

Scheduled activities at airports like security check, refueling etc Observable by anyone Randomization of schedules helpful

UAV/UAV-team patrolling humanitarian mission Adversary disrupts mission – Can disrupt food, harm refugees,

shoot down UAV’s etc Randomize UAV patrol policy


My Contributions

Two main contributions

Single Agent Case :

– Formulate as Non linear program : Entropy based metric

– Convert to Linear Program called BRLP (Binary search for randomization)

– Randomize single agent policies with reward > threshold

Multi Agent Case : RDR (Rolling Down Randomization)

– Randomized policies for decentralized POMDPs

– Threshold on team reward


MDP based single agent case

MDP is tuple < S, A, P, R > S – Set of states A – Set of actions P – Transition function R – Reward function

Basic terms used : x(s,a) : Expected times action a is taken in state s Policy (as function of MDP flows) :

^

^

( , ) ( , ) / ( , )a A

s a x s a x s a


Entropy : Measure of randomness

Randomness or information content : Entropy (Shannon 1948)Entropy for MDP - Additive Entropy – Add entropies of each state

(π is a function of x) Weighted Entropy – Weigh each state by it contribution to

total flow

where, alpha_j is the initial flow of the system

H x s a s aAs S a A

( ) ( ( , ) lo g ( , ))

H x

x s a

s a s aWa A

jj S

a As S

( )

( , )

( , ) lo g( ( , ) )

^

^


Tradeoff : Reward vs Entropy

Non-linear Program: Max entropy, Reward above threshold Objective (Entropy) is non-linear

BRLP ( Binary Search for Randomization LP ) : Linear Program No entropy calculation, Entropy as function of flows

m ax ( )

m in

H x

st A x a lpha

x

R x R

W

0


BRLP

Input and target reward (n% * maximum reward)

Poly-time convergence

Monotonicity: Entropy decreases or constant with increasing reward. Control through

Input can be any high entropy policy

One such input is the uniform policy Equal probability for all actions out of all states


LP for Binary Search

Policy as function of and

Linear Program

),(),(

0

)(max

asas

x

alphaAxst

xR


BRLP in Action

= 1

- Max entropy

= 0DeterministicMax Reward Target

Reward

Beta = .5

x

),(),( asas


Results (Averaged over 10 MDPs)

0123456789

10

50 60 70 80 90 100

Reward Threshold(%)

Ave

. Wei

gh

ted

En

tro

py

BRLP

Hw(x)

Ha(x)

Max Entropy

Max entropy : Expected Entropy Method : 10% avg gain over BRLPFastest : BRLP : 7 fold average speedup over Expected Entropy

0

20

40

60

80

100

120

50 60 70 80 90 100

Reward Threshold(%)

Exe

cuti

on

Tim

e (s

ec) BRLP

Hw(x)

Ha(x)


Multi Agent Case: Problem

Maximize entropy for agent teams subject to reward threshold

For agent team : Decentralized POMDP framework used Agents know initial joint belief state No communication possible between agents

For adversary : Knows the agents policy Exploits the action predictability Can calculate the agent’s belief state


RDR : Rolling Down Randomization

Input : Best ( local or global ) deterministic policy Percent of reward loss d parameter – Number of turns each agent gets

– Ex: d = .5 => Number of steps = 1/d = 2– Each agent gets one turn (for 2 agent case)– Single agent MDP problem at each step

For agent 1’s turn : Fix policy of other agents (Agent 2) Find randomized policy

– Maximizes joint entropy ( w1 * Entropy(agent1) + w2 * Entropy(agent2) )

– Maintains joint reward above threshold


RDR : d = .5

Max Reward

80% of Max Reward

Agent 1Maximize joint entropy

Joint Reward > 90%

Reward = 90%

Agent 2Maximize joint entropy

Joint reward > 80%


Experimental Results : Reward Threshold vs Weighted Entropy ( Averaged 10 instances )

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

50 70 90

Reward Threshold(%)

We

igh

ted

En

tro

py

T=2T=3T=2 MaxT=3 Max


Summary

Intentional randomization as main focus

Single agent case : BRLP algorithm introduced

Multi agent case : RDR algorithm introduced

Multi-criterion problem solved that Maximizes entropy Maintains Reward > Threshold


Thank You

Any comments/questions ??


Difference between safety and security ??

Security: It is defined as the ability of the system to deal with threats that are intentionally caused by other intelligent agents and/or systems.

Safety : A system's safety is its ability to deal with any other threats to its goals.


Probing Results : Single agent Case

0

2

4

6

8

10

12

14

0 2 4 6 8 10

Entropy

# o

f o

bse

rvat

ion

s

Observe All

Observe Select

Observe Noisy


Probing Results : Multi agent Case

0

1

2

3

4

5

0 1 2 3 4Joint Entropy

Join

t #

of

Ob

serv

atio

ns

Observe AllObserve Select

Observe Noisy


Define POMDP


Define Distributed POMDP

Dec-POMDP is a tuple <S,A,P,Ω,O,R>, where

S – Set of states

A – Joint action set <a1,a2,…,an>

P – Transition function

Ω – Set of joint observations

O- Observation function – Probability of joint observation given current state and previous joint action. Observations independent of each other

R – Immediate, Joint reward

A DEC-MDP is a DEC-POMDP with the restriction that at each time step the agents observations together uniquely determine the state.


Counterexample : Entropy

Lets say adversary shoots down UAV Hence targets highest probable action --- Called Hit rate

Assume UAV has 3 actions.

2 possible probability distributions H ( 1/2, 1/2, 0 ) = 1 ( log base 2 ) H ( 1/2 - delta, 1/4 + delta, 1/4 ) ~ 3/2

Entropy = 3/2, Hit rate = 1/2-delta

Entropy = 1, Hit rate = 1/2

Higher entropy but lower hit rate


d-parameter & Comments on Results

Effect of d-parameter (avg of 10 instances)

RDR : Avg runtime in sec and (Entropy), T = 2

Conclusions:Greater tolerance of reward loss => Higher entropyReaching maximum entropy tougher than single agent caseLower miscoordination cost implies higher entropyd parameter of .5 is good for practical purposes.

Reward Threshold

1 .5 .25 .125

90% .67(.59) 1.73(.74) 3.47(.75) 7.07(.75)

50% .67(1.53) 1.47(2.52) 3.4(2.62) 7.47(2.66)


Example where uniform policy is not best


Entropies

For uniform policy – 1 + ½ * 1 + 2 * ¼ * 1 + 4 * 1/8 * 1 = 2.5

If initially deterministic policy and then uniform – 0 + 1 * 1 + 2 * ½ * 1 + 4 * ¼ * 1 = 3

Hence, uniform policies need not always be optimal.

1 university of southern california security in multiagent systems by policy randomization praveen...

Documents