1 university of southern california security in multiagent systems by policy randomization praveen...

29
1 University of Southern California Security in Multiagent Systems by Policy Randomization Praveen Paruchuri, Milind Tambe, Fernando Ordonez University of Southern California Sarit Kraus Bar-Ilan University,Israel University of Maryland, College Park

Post on 21-Dec-2015

222 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: 1 University of Southern California Security in Multiagent Systems by Policy Randomization Praveen Paruchuri, Milind Tambe, Fernando Ordonez University

1University of Southern California

Security in Multiagent Systems by Policy Randomization

Praveen Paruchuri, Milind Tambe, Fernando Ordonez

University of Southern California

Sarit Kraus

Bar-Ilan University,Israel

University of Maryland, College Park

Page 2: 1 University of Southern California Security in Multiagent Systems by Policy Randomization Praveen Paruchuri, Milind Tambe, Fernando Ordonez University

2University of Southern California

Motivation: The Prediction Game

An UAV (Unmanned Aerial Vehicle) Flies between the 4 regions

Can you predict the UAV-fly pattern ??

Pattern 11, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4,……Pattern 21, 4, 3, 1, 1, 4, 2, 4, 2, 3, 4, 3,… (as generated by 4-sided dice)Can you predict if 100 numbers in pattern 2 are given ??

Randomization decreases Predictability Increases Security

Region 1 Region 2

Region 3 Region 4

Page 3: 1 University of Southern California Security in Multiagent Systems by Policy Randomization Praveen Paruchuri, Milind Tambe, Fernando Ordonez University

3University of Southern California

Problem Definition

Problem : Increase security by decreasing predictability for agent/agent-team acting in uncertain adversarial environments. Even if Policy Given, it is Secure Efficient Algorithms for Reward/Randomness Tradeoff

Assumptions for Agent/agent-team: Adversary is unobservable

– Adversary’s actions/capabilities or payoffs are unknown

Assumptions for Adversary: Knows the agents plan/policy Exploits the action predictability Can see the agent’s state (or belief state)

Page 4: 1 University of Southern California Security in Multiagent Systems by Policy Randomization Praveen Paruchuri, Milind Tambe, Fernando Ordonez University

4University of Southern California

Solution Technique

Technique developed: Intentional policy randomization MDP/POMDP framework

– Sequential decision making

– MDP Markov Decision Process

– POMDP Partially Observable MDP

Increase Security => Solve Multi-criteria problem for agents Maximize action unpredictability (Policy randomization) Maintain reward above threshold (Quality constraints)

Page 5: 1 University of Southern California Security in Multiagent Systems by Policy Randomization Praveen Paruchuri, Milind Tambe, Fernando Ordonez University

5University of Southern California

Domains

Scheduled activities at airports like security check, refueling etc Observable by anyone Randomization of schedules helpful

UAV/UAV-team patrolling humanitarian mission Adversary disrupts mission – Can disrupt food, harm refugees,

shoot down UAV’s etc Randomize UAV patrol policy

Page 6: 1 University of Southern California Security in Multiagent Systems by Policy Randomization Praveen Paruchuri, Milind Tambe, Fernando Ordonez University

6University of Southern California

My Contributions

Two main contributions

Single Agent Case :

– Formulate as Non linear program : Entropy based metric

– Convert to Linear Program called BRLP (Binary search for randomization)

– Randomize single agent policies with reward > threshold

Multi Agent Case : RDR (Rolling Down Randomization)

– Randomized policies for decentralized POMDPs

– Threshold on team reward

Page 7: 1 University of Southern California Security in Multiagent Systems by Policy Randomization Praveen Paruchuri, Milind Tambe, Fernando Ordonez University

7University of Southern California

MDP based single agent case

MDP is tuple < S, A, P, R > S – Set of states A – Set of actions P – Transition function R – Reward function

Basic terms used : x(s,a) : Expected times action a is taken in state s Policy (as function of MDP flows) :

^

^

( , ) ( , ) / ( , )a A

s a x s a x s a

Page 8: 1 University of Southern California Security in Multiagent Systems by Policy Randomization Praveen Paruchuri, Milind Tambe, Fernando Ordonez University

8University of Southern California

Entropy : Measure of randomness

Randomness or information content : Entropy (Shannon 1948)Entropy for MDP - Additive Entropy – Add entropies of each state

(π is a function of x) Weighted Entropy – Weigh each state by it contribution to

total flow

where, alpha_j is the initial flow of the system

H x s a s aAs S a A

( ) ( ( , ) lo g ( , ))

H x

x s a

s a s aWa A

jj S

a As S

( )

( , )

( , ) lo g( ( , ) )

^

^

Page 9: 1 University of Southern California Security in Multiagent Systems by Policy Randomization Praveen Paruchuri, Milind Tambe, Fernando Ordonez University

9University of Southern California

Tradeoff : Reward vs Entropy

Non-linear Program: Max entropy, Reward above threshold Objective (Entropy) is non-linear

BRLP ( Binary Search for Randomization LP ) : Linear Program No entropy calculation, Entropy as function of flows

m ax ( )

m in

H x

st A x a lpha

x

R x R

W

0

Page 10: 1 University of Southern California Security in Multiagent Systems by Policy Randomization Praveen Paruchuri, Milind Tambe, Fernando Ordonez University

10University of Southern California

BRLP

Input and target reward (n% * maximum reward)

Poly-time convergence

Monotonicity: Entropy decreases or constant with increasing reward. Control through

Input can be any high entropy policy

One such input is the uniform policy Equal probability for all actions out of all states

Page 11: 1 University of Southern California Security in Multiagent Systems by Policy Randomization Praveen Paruchuri, Milind Tambe, Fernando Ordonez University

11University of Southern California

LP for Binary Search

Policy as function of and

Linear Program

),(),(

0

)(max

asas

x

alphaAxst

xR

Page 12: 1 University of Southern California Security in Multiagent Systems by Policy Randomization Praveen Paruchuri, Milind Tambe, Fernando Ordonez University

12University of Southern California

BRLP in Action

= 1

- Max entropy

= 0DeterministicMax Reward Target

Reward

Beta = .5

x

),(),( asas

Page 13: 1 University of Southern California Security in Multiagent Systems by Policy Randomization Praveen Paruchuri, Milind Tambe, Fernando Ordonez University

13University of Southern California

Results (Averaged over 10 MDPs)

0123456789

10

50 60 70 80 90 100

Reward Threshold(%)

Ave

. Wei

gh

ted

En

tro

py

BRLP

Hw(x)

Ha(x)

Max Entropy

Max entropy : Expected Entropy Method : 10% avg gain over BRLPFastest : BRLP : 7 fold average speedup over Expected Entropy

0

20

40

60

80

100

120

50 60 70 80 90 100

Reward Threshold(%)

Exe

cuti

on

Tim

e (s

ec) BRLP

Hw(x)

Ha(x)

Page 14: 1 University of Southern California Security in Multiagent Systems by Policy Randomization Praveen Paruchuri, Milind Tambe, Fernando Ordonez University

14University of Southern California

Multi Agent Case: Problem

Maximize entropy for agent teams subject to reward threshold

For agent team : Decentralized POMDP framework used Agents know initial joint belief state No communication possible between agents

For adversary : Knows the agents policy Exploits the action predictability Can calculate the agent’s belief state

Page 15: 1 University of Southern California Security in Multiagent Systems by Policy Randomization Praveen Paruchuri, Milind Tambe, Fernando Ordonez University

15University of Southern California

RDR : Rolling Down Randomization

Input : Best ( local or global ) deterministic policy Percent of reward loss d parameter – Number of turns each agent gets

– Ex: d = .5 => Number of steps = 1/d = 2– Each agent gets one turn (for 2 agent case)– Single agent MDP problem at each step

For agent 1’s turn : Fix policy of other agents (Agent 2) Find randomized policy

– Maximizes joint entropy ( w1 * Entropy(agent1) + w2 * Entropy(agent2) )

– Maintains joint reward above threshold

Page 16: 1 University of Southern California Security in Multiagent Systems by Policy Randomization Praveen Paruchuri, Milind Tambe, Fernando Ordonez University

16University of Southern California

RDR : d = .5

Max Reward

80% of Max Reward

Agent 1Maximize joint entropy

Joint Reward > 90%

Reward = 90%

Agent 2Maximize joint entropy

Joint reward > 80%

Page 17: 1 University of Southern California Security in Multiagent Systems by Policy Randomization Praveen Paruchuri, Milind Tambe, Fernando Ordonez University

17University of Southern California

Experimental Results : Reward Threshold vs Weighted Entropy ( Averaged 10 instances )

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

50 70 90

Reward Threshold(%)

We

igh

ted

En

tro

py

T=2T=3T=2 MaxT=3 Max

Page 18: 1 University of Southern California Security in Multiagent Systems by Policy Randomization Praveen Paruchuri, Milind Tambe, Fernando Ordonez University

18University of Southern California

Summary

Intentional randomization as main focus

Single agent case : BRLP algorithm introduced

Multi agent case : RDR algorithm introduced

Multi-criterion problem solved that Maximizes entropy Maintains Reward > Threshold

Page 19: 1 University of Southern California Security in Multiagent Systems by Policy Randomization Praveen Paruchuri, Milind Tambe, Fernando Ordonez University

19University of Southern California

Thank You

Any comments/questions ??

Page 20: 1 University of Southern California Security in Multiagent Systems by Policy Randomization Praveen Paruchuri, Milind Tambe, Fernando Ordonez University

20University of Southern California

Page 21: 1 University of Southern California Security in Multiagent Systems by Policy Randomization Praveen Paruchuri, Milind Tambe, Fernando Ordonez University

21University of Southern California

Difference between safety and security ??

Security: It is defined as the ability of the system to deal with threats that are intentionally caused by other intelligent agents and/or systems.

Safety : A system's safety is its ability to deal with any other threats to its goals.

Page 22: 1 University of Southern California Security in Multiagent Systems by Policy Randomization Praveen Paruchuri, Milind Tambe, Fernando Ordonez University

22University of Southern California

Probing Results : Single agent Case

0

2

4

6

8

10

12

14

0 2 4 6 8 10

Entropy

# o

f o

bse

rvat

ion

s

Observe All

Observe Select

Observe Noisy

Page 23: 1 University of Southern California Security in Multiagent Systems by Policy Randomization Praveen Paruchuri, Milind Tambe, Fernando Ordonez University

23University of Southern California

Probing Results : Multi agent Case

0

1

2

3

4

5

0 1 2 3 4Joint Entropy

Join

t #

of

Ob

serv

atio

ns

Observe AllObserve Select

Observe Noisy

Page 24: 1 University of Southern California Security in Multiagent Systems by Policy Randomization Praveen Paruchuri, Milind Tambe, Fernando Ordonez University

24University of Southern California

Define POMDP

Page 25: 1 University of Southern California Security in Multiagent Systems by Policy Randomization Praveen Paruchuri, Milind Tambe, Fernando Ordonez University

25University of Southern California

Define Distributed POMDP

Dec-POMDP is a tuple <S,A,P,Ω,O,R>, where

S – Set of states

A – Joint action set <a1,a2,…,an>

P – Transition function

Ω – Set of joint observations

O- Observation function – Probability of joint observation given current state and previous joint action. Observations independent of each other

R – Immediate, Joint reward

A DEC-MDP is a DEC-POMDP with the restriction that at each time step the agents observations together uniquely determine the state.

Page 26: 1 University of Southern California Security in Multiagent Systems by Policy Randomization Praveen Paruchuri, Milind Tambe, Fernando Ordonez University

26University of Southern California

Counterexample : Entropy

Lets say adversary shoots down UAV Hence targets highest probable action --- Called Hit rate

Assume UAV has 3 actions.

2 possible probability distributions H ( 1/2, 1/2, 0 ) = 1 ( log base 2 ) H ( 1/2 - delta, 1/4 + delta, 1/4 ) ~ 3/2

Entropy = 3/2, Hit rate = 1/2-delta

Entropy = 1, Hit rate = 1/2

Higher entropy but lower hit rate

Page 27: 1 University of Southern California Security in Multiagent Systems by Policy Randomization Praveen Paruchuri, Milind Tambe, Fernando Ordonez University

27University of Southern California

d-parameter & Comments on Results

Effect of d-parameter (avg of 10 instances)

RDR : Avg runtime in sec and (Entropy), T = 2

Conclusions:Greater tolerance of reward loss => Higher entropyReaching maximum entropy tougher than single agent caseLower miscoordination cost implies higher entropyd parameter of .5 is good for practical purposes.

Reward Threshold

1 .5 .25 .125

90% .67(.59) 1.73(.74) 3.47(.75) 7.07(.75)

50% .67(1.53) 1.47(2.52) 3.4(2.62) 7.47(2.66)

Page 28: 1 University of Southern California Security in Multiagent Systems by Policy Randomization Praveen Paruchuri, Milind Tambe, Fernando Ordonez University

28University of Southern California

Example where uniform policy is not best

Page 29: 1 University of Southern California Security in Multiagent Systems by Policy Randomization Praveen Paruchuri, Milind Tambe, Fernando Ordonez University

29University of Southern California

Entropies

For uniform policy – 1 + ½ * 1 + 2 * ¼ * 1 + 4 * 1/8 * 1 = 2.5

If initially deterministic policy and then uniform – 0 + 1 * 1 + 2 * ½ * 1 + 4 * ¼ * 1 = 3

Hence, uniform policies need not always be optimal.