learning equilibria with partial information in cognitive radio … · 2011-05-09 · learning...

Learning Equilibria with Partial Information inCognitive Radio Networks

Samir M. PerlazaAlcatel Lucent Chair in Flexible Radio at Supelec

Joint work with L. Rose and C. Le Martret from Thales Communications and M.Debbah from Alcatel Lucent Chair in Flexible Radio at Supelec

10 ans de Radio Intelligente : bilan et perspectivesGDR-ISIS Seminar. May 2011. Paris, France.

Outline

Motivation

Learning Equilibria in Cognitive Radio Networks

Study Cases

Conclusions

Motivation Learning Equilibria in Cognitive Radio Networks Study Cases Conclusions

Outline

Motivation


Study Cases

Conclusions

1 / 32


Cognitive Radios (CR)

A CR is a radio that can change its transmitter parameters based oninteraction with the environment in which it operates

Federal Communications Commission (FCC)

2 / 32


Game Theory

GT is... a a branch of mathematics which studies the interaction betweenseveral decision-takers aiming at maximizing a common or individual benefit.

One Decision-taker: Optimization and Control Theory

Several Decision-takers: Game Theory (due to interactions).

3 / 32


Modeling Cognitive Radio Networks as Games

Consider the game in normal form:

G =(K, {Ak}k∈K , {uk}k∈K

)(1)

Players: transmitters orreceivers.

K = {1, . . . ,K}. (2)

Actions:

Power AllocationModulation - CodingDecoding Order...

Ak ={

A(1)k , . . . ,A(Nk )

k

}. (3)

Utility function:

uk : A1 × . . .×AK → R (4)

Data RateBEREnergy Efficiency...

4 / 32


Nash Equilibrium and Cognitive Radio NetworksJ. F. Nash, “Equilibrium points in n-person games”, Proceedings of the National Academy of Sciences of the UnitedStates of America, vol. 36, no. 1, pp. 48-49, 1950.

Player k chooses its action ak ∈ Ak following a probability distribution[Borel-1921].

πk =

(π

k,A(1)k, . . . , π

k,A(Nk )

k

)∈ 4 (Ak ) . (5)

Here, πk,A

(nk )

k= Pr

(ak = A(nk )

k

).

Definition (Nash Equilibrium (Nash-1950))A strategy profile π∗ ∈ 4 (A1)× . . .×4 (AK ) is an NE if, for allplayers k ∈ K and ∀πk ∈ 4 (Ak )

uk (π∗k ,π

∗−k ) > uk (πk ,π

∗−k )

5 / 32


Example: A Parallel Multiple Access Channel (1)S. M. Perlaza, H. Tembine, S. Lasaulce, and V. Quintero-Florez, “On the Fictitious play and channel selectiongames”, in IEEE Latin-American Conference on Communications (LATINCOM), Bogota, Colombia, 2010.

Consider the gameG =

(K, {Ak}k∈K , {uk}k∈K

).

K={1,2} and ∀k∈K,

Ak={(0,pmax ),(pmax ,0)}.

uk=12∑2

n=1 log2

(1+

pk,ngk,nσ2+p−k,ng−k,n

).

Tx1\Tx2 A(1)2 =(pmax,0) A(2)

2 =(0,pmax)

A(1)1 =(pmax,0)

12 log2(σ2+pmax(g11+g21))

+ 12 log2(σ2)

12 log2(σ2+pmaxg11)

+ 12 log2(σ2+pmaxg22)

A(2)1 =(0,pmax)

12 log2(σ2+pmaxg12)

+ 12 log2(σ2+pmaxg21)

12 log2(σ2+pmax(g12+g22))

+ 12 log2(σ2)

6 / 32


Example: A Parallel Multiple Access Channel (2)S. M. Perlaza, H. Tembine, S. Lasaulce, and V. Quintero-Florez, “On the Fictitious play and channel selectiongames’’, in IEEE Latin-American Conference on Communications (LATINCOM), Bogota, Colombia, 2010.

1ψ(g22

)

ψ(g

11)

g21g22

1ψ(g12)

1

g11g12

p1 = (0, pmax)p2 = (pmax, 0)

1ψ(g22)

1

0

ψ(g11)

p1 = (pmax, 0)p2 = (0, pmax)

1ψ(g12

)

p1=

(0,p

max )

p2=

(0,p

max )

ψ(g21)

p1 = (pmax, 0)p2 = (pmax, 0)

ψ(g

21)

Figure: NE action profiles as a function of the ratios g11g12

and g21g22

. Thefunction ψ : R+ → R+ is defined as follows: ψ(x) = 1 + SNR x . Here, it hasbeen arbitrarily assumed that ψ(g11) < ψ(g21).

7 / 32


The Big Challenge

How to achieve a (Nash) equilibrium in cognitive radio networks?

A CR possesses only local information

Topology is dynamically changing.

Signaling between CR is limited.

Spectrum sensing is not reliable.

Some existing contributions: [Sastry-1994] [Yu-2004] [Pang-2008][Scutari-2009] [Leshem-2009] [Larsson-2009] [Iellamo-2011] [Rose-2011][Perlaza-2011a].

8 / 32


Outline

Motivation


Study Cases

Conclusions

9 / 32



Learning Iterative Steps:

Choose action ak (t) ∼ πk (t).

Observe game outcome, e.g.,a−k (t)uk (ak (t), a−k (t)).

Improve πk (t + 1).

h(t)

ak (t)

(i) a−k (t)

(ii) uk (t)

πk (t)

Thus, we can expect that: ∀k ∈ K,

πk (t) t→∞−→ π∗k (6)

uk (πk (t),π−k (t))t→∞−→ uk (π∗k ,π

∗−k ) (7)

where, π∗ = (π∗1 , . . . ,π∗K ) is a NE strategy profile.

10 / 32


Learning Techniques in Cognitive Radio Networks

Best Response Dynamics (BRD):[Scutari-2008a] [Pang-2008] [Larsson-2009] [Perlaza-2009]

Fictitious Play (FP):[Perlaza-2010]

Reinforcement Learning (RL):[Sastry-1994]

Joint Utility STrategy lEarning (JUSTE):[Perlaza-2010c]

Other techniques not treated here:Regret Matching Learning [Altman-2008], Q-Learning [Bennis-2010],Bandits [Liu-2010]], Imitation Learning [Iellamo-2011], etc.

11 / 32


Best Response Dynamics (1)

Definition (Best-Response Correspondence)

In the game G, the correspondence BRk : A−k → Ak , such that

BRk (a−k ) = arg maxa′k∈Ak

uk(a′k , a−k

), (8)

is the best response correspondence of player k, given the actions a−k .

Sequential BRD: One player updates its action.

Simultaneous BRD: All players update its action.

12 / 32


Best Response Dynamics (2)

BRDConvergence NEObservations a−k (t)

Closed Expr. for uk YesCalculation. Cap. Optim.Noise Tolerance. No

Enviroment Static

BRD converges to NE in potential games [Monderer-Shapley-1996a].

13 / 32


Fictitious Play (1)

G. W. Brown, “Iterative solution of games by Fictitious play", Activity Analysis of Production and Allocation, vol.13, no. 1, pp. 374-376, 1951.

At game stage t , player k observes a−k (t − 1).

Each player calculates the empirical frequency of all actions A(nk )k , i.e.,

fk,A

(nk )

k(t) =

1t − 1

t−1∑s=0

1{ak (s)=A

(nk )

k

}, (9)

Player k choses its action ak (t) to maximize its expected utility:

ak (t) ∈ arg maxa′k∈Ak

∑a−k∈A−k

uk(a′k , a−k

) ∏j∈K\{k}

fj,aj (t). (10)

14 / 32


Fictitious Play (2)

BRD FPConvergence NE NEObservations a−k (t) a−k (t)

Closed Expr. for uk Yes YesCalculation. Cap. Optim. Optim.Noise Tolerance. No No

Enviroment Static Stationary

FP converges to NE in

Zero-sum games [Robinson-1951],

Potential games [Monderer-Shapley-1996b],

Generic 2× 2 games [Miyazawa-1961].

Does FP really converges in Potential Games? [Perlaza-2010b]

15 / 32


Cumulative Payoff Matching (1)

Players k observes uk (t) = uk (ak (t), a−k (t)) at each stage.

Player k calculates its cumulative utility θk,A

(nk )

kwith all actions A(nk )

k :

θk,A

(nk )

k(t) =

t−1∑s=0

uk (s)1{ak (s)=A

(nk )

k

}. (11)

The higher θk,A

(nk )

k(t), the higher the probability π

k,A(nk )

k(t), i.e.,

πk,A

(nk )

k(t) =

θk,A

(nk )

k(t)∑

ak∈Akθk,ak (t)

(12)

16 / 32


Cumulative Payoff Matching (2)

BRD FP RLConvergence NE NE −−Observations a−k (t) a−k (t) uk (t)

Closed Expr. for uk Yes Yes NoCalculation. Cap. Optim. Optim. Algeb. Oper.Noise Tolerance. No No No

Enviroment Static Stationary Stationary

In some cases, RL can converge to NE [Sastry-1994].

17 / 32


Joint Utility and Strategy Learning - JUSTES. M. Perlaza, S. Lasaulce, H. Tembine, and M. Debbah, “Radio resource sharing in decentralized wirelessnetworks: A logit equilibrium approach”, Submitted to IEEE JSTP on Signal Processing, June. 2011.

Players k observes uk (t) = uk (ak (t), a−k (t)) at each stage.

Player k calculates its expected utility θk,A

(nk )

kwith all its actions A(nk )

k :

uk,A

(nk )

k(t) =

1T

k,A(nk )

k(t)

t−1∑s=0

uk (s)1{ak (s)=A

(nk )

k

}, (13)

The higher uk,A

(nk )

k(t), the higher the probability π

k,A(nk )

k(t), i.e.,

πk,A

(nk )

k(t) = π

k,A(nk )

k(t − 1) + λk (t)

(β

k,A(nk )

k(uk (t))− π

k,A(nk )

k(t − 1)

).

where β(γk )k (uk (·,π−k )) is a logit function.

18 / 32


Joint Utility and STrategy lEarning - JUSTE (1)

0 2 4 6 8 10 12 140

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

ExpectedUtility

withthen-thAction

Action Index: n

0 2 4 6 8 10 12 140

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Prob.of

then-thAction

Action Index: n

Smooth Best Response κ = 0.001.

Smooth Best Response κ = 0.1.

Smooth Best Response κ = 1.

Figure: (left) Mean utility achieved with each of the actions. (right)Corresponding Probability Distribution (Logit Function).

19 / 32


Joint Utility and STrategy lEarning - JUSTE (2)

BRD FP RL JUSTEConvergence NE NE −− ε-NEObservations a−k (t) a−k (t) uk (t) uk (t)

Closed Expr. for uk Yes Yes No NoCalculation. Cap. Optim. Optim. Algeb. Oper. Algeb. Oper.Noise Tolerance. No No No Yes

Enviroment Static Stationary Stationary Stationary

JUSTE converges to a ε-NE in potential games [Perlaza-2011a], zero-sumgames with 2 players [Leslie-2003].

20 / 32


Outline

Motivation


Study Cases

Conclusions

21 / 32


Example: A Parallel Interference Channel (1)L. Rose, S. M. Perlaza, and M. Debbah, “On the Nash equilibria in decentralized parallel interferencechannels”, ICC2011, Kyoto, Japan, Jun. 2011.

Consider the gameG =


).

K={1,2} and ∀k∈K,

Ak={1,0}.

uk (αk ,α−k )=log2

1+αk g(1)k,k

1+α−k g(1)k,−k

+

log2

1+(1−αk )g(2)k,k

1+(1−α−k )g(2)k,−k

.

Tx1\Tx2 α2=1 α2=0

α1=1 (u1(1,1),u2(1,1)) (u1(1,0),u2(0,1))

α1=0 (u1(0,1),u2(1,0)) (u1(0,0),u2(0,0))

Single antenna radios with limited power pmax.

22 / 32


Example: A Parallel Interference Channel (2)L. Rose, S. M. Perlaza, and M. Debbah, “On the Nash equilibria in decentralized parallel interferencechannels”, ICC2011, Kyoto, Japan, Jun. 2011.

0 2 4 6 8 10 12 14 16 18 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Signal to Noise Ratio [dB]

Pro

babili

ty o

f N

E

1 NE

2 NE

Figure: Probability of observing eitherone or two NE in the gameG =


).

Two cases can be observed:

Unique NE:(α∗1 , α

∗2

)∈ {α1, α2 ∈ {0, 1} :

(α1, α2)}2 NE:

(α∗1 , α

∗2

)∈ {(0, 1) , (1, 0)}

23 / 32


Example: A Parallel Interference Channel (4)Rose, L. and Perlaza, S. M. and Lasaulce, S. and Debbah, M., “Learning Equilibria with Partial Information inWireless Networks", IEEE Communications Magazine, Special Issue in Game Theory for WirelessCommunications, August, 2011.

Figure: Average number of iterations required to observe convergence to a NE in the game

G =(K, {Ak}k∈K , {uk}k∈K

).

24 / 32



Figure: Average sum spectral efficiency [bps/Hz] as a function of the number of iterations when the signal tonoise ratio (SNR) is fixed to 10 dB.

25 / 32



Figure: Average sum spectral efficiency [bps/Hz] as a function of the signal to noise ratio (SNR) when themaximum number of iterations has been fixed to 40 iterations.

26 / 32



2 4 6 8 10 12 14 16 18 200

0.5

1

1.5

2

2.5

Number of Channels (S)

Sum

Spectral

Efficien

cyin

bps/Hz.

Logit Equilibrium, ∀k ∈ K, 1γk

= 0.2


= 0.4


= 1

Figure: Achieved sum spectral efficiency in the game G with JUSTE. Here α1(n) = α2(n) = 1

n( 3

4 ),

λ2(n) = 1

n( 2

3 )and λ1(n) = 1

n . Moreover, SNR =pk,maxσ2 = 10 dBs.

27 / 32


Outline

Motivation


Study Cases

Conclusions

28 / 32


Conclusions

Properties of NE (existance, multiplicity) in CRN are topology-dependent.A general algorithm for achieving NE in CRN is still unknown.

It is possible to converge to NE using only local information. (always?)

More information does not imply better convergence properties.

More choices does not imply better performance.

29 / 32


ReferencesBorel-1921 Emile Borel, “La théorie du jeu et les équations à noyau symétrique,” Comptes Rendus de

l’Académie des Sciences, vol. 173, pp. 1304-1308, Sept. 1921.

Borkar-1997 V. Borkar, “Stochastic approximation with two timescales”, Systems Control Lett., vol. 29, pp.291Ð294, 1997.

Larsson-2009 Larsson, E. Jorswieck, E. Lindblom, J. Mochaourab, R. “Game theory and the flat fadingGaussian interference channel,” IEEE Transactions on Signal Processing, 2009, October.

Leshem-2009 Leshem, A. Zehavi, E. “Game theory and the frequency selective interference channel,ÓIEEE Transactions on Signal Processing, 2009, October.

Leslie-2003 S. D. Leslie and E. J. Collins, “Convergent multiple-timescales reinforcement learningalgorithms in normal form games,” Ann. Appl. Probab., vol. 13, no. 4, pp. 1231Ð1251, 2003.

Liu-2008 K. Liu and Q. Zhao.“Distributed Learning in Multi-Armed Bandit with Multiple Players”. IEEETransactions on Signal Processing, vol. 58, no. 11, pp. 5667-5681, November, 2010.

Miyazawa-1961 Miyazawa, K. (1961). “On the convergence of the learning process in a 2?2 non- zero-sumtwo-person game”. Econometric Research Program, Princeton

Monderer - 1996a D. Monderer and L. S. Shapley, “Fictitious play property for games with identical interests,"Int. J. Economic Theory, vol. 68, pp. 258-265, 1996.

Monderer - 1996b D. Monderer and L. S. Shapley, , “Potential games,” Games and Economic Behavior, vol. 14,pp. 124-143, 1996.

Nash-1950 Nash, John, “Equilibrium points in n-person games” Proceedings of the National Academy ofSciences 36(1):48-49. 1950

Pang-2008 Jong-Shi Pang, Gesualdo Scutari, Francisco Facchinei, and Chaoxiong Wang, ÒDistributedPower Allocation With Rate Constraints in Gaussian Parallel Interference Channels,Ó IEEETrans. on Information Theory, vol. 54, no. 8, pp. 3471-3489, Aug. 2008

30 / 32


References

Perlaza-2011a S. M. Perlaza, S. Lasaulce, H. Tembine, and M. Debbah, “Radio resource sharing indecentralized wireless networks: A logit equilibrium approach”, Submitted to IEEE JSTP onSignal Processing. Special Issue in Heterogeneous Networks For Future BroadbandWireless Systems, June. 2011.

Perlaza-2011b Perlaza, S. M. and Tembine, H. and Lasaulce, S. and Debbah, M., “A General Framework forQuality-Of-Service Provisioning in Decentralized Networks”. Submitted to the IEEE Journalin Selected Topics in Signal Processing. Special Issue in Game Theory for SignalProcessing. 2011

Perlaza-2010c S. M. Perlaza, H. Tembine, and S. Lasaulce, “How can ignorant but patient cognitiveterminals learn their strategy and utility?”. SPAWC 2010, Marrakech, Morocco, June 2010.

Perlaza-2010d S. M. Perlaza, H. Tembine, S. Lasaulce, and M. Debbah, “Learning to use the spectrum inself-configuring heterogeneous networks: A logit equilibrium approach,” in 4th InternationalICST Workshop on Game Theory in Communication Networks, 2011

Perlaza-2010e S. M. Perlaza, H. Tembine, S. Lasaulce, and V. Quintero-Florez, “On the fictitious play andchannel selection games,”in IEEE Latin-American Conference on Communications(LATINCOM), Bogota, Colombia, Sept. 2010, pp. 1- 5.

Robinson-1951 An Iterative Method of Solving a Game. Author(s): Julia Robinson. Source: The Annals ofMathematics, Second Series, Vol. 54, No. 2 (Sep., 1951), pp. 296-301

Rose-2011 Rose, L. and Perlaza, S. M., and Debbah, M., “Learning equilibria with partial information inwireless networks,” IEEE Communication Magazine. Special Issue on Game Theory forWireless Communications, Aug. 2011.

Rose-2010 Rose, L. and Perlaza, S. M., and Debbah, M., "On the Nash Equilibria in DecentralizedParallel Interference Channels", in proc. of the IEEE ICC 2011 Workshop on Game Theoryand Resource Allocation for 4G, Kioto, Japan, June, 2011

31 / 32


References

Sastry-1994 P. Sastry, V. Phansalkar, and M. Thathachar, “Decentralized learning of Nash equilibria inmulti-person stochastic games with incomplete information," IEEE Transactions on Systems,Man and Cybernetics, vol. 24, no. 5, pp. 769-777, May 1994.

Scutari-2009 Gesualdo Scutari, Daniel P. Palomar, and Sergio Barbarossa, ÒThe MIMO IterativeWaterfilling Algorithm,Ó IEEE Trans. on Signal Processing, vol. 57, no. 5, pp. 1917-1935,May 2009.

Scutari-2008a Gesualdo Scutari, Daniel P. Palomar, and Sergio Barbarossa, ÒAsynchronous IterativeWater-Filling for Gaussian Frequency-Selective Interference Channels,Ó IEEE Trans. onInformation Theory, vol. 54, no. 7, pp. 2868-2878, July 2008.

Scutari-2008b Gesualdo Scutari, Daniel P. Palomar, and Sergio Barbarossa, “Optimal Linear PrecodingStrategies for Wideband Noncooperative Systems Based on Game Theory - Part I: NashEquilibria”. IEEE Trans. on Signal Processing, vol. 56, no. 3, pp. 1230-1249, March 2008.

Scutari-2008c Gesualdo Scutari, Daniel P. Palomar, and Sergio Barbarossa, “Optimal Linear PrecodingStrategies for Wideband Noncooperative Systems Based on Game Theory - Part II:Algorithms”. IEEE Trans. on Signal Processing, vol. 56, no. 3, pp. 1250-1267, March 2008.

Yu-2004 W. Yu, W. Rhee, S. Boyd, and J. Cioffi, “Iterative water-filling for Gaussian vector multipleaccess channels,” IEEE Trans. on Info. Theory, vol. 50, no. 1, pp. 145-152, Jan. 2004.

32 / 32

learning equilibria with partial information in cognitive radio … · 2011-05-09 · learning...

Documents