game theoretic issues in cognitive radio systems (invited ...cognitive radio may conﬂict with that...

Game Theoretic Issues in Cognitive Radio Systems(Invited Paper)

Jane Wei Huang and Vikram KrishnamurthyStatistical Signal Processing Research Lab

University of British Columbia, Vancouver, CanadaEmail: {janeh, vikramk}@ece.ubc.ca

Abstract— The ability to model independent decision makerswhose actions potentially affect all other decision makersmakes game theory attractive to analyze the performancesof wireless communication systems. Recently, there has beengrowing interest in adopting game theoretic methods tocognitive radio networks for power control, rate adaptationand channel access schemes. This work presents severalresults in game theory and their applications in cognitiveradio systems. First, we compute the Nash equilibriumpower allocation and rate adaptation policies in cognitiveradio systems using static game and dynamic Markoviangame frameworks. We then describe how mechanism designhelps to design a truth revealing channel access scheme.Finally, we introduce the correlated equilibrium concept instochastic games and its application to solve the transmissioncontrol problem in a cognitive radio system.

Index Terms— Cognitive Radio, Game Theory, General-SumMarkovian Dynamic Game, Switching Control Game, NashEquilibrium, Mechanism Design, Correlated Equilibrium

I. I NTRODUCTION

Game theory was first introduced by J. V. Neumannand O. Morgenstern in [1] in1944. It is a disciplineaiming at modeling situations in which decision makershave to make specific actions that have mutual, possiblyconflicting consequences. Game theory has been widelyused as an analysis tool in economic systems [2], [3].Recently, with the introduction of ad-hoc networks andcognitive radio systems, has game theory been consideredas an adequate tool to design wireless self-organized net-works [4], [5]. Specifically, game theoretic models havebeen developed to better understand congestion control,routing, power control, trust management and other issuesin wired and wireless communication systems. The goalof this paper is to present recent results in game theoryapplied to the design and analysis of cognitive radionetworks.

A. Why is Game Theory Relevant to Cognitive RadioSystems?

With an increasing demand on data rates and newapplications, spectrum crowding and congestion continueto grow. A report released by the Federal CommunicationsCommission (FCC) in2002 [6] suggests that while somespectrum bands are over-utilized and crowded, many

Manuscript received June 29, 2009; revised August 3, 2009; acceptedSeptember 2, 2009.

other licensed spectrum bands are under-utilized. Thisfact motivates the development of new technologies andstandards in wireless communication systems that seekto use these under-utilized licensed bands. The idea ofcognitive radio systems [7]–[9] is one possible method toachieve more efficient utilization of the available spectrumresources. While the traditional approach to ensure theco-existence of multiple systems is to split the availablespectrum into frequency bands and allocate them todifferent licensed (primary) users, the dynamic spectrumaccess in cognitive radio systems improves the spectrumutilization by detecting unoccupied spectrum holes andassigning them to unlicensed (secondary) users.

The term cognitive radio was coined in1999 by J. Mi-tola III in [10]. Dynamic spectrum access is an importantaspect of cognitive radio. It can be achieved in variousways includingunderlaying,overlaying, or interweavingthe signals of secondary users with that of the primaryusers, while keeping the interferences as low as possible.

In cognitive radio networks with reduced functional-ity base stations (no central authority) and autonomouscognitive radios, game theory can be naturally applied toachieve the decentralized operation and self configurationfeatures. In a game theoretic setting, cognitive radioscan be viewed as selfish rational players each seekingto optimize its own utility. The interest of an individualcognitive radio may conflict with that of the network,in which case game theory can be straightforwardlyapplied, as it traditionally analyzes situations where playerobjectives are in conflict.

B. Cognitive Radio Power Control with Static GameTheoretic Approach

In a cognitive radio network, proper power controlis of importance to ensure efficient operation of bothprimary and secondary users. Even without the presenceof primary users, power control is still an issue amongsecondary users since the signal of one user may causeinterference to the transmissions of others. Thus, how todevelop an efficient power allocation scheme that is ableto jointly optimize the performance of multiple cognitiveradios in the presence of mutual interference is of interestto such a system.

Static game framework has been used to compute theNash equilibrium power allocation policies in cognitive

790 JOURNAL OF COMMUNICATIONS, VOL. 4, NO. 10, NOVEMBER 2009

© 2009 ACADEMY PUBLISHERdoi:10.4304/jcm.4.10.790-802

radio networks [8], [11], [12]. Different from a dynamicgame where players make sequence of decisions, a staticgame is one in which all players make decisions onetime simultaneously, without knowledge of the strategiesof other players. The Nash equilibrium of a game is aset of strategies, one for each user, such that no userhas the incentive to unilaterally change its action. In aNash equilibrium, any change in the strategy by a userwould lead that user to have less payoff than if it keepsthe current strategy. Section II gives an example of acognitive radio system where each user aims to maximizeits information rate subject to the transmission powerconstraint. A distributed asynchronous iterative water-filling algorithm is used to compute the Nash equilibriumpower allocation policy of such a system using static gametheoretic approach [12].

C. Switching Control Game: A Special Type of StochasticDynamic Game

Most games considered in wireless communicationsystems to date are static games. Stochastic dynamic gametheory is an essential tool for cognitive radio systems asit is able to exploit the correlated channels in the analysisof decentralized behaviors of cognitive radios.

The concept of a stochastic game, first introduced byLloyd Shapley in early1950s, is a dynamic game playedby one or more players. The elements of a stochasticgame include system state set, action sets, transitionprobabilities and utility functions. It is an extension of thesingle player Markov decision process (MDP) to includethe multiple players whose actions all impact the resultingpayoffs and next state. A switching control game [13]–[15] is a special type of stochastic dynamic game wherethe transition probability in any given state depends ononly one player. It is known that the Nash equilibrium forsuch a game can be computed by solving a sequence ofMarkov decision processes. Section III shows an examplewhere the rate adapt problem in a Time Division MultipleAccess (TDMA) cognitive radio system is formulated asa switching control Markovian game and a value iterativeoptimization algorithm is proposed to compute the Nashequilibrium for such a game [16].

D. Mechanism Design in Cognitive Radio Systems

An efficient spectrum assignment technology is essen-tial to a cognitive radio system, which allows secondaryusers to opportunistically utilize the unoccupied spectrumholes based on agreements and constraints. These sec-ondary users have to coordinate with each other in orderto maintain the order and result in maximum efficiency. Itmotivates the development of spectrum access approachesin cognitive radio systems. The opportunistic schedulingin cognitive networks assuming the scheduling is fullyaware of primary user transmissions are considered in [17]and [18], while [19]–[21] consider the scenario with onlypartial primary user activity information is available.

However, all the existing opportunistic scheduling ap-proaches overlook the fact that the secondary users may

be owned by different agents and they may work incompetitive rather than cooperative manners. These selfishusers can become so sophisticated that they lie abouttheir states to optimize their own utilities at the costof reducing the overall system performance. It requiresmechanism design theory in order to prevent this fromhappening. Mechanism design is the study of design-ing rules for strategic, autonomous and rational playersto achieve predictable global outcome [22] using gametheoretic approach. A milestone in mechanism design isthe Vickrey-Clark-Groves (VCG) mechanism, which isa generalization of Vickrey’s second price auction [23]proposed by Clark [24] and Groves [25]. The particularpricing policy of the VCG mechanism makes reportingtrue values the dominant strategy for all the players.Section IV is an example where we model each user in acognitive radio as a selfish player aiming to optimize hisown utility and we try to find a mechanism which ensuresefficient resource allocation within the network [26].

E. Correlated Equilibrium of a Dynamic MarkovianGame

The fundamental solution concept for dynamicMarkovian games is Nash equilibrium, however, it suf-fers from limitations, such as non-uniqueness, loss ofefficiency, non-guarantee of existence. In game theory,a correlated equilibrium is a solution concept which ismore general than the Nash equilibrium [27], [28]. Acorrelated equilibrium is defined as follows. Each playerin a game chooses his action according to his observationof the value of a signal. A strategy assigns an actionto every possible observation a player can make. If noplayer would deviate from the recommended strategy, thedistribution is called a correlated equilibrium. Comparedto Nash equilibria, correlated equilibria offer a numberof conceptual and computational advantages, includingthe facts that new and sometimes more “fair” payoffscan be achieved, that correlated equilibria can be com-puted efficiently for games in standard normal form,and that correlated equilibria are the convergence notionfor several natural learning algorithms. Furthermore, ithas been argued that the correlated equilibria are thenatural equilibrium concept consistent with the Bayesianperspective [28]. Section V is one of such examples whereit formulates the user scheduling problem in a cognitiveradio network using stochastic dynamical game frame-work with the goal of obtaining the correlated equilibriumpolicy [29].

F. Organization of this Paper

This paper is organized as follows: Section II formu-lates the power allocation problem in a cognitive radiosystem using static game framework. Section III thenintroduces a special type of dynamic game: A switchingcontrol game and uses it to solve the rate adaptationproblem in a TDMA cognitive radio system. Section IVapplies mechanism design to obtain a truth revealing

JOURNAL OF COMMUNICATIONS, VOL. 4, NO. 10, NOVEMBER 2009 791

© 2009 ACADEMY PUBLISHER

opportunistic scheduling algorithm and Section V com-putes the correlated equilibrium in a stochastic dynamicgame. Finally, Section VI concludes the paper with someopen issues on applying game theory in cognitive radiosystems.

II. D ISTRIBUTED TRANSMISSIONPOWER CONTROL:ASYNCHRONOUSITERATIVE WATER-FILLING

An iterative water-filling algorithm (IWFA) is proposedin [11] to obtain the Nash equilibrium for multiuserpower control problem in a digital subscriber line system.In the problem formulation, the user power allocationproblem in a interference channel system is modeled asa noncooperative game, and the existence and uniquenessof a Nash equilibrium are established for a two-playerversion of such a game. However, the IWFA suffers fromlow convergence rate in a system with large numberof users. In order to overcome this disadvantage, animproved asynchronous iterative water-filling algorithm(AIWFA) was proposed in [12]. The AIWFA is basedon the asynchronous framework as described in [30]which allows all the users to update in a completelyasynchronous way. This feature makes AIWFA applicableto all practical cognitive radio systems.

The system model we consider here is a Gaussianfrequency-selective interference channel with multiplecognitive radio users and multiple receivers. It is aimedto find a distributed power allocation scheme withoutthe coordination among users. In the system model, weassume there areK secondary users and each user hasG subcarriers.K = {1, 2, . . . ,K} is used to denote theset of users. Denoting the power allocation of userk oversubcarrierg aspk(g), the system constraint can be writtenas follows.

G∑g=1

pk(g) ≤ Pk, (1)

pk(g) ≤ Pmaxk (g), (2)

wherePk denotes the total transmission power of thekthuser andPmax

k (g) denotes the power limit on thegthsubcarrier of thekth user. (1) is the total power constrainton each user and (2) is the spectral mask constraint andit is imposed to eliminate the interference from each userover specified spectrum bands.

A. Problem Formulation

With the above system setup and constraints (1,2), eachuser aims to maximize its transmission rate in a distributedway. DenoteRk as the maximum achievable rate of thekth user, and it can be expressed as:

Rk =1G

G∑g=1

log(

1 +|hk,k(g)|2pk(g)

σ2k(g) +

∑Kl=1,l 6=k |hk,l(g)|2pl(g)

), (3)

with hi,j denoting the channel quality between thejthcognitive radio user andith receiver on thegth sub-carrier. |σk(g)|2 is used to denote the variance of the

zero-mean circularly symmetric complex Gaussian noiseat the kth receiver over thegth subcarrier. The term∑K

t=1,t6=k |hk,t|2pt(g) is the total interference caused byall other users to userk.

Using pk = {pk(1), pk(2), . . . , pk(G)} to denote thepower allocation vector of thekth user andp−k ={p1, . . . ,pk−1,pk+1, . . . ,pK} to denote the power allo-cation strategies of the remainingK−1 remaining users,the power allocation strategy of all the users in the systemcan be denoted asp = {p1, . . . ,pK} = {pk,p−k}. WedenotePk as the set of transmission policies of userkthat satisfy the system constraints (1, 2), it is specifiedas:

Pk ={pk :

G∑g=1

pk(g) ≤ Pk, pk(g) ≤ Pmaxk (g)

}. (4)

We usep∗ = {p∗1, . . . ,p∗K} to denote the Nash equilib-rium power allocation strategy. Givenp∗−k, the optimalpower allocation strategy of thekth userp∗k is the solutionto the following optimization problem:

maxpk

Rk(pk,p−k), s.t. pk ∈ Pk, (∀k ∈ K). (5)

Here, Rk(pk,p−k) (specified in (3)) is the maximumachievable transmission rate of thekth user.

B. Asynchronous Iterative Water-filling Algorithm

Based on the above problem formulation, an asynchro-nous iterative water-filling algorithm is proposed to obtainthe Nash equilibrium policy [12]. We usen to denotethe iteration index andN = {0, 1, 2, . . . } to indicate theiteration index set. Due to the asynchronous feature ofthe AIWFA, not every user updates its power allocationstrategy at each iterationn. We useNk to indicate theiteration index set for userk where userk updates itspolicy pn

k . Here,pnk is specified as the power allocation

policy of thekth user at thenth iteration.The AIWFA is outlined in Algorithm 1.µk is the water-

level parameter chosen to satisfy the power constraint ofthe kth user (1) and[x]ba is the Euclidean projection ofxonto the interval[a, b]. The algorithm can be summarizedas follows. In step 1, we initialize the iteration indexn andthe initial power allocation vectorp0, wherep0 satisfiesthe system constraints (1,2). Ifn ∈ Nk, we update thetransmission policy of thekth user at thenth iterationpn

k according to the water-filling algorithm, otherwise,the power allocation policy of thekth user remainsunchanged. The algorithm terminates whenpn converges.It is shown in [12] that the convergence of Algorithm 1 isguaranteed if one of the following conditions is satisfied.

1

wk

Xk 6=l

maxn∈Dk∩Dl

|hk,l|2|hk,k|2

Pl

Pkwl < 1,∀k ∈ K;

1

wl

Xk 6=l

maxn∈Dk∩Dl

|hk,l|2|hk,k|2

Pl

Pkwk < 1,∀l ∈ K; (6)

wherew = {w1, w2, . . . , wK} is any positive vector.Dk

denotes the set{1, 2, . . . , G} possibly deprived by thesubcarrier indicates the userk would never use as thebest response set to any strategies used by the other users,for the given set of transmission power and propagationchannels [12].



Algorithm 1 Asynchronous Iterative Water-filling Algo-rithm

Step 1: Setn = 0; Initialize p0 with pk ∈ Pk for ∀k ∈ K.Step 2: Update the transmission policy of each user:for k = 1 : K do

if n ∈ Nk thenfor g = 1 : G do

pn+1k (g) =

�µk −

σ2k(g)+

PKl=1,l6=k |hk,l(g)|2pl(g)n

|hk,k(g)|2pl(g)n

�Pmaxk (g)

0

end forelse

pn+1k = pn

k .end if

end forStep 3: If pn+1 6= pn, thenn = n + 1; otherwise,pn+1 isthe system Nash equilibrium power allocation policy.

III. SWITCHING CONTROL MARKOVIAN DYNAMIC

GAME

In this section, we are going to formulate the seconduser rate adaptation problem in a cognitive radio networkas a constrained general-sum switching control Markoviandynamic game. A switching control game [13]–[15] is aspecial type of game where the transition probability inany given state depends on only one player. It turns outthat we can solve such type of game by a finite sequenceof Markov decision processes.

The system model considers the secondary user rateadaptation problem in cognitive radio networks wheremultiple secondary users attempt to access a spectrumhole [16]. We assume a Time Division Multiple Access(TDMA) cognitive radio system model (as specified in theIEEE 802.16 standard [31]) that schedules one user perspectrum hole at each time slot according to a predefineddecentralized scheduling policy. Therefore, the interactionamong secondary users is characterized as a competitionfor the spectrum hole and can naturally be formulatedas a dynamic game. By modeling transmission channelsas correlated Markovian sources, the transmission rateadaptation problem for each user can be formulatedas a general-sum switching control Markovian dynamicgame with a latency constraint. The transmission policyof such a game takes into account the secondary userchannel qualities, as well as the transmission delay ofeach secondary user.

A. System Description

This subsection introduces the system model (Fig. 1).We consider a TDMA system withK secondary userswhere only one user can access the channel at each timeslot according to a predefined decentralized access rule.The access rule will be described later in this section. Thecorrelated block fading channel of each user is modeledas a Markov chain. The rate control problem of eachsecondary user can then be formulated as a constrainedMarkovian dynamic game. More specifically, under thepredefined decentralized access rule, the problem pre-sented is a special type of game, namely a switching

Figure 1. AK user cognitive radio system network where all the usersare trying to access to the spectrum hole.

control Markovian dynamic game.1) System States and TDMA Access Rule:We are

going to denote the time slot index ast and t ∈ T ={0, 1, 2, . . . }. The channel quality state of userk at timet is denoted asht

k and it is assumed to belong to a finiteset{0, 1, . . . , Qh}. The channel state can be obtained byquantizing a continuous valued channel model comprisingof circularly symmetric complex Gaussian random vari-ables that depend only on the previous time slot. Thecomposition of channel states of all theK users can bewritten asht = {ht

1, . . . , htK}. Assuming that the channel

stateht ∈ H, t ∈ K is block fading and each block lengthequals to one time slot, the channel state can be modeledusing a finite states Markov chain model. The transitionprobability of the channel states from timet to t + 1 canbe denoted asP(ht+1|ht).

Let btk denote the buffer occupancy state of userk at

time t and it belongs to a finite setbtk ∈ {0, 1, . . . , L}.

The composition of the buffer states of all theK users canbe denoted asbt = {bt

1, . . . , btK} andbt is an element of

the secondary user buffer state spaceB.New packets arrive at the buffer at each time slot

and we denote the number of new incoming packets ofthe kth user at timet as f t

k, f tk ∈ {0, 1, 2, . . . ,∞}.

The composition of the incoming traffic of all theKusers can be denoted asf t = {f t

1, . . . , ftK}, it is an

element of the incoming traffic spaceF . For simplicity,the incoming traffic is assumed to be independent andidentically distributed (i.i.d.) in terms of time indext anduser indexk. The incoming traffic is not a part of thesystem state but it affects the buffer state evolution.

Usestk = [ht

k, btk] to denote the state of userk at time

t, the system state at timet can then be denoted asst ={st

1, . . . , stK}. The finite system state space is denoted as

S, which comprises channel stateH and secondary userbuffer stateB. That is, S = H × B. Here × denotesa Cartesian product. Furthermore,Sk is used to indicatethe state space where userk is scheduled for transmission.S1,S2, . . . ,SK are disjoint subsets ofS with the propertyof S = S1 ∪ S2 ∪ · · · ∪ SK .

The system adopts a TDMA cognitive radio systemmodel (IEEE 802.16 [31]). A decentralized channel accessalgorithm can be constructed as follows: At the beginningof a time slot, each userk attempts to access the channelafter a certain time delayWT t

k. The time delay of userk



can be specified via an opportunistic scheduling algorithm[32], such as

WT tk =

γk

btkht

k

. (7)

Here γk is a user specified quality of service (QoS)parameter andγk ∈ {γp, γs}. If userk is a primary user,γk = γp, otherwise,γk = γs. By setting γp << γs,the network does not allow the transmission of secondaryusers with the presence of primary users. As soon as auser successfully access a channel, the remaining usersdetect the channel occupancy and stop their attempt toaccess. Usek∗t to denote the index of the first userwhich successfully accesses the spectrum hole, in the casewhere there are multiple users with the same minimumwaiting time,k∗t is chosen from those users with equalprobability.

2) Action and Costs:If the kth user is scheduledfor transmission at time slott, its actionat

k representsthe bits/symbol rate of the transmission. Assuming thesystem uses an uncoded M-ary quadrature amplitudemodulation (QAM), different bits/symbol rates determinethe modulation schemes, that is,M = 2at

k .Transmission cost: When userk is scheduled for trans-

mission at time instantt, that is, st ∈ Sk, the costfunction of userk depends only onat

k, as all the otherusers are inactive. Letci(st, at

k) denote the transmissioncost of useri, i ∈ K at time t. Specifically,ci(st, at

k) ischosen to be the bit error rate (BER) of useri during thetransmission. Thus, the costs of all the users in the systemcan be specified as:

ck(st, atk) ≥ 0 (8)

ci,i 6=k(st, atk) = 0.

Holding cost: Each user has an instantaneous QoSconstraint denoted asdi(st, at

k), i = 1, . . . , K. If the QoSconstraint is chosen to be the delay (latency constraint)then di(st, at

k) is a function of the buffer statebti. The

instantaneous holding costs will be subsequently includedin an infinite horizon latency constraint.

B. Transition Probabilities and Switching Control GameFormulation

1) Transition Probabilities:With the above setup, thedecentralized transmission control problem in a Markov-ian block fading channel cognitive radio system can nowbe formulated as a switching control game. In such agame [14], the transition probabilities depend only on theaction of thekth user whens ∈ Sk. This feature enablesus to solve such a game by a finite sequence of Markovdecision processes. According to the property of theswitching control game, when thekth user is scheduledfor transmission, the transition probability between thecurrent composite statest = [ht,bt] and the next statest+1 = [ht+1,bt+1] depends only on the action of thekth userat

k, which can be specified as

P(st+1|st, atk)

=

KYi=1

P(ht+1i |ht

i) ·KY

i=1,i6=k

P(bt+1i |bt

i) · P(bt+1k |bt

k, atk).(9)

The buffer occupancy of userk evolves according toLindley’s equation [33]

bt+1k = min

�[bt

k − atk]+ + f t

k, L�. (10)

The buffer state of useri ∈ K, i 6= k evolves accordingto the following rule:

bt+1i = min(bt

i + f ti , L).

The buffer state transition probability of userk dependson its incoming traffic distribution and its action, whichis

P(bt+1k |bt

k, atk) =

�P(f t

k = bt+1k − [bt

k − atk]+) bt+1

k < LP∞x=L−[bt

k−at

k]+ P(f

tk = x) bt

k = L .

For those users who are not scheduled for transmission,the buffer state transition probabilities only depends onthe incoming traffic, which can be written as

P(bt+1i |bt

i) =

�P(f t

i = bt+1i − bt

i) bt+1i < LP∞

x=L−btiP(f t

i = x) bt+1i = L . (11)

2) Switching Controlled Markovian Game Formula-tion: We useπi (i = 1, 2, . . . , K) to denote the transmis-sion policy vector of theith user. With a slight abuse ofnotation,πi(s) is used to denote the transmission policy ofuseri in states and is a component ofπi. πi(s) lives in thesame space as the actionai of theith user. Assume at timeinstant t userk is scheduled for transmission accordingto the system access rule which is specified in (7). Theinfinite horizon expected total discounted cost of theith(i = 1, 2, . . . ,K) user under transmission policyπi canbe written as:

Ci(πi) = Eπi

� ∞Xt=1

βt−1 · ci(st, at

k)

�(12)

where0 ≤ β < 1 is the discount factor. The expectationof the above function is taken over the system statest

which evolves over time indext. If we denote the holdingcost of useri at thetth time slot asdi(st, at

k), the infinitehorizon expected total discounted latency constraint canbe written as

Di(πi) = Eπi

� ∞Xt=1

βt−1 · di(st, at

k)

�≤ eDi, (13)

whereDi is a system parameter depending on the systemrequirement. Note here that we assume the latency con-straint is valid in our problem formulation.Di is chosenso that the set of policies that satisfy such a constraintis non-empty. This assumption will be discussed morespecifically in Section III-C.

Equations (9,12,13) define a constrained switching con-trol Markovian game. Our goal is to compute a Nashequilibrium policy π∗i , i ∈ K (which is not necessarilyunique) that minimizes the discounted transmission cost(12) subject to the latency constraint (13). The followingresult shows that a Markovian switching control game canbe solved using a sequence of Markov decision processes(MDPs).

Result: [14, Chapter 3.2]The constrained switchingcontrol Markovian game (12,13) can be solved by a finitesequence of MDPs (as described in Algorithm 2). At eachstep, the algorithm iteratively updates the transmissionpolicy πn

i of user i given the transmission policies of



the remaining users. The optimization problem of eachiteration can be mathematically written as:

π∗ni = {πn

i : minπi

Cni (πi) s.t. Dn

i (πi) ≤ eDi, i ∈ K.} (14)

C. Value Iteration Algorithm

We present a value iterative algorithm in this subsectionto compute a Nash equilibrium solution to the constrainedMakovian dynamic game optimization problem describedin (14). A value iterative optimization algorithm wasdesigned to calculate the Nash equilibrium for an un-constrained general-sum dynamic Markovian switchingcontrol game [14]. Therefore, we first transfer the problemin (14) to an unconstrained one using Lagrangian dynamicprogramming and then apply the value iterative algorithmspecified in Algorithm 2 to compute the Nash equilibriumsolution.

Algorithm 2 Value Iterative Optimization AlgorithmStep 1:Setm = 0; Initialize l.Initialize {V0

1,V02, . . . ,V

0K}, {λ0

1, λ02, . . . , λ

0K}.

Step 2: Inner Loop: Setn = 0;Step 3: Inner Loop: Update Transmission Policies;for k = 1 : K do

for eachs ∈ Sk,

πnk (s) = arg minπn

k (s)

{c(s, ak) + λm

k · dk(s, ak) +

β∑|S|

s′=1 P(s′|s, ak)vnk (s′)

};

vn+1k (s) = c(s, πn

k (s)) + λmk · dk(s, πn

k (s)) +β

∑|S|s′=1 P(s′|s, πn

k (s))vnk (s′);

vn+1i=1:K,i 6=k(s) = λm

i · di(s, πnk (s)) +

β∑|S|

s′=1 P(s′|s, πnk (s))vn

i (s′);end forStep 4: If Vn+1

k ≤ Vnk , k ∈ K, set n = n + 1, and

return to Step 3; Otherwise, go to Step 5.Step 5: Update Lagrange Multipliersfor k = 1 : K do

λm+1k = λm

k + 1l

[Dk(πn

1 , πn2 , . . . , πn

K)− Dk

]

end forStep 6: The algorithm stops whenλm

k , k ∈ K converge,otherwise, setm = m + 1 and return to Step 2.

The algorithm can be summarized as follows. We useVn

k=1,2,...,K to represent the value vector atnth inneriteration andλm

k=1,2,...,K to represent Lagrange multiplierat mth outer iteration. The algorithm mainly consists oftwo parts: the outer loop and the inner loop. The outerloop updates the Lagrange multiplier of each user and theinner loop optimize the transmission policy of each userunder fixed Lagrange multipliers. The outer loop indexand inner loop index arem andn, respectively. It couldbe seen from Algorithm 2 that the interaction among allthe secondary users is through the update of value vectorssincevn+1

i=1:K,i 6=k(s) is a function ofπnk (s). The inner loop

runs at every time slot and the inner loop iteration periodequals to the time slot period.

In Step 1, we set the outer loop indexm to be 0 andinitialize the step sizel, the value vectorV0

k=1,2,...,K

and Lagrange multipliersλ0k=1,2,...,K . Step 3 is the inner

loop where at each step we solvekth user controlledgame and obtain the new optimal strategy for that userwith the strategies of the remaining players fixed. Step 4updates the Lagrange multipliers based on the discounteddelay value of each user given the transmission policies{πn

1 , πn1 , . . . , πn

K}. 1l is the step size which satisfies

the conditions for convergence of the Robbins-Monroalgorithm. This the sequence of Lagrange multipliers{λm

1 , . . . , λmK} with m = 0, 1, 2, . . . converges in prob-

ability to {λ∗1, . . . , λ∗K} which satisfy the constrainedproblem defined in (14) [34], [35]. The algorithm termi-nates when certain accuracy ofλm

k=1,2,...,K is obtained,otherwise, go to Step 2.

Since this is a constrained optimization problem, theoptimal transmission policy is a randomization of two de-terministic polices [33]. Useλ∗k=1,2,...,K to represent theLagrange multipliers obtained with the above algorithm.The randomization policy of each user can be written as:

π∗k(s) = qkπ∗k(s, λk,1) + (1− qk)π∗k(s, λk,2), (15)

where 0 ≤ qk ≤ 1 is the randomization factor andπ∗k(s, λk,1), π∗k(s, λk,2) are the unconstrained optimalpolicies with Lagrange multipliersλk,1 andλk,2. Specifi-cally, λk,1 = λ∗k−∆ andλk,2 = λ∗k+∆ for a perturbationparameter∆. The randomization factor of thekth userqk

is calculated by:

qk =Dk −Dk(λ1,2, . . . , λK,2)

Dk(λ1,1, . . . , λK,1)−Dk(λ1,2, . . . , λK,2). (16)

The convergence proof of the inner loop of Algorithm 2can be referred to [14, Chapter 6.3]. The intuition behindthe proof is as follows: The value vectorV(n)

k (k ∈ K)is nonincreasing on the iteration indexn in the valueiteration algorithm. There are only a finite number ofstrategies available for the optimal policyπ∗k for k ∈ K.It can be concluded that the algorithm converges in afinite number of iterations. This value iterative algorithmobtains a Nash equilibrium solution to the constrainedswitching control Markovian game with general sumreward and general sum constraint.

Fig. 2 is an example on the performance of Algo-rithm 2. The system has2 cognitive radio users, each userhas a size10 buffer, and the channel quality measurementsare quantized into two different states, namely{1, 2}. Itcould be seen from Fig. 2 that the Nash equilibrium policyof user1 is a randomized mixture of two pure policies.

IV. T RUTH REVEALING OPPORTUNISTIC

SCHEDULING

The decentralized channel access algorithm adopt inSection III-A.1 is based on the opportunistic accessscheme where each cognitive radio waits for a certain



12

12

34

56

78

910

0

0.5

1

1.5

2Randomized Optimal Policy

Buffer State

Pol

icy

Channel State

Figure 2. The Nash equilibrium transmission control policy obtained viavalue iterative optimization algorithm (Algorithm 2). A2-user systemis considered, and each user has a size10 buffer.

time at the beginning of each time slot and the first userwho accesses the channel can use the channel for thattime slot. The system is equivalent of having a virtualcentral scheduler which decides which user is scheduledfor transmission at each time slot. However, if differentcognitive radios belong to different agents, the selfishcognitive radios may not reveal their true state informationto the virtual central scheduler aiming to maximize theirown payoffs. Mechanism design theory can be applied toprevent this from happening.

This section considers a multiple user cognitive radiosystem. We propose a truth revealing system access pro-tocol based on opportunistic scheduling algorithm [26].The proposed protocol provides better system perfor-mance over conventional approaches. By applying themechanism design theory to the opportunistic scheduling,system users are eliminated from lying and the optimalityof the overall system performance is ensured. The pricingmechanism we propose is based on VCG mechanism andit maintains the same desirable economic properties asthat of the VCG mechanism.

A. Conventional Opportunistic Accessing Scheme

This subsection describes a conventional opportunisticscheduling algorithm in aK users cognitive radio system.Similar to the previous notations,bk is used to indicate thebuffer state of userk andb = {b1, b2, . . . , bk}. bk is usedto represent the buffer state that thekth user reports tothe virtual central scheduler. In a truth revealing system,each user reports the true state value andbk = bk, k ∈ K.

The channel quality of userk is denoted ashk, specif-ically, hk measures the signal to noise radio (SNR). Thecomposition of the channel states of all theK usersis denoted ash = {h1, h2, . . . , hK}. Let the symbolsper second transmission rate of userk be wk and bitsper symbol rate bemk (different values ofmk leads todifferent modulation schemes). Assuming each user uses

unit transmission power, the instantaneous throughput is:

ρk = wkmk

(1− pe(hk,mk)

)sk . (17)

Here,sk denotes the average packet size in bits.pe(hk)denotes the BER, which is a function of the SNR andmodulation mode of current userk. Assuming the systemuses an uncoded M-ary quadrature modulation (QAM),pe(hk) can be approximated as [33]:

pe(hk, mk) = 0.2× exp� −1.6hk

2mk − 1

�. (18)

Applying quantization to the instantaneous throughputρk, we haveρk ∈ {0, 1, 2, . . . , Qρ} (k = 1, 2, . . . , K)with Qρ indicating the maximum throughput quantizationlevel. Similarly, ρk is used to indicate the instantaneousthroughput state that userk reports to the central sched-uler andρk ∈ {0, 1, 2, . . . , Qρ}.

For notation convenience, we useΘk = {ρk, bk} torepresent the true states of thekth user andθk to representthe reported states of thekth user.Θ−k denotes the truestates of all the remainingK−1 users (excluding userk)and Θ−k denotes the corresponding reported values. WeuseΘ = {Θk, Θ−k} to denote the true states of all theKusers andΘ = {Θk, Θ−k} to denote the reported states.

The opportunistic access scheme is based on the re-ported buffer and throughput states. DefineA as afeasibleset, it is a subset ofK = {1, 2, . . . ,K}, and A ⊆ K.A feasible set is a set which satisfies the system con-straint. We specify the system constraint to be the overalltransmission power in the system. Specifically, the overalltransmission power in the system should be equal or lessthan the system power limitP . As we specified earlier inthis subsection, each user uses an unit transmission powerto transmit. Thus, the transmission power constraint on thesystem is equivalent to the constraint on the total numberof users who are transmitting. That is, the number of userswho are transmitting simultaneously should be less orequal to the system power limitP . The optimal feasiblesetA∗ to the conventional opportunistic access scheme isa solution to the following optimization problem.

A∗ = arg maxXk∈A

γk · Uk(ρk, bk), (19)

s.t. |A| ≤ P. (20)

|A| denotes the number of users in setA andUk(ρk, bk)denotes the corresponding utility of userk given through-put ρk and buffer statebk. γk in (19) is the QoS parameterwhich depends on the user type. In a cognitive radiosystem, there are two types of users, primary user andsecondary user. Thus,γk = {γp, γs}. If user k is aprimary user,γk = γp, otherwise,γk = γs. Furthermore,by settingγp À γs, the network would not allow thetransmission of secondary users with the presence ofprimary user. If there are multiple sets that optimize19 subject to the constraint 20,A∗ is randomly chosenfrom these set with equal probability. In the decentralizedchannel access algorithm specified in Section III-A.1, theutility function is specified asUk(ρk, bk) = hk bk and thetransmission power limit is specified to beP = 1 sinceonly one user is allowed for transmission per time slot.

The conventional opportunistic accessing scheme as-sumes the state information received by the virtual central



scheduler is true, that is,ρk = ρk and bk = bk.However, the conventional opportunistic algorithm maybe challenged when the users become so sophisticated andare able to reconfigure themselves to make efficient usageof the local resources (e.g. manage their own reportingdata to have the most efficient data transmission). Itbecomes increasing important to design a mechanism tooptimize the overall system performance while ensuringthe profit of each user.

B. The Pricing Mechanism

We are going to apply the VCG pricing mechanism tothe opportunistic scheduling algorithm, the new mecha-nism enforces the truth revealing property of each user.The Nash equilibrium of such an algorithm is when eachuser reports true values.

1) The Pricing Mechanism:Different from the cen-tralized conventional opportunistic scheduling algorithm,the proposed pricing mechanism is a distributed algorithmwhere each user tries to maximize his own utility functionby choosing the reported state values. The buffer andthroughput that userk chooses to report to the centralscheduler is a solution of the following optimizationproblem:

{ρk, bk} := maxΘk

vk(Θk, Θk, Θ−k)

= maxρk,bk

αγkρkbk ×Q

j∈A∗,j 6=k αγj ρj bjQj∈A

′ αγj ρj bjIk∈A∗ + Ik/∈A∗ ,

where the setsA∗ and A′

are defined in the followingways:

A∗ := arg maxXj∈A

γj ρj bj , s.t. |A| ≤ P ;

A′

:= arg maxρk=0

Xj∈A,j 6=k

γj ρj bj , s.t. |A|k/∈A ≤ P.

In this optimization problem,α is a fixed constant for thesystem chosen to beα > 1, I{.} is the indication functionwhose value is1 when the condition is true, otherwise, itis 0.

vk(Θk, Θk, Θ−k) is the utility function of userk,which is a function of the true states ofkth user, thereported states ofkth user and the reported states of allthe remaining users. When a user is not scheduled fortransmission, his utility function equals to1, while whena user is scheduled for transmission, his utility functionequals to the first part, that is:

vk(Θk, Θk, Θ−k) =

8<: αγkρkbk ·Qj∈A∗,j 6=k αγj ρj bjQ

j∈A′ α

γj ρj bjif k ∈ A∗

1. if k /∈ A∗

In the above equation, the first partαγkρkbk is the gainof userk per unit of subcarrier with throughput stateρk

and buffer statebk. The second termQ

j∈A∗,j 6=k αγj ρj bjQj∈A

′ αγj ρj bj

can be interpret as the number of unit of subcarrier thatuserk will be allocated which is a function of the stateof the remaining users in the system. In other words, theinverse of the second term could be interpreted as theprice that userk has to pay to the system if it is scheduledfor transmission. Each user select{ρk, bk} to report to

the central scheduler aiming to maximize its own utilityfunction.

There is one condition necessary in order to achieve anefficient allocation scheme with selfish agents [36], [37]:if a userk (k=1,2,. . . ,K) reports a false state valuesΘk 6=Θk results in the same value of the utility function as thatof if it reports the true value, which isvk(Θk, Θk, Θ−k) =vk(Θk, Θk, Θ−k), Θk 6= Θk, then the user will choose toreport the true values. We name this as thetruth preferredrule. The interpretation of this rule is that when lyingabout the states does not bring any benefit to a user, auser would prefer telling the truth.

The pricing mechanism we propose above is based onthe VCG mechanism, where we modified the conventionalsummation form of the utility function into a productform. Such pricing mechanism can be easily related to apractical 802.11 system and interpret the utility functionin terms of real physical parameters.

2) Economic Properties of The Pricing Mechanism:The pricing mechanism we proposed above still maintainsthe same desirable economic properties as that of VCGmechanism, these properties are specified as follows [38],[39]:

1) The mechanism is incentive-compatible in ex-postNash equilibrium. The best response strategy is toreveal the true state informationΘk = Θk evenafter they have complete information about otherusersΘ−k.

2) The mechanism is individually rational. A selfishagent will join the mechanism rather than choosingnot to, because the value of the utility function isnon-negative.

3) The mechanism is efficient. Since all the userswill truthfully reveal their state information, theopportunistic scheduling algorithm carried out bythe central scheduler will maximize the systemperformance.

The detailed proof of the properties is shown in [26].Fig. 3 is a numerical example showing the performance

of the pricing mechanism we designed. We simulate a30 users cognitive radio system with each user has5buffer states and10 throughput states. The transmissionpower constraint on the system isP = 3 which isequivalent to that the maximum number of users transmitsimultaneously is3.

In Fig. 3, the x-axis represents the number of iterationsand y-axis represents the mean squared error (MSE) of thereported buffer states and throughput states.n is used todenote the iteration index. Definingρn = {ρn

1 , . . . , ρnK}

and bn = {bn1 , . . . , bn

K}, the MSE of the reported bufferstates and throughput states can be written as:

MSE(ρn) =1

K×

KXk=1

(ρnk − ρn

k )2; (21)

MSE(bn) =1

K×

KXk=1

(bnk − bn

k )2. (22)

In the figure, the solid curve and the dash curve show theMSE of the reported buffer states and throughput states,



2 4 6 8 10 12 14 16 18 200

5

10

15

20

25

30

35

Number of Iterations

MS

E

Result with Mechanism Learning Algorithm

Buffer StateThroughput State

Figure 3. The MSE of the reported buffer states and throughput statesafter applying the pricing mechanism. The result is of a30 users systemwith L = 5, Qρ = 10 andP = 3.

respectively. We can see from the figure that the MSEconverge to0 after 11 iterations, at which state, all theusers in the system report truthfully andΘ = Θ.

V. CORRELATED EQUILIBRIUM IN STOCHASTIC

DYNAMICAL GAME

This section considers the user scheduling problem ina cognitive radio network, with the goal of obtaining thecorrelated equilibrium policy in a general-sum stochasticdynamical game setting. The correlated equilibrium policyof each user takes into account its transmission rate andtransmission delay [29]. The dynamic of the channelquality and buffer state is formulated as a Markov chainand the correlated equilibrium of such a stochastic game isdefined based on the Q-functions. The existing non-regretlearning algorithm [40], [41] can be easily applied toobtain such correlated equilibrium policies. The correlatedequilibrium policy of each user is independent from eachother, which enables the decentralized feature of thesystem. We propose two algorithms, namely, iterative cor-related equilibrium algorithm and correlated Q-learningalgorithm. Compared to Algorithm 3, Algorithm 4 com-bines the Q-learning with the non-regret learning whichoffers a decentralize feature and is more implementable.Distributed learning is an important feature of a cognitiveradio system since it does not require central control andeach user is able to learn the correlated equilibrium policyindependently. Both of the algorithms proposed in thissection are novel and have not been applied in cognitiveradio networks before.

A. System Model and Transmission Control Utility Func-tion

Different from the system model used in Section III,the system model we use here allows more than one userto access to the spectrum at each time slot.

Similarly, we assume there areK users in the systemand t is used as the time slot index.hk denotes the

channel quality of userk and the composition of thechannel states ish = {h1, . . . , hK}. The channel stateover time is formulated as a Markov chain with transitionprobability denoted asP(ht+1|ht). The buffer state of userk is denoted asbk andb = {b1, . . . , bK}. The incomingtraffic of userk is fk and f = {f1, . . . , fK}. The systemstate of userk is sk = {hk, bk} and the composition ofthe states of all the users is denoted ass = {s1, . . . , sK}.

The system is designed to perform effective userscheduling. At each time slot, each userk (k ∈ K)chooses an optimal actionak from the action setAk ={0, 1}, where0 representsno transmissionand 1 repre-sents transmission. The joint action of all the users isdenoted asa = {a1, . . . , aK}, which is an element of thejoint action space,a ∈ A. Using standard game theoreticnotation, we can writea = {ak,a−k} with a−k standingfor the joint actions of other users excluding userk.

The transition probability between the current compos-ite states = [h,b] and the next states′ = [h′,b′] is afunction of a, which can be expressed as:

P(s′|s,a) =

KYk=1

P(h′k|hk) ·KY

k=1

P(b′k|bk, ak), (23)

where the expression ofP(b′k|bk, ak) is given in (11).The utility function of userk in states given actiona

is denoted asuk(s,a), and it is specified as:

uk(s,a) = γk

�log�1 +

|hk|2ak

σ2k(n) +

PKj=1,j 6=k |τk,jhj |2aj

�− βkbk

1K

PKj=1 bj

�,

where γk is the QoS parameter andγk ∈ {γp, γs}. Ifuser k is a primary user,γk = γp ≥ 0, while if userk is a secondary user,γk = γs ≥ 0. γp À γs so thatthe primary users have the priority over secondary users.The first half of the utility functionlog(. . . ) representsthe maximum achievable transmission rate of userk andit is similar to that in (3), while the second half includesthe transmission delay andβk ≥ 0 denotes the weightingparameter between the transmission rate and the delay.τk,j is the channel coefficient determined by the locationof users andτk,j ·hj measures the channel quality betweenthe jth transmitter and thekth receiver.

B. Correlated Equilibrium in a Markovian DynamicGame

1) A Review of Correlated Equilibrium in StaticGames: To motivate the correlated equilibrium for adynamic game, we first review the correlated equilibriumfor a static game. In aK-user static game setup, eachuserk ∈ K aims to devise a rule for selecting an actionak from its action setAk so as to maximize the expectedvalue of its utility functionuk({a1, a2, . . . , aK}). Sinceeach user is only able to adapt its own action, the optimalaction policy depends on the rational consideration of thepolicies from other users. The adopted solution for sucha problem is called anequilibrium. Many equilibriumconcepts have been developed and the most common one



is the Nash equilibrium. In this section, we focus on animportant generalization of the Nash equilibrium, knownas correlated equilibrium[27], [28], which is defined asfollows.

Definition 1: Define a joint policyπ to be a probabil-ity distribution on the joint action spaceA = A1×A2×. . .AK . With the given actions of other usersa−k, thepolicy π is a correlated equilibrium, if for everyak ∈ Ak,(k = 1, 2, . . . , K) such thatπ(a−k, ak) > 0, and anyalternative policya′k ∈ Ak, it holds that,X

a−k∈A−k

π(a−k, ak)uk({a−k, ak}) ≥Xa−k∈A−k

π(a−k, ak)uk({a−k, a′k}). (24)

One intuitive interpretation of correlated equilibrium isthat π provides theK users a strategy recommendationfrom the trusted third-party. The implicit assumption isthat theK − 1 other users follow this recommendation,and userk ask itself whether it is of its best interestto follow the recommendation as well. The equilibriumcondition states that there is no derivation rule that couldaward userk a better expected utility. We also notice thatthe set of Nash equilibria is obtained by intersecting thepolyhedron described above with the additional constraintπ(a−k|ak) = π(a−k|a′k) for all ak, a′k ∈ Ak anda−k ∈A−k. Any Nash equilibrium can be represented as acorrelated equilibrium when the users can generate theirrecommendations independently.

One advantage of using correlated equilibrium is thatit permits coordination among users, generally throughobservation of a common signal, which leads to improvedperformance over a Nash equilibrium [28].

2) The Definition of Correlated Equilibrium inMarkovian Games:The stationary policy of the system isonly a function of the state instead of the time. We useπsto denote the transmission policy vector of all the users instates. The policies of all the users over all the states canbe denoted asπ andπs ∈ π. Any πs can be decomposedinto marginals (πs,k, πs,−k) for any k, whereπs,k is themarginal probability distribution of the strategy of userk,while πs,−k is the marginal distribution of all users butk.Furthermore, each entry ofπs is denoted asπs(a−k, ak)(∀a−k ∈ A−k, ∀ak ∈ Ak), which represents the jointprobability of taking actiona−k and ak in states. Theinfinite horizon expected total discounted value of userkwith initial states0 = s under transmission policyπ is:

V πk (s) = Eπ

� ∞Xt=0

βtuk(st,at)��s0 = s

�, (25)

where0 < β < 1 is the economic discount factor chosenby the system.

Based on the above definitions, each userk, k ∈ Kupdates its value vector from timet to t + 1 in thefollowing way,

V tk (s) = max

πs,k

Xa∈A

πs(a)�uk(s,a) + γ

Xs′∈S

P(s′|s,a)V t−1k (s′)

�.

The above optimization problem defines how the strategyof each userπs,k is updated at each iteration. Denoteπ∗

as the solution to the problem when all the users do the

value vector update simultaneously. The value vector ofuserk converges toVπ∗

k under policyπ∗ andV π∗k (s) ∈

Vπ∗kWe are now ready to define the Q-function. Q-function

Qπ∗k (s,a) of user k is the total discounted reward of

taking actiona in states and then followingπ∗ thereafter,and it is defined as:

Qπ∗k (s,a) = uk(s,a) + γ ·

Xs′∈S

P(s′|s,a)V π∗k (s). (26)

Therefore, we can have the following relationship be-tween the value vector and Q-function [42].

V π∗k (s) =

Xa∈A

π∗s (a) ·Qπ∗k (s,a). (27)

The correlated equilibrium in a Markovian dynamic gameis then defined as follows.

Definition 2: The stationary policyπ∗ is a correlatedequilibrium for the Markovian game described above if∀k ∈ K, ∀s ∈ S and∀ak, a′k ∈ Ak,X

a−k∈A−k

π∗s (a−k, ak) ·Qπ∗k (s, {a−k, ak}) ≥X

a−k∈A−k

π∗s (a−k, ak) ·Qπ∗k (s, {a−k, a′k}). (28)

C. Distributed Correlated Equilibrium Algorithm

In this section, we first introduce an iterative corre-lated equilibrium algorithm which uses the non-regretlearning algorithm to obtain the correlated equilibriumin the Markovian game. Then, we propose a correlatedQ-learning algorithm which combines the non-regret al-gorithm and Q-learning so as to calculate the correlatedequilibrium policy in a distributed way. Both of thealgorithms proposed are novel and have not been appliedin cognitive radio networks.

First of all, the existence of a correlated equilibriumin any general-sum Markov game can be shown by thefollowing result from [42].

Theorem 3: [42] Every general-sum Markov gamehas a stationary correlated equilibrium policy.This result enables the application of the following algo-rithms.

1) Iterative Correlated Equilibrium Algorithm:Non-regret learning algorithm was first proposed by Hart andMas-Colell in [40], where they formulate the probabil-ities the players depart from their current plays to beproportional to the measure of regret for not havingused other strategies in the past. The non-regret learningalgorithm has been proved to be able to converge to theset of correlated equilibria of the game with probabilityone. An iterative correlated equilibrium algorithm hasbeen designed to obtain the correlated equilibrium in theMarkovian game system model in this section. By usingthis algorithm, each user in the system does not need tokeep the information of the Q-functions of other usersand the problem has been solved in a distributed way tosome extent.

The iterative correlated equilibrium algorithm is sum-marized in the table for Algorithm 3. In the algorithm, we



Algorithm 3 Iterative Correlated Equilibrium AlgorithmOuter LoopStep 1 Initialization:n = 0 and initialize Q0

k(s,a) for ∀k ∈ K, ∀s ∈ X and∀a ∈ A.Step 2For ∀k ∈ K and∀s ∈ S do the following:Inner Loop : Non-regret Learning

Initialize inner loop indexl = 0, initial regret matrixR0k

and actiona0k.

Action Update: leti = alk and let al+1

k = j with theprobability

P l+1k (j) =

max{Rnk (i,j),0}µ

if j 6= i;

P l+1k (j) = 1−

Pm6=i max{Rn

k (i,m),0}µ

if j = i.Regret matrix update:Hi,j

k (al+1) = I(al+1

k=i)

· �Qk(j,al+1−k )−Qk(i,al+1

−k )�;

Rl+1k = Rl

k + 1l+1

�Hk(al+1)−Rl

k

�.

Exit inner loop whenPlk converges and setπn = Pl+1; else

set l = l + 1 and return back toInner Loop .Step 3Update value vectors:V n

k (s) =P

a∈A πns (a) ·Qn

k (s,a), for ∀k ∈ K, ∀s ∈ S.Step 4Update Q-function values:Qn+1

k (s,a) = uk(s,a) + γ · Ps′∈S P(s′|s,a)V n

k (s′) for∀k ∈ K, ∀s ∈ X anda ∈ A.The iteration terminates when the values of the parametersπn converge; else setn = n + 1 and return back to Step2.

first initialize the outer loop iteration indexn = 0 and theinitial Q-function values in Step 1. Then, we implementthe non-regret learning algorithm for eachk ands to learnthe new policy. Based on the new policy, we update thevalue vectorsV n

k and the Q-function values accordingly.The inner loop of the algorithm is non-regret learning,

wherel denotes the inner loop iteration.Rlk denotes the

regret matrix of userk at the lth iteration. Each entryRl

k(i, j) indicates the regret value from actioni to actionj. Pl denotes the policy of all the users at thelth iteration,while Pl

k denotes that of userk andPl = {Pl1, . . . ,P

lK}.

Each of Plk entry P l

k(j) represents the probability oftaking actionj at iterationl. The new policy of each useris to pick an action according to the action probabilityPl. The constantµ is a normalization factor chosen tobeµ >

∑m 6=i max{Rn

k (i,m), 0}, ensuring the update ofP l+1

k yields valid probabilities. It can be viewed as aninertia parameter, i.e., a higherµ yields lower probabilityof switching. The termIf(x) is the indicator functionwhich equals to1 whenf(s) is true and0 otherwise.

The non-regret learning algorithm applied here is basedon [41] and its convergence has been proved therein. Weobserve that in Algorithm 3, each user does not needto know the utility functions and policies of other userswhich makes the algorithm distributed. Note that in theinner loops, users play the repeated games and the processcan be implemented online while the outer loop relies onthe knowledge of the probability transition matrix andhence is an offline update. It motivates us to proposethe next algorithm, which lifts the requirement of thetransition matrix and is implementable online.

2) Correlated Q-learning Algorithm:Q-learning is atype of reinforcement learning technique which learns theQ-function values without the knowledge of the system

model. In this section, we are going to propose a corre-lated Q-learning to calculate the correlated equilibriumpolicy for the Markovian game. Compared to Algo-rithm 3, this algorithm does not require the information ofthe system state transition probability matrix which makesit more practical and implementable.

Algorithm 4 Correlated Q-learning Algorithm

Step 1 Initialize n = 0 and Q0k(s,a) for ∀k ∈ K, ∀s ∈ S

and∀a ∈ A.Step 2Updateπn by non-regret learning.Step 3Update the Q-function values:For ∀s ∈ S and∀a ∈ A:Qn+1

k (s,a) = (1 − αn) · Qnk (s,a) + αn[uk(s,a) +

γP(s′|s,a)P

a′∈A πnk (s′,a′) ·Qn

k (s′,a′)]The algorithm terminates when the values of the parametersπn converge; else setn = n + 1 and return back to Step2.

The correlated Q-learning algorithm (Algorithm 4) canbe summarized as follows. System initializes the iterationindex n and the initial Q-function values in Step 1.Step 2 calculates the correlated equilibrium policyπn

with given Q-function values by using the non-regretlearning similar to that in Algorithm 3. The system runsthe non-regret learning offline. Based on the updatedtransmission policies, each user updates its Q-functionvalues by using Q-learning. Parameterαn is the learningrate sequence which is chosen such that

∑∞n=1 αn = ∞

and∑∞

n=1 α2n < ∞. The algorithm terminates when the

Q-function converges. The optimal correlated equilibriumand the value function can thus be obtained from the finalQ-function.

In both Algorithm 3 and Algorithm 4, at each innerloop iteration, they require the knowledge of theQ-function they calculate at each outer loop. Based on thisinformation, each user performs one table lookup in itsQ-function to calculate theQ-utility given the currentreadings of the states and the actionsal. Two additionsand two multiplications update the regret value; and onerandom number, one multiplication and one comparisonsuffice to calculate the next action. In the outer loop,an update is done for each states and action profilea. The computational complexity for the inner loop issmall and hence suitable for implementation. However,the complexity of outer loop depends on the number ofstates and the size of action profiles. The size of the actionset is exponential in the number of users. However, it ispossible to regard the actions of other users as one virtualaction so as to reduce the dimension of the set of actionprofiles. This dimension reduction technique enables adramatic reduction in computational complexity, makingan efficient process the update of the Q-functions in thealgorithms.

VI. OPEN ISSUES ANDCONCLUSIONS

This paper adopt game theoretic approach to solvevarious problems in cognitive radio systems, such aspower allocation, rate adaptation and accessing control.The main game theoretic concepts we used include Nash



equilibria in static games and switching control games,correlated equilibria in stochastic Markovian games andmechanism design.

Despite the recent popularity of game theory in wire-less communications and cognitive radio networks, thepotential for further research is vast. Here are some openissues related to this paper:

1) The dimension of the system state in a stochasticgame is exponential to the number of users ofthe system, thus, the convergence speed of Algo-rithm 2,3,4 suffers from the “curse of dimension-ality” when the system has large number of users.How to efficiently reduce the state space is yet anissue to solve.

2) Section III only focuses on a special type ofgame, namely switching control Markovian game.The study of a more general type of Markoviangame is still a very interesting area. For example,prove the existence of the Nash equilibrium in ageneral Markovian game under certain conditionsor propose efficient algorithms to obtain a Nashequilibrium policy in a general Markovian game.

3) The pricing mechanism used in Section IV is basedon the well-known VCG mechanism. The devel-opment of other pricing mechanisms or reputationbased mechanism can be one direction of the futurework.

4) More analytical results of the correlated equilibriumin a general-sum stochastic game (e.g., the struc-tural result on the correlated equilibrium policy) canbe explored.

REFERENCES

[1] J. V. Neumann and O. Morgenstern,Theory of Games andEconomic Behavior. Princeton University Press, 1944.

[2] R. Gibbons, Game Theory for Applied Economists.Princeton University Press, 1992.

[3] S. Hart and A. Neyman,Game and Economic Theory.Ann Arbor: The University of Michigan Press, 1995.

[4] A. B. MacKenzie and L. A. DaSilva,Game Theory forWireless Engineers. San Rafael, California: Morgan andClaypool Publishers, 2006.

[5] S. G. Glisic, Advanced Wireless Communications: 4GCognitive and Cooperative Broadband Technology. Wiley,2007.

[6] F. C. Commussion,Spectrum Policy Task Force. Rep. ETDocket no. 02-135, November, 2007.

[7] J. Mitola III, “Cognitive radio for flexible mobile multi-media communications,” inProceedings of IEEE Interna-tional Workshop on Mobile Multimedia Communications,1999, pp. 3–10.

[8] S. Haykin, “Cognitive radio: Brain-empowered wirelesscommunications,” IEEE Journal on Selected Areas inCommunications, vol. 23, no. 2, pp. 201–220, February2005.

[9] V. K. Bhargava and E. Hossain,Cognitive Wireless Com-munication Networks. New York: Springer-Verlag, 2007.

[10] J. Mitola III and G. Q. Maguire Jr., “Cognitive radio:Making software radios more personal,”IEEE PersonalCommunications, vol. 6, no. 4, pp. 13–18, August 1999.

[11] W. Yu, G. Ginis, and J. M. Cioffi, “Distributed multiuserpower control for digital subscriber lines,”IEEE Journalon Selected Areas in Communications, vol. 20, no. 5, pp.1105–1115, June 2002.

[12] G. Scutari, D. Palomar, and S. Barbarossa, “Asynchronousiterative water-filling for Gaussian frequency-selective in-terference channels,”IEEE Transactions on InformationTheory, vol. 54, no. 7, pp. 2868–2878, July 2008.

[13] S. R. Mohan and T. E. S. Raghavan, “An algorithmfor discounted switching control stochastic games ,”ORSpektrum, vol. 55, no. 10, pp. 5069–5083, October 2007.

[14] J. A. Filar and K. Vrieze,Competitive Markov DecisionProcesses. New York: Springer-Verlag, 1997.

[15] O. J. Vrieze, S. H. Tijs, T. E. S. Raghavan, and J. A. Filar,“A finite algorithm for the switching control stochasticgame,”OR Spektrum, vol. 5, no. 1, pp. 15–24, March 1983.

[16] J. W. Huang and V. Krishnamurthy, “Transmission controlin cognitive radio systems with latency constraints as aswitching control dynamic game,” inProceedings of IEEECDC, December 2008, pp. 3823–3828.

[17] C. Peng, H. Zheng, and B. Y. Zhao, “Utilization andfairness in spectrum assignment for opportunistic spectrumaccess,”Mobile Networks and Applications, vol. 11, no. 4,pp. 555–576, August 2006.

[18] W. Wang, X. Liu, and H. Xiao, “Exploring opportunis-tic spectrum availability in wireless communication net-works,” in Proceedings of IEEE VTC, September 2005.

[19] Y. Chen, Q. Zhao, and A. Swami, “Joint design andseperation principle for opportunistic spectrum access,” inProceedings of IEEE ACSSC, October/November 2006, pp.696–700.

[20] Q. Zhao, L. Tong, A. Swami, and Y. Chen, “Decentralizedcognitive MAC for opportunistic spectrum access in adhoc networks: A POMDP framework,”IEEE Journal onSelected Areas in Communications, vol. 25, no. 3, pp. 589–600, April 2007.

[21] R. Urgaonkar and M. J. Neely, “Opportunistic schedulingwith reliability guarantees in cognitive radio networks,” inProceedings of IEEE INFOCOM, April 2008, pp. 1301–1309.

[22] A. MasColell, M. Whinston, and J. R. Green,Microeco-nomic Theory. Oxford: Oxford University Press, 1995.

[23] W. Vickrey, “Counterspeculation auctions and competitivesealed tenders,”Journal of Finance, vol. 16, no. 1, pp.8–37, 1961.

[24] E. H. Clarke, “Multipart pricing of public goods,”PublicChoice, vol. 2, pp. 19–33, 1971.

[25] T. Groves, “Incentives in team,”Econometrica, vol. 41,no. 4, pp. 617–631, 1973.

[26] J. W. Huang and V. Krishnamurthy, “Truth revealing op-portunistic scheduling in cognitive radio systems,” inThe10th IEEE International Workshop in Signal ProcessingAdvances in Wireless Communications, June 2009.

[27] R. J. Aumann, “Subjectivity and correlation in randomizedstrategies,”Journal of Mathematical Economics, vol. 1, pp.67–96, March 1974.

[28] ——, “Correlated equilibrium as an expression ofBayesian rationality,”Econometrica, vol. 55, no. 1, pp.1–18, 1987.

[29] J. W. Huang, Q. Zhu, V. Krishnamurthy, and T. Basar,“Distributed Correlated Q-learning for Dynamic Trans-mission Control of Sensor NetworksSubmitted to IEEEGlobecom,” 2009.

[30] D. P. Bertsekas and J. N. Tsitsiklis,Parallel and Distrib-uted Computation: Numerical Methods. Belmont, MA:Athena Scientific, 1989.

[31] IEEE Standard for Local and Metropolitan Area Networks,Part 16: Air Interface for Fixed Broadband Wireless AccessSystems, IEEE std 802.16-2004, 2004.



[32] A. Farrokh and V. Krishnamurthy, “Opportunistic schedul-ing for streaming users in HSDPA multimedia systems,”IEEE Transactions on Multimedia Systems, vol. 8, no. 4,pp. 844–855, August 2006.

[33] D. Djonin and V. Krishnamurthy, “MIMO transmissioncontrol in fading channels - a constrained Markov decisionprocess formulation with monotone randomized policies ,”IEEE Transactions on Signal Processing, vol. 55, no. 10,pp. 5069–5083, October 2007.

[34] J. C. Spall,Introduction to Stochastic Search and Opti-mization: Estimation, Simulation and Control. Wiley-Interscience, 2003.

[35] V. Krishnamurthy and G. G. Yin, “Recursive algorithmsfor estimation of hidden Markov models and autoregres-sive models with Markov regime,”IEEE Transactions onInformation Theory, vol. 48, no. 2, pp. 458–476, February2002.

[36] R. Mirrlees, “An exploration in the theory of optimumincome taxation,”Review of Economic Studies, vol. 38,pp. 175–208, 1971.

[37] P. Dasgupta and E. Maskin, “Efficient auctions,”QuarterlyJournal of Economics, vol. 115, pp. 341–388, 2000.

[38] R. K. Dash, D. C. Parkes, and N. R. Jennings,“Computational mechanism design: A call to arms,”IEEEIntelligent Systems, vol. 18, no. 6, pp. 40–47, Novem-ber/December 2003.

[39] R. K. Dash, A. Rogers, N. R. Jennings, S. Reece, andS. Roberts, “Constrained bandwidth allocation in multi-sensor information fusion: a mechanism design approach,”in Proceedings of IEEE Information Fusion Conference,July 2005, pp. 1185–1192.

[40] S. Hart and A. Mas-Colell, “A simple adaptive procedureleading to correlated equilibrium,”Econometrica, vol. 68,no. 5, pp. 1127–1150, September 2000.

[41] V. Krishnamurthy, M. Maskery, and G. Yin, “Decentralizedadaptive filtering algorithm for sensor activation in anunattended ground sensor network,”IEEE Transactionson Signal Processing, vol. 56, no. 12, pp. 6086–6101,December 2008.

[42] A. Greenwald and M. Zinkevich, “A direct proof of theexistence of correlated equilibrium policies in general-sumMarkov games,”Brown CS: Tech Report CS-05-07, July2005.

Jane Wei Huang (S’05) received her bachelor’s degree fromZhejiang University, China in 2005, and M.Phil. from HongKong University of Science and Technology, Hong Kong in2007. She is currently a Ph.D. student in the University of BritishColumbia.

Her research interests include game theory, Markov decisionprocess, cognitive radio, and sensor networks.

Vikram Krishnamurthy (S’90-M’91-SM’99-F’05) was born in1966. He received his bachelor’s degree from the Universityof Auckland, New Zealand in 1988, and Ph.D. from the Aus-tralian National University, Canberra, in 1992. He is currently aprofessor and holds the Canada Research Chair at the Depart-ment of Electrical Engineering, University of British Columbia,Vancouver, Canada. Prior to 2002, he was a chaired professorat the Department of Electrical and Electronic Engineering,University of Melbourne, Australia, where he also served asdeputy head of department. His current research interests includecomputational game theory, stochastic dynamical systems formodeling of biological ion channels and stochastic optimizationand scheduling.

Dr. Krishnamurthy has served as associate editor for sev-eral journals includingIEEE Transactions Automatic Control,IEEE Transactions on Signal Processing, IEEE Transactions

Aerospace and Electronic Systems, IEEE Transactions Nanobio-science, andSystems and Control Letters. From 2009-2010 heserves as distinguished lecturer for the IEEE signal processingsociety. From 2010, he serves as editor in chief ofIEEE Journalof Selected Topics in Signal Processing.



game theoretic issues in cognitive radio systems (invited ...cognitive radio may conﬂict with that...

Documents