optimal energy-delay tradeoff for opportunistic spectrum access in cognitive radio networks

1Optimal Energy-Delay Tradeoff for

Opportunistic Spectrum Access in Cognitive

Radio NetworksOussama Habachi, Yezekael Hayel and Rachid El-azouzi

CERI/LIA, University of Avignon, France

Abstract

Cognitive radio (CR) has been considered as a promising technology to enhance spectrum efficiency

via opportunistic transmission at link level. Basic CR features allow secondary users to transmit only

when the licensed channel is not occupied by primary users. However, waiting for idle time slot may

include large packet delay and high energy consumption. Thus, we consider Opportunistic Spectrum

Access (OSA) mechanism that takes into account packet delay and energy consumption. We formulate

the OSA problem as a Partially Observable Markov Decision Process (POMDP) by explicitly considering

the energy constraint as well as, the delay constraint, which are often ignored in existing OSA solutions.

Specifically, we consider a POMDP with an average reward criterion. We further consider that the

secondary user may decide, at any moment, to use another dedicated way (3G) of communication in order

to transmit its packets. We derive structural properties of the value function and we show the existence

of optimal strategies in the class of the threshold strategies. For implementation purposes, we propose

online learning mechanisms that estimate the primary user activity based on statistical knowledge of the

primary user activity. In particular, numerical illustrations validate our theoretical findings. It is shown

that optimal policy has a threshold structure. We also present numerical illustrations on the convergence

of the proposed algorithms for estimating primary user activity.

Index Terms

POMDP, Cognitive Radio Networks, QoS.

I. INTRODUCTION

The access to the spectrum frequency is defined by licenses assigned to primary users. The latter must

be conform to the specifications described in the license (e.g. location of the base station, frequency and

January 16, 2012 DRAFT

2the maximum transmission power). Nonetheless, a recent study made by the Federal Communications

Commission (FCC) has proved that some frequency bands are not sufficiently used by licensed users at

a particular time and in a specific location [1].

Cognitive radio, which is a new paradigm for designing wireless communication systems, has appeared

in order to enhance the utilization of the radio frequency spectrum. Cognitive radio has been considered

as the key technology that enable secondary users to access the licensed spectrum. A cognitive user,

as defined in [2], is a mobile who has the faculty to adapt its transmission parameters (e.g. frequency

and modulation) to the wireless environment, and support different communication standards (e.g. GSM,

CDMA, WiMAX and WiFi). Moreover, when there is no opportunity to transmit over the licensed

channels, the secondary users may have the possibility to transmit on dedicated channels, generally,

with a higher cost and/or a lower throughput than transmitting over licensed channels. The possibility of

having dedicated channels reserved for secondary mobiles has been proposed in [3],[4] and [5]. Those CR

architectures are described in [6] where the authors also present the network components, the spectrum

and network heterogeneity, and the spectrum management framework. We focus in this paper, on a CR

network where a secondary user communicates with other secondary users through an ad-hoc connection

using a spectrum hole of a licensed frequency (see Figure 1). A secondary user can be considered as a

pair of transmitter-receiver nodes. We assume that there is no interactions with other secondary users.

This model is also suited for the scenario depicted on figure 2 where the secondary user is a cognitive

radio base station which is able to sense the activity of a primary base station, and then takes profit of

spectrum holes for transmitting on the downlink. Our main contribution is to consider in this cognitive

radio setting, an optimal opportunistic spectrum access (OSA) mechanism that takes into account energy

and delay constraints. Many works have focused on the study of optimal sensing and access policies in

cognitive radio networks (see [7], [8] and [9]). All these works have focused on either spectrum sensing

or dynamic spectrum sharing. In [10], the authors focused on an OSA problem with an energy constraint.

The authors have formulated their problem as a POMDP and derived some properties of the optimal

sensing control policies. Their control parameter is the duration of sensing used by a secondary user at

each time slot for determining the primary user activity. They provided heuristic control policies based

on gird-based approximation, myopic policies and static policies which have low complexity but give

suboptimal control policies. finally, they compare their heuristics methods with optimal solutions obtained

using a POMDP solver. Authors of [11] incorporate the energy constraint in the design of the optimal

policy of sensing and access in cognitive radio network. They formulate the problem also as a POMDP

but with a finite horizon and established a threshold structure of the optimal policy for the single channel


3model. However, they did not provide analytical expression of the optimal control policy. It is noteworthy

that the impact of the energy constraint or the capacity of cognitive radio to support additional Quality-of-

Service (QoS), such as the expected delay, has been somehow ignored in the literature. In fact, it is very

important for today multimedia applications on wireless networks, to provide reliable communication

while sustaining a certain level of QoS. In fact, taking into account the delay constraint as well as

the energy constraint significantly complicates the optimization problem. Without considering the delay

constraint, the secondary user achieves the best tradeoff between trying to access the licensed channel

and sleeping to conserve energy. The design of such tradeoff lies among several conflicting objectives:

gaining immediate access, gaining spectrum occupancy information, conserving energy and minimizing

packets delay. Then, the goal of our paper is to study such energy-QoS tradeoff for determining an

optimal OSA mechanism for secondary users in a cognitive radio network. The major contributions of

our work are:

The problem is formulated as an infinite horizon POMDP with average criterion. The average

criterion is better than the discount or the total criterion as the secondary user takes often decisions.

In order to gain insights into the energy-delay constrained OSA problem, we derive structural

properties of the value function. We are able to show that the value function is increasing with the

belief and decreasing with the packet delay. These structural results not only give us the fundamental

design thresholds but also reduce the computational complexity when seeking for the optimal policies.

We show that the secondary user can maximize its average reward by adopting a simple threshold

policy, and we derived closed-form expressions for these thresholds.

Since the secondary user may use a dedicated channel for its packets, the optimal threshold policy

guarantees a bounded delay.

The organization of the paper is as follows. In the next section, we describe the primary and the

secondary user models. Section III presents our Markov decision process framework. In Section IV, we

study the existence of an optimal threshold policy for our opportunistic spectrum access with an energy-

QoS tradeoff. We propose two learning based protocols for estimation of the state transition rates in

Section V. Before concluding the paper and giving some perspectives, we present, in Section VI, some

numeric illustrations.

II. COGNITIVE RADIO NETWORK MODEL

We consider a wireless system with N independent channels licensed to primary users. The state of

each channel n {1, . . . , N} is modeled by a time-homogeneous discrete Markov process sn(t). The


4state space is {0, 1} where sn(t) = 0 means that the channel n is free for secondary access and sn(t) = 1means that the channel n is occupied by primary user. The transition probabilities of the channel n is

given by the following matrix:

Pn =

n 1 nn 1 n

The transition rates evolve as illustrated in Figure 3.

The global system state, composed of the N channels, is denoted by the vector s(t) = [s1(t), ..., sN (t)]

and the global state space is = {0, 1}N . The transition probabilities can be determined by the statisticsof the primary network traffic and are assumed to be known by secondary users. We present in section

V some methods allowing the secondary user to estimate these transition probabilities on the fly.

We consider a secondary user having the possibility to access to anyone of the N licensed channels.

The objective of the secondary user is to detect the channels that are free during a given time slot.

However waiting for idle time slot may include large packet delay and high energy consumption due

to sensing. To overcome this, we consider an OSA that takes into account packet delay, throughput

and energy consumption. Since todays wireless networks are highly heterogeneous with mobile devices

consisting of multiple wireless network interfaces, we assume that at any time, the secondary user has

access to the network through another technology like 3G. This is typically the case with the 802.22

standard in which secondary users transmit over the TV bands [12]. The secondary user will prefer to

transmit its packet on a licensed channel because it is cheaper than a dedicated communication while the

dedicated channel guarantees perfect access.

The goal of each secondary user is to minimize the expected delay of its packets, accounting for

energy, throughput and monetary costs. In order to achieve such goal, a secondary user has to choose at

each time slot one of the following actions:

to be inactive during the slot,

to sense a primary channel and to transmit if the channel is available during time slot, else to wait

for next time slot,

or to sense a primary channel and to transmit if the channel is available during time slot, else to

use the dedicated channel.

Our important contribution is to consider the average transmission delay of a packet in the optimal

decision. Indeed, sensing a primary channel has a cost for the secondary user. We look for an optimal

sensing policy which depends on the history of observations and actions.


5III. PARTIAL OBSERVATION MARKOV DECISION PROCESS FRAMEWORK

Due to partial spectrum sensing, the global system state s(t) cannot be directly observed by a secondary

user. To overcome this difficulty, the secondary user infers the global system state based on observations

that can be summarized in a belief vector (t) = {1(t), ..., 2N (t)} where j(t) is the conditionalprobability (given the observation history and the decision) that the system state s(t) = j in slot t. Since

the N channels are independent, it has been proved in [13] that we can consider the following simpler

belief vector:

~(t) = [1(t), .., N (t)],

where i(t) is the conditional probability that the channel i is available in slot t. Hence, we study the

problem of OSA for secondary user as a POMDP problem.

A. Description of the POMDP

1) State: The state of the system at time slot t is given by (~(t), l(t)) where l(t) is the delay of the

packet held by secondary user at time t. The delay of a new packet equals one, and increases by one

every time slot, except when the secondary user transmits the packet.

2) Action: For each time slot t and each state (~(t), l(t)), the three possible actions are:

a(t) =

0, to be inactive

1, to sense and to transmit only if the channel is available during time slot,

2, to sense and to transmit if the channel is available during time slot,

else to transmit through the dedicated channel.

3) Observation and belief: When the secondary user decides to sense (i.e. to take action a(t) {1, 2}),one channel n(t) is determined and the secondary user observes the channel occupancy state sn(t)(t) {0, 1}. Let (t) be the observation outcome at time t, where (t) = 0 if the sensed channel is idle and(t) = 1 otherwise. The user updates the belief vector ~(t) after the observation outcome. For each

channel n, the conditional probability n(t+ 1) is therefore defined as follows:

n(t+ 1) := Pr(sn(t+ 1) = 0|a(t), (t)) =

n + (n n)n(t) if a(t) = 0 or n 6= n(t),n if a(t) 6= 0, (t) = 0

and n = n(t),

n if a(t) 6= 0, (t) = 1and n = n(t).

(1)


6Note that we can extend easily our model to sense not only one channel but a subset of the primary

channels.

4) Channel choice policy: At each time slot t, based on its belief vector ~(t), the secondary user

chooses a channel n(t) N to be sensed. There exists several channel choice policies in the literaturelike deterministic, randomized and periodic (see [1]). An example of channel choice policy is to sense

the channel which has the highest probability to be idle, i.e. n(t) := arg maxn(n(t)).

5) Policies: The strategy of the secondary user is defined by the probability of choosing a given action

depending on the system state. We define a sensing and access policy as a vector [1, 2, . . .] where t

is a mapping from a state (~(t), l(t)) to an action a(t). The set of policies is denoted by . A stationary

policy is a mapping that specifies for each state, independently of the time slot t, an action to be chosen.

In the next section, we show that our POMDP problem has an optimal stationary policy which allows

us to restrict our problem to stationary policies.

6) Reward and costs:

Reward : Let be the reward representing the number of delivered bits when the secondary user

transmits its packet.

Costs : Let cs be the energy cost function for sensing a primary channel, measured as monetary

units. This function depends on the action a(t) as:

cs(a(t)) =

cs, if a(t) > 0,0, if a(t) = 0.The primary user and the service provider for the dedicated access, charge a price for each packet

transmitted. Those prices are respectively Pp for a transmission over a primary channel and P3G for

a transmission over the dedicated channel.

Hence, when the secondary user transmits successfully a packet, he gets the reward zt(a(t), (t))

which depends on the action a(t) and the observation (t) by:

zt(a(t), (t)) =

0, if a(t) = 0,

Pp if a(t) 1 and (t) = 0, P3G, if a(t) = 2 and (t) = 1.

In order to model the impact of the delay, we introduce an additional cost when a packet is not

transmitted. This cost depends on the current delay l of the packet and is defined by the function

f(l). This function is assumed to be increasing with l in order to growth the incentive of transmitting

the packet when it becomes delayed from a long time.


7 Instantaneous reward: At time slot t, the instantaneous reward rt of a secondary user depends on

the system state (~(t), l(t)) and the action a(t), and is expressed by:

rt((~(t), l(t)), a(t)) = zt(a(t), (t)) f(l(t)) cs(a(t)).

The problem faced by the secondary user consists of finding the sensing policy that maximizes its

expected average reward defined by:

R() = limT

1

TIE

(Tt=1

rt((~(t), l(t)), a(t))|~(0)),

while ~(0) is the initial belief vector. Then our objective is to find an optimal sensing policy that

maximizes the average reward R(), i.e.:

= argmax

limT

1

TIE

(Tt=1

rt((~(t), l(t)), a(t))|~(0)). (2)

In some particular MDP and POMDP problems, we are able to determine an optimal policy in a

smaller set reduced to stationary policies. We prove in the following proposition that there exists an

average optimal stationary policy for our POMDP problem.

Proposition 1: There exists an average optimal stationary policy for our POMDP formulation described

in (2).

Proof: see Appendix A.

Given this result, we can restrict our problem to the set S of stationary policies. Then, for the

remainder of this paper, we omit the time index t and we look for an optimal sensing policy which is a

mapping between a system state (~, l) to an action a, independently of the time slot t. Now, we make a

first analysis of the value function of the POMDP.

We denote by ns(~|) the function that updates the belief vector ~ when the user chooses to beinactive in the current slot, i.e. the secondary user takes action 0. The function s(~|) updates the beliefvector ~ when the secondary user senses a licensed channel in the current slot and observes , i.e. the

secondary user takes the action 1 or 2.

The value function is denoted V (, l). Let us denote by Qa(, l) the action-value function taking the

action a in the current slot when the information state is (, l). Therefore, the value function is expressed

by

gu + V (~, l) = maxaA

Qa(~, l), (3)

where gu is a constant, and the optimal action is given by

a(~, l) = arg maxaA

Qa(~, l). (4)


8We determine the action-value function for each different action 0, 1 and 2. When the secondary user

decides to wait, i.e. to take the action a = 0, we have:

Q0(~, l) = f(l) + V (ns(~| = 0), l + 1). (5)

When the secondary user chooses to sense the channel n and decides to wait for the next time slot if

the channel n is busy, i.e. to take action 1, we have:

Q1(~, l) = cs + n( Pp + V (s(~| = 0), 1)) (6)

+(1 n)(f(l) + V (s(~| = 1), l + 1)).

When the secondary user chooses to sense the channel n and to transmit using the dedicated channel if

the channel n is busy, i.e. to take action 2, we have:

Q2(~, l) = cs + n(Pp + V (s(~| = 0), 1)) (7)

+(1 n)(P3G + V (s(~| = 1), 1)).

We focus on the case of one licensed channel. The multichannel case will be studied in Section III-C.

We take the assumption that there exists a packet delay l such that the secondary user transmits its

packet using the dedicated channel if the observation is = 1. In fact, this assumption is somehow

realistic as the user has no interest to keep the file in its buffer indefinitely. We denote by and the

transition rates of the channel, and the belief of the secondary user. We consider that . When , the analysis is similar and the results are unchanged.

B. The single channel model

Let us focus on the belief update function ns.

Lemma 1: We have the following properties of the belief update function ns.

1) The update function ns(|) is increasing with belief .2) We have the following equivalence:

ns(|) pi(0),

and

ns(|) pi(0),

where pi(0) = 1+ is the stationary probability that the primary channel is idle. Figure 4 depicts

the belief evolution.

Proof: See Appendix B.


9It has be shown in [15] that the value function for a POMDP over a finite time horizon is piecewise

linear and convex with respect to the belief vector. In Proposition 2, we show that the value function for

our POMDP problem over an infinite horizon with the average criterion, has also this property.

Proposition 2: The value function V (, l) given in (3) is piecewise linear and convex with respect to

the belief vector .

Proof: See Appendix C.

Note that monotonicity results help us for establishing the structure of the optimal policies (see [16]

for an example) and provide insights into the underlying problem. The following propositions states

monotonicity results of the value function with respect to each of its parameters.

Proposition 3: For each belief vector , the value function is monotonically decreasing with the packet

delay l, i.e. V (, l) V (, l) for l l.Proof: See Appendix D.

This result is intuitive because for the same belief and for a given packet delay, the maximum

expected remaining reward that can be accrued is lower than the one the secondary user can get with a

smaller packet delay.

Proposition 4: The value function is monotonically increasing with the belief vector , i.e. V (, l) V (, l) for .

Proof: See Appendix F.

Again this result seems somehow intuitive as for the same packet delay, when the belief vector is

higher yields that the maximum expected remaining reward becomes higher.

Given all the previous results on the value function V (, l), we are able to show the existence of

an optimal sensing policy for our POMDP problem. Moreover, we determine explicitly the threshold

structure of such optimal policy.

C. The multichannel model

The Lemma 1 holds for the multichannel model. In fact, if ~1 ~2, then n1 n2 and ns(n1) ns(n2), and therefore,

ns(~1) ns(~2). Second, if n pi(0), then we ns(n) n , and thusns(~) ~. Otherwise, we have ns(~) ~.

The Proposition 2 can be straightforwardly extended to the multichannel model. Furthermore, we

studied in Proposition 3 the monotonicity of the value function with a fixed belief value with respect to

the packet delay. This proposition can be also extended to the multichannel model.


10

Let us focus on the Proposition 4. The monotonicity with respect to the belief vector depends on the

order relation over the belief set and also on the monotonicity of the belief update functions s(~| = 0)and s(~| = 1) depending on the belief vector.

IV. OPTIMAL THRESHOLD POLICY

Let us focus on the characteristics of an optimal policy for the secondary user. Intuitively, when the

delay l and the belief probability are small, the secondary user waits for a better opportunity. Thus,

depending on the belief probability, the secondary user makes the decision to sense a primary channel or

not. We prove in this section, that the intuition is true and there exists an optimal sensing policy which

has a threshold structure.

The first decision for a secondary user is whether to sense licensed channels or to wait, depending on

its belief and the current delay of the packet l. We have the following result which gives us a threshold

on the belief probability in order to answer this question.

Proposition 5: For all packet delay l, the optimal action for the secondary user is to wait for the

next slot, i.e. a(, l) = 0 if and only if where is the solution of the equation =max(0,min{Th1(, l), Th2(, l)}) with

Th1(, l) =V (ns(|), l + 1) V (, l + 1) + Csf(l) + Pp + V (, 1) V (, l + 1) , and

Th2(, l) =V (ns(|), l + 1) V (, 1) + Cs f(l) + P3G

Pp + V (, 1) + P3G V (, 1) .

Proof: see Appendix G.

This proposition gives us a necessary and sufficient condition on the use of the action 0 depending on

the belief probability . Consequently, if > then the optimal action is to sense a primary channel,

i.e. a(, l) 6= 0.Furthermore, we have the following property of the optimal policy.

Proposition 6: For all > pi(0) and l, the secondary user never takes the action 0 and thus, Q0(, l) pi(0). Furthermore, we have the following result about the use of

the dedicated channel.


11

Proposition 7: For all belief , the secondary user chooses to use the dedicated channel in spite of

waiting for the next slot if and only if the delay l of the current packet verifies:

f(l) + P3G + V (, l + 1) V (, 1) > 0.

Proof: See Appendix I.

We note that this expression does not depend on the cost of sensing Cs nor on the belief vector .

That is obvious as this expression determines the best action to do after sensing a channel. We have the

last property about the optimal threshold policy.

Corollary 1 (Never Wait After Sensing): If, for all l, the penalty cost f(l) is lower than P3G,then the secondary user transmits on the dedicated channel when the sensed channel is not idle.

Proof: See Appendix J.

This result is also somewhat intuitive. In fact, when the secondary user senses the channel as busy, it

gets P3G as reward if he uses the dedicated channel otherwise he gets a penalty f(l) if he decidesto wait. Thus, if P3G + f(l) is positive the secondary user has no incentive to wait after sensing thelicensed channels.

In all the results, the optimal sensing policy depends on the transition rates and of the primary

user activity. In the literature, those parameters are assumed to be known by the secondary user. We

focus in the next section on online learning algorithms that allow the secondary user to estimate those

rates on the fly.

V. ONLINE LEARNING OF PRIMARY USERS ACTIVITY

We proved that the secondary user has an optimal energy-delay constrained policy given perfect

knowledge of the channels transition rates. However, in practice, some information like the transition

rates and are not available for the secondary user. In this section, we consider a model where the

secondary user does not have external information about the state transition rates. We present two learning

based protocols for the secondary user in order to estimate the primary channels dynamics: rate estimator,

and transition matrix estimator.

A. Rate Estimator

In this approach, the secondary user begins with an initial arbitrary values of and . The secondary

user updates them every time slot depending on the information about the system state. Then, the sec-

ondary user computes its sensing policy based on the estimators = {1, ..., N} and = {1, ..., N}where i (resp. i) is the estimator of i (resp. i).


12

First, the secondary user estimates i which is the probability that the channel i will be sensed idle

given that it was idle in the previous slot. Second, the secondary user estimates pii(0) the stationary

probability for this channel to be idle. The secondary user obtains the estimated value of i based on

the relation i = (1 i) pii(0)1pii(0) .Formally, we consider the following counting processes for the estimation of i and pii(0):

The vector K = {K1, ..., KN} where Ki represents the number of time slots a channel stays in theidle state, i.e. Ki is incremented if the channel i is sensed and is idle at time slot t and t 1.

The vector I = {I1, ..., IN} where Ii represents the number of time slots that the channel is sensedand is idle.

The vector M = {M1, ..., MN} where Mi represents the number of time slots that the channel issensed.

Therefore the secondary user estimates the state transition rates and pii(0) based on the following

expressions: i = KiIi and pii(0) =IiMi

.

B. Transition Matrices Estimator

The convergence of the previous estimators and depends on the occurrence of two successive

sensing actions of the same channel. The secondary user may not sense frequently the same channel in

two successive time slots. Therefore, the previous learning mechanism converges slowly. We present, in

this section, a learning protocol which estimates the transition matrices. We define the set of transition

matrices {Pi(0), Pi(1), ...} where Pi(j) is the transition matrix of the channel i when this channel wasnot sensed during j consecutive slots. For example, if the channel i was sensed j slots before as idle;

the current belief on the state of this channel is (1, 0) Pi(j). As like as the rate estimator, the transitionmatrices are estimated using a counting process.The previous learning protocol is somehow a particular

case of this approach. In fact, estimating and is equivalent to estimating the set of transition matrices

such that the channel was sensed in the previous slot {P1(0), ..., PN (0)}. Therefore, this learning basedprotocol gives more accurate estimation of primary users activity. However, it needs more space and

computational complexity compared to the rates estimators method.

VI. NUMERIC ILLUSTRATIONS

We illustrate our results through simulations of the system over an important number of packets (we

consider 3000 packets). It was shown in [17] that in practice, the average number of available primary

channels is about 15. Unfortunately, we consider only 4 i.i.d primary channels, i.e. N = 4, due to


13

exponential states space (with 4 primary channels, we have approximatively 106 states). Furthermore, we

consider the following system parameters: P3G = 80, Pp = 10, cS = 5 and = 35.

We propose to illustrate our results in three scenarios with symmetric channels:

1) Scenario 1: Primary channels are often occupied (1 = 2 = 3 = 4 = 0.15 and 1 = 2 =

3 = 4 = 0.1),

2) Scenario 2: Primary channels are often idle (1 = 2 = 3 = 4 = 0.85 and 1 = 2 = 3 =

4 = 0.7),

3) Scenario 3: Primary channels have low transition rates (1 = 2 = 3 = 4 = 0.95 and 1 =

2 = 3 = 4 = 0.05). This last scenario is realistic if we consider TV white space [17].

We describe, first, the optimal threshold policy given perfect knowledge about the transition rates of the

primary channels. Second, we give some results using estimated values of transition rates.

A. Single channel model

We consider only one licensed channel with the transition rates = 0.15 and = 0.1. Figure 5

illustrates the optimal policy of the secondary user depending on the belief and the packet delay. For

each packet delay, the secondary user has a threshold policy depending the belief. Moreover, the threshold

belief probability is decreasing with the packet delay. We observe that the maximum packet delay is

13 slots.

Consider the the same scenario with transition rates = 0.2 and = 0.25. We observe in Figure 6

that the secondary user policy has also a threshold structure. A packet has a most a delay of 3 slots.

B. Optimal policy with perfect knowledge of and

We simulate the first scenario and we depict in Figure 7 the thresholds (l) determined in proposition

5 depending on the packet delay l. For each packet delay l, the best action for the secondary user is to

wait for the next slot if its belief probability is lower than . Otherwise, the secondary user decides to

sense the primary channels. In this context, where the primary channels are often occupied (Scenario 1,

Figure 7), the maximum packet delay l obtained with Proposition 7 equals 9. Then, when the packet

delay is l = 9, the user decides to sense and to transmit using the dedicated channel if the sensed channel

is occupied. We describe the optimal policy for the Scenario 2 on Figure 8. The maximum packet delay

in this case is l = 5. This result is intuitive as in this scenario, the primary channels are more often

idle, inducing a lower packet delay. Finally, for the last scenario depicted on Figure 9, which implies

that the maximum packet delay is 5. We observe that the secondary user policy has also a threshold


14

structure. However, the the threshold belief probability is not decreasing with the packet delay. . In

fact, the primary channels are more static (the probability for each channel to stay occupied or idle is

high enough), it appears one kind of periodic threshold strategy.

C. Average reward using estimated values of and

We consider the learning approaches proposed in section V. Let us compare, first, the average reward

and the average delay using the two learning based protocols with perfect knowledge of the channels

transition rates. Figures 10 and 11, show that both learning protocols converge. In fact, we observe on

Figures 10 and 11, that both protocols converge before 400 iterations. However, in Figures 12 and 13, we

can observe that the transition matrices estimation method converge 3 times faster (about 1000 iterations)

than the rate estimators method (about 3000 iterations). Moreover, the average reward and the average

packet delay using the estimated transition rates are close to the average reward and the average delay

with known channels transition rates.

VII. CONCLUSION AND PERSPECTIVES

In this paper, we have used a POMDP framework for determining an optimal sensing policy for oppor-

tunistic spectrum sensing and access (OSA) taking into account an energy-delay tradeoff for secondary

users. Introducing a QoS metric in the spectrum sensing policy is very important with the emergence

of heterogeneous mobiles that are able to transmit their traffic with possible high QoS constraints, at

any time over different ways of communication like 3G, WiFi and TV White Space. We have provided

some structural properties of the value function and then proved the existence of an optimal average

stationary spectrum sensing policy. We have been able to determine explicitly the threshold structure of

the optimal policy. The interaction between several secondary users has not been considered here, and

in the literature very few. This perspective is also very important because if the channel choice policy is

the same for all the secondary users, there could have lots of collisions between several secondary users

that have sensed the same idle primary channel. This decentralized system with partial information can

be modeled using decentralized-POMDP or interactive-POMDP and will be studied in future works.

APPENDIX

A. Proof of Proposition 1

We use the Theorems 8.10.9 and 8.10.7 from [14] to prove the existence of an optimal stationary

policy for our problem. First, the immediate reward rt((s, l), a) is finite, i.e. < rt((s, l), a) < +


15

(as all costs and rewards are finite). Second, We prove that there exist a stationary policy d for which

the derived Markov chain is positive recurrent.

Let us focus on the following belief vector:

0 = (1, 2, ..., N ) such that j = j1(j |0), for j = 1, . . . , N,

where j represents the belief of a channel that was not sensed for j successive slots.

Denote by d the stationary policy which senses licensed channels at every slot, with periodic channel

choice policy. Let us prove that the derived Markov chain is positive recurrent. The probability that the

system returns to the initial belief form any state is p() =Nk=0(1 n(j)) > 0, n {O, ..., N}

and then the return time to the initial belief j follow a geometric distribution so that E{j} = 1p(j) andtherefore all state are positive recurrent under d.

Third, let us prove that gd> and the set {b Sb : rt((s, l), a) > gd for some a A} is finite

and no empty. As the policy d senses licensed channels every slot, gd = f(l(t)) cs (f(l(t)) +Pp )n . If we have

f(l(t)) cs (f(l(t)) + Pp )n > max{f(l(t)), cs P3G (Pp P3G)n}

for all belief b, the policy always sense primary channels is optimal and we have achieved our goal.

Otherwise, the set {b Sb : rt((s, l), a) > gd for some a A} is finite and no empty.Finally, we obtain from the theorems 8.10.9 and 8.10.7 from [14] that there exists an average optimal

stationary policy.

B. Proof of Lemma 1

First, the update function ns is linear with the belief because because ns() = + ( ). Aswe considered the case where , then the update function is increasing with the belief.

Second, let us prove that ns() if pi(0) by induction on the belief.1) We have the initial condition: pi(0) = 1+ and ns() = + ( ) .2) We assume that ns() for a given pi(0).3) The induction operator gives: ns(ns()) = + ( )ns() + ( ) = ns().

Thus, ns() for all pi(0). The analysis for pi(0) is similar.


16

C. Proof of Proposition 2

The proof of the proposition 2 is similar to [15] where the authors consider the finite time horizon

problem. Hence, we briefly describe the procedure for this proof. Considering the maximum packet delay

l and for all belief vector , the value function V (, l) is linear with the belief because

V (, l) = Q2(, l) gu,

= gu + cs P3G + V (s(| = 1), 1) +

n(P3G Pp + V (s(| = 0), 1) V (s(| = 1), 1)).

Then the value function V (, l) can be rewritten as an inner product of the belief vector and a -vector.

As Q2(, l) = Q2(, l), for all l, the action-value function Q2(, l) can be also rewritten as an inner

product of the belief vector and a -vector. We suppose that Proposition 2 holds for all packet delays

higher than l + 1 and we prove that the proposition is true for packet delay l. After some algebra, we

can rewrite the action-value functions given in (5) and (7) in terms of -vector:

Q0(, l) = f(l) + maxl+1

< ns(|), >= f(l) +sS

s

[sS

P (s|s)ns(|)l+1], (8)

and

Q1(, l) = cs + ( Pp + V (, 1)) + (1 )(f(l) + maxl+1

< s(| = 1), >)

= cs + ( Pp + V (, 1)) + (1 )(f(l) +sS

s

[sS

P (s|s)s(|=1)l+1]

), (9)

where ns(|)

l+1 and s(|=1)l+1 are, respectively, the -vectors for the regions containing belief vectors

ns(|) and s(| = 1), respectively. Each term in the square brackets of (8) and (9) are elements,l of a -vector l. Then the action-value functions can be rewritten as an inner product of the belief

vector and a -vector l. Moreover, there are only a finite number of such -vector l since we have

a finite set of belief for all l. As the maximum of a finite set of piecewise linear and convex functions

is also piecewise linear and convex, the Proposition 2 holds.

D. Proof of Proposition 3

Let us prove first that the value function V (, l) is monotonically decreasing with the packet delay l

for all belief vector . The secondary user takes the action 2 for all when the packet delay is l, thus


17

we have:

V (, l) = cs + (Pp + V (, 1)) + (1 )(P3G + V (, 1)).

The secondary user chooses the action that maximizes its average utility and thus:

V (, l 1) = maxa

Qa(, l 1) gu Q2(, l 1) gu,

= cs + (Pp + V (, 1)) + (1 )(P3G + V (, 1)) gu,

= V (, l).

Let us prove that this propriety holds for all packet delays using a backward induction on l:

1) initial condition: For all belief vector , V (, l) V (, l 1),2) we suppose that V (, l + 2) V (, l + 1), .3) We have:

Q0(, l) = f(l) + V (ns(|), l + 1),

f(l + 1) + V (ns(|), l + 2),

= Q0(, l + 1).

Q1(, l) = cs + ( Pp + V (, 1)) + (1 )(f(l) + V (, l + 1)),

cs( Pp + V (, 1)) + (1 )(f(l + 1) + V (, l + 2)),

= Q1(, l + 1).

Q2(, l) = cs + P3G + V (, 1) + (P3G Pp + V (, 1) V (, 1)),

Q2(, l + 1).

The inequalities come from the induction assumption and the monotonicity of the penalty function

f(l). Thus, we have:

, V (, l) V (, l + 1).

The value function is therefore decreasing with the packet delay.

Lemma 2: We have the following inequality:

Pp + V (, 1) P3G + V (, 1).


18

E. Proof of Lemma 2

We prove this lemma by contradiction, so we suppose that Pp + V (, 1) < P3G + V (, 1). Wefirst prove that the following:

gu + V (, 1) Q2(, 1),

gu + V (, 1) cs + ( Pp + V (, 1)) + (1 )( P3G + V (, 1)),

gu + V (, 1) cs + Pp + V (, 1),

gu > cs Pp.

and we take the assumption that the immediate reward when the channel is idle is positive, i.e. csPp 0.

We know that the secondary user takes the action 2 in the state (, l) for all belief vector , i.e

a(, l) = 2,. We have:

gu + V (, l) = cs + ( Pp + V (, 1)) + (1 )( P3G + V (, 1)).

Let us focus on the packet delay l 1. If pi(0), we have:

Q0(, l 1) = f(l 1) + V (ns(), l),

= gu f(l 1) cs + ns()( Pp + V (, 1)) + (1 ns())( P3G + V (, 1)),

= V (, l) f(l 1) + (ns() )(P3G Pp + V (, 1) V (, 1)),

< V (, l).

The inequality is due to the assumption that Pp+V (, 1) < P3G+V (, 1), ns() and f(l1)is positive. As the value function V (, l) is decreasing with the packet delay l (see Proposition 3), then

Q0(, l 1) < V (, l) < V (, l 1). As we proved that gu 0, the secondary user does not take

the action 0 when the packet delay is l 1. For the action 1, we have:

Q1(, l 1) = cs + ( Pp + V (, 1)) + (1 )(f(l 1) + V (, l)),

= cs + ( Pp + V (, 1)) + (1 ) ( gu f(l 1) cs+(Pp + V (, 1)) + (1 )(P3G + V (, 1))) ,

< cs + ( Pp + V (, 1)) + (1 )( gu f(l 1) cs P3G + V (, 1)),

< cs + ( Pp + V (, 1)) + (1 )( P3G + V (, 1)),

= Q2(, l 1).


19

The first inequality is due to the assumption that Pp + V (, 1) < P3G + V (, 1) and the second oneis because gu, f(l 1) and cs are positive. Thus, the optimal strategy is to take the action 2 when thepacket delay is l 1.

Let us prove now by backward induction on l that the optimal action is the action 2 for all belief

vector pi(0). If the secondary user takes the action 2 when the packet delay is l, then it takes also the action 2

when the packet delay is l 1. We suppose that secondary user takes the action 2 when the packet delay is l < l 1. We have the following inequalities:

Q0(, l 1) = f(l 1) + V (ns(), l),

= gu f(l 1) cs + ns()( Pp + V (, 1)) + (1 ns())( P3G + V (, 1)),

= V (, l) f(l 1) + (ns() )(P3G Pp + V (, 1) V (, 1)),

< V (, l).

The inequality is due to the assumption that Pp + V (, 1) < P3G + V (, 1) and ns() ,and f(l 1) is positive. As the value function is decreasing with the packet delay (see Proposition3), then Q0(, l 1) < V (, l 1) + gu, i.e. the secondary user does not take the action 0 with thepacket delay l 1.

Q1(, l 1) = cs + ( Pp + V (, 1)) + (1 )(f(l 1) + V (, l)),

= cs + ( Pp + V (, 1)) + (1 ) ( gu f(l 1) cs+(Pp + V (, 1)) + (1 )(P3G + V (, 1))) ,

< cs + ( Pp + V (, 1)) + (1 )( gu f(l 1) cs P3G + V (, 1)),

< cs + ( Pp + V (, 1)) + (1 )( P3G + V (, 1)),

= Q2(, l 1).

The first inequality is due to the assumption that Pp +V (, 1) < P3G +V (, 1) and the secondone is because gu, f(l 1) and cs are positive. Thus, The optimal strategy is to take action 2 whenthe packet delay is l 1. Thus, the secondary user does not take the action 1 with the packet delayl 1. Finally, the secondary user takes action 2 for all packet delays and beliefs lower than pi(0).


20

We now look at the action-value function Q2(, 1) when the packet delay is l = 1.

Q2(, 1) = cs + ( Pp + V (, 1)) + (1 )( P3G + V (, 1)),

Q2(, 1) = cs P3G + V (, 1) + (P3G Pp + V (, 1) V (, 1)),

gu +Q2(, 1) = gu + V (, 1) Pp + cs + ( 1)(P3G Pp + V (, 1) V (, 1)).

As the secondary user takes the action 2 also for the state (, 1), we have:

gu + V (, 1) = cs + ( Pp + V (, 1)) + (1 )( P3G + V (, 1)),

gu + V (, 1) = cs P3G + V (, 1) + (P3G Pp + V (, 1) V (, 1)),

gu = cs P3G + (P3G Pp + V (, 1) V (, 1)).

Thus, we obtain:

gu +Q2(, 1) = V (, 1) + P3G Pp + ( 1)(P3G Pp + V (, 1) V (, 1)).

As we assumed that P3G Pp + V (, 1) V (, 1) < 0, and P3G > Pp, then we obtain V (, 1) + gu Q2(, 1) and therefore the secondary user takes also the action 2 in the state (, 1). Then we get:

gu + V (, 1) = Q2(, 1) = cs + ( Pp + V (, 1)) + (1 )( P3G + V (, 1)).

Let us evaluate finally the difference V (, 1) V (, 1):

V (, 1) V (, 1) = ( )(P3G Pp + V (, 1) V (, 1)),

V (, 1) V (, 1) < 0.

and

V (, 1) V (, 1) = ( )(P3G Pp + V (, 1) V (, 1)),

(V (, 1) V (, 1))(1 + ) = ( )(P3G Pp),

V (, 1) V (, 1) = ( )(P3G Pp)1 + ,

> 0.

which leads to a contradiction, and therefore, Pp +V (, 1) P3G +V (, 1). The analysis is similarwhen > pi(0).


21

F. Proof of Proposition 4

Let us prove that the value function V (, l) is increasing with the belief vector for any packet delay

l. For all 1 2, we have that:

V (1, l) = gu cs + P3G + V (, 1) + 1(P3G Pp + V (, 1) V (, 1)),

gu cs + P3G + V (, 1) + 2(P3G Pp + V (, 1) V (, 1)),

= V (2, l).

This inequality result from the Lemma 2. Let us prove that this propriety holds for all packet delays l

using backward induction:

Initial condition: There exists a packet delay l such that V (1, l) V (2, l), 1 2, We suppose that V (1, l + 1) V (2, l + 1), 1 2,

First case: We assume that + f(l) Pp + V (, 1) V (, l + 1) 0, then:

Q0(1, l) = f(l) + V (ns(1|), l + 1),

f(l) + V (ns(2|), l + 1),

= Q0(2, l).

The inequality is a direct result from the induction assumption and the Lemma 1. We have also:

Q1(1, l) = cs f(l) + V (, l + 1) + 1( + f(l) Pp + V (, 1) V (, l + 1)),

cs f(l) + V (, l + 1) + 2( + f(l) Pp + V (, 1) V (, l + 1)),

= Q1(2, l).

Q2(1, l) = cs + P3G + V (, 1) + 1(P3G Pp + V (, 1) V (, 1)),

cs + P3G + V (, 1) + 2(P3G Pp + V (, 1) V (, 1)),

= Q2(2, l).

The inequalities comes from the Lemma 2. Thus, we have proved that V (1, l) V (2, l).


22

Second case: We suppose that + f(l) Pp + V (, 1) V (, l+ 1) < 0, then for all we have:

Q1(, l) = cs + ( Pp + V (, 1)) + (1 )(f(l) + V (, l + 1)),

cs f(l) + V (, l + 1),

f(l) + V (, l + 1),

cs f(l) + V (ns(|), l + 1),

Q0(, l).

In fact, we have that ns(|) for all belief vector and the value function V (, l) isincreasing with the belief for the packet delay l + 1 (induction assumption). Thus, gu + V (, l) =

max {Q0(, l), Q2(, l)}. Moreover, we have:

Q0(1, l) = f(l) + V (ns(1|), l + 1),

f(l) + V (ns(2|), l + 1),

= Q0(2, l).

The inequality is a direct result from the induction assumption. Finally, we have that:

Q2(1, l) = cs + P3G + V (, 1) + 1(P3G Pp + V (, 1) V (, 1)),

cs + P3G + V (, 1) + 2(P3G Pp + V (, 1) V (, 1)),

= Q2(2, l).

The inequality comes from the Lemma 2.

Thus, V (1, l) V (2, l) for belief vectors 1 2 and for all packet delay l.

G. Proof of Proposition 5

In this proposition, we determine explicitly the best action a(, l) for the secondary user depending

on the belief and the packet delay l. At each time slot and for a given information state (, l), the

secondary use will decide to take the action 0 if Q0(, l) max {Q1(, l), Q2(, l)}. First we assume that Q1(, l) > Q2(, l), then, let us compare Q0(, l) and Q1(, l). The inequality

Q0(, l) Q1(, l) is equivalent to:

f(l) + V (ns(|), l + 1) cs + ( Pp + V (, 1)) + (1 )(f(l) + V (, l + 1)),

V (ns(|), l + 1) V (, l + 1) cs + (f(l) + Pp + V (, 1) V (, l + 1)).


23

As the value function V (, l)is decreasing with the packet delay l and increasing with the belief ,

we have V (, 1) V (, l+ 1). As we assumed that the immediate reward is higher than the costPp, we obtain that f(l) + Pp + V (, 1) V (, l+ 1) is positive. Then, we have the followingequivalence:

Q0(, l) Q1(, l) V (ns(|), l+1) V (, l+1)cs+(f(l)+Pp+V (, 1)V (, l+1)).

Define the functions F and G as follow:

F (, l) = V (ns(|), l + 1),

G(, l) = V (, l + 1) cs + (f(l) + Pp + V (, 1) V (, l + 1)).

We proved in Proposition 2 that the value function is Piecewise linear and convex. Therefore, for

all packet delays, the function F (, l) is PWLC and increasing with , and the function G(, l) is

linear and increasing with . Note that

If F (, l) G(, l), then Q0(, l) Q1(, l) and therefore the best action is 0. If F (, l) < G(, l), then Q0(, l) < Q1(, l) and therefore the best action is 1.

Let us study the sign of the function H(, l) = F (, l)G(, l). Under these setting, six cases riseup:

1) F (, l) is always higher than G(, l), see Figure (14, case 1).

2) F (, l) is always lower than G(, l), see Figure (14, case 2).

3) F (, l) and G(, l) intersect once and F (, l) < G(, l), see Figure (14, case3).

4) F (, l) and G(, l) intersect once and F (, l) (, l), see Figure (14, case 4).5) F (, l) and G(, l) intersect twice and F (, l) (, l), see Figure (14, case 5).6) G(, l) is tangent to F (, l), see Figure (14, case 6).

Let us focus on F (pi(0), l) and G(pi(0), l).

Let us prove that gu > f(l). We have:

gu + V (, 1) Q0(, 1),

gu + V (, 1) f(l) + V (ns(), l + 1),

gu + V (, 1) V (ns(), l + 1) f(l),

gu > f(l).


24

The inequality is because of the monotonicity of the value function and ns() < . Suppose that

the secondary user chooses the action 0 for the state (pi(0), l). We have:

gu + V (pi(0), l) = f(l) + V (ns(pi(0)), l + 1),

gu + V (pi(0), l) f(l) + V (ns(pi(0)), l),

gu + V (pi(0), l) f(l) + V (pi(0), l),

gu f(l).

This leads to a contradiction as gu > f(l). Thus, Q0(, l) < Q1(, l) and therefore, F (pi(0), l) Q1(, l) and then, we have to compare the action 0 and 2, which

is equivalent to compare the action-value functions Q0(, l) and Q2(, l). The secondary user takes

the action 0 instead of the action 2 if Q0(, l) Q2(, l), which is equivalent to:

f(l) + V (ns(|), l + 1) cs + ( Pp + V (, 1)) + (1 )( P3G + V (, 1)),

V (ns(|), l + 1) V (, 1) + + f(l) cs P3G + (P3G Pp + V (, 1) V (, 1)).

We have from the Lemma 2, that P3G Pp + V (, 1) V (, 1) 0. Then, we can provide thesame analysis presented in the previous case with the function F (, l) = V (ns(|), l + 1) andthe function G(, l) = V (, 1) + + f(l) cs P3G + (P3G Pp + V (, 1) V (, 1)). Thelatter is linear increasing in . We obtain the following threshold policy:

The secondary user takes the action 0 for all beliefs lower than the following threshold:

Th2(, l) =V (ns(|), l + 1) V (, 1) f(l) + cs + P3G

P3G Pp + V (, 1) V (, 1) ,

and take the action 2 otherwise.


25

H. Proof of Proposition 6

We have from the Lemma 1 that if > pi(0) then ns() . Suppose that the secondary user takesthe action 0 for a belief and packet delay l. Thus we have

gu + V (, l) = f(l) + V (ns(), l + 1),

gu + V (, l) f(l) + V (ns(), l),

gu + V (, l) f(l) + V (, l),

gu f(l).

This leads to a contradiction as gu > f(l). The first inequality is because the value function is decreasingwith the packet delay and the second one is because that the value function is increasing with the

belief and ns() . Thus, if > pi(0), then the secondary user never takes the action 0 and thenQ0(, l) < max {Q1(, l), Q2(, l)}.

I. Proof of Proposition 7

Let us compare the value-action functions Q1(, l) and Q2(, l) for all belief vector and packet delay

l. The secondary user waits for next time slot after sensing if Q1(, l) Q2(, l), which is equivalentto:

cs + ( Pp + V (, 1)) + (1 )(f(l) + V (, l + 1)) cs + ( Pp + V (, 1))

+(1 )( P3G + V (, 1)),

f(l) + V (, l + 1) P3G + V (, 1) 0.

Remark that this condition depends only on the packet delay l and not on the belief vector .

J. Proof of Corollary 1

If f(l) is lower than P3G, then f(l) + P3G + V (, l + 1) V (, 1) is always negative.In fact, V (, 2) V (, 1) is negative and f(l) + Pp + V (, l+ 1) V (, 1) is decreasing with l.Therefore, the previous expression is negative for all l 1.


26

REFERENCES

[1] E. Hossain, D. Niyato and Zhu Han, Dynamic spectrum access and management in cognitive radio networks, Cambridge,

2009.

[2] J. Mitola, Cognitive radio: An integrated agent architecture for software defined radio, PhD Dissertation, Royal Inst.

Technol. (KTH), Stockholm, Sweden, 2000.

[3] F. Akyildiz, Won-yeol Lee and al., NeXt generation dynamic spectrum access cognitive radio wireless networks: A

survey, Computer Networks, 2006.

[4] K. Jaganathan, I. Menache, E. Modiano, and G. Zussman, Non-cooperative Spectrum Access - The Dedicated vs. Free

Spectrum Choice, Proc. ACM MOBIHOC11, May 2011.

[5] O. Habachi and Y. Hayel, Optimal sensing strategy for opportunistic secondary users in a cognitive radio network, in the

13th ACM International Conference on Modeling, Analysis and Simulation of Wireless and Mobile Systems (MSWiM),

2010.

[6] I. Akyildiz, W. Lee, M. Vuran, S. Mohanty, A Survey on Spectrum Management in Cognitive Radio Networks, in IEEE

Communication Magazine, 2008.

[7] Qing Zhao and al., Decentralized cognitive MAC for opportunistic spectrum access in ad Hoc networks: A POMDP

framework, IEEE journal on selected areas in communication vol. 25 NO. 3, April 2007.

[8] H. Liu, B. Krishnamachari and Q. Zhao, Cooperation and learning in multiuser oppoertunistic spectrum access, in ICC,

2008.

[9] H. Zheng, and C. Peng, Collaboration and Fairness in Opportunistic Spectrum Access, in proc. of IEEE International

Conference on Communication (ICC), 2005.

[10] A. T. Hoang, Y. C. Liang, D. T. C. Wong, Y. Zeng, and R. Zhang, Opportunistic Spectrum Access for Energy-constrained

Cognitive Radios, in IEEE transaction on wireless communications, 2008.

[11] Y. Chen, Q. Zhao and A. Swami, Distributed Spectrum Sensing and Access in Cognitive Radio Networks With Energy

Constraint, in IEEE transaction on signal processing, february 2009.

[12] K. Challapali, C. Cordeiro, D. Birru, Evolution of spectrum-agile cognitive-radios: first wireless internet standard and

beyond, in proceedgins of WICON, 2006.

[13] Q. Zhao, L. Tong, and A. Swami, Decentralized cognitive MAC for dynamic spectrum access, in Proc. 1st IEEE Symp.

New Frontiers Dynamic Spectrum Access Networks, Nov. 2005.

[14] Martin L. PUTTERMAN, Markov Decision Process Discrete Stochastic Dynamic Programming, WILEY Series in

Probability and Statistique, 2005.

[15] Smallwood, R. D.and Sondik, E. J., The optimal control of partially observable Markov decision processes over a finite

horizon, Operations Research, vol 21,pp 1071-1088, 1973.

[16] W. S. Lovejoy, Some Monotonicity Results for Partially Observed Markov Decision Processes, Oper. Res. vol. 35, no.

5, pp. 736-743, Sept. 1987.

[17] S. Shellhammer, A. Sadek and W. Zhang, Technical Challenges for Cognitive Radio in the TV White Space Spectrum,

Information Theory and Appplications, 2009.


27

Fig. 1. Using cognitive radio in ad-hoc communication. If the licensed frequency f1 is not used by primary users, secondary

users can communicate in ad-hoc mode using f1.

Fig. 2. Cognitive radio network architecture

Fig. 3. The channel transition probabilities for channel i.

Fig. 4. The belief update function ns with respect to the packet delay.


28

Fig. 5. Optimal policy with one licensed channel.

Fig. 6. Optimal policy with one licensed channel.

Fig. 7. Optimal policy for the secondary user in the scenario 1.


29



Fig. 10. Average reward depending on the number of iteration for scenario 2.


30

Fig. 11. Average delay depending on the number of iteration for scenario 2.

Fig. 12. Average reward depending on the number of iteration for scenario 1.

Fig. 13. Average delay depending on the number of iteration for scenario 1.


31

Fig. 14. The function F (, l) and G(, l).