optimal energy-delay tradeoff for opportunistic spectrum access in cognitive radio networks

31
1 Optimal Energy-Delay Tradeoff for Opportunistic Spectrum Access in Cognitive Radio Networks Oussama Habachi, Yezekael Hayel and Rachid El-azouzi CERI/LIA, University of Avignon, France Abstract Cognitive radio (CR) has been considered as a promising technology to enhance spectrum efficiency via opportunistic transmission at link level. Basic CR features allow secondary users to transmit only when the licensed channel is not occupied by primary users. However, waiting for idle time slot may include large packet delay and high energy consumption. Thus, we consider Opportunistic Spectrum Access (OSA) mechanism that takes into account packet delay and energy consumption. We formulate the OSA problem as a Partially Observable Markov Decision Process (POMDP) by explicitly considering the energy constraint as well as, the delay constraint, which are often ignored in existing OSA solutions. Specifically, we consider a POMDP with an average reward criterion. We further consider that the secondary user may decide, at any moment, to use another dedicated way (3G) of communication in order to transmit its packets. We derive structural properties of the value function and we show the existence of optimal strategies in the class of the threshold strategies. For implementation purposes, we propose online learning mechanisms that estimate the primary user activity based on statistical knowledge of the primary user activity. In particular, numerical illustrations validate our theoretical findings. It is shown that optimal policy has a threshold structure. We also present numerical illustrations on the convergence of the proposed algorithms for estimating primary user activity. Index Terms POMDP, Cognitive Radio Networks, QoS. I. I NTRODUCTION The access to the spectrum frequency is defined by licenses assigned to primary users. The latter must be conform to the specifications described in the license (e.g. location of the base station, frequency and January 16, 2012 DRAFT

Upload: sahathermal6633

Post on 06-Nov-2015

216 views

Category:

Documents


0 download

DESCRIPTION

Optimal Energy-Delay Tradeoff for Opportunistic Spectrum Access in Cognitive Radio Networks

TRANSCRIPT

  • 1Optimal Energy-Delay Tradeoff for

    Opportunistic Spectrum Access in Cognitive

    Radio NetworksOussama Habachi, Yezekael Hayel and Rachid El-azouzi

    CERI/LIA, University of Avignon, France

    Abstract

    Cognitive radio (CR) has been considered as a promising technology to enhance spectrum efficiency

    via opportunistic transmission at link level. Basic CR features allow secondary users to transmit only

    when the licensed channel is not occupied by primary users. However, waiting for idle time slot may

    include large packet delay and high energy consumption. Thus, we consider Opportunistic Spectrum

    Access (OSA) mechanism that takes into account packet delay and energy consumption. We formulate

    the OSA problem as a Partially Observable Markov Decision Process (POMDP) by explicitly considering

    the energy constraint as well as, the delay constraint, which are often ignored in existing OSA solutions.

    Specifically, we consider a POMDP with an average reward criterion. We further consider that the

    secondary user may decide, at any moment, to use another dedicated way (3G) of communication in order

    to transmit its packets. We derive structural properties of the value function and we show the existence

    of optimal strategies in the class of the threshold strategies. For implementation purposes, we propose

    online learning mechanisms that estimate the primary user activity based on statistical knowledge of the

    primary user activity. In particular, numerical illustrations validate our theoretical findings. It is shown

    that optimal policy has a threshold structure. We also present numerical illustrations on the convergence

    of the proposed algorithms for estimating primary user activity.

    Index Terms

    POMDP, Cognitive Radio Networks, QoS.

    I. INTRODUCTION

    The access to the spectrum frequency is defined by licenses assigned to primary users. The latter must

    be conform to the specifications described in the license (e.g. location of the base station, frequency and

    January 16, 2012 DRAFT

  • 2the maximum transmission power). Nonetheless, a recent study made by the Federal Communications

    Commission (FCC) has proved that some frequency bands are not sufficiently used by licensed users at

    a particular time and in a specific location [1].

    Cognitive radio, which is a new paradigm for designing wireless communication systems, has appeared

    in order to enhance the utilization of the radio frequency spectrum. Cognitive radio has been considered

    as the key technology that enable secondary users to access the licensed spectrum. A cognitive user,

    as defined in [2], is a mobile who has the faculty to adapt its transmission parameters (e.g. frequency

    and modulation) to the wireless environment, and support different communication standards (e.g. GSM,

    CDMA, WiMAX and WiFi). Moreover, when there is no opportunity to transmit over the licensed

    channels, the secondary users may have the possibility to transmit on dedicated channels, generally,

    with a higher cost and/or a lower throughput than transmitting over licensed channels. The possibility of

    having dedicated channels reserved for secondary mobiles has been proposed in [3],[4] and [5]. Those CR

    architectures are described in [6] where the authors also present the network components, the spectrum

    and network heterogeneity, and the spectrum management framework. We focus in this paper, on a CR

    network where a secondary user communicates with other secondary users through an ad-hoc connection

    using a spectrum hole of a licensed frequency (see Figure 1). A secondary user can be considered as a

    pair of transmitter-receiver nodes. We assume that there is no interactions with other secondary users.

    This model is also suited for the scenario depicted on figure 2 where the secondary user is a cognitive

    radio base station which is able to sense the activity of a primary base station, and then takes profit of

    spectrum holes for transmitting on the downlink. Our main contribution is to consider in this cognitive

    radio setting, an optimal opportunistic spectrum access (OSA) mechanism that takes into account energy

    and delay constraints. Many works have focused on the study of optimal sensing and access policies in

    cognitive radio networks (see [7], [8] and [9]). All these works have focused on either spectrum sensing

    or dynamic spectrum sharing. In [10], the authors focused on an OSA problem with an energy constraint.

    The authors have formulated their problem as a POMDP and derived some properties of the optimal

    sensing control policies. Their control parameter is the duration of sensing used by a secondary user at

    each time slot for determining the primary user activity. They provided heuristic control policies based

    on gird-based approximation, myopic policies and static policies which have low complexity but give

    suboptimal control policies. finally, they compare their heuristics methods with optimal solutions obtained

    using a POMDP solver. Authors of [11] incorporate the energy constraint in the design of the optimal

    policy of sensing and access in cognitive radio network. They formulate the problem also as a POMDP

    but with a finite horizon and established a threshold structure of the optimal policy for the single channel

    January 16, 2012 DRAFT

  • 3model. However, they did not provide analytical expression of the optimal control policy. It is noteworthy

    that the impact of the energy constraint or the capacity of cognitive radio to support additional Quality-of-

    Service (QoS), such as the expected delay, has been somehow ignored in the literature. In fact, it is very

    important for today multimedia applications on wireless networks, to provide reliable communication

    while sustaining a certain level of QoS. In fact, taking into account the delay constraint as well as

    the energy constraint significantly complicates the optimization problem. Without considering the delay

    constraint, the secondary user achieves the best tradeoff between trying to access the licensed channel

    and sleeping to conserve energy. The design of such tradeoff lies among several conflicting objectives:

    gaining immediate access, gaining spectrum occupancy information, conserving energy and minimizing

    packets delay. Then, the goal of our paper is to study such energy-QoS tradeoff for determining an

    optimal OSA mechanism for secondary users in a cognitive radio network. The major contributions of

    our work are:

    The problem is formulated as an infinite horizon POMDP with average criterion. The average

    criterion is better than the discount or the total criterion as the secondary user takes often decisions.

    In order to gain insights into the energy-delay constrained OSA problem, we derive structural

    properties of the value function. We are able to show that the value function is increasing with the

    belief and decreasing with the packet delay. These structural results not only give us the fundamental

    design thresholds but also reduce the computational complexity when seeking for the optimal policies.

    We show that the secondary user can maximize its average reward by adopting a simple threshold

    policy, and we derived closed-form expressions for these thresholds.

    Since the secondary user may use a dedicated channel for its packets, the optimal threshold policy

    guarantees a bounded delay.

    The organization of the paper is as follows. In the next section, we describe the primary and the

    secondary user models. Section III presents our Markov decision process framework. In Section IV, we

    study the existence of an optimal threshold policy for our opportunistic spectrum access with an energy-

    QoS tradeoff. We propose two learning based protocols for estimation of the state transition rates in

    Section V. Before concluding the paper and giving some perspectives, we present, in Section VI, some

    numeric illustrations.

    II. COGNITIVE RADIO NETWORK MODEL

    We consider a wireless system with N independent channels licensed to primary users. The state of

    each channel n {1, . . . , N} is modeled by a time-homogeneous discrete Markov process sn(t). The

    January 16, 2012 DRAFT

  • 4state space is {0, 1} where sn(t) = 0 means that the channel n is free for secondary access and sn(t) = 1means that the channel n is occupied by primary user. The transition probabilities of the channel n is

    given by the following matrix:

    Pn =

    n 1 nn 1 n

    The transition rates evolve as illustrated in Figure 3.

    The global system state, composed of the N channels, is denoted by the vector s(t) = [s1(t), ..., sN (t)]

    and the global state space is = {0, 1}N . The transition probabilities can be determined by the statisticsof the primary network traffic and are assumed to be known by secondary users. We present in section

    V some methods allowing the secondary user to estimate these transition probabilities on the fly.

    We consider a secondary user having the possibility to access to anyone of the N licensed channels.

    The objective of the secondary user is to detect the channels that are free during a given time slot.

    However waiting for idle time slot may include large packet delay and high energy consumption due

    to sensing. To overcome this, we consider an OSA that takes into account packet delay, throughput

    and energy consumption. Since todays wireless networks are highly heterogeneous with mobile devices

    consisting of multiple wireless network interfaces, we assume that at any time, the secondary user has

    access to the network through another technology like 3G. This is typically the case with the 802.22

    standard in which secondary users transmit over the TV bands [12]. The secondary user will prefer to

    transmit its packet on a licensed channel because it is cheaper than a dedicated communication while the

    dedicated channel guarantees perfect access.

    The goal of each secondary user is to minimize the expected delay of its packets, accounting for

    energy, throughput and monetary costs. In order to achieve such goal, a secondary user has to choose at

    each time slot one of the following actions:

    to be inactive during the slot,

    to sense a primary channel and to transmit if the channel is available during time slot, else to wait

    for next time slot,

    or to sense a primary channel and to transmit if the channel is available during time slot, else to

    use the dedicated channel.

    Our important contribution is to consider the average transmission delay of a packet in the optimal

    decision. Indeed, sensing a primary channel has a cost for the secondary user. We look for an optimal

    sensing policy which depends on the history of observations and actions.

    January 16, 2012 DRAFT

  • 5III. PARTIAL OBSERVATION MARKOV DECISION PROCESS FRAMEWORK

    Due to partial spectrum sensing, the global system state s(t) cannot be directly observed by a secondary

    user. To overcome this difficulty, the secondary user infers the global system state based on observations

    that can be summarized in a belief vector (t) = {1(t), ..., 2N (t)} where j(t) is the conditionalprobability (given the observation history and the decision) that the system state s(t) = j in slot t. Since

    the N channels are independent, it has been proved in [13] that we can consider the following simpler

    belief vector:

    ~(t) = [1(t), .., N (t)],

    where i(t) is the conditional probability that the channel i is available in slot t. Hence, we study the

    problem of OSA for secondary user as a POMDP problem.

    A. Description of the POMDP

    1) State: The state of the system at time slot t is given by (~(t), l(t)) where l(t) is the delay of the

    packet held by secondary user at time t. The delay of a new packet equals one, and increases by one

    every time slot, except when the secondary user transmits the packet.

    2) Action: For each time slot t and each state (~(t), l(t)), the three possible actions are:

    a(t) =

    0, to be inactive

    1, to sense and to transmit only if the channel is available during time slot,

    2, to sense and to transmit if the channel is available during time slot,

    else to transmit through the dedicated channel.

    3) Observation and belief: When the secondary user decides to sense (i.e. to take action a(t) {1, 2}),one channel n(t) is determined and the secondary user observes the channel occupancy state sn(t)(t) {0, 1}. Let (t) be the observation outcome at time t, where (t) = 0 if the sensed channel is idle and(t) = 1 otherwise. The user updates the belief vector ~(t) after the observation outcome. For each

    channel n, the conditional probability n(t+ 1) is therefore defined as follows:

    n(t+ 1) := Pr(sn(t+ 1) = 0|a(t), (t)) =

    n + (n n)n(t) if a(t) = 0 or n 6= n(t),n if a(t) 6= 0, (t) = 0

    and n = n(t),

    n if a(t) 6= 0, (t) = 1and n = n(t).

    (1)

    January 16, 2012 DRAFT

  • 6Note that we can extend easily our model to sense not only one channel but a subset of the primary

    channels.

    4) Channel choice policy: At each time slot t, based on its belief vector ~(t), the secondary user

    chooses a channel n(t) N to be sensed. There exists several channel choice policies in the literaturelike deterministic, randomized and periodic (see [1]). An example of channel choice policy is to sense

    the channel which has the highest probability to be idle, i.e. n(t) := arg maxn(n(t)).

    5) Policies: The strategy of the secondary user is defined by the probability of choosing a given action

    depending on the system state. We define a sensing and access policy as a vector [1, 2, . . .] where t

    is a mapping from a state (~(t), l(t)) to an action a(t). The set of policies is denoted by . A stationary

    policy is a mapping that specifies for each state, independently of the time slot t, an action to be chosen.

    In the next section, we show that our POMDP problem has an optimal stationary policy which allows

    us to restrict our problem to stationary policies.

    6) Reward and costs:

    Reward : Let be the reward representing the number of delivered bits when the secondary user

    transmits its packet.

    Costs : Let cs be the energy cost function for sensing a primary channel, measured as monetary

    units. This function depends on the action a(t) as:

    cs(a(t)) =

    cs, if a(t) > 0,0, if a(t) = 0.The primary user and the service provider for the dedicated access, charge a price for each packet

    transmitted. Those prices are respectively Pp for a transmission over a primary channel and P3G for

    a transmission over the dedicated channel.

    Hence, when the secondary user transmits successfully a packet, he gets the reward zt(a(t), (t))

    which depends on the action a(t) and the observation (t) by:

    zt(a(t), (t)) =

    0, if a(t) = 0,

    Pp if a(t) 1 and (t) = 0, P3G, if a(t) = 2 and (t) = 1.

    In order to model the impact of the delay, we introduce an additional cost when a packet is not

    transmitted. This cost depends on the current delay l of the packet and is defined by the function

    f(l). This function is assumed to be increasing with l in order to growth the incentive of transmitting

    the packet when it becomes delayed from a long time.

    January 16, 2012 DRAFT

  • 7 Instantaneous reward: At time slot t, the instantaneous reward rt of a secondary user depends on

    the system state (~(t), l(t)) and the action a(t), and is expressed by:

    rt((~(t), l(t)), a(t)) = zt(a(t), (t)) f(l(t)) cs(a(t)).

    The problem faced by the secondary user consists of finding the sensing policy that maximizes its

    expected average reward defined by:

    R() = limT

    1

    TIE

    (Tt=1

    rt((~(t), l(t)), a(t))|~(0)),

    while ~(0) is the initial belief vector. Then our objective is to find an optimal sensing policy that

    maximizes the average reward R(), i.e.:

    = argmax

    limT

    1

    TIE

    (Tt=1

    rt((~(t), l(t)), a(t))|~(0)). (2)

    In some particular MDP and POMDP problems, we are able to determine an optimal policy in a

    smaller set reduced to stationary policies. We prove in the following proposition that there exists an

    average optimal stationary policy for our POMDP problem.

    Proposition 1: There exists an average optimal stationary policy for our POMDP formulation described

    in (2).

    Proof: see Appendix A.

    Given this result, we can restrict our problem to the set S of stationary policies. Then, for the

    remainder of this paper, we omit the time index t and we look for an optimal sensing policy which is a

    mapping between a system state (~, l) to an action a, independently of the time slot t. Now, we make a

    first analysis of the value function of the POMDP.

    We denote by ns(~|) the function that updates the belief vector ~ when the user chooses to beinactive in the current slot, i.e. the secondary user takes action 0. The function s(~|) updates the beliefvector ~ when the secondary user senses a licensed channel in the current slot and observes , i.e. the

    secondary user takes the action 1 or 2.

    The value function is denoted V (, l). Let us denote by Qa(, l) the action-value function taking the

    action a in the current slot when the information state is (, l). Therefore, the value function is expressed

    by

    gu + V (~, l) = maxaA

    Qa(~, l), (3)

    where gu is a constant, and the optimal action is given by

    a(~, l) = arg maxaA

    Qa(~, l). (4)

    January 16, 2012 DRAFT

  • 8We determine the action-value function for each different action 0, 1 and 2. When the secondary user

    decides to wait, i.e. to take the action a = 0, we have:

    Q0(~, l) = f(l) + V (ns(~| = 0), l + 1). (5)

    When the secondary user chooses to sense the channel n and decides to wait for the next time slot if

    the channel n is busy, i.e. to take action 1, we have:

    Q1(~, l) = cs + n( Pp + V (s(~| = 0), 1)) (6)

    +(1 n)(f(l) + V (s(~| = 1), l + 1)).

    When the secondary user chooses to sense the channel n and to transmit using the dedicated channel if

    the channel n is busy, i.e. to take action 2, we have:

    Q2(~, l) = cs + n(Pp + V (s(~| = 0), 1)) (7)

    +(1 n)(P3G + V (s(~| = 1), 1)).

    We focus on the case of one licensed channel. The multichannel case will be studied in Section III-C.

    We take the assumption that there exists a packet delay l such that the secondary user transmits its

    packet using the dedicated channel if the observation is = 1. In fact, this assumption is somehow

    realistic as the user has no interest to keep the file in its buffer indefinitely. We denote by and the

    transition rates of the channel, and the belief of the secondary user. We consider that . When , the analysis is similar and the results are unchanged.

    B. The single channel model

    Let us focus on the belief update function ns.

    Lemma 1: We have the following properties of the belief update function ns.

    1) The update function ns(|) is increasing with belief .2) We have the following equivalence:

    ns(|) pi(0),

    and

    ns(|) pi(0),

    where pi(0) = 1+ is the stationary probability that the primary channel is idle. Figure 4 depicts

    the belief evolution.

    Proof: See Appendix B.

    January 16, 2012 DRAFT

  • 9It has be shown in [15] that the value function for a POMDP over a finite time horizon is piecewise

    linear and convex with respect to the belief vector. In Proposition 2, we show that the value function for

    our POMDP problem over an infinite horizon with the average criterion, has also this property.

    Proposition 2: The value function V (, l) given in (3) is piecewise linear and convex with respect to

    the belief vector .

    Proof: See Appendix C.

    Note that monotonicity results help us for establishing the structure of the optimal policies (see [16]

    for an example) and provide insights into the underlying problem. The following propositions states

    monotonicity results of the value function with respect to each of its parameters.

    Proposition 3: For each belief vector , the value function is monotonically decreasing with the packet

    delay l, i.e. V (, l) V (, l) for l l.Proof: See Appendix D.

    This result is intuitive because for the same belief and for a given packet delay, the maximum

    expected remaining reward that can be accrued is lower than the one the secondary user can get with a

    smaller packet delay.

    Proposition 4: The value function is monotonically increasing with the belief vector , i.e. V (, l) V (, l) for .

    Proof: See Appendix F.

    Again this result seems somehow intuitive as for the same packet delay, when the belief vector is

    higher yields that the maximum expected remaining reward becomes higher.

    Given all the previous results on the value function V (, l), we are able to show the existence of

    an optimal sensing policy for our POMDP problem. Moreover, we determine explicitly the threshold

    structure of such optimal policy.

    C. The multichannel model

    The Lemma 1 holds for the multichannel model. In fact, if ~1 ~2, then n1 n2 and ns(n1) ns(n2), and therefore,

    ns(~1) ns(~2). Second, if n pi(0), then we ns(n) n , and thusns(~) ~. Otherwise, we have ns(~) ~.

    The Proposition 2 can be straightforwardly extended to the multichannel model. Furthermore, we

    studied in Proposition 3 the monotonicity of the value function with a fixed belief value with respect to

    the packet delay. This proposition can be also extended to the multichannel model.

    January 16, 2012 DRAFT

  • 10

    Let us focus on the Proposition 4. The monotonicity with respect to the belief vector depends on the

    order relation over the belief set and also on the monotonicity of the belief update functions s(~| = 0)and s(~| = 1) depending on the belief vector.

    IV. OPTIMAL THRESHOLD POLICY

    Let us focus on the characteristics of an optimal policy for the secondary user. Intuitively, when the

    delay l and the belief probability are small, the secondary user waits for a better opportunity. Thus,

    depending on the belief probability, the secondary user makes the decision to sense a primary channel or

    not. We prove in this section, that the intuition is true and there exists an optimal sensing policy which

    has a threshold structure.

    The first decision for a secondary user is whether to sense licensed channels or to wait, depending on

    its belief and the current delay of the packet l. We have the following result which gives us a threshold

    on the belief probability in order to answer this question.

    Proposition 5: For all packet delay l, the optimal action for the secondary user is to wait for the

    next slot, i.e. a(, l) = 0 if and only if where is the solution of the equation =max(0,min{Th1(, l), Th2(, l)}) with

    Th1(, l) =V (ns(|), l + 1) V (, l + 1) + Csf(l) + Pp + V (, 1) V (, l + 1) , and

    Th2(, l) =V (ns(|), l + 1) V (, 1) + Cs f(l) + P3G

    Pp + V (, 1) + P3G V (, 1) .

    Proof: see Appendix G.

    This proposition gives us a necessary and sufficient condition on the use of the action 0 depending on

    the belief probability . Consequently, if > then the optimal action is to sense a primary channel,

    i.e. a(, l) 6= 0.Furthermore, we have the following property of the optimal policy.

    Proposition 6: For all > pi(0) and l, the secondary user never takes the action 0 and thus, Q0(, l) pi(0). Furthermore, we have the following result about the use of

    the dedicated channel.

    January 16, 2012 DRAFT

  • 11

    Proposition 7: For all belief , the secondary user chooses to use the dedicated channel in spite of

    waiting for the next slot if and only if the delay l of the current packet verifies:

    f(l) + P3G + V (, l + 1) V (, 1) > 0.

    Proof: See Appendix I.

    We note that this expression does not depend on the cost of sensing Cs nor on the belief vector .

    That is obvious as this expression determines the best action to do after sensing a channel. We have the

    last property about the optimal threshold policy.

    Corollary 1 (Never Wait After Sensing): If, for all l, the penalty cost f(l) is lower than P3G,then the secondary user transmits on the dedicated channel when the sensed channel is not idle.

    Proof: See Appendix J.

    This result is also somewhat intuitive. In fact, when the secondary user senses the channel as busy, it

    gets P3G as reward if he uses the dedicated channel otherwise he gets a penalty f(l) if he decidesto wait. Thus, if P3G + f(l) is positive the secondary user has no incentive to wait after sensing thelicensed channels.

    In all the results, the optimal sensing policy depends on the transition rates and of the primary

    user activity. In the literature, those parameters are assumed to be known by the secondary user. We

    focus in the next section on online learning algorithms that allow the secondary user to estimate those

    rates on the fly.

    V. ONLINE LEARNING OF PRIMARY USERS ACTIVITY

    We proved that the secondary user has an optimal energy-delay constrained policy given perfect

    knowledge of the channels transition rates. However, in practice, some information like the transition

    rates and are not available for the secondary user. In this section, we consider a model where the

    secondary user does not have external information about the state transition rates. We present two learning

    based protocols for the secondary user in order to estimate the primary channels dynamics: rate estimator,

    and transition matrix estimator.

    A. Rate Estimator

    In this approach, the secondary user begins with an initial arbitrary values of and . The secondary

    user updates them every time slot depending on the information about the system state. Then, the sec-

    ondary user computes its sensing policy based on the estimators = {1, ..., N} and = {1, ..., N}where i (resp. i) is the estimator of i (resp. i).

    January 16, 2012 DRAFT

  • 12

    First, the secondary user estimates i which is the probability that the channel i will be sensed idle

    given that it was idle in the previous slot. Second, the secondary user estimates pii(0) the stationary

    probability for this channel to be idle. The secondary user obtains the estimated value of i based on

    the relation i = (1 i) pii(0)1pii(0) .Formally, we consider the following counting processes for the estimation of i and pii(0):

    The vector K = {K1, ..., KN} where Ki represents the number of time slots a channel stays in theidle state, i.e. Ki is incremented if the channel i is sensed and is idle at time slot t and t 1.

    The vector I = {I1, ..., IN} where Ii represents the number of time slots that the channel is sensedand is idle.

    The vector M = {M1, ..., MN} where Mi represents the number of time slots that the channel issensed.

    Therefore the secondary user estimates the state transition rates and pii(0) based on the following

    expressions: i = KiIi and pii(0) =IiMi

    .

    B. Transition Matrices Estimator

    The convergence of the previous estimators and depends on the occurrence of two successive

    sensing actions of the same channel. The secondary user may not sense frequently the same channel in

    two successive time slots. Therefore, the previous learning mechanism converges slowly. We present, in

    this section, a learning protocol which estimates the transition matrices. We define the set of transition

    matrices {Pi(0), Pi(1), ...} where Pi(j) is the transition matrix of the channel i when this channel wasnot sensed during j consecutive slots. For example, if the channel i was sensed j slots before as idle;

    the current belief on the state of this channel is (1, 0) Pi(j). As like as the rate estimator, the transitionmatrices are estimated using a counting process.The previous learning protocol is somehow a particular

    case of this approach. In fact, estimating and is equivalent to estimating the set of transition matrices

    such that the channel was sensed in the previous slot {P1(0), ..., PN (0)}. Therefore, this learning basedprotocol gives more accurate estimation of primary users activity. However, it needs more space and

    computational complexity compared to the rates estimators method.

    VI. NUMERIC ILLUSTRATIONS

    We illustrate our results through simulations of the system over an important number of packets (we

    consider 3000 packets). It was shown in [17] that in practice, the average number of available primary

    channels is about 15. Unfortunately, we consider only 4 i.i.d primary channels, i.e. N = 4, due to

    January 16, 2012 DRAFT

  • 13

    exponential states space (with 4 primary channels, we have approximatively 106 states). Furthermore, we

    consider the following system parameters: P3G = 80, Pp = 10, cS = 5 and = 35.

    We propose to illustrate our results in three scenarios with symmetric channels:

    1) Scenario 1: Primary channels are often occupied (1 = 2 = 3 = 4 = 0.15 and 1 = 2 =

    3 = 4 = 0.1),

    2) Scenario 2: Primary channels are often idle (1 = 2 = 3 = 4 = 0.85 and 1 = 2 = 3 =

    4 = 0.7),

    3) Scenario 3: Primary channels have low transition rates (1 = 2 = 3 = 4 = 0.95 and 1 =

    2 = 3 = 4 = 0.05). This last scenario is realistic if we consider TV white space [17].

    We describe, first, the optimal threshold policy given perfect knowledge about the transition rates of the

    primary channels. Second, we give some results using estimated values of transition rates.

    A. Single channel model

    We consider only one licensed channel with the transition rates = 0.15 and = 0.1. Figure 5

    illustrates the optimal policy of the secondary user depending on the belief and the packet delay. For

    each packet delay, the secondary user has a threshold policy depending the belief. Moreover, the threshold

    belief probability is decreasing with the packet delay. We observe that the maximum packet delay is

    13 slots.

    Consider the the same scenario with transition rates = 0.2 and = 0.25. We observe in Figure 6

    that the secondary user policy has also a threshold structure. A packet has a most a delay of 3 slots.

    B. Optimal policy with perfect knowledge of and

    We simulate the first scenario and we depict in Figure 7 the thresholds (l) determined in proposition

    5 depending on the packet delay l. For each packet delay l, the best action for the secondary user is to

    wait for the next slot if its belief probability is lower than . Otherwise, the secondary user decides to

    sense the primary channels. In this context, where the primary channels are often occupied (Scenario 1,

    Figure 7), the maximum packet delay l obtained with Proposition 7 equals 9. Then, when the packet

    delay is l = 9, the user decides to sense and to transmit using the dedicated channel if the sensed channel

    is occupied. We describe the optimal policy for the Scenario 2 on Figure 8. The maximum packet delay

    in this case is l = 5. This result is intuitive as in this scenario, the primary channels are more often

    idle, inducing a lower packet delay. Finally, for the last scenario depicted on Figure 9, which implies

    that the maximum packet delay is 5. We observe that the secondary user policy has also a threshold

    January 16, 2012 DRAFT

  • 14

    structure. However, the the threshold belief probability is not decreasing with the packet delay. . In

    fact, the primary channels are more static (the probability for each channel to stay occupied or idle is

    high enough), it appears one kind of periodic threshold strategy.

    C. Average reward using estimated values of and

    We consider the learning approaches proposed in section V. Let us compare, first, the average reward

    and the average delay using the two learning based protocols with perfect knowledge of the channels

    transition rates. Figures 10 and 11, show that both learning protocols converge. In fact, we observe on

    Figures 10 and 11, that both protocols converge before 400 iterations. However, in Figures 12 and 13, we

    can observe that the transition matrices estimation method converge 3 times faster (about 1000 iterations)

    than the rate estimators method (about 3000 iterations). Moreover, the average reward and the average

    packet delay using the estimated transition rates are close to the average reward and the average delay

    with known channels transition rates.

    VII. CONCLUSION AND PERSPECTIVES

    In this paper, we have used a POMDP framework for determining an optimal sensing policy for oppor-

    tunistic spectrum sensing and access (OSA) taking into account an energy-delay tradeoff for secondary

    users. Introducing a QoS metric in the spectrum sensing policy is very important with the emergence

    of heterogeneous mobiles that are able to transmit their traffic with possible high QoS constraints, at

    any time over different ways of communication like 3G, WiFi and TV White Space. We have provided

    some structural properties of the value function and then proved the existence of an optimal average

    stationary spectrum sensing policy. We have been able to determine explicitly the threshold structure of

    the optimal policy. The interaction between several secondary users has not been considered here, and

    in the literature very few. This perspective is also very important because if the channel choice policy is

    the same for all the secondary users, there could have lots of collisions between several secondary users

    that have sensed the same idle primary channel. This decentralized system with partial information can

    be modeled using decentralized-POMDP or interactive-POMDP and will be studied in future works.

    APPENDIX

    A. Proof of Proposition 1

    We use the Theorems 8.10.9 and 8.10.7 from [14] to prove the existence of an optimal stationary

    policy for our problem. First, the immediate reward rt((s, l), a) is finite, i.e. < rt((s, l), a) < +

    January 16, 2012 DRAFT

  • 15

    (as all costs and rewards are finite). Second, We prove that there exist a stationary policy d for which

    the derived Markov chain is positive recurrent.

    Let us focus on the following belief vector:

    0 = (1, 2, ..., N ) such that j = j1(j |0), for j = 1, . . . , N,

    where j represents the belief of a channel that was not sensed for j successive slots.

    Denote by d the stationary policy which senses licensed channels at every slot, with periodic channel

    choice policy. Let us prove that the derived Markov chain is positive recurrent. The probability that the

    system returns to the initial belief form any state is p() =Nk=0(1 n(j)) > 0, n {O, ..., N}

    and then the return time to the initial belief j follow a geometric distribution so that E{j} = 1p(j) andtherefore all state are positive recurrent under d.

    Third, let us prove that gd> and the set {b Sb : rt((s, l), a) > gd for some a A} is finite

    and no empty. As the policy d senses licensed channels every slot, gd = f(l(t)) cs (f(l(t)) +Pp )n . If we have

    f(l(t)) cs (f(l(t)) + Pp )n > max{f(l(t)), cs P3G (Pp P3G)n}

    for all belief b, the policy always sense primary channels is optimal and we have achieved our goal.

    Otherwise, the set {b Sb : rt((s, l), a) > gd for some a A} is finite and no empty.Finally, we obtain from the theorems 8.10.9 and 8.10.7 from [14] that there exists an average optimal

    stationary policy.

    B. Proof of Lemma 1

    First, the update function ns is linear with the belief because because ns() = + ( ). Aswe considered the case where , then the update function is increasing with the belief.

    Second, let us prove that ns() if pi(0) by induction on the belief.1) We have the initial condition: pi(0) = 1+ and ns() = + ( ) .2) We assume that ns() for a given pi(0).3) The induction operator gives: ns(ns()) = + ( )ns() + ( ) = ns().

    Thus, ns() for all pi(0). The analysis for pi(0) is similar.

    January 16, 2012 DRAFT

  • 16

    C. Proof of Proposition 2

    The proof of the proposition 2 is similar to [15] where the authors consider the finite time horizon

    problem. Hence, we briefly describe the procedure for this proof. Considering the maximum packet delay

    l and for all belief vector , the value function V (, l) is linear with the belief because

    V (, l) = Q2(, l) gu,

    = gu + cs P3G + V (s(| = 1), 1) +

    n(P3G Pp + V (s(| = 0), 1) V (s(| = 1), 1)).

    Then the value function V (, l) can be rewritten as an inner product of the belief vector and a -vector.

    As Q2(, l) = Q2(, l), for all l, the action-value function Q2(, l) can be also rewritten as an inner

    product of the belief vector and a -vector. We suppose that Proposition 2 holds for all packet delays

    higher than l + 1 and we prove that the proposition is true for packet delay l. After some algebra, we

    can rewrite the action-value functions given in (5) and (7) in terms of -vector:

    Q0(, l) = f(l) + maxl+1

    < ns(|), >= f(l) +sS

    s

    [sS

    P (s|s)ns(|)l+1], (8)

    and

    Q1(, l) = cs + ( Pp + V (, 1)) + (1 )(f(l) + maxl+1

    < s(| = 1), >)

    = cs + ( Pp + V (, 1)) + (1 )(f(l) +sS

    s

    [sS

    P (s|s)s(|=1)l+1]

    ), (9)

    where ns(|)

    l+1 and s(|=1)l+1 are, respectively, the -vectors for the regions containing belief vectors

    ns(|) and s(| = 1), respectively. Each term in the square brackets of (8) and (9) are elements,l of a -vector l. Then the action-value functions can be rewritten as an inner product of the belief

    vector and a -vector l. Moreover, there are only a finite number of such -vector l since we have

    a finite set of belief for all l. As the maximum of a finite set of piecewise linear and convex functions

    is also piecewise linear and convex, the Proposition 2 holds.

    D. Proof of Proposition 3

    Let us prove first that the value function V (, l) is monotonically decreasing with the packet delay l

    for all belief vector . The secondary user takes the action 2 for all when the packet delay is l, thus

    January 16, 2012 DRAFT

  • 17

    we have:

    V (, l) = cs + (Pp + V (, 1)) + (1 )(P3G + V (, 1)).

    The secondary user chooses the action that maximizes its average utility and thus:

    V (, l 1) = maxa

    Qa(, l 1) gu Q2(, l 1) gu,

    = cs + (Pp + V (, 1)) + (1 )(P3G + V (, 1)) gu,

    = V (, l).

    Let us prove that this propriety holds for all packet delays using a backward induction on l:

    1) initial condition: For all belief vector , V (, l) V (, l 1),2) we suppose that V (, l + 2) V (, l + 1), .3) We have:

    Q0(, l) = f(l) + V (ns(|), l + 1),

    f(l + 1) + V (ns(|), l + 2),

    = Q0(, l + 1).

    Q1(, l) = cs + ( Pp + V (, 1)) + (1 )(f(l) + V (, l + 1)),

    cs( Pp + V (, 1)) + (1 )(f(l + 1) + V (, l + 2)),

    = Q1(, l + 1).

    Q2(, l) = cs + P3G + V (, 1) + (P3G Pp + V (, 1) V (, 1)),

    Q2(, l + 1).

    The inequalities come from the induction assumption and the monotonicity of the penalty function

    f(l). Thus, we have:

    , V (, l) V (, l + 1).

    The value function is therefore decreasing with the packet delay.

    Lemma 2: We have the following inequality:

    Pp + V (, 1) P3G + V (, 1).

    January 16, 2012 DRAFT

  • 18

    E. Proof of Lemma 2

    We prove this lemma by contradiction, so we suppose that Pp + V (, 1) < P3G + V (, 1). Wefirst prove that the following:

    gu + V (, 1) Q2(, 1),

    gu + V (, 1) cs + ( Pp + V (, 1)) + (1 )( P3G + V (, 1)),

    gu + V (, 1) cs + Pp + V (, 1),

    gu > cs Pp.

    and we take the assumption that the immediate reward when the channel is idle is positive, i.e. csPp 0.

    We know that the secondary user takes the action 2 in the state (, l) for all belief vector , i.e

    a(, l) = 2,. We have:

    gu + V (, l) = cs + ( Pp + V (, 1)) + (1 )( P3G + V (, 1)).

    Let us focus on the packet delay l 1. If pi(0), we have:

    Q0(, l 1) = f(l 1) + V (ns(), l),

    = gu f(l 1) cs + ns()( Pp + V (, 1)) + (1 ns())( P3G + V (, 1)),

    = V (, l) f(l 1) + (ns() )(P3G Pp + V (, 1) V (, 1)),

    < V (, l).

    The inequality is due to the assumption that Pp+V (, 1) < P3G+V (, 1), ns() and f(l1)is positive. As the value function V (, l) is decreasing with the packet delay l (see Proposition 3), then

    Q0(, l 1) < V (, l) < V (, l 1). As we proved that gu 0, the secondary user does not take

    the action 0 when the packet delay is l 1. For the action 1, we have:

    Q1(, l 1) = cs + ( Pp + V (, 1)) + (1 )(f(l 1) + V (, l)),

    = cs + ( Pp + V (, 1)) + (1 ) ( gu f(l 1) cs+(Pp + V (, 1)) + (1 )(P3G + V (, 1))) ,

    < cs + ( Pp + V (, 1)) + (1 )( gu f(l 1) cs P3G + V (, 1)),

    < cs + ( Pp + V (, 1)) + (1 )( P3G + V (, 1)),

    = Q2(, l 1).

    January 16, 2012 DRAFT

  • 19

    The first inequality is due to the assumption that Pp + V (, 1) < P3G + V (, 1) and the second oneis because gu, f(l 1) and cs are positive. Thus, the optimal strategy is to take the action 2 when thepacket delay is l 1.

    Let us prove now by backward induction on l that the optimal action is the action 2 for all belief

    vector pi(0). If the secondary user takes the action 2 when the packet delay is l, then it takes also the action 2

    when the packet delay is l 1. We suppose that secondary user takes the action 2 when the packet delay is l < l 1. We have the following inequalities:

    Q0(, l 1) = f(l 1) + V (ns(), l),

    = gu f(l 1) cs + ns()( Pp + V (, 1)) + (1 ns())( P3G + V (, 1)),

    = V (, l) f(l 1) + (ns() )(P3G Pp + V (, 1) V (, 1)),

    < V (, l).

    The inequality is due to the assumption that Pp + V (, 1) < P3G + V (, 1) and ns() ,and f(l 1) is positive. As the value function is decreasing with the packet delay (see Proposition3), then Q0(, l 1) < V (, l 1) + gu, i.e. the secondary user does not take the action 0 with thepacket delay l 1.

    Q1(, l 1) = cs + ( Pp + V (, 1)) + (1 )(f(l 1) + V (, l)),

    = cs + ( Pp + V (, 1)) + (1 ) ( gu f(l 1) cs+(Pp + V (, 1)) + (1 )(P3G + V (, 1))) ,

    < cs + ( Pp + V (, 1)) + (1 )( gu f(l 1) cs P3G + V (, 1)),

    < cs + ( Pp + V (, 1)) + (1 )( P3G + V (, 1)),

    = Q2(, l 1).

    The first inequality is due to the assumption that Pp +V (, 1) < P3G +V (, 1) and the secondone is because gu, f(l 1) and cs are positive. Thus, The optimal strategy is to take action 2 whenthe packet delay is l 1. Thus, the secondary user does not take the action 1 with the packet delayl 1. Finally, the secondary user takes action 2 for all packet delays and beliefs lower than pi(0).

    January 16, 2012 DRAFT

  • 20

    We now look at the action-value function Q2(, 1) when the packet delay is l = 1.

    Q2(, 1) = cs + ( Pp + V (, 1)) + (1 )( P3G + V (, 1)),

    Q2(, 1) = cs P3G + V (, 1) + (P3G Pp + V (, 1) V (, 1)),

    gu +Q2(, 1) = gu + V (, 1) Pp + cs + ( 1)(P3G Pp + V (, 1) V (, 1)).

    As the secondary user takes the action 2 also for the state (, 1), we have:

    gu + V (, 1) = cs + ( Pp + V (, 1)) + (1 )( P3G + V (, 1)),

    gu + V (, 1) = cs P3G + V (, 1) + (P3G Pp + V (, 1) V (, 1)),

    gu = cs P3G + (P3G Pp + V (, 1) V (, 1)).

    Thus, we obtain:

    gu +Q2(, 1) = V (, 1) + P3G Pp + ( 1)(P3G Pp + V (, 1) V (, 1)).

    As we assumed that P3G Pp + V (, 1) V (, 1) < 0, and P3G > Pp, then we obtain V (, 1) + gu Q2(, 1) and therefore the secondary user takes also the action 2 in the state (, 1). Then we get:

    gu + V (, 1) = Q2(, 1) = cs + ( Pp + V (, 1)) + (1 )( P3G + V (, 1)).

    Let us evaluate finally the difference V (, 1) V (, 1):

    V (, 1) V (, 1) = ( )(P3G Pp + V (, 1) V (, 1)),

    V (, 1) V (, 1) < 0.

    and

    V (, 1) V (, 1) = ( )(P3G Pp + V (, 1) V (, 1)),

    (V (, 1) V (, 1))(1 + ) = ( )(P3G Pp),

    V (, 1) V (, 1) = ( )(P3G Pp)1 + ,

    > 0.

    which leads to a contradiction, and therefore, Pp +V (, 1) P3G +V (, 1). The analysis is similarwhen > pi(0).

    January 16, 2012 DRAFT

  • 21

    F. Proof of Proposition 4

    Let us prove that the value function V (, l) is increasing with the belief vector for any packet delay

    l. For all 1 2, we have that:

    V (1, l) = gu cs + P3G + V (, 1) + 1(P3G Pp + V (, 1) V (, 1)),

    gu cs + P3G + V (, 1) + 2(P3G Pp + V (, 1) V (, 1)),

    = V (2, l).

    This inequality result from the Lemma 2. Let us prove that this propriety holds for all packet delays l

    using backward induction:

    Initial condition: There exists a packet delay l such that V (1, l) V (2, l), 1 2, We suppose that V (1, l + 1) V (2, l + 1), 1 2,

    First case: We assume that + f(l) Pp + V (, 1) V (, l + 1) 0, then:

    Q0(1, l) = f(l) + V (ns(1|), l + 1),

    f(l) + V (ns(2|), l + 1),

    = Q0(2, l).

    The inequality is a direct result from the induction assumption and the Lemma 1. We have also:

    Q1(1, l) = cs f(l) + V (, l + 1) + 1( + f(l) Pp + V (, 1) V (, l + 1)),

    cs f(l) + V (, l + 1) + 2( + f(l) Pp + V (, 1) V (, l + 1)),

    = Q1(2, l).

    Q2(1, l) = cs + P3G + V (, 1) + 1(P3G Pp + V (, 1) V (, 1)),

    cs + P3G + V (, 1) + 2(P3G Pp + V (, 1) V (, 1)),

    = Q2(2, l).

    The inequalities comes from the Lemma 2. Thus, we have proved that V (1, l) V (2, l).

    January 16, 2012 DRAFT

  • 22

    Second case: We suppose that + f(l) Pp + V (, 1) V (, l+ 1) < 0, then for all we have:

    Q1(, l) = cs + ( Pp + V (, 1)) + (1 )(f(l) + V (, l + 1)),

    cs f(l) + V (, l + 1),

    f(l) + V (, l + 1),

    cs f(l) + V (ns(|), l + 1),

    Q0(, l).

    In fact, we have that ns(|) for all belief vector and the value function V (, l) isincreasing with the belief for the packet delay l + 1 (induction assumption). Thus, gu + V (, l) =

    max {Q0(, l), Q2(, l)}. Moreover, we have:

    Q0(1, l) = f(l) + V (ns(1|), l + 1),

    f(l) + V (ns(2|), l + 1),

    = Q0(2, l).

    The inequality is a direct result from the induction assumption. Finally, we have that:

    Q2(1, l) = cs + P3G + V (, 1) + 1(P3G Pp + V (, 1) V (, 1)),

    cs + P3G + V (, 1) + 2(P3G Pp + V (, 1) V (, 1)),

    = Q2(2, l).

    The inequality comes from the Lemma 2.

    Thus, V (1, l) V (2, l) for belief vectors 1 2 and for all packet delay l.

    G. Proof of Proposition 5

    In this proposition, we determine explicitly the best action a(, l) for the secondary user depending

    on the belief and the packet delay l. At each time slot and for a given information state (, l), the

    secondary use will decide to take the action 0 if Q0(, l) max {Q1(, l), Q2(, l)}. First we assume that Q1(, l) > Q2(, l), then, let us compare Q0(, l) and Q1(, l). The inequality

    Q0(, l) Q1(, l) is equivalent to:

    f(l) + V (ns(|), l + 1) cs + ( Pp + V (, 1)) + (1 )(f(l) + V (, l + 1)),

    V (ns(|), l + 1) V (, l + 1) cs + (f(l) + Pp + V (, 1) V (, l + 1)).

    January 16, 2012 DRAFT

  • 23

    As the value function V (, l)is decreasing with the packet delay l and increasing with the belief ,

    we have V (, 1) V (, l+ 1). As we assumed that the immediate reward is higher than the costPp, we obtain that f(l) + Pp + V (, 1) V (, l+ 1) is positive. Then, we have the followingequivalence:

    Q0(, l) Q1(, l) V (ns(|), l+1) V (, l+1)cs+(f(l)+Pp+V (, 1)V (, l+1)).

    Define the functions F and G as follow:

    F (, l) = V (ns(|), l + 1),

    G(, l) = V (, l + 1) cs + (f(l) + Pp + V (, 1) V (, l + 1)).

    We proved in Proposition 2 that the value function is Piecewise linear and convex. Therefore, for

    all packet delays, the function F (, l) is PWLC and increasing with , and the function G(, l) is

    linear and increasing with . Note that

    If F (, l) G(, l), then Q0(, l) Q1(, l) and therefore the best action is 0. If F (, l) < G(, l), then Q0(, l) < Q1(, l) and therefore the best action is 1.

    Let us study the sign of the function H(, l) = F (, l)G(, l). Under these setting, six cases riseup:

    1) F (, l) is always higher than G(, l), see Figure (14, case 1).

    2) F (, l) is always lower than G(, l), see Figure (14, case 2).

    3) F (, l) and G(, l) intersect once and F (, l) < G(, l), see Figure (14, case3).

    4) F (, l) and G(, l) intersect once and F (, l) (, l), see Figure (14, case 4).5) F (, l) and G(, l) intersect twice and F (, l) (, l), see Figure (14, case 5).6) G(, l) is tangent to F (, l), see Figure (14, case 6).

    Let us focus on F (pi(0), l) and G(pi(0), l).

    Let us prove that gu > f(l). We have:

    gu + V (, 1) Q0(, 1),

    gu + V (, 1) f(l) + V (ns(), l + 1),

    gu + V (, 1) V (ns(), l + 1) f(l),

    gu > f(l).

    January 16, 2012 DRAFT

  • 24

    The inequality is because of the monotonicity of the value function and ns() < . Suppose that

    the secondary user chooses the action 0 for the state (pi(0), l). We have:

    gu + V (pi(0), l) = f(l) + V (ns(pi(0)), l + 1),

    gu + V (pi(0), l) f(l) + V (ns(pi(0)), l),

    gu + V (pi(0), l) f(l) + V (pi(0), l),

    gu f(l).

    This leads to a contradiction as gu > f(l). Thus, Q0(, l) < Q1(, l) and therefore, F (pi(0), l) Q1(, l) and then, we have to compare the action 0 and 2, which

    is equivalent to compare the action-value functions Q0(, l) and Q2(, l). The secondary user takes

    the action 0 instead of the action 2 if Q0(, l) Q2(, l), which is equivalent to:

    f(l) + V (ns(|), l + 1) cs + ( Pp + V (, 1)) + (1 )( P3G + V (, 1)),

    V (ns(|), l + 1) V (, 1) + + f(l) cs P3G + (P3G Pp + V (, 1) V (, 1)).

    We have from the Lemma 2, that P3G Pp + V (, 1) V (, 1) 0. Then, we can provide thesame analysis presented in the previous case with the function F (, l) = V (ns(|), l + 1) andthe function G(, l) = V (, 1) + + f(l) cs P3G + (P3G Pp + V (, 1) V (, 1)). Thelatter is linear increasing in . We obtain the following threshold policy:

    The secondary user takes the action 0 for all beliefs lower than the following threshold:

    Th2(, l) =V (ns(|), l + 1) V (, 1) f(l) + cs + P3G

    P3G Pp + V (, 1) V (, 1) ,

    and take the action 2 otherwise.

    January 16, 2012 DRAFT

  • 25

    H. Proof of Proposition 6

    We have from the Lemma 1 that if > pi(0) then ns() . Suppose that the secondary user takesthe action 0 for a belief and packet delay l. Thus we have

    gu + V (, l) = f(l) + V (ns(), l + 1),

    gu + V (, l) f(l) + V (ns(), l),

    gu + V (, l) f(l) + V (, l),

    gu f(l).

    This leads to a contradiction as gu > f(l). The first inequality is because the value function is decreasingwith the packet delay and the second one is because that the value function is increasing with the

    belief and ns() . Thus, if > pi(0), then the secondary user never takes the action 0 and thenQ0(, l) < max {Q1(, l), Q2(, l)}.

    I. Proof of Proposition 7

    Let us compare the value-action functions Q1(, l) and Q2(, l) for all belief vector and packet delay

    l. The secondary user waits for next time slot after sensing if Q1(, l) Q2(, l), which is equivalentto:

    cs + ( Pp + V (, 1)) + (1 )(f(l) + V (, l + 1)) cs + ( Pp + V (, 1))

    +(1 )( P3G + V (, 1)),

    f(l) + V (, l + 1) P3G + V (, 1) 0.

    Remark that this condition depends only on the packet delay l and not on the belief vector .

    J. Proof of Corollary 1

    If f(l) is lower than P3G, then f(l) + P3G + V (, l + 1) V (, 1) is always negative.In fact, V (, 2) V (, 1) is negative and f(l) + Pp + V (, l+ 1) V (, 1) is decreasing with l.Therefore, the previous expression is negative for all l 1.

    January 16, 2012 DRAFT

  • 26

    REFERENCES

    [1] E. Hossain, D. Niyato and Zhu Han, Dynamic spectrum access and management in cognitive radio networks, Cambridge,

    2009.

    [2] J. Mitola, Cognitive radio: An integrated agent architecture for software defined radio, PhD Dissertation, Royal Inst.

    Technol. (KTH), Stockholm, Sweden, 2000.

    [3] F. Akyildiz, Won-yeol Lee and al., NeXt generation dynamic spectrum access cognitive radio wireless networks: A

    survey, Computer Networks, 2006.

    [4] K. Jaganathan, I. Menache, E. Modiano, and G. Zussman, Non-cooperative Spectrum Access - The Dedicated vs. Free

    Spectrum Choice, Proc. ACM MOBIHOC11, May 2011.

    [5] O. Habachi and Y. Hayel, Optimal sensing strategy for opportunistic secondary users in a cognitive radio network, in the

    13th ACM International Conference on Modeling, Analysis and Simulation of Wireless and Mobile Systems (MSWiM),

    2010.

    [6] I. Akyildiz, W. Lee, M. Vuran, S. Mohanty, A Survey on Spectrum Management in Cognitive Radio Networks, in IEEE

    Communication Magazine, 2008.

    [7] Qing Zhao and al., Decentralized cognitive MAC for opportunistic spectrum access in ad Hoc networks: A POMDP

    framework, IEEE journal on selected areas in communication vol. 25 NO. 3, April 2007.

    [8] H. Liu, B. Krishnamachari and Q. Zhao, Cooperation and learning in multiuser oppoertunistic spectrum access, in ICC,

    2008.

    [9] H. Zheng, and C. Peng, Collaboration and Fairness in Opportunistic Spectrum Access, in proc. of IEEE International

    Conference on Communication (ICC), 2005.

    [10] A. T. Hoang, Y. C. Liang, D. T. C. Wong, Y. Zeng, and R. Zhang, Opportunistic Spectrum Access for Energy-constrained

    Cognitive Radios, in IEEE transaction on wireless communications, 2008.

    [11] Y. Chen, Q. Zhao and A. Swami, Distributed Spectrum Sensing and Access in Cognitive Radio Networks With Energy

    Constraint, in IEEE transaction on signal processing, february 2009.

    [12] K. Challapali, C. Cordeiro, D. Birru, Evolution of spectrum-agile cognitive-radios: first wireless internet standard and

    beyond, in proceedgins of WICON, 2006.

    [13] Q. Zhao, L. Tong, and A. Swami, Decentralized cognitive MAC for dynamic spectrum access, in Proc. 1st IEEE Symp.

    New Frontiers Dynamic Spectrum Access Networks, Nov. 2005.

    [14] Martin L. PUTTERMAN, Markov Decision Process Discrete Stochastic Dynamic Programming, WILEY Series in

    Probability and Statistique, 2005.

    [15] Smallwood, R. D.and Sondik, E. J., The optimal control of partially observable Markov decision processes over a finite

    horizon, Operations Research, vol 21,pp 1071-1088, 1973.

    [16] W. S. Lovejoy, Some Monotonicity Results for Partially Observed Markov Decision Processes, Oper. Res. vol. 35, no.

    5, pp. 736-743, Sept. 1987.

    [17] S. Shellhammer, A. Sadek and W. Zhang, Technical Challenges for Cognitive Radio in the TV White Space Spectrum,

    Information Theory and Appplications, 2009.

    January 16, 2012 DRAFT

  • 27

    Fig. 1. Using cognitive radio in ad-hoc communication. If the licensed frequency f1 is not used by primary users, secondary

    users can communicate in ad-hoc mode using f1.

    Fig. 2. Cognitive radio network architecture

    Fig. 3. The channel transition probabilities for channel i.

    Fig. 4. The belief update function ns with respect to the packet delay.

    January 16, 2012 DRAFT

  • 28

    Fig. 5. Optimal policy with one licensed channel.

    Fig. 6. Optimal policy with one licensed channel.

    Fig. 7. Optimal policy for the secondary user in the scenario 1.

    January 16, 2012 DRAFT

  • 29

    Fig. 8. Optimal policy for the secondary user in the scenario 2.

    Fig. 9. Optimal policy for the secondary user in the scenario 3.

    Fig. 10. Average reward depending on the number of iteration for scenario 2.

    January 16, 2012 DRAFT

  • 30

    Fig. 11. Average delay depending on the number of iteration for scenario 2.

    Fig. 12. Average reward depending on the number of iteration for scenario 1.

    Fig. 13. Average delay depending on the number of iteration for scenario 1.

    January 16, 2012 DRAFT

  • 31

    Fig. 14. The function F (, l) and G(, l).

    January 16, 2012 DRAFT