a reinforcement learning based qam/psk symbol synchronizer · m. matta et al.: a reinforcement...

General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

Users may download and print one copy of any publication from the public portal for the purpose of private study or research.

You may not further distribute the material or use it for any profit-making activity or commercial gain

You may freely distribute the URL identifying the publication in the public portal If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim.

Downloaded from orbit.dtu.dk on: Sep 08, 2020

A Reinforcement Learning Based QAM/PSK Symbol Synchronizer

Matta, Marco; Cardarilli, Gian Carlo; Di Nunzio, Luca; Fazzolari, Rocco; Giardino, Daniele; Nannarelli,Alberto; Re, Marco; Spano, Sergio

Published in:IEEE Access

Link to article, DOI:10.1109/ACCESS.2019.2938390

Publication date:2019

Document VersionPeer reviewed version

Link back to DTU Orbit

Citation (APA):Matta, M., Cardarilli, G. C., Di Nunzio, L., Fazzolari, R., Giardino, D., Nannarelli, A., Re, M., & Spano, S. (2019).A Reinforcement Learning Based QAM/PSK Symbol Synchronizer. IEEE Access, 7, 124147-124157.https://doi.org/10.1109/ACCESS.2019.2938390

https://doi.org/10.1109/ACCESS.2019.2938390

https://orbit.dtu.dk/en/publications/a626e17c-ac40-43a4-9dc8-63dc466d8919

https://doi.org/10.1109/ACCESS.2019.2938390

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/ACCESS.2019.2938390, IEEE Access

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.

Digital Object Identifier 10.1109/ACCESS.2017.DOI

A Reinforcement Learning BasedQAM/PSK Symbol SynchronizerMARCO MATTA1, GIAN CARLO CARDARILLI1(Member, IEEE), LUCA DI NUNZIO1, ROCCOFAZZOLARI1, DANIELE GIARDINO1, ALBERTO NANNARELLI2 (Senior Member, IEEE),MARCO RE1(Member, IEEE), and SERGIO SPANÒ11Department of Electronic Engineering, University of Rome “Tor Vergata”2Department of Applied Mathematics and Computer Science, Danmarks Tekniske Universitet, Lyngby, Denmark

Corresponding author: Marco Matta (e-mail: [email protected]).

ABSTRACTMachine Learning (ML) based on supervised and unsupervised learning models has been recently appliedin the telecommunication field. However, such techniques rely on application-specific large datasets and theperformance deteriorates if the statistics of the inference data changes over time. Reinforcement Learning(RL) is a solution to these issues because it is able to adapt its behavior to the changing statistics of theinput data. In this work, we propose the design of an RL Agent able to learn the behavior of a TimingRecovery Loop (TRL) through the Q-Learning algorithm. The Agent is compatible with popular PSK andQAM formats. We validated the RL synchronizer by comparing it to the Mueller and Müller TRL in termsof Modulation Error Ratio (MER) in a noisy channel scenario. The results show a good trade-off in terms ofMER performance. The RL based synchronizer loses less than 1 dB of MER with respect to the conventionalone but it is able to adapt its behavior to different modulation formats without the need of any tuning for thesystem parameters.

INDEX TERMSArtificial Intelligence, Machine Learning, Reinforcement Learning, Q-Learning, Synchronization, TimingRecovery Loop

I. INTRODUCTION

MACHINE Learning (ML) is a field of Artificial Intel-ligence based on statistical methods to enhance the

performance of algorithms in data pattern identification [1].ML is applied in several fields, such as medicine [2], financialtrading [3], big data management [4], imaging and imageprocessing [5], security [6], [7], mobile apps [8] and more.

In recent years, the advancement of electronics, infor-mation sciences and Artificial Intelligence supported theresearch and development in the telecommunication (TLC)field. In modern TLC systems, important aspects are flex-ibility and compatibility with multiple standards, often ad-dressed with ML approaches. Some examples are Soft-ware Defined Radio (SDR) [9], Cognitive Radio (CR) [10]and Intelligent Radio systems (IR). IRs are capable to au-tonomously estimate the optimal communication parameterswhen the system operates in a time-variant environment. Thisintelligent behavior can be obtained by using conventionaladaptive signal processing techniques or by using ML ap-

proaches.In [11] the authors show a survey of different innovative

ML techniques applied to telecommunications. In [12] theauthors compare the modulation recognition performanceusing different ML techniques such as Logistic Regression(LR), Artificial Neural Networks (ANN) and Support VectorMachines (SVM) over different datasets (FSK and PSK-QAM signals). The authors in [13] propose the use of DeepBelief Networks in a demodulator architecture. Other note-worthy research papers in this context are [14]–[18]. Anotherimportant research field is related to the development of MLhardware accelerators that allow time and energy efficientML algorithms execution [19]–[24].

ML techniques are usually classified in three main cate-gories: Supervised, Unsupervised and Reinforcement Learn-ing. The first two require a training phase to obtain an expertalgorithm ready to be deployed in the field (inference phase).Supervised and unsupervised ML approaches rely on massiveamounts of data, intensive offline training sessions and large

VOLUME 4, 2016 1



M. Matta et al.: A Reinforcement Learning Based QAM/PSK Symbol Synchronizer

parameter spaces [25]. Moreover, the inference performancedegrades when the statistics of the input data differs fromthat of the training examples. In these cases, multiple trainingsessions are required to update the model [26]. In Reinforce-ment Learning (RL) the training and inference phases are notseparated. The learner interacts with the environment to col-lect information and receives an immediate reward for eachdecision and action it takes. The reward is a numerical valuethat quantifies the quality of the action performed by thelearner. Its aim is to maximize the reward while interactingwith the environment through an iterative sequence of actions[27].

Consequently, in a scenario where the input data statisticsevolves, a possible solution is the use of an RL approach.The main feature of RL is its capability to solve the taskinteracting with the environment by trail-and-error. For thisreason, this methodology brings the following advantages:• The off-line training phase is not required, avoiding the

need for cumbersome training data-sets;• Parameter space dimensions are generally smaller than

in supervised and unsupervised approaches;• The model is self-adapting using the data collected in

the field.RL became increasingly popular in robotics [28]–[31],

Internet of Things (IoT) [32], financial trading [33] andindustrial applications [34]. RL for multi-Agent systems isa growing research field as well. It is often bio-inspired [35]and the learners are organized in ensembles, i.e. swarms, toimprove learning capabilities [36], [37].

However, the application of RL models to TLC is a fairlynew and unexplored line of research. As detailed later, a con-ventional TLC receiver is a cascade of processing units basedon tunable feedback loops and application-specific modules.For example, a Timing Recovery Loop (TRL) is a system thatincludes an error detector and a feedback loop filter to controlthe synchronization of the receiver resampler. Conventionalapproaches require the tuning of the filter parameters and er-ror detectors that are specific to the transmission parameters,such as types of modulation scheme.

In this paper, we propose a TRL based on RL capable tooperate independently of the modulation format, due to itsadaptation capability. Moreover, a large training dataset is notrequired, differently from the cited researches about the useof supervised and unsupervised ML approaches in TLC. Oursolution is based on Q-Learning, an RL algorithm developedby Watkins [38]. This model learns from the input data howto synthesize the behavior of a symbol synchronizer by usinga trial-and-error iterative process. The system performancehas been validated by comparing the proposed approach toa standard synchronization method, the Mueller and Müllergate [39]. With respect to [40], we extend the method to awider set of modulations schemes by using a modified andenhanced ML model.

The main advantages of our approach are:• It is capable to operate on different modulation schemes

(static conditions);

• It is capable of self adaptation to changes in the channelnoise and modulation formats (changes in the environ-ment);

• It avoids the loop filter parameter tuning phase;• It avoids the need to allocate specific sub-channels for

synchronization because the timing recovery is per-formed by using the input data.

In this paper, we prove that the use of a ReinforcementLearning based QAM/PSK symbol synchronizer is a validtiming recovery method by testing it in generic telecommu-nication system.

This paper is organized as follows. In Sect. II an overviewabout conventional symbol synchronization methods, RL andQ-Learning is provided. In Sect. III we illustrate the architec-ture and the sub-modules of the proposed RL based TRL. InSect. IV we provide the experimental results regarding thechoice of the Q-Learning hyperparameters and we comparethe performance of the system to conventional synchroniza-tion methods. In Sect. V the conclusions are drawn.

II. BACKGROUNDIn this section, we provide insights about conventional timingrecovery methods and Reinforcement Learning.

A. SYMBOL SYNCHRONIZATION METHODSA generic receiver front-end consists of a receive filter fol-lowed by a carrier and timing recovery system. The carrierrecovery system is implemented with first or second-orderPhase-Locked Loops (PLL). The symbol synchronization isperformed using Timing Recovery Loops (TRL) as shownin [41], [42]. Examples of recovery loops are the CostasLoop, for carrier recovery [43], the Early-Late [42], [44],the Mueller and Müller [39], and the Gardner Loop [45]for symbol synchronization. The timing recovery is oftenimplemented as a first-order TRL. A TRL consists of atiming phase detector (TPD), a loop filter, and a NumericallyControlled Oscillator (NCO) for resampling.

In this paper we compare the RL based synchronizerto the Mueller and Müller (M&M) method for two mainreasons: the M&M recovery loop is a data-assisted timingsynchronization algorithm and it uses only one sample persymbol. The M&M synchronizer (the BPSK version is shownin Fig. 1) resamples the signal y[k] and computes the syn-chronization error e[n] = x[n]a[n − 1] − x[n − 1]a[n].A decision device estimates the expected values a[n] anda[n−1], providing the constellation coordinates of the nearestsymbol. The error signal drives a loop filter which controlsthe NCO that resamples the input waveform y[k].

B. REINFORCEMENT LEARNINGAmong the ML paradigms described in the Introduction, RLis the most appropriate method for symbol timing synchro-nization, because a lot of the new TLC applications implydynamic scenarios (environments). RL is a ML methodologyused to train an object, called Agent, to perform a certain taskby interacting with the environment. As shown in Fig. 2, the

2 VOLUME 4, 2016




RX Module

x[n]

State&

RewardEvaluator

Q-LearningEngine

sn

rn

a2,n

sn

TXModule

Channel Emulator

Eb/N0 Delay

ActionDecoder a1,n

an

y[k]

CarrierRecovery

SymbolDecoder

ImpulsiveNCO

z-2

a2,n ∇2

abs

scaling

frac.part

rnabs-

x[n]

z-1

z-1

Decision

Loop Filter

y[k]x[n] x[n-1]

a[n] a[n-1]

e[n]

-

NCO

z-T/2

conj

Loop Filter

y[k]x[n]

x[n+T/2]

e[n]

-

ImpulsiveNCO

T/2sampler

Re

z-T/2 x[n-T/2]

Q-LearningSynchronizer

y[k]

square

square

Earlysampler

On-timesampler

Latesampler

x[n-δ]

SymbolDecoder

x[n]

x[n+δ]

-Loop Filter

e[n]Impulsive

NCO

FIGURE (1) Mueller and Müller Timing Recovery Loop(BPSK implementation).

FIGURE (2) Principle of Reinforcement Learning.

RL Agent adjusts its behavior by a trial-and-error processthrough a rewarding mechanism called reinforcement thatis a measure of its task-solving overall capability [46]. Theeffects of the actions are gathered and estimated by an entitycalled Interpreter that computes the new state and rewardsthe Agent accordingly. The Interpreter can be external orinternal to the Agent itself: in the latter case, the Agent andthe Interpreter are merged into a more complex Agent able tosense the environment and capable of self-criticism. This isthe type of Agent we use in this paper.

The proposed RL-based synchronizer interacts with thereceived signal generated by an evolving TLC environmentand is capable to decide the best timing compensation value.This process takes place without the need for extensivetraining phases and large training datasets.

In standard RL algorithms, the environment is often mod-eled as a Markovian Decision Process (MDP) [46]. TheAgent is characterized by the space S × A, where S isthe state vector and A is the set of the actions. The Agentevaluates the states si ∈ S and decides to perform the actionai ∈ A. The process by which the Agent senses the envi-ronment, collecting pieces of information and knowledge, iscalled exploration. At a certain time t, the Agent is in the statest and takes the action at. The decision to perform the actionat takes place because the Agent iteratively learns and buildsan optimal action-selection policy π : S × A. Each π(s, a)is a real number in the range [0, 1]. π is a probability mapshowing the chances of the Agent to take one of the actionsin A given the current state st. Then, the Agent observes theeffects generated by the action at on the environment.

The information derived from the environment response iscalled reward, a real number (positive or negative). Its valuedepends on how the action at brings the system closer to thetask solution. The Agent aims to accumulate positive rewardsand to maximize its expected return by iterating actions overtime. This process is modeled by the State-Value FunctionVπ(s) which is related to the current policy map and it isupdated with the reward value at each iteration. ThroughVπ(s) the Agent learns the quality of the new state st+1

reached starting from st and executing at. The number ofiterations taken by the Agent to complete the given task iscalled an episode. The agent reaches a terminal state sgoaland the updated State-Value Function is stored for furtheruse. The subsequent episode starts with the updated State-Value Function available to the Agent. As Vπ(s) is updated,the Agent manages its set of actions in order to solve the taskby maximizing the cumulative reward in fewer iterations.This process is called exploitation. The Agent explores andexploits the environment using the knowledge collected inthe form of the π map.

In general, the environment is dynamic and the effect of anaction in a certain state can change over time. By alternatingexploration and exploitation, the RL Agent adapts its State-Value Function Vπ(s) over time following the environmentchanges. There are several RL algorithms that approximateand estimate the State-Value function, both for single andmulti-Agent, and the most prominent example is Q-learningby Watkins [38].

Watkins’ Q-LearningQ-learning is an algorithm able to approximate the State-Value function Vπ(s) of an Agent in order to find an optimalaction-selection policy π∗. As mentioned before, the Agentoverall goal is to maximize its total accumulated futurediscounted reward derived from current and future actions.The State-Value function is defined as Vπ(s) = E[R] and Ris the Discounted Cumulative Future Reward defined as (1):

R =∞∑t=0

γtrt (1)

where γ ∈ [0, 1] is the discount factor and rt is the rewardat time t. In [38] the authors demonstrate that to approximatethe State-Value Vπ∗(s) function related to the optimal policyπ∗ it is sufficient to compute a map called Action-ValueMatrix Qπ∗ : S × A. The elements of Q represent thequality of the related policy π(st, at). The Q values areupdated at every iteration (time t) using the Q-Learningupdate algorithm in the form of the Bellman equation [47]and reported below (2).

Q(st, at)← (1− α)Q(st, at)+

+α(rt + γmax

aQ(st+1, a)

) (2)

where α is the learning rate, γ is the discount factor and rtis the reward at time t. Some state-of-the-art action-selection

VOLUME 4, 2016 3




Eye

op

en

ing

Variation of samples amplitude

Time

Am

plit

ude

FIGURE (3) Objectives: minimization of the amplitudevariation and maximization of the eye-opening (BPSK wave-form).

policies widely used in Reinforcement Learning frameworks[46] are:

• Random: the Agent selects an action ai ∈ A with arandom approach, ignoring the Q values;

• Greedy: the Agent selects the action with the highestcorresponding Q value;

• ε-Greedy: the chance of taking a random action is avalue ε ∈ [0, 1] and the probability of a greedy choice is1− ε;

• Boltzmann: the Q values are used to weight the action-selection probabilities according to the Boltzmann dis-tribution based on the temperature parameter τ .

In the case of an Agent required to perform multipleactions simultaneously, it is possible to apply the concept ofAction Branching [48] that extends the dimensionality of theaction space.

III. TIMING SYNCHRONIZER BASED ON Q-LEARNINGIn this section, we present the design of a TRL based on Q-Learning able to recover the timing of a raised-cosine-shapedsignal when QAM and PSK modulation schemes are used.In our system, the input signal y[k] is downsampled by afactor M obtaining the signal x[n] (as for the Mueller andMüller gate). The Agent task is to compensate for the symboltiming error by delaying or anticipating the resampling time.In our experiments, the downsampling factor M is equal tothe number of samples per symbol. The definitions of theAgent state is related to a measure on the resampled symbol(detailed in Sect. III-A2). The Agent action is defined as adouble control (Sect III-C): the first one is used to managethe resampling timing and the second one to modify the inputsignal amplitude.

With respect to Fig. 3, the Agent is required to fulfill twomain objectives:

1) To minimize the variation in amplitude of consecutiveresampled symbols x[n] with respect to an expectedconstellation (dark blue arrows);

2) To maximize the eye-opening (orange arrow).

With respect to [40], the algorithm is extended to supportthe QAM synchronization.

RX Module

x[n]

State&

Reward Evaluator

Q-LearningEngine

s[n]

r[n]

G[n]

s[n]

TXModule

Channel Emulator

Eb/N0 Delay

ActionDecoder

a[n]

y[k]

From CarrierRecovery

To SymbolDecoder

ResamplingNCO

z-2

a2[n] ∇2

Re

scaling

wrap

r[n]abs-

x[n]

z-1

z-1

Decision

Loop Filter

y[k] x[n] x[n-1]

a[n] a[n-1]

e[n]

-

Impulsive NCO

z-T/2

conj

Loop Filter

y[k]x[n]

x[n+T/2]

e[n]

-

ImpulsiveNCO

T/2sampler

Re

z-T/2 x[n-T/2]


y[k]

square

square

Earlysampler

On-timesampler

Latesampler

x[n-δ]

SymbolDecoder

x[n]

x[n+δ]

-Loop Filter

e[n]Impulsive

NCO

absx̂[n]

Constellation conditioning

Timing Control

FIGURE (4) Q-Learning timing synchronizer architecture.

The proposed RL Agent architecture is illustrated in Fig. 4and it consists of four main sub-modules:

1) State and Reward Evaluator Block: it is the Interpreterin the RL framework;

2) Q-Learning Engine: it updates theQmatrix and imple-ments the action-selection policy π;

3) Action Decoder: it carries out the Agent decisions tomanage the signal sampling delay and its amplitude;

4) Numerically Controlled Oscillator: it provides the re-sampling timing.

Additional details and accurate descriptions of the TRLblocks are provided in the following subsections.

A. STATE AND REWARD EVALUATORThe basic idea behind the RL based TRL is illustrated inFig.5. In this figure the input y[k] is a BPSK signal. The vari-ation range of the resampled signal absolute value |(x[n])| isrepresented by the pink band (Fig. 5a) and green (Fig. 5b).The Agent goal is the minimization of this variation and,in addition to this, it minimizes the difference between theamplitude of the resampled signal x[n] (red crosses) and theexpected symbol value x∗(blue line, the value is 1 in thedepicted example). We design an RL Interpreter, called Stateand Reward Evaluator (SRE), that processes x[n] to fulfillthese objectives.

We design the Agent states si ∈ S and reward r dependingon the input signal amplitude and its variation over time. Asdepicted in Fig. 6, the SRE module is built as two blocksin cascade called Constellation Conditioning and State &Reward Computing.

1) Constellation ConditioningThe main limitation of the approach proposed in [40] isthe inability to synchronize QAM signals. To address thisproblem we use a coordinate transformation approach, im-plemented as shown in Fig.7.

The following equation represents the constellation condi-tioning algorithm:

x̂ = |[G · xr − round(G · xr)] + 1| (3)

where x̂[n] is the conditioned symbol, xr is the real partof x[n], and the gain value G[n] is controlled by the Agentwith the action a2[n], detailed later. Using this approach, thetransformed signal x̂[n] tends to an optimal value x̂∗ that

4 VOLUME 4, 2016




0 8 16 24 32 40 48 56 64

Samples

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

Am

plitu

de

Symbol Coordinate x*Input y[k]Sampled waveform x[n]Absolute value of x[n]

Amplitude variation before synch.

(a)

0 8 16 24 32 40 48 56 64

Samples

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

Am

plitu

de

Symbol Coordinate x*Input y[k]Sampled waveform x[n]Absolute value of x[n]

Amplitude variation after synch.

(b)

FIGURE (5) Resampling of a BPSK waveform (M = 8):(a) before synchronization, (b) after synchronization.

State and Reward Evaluator

Environment

Agent

Interpreter

ActionState

Reward

G[n]

Real | · |x[n] wrap x̂[n]

s[n]

z-2

∇2scaling

r[n]| · |-

x̂[n]x[n]x̂[n]

s[n]

r[n]x[n]

Constellation

Conditioning

State & Reward

Computing

x̂[n]

xr[n]

+1

G[n]

FIGURE (6) State and Reward Evaluator (SRE) block.

depends on the value of G[n]. The effect of the wrap blockis modeled by [G · xr − round(G · xr)]. The constellationconditioning process is depicted in Fig. 8. In this example,we consider a 16QAM constellation where the coordinatesof x[n] are projected on the real axis, aligned using thegain block, collapsed and merged on the expected symbolcoordinate x̂∗.

2) State & Reward Computing

In Fig. 9 the State and Reward Computing block is shown.

Environment

Agent

Interpreter

ActionState

Reward

G[n]

Real | · |x[n] wrap x̂[n]

s[n]

z-2

∇2scaling

r[n]| · |-

x̂[n]x[n]x̂[n]

s[n]

r[n]x[n]

Constellation

Conditioning

State & Reward

Computing

x̂[n]

xr[n]

+1

FIGURE (7) Constellation Conditioning block.

The state function (4) and the reward function (5) are:

s[n] = N/2 · (∇2x̂[n] + 1) + x̂[n− 2] (4)

r[n] =[x̂[n− 2]− |∇2x̂[n]|

]q(5)

whereN is the number of states and q must be an odd integerto keep the sign. ∇2x̂[n] is the second-order finite differenceof x̂[n], as shown in (6):

∇2x̂[n] = x̂[n]− 2x̂[n− 1] + x̂[n− 2] (6)

To compute the state s[n], the value of ∇2x̂[n] is scaledby N/2 to be compatible with the coordinates of the AgentQ-matrix. If the Agent retrieves the correct symbol phase,the differential ∇2x̂[n] tends to 0 and x̂[n] to x̂∗. For thisreason, according to (4), the target state s∗i is placed in theneighborhood of si = N/2 + x̂∗ and the reward, in (5), ismaximized.

B. Q-LEARNING ENGINEThe purpose of this module is to update the Action-Valueelements of Q using equation (2). As shown in Fig. 4, theinputs are the Agent state s[n] and the reward r[n]. Theoutput is the action a[n], determined using one of the policiesdescribed in Sect. II-B. In particular, the action is branchedin two components:• Branch a1 =

{a10, a

11, a

12

}represents the action vector

to delay, stop or anticipate the resampler;• Branch a2 =

{a20, a

21, a

22

}is an action vector used to

increase, hold or decrease G[n].The Agent action space A = a1 × a2 is bi-dimensional andit is processed and actuated by the Action Decoder block,detailed in the next Section. The mapping of the state-actionspace S × A into Q is a tensor with one dimension forthe states and two for the actions, as illustrated in Fig. 10.There are 3 × 3 = 9 available actions for each state (si).The hyperparameter N ∈ N is the cardinality of the statespace S (hence si ∈ N). For this reason, the state values s[n]calculated in (4) are rounded to address the Q-tensor.

C. ACTION DECODERThe Action Decoder implements the decisions taken by theAgent. The action a =

{a1, a2

}is the input of this block.

The outputs are used by the Agent to control the NCO timingandG[n] respectively. This block computes the increments tobe assigned to the NCO timing and the Gain G[n].

The action vector a1 control the resampler using threedifferent actions:• a10 = 0 : the Agent decides that the current timing is

correct;• a11 = +1 : the Agent anticipates the resampling;• a12 = −1 : the Agent delays the resampling.The purpose of the action a2 is to modify the amplitude of

the input signal, as shown in Fig. 8 and discussed in SectIII-A. The elements of a2 are the increments used by theAction Decoder to update the input gain G[n]:

VOLUME 4, 2016 5




Re wrap

x[n]

x̂[n]

| · |

FIGURE (8) Example of Constellation Conditioning on a 16QAM modulation scheme.

s[n]

z-2

2scaling

r[n]| · |-

x̂[n]

FIGURE (9) State and Reward computing block.

States

s0s1

sN-1

sN-2

sN-3

FIGURE (10) Mapping of the State-Action space into Qvalues.

• a20 = 0 : the current x̂∗ value is appropriate and theAgent decides to maintain the G[n] constant;

• a21 = +0.02 : G[n] is increased;• a22 = −0.02 : G[n] is decreased.

The initial control values for the NCO timing and G[n] are 0and 1 respectively.

IV. EXPERIMENTS AND RESULTSIn this section, we present three experiments. In the first onewe find suitable Q-Learning hyperparameters, in the secondone we test the adaptivity properties of the Q-LearningAgent, and, in the last one, we compare our algorithm tothe Mueller and Müller Loop when signals with differentmodulation formats in a generic telecommunication systemare used. The experimental setup is common among theexperiments and it is shown in Fig. 11.

RX Module

TX

Module

AWGN Channel

Emulator

Eb/N0 Delay d


FIGURE (11) Experimental setup used to test and charac-terize the proposed Q-Learning TRL.

The TX-Module is a base-band transmitter in which themodulation scheme and Root-Raised Cosine (RRC) shap-ing filter are parametric. The AWGN Channel Emulator isconfigurable in delay and noise power (measured in termsof Eb/N0). The RX-Module includes an RRC receive filterand the Q-Learning Synchronizer. In these experiments, weassume a recovered carrier. The performance of the synchro-nizer is evaluated in terms of Modulation Error Ratio (MER),defined as in [49].

In the first experiment, we asses the optimal Q-Learninghyperparameters a posteriori. At the same time, we aim toreduce the Agent complexity, i.e. the dimensionality of thehyperparameters space. In the second experiment, we changethe environment properties i.e. the modulation format toevaluate the Q-Learning Agent adaptation capability. In thelast experiment, we compare the MER of the Q-Learningsynchronizer with that obtained by the Mueller and Müllermethod for PSK and QAM modulation schemes. The ex-periments are characterized by the following Q-Learninghyperparameters (the values that were validated a posteriori)and TLC parameters:

• The N × 3 × 3 Q-Tensor is initialized with all zerovalues;

• The Agent number of states is N = 64;• The values of α and γ in (2) are 0.1 and 0.01, respec-

tively;• The Agent action-selection policy is ε-greedy with ε =

10−5;• The samples per symbol of the transmitted waveform is

equal to M = 32.

The MER measurements, if not explicitly otherwise indi-cated, are obtained as the average of 200 simulations wherewe select the last 200 symbols in steady-state condition. Thetransmitted PSK symbols have a constant amplitude equal

6 VOLUME 4, 2016




to 1. In QAM mode, the symbols have a fixed minimumdistance of 2.

A. HYPERPARAMETERS ANALYSISConsidering equation (2) and the Agent features, the hyper-parameters are:• Action-selection policy π: it represents the action choice

rule according to the Q-values. In our work we em-ployed ε-Greedy with ε = 10−5 to facilitate the state-space exploration. Moreover, it is important to avoid thelocal maxima of the reward function.

• Learning rate α: it defines the convergence speed ofthe Q matrix towards the ideal State-Value function. Inthe case of a dynamic environment, it also defines theadaptation speed of the Agent. For α = 0.1, we founda good trade-off between algorithm convergence speedand local maxima avoidance.

• Discount factor γ: it weights the outcome of the presentactions with respect to the future ones. A suitable γvalue is 0.01.

• Number of states N : it defines the state-space quantiza-tion, hence the Q-tensor dimension (Fig. 10).

In the following analysis, different simulations have beenimplemented for PSK and QAM formats, and in the nextSections the QPSK and 16QAM are illustrated.

1) MER variation as a function of α and γTo study the range of variation of the Q-Learning TRL hyper-parameters with respect to the modulation format andEb/N0

we set an initial constant timing delay d = M/4, where Mis the number of samples per symbol. The graphs in Fig. 12show that the variations of α (Fig. 12a and Fig. 12b) and γ(Fig. 12c and Fig. 12d) do not affect the MER performancesignificantly. Consequently, the default set α = 0.1 andγ = 0.01 is a proper design choice for different modulationformats.

2) Number of States NThe number of states determines the environment observa-tion precision of the Agent. To analyze the system MERperformance, we simulate our system with N in the range8-1024 and Eb/N0 = 5-30 dB for PSK and QAM modu-lation schemes. The MER results are evaluated after 1000exploration symbols.

Figure 13a shows that the MER is affected by N only inthe case of high SNR. In this case, the MER grows whenN increases (10 dB of spread at Eb/N0 = 30 dB). In the16QAM case in Fig. 13b, we observe the same dependencewith respect to N but with a smaller spread (2.5 dB atEb/N0 = 25-30 dB).

This experiment allows a proper selection of the numberof states N . In the QPSK case, the MER curves are mono-tonically growing for N ≥ 64, consequently there is noadvantage to use an Agent with more than 64 states. In the16QAM tests, the curves show a similar trend. The best trade-

5 10 15 20 25 30Eb/N0 (dB)

0

5

10

15

20

25

30

35

ME

R (

dB)

= 0.01 = 0.1 = 0.2 = 0.5 = 0.9

QPSK -

(a)

5 10 15 20 25 30Eb/N0 (dB)

0

5

10

15

20

25

30

35

ME

R (

dB)

= 0.01 = 0.1 = 0.2 = 0.5 = 0.9

16QAM -

(b)

5 10 15 20 25 30Eb/N0 (dB)

0

5

10

15

20

25

30

35

ME

R (

dB)

= 0.01 = 0.1 = 0.2 = 0.5 = 0.9

QPSK -

(c)

5 10 15 20 25 30Eb/N0 (dB)

0

5

10

15

20

25

30

35

ME

R (

dB)

= 0.01 = 0.1 = 0.2 = 0.5 = 0.9

16QAM -

(d)

FIGURE (12) MER variation as a function of α and γ fordifferent Eb/N0 and different modulation formats.

5 10 15 20 25 30E

b/N

0 (dB)

0

5

10

15

20

25

30

35

ME

R (

dB)

N = 8N = 16N = 32N = 64N = 128N = 256N = 512N = 1024

QPSK - N

(a)

5 10 15 20 25 30E

b/N

0 (dB)

5

10

15

20

25

30

35

ME

R (

dB) N = 8

N = 16N = 32N = 64N = 128N = 256N = 512N = 1024

16QAM - N

(b)

FIGURE (13) Variation of N . MER measured for differentchannel noise levels.

VOLUME 4, 2016 7




0 2 4 6 8 10Number of Iterations (x103)

-15

-10

-5

0

5

10

15

Tim

ing

com

pens

atio

n

0

0.5

1

1.5

Gai

n G

[n]

Exploration 2Exploration 1

Environment change :QPSK to 16QAM Delay d=7

Exploitation 1 Exploitation 2

FIGURE (14) Q-Learning adaptivity: Timing Compensa-tion and Gain G[n].

off in terms of MER and number of states is N = 64. Theresulting Q-Tensor consists of 64× 3× 3 = 576 elements.

B. Q-LEARNING TRL ADAPTIVITY EXPERIMENTAs mentioned in Sect.I and Sect. II-B, an RL Agent is able toadapt itself to the evolution of the environment. This propertyensures high system availability on the field and no needsfor training sessions. With the following experiment we showthe adaptivity properties that the RL approach provides andhow the proposed TRL is able to operate independently ofthe modulation scheme.

In Fig. 14, the initial timing delay is set to d = 7 (insamples, horizontal dashed gray line) and the modulation for-mat changes from QPSK to 16QAM (vertical red line). Thischange in the modulation scheme models an environmentevolution and it is useful to observe the Agent behavior interms of adaptivity.

The Agent manages its actions to carry out the best TRLcontrols, i.e. the timing compensation (blue curve) and thevalue of G (orange curve). The Agent experiences two ex-ploration phases (green background) and two exploitationphases (white background). After the first exploration phase(~500 iterations), the Agent decides that the best timingcompensation and gain values are 7 and ~1, in a QPSKenvironment. The values of the Q-matrix converge to theoptimal ones and the Agent begins to exploit its knowledgeby maintaining the optimal TRL controls.

At a certain point, the environment changes from QPSKto 16QAM. Because of that, the Agent is forced to adaptitself by entering the second exploration phase to update itsQ-matrix. After about 2000 Q-learning iterations, the Agententers the last exploitation phase and it decides to set thetiming compensation correctly to 7 and the gain to ~0.7. Themodification of G[n] from ~1 to ~0.7 is related to the changeof the real part of the coordinate of the received symbol fromthe QPSK to the QAM format.

C. MODULATION SCHEME EXPERIMENTS ANDCOMPARISON WITH CONVENTIONAL METHODSIn this experiment we test the Q-Leaning synchronizerperformance when different modulation schemes are used.Moreover, we compare its performance to conventional syn-

TABLE (1) Loop Filter Parameters of Mueller and MüllerTRLs

Parameter PSK QAMDamping Factor 1 2Normalized Loop Bandwidth 0.1 0.005Phase Detector Gain 2.7 2.7

chronization methods. The Q-Learning hyperparameters areN = 64, α = 0.1, γ = 0.01, values suitable for BPSK,QPSK, 16QAM, 64QAM and 256QAM. The channel delayis constant and equal to d = M/4, where M is the numberof samples per symbol. The Q-Learning synchronizer iscompared to the following methods:• A Reference ideal resampler: a resampler with the exact

timing compensation for the channel delay d;• A Mueller and Müller timing recovery loop: the state-

of-the-art TRL discussed in Sec. II-A.The graphs in Fig. 15 are obtained with the same set of the

Q-Learning hyperparameters. The Mueller and Müller TRLfilter parameters depend from the modulation method (oneset for PSK modulations and one for QAM) and they arereported in Table 1.

Figure 15 and Table 2 show the experimental results. Thenumber of simulations is 200, the number of explorationsymbols is 10000 and the MER of a single simulation iscomputed over 200 symbols. Each value in Table 2 is theaverage MER plus or minus its standard deviation.

Figures 15a and 15b show the MER measurements forBPSK and QPSK respectively. The MER values for BPSKand QPSK are similar. The only exception is in the QPSKconfiguration as the Q-Learning synchronizer loses up to0.7dB at Eb/N0 = 25 dB with respect to the referencemodel.

The QAM related graphs (Fig. 15c, Fig. 15d and Fig. 15e)show an interesting performance. In the case of high noise(Eb/N0 = 5-10 dB), the three systems perform similarly.In the mid-range (10-20 dB), the Q-Learning performanceis similar to the Mueller and Müller loop in the 16QAMand 256QAM. In the same range, for 64QAM, we measuredMER values lower (1.5 dB on average) than the Muellerand Müller loop. In the 20-25 dB range, the Q-Learningsynchronizer is outperformed by about 2 dB due to a poorerefficacy of the exploration phase of the Agent. This leads tothe conclusion that the presence of white noise in the RLenvironment forces the Agent to explore more states, thusconverging to its optimal Q-matrix much more efficiently.Taking into account all MER measurements, the averagedifference between our method and the conventional TRL islower than 1 dB.

V. CONCLUSIONSIn this paper, we presented an innovative method based onRL to implement a QAM/PSK symbol synchronizer. Thisapproach employs an RL Agent able to synthesize a TRLbehavior compatible with a number of modulation schemes.

8 VOLUME 4, 2016




TABLE (2) Comparison of the Q-Learning synchronizer (QLearn) with the Reference ideal sampler (Ref) and the Muellerand Müller Timing Recovery Loop (M&M).

Eb/N0 System BPSK MER (dB) QPSK MER (dB) 16QAM MER (dB) 64QAM MER (dB) 256QAM MER (dB)Ref 4.03± 0.00 7.05± 0.00 10.64± 0.03 12.35± 0.04 13.65± 0.02

5dB M&M 3.43± 0.19 6.91± 0.02 10.02± 0.12 11.89± 0.11 13.16± 0.10QLearn 4.00± 0.01 7.03± 0.02 10.44± 0.04 12.19± 0.06 13.30± 0.05

Ref 9.04± 0.00 12.05± 0.00 15.63± 0.07 17.42± 0.04 18.69± 0.0510dB M&M 8.97± 0.02 11.94± 0.02 15.25± 0.08 17.00± 0.05 17.92± 0.26

QLearn 9.01± 0.01 12.03± 0.01 15.22± 0.11 16.62± 0.09 17.75± 0.09Ref 14.04± 0.00 17.05± 0.00 20.72± 0.03 22.46± 0.03 23.66± 0.04

15dB M&M 13.96± 0.02 16.95± 0.02 20.17± 0.09 22.14± 0.07 21.81± 0.55QLearn 14.02± 0.01 17.02± 0.01 20.17± 0.12 20.80± 0.12 21.82± 0.16

Ref 19.04± 0.00 22.05± 0.00 25.65± 0.06 27.37± 0.04 28.67± 0.0420dB M&M 18.95± 0.02 21.96± 0.02 24.30± 0.21 26.91± 0.14 25.59± 1.27

QLearn 19.01± 0.03 22.02± 0.03 25.30± 0.15 24.51± 0.36 25.30± 0.31Ref 24.04± 0.00 27.05± 0.00 30.59± 0.03 32.45± 0.03 33.58± 0.04

25dB M&M 23.91± 0.02 26.97± 0.01 28.29± 0.30 30.84± 0.39 29.96± 0.49QLearn 23.94± 0.07 26.56± 0.20 29.33± 0.42 28.78± 0.84 27.96± 0.51

5 10 15 20 25E

b/N

0 (dB)

0

10

20

30

40

ME

R (

dB)

Reference DelayQ-LearningMueller

BPSK

(a)

5 10 15 20 25E

b/N

0 (dB)

0

10

20

30

40

ME

R (

dB)


QPSK

(b)

5 10 15 20 25E

b/N

0 (dB)

0

10

20

30

40

ME

R (

dB)


16QAM

(c)

5 10 15 20 25E

b/N

0 (dB)

0

10

20

30

40

ME

R (

dB)


64QAM

(d)

5 10 15 20 25E

b/N

0 (dB)

0

10

20

30

40

ME

R (

dB)


256 QAM

(e)

FIGURE (15) MER for different modulation schemes andcomparison with the Reference ideal resampler and theMueller and Müller approach.

The RL algorithm used to approximate the State-Valuefunction is Q-Learning. The state and reward are computedfrom the sampled symbol amplitude x̂[n] and its differential∇2x̂[n]. The Agent action is branched in two sub-action: thefirst one to manage the timing and the second one to adjustthe input signal amplitude. The proposed Q-Learning TRLincludes four modules: a State and Reward Evaluator (theRL Interpreter), a Q-Learning Engine, an Action decoder,and an NCO. The compatibility with respect to the differentmodulation schemes is obtained through the ConstellationConditioning block. This solution extends the work presentedin [40] to QAM formats.

The proposed method was validated through the simu-lation of a base-band transmitter, receiver, and an AWGNchannel emulator, with Eb/N0 in the range 5-30 dB throughthree experiments. In the first one, we estimated a set ofhyperparameters for BPSK, QPSK, 16QAM, 64QAM and256QAM. The performance of the RL synchronizer is evalu-ated in terms of Modulation Error Ratio (MER).

In the second experiment, we proved the autonomousadaptation capability of our approach, suitable for the new in-telligent communication systems. In the last experiment, wecompared the RL approach to the Mueller and Müller TRL:for BPSK and QPSK modulations, the two synchronizersachieve similar MER values, close to a Reference resampler.The proposed TRL is slightly outperformed for the 64QAMand 256QAM in low-noise scenarios (Eb/N0 ≥ 15 dB). TheRL synchronizer performance is the same as the Mueller andMüller TRL for 16QAM in the full noise range.

These results are a good trade-off between flexibility andperformance. Moreover, unlike conventional methods, theQ-Learning based TRL is able to adapt itself to evolvingscenarios, avoiding parameter tuning. We plan to expandour research to support additional modulation formats and toimplement suitable hardware architectures.

REFERENCES[1] C. M. Bishop, Pattern recognition and machine learning. springer, 2006.[2] M. Chen, Y. Hao, K. Hwang, L. Wang, and L. Wang, “Disease prediction

VOLUME 4, 2016 9




by machine learning over big data from healthcare communities,” IeeeAccess, vol. 5, pp. 8869–8879, 2017.

[3] D. Cabrera, C. Cubillos, A. Cubillos, E. Urra, and R. Mellado, “Affectivealgorithm for controlling emotional fluctuation of artificial investors instock markets,” IEEE Access, vol. 6, pp. 7610–7624, 2018.

[4] A. L’heureux, K. Grolinger, H. F. Elyamany, and M. A. Capretz, “Machinelearning with big data: Challenges and approaches,” IEEE Access, vol. 5,pp. 7776–7797, 2017.

[5] G. Wang, “A perspective on deep imaging,” IEEE Access, vol. 4, pp. 8914–8924, 2016.

[6] K. Muhammad, J. Ahmad, I. Mehmood, S. Rho, and S. W. Baik, “Convo-lutional neural networks based fire detection in surveillance videos,” IEEEAccess, vol. 6, pp. 18 174–18 183, 2018.

[7] T. Petrovic, K. Echigo, and H. Morikawa, “Detecting presence from a wifirouter’s electric power consumption by machine learning,” IEEE Access,vol. 6, pp. 9679–9689, 2018.

[8] A. Sehgal and N. Kehtarnavaz, “A convolutional neural network smart-phone app for real-time voice activity detection,” IEEE Access, vol. 6, pp.9017–9026, 2018.

[9] M. Dillinger, K. Madani, and N. Alonistioti, Software Defined Radio:Architectures, Systems and Functions. John Wiley & Sons, 2005.

[10] J. Mitola, “Cognitive Radio—an Integrated Agent Architecture for Soft-ware Defined Radio,” Ph.D. dissertation, Royal Institute of Technology(KTH), 2000.

[11] A. He et al., “A Survey of Artificial Intelligence for Cognitive Radios,”IEEE Transactions on Vehicular Technology, vol. 59, no. 4, pp. 1578–1592, 2010.

[12] M. Bari, H. Taher, S. S. Sherazi, and M. Doroslovacki, “SupervisedMachine Learning for Signals having RRC Shaped Pulses,” in Signals,Systems and Computers, 2016 50th Asilomar Conference on. IEEE, 2016,pp. 652–656.

[13] M. Fan and L. Wu, “Demodulator based on deep belief networks in com-munication system,” in 2017 International Conference on Communication,Control, Computing and Electronics Engineering (ICCCCEE). IEEE,2017, pp. 1–5.

[14] Q. Chen, Y. Wang, and C. W. Bostian, “Universal Classifier SynchronizerDemodulator,” in Performance, Computing and Communications Confer-ence, 2008. IPCCC 2008. IEEE International. IEEE, 2008, pp. 366–371.

[15] M. Zhang, Z. Liu, L. Li, and H. Wang, “Enhanced efficiency bpsk de-modulator based on one-dimensional convolutional neural network,” IEEEAccess, vol. 6, pp. 26 939–26 948, 2018.

[16] T. Chadov, S. Erokhin, and A. Tikhonyuk, “Machine learning approachon synchronization for fec enabled channels,” in 2018Systems of Sig-nal Synchronization, Generating and Processing in Telecommunications(SYNCHROINFO). IEEE, 2018, pp. 1–4.

[17] J. J. G. Torres, A. Chiuchiarelli, V. A. Thomas, S. E. Ralph, A. M. C.Soto, and N. G. González, “Adaptive nonsymmetrical demodulation basedon machine learning to mitigate time-varying impairments,” in 2016 IEEEAvionics and Vehicle Fiber-Optics and Photonics Conference (AVFOP).IEEE, 2016, pp. 289–290.

[18] Y. Liu, Y. Shen, L. Li, and H. Wang, “Fpga implementation of a bpsk 1d-cnn demodulator,” Applied Sciences, vol. 8, no. 3, p. 441, 2018.

[19] G. C. Cardarilli, L. Di Nunzio, R. Fazzolari, D. Giardino, M. Matta,M. Re, F. Silvestri, and S. Spanò, “Efficient ensemble machine learningimplementation on fpga using partial reconfiguration,” in InternationalConference on Applications in Electronics Pervading Industry, Environ-ment and Society. Springer, 2018, pp. 253–259.

[20] D. Giardino, M. Matta, M. Re, F. Silvestri, and S. Spanò, “Ip generatortool for efficient hardware acceleration of self-organizing maps,” in In-ternational Conference on Applications in Electronics Pervading Industry,Environment and Society. Springer, 2018, pp. 493–499.

[21] L. M. Da Silva, M. F. Torquato, and M. A. Fernandes, “Parallel imple-mentation of reinforcement learning q-learning technique for fpga,” IEEEAccess, vol. 7, pp. 2782–2798, 2018.

[22] R. Li, Z. Zhao, Q. Sun, I. Chih-Lin, C. Yang, X. Chen, M. Zhao, andH. Zhang, “Deep reinforcement learning for resource management innetwork slicing,” IEEE Access, vol. 6, pp. 74 429–74 441, 2018.

[23] C. Savaglio, P. Pace, G. Aloi, A. Liotta, and G. Fortino, “Lightweightreinforcement learning for energy efficient communications in wirelesssensor networks,” IEEE Access, 2019.

[24] R. Ali, N. Shahin, Y. B. Zikria, B.-S. Kim, and S. W. Kim, “Deepreinforcement learning paradigm for performance optimization of channelobservation–based mac protocols in dense wlans,” IEEE Access, vol. 7,pp. 3500–3511, 2018.

[25] H. Kim, H. Nam, W. Jung, and J. Lee, “Performance analysis of cnn frame-works for gpus,” in 2017 IEEE International Symposium on PerformanceAnalysis of Systems and Software (ISPASS). IEEE, 2017, pp. 55–64.

[26] A. Ray, Compassionate Artificial Intelligence: Frameworks and Algo-rithms. Compassionate AI Lab (An Imprint of Inner Light Publishers),2018.

[27] M. Mohri, A. Rostamizadeh, and A. Talwalkar, Foundations of machinelearning. MIT press, 2018, pp. 7–8.

[28] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deepvisuomotor policies,” The Journal of Machine Learning Research, vol. 17,no. 1, pp. 1334–1373, 2016.

[29] A. Konar, I. G. Chakraborty, S. J. Singh, L. C. Jain, and A. K. Nagar, “Adeterministic improved q-learning for path planning of a mobile robot,”IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 43,no. 5, pp. 1141–1153, 2013.

[30] J.-L. Lin, K.-S. Hwang, W.-C. Jiang, and Y.-J. Chen, “Gait balance andacceleration of a biped robot based on q-learning,” IEEE Access, vol. 4,pp. 2439–2449, 2016.

[31] S. Wu, “Illegal radio station localization with uav-based q-learning,” ChinaCommunications, vol. 15, no. 12, pp. 122–131, 2018.

[32] J. Zhu, Y. Song, D. Jiang, and H. Song, “A new deep-q-learning-basedtransmission scheduling mechanism for the cognitive internet of things,”IEEE Internet of Things Journal, vol. 5, no. 4, pp. 2375–2385, 2018.

[33] Y. Deng, F. Bao, Y. Kong, Z. Ren, and Q. Dai, “Deep direct reinforcementlearning for financial signal representation and trading,” IEEE transactionson neural networks and learning systems, vol. 28, no. 3, pp. 653–664,2017.

[34] C. Wei, Z. Zhang, W. Qiao, and L. Qu, “Reinforcement-learning-basedintelligent maximum power point tracking control for wind energy conver-sion systems,” IEEE Transactions on Industrial Electronics, vol. 62, no. 10,pp. 6360–6370, 2015.

[35] I. Navarro and F. Matía, “An introduction to swarm robotics,” Isrn robotics,vol. 2013, 2012.

[36] M. Brambilla, E. Ferrante, M. Birattari, and M. Dorigo, “Swarm robotics:a review from the swarm engineering perspective,” Swarm Intelligence,vol. 7, no. 1, pp. 1–41, 2013.

[37] M. Matta, S. Spanò, G. C. Cardarilli, L. Di Nunzio, R. Fazzolari, D. Gi-ardino, F. Silvestri et al., “Q-RTS: a Real-Time Swarm Intelligence basedon Multi-Agent Q-Learning,” Electronics Letters, 2019.

[38] C. J. Watkins and P. Dayan, “Q-Learning,” Machine learning, vol. 8, no.3-4, pp. 279–292, 1992.

[39] K. Mueller and M. Müller, “Timing recovery in digital synchronous datareceivers,” IEEE Transactions on Communications, vol. 24, no. 5, pp. 516–531, 1976.

[40] G. C. Cardarilli, L. Di Nunzio, R. Fazzolari, D. Giardino, M. Matta,F. Silvestri, M. Re, and S. Spanò, “A Q-Learning based PSK SymbolSynchronizer,” in 2019 International Symposium on Signals, Circuits andSystems (ISSCS 2019). IEEE, 2019, accepted.

[41] J. Barry, E. Lee, and D. Messerschmitt, Digital Communication, 3rd ed.Springer Science & Business Media, 2012, pp. 137–139 and 727–760.

[42] F. Ling and J. Proakis, Synchronization in Digital Communication Sys-tems. Cambridge University Press, 2017.

[43] J. P. Costas, “Synchronous communications,” Proceedings of the IRE,vol. 44, no. 12, pp. 1713–1718, Dec 1956.

[44] Sklar, Bernard, Digital Communications: Fundamentals and Applications.Upper Saddle River, NJ, USA: Prentice-Hall, Inc., 1988.

[45] F. Gardner, “A bpsk/qpsk timing-error detector for sampled receivers,”IEEE Transactions on Communications, vol. 34, no. 5, pp. 423–429, May1986.

[46] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.MIT press, 2018.

[47] R. Bellman, Dynamic Programming, 1st ed. Princeton, NJ, USA:Princeton University Press, 1957.

[48] A. Tavakoli, F. Pardo, and P. Kormushev, “Action branching architecturesfor deep reinforcement learning,” in Thirty-Second AAAI Conference onArtificial Intelligence, 2018.

[49] E. ETSI, “290, Digital Video Broadcasting (DVB): Measurement guide-lines for DVB systems,” DVB Document AO, vol. 14, pp. 43–44, 1997.

10 VOLUME 4, 2016




MARCO MATTA was born in Cagliari, Italy, in1989. He received the B.S. degree in electronicengineering from the University of Rome “TorVergata", Italy, in 2014 and the M.S. degree inelectronic engineering from the same institute in2017. Since 2017, he has been a member of theDSPVLSI research group of the same University.He is currently pursuing the Ph.D. degree in elec-tronic engineering. His research interest includesthe development of hardware platforms and low-

power accelerators aimed Machine Learning algorithms and telecommu-nications. In particular, he is currently focused on the implementation ofReinforcement Learning models on FPGA.

216 IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, VOL. 15, NO. 2, JUNE 2007

Gian Carlo Cardarilli was born in Rome, Italy, andreceived the laurea (summa cum laude) from the Uni-versity of Rome “La Sapienza,” in 1981.

He has been with the University of Rome “TorVergata” since 1984. At present, he is full professorof digital electronics and electronics for commu-nication systems at the University of Rome “TorVergata.” During 1992–1994, he was with the Uni-versity of L’Aquila. During 1987–1988, he was withthe Circuits and Systems team at EPFL, Lausanne,Switzerland. His interests are in the area of VLSI

architectures for Signal Processing and IC design. In this field, he publishedover than 160 papers in international journals and conferences. He has alsoregular cooperation with companies as: Alcatel Alenia Space, Italy; STM,Agrate Brianza, Italy; Micron, Italy; Selex S.I., Italy. His scientific interestconcerns the design of special architectures for signal processing. He works inthe field of computer arithmetic and its application to the design of fast signaldigital processor.

Maria Grazia Marciani received the degree inmedicine and surgery and the specialization in neu-rology from the University of Rome, “La Sapienza,”Italy.

She is a full professor of neurology at the Uni-versity of Rome “Tor Vergata,” Italy. She is also theHead of the Neuroscience Department at the MedicalSchool. Her interests are in the field of epileptic dis-eases as well as in rehabilitation neuroscience.

GIAN CARLO CARDARILLI (S’79-M’81) wasborn in Rome, Italy, and received the laurea(summa cum laude) from the University of Rome“La Sapienza,” in 1981. He has been with theUniversity of Rome “Tor Vergata” since 1984.At present, he is full professor of digital elec-tronics and electronics for communication sys-tems at the University of Rome “Tor Vergata.”During 1992–1994, he was with the Universityof L’Aquila. During 1987–1988, he was with the

Circuits and Systems team at EPFL, Lausanne, Switzerland. His interestsare in the area of VLSI architectures for Signal Processing and IC design. Inthis field, he published over than 160 papers in international journals andconferences. He has also regular cooperation with companies as: AlcatelAlenia Space, Italy; STM, Agrate Brianza, Italy; Micron, Italy; Selex S.I.,Italy. His scientific interest concerns the design of special architectures forsignal processing. He works in the field of computer arithmetic and itsapplication to the design of fast signal digital processor.

LUCA DI NUNZIO is Adjunct Professor in Digi-tal Electronics Lab at the University of Rome “TorVergata" and Adjunct Professor of Digital Elec-tronics at the University Guglielmo Marconi. Hereceived the Master Degree (summa cum laude) inElectronics Engineering in 2006 at the Universityof Rome “Tor Vergata" and the PhD in Systemsand Technologies for the Space in 2010 in thesame University. He has a working history withseveral companies in the fields of Electronics and

Communications. His research activity is in the fields of ReconfigurableComputing, Communication circuits, Digital Signal Processing and MachineLearning.

ROCCO FAZZOLARI received the Master de-gree in Electronic Engineering in May 2009 andthe Ph.D. in “Space Systems and Technologies”in June 2013 at University of Rome Tor Vergata(Italy). At present, he is postdoctoral fellow andassistant professor at University of Rome “TorVergata" (dept. of Electronic Engineering). Heworks on hardware implementation of high-speedsystems for Digital Signals Processing, MachineLearning, Array of Wireless Sensor Networks and

systems for data analisys of Acoustic Emission (AE) Sensors (based onultrasonic waves).

DANIELE GIARDINO received the B.S. andM.S. degrees in electronic engineering at the Uni-versity of Rome “Tor Vergata" (Italy) in 2015 and2017, respectively. He is currently a Ph.D. studentat the same institute in electronic engineering andis a member of the research group DSPVLSI.He works on digital development for widebandsignals architectures, telecommunications, digitalsignal processing, and machine learning. In spe-cific, he is focused on the digital implementation

of MIMO systems for wideband signals.

ALBERTO NANNARELLI (S’94–M’99–SM’13)graduated in electrical engineering from the Uni-versity of Roma “La Sapienza,” Roma, Italy, in1988, and received the M.S. and the Ph.D. degreesin electrical and computer engineering from theUniversity of California at Irvine, CA, USA, in1995 and 1999, respectively. He is an AssociateProfessor at the Technical University of Denmark,Lyngby, Denmark. He worked for SGS-ThomsonMicroelectronics and for Ericsson Telecom as a

Design Engineer and for Rockwell Semiconductor Systems as a summerintern. From 1999 to 2003, he was with the Department of ElecticalEngineering, University of Roma “Tor Vergata,” Italy, as a PostdoctoralResearcher. His research interests include computer arithmetic, computerarchitecture, and VLSI design. Dr. Nannarelli is a Senior Member of theIEEE Computer Society.

MARCO RE (M’92) Marco Re holds a Ph.D.in Microelectronics and he is Associate Profes-sor at the University of Rome “Tor Vergata”where teaches Digital Electronics and HardwareArchitectures for DSP. He was awarded with twoNATO fellowships at the University of Californiaat Berkeley working as visiting scientist at Ca-dence Berkeley Laboratories. He has been award-eded with the Otto Moensted felloship as visitingprofessor at the Technical University of Denmark.

Marco collaborates in many research projects with different companies inthe field of DSP architectures and algorithms. His main scientific interestis in the field of: Low power DSP algorithms architectures, Hardware-Software Codesign, Fuzzy Logic and Neural hardware architectures, LowPower Digital implementations based on non-traditional number systems,Computer Arithmetic and Cad tools for DSP. He is author of about 200papers on international journals and international conferences. He is memberof IEEE and Audio Engineering Society (AES). He his Director of a Masterin Audio Engineering at the University of Rome Tor Vergata Department ofElectronic Engineering.

VOLUME 4, 2016 11




SERGIO SPANÒ received the B.S. degree inElectronic Engineering from the University of TorVergata, Rome, Italy, in 2015 and the M.S. degreein Electronic Enginering from the same Universityin 2018. He is currently pursuing the Ph.D. de-gree in Electronic Engineering in Tor Vergata asmember of the DSPVLSI research group. His in-terests include Digital Signal Processing, MachineLearning, Telecommunications and ASIC/FPGAhardware design. He had industrial experiences in

the Space and Telecommunications field. His current research topics relate toMachine Learning hardware implementations for embedded and low-powersystems.

12 VOLUME 4, 2016

a reinforcement learning based qam/psk symbol synchronizer · m. matta et al.: a reinforcement...

Documents