a 675 mbps, 4 4 64-qam k-best mimo detector in 0.13 cmos

Upload: tariksuleiman

Post on 05-Apr-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/2/2019 A 675 Mbps, 4 4 64-Qam K-best Mimo Detector in 0.13 Cmos

    1/13

    This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1

    A 675 Mbps, 4 4 64-QAM K-Best MIMODetector in 0.13 m CMOS

    Mahdi Shabany, Associate Member, IEEE, and P. Glenn Gulak, Senior Member, IEEE

    AbstractThis paper introduces a novel scalable pipelinedVLSI architecture for a 4 4 64-QAM hard-output mul-tiple-inputmultiple-output (MIMO) detector based on K-bestlattice decoders. The key contribution is a means of expandingthe intermediate nodes of the search tree on-demand, ratherthan exhaustively, along with three types of distributed sortersoperating in a pipelined structure. The proposed architecturehas a fixed critical path independent of the constellation size,on-demand expansion scheme, efficient distributed sorters, andis scalable to higher number of antennas. Fabricated in 0.13 mCMOS, it occupies 0.95 m m 2 core area. Operating at 282 MHzclock frequency, it dissipates 135 mW at 1.3 V supply with no BERperformance loss. It achieves an SNR-independent throughput

    of 675 Mbps satisfying the requirements of IEEE 802.16m andlong term evolution (LTE) systems. The measurements confirmthat this design consumes 3.0 less energy/bit and operates at asignificantly higher throughput compared to the best previouslypublished design.

    Index TermsK-best detectors, long term evolution (LTE) sys-tems, multiple-inputmultiple-output (MIMO) detection, WiMAXsystems.

    I. INTRODUCTION

    D

    UE to the high spectral efficiency, multiple-inputmul-tiple-output (MIMO) systems have attracted significant

    attention as the technology of choice in many standards suchas IEEE 802.11n, IEEE 802.16e, IEEE 802.16m and the longterm evolution (LTE) project. One of the main challengesin exploiting the potential of MIMO systems is to designlow-complexity high-throughput detection schemes with nearmaximum-likelihood (ML) performance that are suitable forefficient very large scale integration (VLSI) realization. Unfor-tunately, the complexity of the optimal ML detection schemegrows exponentially with the number of transmit antennasand the constellation size. Lower-complexity detectors such aszero-forcing (ZF), minimum mean-square error (MMSE) orsuccessive interference cancelation (SIC) detectors can greatlyreduce the computational complexity. However, they sufferfrom significant performance loss.

    The other alternative is to use near-optimal non-linear de-tectors [2]. Depending on how they carry out the non-exhaus-

    Manuscript received March 13, 2010; revised July 03, 2010 and September12, 2010; accepted October 19, 2010. This work was published in part to Inter-national Solid State Circuits Conference (ISSCC) 2009 .

    M. Shabany is with the Electrical Engineering Department, Sharif Universityof Technology, Tehran 11556-74513, Iran (e-mail: [email protected]).

    P. G. Gulak is with the Department of Electrical and Computer Engineering,University of Toronto, Toronto, ON M5S 2E4, Canada (e-mail: [email protected]).

    Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

    Digital Object Identifier 10.1109/TVLSI.2010.2090367

    tive search, near-optimal non-linear detection methods gener-ally fall into a few main categories, namely depth-first search,breadth-first search, and best-first search. Depth-first sphere de-coding (SD) [3] is one of the most attractive depth-first ap-proaches whose performance is optimal under the assumptionof unlimited execution time, [2]. However, the actual runtimeof the algorithm is dependent not only on the channel realiza-tion, but also on the operating signal-to-noise-ratio (SNR) [4].Thus leading to a variable sustained throughput, which resultsin extra overhead in the hardware due to the extra required I/Obuffers and lower hardware utilization.

    Among the breadth-first search methods, the mostwell-known approach is the K-best algorithm, [5]. The K-bestdetector guarantees a SNR-independent fixed-throughput witha performance close to ML. Being fixed-throughput in na-ture along with the fact that the breadth-first approaches arefeed-forward detection schemes, makes them especially attrac-tive for VLSI implementation. There has been some effort inthe literature directed towards their VLSI implementation [6],[7]. However, the current child expansion and sorting schemesin those architectures are not efficient/scalable for higher-orderconstellation schemes such as 64-QAM and 256-QAM. Inmost of these architectures, the delay of the critical path in-creases for higher modulation orders, which ultimately limits

    the maximum achieved throughput. Moreover, in spite ofvarious published architectures for the implementation of 44 16-QAM systems, an efficient high-throughput applicationspecific integrated circuit (ASIC) implementation for 64-QAMsystems at high data rate is still a major challenge and has notbeen fully addressed in the literature.

    In this paper, an efficient VLSI architecture, its chip imple-mentation and test results for a 4 4 64-QAM K-best MIMOdetector is reported, which alleviates the above problems andoperates at a significantly higher throughput than currently re-ported schemes. The promising features of the proposed ASICare as follows. Simulation results indicate sub-linear scaling inthe constellation size. For instance, for 16-QAM, is chosen

    to be 5 while for 64-QAM, meaning that the constella-tion quadruples but the K value only doubles, thus the sub-linearincrease. It also has fixed critical path delay independent of theconstellation order, value, and the number of antennas. More-over, it efficiently expands a very small fraction of all possiblechildren in the K-best algorithm and can be applied to infinitelattices. Finally it provides the exact K-best solution, i.e., thesolution that implements the original K-best algorithm with allneeded expansions.

    II. K-BEST ALGORITHM

    Let us consider a spatial multiplexing MIMO system with

    transmit and receive antennas whose equivalent base-

    1063-8210/$26.00 2010 IEEE

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/2/2019 A 675 Mbps, 4 4 64-Qam K-best Mimo Detector in 0.13 Cmos

    2/13

    This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    2 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

    band model of the Rayleigh fading channel is described by

    a complex-valued channel matrix . The complex

    baseband equivalent model can be expressed as ,where denotes the -dimensionalcomplex transmit signal vector, in which each element isindependently drawn from a complex constellation (a sym-metric -QAM scheme with bits per symbol, i.e.,

    ), is the -dimensionalreceived symbol vector, and representsthe -dimensional independent identically distributed (i.i.d)complex zero-mean Gaussian noise vector with variance ,i.e., . The real model equivalent of this systemcan also be derived using a real-valued decomposition (RVD)model [6] as follows:

    (1)

    where ,and are the equiva-

    lent real-valued vectors with the following mappings, , ,

    , and and are decomposed ac-cordingly, where and denote the real andimaginary parts of the variables, respectively. Note that

    , whereis the set of possible real entries in the constellation for

    in-phase and quadrature parts with . The objectiveof the MIMO ML detection method is to find the closesttransmitted vector based on the observation , i.e.,

    (2)

    The exhaustive-search ML detection is infeasible to implementfor large constellation sizes (i.e., 64-QAM and larger) becauseof its exponential complexity nature. The K-best algorithm,a.k.a. the M-algorithm, is a near-ML technique to solve theabove problem with a much lower complexity.

    The problem in (2) can be considered as a tree-search problemwith levels. In fact, the K-best algorithm explores the treefrom the root to the leaves by expanding each level and selectingthe best candidates with the lowest path metric in each levelthat are the surviving nodes of that level. Consider the problemin (2), and let us denote the QR-decomposition of the channelmatrix as , where is a unitary matrix of size

    and is an upper triangular matrix. Applyingto (1) results in

    (3)

    where . Since the nulling matrix is unitary, thenoise, , remains spatially white and the norm vector in (2),which represents the ML detection rule, can be rewritten as

    . Exploiting the upper triangularnature of , this norm vector can be further expanded as

    (4)

    which is a tree-search problem with levels. Starting from, (4) can be evaluated recursively as follows:

    (5)

    (6)

    for , where ,is the accumulated partial Euclidean distance (PED)

    with , denotes the distance in-crement between two successive nodes/levels in the tree, and

    (7)

    where , , and denote the scaled , , andby , respectively, i.e., , , and

    . Based on the above formulation,1 the K-best algo-rithm can be described as in Table I.

    The path with the lowest PED at the last level of the tree is thehard-decision output of the detector, whereas, for a soft-decisionoutput, all of the existing paths at the last level are consideredto calculate the Log-Likelihood Ratios (LLRs).

    Let us consider a real-model MIMO system withchannel matrix (1). As mentioned, the system can be thoughtof as a detection problem in a tree with levels, nodesper level and children per node. Because of the upper tri-angular structure of matrix , the algorithm starts from the lastrow of the matrix ( -th row, which is the -th level of thedetection tree) and goes all the way up to the first row of the ma-trix, which is the first level of the detection tree. Note that in thisscheme, all the possible children of a level are expanded exhaus-tively. The size of this exhaustive expansion grows significantlywhen the constellation size is scaled upward. Therefore, betterways are needed to calculate the best candidates of each levelwithout performing an exhaustive search.

    Regardless of whether dealing with the hard-decision or soft-decision output, there are two main computations that play crit-ical roles in the total computational complexity of the algo-rithm, namely, 1) the expansion of the surviving paths, and 2) thesorting.2 Therefore, the important part of any VLSI realizationof the K-best algorithm is an efficient architecture to implementthese two computational cores. The approach used in previouslypublished work and that used in this paper are described in thefollowing.

    1) Expansion: The K-best algorithm enumerates all the pos-sible children of a parent node in each level. Since there areparent nodes at each level and children per parent, thusthe path metrics of children need to be computed in

    1A typical detection core consists of a preprocessing core whose output is theR matrix. In this paper, we assume that the sorted QR-decomposition algorithmis applied to the channel matrix to generate the R matrix using the scaled anddecoupled architectures [8].

    2If these bottlenecks are resolved, the extension of the hard-decision schemeto the soft version is shown to be straightforward in [6].

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/2/2019 A 675 Mbps, 4 4 64-Qam K-best Mimo Detector in 0.13 Cmos

    3/13

    This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    SHABANY AND GULAK: A 675 Mbps, 4 4 64-QAM K-BEST MIMO DETECTOR IN 0.13 CMOS 3

    TABLE IK-BEST ALGORITHM

    each level, which incurs a large computational complexity.3 Thephase shift keying (PSK) enumeration scheme [9], which isbased on the search over multiple base-centric circles, or its sim-plified version for -QAM systems, [10], have been proposedto simplify the enumeration process. Moreover, in [11], a dif-ferent variation of the base-centric search methodology is used,

    in which the joint SD algorithm and successive interference can-celation are employed. A relaxed K-best enumeration scheme isalso proposed in [12] based on the PSK enumeration idea withlocal sorters. Although these methods are simpler to implement,they do not linearly scale with the constellation size (such as[9]) and/or have performance loss compared to the exact K-bestimplementation (such as [12] and [11]). In this paper, we pro-pose an efficient expansion method called the on-demand ex-pansion scheme, which avoids the exhaustive enumeration ofthe children while providing all the information required forthe exact K-best implementation with no performance degrada-tion, which, to the best of our knowledge, is the only expansionscheme to-date with a computational complexity proportional

    to the value and independentof the constellation size.2) Sorting: Based on the algorithm in Table I, in each level

    of the tree there are children to be sorted. Among all thesorting algorithms addressed in [13], bubble sorting is the mosteffective one, which distributes the sorting over multiple cycles[5]. Using bubble sorting, it takes cycles to obtain thesorted list in each level. This is time-intensive for large valuesof and , which ultimately limits the throughput. In [7], adistributed sorting method is proposed based on the Schnorr-Eu-chner (SE) ordered search technique [2], [14]. However, it re-quires all the children of a parent node to be calculated by ametric computation unit and is applicable only for ,and thus cannot be applied to , and . More-

    over, for higher values of , the proposed single-cycle mergecore in [7] becomes increasingly complex resulting in a longcritical path. Therefore, [7] is not a suitable platform to achievehigh throughput for higher order modulations like 64-QAM and256-QAM where the value of is large (e.g., for64-QAM and for 256-QAM). Finally in [12], a re-laxed approach to implement the K-best algorithm using a dis-tributed sorting scheme is proposed. This approach is simpler toimplement but results in the performance loss compared to theexact K-best solution. Moreover, the implemented ASIC occu-pies large silicon area while having moderate throughput. In thispaper, we propose a distributed sorter, working in a pipelined

    3

    In some implementations, a metric such as a radius constraint is used to limitthe number of expanded children [6]. In this paper, we consider a general K-bestimplementation without a metric restriction.

    Fig. 1. Order of the SE row-enumeration for four consecutive enumerations in16-QAM.

    structure with the on-demand expansion scheme, which findsthe best candidates in clock cycles. It works for any valueof and and its complexity is proportional to the valueand independent of the constellation size. It also does not com-promise the BER performance and provides the exact K-best so-lution, and can be easily extended to the complex domain [15].

    III. PROPOSED K-BEST DETECTION SCHEME

    Consider level of the tree and assume that the set ofK-bestcandidates in level (denoted by ) is known. Each nodein level has possible children, so there are pos-sible children in level . One of the main elements of our pro-

    posed scheme is to find the children of each node on-demandand in the order of increasing PED rather than calculating thePED of all the children exhaustively. In other words, the keyidea of the proposed distributed K-best scheme is to find the firstchild4 (FC) of each parent node in . Among these first chil-dren the one with the lowest PED is definitely one of the K-bestcandidates in . That child is selected and replaced by its nextbest sibling.5 This process repeats times to find the K-bestcandidates in level . The same procedure is performed foreach level of the tree.

    A. First/Next Child Calculation

    In the on-demand scheme described above, the first and next

    child are required to be determined. Based on the system modelin (5), the first child of a node in is the one mini-

    mizing , i.e.,

    (8)This is because is in common between all childrenof a parent.

    Therefore, can be found by roundingto the nearest integer value in (represented by in thispaper). In order to find the next children (NC), the Schnorr-Eu-chner technique, [14], is employed, which implies a zig-zag

    movement around to select the consecutive elements in .Fig. 1 shows such an enumeration for . In fact, the SEenumeration finds the closest points in a real domain one-by-oneby changing the search direction. The procedure of selecting thefirst/next child of node in level is described in Table II, where

    denotes the number of moves, and represents the di-rection. In fact, alternates between positive and nega-tive unless it reaches . The number of moves alsoincreases by 2 every time and is reset to 2 if boundaries of arereached.

    The proposed scheme is pictorially depicted in Fig. 2 for levelwhere and . It shows the way that is de-

    4The first child refers to the child with the lowest local PED among all chil-

    dren of a parent.5The next sibling refers to the child with the next lowest local PED.

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/2/2019 A 675 Mbps, 4 4 64-Qam K-best Mimo Detector in 0.13 Cmos

    4/13

  • 8/2/2019 A 675 Mbps, 4 4 64-Qam K-best Mimo Detector in 0.13 Cmos

    5/13

    This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    SHABANY AND GULAK: A 675 Mbps, 4 4 64-QAM K-BEST MIMO DETECTOR IN 0.13 CMOS 5

    Fig. 3. Proposed pipelined VLSI architecture of the K-best algorithm for the detection of a 4 2 4, 64-QAM system with K = 1 0 .

    this sorter, the number of clock cycles required for sorting ishalf as much as that of the classic bubble sorting. The key ideathat makes this sorter faster, is the implementation of two tasks

    (max/min and the data exchange) in 1 clock cycle through theintroduction of intermediate registers. The detailed architectureof this block will be discussed in Section IV-B-3.

    The output of the Sorter block is the sorted FCs of level7, i.e. , (FC-L7 in Fig. 3) that are all loaded simultaneouslyto the next stage (i.e., PE I)8 to form . Generally speaking,in each level, 1 PE II block is used to generate and sort thelist of all FCs of the current level and 1 PE I block is used togenerate the K-best list of the current level, denoted by FC-Land NC-L in Fig. 3.

    The task of the PE I block is to take the FCs of eachlevel as an input and generates the K-best list of that levelone-by-one. The node in with the lowest PED is definitelyone of the K-best candidates in level 7. This value is passed tothe PE II block in FC-L6. Upon the removal of this FC, itsnext sibling needs to be calculated, which is done by the corecalled NC-Block in the feedback loop of the PE I block(Section IV-B-4 and IV-B-5), and substitutes the FC in . ThePED of this sibling needs to be compared with the other FCs,already present in the NC-L7 stage. The next K-best candidatehas the lowest PED among this new set. This process is repeated10 times (taking 10 cycles) until all the K-best values of thesecond level of the tree are generated and passed to the PE IIblock in FC-L6.

    The PE II receives the K-best candidates of level 7, oneafter the other, and generates the FC of each received K-bestcandidate one-by-one and sorts them as they arrive. It finallytransfers them to its following PE I block. This process repeatsfor all the levels down to the first level. Since at the first levelonly the FC with the lowest PED is of concern, only 1 PE II

    8The data transfers happen between blocks every 10 clock cycles. The dashedgray arrows in Fig. 3 imply that the data is loaded only once every 10 clockcycles after the completion of the previous stage, and the number on the arrowshows how many cycles after thecompletionof theprevious stage data is loaded.Note also that the utilization factor for all the blocks except the first three is100%. This means PE I/II Blocks require 10 cycles to produce an outputwhile Sorter Block is active for only 4 cycles every 10 clock cycles. Theoutput of the Sorter is loaded to the following PE I Block once every 10cycles.

    block is used for the first level (FC-L1), whose output is thesolution to the hard detection symbol .9

    B. Detailed VLSI Architecture

    The inputs to the architecture are the entries of the ma-trix as well as the vector in (3). The matrix resulting fromthe QR-decomposition on (1) has some nice features, which areexplained by the following example. Let us consider a 4 4,64-QAM MIMO system (8 8 with RVD). The matrix is asfollows:10

    (9)This implies that two consecutive rows of the matrix share

    the same entries with a possible sign flip. Therefore, in the VLSIarchitecture, the input values of two consecutive levels sharethe values, thus, the above RVD both reduces the numberof input pads and the required memory to buffer the values.The other implication of this structure is that the first children ofthe odd rows do not depend on the K-best list of the precedingeven row. This is because , , , and are all zero. For

    9

    The goal of the PE II block is to generate the sorted list of all FCs not justthe minimum FC. This is because this sorted list is fed to the following PE Iblock to find the K best nodes in the current level. This works as follows. Inthe first clock cycle, the minimum FC is announced as the next child. The nextsibling of this announced node is calculated. The PED of this new sibling hasto be compared to the PED of the previous 9 FCs provided by the precedingPE II. Since we have already sorted them in the previous stage, we just need 1clock cycle to compare the PED of the new sibling with the 2nd lowest FC andannounce the winner. Had we not sorted them in PE II, we should have to findthe minimum of the PED of the new sibling and all the 9 remaining FCs of thelast stage, which incurs a long critical path.

    10The R matrix in (9) is derived as follows: First, the columns of the com-plex-valued channel matrix are sorted. Then, this sorted version is transformedinto the real-valued domain, where finally the QRD is applied. Note that theproposed algorithmic and architectural ideas in this paper can be easily recon-figured/modified, with the aid of a small control circuitry, to accommodate any

    input R-Matrix with any arbitrary sorted inputs. The only thing that has to bemodified is the way we store the input r values and the multiplexing schemeat the input.

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/2/2019 A 675 Mbps, 4 4 64-Qam K-best Mimo Detector in 0.13 Cmos

    6/13

  • 8/2/2019 A 675 Mbps, 4 4 64-Qam K-best Mimo Detector in 0.13 Cmos

    7/13

    This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    SHABANY AND GULAK: A 675 Mbps, 4 4 64-QAM K-BEST MIMO DETECTOR IN 0.13 CMOS 7

    Fig. 7. Architecture for Level II with the critical path highlighted.

    a long critical path and implies that the critical path of the ar-

    chitecture contains only 1 multiplication. It is assumed that allthe FFs used in this paper are triggered by the positive-edge ofthe clock.

    2) Level II Block: The input to Level II is the PED valuesof the 8th level and its output is the PED values of the firstchildren in the 7th level of the tree. In fact, in the Level IIblock, the first children of the eight nodes in the 8th level aredetermined. Note that due to the structure of the matrix in(9), the first children in the 7th level of the tree are all the sameand independent of their parents in level 8 (because ).This child is determined and is used to calculate the updatedPED values of the nodes in the 7th level. Since , noextra input is required for the calculations in Level II. The

    equation of the 7th level can be written as. This implies that in order to find

    the first child in the 7th level, is applied to the input of theMapper/Limiter block whose output is the first child. Thearchitecture of the Level II block is shown in Fig. 7. Oncethe first child was determined it is multiplied by using theMU block. The input normalized value is also multiplied by

    after which the Euclidean distance between the first childand the received vector (i.e., ) is calculated andthe result is added to the PED values of the 8th level PEDs toderive the eight updated PEDs of the 7th level. The fine-grainedpipelining technique has also been employed in this block tobreak it into four stages in order to limit the length of the critical

    path.3) Sorter Block: The input to the Sorter block is the set

    of eight PED values of the 7th level FCs and the main task ofthis block is to generate the sorted list of these PED values. Thearchitecture of the Sorter is shown in Fig. 8. The eight inputsare denoted by , and the outputs are stored in eightregisters shown by grey flip-flops labeled by letter N. TheCtrl signal is used to load the data in 1 clock cycle. Using thisarchitecture, it takes four clock cycles to sort all the eight PEDvalues. This architecture can be used as a general sorter, whichsorts numbers in clock cycles because it implementstwo tasks (max/min and the data exchange) in one clock cyclethrough the introduction of intermediate registers. One such set

    of consecutive minimizations is highlighted in Fig. 8, which isalso the critical path of the Sorter block. Note that the factor

    on the FFs, shown in Fig. 8, represents a register bankof length bits, used to store the child list (path history) as wellas the updated PED values.12

    4) PE I Block: PE I is a general block used for all the levelsfrom level 7 to level 2. It receives the sorted list of the firstchildren of each level and generates the best candidates ofthat level. For instance the output of the PE I in level 7, calledNC-L7 in Fig. 3, is the consecutive best candidateswith the lowest PED values in level 7, generated one-by-onein series at the output. The architecture of PE I is shown inFig. 9. It consists of a sorter, and a block called NC-Blockon the feedback path. In fact PE I receives the sorted list ofthe PEDs from the preceding stage. It finds the best one withthe lowest PED and announces it as the next K-best candidateat the output, and then calculates the next best sibling of theannounced child through the NC-Block and feeds it back tothe sorter to locate the correct location of the new sibling inthe already sorted list in PE I. The following points clarify thedetails of this architecture:

    The main task of the sorter in this block is to receive asorted list and to find the correct position of a new entry inthe sorted list, while announcing the entry with the lowestPED every clock cycle.

    Before the sorted PED values of the preceding stage areloaded into the PE I block, there is a reset signal, Rstin Fig. 9, that initializes all the register banks (exceptthe one attached to the output) to the maximum possiblenumber. This is necessary to avoid any interference fromthe previous values stored in them and makes them readyto process the new list. The Rst signal also initializes thecontrol signal, Ctrl in Fig. 9, to zero, which is used toload the sorted list from the preceding stage to the PE Iblock. Note that the data in the sorted list is loaded onepair at a time. For instance, when , andare loaded and when , and are loaded.The reason is to guarantee the proper functionality ofthe architecture when PE I and PE II are operatingtogether. A snapshot of the timing relationship betweenthe Clk, Rst, and Ctrl signals are also shown by anexample in Fig. 9.

    The critical path of the PE I block is highlighted in Fig. 9.It contains a MUX, a comparator and the NC-Block. Themain task of the NC-Block is to determine the next bestsibling of an already announced best child. It also findsthe PED value of this sibling and sends the information tothe sorter in the PE I block. Since the NC-Block is onthe feedback portion of the architecture, pipelining cannothelp to increase the throughput of the architecture. Sincethis path is the critical path of the whole MIMO architec-ture, an efficient architecture needs to be proposed for theNC-Block to ensure an overall high-throughput architec-ture.

    5) NC-Block: The detail of the NC-Block architecture isshown in Fig. 10(a). The NC-Block in the -th level needs tocalculate the number of jumps (in Table II), the directionof the next move, SignBit, and finally calculates the PEDvalue of the new sibling.13 These 3 tasks are implemented by

    12Thepath list grows from one level to another andtherefore so does thevalue

    of N , i.e., the value of N is different for each stage.13The signal SB in Fig. 10(a) represents the sign bit of the result of the adder.

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/2/2019 A 675 Mbps, 4 4 64-Qam K-best Mimo Detector in 0.13 Cmos

    8/13

    This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    8 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

    Fig. 8. Architecture for the Sorter block with the critical path highlighted.

    Fig. 9. Architecture for the PE I block with the critical path highlighted.

    the architecture shown in Fig. 10(a), where SignBit deter-mines the direction of the SE enumeration for the next child

    and Uout(Lout) determines whether the SE enumeration hasreached the upper (lower) boundary of the set. In fact, ac-cording to (5) and (6), the PED value of the new sibling can bedetermined as follows:

    (10)

    where refers to the quantity defined in (7) while was

    omitted for brevity of discussion and .As mentioned, any effort to simplify this block and/or reduceits critical path, has a direct and significant effect on the totalachievable data rate. To optimize this block, the following two

    techniques were utilized in our VLSI architecture:1) Avoid multiplication: Since the value of in (7) de-

    pends only on the selected symbols up to level and isindependent of the current sibling , the values ofand can be calculated using the FC-Block in thepreceding block14 (Section IV-B-7) and forwarded to theNC-Block as an input [see Fig. 10(a)]. This is a preferredapproach as the required multiplication to calculatewill be rescheduled and removed from the critical pathand is shifted to a block that is pipelineable. Moreover, thesecond multiplication, i.e., , is realized using the MUblock.

    14

    For PE I ofNC-L7, the preceding block is Level II while for all otherPE I blocks, this block is the PE II block in the preceding stage. For instancefor PE I in NC-L3, this is done in PE II in FC-L3.

    2) Broken critical path: As can be seen from the NC-Blockarchitecture [Fig. 10(a)], the critical path has 3 adders

    (one 4-bit and two 16-bit adders), as well as the MU block.The critical path associated with this architecture is 4.8ns in 0.13 CMOS technology using a commercialstandard cell library. The first part of the critical path[specified by the 1st section in Fig. 10(a)] calculates thenext sibling, which can be transferred to the FC-Blockin the preceding block. This means that the FC-Blockwould calculate both the first and second best child of eachparent and sends them to the NC-Block. The NC-Blockcalculates the PED value of the second best child whiledetermining the third best child and so on. This impliesthat the NC-Block block always calculates one childahead. Using this approach in our ASIC implementation

    yields a critical path of length 3.65 ns, thus higher overallthroughput. The use of this scheduling technique effec-tively breaks the critical path of the NC-Block down intotwo smaller parts [1st and 2nd section in Fig. 10(a)]. Thisis shown in Fig. 10(b) with the improved critical path.The first section of the NC-Block on the right hand sideof Fig. 10(b) is denoted by NCSub, which is the blockthat will be added to the preceding FC-Block in orderto calculate the second best child. The second sectionconsists of two adders, whose complexity is independentof the constellation size and the value.

    6) PE II Block: The output ofPE I is the serial list ofK-bestcandidates of the the current level, generated one-by-one at the

    output. As each of the K-best candidates is generated, it is sentto the PE II block to calculate the first children of the next

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/2/2019 A 675 Mbps, 4 4 64-Qam K-best Mimo Detector in 0.13 Cmos

    9/13

    This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    SHABANY AND GULAK: A 675 Mbps, 4 4 64-QAM K-BEST MIMO DETECTOR IN 0.13 CMOS 9

    Fig. 10. Architecture for the NC-Block inside the PE I block with the critical path highlighted. (a) Original. (b) Improved.

    Fig. 11. Architecture for the PE II block with the critical path highlighted.

    level and sort them as they arrive. The architecture of the PEII block is shown in Fig. 11., where is the input port and

    are the output ports. At the beginning, the first childof the K-best candidate of the previous stage and its updatedPED value are calculated by the FC-Block, and then usingthe sequential sorter, the calculated PED values are sorted asthey arrive. Note that this process is performed on a cycle basissince the PE II block is connected to the output of the PE Iblock in a pipelined fashion. In the proposed architecture for PEII, the sorted PEDs are stored in the register banks, depicted by

    -bit registers in Fig. 11. and denoted by . At everyclock cycle, 2 register banks are updated at the same time. Thisis because of the fact that the registers on the upper part of thesorter are located in every other stage. The functionality of thesorter is such that the larger values are shifted to the right whilethe smaller values are shifted to the left.

    Once the last element (10th element in 64-QAM) enters thesorter, it updates the first two register banks, thus the first two areguaranteed to have the 2 smallest PED values. Therefore, at thenext clock cycle, they can be transferred to the following PE Iblock. After the second clock cycle, the next 2 register banks areupdated. Therefore, the PED values are transferred to the nextlevel on a pair-by-pair basis. This fact is shown in Fig. 3 with

    grey lines between the PE II block and the PE I block andthe numbers on them represent the number of clock cycles afterthe arrival of the last K-best candidate to the PE II in whichthey are transferred.

    Note also that once the last element comes in and the first tworegister banks are sent to the next stage, the internal min/maxfunctions should be initialized to the highest positive numberto avoid the comparison between the first element of the nextiteration and the last element of the current iteration (done usingthe signal in Fig. 11.). This makes the core utilization 100%as PE I and PE II are fully pipelined with zero latency withrespect to one another.15

    7) FC-Block: The main task of the FC-Block is to cal-culate the value in (7), the first child of the current parentbased on the calculated , and its PED. It also determines thesecond best child and its corresponding PED value as well asthe value for its following NC-Block as mentioned inSection IV-B-5. The proposed architecture for the FC-Blockis shown in Fig. 12. In order to increase the total throughput in

    15Note the difference between the functionality of PE I and PE II block. PE IBlock gets a sorted list as an input and finds the location of the new generated

    nodes in this sorted list. It also announces the smallest node at the same time.However, PE II block is a sorter that takes as an input the nodes that come oneby one into the sorter and announces the sorted list with two entries at a time.

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/2/2019 A 675 Mbps, 4 4 64-Qam K-best Mimo Detector in 0.13 Cmos

    10/13

    This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    10 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

    Fig. 12. Architecture for the FC-Block inside the PE II block with the crit-ical path highlighted.

    the architecture, pipelining has been used by the introductionof FFs on all the forward paths. The proposed architecture forthe FC-Block consists of 5 pipeline levels. Each FC-Blockis used inside a PE II block. In the first pipeline level (seeFig. 12), there are 6 MU blocks. However, depending on the PEII block in which it is used, only part of these MU blocks areused. For instance, for the PE II blocks of stage FC-L6 andFC-L5, only the first 2 MUs are implemented whereas for PEII blocks of stage FC-L2 and FC-L1, all of the 6 MUs are im-plemented.

    The first 2 levels of the architecture calculate thein (7) and in the third level, the value of

    is calculated, which is used to calculate the first child usingthe Mapper and Limiter blocks. The number of movesand the direction of the move for the SE enumeration todetermine the next best child required by the NC-Blockare also determined in level 4 through the SignBit, Uout,and Lout signals. Finally the blocks in level 5 calculate thePED value of the announced first child. It also calculates thesecond best child through the introduction of the NCSub block,which was described in Section IV-B-5. The updated values ofSignBit, Uout, Lout, the second best child and its scaledvalue are sent to the output. All of the above blocks are inter-connected together in a pipelined fashion and at every clockcycle a data exchange occurs between the adjacent blocks. This

    means all the data are calculated and transferred sequentiallyoperand-by-operand between the blocks. A proper scheduling

    TABLE IIIFIXED-POINT WORD-LENGTH (bits) OF PARAMETERS

    [ n : m ] an n -bit number with m bits for the fractional part.

    TABLE IVCOMPARISON BETWEEN DIFFERENT K-BEST IMPLEMENTATIONS

    The number in parenthesis represents the latency after pipelining.

    scheme at the input of the chip guarantees the delivery of thecorrect and values to the blocks.

    C. Latency and Bit-True Simulation

    The fine-grained pipelining used inside the blocks improvesthe throughput at the cost of the larger latency. Starting fromthe first block, the latency of Level I is 2 cycles, Level II hasa 3-cycle latency, the Sorter blocks latency is 4 cycles, thePE I blocks latency is cycles, and finally PE IIslatency is 10 cycles plus an additional 6 cycles for the pipelinedFC-Block. Therefore, according to the architecture in Fig. 3,the total latency of the architecture is

    .Table III shows the number of bits associated with different

    variables in the algorithm for the fixed-point simulation of a4 4, 64-QAM system in the form of , where and

    denote the total number of bits for the integer and fractionalparts, respectively. The fixed-point simulations are performedusing the 2s complement number representation. Note that theword lengths in Table III have been derived based on exten-sive simulation results to minimize the BER loss relative to thefloating-point result (i.e., less than 0.5 dB loss at ).

    D. Complexity Analysis

    Table IV shows the complexity comparison between differentschemes. For the sake of comparison, the number of visited chil-dren (expand), the required number of clock cycles to do thesorting (sort), and the total latency are considered. The valueslisted in the expand row refers to the number of PED values re-quired to be calculated, which directly translates to more areaand power. A key feature in our architecture is that onlychildren need to be calculated in each level, whereas in other ap-proaches, e.g. [7], the PED of all the children of all parent nodesshould be calculated. The last row of the table indicates whetherthe length of the critical path of the architecture grows with the

    constellation order . The number in parenthesis refers to thelatency of the pipelined architecture.

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/2/2019 A 675 Mbps, 4 4 64-Qam K-best Mimo Detector in 0.13 Cmos

    11/13

  • 8/2/2019 A 675 Mbps, 4 4 64-Qam K-best Mimo Detector in 0.13 Cmos

    12/13

    This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    12 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

    Fig. 13. Micrograph of the implemented ASIC.

    Fig. 14. Measurement plots for maximum clock frequency and power dissipa-tion vs. supply voltage ( V ) at 25 C .

    with the expected values from the bit-true simulations bothfrom MATLAB and NC-Verilog.

    Fig. 14. shows a Shmoo plot depicting the maximum op-erating frequency and the total power dissipation of the de-sign versus the supply voltage. All fabricated chips were tested,

    where the average and the max/min values of the achieved fre-quency have been shown in Fig. 14. Operating at a clock rate of282 MHz with the overall latency of 0.6 results in a measuredsustained throughput of 675 Mbps dissipating 135 mW at 1.3 Vsupply and 25 . (This translates into a sustained throughputof 170 Mbps per transmit antenna in a 4 4 MIMO receiver.).The temperature was forced to be at 25 using the TemptronicTP04300 thermal forcing unit (TFU). Using the TFU, test re-sults at 80 yield a clock rate of 250 MHz while dissipating104 mW at 1.2 V supply producing a sustained data rate of600 Mbps.

    Fig. 15. shows a comparison between the reported ASIC im-plementations of 4 4 64-QAM as well as the 16-QAM MIMO

    Detectors. Previous publications with measured or estimatedpower dissipations shown in the figure. The values of the [6] and

    Fig. 15. Measured throughput vs. energy/bit. Results of the designs in [6] and[19] have been scaled to a 0.13 m equivalent CMOS process.

    Fig. 16. Measured BERat a sustainedthroughput of 675Mbps(282MHz clockfrequency) dissipating 135 mW @ 1.3 V supply and 25 C .

    [19] have been scaled to a 0.13 equivalent CMOS process.The comparison graph confirms that measurements from thisdesign achieves 2.6 better throughput per area compared tothe best reported design and at the same time consumes 3.0less energy per bit compared to the previous best design. Thisadvantage is due to the efficient expansion and sorting opera-

    tions used in this design.The measured BER results are shown in Fig. 16. It representsthe result for a single-carrier 4 4 64-QAM MIMO system.Test vectors used to test the chip and generate the BER curverepresent a total of 100,000 packets, where each packet consistsof 96 bits20 (9.6 Mbits in total). Test vectors are created using:1) pseudo-random data, 2) complex-valued random Gaussian

    channel matrix with statistically independent elements up-dated per four channel use, and 3) additive white Gaussian (cir-cularly symmetric) complex random noise. Test results agreewith the expected golden vector set confirming correct operationof the test chip. The partial-scan methodology was employed toprovide sufficient level of the observability and controllability

    20This is because 9 6 = 4 ( N ) 2 6 ( N u m b e r o f b i t s p e r c o n s t e l l a t i o n s y m b o l ) 2 4 ( C h a n n e l u p d a t e s e v e r y f o u r c h a n n e l u s e ) .

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/2/2019 A 675 Mbps, 4 4 64-Qam K-best Mimo Detector in 0.13 Cmos

    13/13

    This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    SHABANY AND GULAK: A 675 Mbps, 4 4 64-QAM K-BEST MIMO DETECTOR IN 0.13 CMOS 13

    TABLE VICHARACTERISTICS SUMMARY OF DETECTOR AND MEASURED RESULTS

    during the testing procedure. Table VI summarizes the designand performance characteristics.

    VII. CONCLUSIONS

    A high-throughput silicon implementation of the K-bestalgorithm suitable for high-order constellation schemes, whichhas the smallest number of visited nodes, as well as the highestachieved throughput reported to-date. The key innovation arethe introduction of an on-demand expansion and distributedsorting scheme operating in a pipelined fashion. In 0.13CMOS, it achieves a throughput of 675 Mbps at a clock fre-quency of 282 MHz satisfying the throughput requirements ofnext-generation WiMAX and LTE systems.

    REFERENCES

    [1] M.Shabany and P.G. Gulak,A 0.13 m CMOS, 655 Mb/s,64-QAM,K-best 4 2 4 MIMO detector, in Proc. IEEE Int. Solid State CircuitsConf., Feb. 2009, pp. 256257.

    [2] E. Agrell, T. Eriksson, A. Vardy, and K. Zeger, Closest point search inlattices, IEEE Trans. Inf. Theory, vol. 48, no. 8, pp. 22012214, Aug.2002.

    [3] U. Fincke and M. Pohst, Improved methods for calculating vectorsof short length in a lattice, including a complexity analysis, Math.Comput., vol. 44, pp. 463471, Apr. 1985.

    [4] J. Jaldn and B. Ottersten, On the complexity of sphere decoding indigital communications, IEEE Trans. Signal Proc., vol. 53, no. 4, pp.14741484, Apr. 2005.

    [5] K. W. Wong, C. Y. Tsui, R. S. K. Cheng, and W. H. Mow, A VLSI ar-chitecture of a K-best lattice decoding algorithm for MIMO channels,in Proc. IEEE Int. Symp. Circuits Syst., May 2002, vol. 3, pp. 273276.

    [6] Z. Guo and P. Nilsson, Algorithm and implementation of the K-bestsphere decoding for MIMO detection, IEEE J. Sel. Areas Commun.,vol. 24, no. 3, pp. 491503, Mar. 2006.

    [7] M. Wenk, M. Zellweger, A. Burg, N. Felber, and W. Fichtner, K-bestMIMO detection VLSI architectures achieving up to 424 Mbps, inProc. IEEE Int. Symp. Circuits Syst., 2006, pp. 11511154.

    [8] L. Davis, Scaled and decoupled Cholesky and QR decompositionswith application to spherical MIMO detection, in Proc. WirelessCommun. Netw. Conf., Mar. 2003, vol. 1, pp. 326331.

    [9] B. M. Hochwald and S. ten Brinkc, Achieving near-capacity on amultiple-antenna channel, IEEE Trans. Commun., vol. 51, no. 3, pp.389399, Mar. 2003.

    [10] A. Burg et al., VLSI implementation of MIMO detection using thesphere decoding algorithm, IEEE J. Solid-State Circuits, vol. 40, no.7, pp. 15661577, Jul. 2005.

    [11] H.-L. Lin, R. C. Chang, and H. Chan, A high-speed SDM-MIMO de-

    coder using efficient candidate searching for wireless communication,IEEE Trans. Circuits Syst. II, vol. 55, no. 3, pp. 289293, Mar. 2008.

    [12] S. Chen, T. Zhang, and Y. Xin, Relaxed K-best MIMO signal detectordesign and VLSI implementation, IEEE Trans. Very Large Scale In-tegr. (VLSI) Syst., vol. 15, no. 3, pp. 328337, Mar. 2007.

    [13] P. A. Bengough and S. J. Simmons, Sorting-based VLSI architecturefor the M-algorithm and T-algorithm trellis decoders, IEEE Trans.Commun., vol. 43, no. 3, pp. 514522, Mar. 1995.

    [14] C. P. Schnorr and M. Euchner, Lattice basis reduction: Improved prac-tical algorithms and solving subset sum problems, Math. Program.,

    vol. 66, pp. 181191, 1994.[15] M. Shabany, K. Su, and P. G. Gulakc, A pipelined high-throughputimplementation of near-optimal complex K-best lattice decoders,in Proc. IEEE Conf. Acoust., Speech, Signal Process., 2008, pp.31733176.

    [16] S. Chen and T. Zhang, Low power soft-output signal detector designfor wireless MIMO communication systems, in Proc. Int. Symp. LowPower Electron. Design, 2007, pp. 232237.

    [17] WiMAX Forum, WiMAX forum mobile system profile release 1.0approvedspecification (revision 1.4.0:2007-05-02), in WiMAX Forum,May 2005.

    [18] Q. Li and Z. Wang, An improved K-best sphere decoding architecturefor MIMO systems, in 40th Asilomar Conf. Signals, Syst. Comput.,Nov. 2006, pp. 21902194.

    [19] C.-Y. Yang and D. Markovic, A flexible DSP architecture for MIMOsphere decoding, IEEE J. Solid-State Circuits, vol. 56, no. 10, pp.23012314, Oct. 2009.

    Mahdi Shabany (S04A08) received the B.Sc. de-gree in electrical engineering from Sharif Universityof Technology, Tehran, Iran,in 2002, andthe M.A.Sc.and Ph.D. degrees bothin electrical engineering fromthe University of Toronto, Toronto, Canada, in 2004and 2008, respectively.

    From 2007 to 2008, he was with Redline Com-munications Co., Toronto, Canada, where he devel-oped and patented designs for WiMAX systems. Healso served as a post-doctoral fellow at the Univer-sity of Toronto in 2009. Currently he is an assistant

    professorin the Electrical Engineering Department at Sharif University of Tech-nology, Tehran,Iran. His mainresearchinterestsincludeDigital Electronics, andVLSI architecture/algorithm design for broadband communication systems.

    P. Glenn Gulak (S82M83SM96) received thePh.D. degree from the University of Manitoba, Win-nipeg, MB, Canada.

    While at the University of Mannitoba, he helda Natural Sciences and Engineering ResearchCouncil of Canada Postgraduate Scholarship. He isa Professor with the Department of Electrical andComputer Engineering, University of Toronto, ON,Canada, as well as a Senior Member of the IEEE anda registered Professional Engineer in the Province ofOntario. His present research interests are currently

    focused on algorithms, circuits, and system-on-chip architectures for digitalcommunication systems; and for biological lab-on-chip microsystems. Hehas authored or co-authored more than 100 publications in refereed journal

    and refereed conference proceedings. In addition, he has received numerousteaching awards for undergraduate courses taught in both the Department ofComputer Science and the Department of Electrical and Computer Engineeringat the University of Toronto. He held the L. Lau Chair in Electrical andComputer Engineering for the five-year period from 19992004. He currentlyholds the Canada Research Chair in Signal Processing Microsystems and theEdward S. Rogers Sr. Chair in Engineering. From January 1985 to January1988, he was a Research Associate in the Information Systems Laboratory andthe Computer Systems Laboratory at Stanford University, Stanford, CA. FromMarch 2001 to March 2003, he was the Chief Technical Officer and Senior VicePresident of LSI Engineering, a fabless semiconductor startup headquartered inIrvine, CA with $70M USD of financing that focused on wireline and wirelesscommunication ICs.

    Dr. Gulak served on the ISSCC Signal Processing Technical Subcommitteefrom 1990 to 1999, was ISSCC Technical Vice-Chair in 2000, and served asthe Technical Program Chair for ISSCC 2001. He currently serves on the Tech-

    nology Directions Subcommittee for ISSCC. He was the recipient of the IEEEMillenium Medal in 2001.

    http://-/?-http://-/?-