nvd-27. a practical noc design for parallel des computation - ieee 2013

4
 A Practical NoC Design for Parallel DES Computation R. Yuan, S.-J. Ruan and J. G ¨ otzeNational Taiwan University of Science and Technology Low-Power Systems Lab, Taipei, Taiwan E-Mails:  {D9902102, sjruan}@mail.ntust.edu.tw TU Dortmund Information Processing Lab, Dortmund, Germany E-Mail: juergen.goet ze@tu-dortmu nd.de  Abstract —The Networ k-on-Chip (NoC) is cons ider ed to be a new SoC paradigm for the next generation to support a large number of processing cores. The idea to combine NoC with ho- mogeneous processors constructing a Multi-Core NoC (MCNoC) is one way to achieve high computational throughput for specic purp ose like cryp tograp hy . Many res earc hes use crypt ograp hy stand ards for perf orman ce demon strat ion but rar ely disc uss a sui table NoC for suc h sta ndard. The goal of thi s paper is to present a practical methodology without compl icate d virtu al cha nne l or pip eli ne tec hnolog ies to pr ovi de hig h thr oughpu t Data Encrypti on Stand ard (DES) compu tatio n on FPGA. The result s point out that a mesh-based NoC wi th pack et and Processing Element (PE) design according to DES specication can achieve great performance over previous works. Moreover , the deterministic XY routing algorithm shows its competitiveness in high thro ughpu t NoC and the West -Fir st rout ing offers the best performance among Turn-Model routings, representatives of adaptive routing. I. I NTRODUCTION Advantages of Network-on-Chip (NoC) over traditional bus- based archite cture hav e been proposed in man y rese arches. The NoC architecture has advantages in both scalability and exibility thus it can be organized to run homogeneous cores in parallel to improve performance for specic purposes [1]. Such approach on NoC is a suitable method to realize a high throughput computational system on FPGA. Data encryption/decryption is one computational algorithm often implemented in researches for performance demonstra- tio n. Cha racter ist ics of one crypto gra phy af fec t the sel ec- ti on of the i t si ze for rout ing, the pa cket si ze in tr af c communic atio n and the arch itec ture for Processing Element (PE). Toge ther with popu lari ty of data protect ion dema nds nowadays, a high performance NoC specic to cryptography must be analyzed. Our work has realized a 5×5 2-D Mesh, VCT switching, running 25 Data Encryption Standard (DES) computations in parallel. The goal of this paper is to evaluate the throughput of a high workload NoC. The main contribution is related to the performance verication results of MCNoC architectures for parallel DES computation. Our results indicate that proposed work has considerable speedup than previous works. Thi s pap er is organized as follows: Sec ti on II des cri bes the related work of DES on other NoC systems. Section III introduces the prop osed architect ure. Sect ion IV desc ribe s congurations of proposed MCNoC including packet format, rout ing algorit hm, ow control and architecture of PE. Sec- tion V describes the experimental methodology and shows the results. In last section, brief statements conclude this paper. II. RELATED WORKS Some NoC proposals use soft-core processors, MicroBlaze or Networked Processor Array (NePA), as processing elements to implement DES computations [2], [3], [4]. These processors have much more complicated functions than traditional DES nee ds. Thu s add ing cores bec omes cos tly bec aus e of the dramatic increase of comp lex ity and traf c load, resultin g limited performance improvement. The res ear ch [5] has rea liz ed one DES enc ryp tio n use d sporadically in the network for brute force testing. The perfor- mance is not unleashed due to the architecture is essentially not designed with high throughput considerations. This paper presents a practical MCNoC for parallel DES processing to achieve high throughput demand. Our proposed MCNoC has all boundary ports open to other resources for high through put purpose, and this sharin g scheme has been applied on state-of-the-art commercial NoC chips like Tilera TILE6 4. Wi thout complic ated designs of pipe line or virt ual channel tech nolog ies, rout ing and ow control comp onen ts can be ke pt si mpl e so the NoC is low-c ost and lo w-power consumption. III. ARCHITECTURE OF M CN OC  F OR  D ES MCNoC is a specic NoC that owns par allel comput ing power that can be shared by multiple components connected with it. A typical architecture of a 5×5 NoC is shown in Fig. 1. Tiles numbered 11, 12, 13, 14, 15, 21, 25, 31, 35, 41, 45, 51, 52, 53, 54, and 55 are boundary tiles. Each boundary tile has either 1 side or 2 sides connecting with external resources, not switches. These external resources can be packet generators, rec ei ve rs or bot h rep res ent ed as til es wit h dot lin e cal led terminal tiles, numbered 01, 02, 03, 04, 05, 10, 16, 20, 26, 30, 36, 40, 50, 56, 61, 62, 63, 64, and 65. Terminal tiles are dummy tiles therefore no PE connects with them. N, E, W and S represent North, East, West and South respectively. The rest of tiles are normal tiles without any specic name. Every tile except terminal tile is composed of one router and one PE. 978-1-4673-4436-4/13/$31.00 ©2013 IEEE

Upload: vikram

Post on 06-Oct-2015

219 views

Category:

Documents


0 download

DESCRIPTION

noc bae paper

TRANSCRIPT

  • A Practical NoC Design for Parallel DES Computation

    R. Yuan, S.-J. Ruan and J. GotzeNational Taiwan University of Science and Technology

    Low-Power Systems Lab, Taipei, TaiwanE-Mails: {D9902102, sjruan}@mail.ntust.edu.tw

    TU DortmundInformation Processing Lab, Dortmund, Germany

    E-Mail: [email protected]

    AbstractThe Network-on-Chip (NoC) is considered to be anew SoC paradigm for the next generation to support a largenumber of processing cores. The idea to combine NoC with ho-mogeneous processors constructing a Multi-Core NoC (MCNoC)is one way to achieve high computational throughput for specicpurpose like cryptography. Many researches use cryptographystandards for performance demonstration but rarely discuss asuitable NoC for such standard. The goal of this paper is topresent a practical methodology without complicated virtualchannel or pipeline technologies to provide high throughputData Encryption Standard (DES) computation on FPGA. Theresults point out that a mesh-based NoC with packet andProcessing Element (PE) design according to DES specicationcan achieve great performance over previous works. Moreover,the deterministic XY routing algorithm shows its competitivenessin high throughput NoC and the West-First routing offers thebest performance among Turn-Model routings, representatives ofadaptive routing.

    I. INTRODUCTION

    Advantages of Network-on-Chip (NoC) over traditional bus-based architecture have been proposed in many researches.The NoC architecture has advantages in both scalability andexibility thus it can be organized to run homogeneous coresin parallel to improve performance for specic purposes [1].Such approach on NoC is a suitable method to realize a highthroughput computational system on FPGA.

    Data encryption/decryption is one computational algorithmoften implemented in researches for performance demonstra-tion. Characteristics of one cryptography affect the selec-tion of the it size for routing, the packet size in trafccommunication and the architecture for Processing Element(PE). Together with popularity of data protection demandsnowadays, a high performance NoC specic to cryptographymust be analyzed.

    Our work has realized a 55 2-D Mesh, VCT switching,running 25 Data Encryption Standard (DES) computations inparallel. The goal of this paper is to evaluate the throughput ofa high workload NoC. The main contribution is related to theperformance verication results of MCNoC architectures forparallel DES computation. Our results indicate that proposedwork has considerable speedup than previous works.

    This paper is organized as follows: Section II describesthe related work of DES on other NoC systems. Section IIIintroduces the proposed architecture. Section IV describes

    congurations of proposed MCNoC including packet format,routing algorithm, ow control and architecture of PE. Sec-tion V describes the experimental methodology and shows theresults. In last section, brief statements conclude this paper.

    II. RELATED WORKS

    Some NoC proposals use soft-core processors, MicroBlazeor Networked Processor Array (NePA), as processing elementsto implement DES computations [2], [3], [4]. These processorshave much more complicated functions than traditional DESneeds. Thus adding cores becomes costly because of thedramatic increase of complexity and trafc load, resultinglimited performance improvement.

    The research [5] has realized one DES encryption usedsporadically in the network for brute force testing. The perfor-mance is not unleashed due to the architecture is essentiallynot designed with high throughput considerations.

    This paper presents a practical MCNoC for parallel DESprocessing to achieve high throughput demand. Our proposedMCNoC has all boundary ports open to other resources forhigh throughput purpose, and this sharing scheme has beenapplied on state-of-the-art commercial NoC chips like TileraTILE64. Without complicated designs of pipeline or virtualchannel technologies, routing and ow control componentscan be kept simple so the NoC is low-cost and low-powerconsumption.

    III. ARCHITECTURE OF MCNOC FOR DES

    MCNoC is a specic NoC that owns parallel computingpower that can be shared by multiple components connectedwith it. A typical architecture of a 55 NoC is shown in Fig. 1.Tiles numbered 11, 12, 13, 14, 15, 21, 25, 31, 35, 41, 45, 51,52, 53, 54, and 55 are boundary tiles. Each boundary tile haseither 1 side or 2 sides connecting with external resources, notswitches. These external resources can be packet generators,receivers or both represented as tiles with dot line calledterminal tiles, numbered 01, 02, 03, 04, 05, 10, 16, 20, 26,30, 36, 40, 50, 56, 61, 62, 63, 64, and 65. Terminal tiles aredummy tiles therefore no PE connects with them. N, E, W andS represent North, East, West and South respectively. The restof tiles are normal tiles without any specic name. Every tileexcept terminal tile is composed of one router and one PE.

    978-1-4673-4436-4/13/$31.00 2013 IEEE

  • Fig. 1: A 55 mesh-based MCNoC

    A terminal tile injects packets into only one boundary tile.The boundary tile receives the packet if its PE is available,otherwise it routes the packet to the neighbor tile according toits routing algorithm. The packet will be routed till an availablePE is found. When DES operation is nished, the packet hasthe done bit set to HIGH and starts to be routed towards theterminal tile where it was from.

    Different from other works use end-to-end trafc or onlyfew input nodes as packet injecting points, our work takesinput trafc from North, East, West and South and returnspackets to all directions. In order to keep NoC low-cost andscalable, the NoC is constructed based on following principles:

    1) No virtual channel or pipeline is used.2) All PEs are designed for data processing only.3) Architectures of all tiles are identical.4) Unnished packet should not leave NoC.The third and forth principles are the difculties. Unlike

    boundary tiles, tiles at the center or some other locations donot need criteria to block packets. Applying these controls addsextra two clock states to routing decision on each direction oftrafc, and 9 out of 25 tiles are affected in a 55 NoC. Forpursuing the overall exibility and scalability of MCNoC weconsider this overhead is acceptable.

    IV. NETWORK ON CHIP CONFIGURATIONS

    A. Packet Format

    Packet is transferred in consecutive 5 its and each it is 32-bit long as Fig. 2. The rst it is Header it storing informationfor routing decision and PE computation. The values of sourceand terminal represent the coordinates of packet generator andreceiver respectively. The destination number always stores thecoordinate of one boundary tile. Two-bit done signal indicateswhether the data it is the nal data in DES operation. Then,iteration number tells which stage of DES operation for thispacket. Finally, serial number is used as a tracking numberand bit 7 is reserved.

    The Key it composed of the second and third its storesa complete 64-bit initial key needed in DES operation, e.g.subkey generation according to the iteration number. Thefourth and fth its are grouped as Data it stores plaintext at

    Fig. 2: Packet its

    the beginning, intermediate cipher while DES processing, andnal ciphertext when DES encryption is done. Those Header,Key and Data its will be mentioned repeatedly in followingsections and their relationship will be claried.

    B. Routing Algorithm

    In this proposal four routing algorithms are tested, onedeterministic XY and three adaptive routings: WF, NL andNF. All these algorithms are deadlock-free, livelock-free andare implemented respectively to evaluate their characteristicsin a high loading DES MCNoC.

    Fig. 3 shows the architecture of router. When a packet issent into a tile, the router rst checks its done bit. If it is LOW,the packet is either sent to PE, or directed to another tile. Ifthe done bit is HIGH, the packet is routed towards its desiredboundary tile. Once it reaches the boundary tile it is bursted tothe desired terminal tile immediately. By checking destinationand terminal number and done bit in Header it, an unnishedpacket is not be able to pass through the boundary tile untilit nishes all processes of DES computation.

    Fig. 3: Router Architecture

    C. Flow Control

    The VCT contributes higher throughput when load increas-ing due to the wormholes drawback of quickly resources satu-ration while packets blocking occurs [6]. Banerjee [7] presentsthat VCT gave lower latencies at higher acceptance rates andprovided better performance than wormhole switching.

  • D. Architecture of PE

    According to the structure of DES, the reasonable numberof iterations are divisors of 16, i.e. 1, 2, 4, 8 and 16 (onePE completes one DES operation). Using a small PE for onlyone iteration needs another 15 computations to complete oneDES operation causing more packets routing in network. Bycontrast, a large PE contains full 16 iterations makes packetstay inside PE longer thus network trafc allows more data toll up other PEs. Whether the fast reaction of small PE helpsthroughput improvement to overall network becomes a factorto consider. This part of testing is discussed in Section V.

    V. EXPERIMENT METHODOLOGY AND RESULTS

    A. Experiment Setup

    Full circuit design was done in Xilinx ISE 11.4 target-ing on XC5VLX220-1FF1760 device. All simulations donein ModelSim-SE 6.2g were under high consecutive packetsinjection rate over 95% overlapping in between any twoinputs, which guaranteed the MCNoC was fully loaded andthe maximum throughput was measured without saturating it.

    B. Simulation Results

    1) Simulation Results of PE Size: Values in Table I statesthe performance and evaluation of 1-, 8- and 16-iterativePEs. The resultant slice utilization tells the 16-iterative PEarchitecture ts to slices architecture better than others.

    In the experiments of throughput testing, the benet ofshort data processing period in low-iterative PE does notcompensate for the loss of throughput caused by congestions.The 1-iterative PE saturates the NoC quickly due to therouting time is much longer than the data processing time.Consequently more packets stay on link rather than in PEresulting congestions. When insertion rate reaches 727Mbps,packet congestion occurs in 8-iterative design resulting only15.73% packets returning to original terminal tiles.

    By analyzing processing time of one packet in Table II,1- and 8-iterative PEs process faster than router since theyimplement only partial DES computation. A 16-iterative PEis able to lock packet longer providing router more chanceto service another packet which further helps reduce trafc innetwork.

    TABLE II: Processing Time of One Packet

    PE Processing TimeSize PE Router

    1 iteration 35ns 80ns8 iterations 75ns 80ns16 iterations 115ns 90ns

    This section concludes that the DES MCNoC equipped with16-iterative PEs offers the best performance and slice utiliza-tion. To achieve high throughput purpose, the PE processingtime has to be longer than the router processing time, whichhelps improve utilization on both PE and router.

    2) Simulation Results of Trafc Latency on Terminal Tiles:Fig. 4 presents average latencies measured on all terminal tilesfor XY and Turn-Model routing algorithms. The X-axis isterminal tile number and the Y-axis is the latency in nanosec-ond. Packet amount less than 4,000 has no discriminationfor all designs. When the number of packets increases, somedelays occur reecting the hard time of few boundary tiles,especially the ones sitting at corners. Such situation is obviousin XY routing since a packet at corner waits for the only routebeing available causing others packets blocked in another tiles.Adaptive routings on the other hand has several distributionsof hot-spot corners due to the difference in routing sequences.

    (a) XY routing (b) WF routing

    (c) NF routing (d) NL routing

    Fig. 4: Packet Latency on Terminal Tile in DES MCNoC

    For instance the WF routes packets from PE takes Westas the rst choice then South as the second choice, combingwith the features that packets from North routes to South asthe rst choice, packets from East routes to West and Southas the rst and second choice respectively, resulting that mostpackets take West or South as their rst routing path causingterminal tiles at lower left corner hard to inject more packets,which causes terminal tiles numbered 04, 05 and 06 becomehigh latency entries. This experiment illustrates the number ofhigh latency entries for XY, WF, NF, and NL is 9, 3, 5, and6 respectively.

    3) Simulation Results of PE Utilization: The computingpower of MCNoC derives from the sum of computing power ofevery PE. Theoretically if one MCNoC can distribute workloadequally to each PE, the throughput should be the maximum.Fig. 5 presents the utilization rate of each PE for differentrouting algorithms in the 10,000-packet test set.

    The XY routing utilizes PEs the most uneven which hasvery high utilization rate at tiles numbered 21, 25 and 52,but extreme low or none utilization rate at tiles around centerof NoC. The WF and NL routings have ve and the NFrouting has seven PEs at utilization lower than 1%. Theseillustrate that the Turn-Model is a biased routing algorithmhaving approximately a quarter of total computing power loss

  • TABLE I: DES MCNoC Performance Comparisons with Varies PE Architectures

    Slice 100 packets @200MHz 10,000 packets @200MHz 10,000 packets @200MHzUsage Insertion Rate 320Mbps Insertion Rate 320Mbps Insertion Rate 727Mbps

    PE Size in Maximum Register LUT Average Packets Average Packets Average PacketsTested MCNoC Frequency Latency Return Ratio Latency Return Ratio Latency Return Ratio

    1 iteration 229.764MHz 12% 28% 2.02us 75% 2.71us 0.89% X X8 iterations 263.832MHz 15% 23% 2.26us 100% 279.48us 100% 23.70us 15.73%16 iterations 263.832MHz 13% 24% 2.25us 100% 279.45us 100% 141.62us 100%

    Fig. 5: PE Utilization in DES MCNoC

    in MCNoC.4) Simulation Results of Throughput: According to com-

    parison results described in previous sections, the DES MC-NoC using WF routing algorithm has the best performanceof all. It has the highest insertion rate of packet and lowestprocessing latency attributing to the higher PE utilization andlower trafc contention than other algorithms. The XY routinghas higher packet insertion rate over NF and NL routings, butgives the lowest throughput due to its vulnerability to networkcongestion. Even though, the XY shows a very competitiveperformance in high throughput design. All designs have maxi-mum frequencies over 250MHz and throughputs are calculatedin gigabits per second listed in Table III.

    Comparing with previous works listed in Table IV, theproposed work is 6.17 times faster than [3] which is composedof soft-core processors and pipeline technology, 14.71 timesfaster than [4] which is also a complicated design appliedNePA and group pipelining.

    VI. CONCLUSIONS

    The results show a high throughput DES computationdesign can be achieved with low-cost switching, packet formatand routing algorithms in a 55 mesh-based MCNoC. Usinglarge PE is area efcient to FPGA and having PE processingtime longer than routing time is a key factor for PE architecture

    TABLE III: Throughput Testing Results

    10,000 packets @250MHzRouting Max. Insertion Average Thro.

    Freq. Rate Latency of DESXY 265MHz 784Mbps 133us 4.80GbpsWF 264MHz 909Mbps 113us 5.65GbpsNF 265MHz 769Mbps 133us 4.82GbpsNL 264MHz 889Mbps 116us 5.54Gbps

    TABLE IV: Throughput Comparison with Previous Works

    Work PE Arch. Frequency ThroughputXY

    250MHz

    4.80GbpsProposed WF 16-iterative 5.65GbpsMCNoC NF DES 4.82Gbps

    NL 5.54Gbps[2] MicroBlaze 100MHz 12.8Mbps[3] MicroBlaze 100MHz 915Mbps[4] NePA 100MHz 384Mbps

    selection. This paper also shows the uneven PE utilizationin XY routing and biased routing algorithms in Turn-Modelcausing none negligible performance loss. Finally, a NoC withconsiderations of DES architecture adds great throughput tothe nal design, 5 times faster than the best performance inprevious works.

    To the best of our knowledge this is the rst NoC designconstructed from cryptologys point of view for high through-put purpose.

    VII. ACKNOWLEDGMENTSThis work was supported by Taiwan National Science

    Council grants PPP 101-2911-I-011-502.

    REFERENCES[1] H.C. Freitas, L.M. Schnorr, M.A.Z. Alves, and P.O.A. Navaux. Impact

    of Parallel Workloads on NoC Architecture Design. In Parallel,Distributed and Network-Based Processing (PDP), 2010 18th EuromicroInternational Conference on, pages 551555, Feb. 2010.

    [2] T.T.-O. Kwok and Y.-K. Kwok. On the Design, Control, and Use of ARecongurable Heterogeneous Multi-Core System-on-a-Chip. In IPDPS2008, pages 111, Apr. 2008.

    [3] X. Li and O. Hammami. An Automatic Design Flow for Data Paralleland Pipelined Signal Processing Applications on Embedded Multiproces-sor with NoC: Application to Cryptography. Int. J. Recong. Comput.,2009, Jan. 2009.

    [4] Y.S. Yang, J.H. Bahn, S.E. Lee, and N. Bagherzadeh. Parallel andPipeline Processing for Block Cipher Algorithms on a Network-on-Chip.In ITNG 2009, pages 849854, Apr. 2009.

    [5] G. Schelle and D. Grunwald. Exploring FPGA Network on ChipImplementations across Various Application and Network Loads. InFPL 2008, pages 4146, Sep. 2008.

    [6] J. Duato, A. Robles, F. Silla, and R. Beivide. A Comparison ofRouter Architectures for Virtual Cut-Through and Wormhole Switchingin a NOW Environment. In Proceedings of the 13th InternationalSymposium on Parallel Processing and the 10th Symposium on Paralleland Distributed Processing, pages 240247, 1999.

    [7] N. Banerjee, P. Vellanki, and K.S. Chatha. A Power and PerformanceModel for Network-on-Chip Architectures. In Design, Automation andTest in Europe Conference and Exhibition. Proceedings, volume 2, pages12501255 Vol.2, Feb. 2004.