nvd-27. a practical noc design for parallel des computation - ieee 2013

A Practical NoC Design for Parallel DES Computation

R. Yuan, S.-J. Ruan and J. GotzeNational Taiwan University of Science and Technology

Low-Power Systems Lab, Taipei, TaiwanE-Mails: {D9902102, sjruan}@mail.ntust.edu.tw

TU DortmundInformation Processing Lab, Dortmund, Germany

E-Mail: [email protected]

AbstractThe Network-on-Chip (NoC) is considered to be anew SoC paradigm for the next generation to support a largenumber of processing cores. The idea to combine NoC with ho-mogeneous processors constructing a Multi-Core NoC (MCNoC)is one way to achieve high computational throughput for specicpurpose like cryptography. Many researches use cryptographystandards for performance demonstration but rarely discuss asuitable NoC for such standard. The goal of this paper is topresent a practical methodology without complicated virtualchannel or pipeline technologies to provide high throughputData Encryption Standard (DES) computation on FPGA. Theresults point out that a mesh-based NoC with packet andProcessing Element (PE) design according to DES specicationcan achieve great performance over previous works. Moreover,the deterministic XY routing algorithm shows its competitivenessin high throughput NoC and the West-First routing offers thebest performance among Turn-Model routings, representatives ofadaptive routing.

I. INTRODUCTION

Advantages of Network-on-Chip (NoC) over traditional bus-based architecture have been proposed in many researches.The NoC architecture has advantages in both scalability andexibility thus it can be organized to run homogeneous coresin parallel to improve performance for specic purposes [1].Such approach on NoC is a suitable method to realize a highthroughput computational system on FPGA.

Data encryption/decryption is one computational algorithmoften implemented in researches for performance demonstra-tion. Characteristics of one cryptography affect the selec-tion of the it size for routing, the packet size in trafccommunication and the architecture for Processing Element(PE). Together with popularity of data protection demandsnowadays, a high performance NoC specic to cryptographymust be analyzed.

Our work has realized a 55 2-D Mesh, VCT switching,running 25 Data Encryption Standard (DES) computations inparallel. The goal of this paper is to evaluate the throughput ofa high workload NoC. The main contribution is related to theperformance verication results of MCNoC architectures forparallel DES computation. Our results indicate that proposedwork has considerable speedup than previous works.

This paper is organized as follows: Section II describesthe related work of DES on other NoC systems. Section IIIintroduces the proposed architecture. Section IV describes

congurations of proposed MCNoC including packet format,routing algorithm, ow control and architecture of PE. Sec-tion V describes the experimental methodology and shows theresults. In last section, brief statements conclude this paper.

II. RELATED WORKS

Some NoC proposals use soft-core processors, MicroBlazeor Networked Processor Array (NePA), as processing elementsto implement DES computations [2], [3], [4]. These processorshave much more complicated functions than traditional DESneeds. Thus adding cores becomes costly because of thedramatic increase of complexity and trafc load, resultinglimited performance improvement.

The research [5] has realized one DES encryption usedsporadically in the network for brute force testing. The perfor-mance is not unleashed due to the architecture is essentiallynot designed with high throughput considerations.

This paper presents a practical MCNoC for parallel DESprocessing to achieve high throughput demand. Our proposedMCNoC has all boundary ports open to other resources forhigh throughput purpose, and this sharing scheme has beenapplied on state-of-the-art commercial NoC chips like TileraTILE64. Without complicated designs of pipeline or virtualchannel technologies, routing and ow control componentscan be kept simple so the NoC is low-cost and low-powerconsumption.

III. ARCHITECTURE OF MCNOC FOR DES

MCNoC is a specic NoC that owns parallel computingpower that can be shared by multiple components connectedwith it. A typical architecture of a 55 NoC is shown in Fig. 1.Tiles numbered 11, 12, 13, 14, 15, 21, 25, 31, 35, 41, 45, 51,52, 53, 54, and 55 are boundary tiles. Each boundary tile haseither 1 side or 2 sides connecting with external resources, notswitches. These external resources can be packet generators,receivers or both represented as tiles with dot line calledterminal tiles, numbered 01, 02, 03, 04, 05, 10, 16, 20, 26,30, 36, 40, 50, 56, 61, 62, 63, 64, and 65. Terminal tiles aredummy tiles therefore no PE connects with them. N, E, W andS represent North, East, West and South respectively. The restof tiles are normal tiles without any specic name. Every tileexcept terminal tile is composed of one router and one PE.

978-1-4673-4436-4/13/$31.00 2013 IEEE

Fig. 1: A 55 mesh-based MCNoC

A terminal tile injects packets into only one boundary tile.The boundary tile receives the packet if its PE is available,otherwise it routes the packet to the neighbor tile according toits routing algorithm. The packet will be routed till an availablePE is found. When DES operation is nished, the packet hasthe done bit set to HIGH and starts to be routed towards theterminal tile where it was from.

Different from other works use end-to-end trafc or onlyfew input nodes as packet injecting points, our work takesinput trafc from North, East, West and South and returnspackets to all directions. In order to keep NoC low-cost andscalable, the NoC is constructed based on following principles:

1) No virtual channel or pipeline is used.2) All PEs are designed for data processing only.3) Architectures of all tiles are identical.4) Unnished packet should not leave NoC.The third and forth principles are the difculties. Unlike

boundary tiles, tiles at the center or some other locations donot need criteria to block packets. Applying these controls addsextra two clock states to routing decision on each direction oftrafc, and 9 out of 25 tiles are affected in a 55 NoC. Forpursuing the overall exibility and scalability of MCNoC weconsider this overhead is acceptable.

IV. NETWORK ON CHIP CONFIGURATIONS

A. Packet Format

Packet is transferred in consecutive 5 its and each it is 32-bit long as Fig. 2. The rst it is Header it storing informationfor routing decision and PE computation. The values of sourceand terminal represent the coordinates of packet generator andreceiver respectively. The destination number always stores thecoordinate of one boundary tile. Two-bit done signal indicateswhether the data it is the nal data in DES operation. Then,iteration number tells which stage of DES operation for thispacket. Finally, serial number is used as a tracking numberand bit 7 is reserved.

The Key it composed of the second and third its storesa complete 64-bit initial key needed in DES operation, e.g.subkey generation according to the iteration number. Thefourth and fth its are grouped as Data it stores plaintext at

Fig. 2: Packet its

the beginning, intermediate cipher while DES processing, andnal ciphertext when DES encryption is done. Those Header,Key and Data its will be mentioned repeatedly in followingsections and their relationship will be claried.

B. Routing Algorithm

In this proposal four routing algorithms are tested, onedeterministic XY and three adaptive routings: WF, NL andNF. All these algorithms are deadlock-free, livelock-free andare implemented respectively to evaluate their characteristicsin a high loading DES MCNoC.

Fig. 3 shows the architecture of router. When a packet issent into a tile, the router rst checks its done bit. If it is LOW,the packet is either sent to PE, or directed to another tile. Ifthe done bit is HIGH, the packet is routed towards its desiredboundary tile. Once it reaches the boundary tile it is bursted tothe desired terminal tile immediately. By checking destinationand terminal number and done bit in Header it, an unnishedpacket is not be able to pass through the boundary tile untilit nishes all processes of DES computation.

Fig. 3: Router Architecture

C. Flow Control

The VCT contributes higher throughput when load increas-ing due to the wormholes drawback of quickly resources satu-ration while packets blocking occurs [6]. Banerjee [7] presentsthat VCT gave lower latencies at higher acceptance rates andprovided better performance than wormhole switching.

D. Architecture of PE

According to the structure of DES, the reasonable numberof iterations are divisors of 16, i.e. 1, 2, 4, 8 and 16 (onePE completes one DES operation). Using a small PE for onlyone iteration needs another 15 computations to complete oneDES operation causing more packets routing in network. Bycontrast, a large PE contains full 16 iterations makes packetstay inside PE longer thus network trafc allows more data toll up other PEs. Whether the fast reaction of small PE helpsthroughput improvement to overall network becomes a factorto consider. This part of testing is discussed in Section V.

V. EXPERIMENT METHODOLOGY AND RESULTS

A. Experiment Setup

Full circuit design was done in Xilinx ISE 11.4 target-ing on XC5VLX220-1FF1760 device. All simulations donein ModelSim-SE 6.2g were under high consecutive packetsinjection rate over 95% overlapping in between any twoinputs, which guaranteed the MCNoC was fully loaded andthe maximum throughput was measured without saturating it.

B. Simulation Results

1) Simulation Results of PE Size: Values in Table I statesthe performance and evaluation of 1-, 8- and 16-iterativePEs. The resultant slice utilization tells the 16-iterative PEarchitecture ts to slices architecture better than others.

In the experiments of throughput testing, the benet ofshort data processing period in low-iterative PE does notcompensate for the loss of throughput caused by congestions.The 1-iterative PE saturates the NoC quickly due to therouting time is much longer than the data processing time.Consequently more packets stay on link rather than in PEresulting congestions. When insertion rate reaches 727Mbps,packet congestion occurs in 8-iterative design resulting only15.73% packets returning to original terminal tiles.

By analyzing processing time of one packet in Table II,1- and 8-iterative PEs process faster than router since theyimplement only partial DES computation. A 16-iterative PEis able to lock packet longer providing router more chanceto service another packet which further helps reduce trafc innetwork.

TABLE II: Processing Time of One Packet

PE Processing TimeSize PE Router

1 iteration 35ns 80ns8 iterations 75ns 80ns16 iterations 115ns 90ns

This section concludes that the DES MCNoC equipped with16-iterative PEs offers the best performance and slice utiliza-tion. To achieve high throughput purpose, the PE processingtime has to be longer than the router processing time, whichhelps improve utilization on both PE and router.

2) Simulation Results of Trafc Latency on Terminal Tiles:Fig. 4 presents average latencies measured on all terminal tilesfor XY and Turn-Model routing algorithms. The X-axis isterminal tile number and the Y-axis is the latency in nanosec-ond. Packet amount less than 4,000 has no discriminationfor all designs. When the number of packets increases, somedelays occur reecting the hard time of few boundary tiles,especially the ones sitting at corners. Such situation is obviousin XY routing since a packet at corner waits for the only routebeing available causing others packets blocked in another tiles.Adaptive routings on the other hand has several distributionsof hot-spot corners due to the difference in routing sequences.

(a) XY routing (b) WF routing

(c) NF routing (d) NL routing

Fig. 4: Packet Latency on Terminal Tile in DES MCNoC

For instance the WF routes packets from PE takes Westas the rst choice then South as the second choice, combingwith the features that packets from North routes to South asthe rst choice, packets from East routes to West and Southas the rst and second choice respectively, resulting that mostpackets take West or South as their rst routing path causingterminal tiles at lower left corner hard to inject more packets,which causes terminal tiles numbered 04, 05 and 06 becomehigh latency entries. This experiment illustrates the number ofhigh latency entries for XY, WF, NF, and NL is 9, 3, 5, and6 respectively.

3) Simulation Results of PE Utilization: The computingpower of MCNoC derives from the sum of computing power ofevery PE. Theoretically if one MCNoC can distribute workloadequally to each PE, the throughput should be the maximum.Fig. 5 presents the utilization rate of each PE for differentrouting algorithms in the 10,000-packet test set.

The XY routing utilizes PEs the most uneven which hasvery high utilization rate at tiles numbered 21, 25 and 52,but extreme low or none utilization rate at tiles around centerof NoC. The WF and NL routings have ve and the NFrouting has seven PEs at utilization lower than 1%. Theseillustrate that the Turn-Model is a biased routing algorithmhaving approximately a quarter of total computing power loss

TABLE I: DES MCNoC Performance Comparisons with Varies PE Architectures

Slice 100 packets @200MHz 10,000 packets @200MHz 10,000 packets @200MHzUsage Insertion Rate 320Mbps Insertion Rate 320Mbps Insertion Rate 727Mbps

PE Size in Maximum Register LUT Average Packets Average Packets Average PacketsTested MCNoC Frequency Latency Return Ratio Latency Return Ratio Latency Return Ratio

1 iteration 229.764MHz 12% 28% 2.02us 75% 2.71us 0.89% X X8 iterations 263.832MHz 15% 23% 2.26us 100% 279.48us 100% 23.70us 15.73%16 iterations 263.832MHz 13% 24% 2.25us 100% 279.45us 100% 141.62us 100%

Fig. 5: PE Utilization in DES MCNoC

in MCNoC.4) Simulation Results of Throughput: According to com-

parison results described in previous sections, the DES MC-NoC using WF routing algorithm has the best performanceof all. It has the highest insertion rate of packet and lowestprocessing latency attributing to the higher PE utilization andlower trafc contention than other algorithms. The XY routinghas higher packet insertion rate over NF and NL routings, butgives the lowest throughput due to its vulnerability to networkcongestion. Even though, the XY shows a very competitiveperformance in high throughput design. All designs have maxi-mum frequencies over 250MHz and throughputs are calculatedin gigabits per second listed in Table III.

Comparing with previous works listed in Table IV, theproposed work is 6.17 times faster than [3] which is composedof soft-core processors and pipeline technology, 14.71 timesfaster than [4] which is also a complicated design appliedNePA and group pipelining.

VI. CONCLUSIONS

The results show a high throughput DES computationdesign can be achieved with low-cost switching, packet formatand routing algorithms in a 55 mesh-based MCNoC. Usinglarge PE is area efcient to FPGA and having PE processingtime longer than routing time is a key factor for PE architecture

TABLE III: Throughput Testing Results

10,000 packets @250MHzRouting Max. Insertion Average Thro.

Freq. Rate Latency of DESXY 265MHz 784Mbps 133us 4.80GbpsWF 264MHz 909Mbps 113us 5.65GbpsNF 265MHz 769Mbps 133us 4.82GbpsNL 264MHz 889Mbps 116us 5.54Gbps

TABLE IV: Throughput Comparison with Previous Works

Work PE Arch. Frequency ThroughputXY

250MHz

4.80GbpsProposed WF 16-iterative 5.65GbpsMCNoC NF DES 4.82Gbps

NL 5.54Gbps[2] MicroBlaze 100MHz 12.8Mbps[3] MicroBlaze 100MHz 915Mbps[4] NePA 100MHz 384Mbps

selection. This paper also shows the uneven PE utilizationin XY routing and biased routing algorithms in Turn-Modelcausing none negligible performance loss. Finally, a NoC withconsiderations of DES architecture adds great throughput tothe nal design, 5 times faster than the best performance inprevious works.

To the best of our knowledge this is the rst NoC designconstructed from cryptologys point of view for high through-put purpose.

VII. ACKNOWLEDGMENTSThis work was supported by Taiwan National Science

Council grants PPP 101-2911-I-011-502.

REFERENCES[1] H.C. Freitas, L.M. Schnorr, M.A.Z. Alves, and P.O.A. Navaux. Impact

of Parallel Workloads on NoC Architecture Design. In Parallel,Distributed and Network-Based Processing (PDP), 2010 18th EuromicroInternational Conference on, pages 551555, Feb. 2010.

[2] T.T.-O. Kwok and Y.-K. Kwok. On the Design, Control, and Use of ARecongurable Heterogeneous Multi-Core System-on-a-Chip. In IPDPS2008, pages 111, Apr. 2008.

[3] X. Li and O. Hammami. An Automatic Design Flow for Data Paralleland Pipelined Signal Processing Applications on Embedded Multiproces-sor with NoC: Application to Cryptography. Int. J. Recong. Comput.,2009, Jan. 2009.

[4] Y.S. Yang, J.H. Bahn, S.E. Lee, and N. Bagherzadeh. Parallel andPipeline Processing for Block Cipher Algorithms on a Network-on-Chip.In ITNG 2009, pages 849854, Apr. 2009.

[5] G. Schelle and D. Grunwald. Exploring FPGA Network on ChipImplementations across Various Application and Network Loads. InFPL 2008, pages 4146, Sep. 2008.

[6] J. Duato, A. Robles, F. Silla, and R. Beivide. A Comparison ofRouter Architectures for Virtual Cut-Through and Wormhole Switchingin a NOW Environment. In Proceedings of the 13th InternationalSymposium on Parallel Processing and the 10th Symposium on Paralleland Distributed Processing, pages 240247, 1999.

[7] N. Banerjee, P. Vellanki, and K.S. Chatha. A Power and PerformanceModel for Network-on-Chip Architectures. In Design, Automation andTest in Europe Conference and Exhibition. Proceedings, volume 2, pages12501255 Vol.2, Feb. 2004.

nvd-27. a practical noc design for parallel des computation - ieee 2013

Documents