rtl datapath optimization using system-level transformations

8
RTL Datapath Optimization Using System-level Transformations Samaneh Ghandali 1 , Bijan Alizadeh 1,2 , Masahiro Fujita 3 , Zainalabedin Navabi 1 1 School of Electrical and Computer Engineering, College of Engineering, University of Tehran, Tehran, Iran 2 School of Computer Science, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran 3 VLSI Design and Education Center (VDEC), University of Tokyo, Tokyo, Japan [email protected], [email protected], [email protected], [email protected] Abstract This paper describe a system-level approach to improve the area and delay of datapath designs that perform polynomial computations over , which are used in many applications such as computer graphics and digital signal processing domains. This approach optimizes the implementation of multivariate polynomial systems in terms of the number of arithmetic operations by performing optimization on a system level prior to high-level synthesis. Univariate functional decomposition of polynomial expressions and canonization form over are used in this method. We use GAUT high-level synthesis tool to generate RTL datapath architectures for the optimized polynomials. Experimental results on a set of benchmark applications with polynomial expressions show that this method outperforms conventional methods in terms of the area of the sequential datapath architectures in speed optimization mode with an average improvement of 25.81%, and the required clock cycles in two modes of speed optimization and area optimization, with an average improvement of 23.48% and 38.24%, respectively. Keywords High-level synthesis, system-level transformations, register transfer level (RTL), polynomial datapath, univariate functional decomposition, canonization form 1. Introduction As the complexity and size of modern embedded application is continuously increasing, designing hardware at higher levels of abstraction for faster design adjustments and higher simulation speed is necessary. Conventional high level synthesis techniques are not efficient to eliminate redundancy and common sub-expression for polynomial datapaths over . Such polynomial functions have been optimized manually to achieve efficient register-transfer- level (RTL) implementation. This process can be time consuming and error prone. Hence, developing high level synthesis and optimization techniques to automate the design of custom polynomial datapaths from a behavioral description is desirable. The Horner form of a polynomial expression is a normal form representation using a nested format. This method transforms the expression into a sequence of nested additions and multiplications, which are suitable for univariate polynomials and for sequential machine evaluation using multiplier-accumulator units. Another algebraic technique is based on kernel/co-kernel computation [6], in which first, lowest cost form of given polynomials from canonization, square-free factorization and original forms is taken into consideration. Then common coefficients and common cubes are extracted using the kernel/co-kernel extraction technique from [7]. Common sub-expressions are determined using algebraic division technique. This method is only applicable to those polynomials in which linear blocks exist explicitly. In [7] and [8], a factoring method was proposed employing kernel/co-kernel extraction with common sub- expression elimination to reduce the size of implementation. The approximate factorization algorithm presented in [8] represents an arithmetic function f as a product of sub- functions f = f 1 ×f 2 ×…×f n where f i is a multivariate polynomial. However, this algorithm is able to factorize square-free polynomials and cannot deal with a sub-function f i with a degree higher than one. Another algebraic method has been proposed in [3] and then improved in [4]. The main idea is somehow similar to algebraic division techniques used in logic synthesis. This technique tries to decompose the original polynomial poly as poly = p 1 × p 2 + p 3 while p 3 should be minimized. For doing so, all possible initial values of p 1 and p 2 must be evaluated. Then for each initialization it is necessary to check whether other monomials in poly can be represented in the form p 1 × p 2 . Finally, the best initialization, which constitutes the lowest complexity p 3 , is chosen. The algebraic technique in [2] improves the optimization heuristics in [3] and [4] to extract more common sub-expressions by considering single- variable and hidden monomials. This technique makes use of finite ring algebra and Modular Horner Expansion Diagram [5]. This method first reduces the original polynomials over . Then common sub-expressions are extracted based on two heuristics. The main disadvantage of this technique is that decompositions are started from reduced polynomials while if the original polynomials are used more common sub- expressions would be extracted. The Algebraic method in [1] proposed for the first time a kind of polynomial optimization technique based on redundancy addition/removal. The main idea is somehow similar to logic optimization based on redundancy addition/removal which has been developed in logic synthesis area. In this method, first, kernels/co-kernels of given polynomials are extracted as good building blocks, then a large number of vanishing polynomials over , which are equal to 0 over , are generated as redundancy in order to transform the given polynomials in such a way that more common sub-expressions can be extracted. Finally, using algebraic division common sub-expressions are determined. In the current paper, we introduce some system-level techniques for transformation of the given system of 978-1-4799-3946-6/14/$31.00 ©2014 IEEE 309 15th Int'l Symposium on Quality Electronic Design

Upload: don-hass

Post on 17-Aug-2015

214 views

Category:

Documents


0 download

DESCRIPTION

This approach optimizes the implementation of multivariate polynomial systems in terms of the number of arithmetic operations by performing optimization on a system level prior to high-level synthesis.

TRANSCRIPT

RTL Datapath Optimization Using System-level TransformationsSamaneh Ghandali1, Bijan Alizadeh1,2, Masahiro Fujita3, Zainalabedin Navabi1 1School of Electrical and Computer Engineering, College of Engineering, University of Tehran, Tehran, Iran 2School of Computer Science, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran 3VLSI Design and Education Center (VDEC), University of Tokyo, Tokyo, Japan [email protected], [email protected], [email protected], [email protected] Abstract Thispaperdescribeasystem-levelapproachtoimprovethe areaanddelayofdatapathdesignsthatperformpolynomial computations over Z2m, which are used in many applications suchascomputergraphicsanddigitalsignalprocessing domains.Thisapproachoptimizestheimplementationof multivariatepolynomialsystemsintermsofthenumberof arithmeticoperationsbyperformingoptimizationona systemlevelpriortohigh-levelsynthesis.Univariate functionaldecompositionofpolynomialexpressionsand canonization form over Z2m are used in this method. We use GAUThigh-levelsynthesistooltogenerateRTLdatapath architecturesfortheoptimizedpolynomials.Experimental resultsonasetofbenchmarkapplicationswithpolynomial expressions show that thismethod outperforms conventional methodsintermsoftheareaofthesequentialdatapath architecturesinspeedoptimizationmodewithanaverage improvement of 25.81%, and the required clock cycles in two modesofspeedoptimizationandareaoptimization,withan average improvement of 23.48% and 38.24%, respectively. Keywords High-levelsynthesis,system-leveltransformations,register transferlevel(RTL),polynomialdatapath,univariate functional decomposition, canonization form 1. Introduction Asthecomplexityandsizeofmodernembedded application is continuously increasing, designing hardware at higher levels of abstraction for faster design adjustments and highersimulationspeedisnecessary.Conventionalhigh levelsynthesistechniquesarenotefficienttoeliminate redundancyandcommonsub-expressionforpolynomial datapathsoverZ2m.Suchpolynomialfunctionshavebeen optimizedmanuallytoachieveefficientregister-transfer-level(RTL)implementation.Thisprocesscanbetime consuminganderrorprone.Hence,developinghighlevel synthesis and optimization techniques to automate the design ofcustompolynomialdatapathsfromabehavioral description is desirable. The Horner form of a polynomial expression is a normal formrepresentationusinganestedformat.Thismethod transforms the expression into a sequence of nested additions andmultiplications,whicharesuitableforunivariate polynomialsandforsequentialmachineevaluationusing multiplier-accumulator units. Another algebraic technique is based on kernel/co-kernel computation[6],inwhichfirst,lowestcostformofgiven polynomials from canonization, square-free factorization and originalformsistakenintoconsideration.Thencommon coefficientsandcommoncubesareextractedusingthe kernel/co-kernelextractiontechniquefrom[7].Common sub-expressionsaredeterminedusingalgebraicdivision technique.Thismethodisonlyapplicabletothose polynomials in which linear blocks exist explicitly.In[7]and[8],afactoringmethodwasproposed employingkernel/co-kernelextractionwithcommonsub-expressioneliminationto reducethesizeofimplementation. Theapproximatefactorizationalgorithmpresentedin[8] representsanarithmeticfunctionfasaproductofsub-functionsf=f1f2fnwherefiisamultivariate polynomial.However,thisalgorithmisabletofactorize square-free polynomials and cannot deal with a sub-function fi with a degree higher than one. Anotheralgebraicmethodhasbeenproposedin[3]and thenimprovedin[4].Themainideaissomehowsimilarto algebraicdivisiontechniquesusedinlogicsynthesis.This technique tries to decompose the original polynomial poly as poly = p1 p2 + p3 while p3 should be minimized. For doing so, all possible initial values of p1 and p2 must be evaluated. Thenforeachinitializationitisnecessarytocheckwhether other monomials in poly can be represented in the form p1 p2.Finally,thebestinitialization,whichconstitutesthe lowestcomplexityp3,ischosen.Thealgebraictechniquein [2]improvestheoptimizationheuristicsin[3]and[4]to extract more common sub-expressions by considering single-variable and hidden monomials. This technique makes use of finiteringalgebraandModularHornerExpansionDiagram [5].Thismethodfirstreducestheoriginalpolynomialsover Z2m.Thencommonsub-expressionsareextractedbasedon twoheuristics.Themaindisadvantageofthistechniqueis thatdecompositionsarestartedfromreducedpolynomials while if the original polynomials are used more common sub-expressions would be extracted. The Algebraic method in [1] proposed for the first time a kindofpolynomialoptimizationtechniquebasedon redundancyaddition/removal.Themainideaissomehow similartologicoptimizationbasedonredundancy addition/removalwhichhasbeendevelopedinlogic synthesisarea.Inthismethod,first,kernels/co-kernelsof givenpolynomialsareextractedasgoodbuildingblocks, thenalargenumberofvanishingpolynomialsoverZ2m, whichare equalto 0 over Z2m,aregeneratedas redundancy inordertotransformthegivenpolynomialsinsuchaway that more common sub-expressions can be extracted. Finally, usingalgebraicdivisioncommonsub-expressionsare determined.Inthecurrentpaper,weintroducesomesystem-level techniquesfortransformationofthegivensystemof 978-1-4799-3946-6/14/$31.00 2014 IEEE30915th Int'l Symposium on Quality Electronic Design polynomials,whichoffermorecommonsub-expressions. Ouroptimizationmethodreducesthecomplexityof polynomialdatapathsintermsofthenumberofarithmetic operationsbyperformingoptimizationonasystem-level priortohigh-levelsynthesis.Furthermore,inorderto generateRTLdatapatharchitecturefortheoptimized polynomials, we use GAUT high-level synthesis tool [12] as ahigh-levelsynthesistool,althoughanyotherhighlevel synthesistoolscanbeutilized.Ouroptimizationmethod reducestheareaandthenumberofclockcyclesattheRTL datapatharchitectures.Inthismethod,weusemathematics conceptofunivariatefunctionaldecompositionof polynomialexpressionsinordertoobtaingoodbuilding blocks and hence extract more common sub-expressions. In summary, our design flow in this paper consists of the following tasks:System-leveltransformationstooptimizedatapath designs that perform polynomial computations over Z2m usingunivariatefunctionaldecompositionand canonization form. Univariatefunctionaldecompositionofthegiven polynomialstoobtaingoodbuildingblocksandextract suitable common sub-expressions. High-levelsynthesisusingGAUT[12]togenerate datapatharchitecturesfortheoptimizedpolynomialsas sequential circuits. Evaluating the performance of the proposed method and showing its effectiveness by comparing it with the state-of-the-artpolynomialoptimizationmethodsinthe literature. Theremainderofthispaperisorganizedasfollows. Section2introducessomepreliminarieswhichareusedin the rest of the paper. A motivational example is presented in section3.Section4explains,indetail,ourproposed polynomialoptimizationmethod.Section5evaluatesthe performanceofouralgorithmsandpresentsexperimental results that demonstrate their effectiveness. Finally, section 6 provides our conclusion. 2. Preliminaries Thissectionintroducessomepreliminarieswhichare usedintherestofthepaper.Inthispaperarithmeticdata pathsaremodeledaspolynomialfunctionsoverZ2n1 Z2n2 Z2nd to Z2m[9].Let1(x ),,p(x )bepgiven polynomialfunctionsoverZ2n1 Z2n2 Z2nd toZ2m as the specification wherex= < x1,x2,,xd > is a vector of d inputvariablesandn1,n2,,nddenotesizeofthe correspondingvariables.Z2nrepresentsthefinitesetof integers {0, 1, , 2n-1}. m is the size of the output bit-vector f. Theorem1:Letfbeapolynomialfunctionfrom Z2n1 Z2ndtoZ2m.Thenaccordingto[9],fcanbe uniquelyrepresentedinacanonicalformas(1),whereYkis fallingfactorialofdegree k e Z(Zdenotestheringof integers) and is defined as follows, Y0(x)=1,Y1(x) = x,Y2(x)= x(x-1), , Yk(x)=Yk-1(x)(x-k+1). aKisanintegersuchthat1aK < 2mgcd(2m, ki!di=1),K= for each ki= 1, 2, , i, and p = min{2ni, SF(2m)]. SF(n)istheleastk e Hsuchthatndividesk!,anddenotes Smarandachefunction[10].gcd(x,y)computesthegreatest common divisor of x and y. = oKK K= oK k1(x1) kd(xd)K(1)For example, let f = 2x5+x4+x2-2x, the canonical form of foverZ23is2x2.Notethatthecanonicalformofa polynomialoverZ2n1 Z2n2 Z2nd to Z2mmaybe zero. Definition 1: Ifg and hareunivariate polynomials,then univariatepolynomialf(x)=g(x)oh(x)istheirfunctional composition,and(g,h)isaunivariatefunctional decompositionoff,wheregandharepolynomialswith lowerdegreethanfandarecalledleftdecompositionfactor andrightdecompositionfactoroff,respectively.oisthe compositionoperatorviacomputingtheoutputofgwhenit hasanargumentofh(x)insteadofx(i.e.,f(x)=g(x)oh(x) =g(h(x))).Example1:Letf(x)=x4+x2-3,thenf(x)=g(x)oh(x)= (x2+x-3)ox2 isaunivariatefunctionaldecompositionoff, where g(x) = x2+x-3 and h(x) = x2. 3. Motivational Example Inthissection,wepresentanexampletomotivatethe optimizationtechniquetobepresented.Inorderto demonstrate the effectiveness of the proposed method, let us consider the following polynomial system. f1(x) = x4+2x3+x2+xy3-3xy2+2xy f2(x) = x6+3x5+3x4+x3+x2+ x. This system needs 32 multiplications and 10 additions. Afterapplyingthefactorizationtechniqueusing MATLAB[11]tothesepolynomials,f1andf2are transformed to the following forms f1(x) = x(x3+2x2+x+y3-3y2+2y) f2(x) = x(x+1)(x4+2x3 +x2+ 1), which need 19 multiplications and 9 additions. Byapplyingourproposedoptimizationmethodover Z22 totheoriginalpolynomials,f1isconvertedtothefollowing form, h(x) = x2+x, g(x) = x2

f1(x) = g o h + xy3-3xy2+2xy = x2 o (x2+x) + xy3-3xy2+2xy,because canonical form ofxy3-3xy2+2xyover Z22 is 0,f1(x)= x2 o (x2+x) = (x2+x)2 = h2, and f2 is converted as follows. h(x) = x2 + x, g(x) =x3 + x, (a) (b)Figure1:(a)Datapatharchitectureofthepolynomials,implementedusingfactorization,(b)Datapatharchitectureofthepolynomials, implemented using our proposed methodf2(x) = g o h = (x3 + x) o (x2+x) = (x2+ x)3+ x2+x = h3+h = h(h2+1) = h(f1+1). Theoptimizedpolynomialsystemrequiresonly3 multiplicationsand2additions.WehaveusedGAUTasa high-levelsynthesistooltogeneratedatapatharchitectures forthepolynomialsystems.GAUTtoolhasbeenusedin many academic projects, and its HLS algorithms for binding, allocation, and scheduling are well documented [12]. WehaveusedGAUTtogeneratedatapatharchitectures fortwomodes;speedoptimizationandareaoptimizationin whichonlyonefunctionalunitisconsideredforeach operationtypeexistedinthedesign.Thedatapath architectureofthepolynomials,implementedusing factorization,inthespeedoptimizationmodeisshownin Fig.1(a).Thedatapatharchitectureofthepolynomials, implementedusingourproposedmethodisshowninFig. 1(b).TheresultsreportedbyGAUTforthepolynomials, implementedusingfactorizationandourproposedmethod areshowninTable1.Wehaveusednotchlibrary, provided by GAUT, and we have set clock cycle to 20. This tablereportsareaandnumberoftheclockcycles,registers, multiplexers,andfunctionalunits(adder,subtracter, multiplier)inthedatapatharchitecturesofthefactored polynomialsandoptimizedpolynomialsusingourproposed method, in speed optimization and area optimization modes.Table 1.Gautreportforthepolynomials,implemented usingfactorization,andforthepolynomials,implemented usingourproposedmethod,inspeedoptimizationandarea optimization modes. Factorization Proposed Method Speed OptimizationCycles66 Registers144 Muxes16032 FU +41 -00 61 Area53091 Area OptimizationCycles226 Registers134 Muxes32032 FU +11 -00 11 Area9191 4. Proposed System-level Optimization Method Weintroducesomesystem-leveltechniquesfor transformationofthegivensystemofpolynomials,which offermorecommonsub-expressions.Ouroptimization methodreducesthecomplexityofpolynomialdatapathsin termsofthenumberofarithmeticoperationsbyperforming optimizationonasystem-levelpriortohigh-levelsynthesis. Furthermore,togeneratedatapatharchitectureforthe optimizedpolynomialsassequentialcircuits,weuseGAUT high-levelsynthesistool.Ouroptimizationmethodreduces area and number of clock cycles in the datapath architectures. Inthefirstphaseoftheproposedsystem-level optimizationmethod,eachgivenmultivariatepolynomial f(x1,,xd)istransformedtoseveralunivariatepolynomials byrepresentingfbasedoneachinputvariablexi(1 i J). Then each obtained univariate polynomial is decomposed throughunivariatefunctionaldecompositionalgorithm explainedinsubsection4.1inordertoobtaingoodbuilding blocks.Inthesecondphase,toextractcommonsub-expressionsamongthegivenpolynomials,wemakeuseof univariatefunctionaldecompositionalgorithmunlikeother worksthatutilizealgebraicdivisiontechnique[1][6][7]. Finally, among various forms of the polynomials in terms of theextractedcommonsub-expressions,theformwith smallestnumberofthearithmeticoperationsisselected. Thesephasesareexplainedinmoredetailsinthefollowing subsections. 4.1. Determining Building Blocks (Phase I) Inthisphase,eachgivenmultivariatepolynomialfis transformedtoseveralunivariatepolynomialsby representingfbasedoneachinputvariables.Theneach obtainedunivariatepolynomialisdecomposedthrough univariatefunctionaldecompositionalgorithminorderto obtaingoodbuildingblocks.Thisphaseisexplainedinthe following steps. Step 1: Each given multivariate polynomial f(x1,, xd) is rewritten based on each input variable xi(1 i J) as (2). =c1,..,ci-1,ci+1,..,cdx1c1x-1ci-1x+1ci+1xdcdc1,..,ci-1,ci+1,...,cd>0(2)wherefe1,..,ei-1,ei+1,..,ed(x)isaunivariatepolynomialwhich representsthepolynomialfbasedonthevariablexi,and e1,...,ei-1,ei+1,...,edaredegreesofd-1variablesx1,,xi-1, xi+1,, xd in polynomial f. Afterapplyingthistransformationtoallgiven polynomialsfj (1 ] p)wherepisthenumberofgiven polynomials,allobtainedunivariatepolynomials fe1,..,ei-1,ei+1,..,ed(x) (1 i J)fromallfj arestoredinaset named AllSubPolyxi(1 i J). Example2:Supposef1(x,y)=x6y2+5x6y+2x5y2+10x5y-x4y2-5x4y+x3y4-x3y2-5x3y+2x2y4+2x2y2+10x2y-xy2-5xy,and f2(x,y) = x6y3+x4y4-2x4y3+2x4y2-2x4y+x2y3+xy4+2xy2-2xy.Then f1 based on the variable y is represented as f1 =x6(y2+5y)+x5(2y2+10y)+x4(-y2-5y)+x3(y4-y2-5y)+x2(2y4+ 2y2+10y)+x(-y2-5y), soAllSubPoly={f1(y)=-y2-5y,f2(y)=2y4+2y2+10y, f3(y)=y4-y2-5y, f4(y)=-y2-5y, f5(y)=2y2+10y, f6(y)=y2+5y}.And f1 based on the variable x is represented as f1 =y4(x3+2x2)+y2(x6+2x5-x4-x3+2x2-x)+y(5x6+10x5-5x4-5x3+10x2-5x),so AllSubPolyx={f1(x)=5x6+10x5-5x4-5x3+10x2-5x,f2(x)=x6 +2x5-x4-x3+2x2-x, f4(x)= x3+2x2}. f2 based on the variable y is represented as f2 = x6(y3)+x4(y4-2y3+2y2-2y)+x2(y3)+x(y4+2y2-2y), soAllSubPoly= AllSubPoly U{y3,y4-2y3+2y2-2y,y4+2y2 -2y}.And f2based on the variable x is represented as f2 = y4(x4+x)+y3(x6-2x4+x2)+y2(2x4+2x)+y(-2x4-2x), soAllSubPolyx=AllSubPolyx U{x4+x,x6-2x4+x2,2x4 +2x, -2x4-2x}. Step2:univariatefunctionaldecompositioniscomputed foreachmemberofAllSubPolyxi(1 i J)(i.e., fe1,..,ei-1,ei+1,..,ed(x))byusingtheunivariatefunctional decomposition algorithm explained in the follow. Univariatefunctionaldecompositionalgorithm:Letg andhbepolynomialsofdegreesrandsoverafield.Their functional composition f = g o h = g(h) has degree n = rs. Theunivariatefunctionaldecompositionproblemcanbe statedasfollows:givenfofdegreen=rs,determine whethersuchgandhexist,andintheaffirmativecase, compute them [13]. Thepseudocodeoftheunivariatefunctional decompositionalgorithm[14],whichisslightlymodifiedin our method to also calculate indecomposable part of an input polynomial,isshowninFig.2.Forevery randsvaluesfor whichrs=n,UniDecprocedureinFig.2withfandras inputscalculatesaunivariatefunctionaldecompositionforf as (3), where f0 is indecomposable part of f.(x) = g(x) b(x) + 0 = g(b(x)) + 0(3) Asexplainedin[14],f,gandhareinthefollowing forms,f=xrs+ars-1xrs-1+...+a0,h=xs+cs-1xs-1+...+c1x,g= xr+br-1xr-1+...+b0,respectively.Inthisalgorithm,first, coefficientsofh,i.e.,(c1,...,cs-1),arecalculatedfrom coefficientsoffbyh_UniDecprocedure(lines4-11inFig. 2). For this purpose, polynomial qk is defined as follows:qk = xs +cs-1xs-1+ ... + cs-k xs-k, 0 k s. Then q0=xs, qs= qs-1=h, and qk = qk-1+ cs-k xs-k, 1 k s.Accordingto[14],wecancalculatethefirstk+1 coefficientsofhrfromcoefficientsofqkr.Thek+1st coefficient of qkr is the coefficient of xrs-k, this agree with ars-k, i.e.,thek+1st coefficient off, 1 k s-1. Thusiftheearlier coefficientscs-1,,cs-k+1ofhareknown,thencs-kcanbe determined by computing cs-k=urs-k-dk,1 k s - 1, where, dk is the coefficient of xrs-k in qk-1r [14]. Second,fromfandh,coefficientsofg,i.e.,(b0,...,br-1), are calculated by g_UniDec procedure (lines 12-16 in Fig. 2), let A[i,j] be the coefficient ofxis in hj, 0 i, j r. Then b = (b0,,br-1),canbedeterminedbysolvingthefollowing equation: Ab = a, where a=(a0, as,, ars) are the coefficients of f. Then,compositionofhandgiscomputedbyusingthe function subs,whichisafunctionlibraryofMaple [15]and computes the value of g o h. The difference between f and g o h is considered as indecomposable part of f and refereed as f0 (line 15 in Fig. 2). Figure 2: Univariate functional decomposition algorithm Example 3: Let us consider a member of AllSubPolyx in example 2; f = x6 +2x5-x4-x3+2x2-x. Because n = 6, one of the situation that r and s >1 are r =2, s =3. So g and h are in the following forms. h = x3 + c2x2 + c1x,g = x2 + b1x+b0. The steps of the univariate decomposition algorithm are as follows.q00 = 1, q01 = x3, q02 = x6 step 1: ( k = 1) J1 = coc(q02, x6-1) = u, c2 =(uS-d1)2= 1q10 = 1, q11 = x3 + x2, q12 = x6 + 2x5 +x4 step 2: ( k=2 ) J2 = coc(q12, x6-2) = 1, c1 =(u4-d2)2= -1q10 = 1, q11 = x3 + x2 - x,q12 = x6 + 2x5 - x4 - 2x3 + x2 So h is obtained as h= x3 + x2 - x. Then by using the coefficients of f and h, coefficients of g are calculated as follows. A = _1 u uu 1 -2u u 1_, o = _u-11_ , b = _u11_ So g is obtained as g = x2+x. Thegeneralformresultingbyapplyingtheunivariate functionaldecompositionalgorithmtoeachmemberof AllSubPolyxi(1 i J) is shown in (4). fe1,..,ei1,ei+1,..,eu(x) = g(x) b(x) + 0(x), fe1,..,ei1,ei+1,..,eu(x) e AllSubPolyxi, (1 i J) (4) Step 3: In the third step of the first phase of the proposed method,allobtainedrightdecompositionfactorsb(x)ofall members of AllSubPolyxiarestoredin asetnamed b_sctxi as good building blocks. Example4:Letusconsiderexample2again.By computingtheunivariatefunctionaldecompositionofall membersofAllSubPolyxandAllSubPoly,b_sctxfor variable x and b_sct for variable y are obtained as follows. b_sctx= {x3+x2-x, x2+x, x3-x, x2}, b_sct= {y2, y2-5y}.4.2. Common sub-expression extraction (Phase II) The aim of the second phase of the proposed method is to extractcommonsub-expressionsbetweenallgiven multivariatepolynomials,whichisequaltoextractcommon sub-expressionsbetweentheirequivalentunivariate polynomials which are stored in AllSubPolyxi (1 i J). Toextractcommonsub-expressionswemakeuseof univariatefunctionaldecompositionalgorithmunlikeother worksthatutilizealgebraicdivisiontechnique[1][6][7].By considering members of b_sctxi as good building blocks, we trytore-decomposeallmembersofAllSubPolyxibythese buildingblocksandfindcommonsub-expressionsbetween them.Byusingg_UniDecproceduredescribedinsubsection 4.1,eachfe1,..,ei-1,ei+1,..,ed(x) e AllSubPolyxiisassessed whetherapolynomialg'canbecalculatedfromthis polynomial and each member of b_sctxi as shown in (5), fe1,..,ei1,ei+1,..,eu(x) = g'(x) b'(x) +0 (x) b'(x) e b_sctxi(1 i J)(5) whereg(x)isanewrightdecompositionfactorof fe1,..,ei-1,ei+1,..,ed,0 (x)isanewindecomposablepartof fe1,..,ei-1,ei+1,..,ed, and b'(x) is a member of b_sctxi. Please note thateachmemberofb_sctximaybebelongedtodifferent members of AllSubPolyxi.Toreducecostofthecorrespondinghardware implementationofeachpolynomial,wemakeuseof canonical representation of 0 (x) over Z2ni to Z2m, which is explainedinsection2,andthencomparecostofthe canonicalformwiththeoriginalformof0 (x),andselect the lower cost form. Example5:Letusconsidertwomembersof AllSubPolyx in example 2; p1= x6+2x5-x4-x3+2x2-x belonged tof1,andp2=x6-2x4+x2belongedtof2.Twomembersof h_setx in example 4 are h1=x3+x2-x, and h2=x3-x which are a rightdecompositionfactorofp1andp2respectively.By applying the common sub-expression phase to p1 and p2 over Z23, two obtained forms of these polynomials are as follows.Form 1:p1=(x2-x) o h2 + f0 = (x2-x) o (x3-x) +2x5+x4+x2-2x, canonical_form(f0) = 2x2, so p1=(x2-x) o (x3-x) +2x2 over Z23, p2= x2 o h2= x2 o (x3-x). Form 2: p1=(x2+x) o h1=(x2-x) o (x3+x2-x),p2=(x2+2x) o h1 + f0= (x2+2x) o (x3+x2-x) -2x5-x4-2x2+2x.canonical_form(f0)=-3x2,sop2=(x2+2x)o(x3+x2-x)-3x2 over Z23.Therefore result of the common sub-expression extraction phaseisvariousdecompositionsofeachfe1,..,ei-1,ei+1,..,edbased onthedifferentbuildingblocksbelongedtodifferent members of AllSubPolyxi. 4.3. Complete system-level optimization algorithm Thecompleteproposedsystem-leveloptimization algorithm is explained in this subsection. The pseudo-code of the proposed method is shown in Fig. 3. Lines 3-10 describe thefirstphaseoftheproposedmethodinwhicheachinput multivariatepolynomialistransformedtoseveralunivariate polynomialswhicharestoredinAllSubPolyxi (1 i J) (lines3-6).Thenunivariatefunctionaldecomposition algorithm in Fig. 2 is applied to these univariate polynomials andthenallobtainedrightdecompositionfactorsarestored inb_sctxiasgood building blocks (lines7-10).Lines 11-16 describethesecondphaseinwhicheachmemberof AllSubPolyxiisre-decomposedbymembersofb_sctxi usingg_UniDecprocedureinFig.2(lines11-14),and common sub-expressions are determined. The canonical form overZ2ntoZ2miscalculatedforf0'(xi)inordertoreduce cost of the implementation (lines 15-16). Finally, to select the formwiththesmallestnumberofarithmeticoperations, every new generated form of all members of AllSubPolyxi is consideredtoevaluaterelatedhardwareimplementationby computingcost_funcfunction.Thisfunctiondeterminesthe numberofarithmeticoperationssuchasadditionsand multiplicationsneededforimplementationofthegiven polynomials (lines 17-22). Figure 3: System-level optimization algorithm Example6:Letusconsiderthepolynomialsystemin example2.Thispolynomialsystemoriginallyneeds115 multiplicationsand23additions.Byapplyingourproposed method,wegetanimplementationwithonly16 multiplications and 10 additions as shown below.t1= x2,t2= x(t1-1),t3= t1+t2,t4= y2, f1= t42t1(x+2) + y(y+5)t3(t3+1) , f2= (t12+x)(t4(2+t4)-2y)+ yt4t22. The results reported by GAUT for the datapath architectures oftheoriginalandoptimizedpolynomialsassequential circuits, in speed optimization mode, are shown in Table 2. Table 2.GAUTreportfortheoriginalandoptimized polynomials. Cycles Registers Muxes FU Area +- Original152132031121028Optimized812224214356 5. Experimental Results Inordertoshowtheeffectivenessofourproposed optimizationmethod,wehaveemployeddifferent polynomialsextractedfromrealembeddedsystems.Various combinationsofmultivariatecosinewavelet(MVCS)for graphic applications [7], Savitzky-Golay (SG) filters [16] and digitalimagerejectionunit(DIRU)forimageprocessing applications,Quadraticfilters(Quad)forDSPapplications [17],Phase-ShiftKeying(PSK)fordigitalcommunication [18]havebeentakenintoaccountasmulti-output polynomials.Table 3. Comparison of the datapath architectures in the speed optimization mode. DIRU PSK Quad DIRU PSK SG2 DIRU Quad SG2 DIRU MVCSSG2 % Horner Cycles171716160.00 Registers22242020 Muxes384352324336 FU +2234 -1100 111377 Area9371103605613 Technique in [6] Cycles1215171313.42Registers19372124 Muxes272368306304 FU +3334 -1100 62177 Area5301775605613 Technique in [1] Cycles1515111318.32Registers16152026 Muxes243256304288 FU +2244 -1100 55715 Area4394396131277 Our approach Cycles1313111127.39Registers17192022 Muxes320336288272 FU +2223 -1100 4577 Area356439597605 Wehaveimplementedtheproposedmethodalongwith themethodsin[1]and[6]inMaple[15],andthenwehave usedGAUTasahigh-levelsynthesistool[12]to automaticallygeneratedatapatharchitecturesforobtained polynomials. Togeneratedatapatharchitecturesbasedonoptimized polynomials,weconsidered twomodes provided byGAUT; speed optimization andarea optimization. Table 3illustrates theresultsobtainedusingourproposedmethod,Horner form,andmethodsin[1]and[6]forspeedoptimization mode. This table reports area and number of the clock cycles, registers,multiplexers,andfunctionalunits(adder, subtracter,multiplier)intheobtaineddatapatharchitectures. %indicatesthepercentofimprovementinthenumberof required clock cycles in all methods compared to the Horner form.Theresultsinthetableindicatethatinourmethod requiredclockcyclesarereducedbyanaverageof38.11%, 20.10%,and12.24%incomparisonwiththeHornerform, themethodin[6]andthemethodin[1]acrossall benchmarks.Thisreductionindicatesthatourgoalof reducing criticalpath delayhasbeen achieved.Furthermore, area is improved in our method with an average improvement of 25.81% in comparison with other works. Table 4. Comparison of the datapath architectures in the area optimization mode. DIRU PSK Quad DIRU PSK SG2 DIRU Quad SG2 DIRU MVCSSG2 % Horner Cycles728668860.00 Registers13151316 Muxes592656560672 FU +1111 -1100 1111 Area99999191 Technique in [6] Cycles529172718.38 Registers16271722 Muxes6721056800832 FU +1111 -1100 1111 Area99999191 Technique in [1] Cycles3636556537.92Registers12111819 Muxes411432752832 FU +1111 -1100 1111 Area99999191 Our approach Cycles4148495238.68Registers16171818 Muxes624704720752 FU +1111 -1111 1100 Area99999191

In the area optimization mode, only one functional unit is considered for each operation type existed in the design (i.e., all operations with the same type should be bound to a same functional unit). Table 4 illustrates the results obtained using ourproposedmethod,Hornerform,andmethodsin[1]and [6]forareaoptimizationmode.Theresultsreportedinthe tableindicatethatourproposedmethodprovidesfewer requiredclockcyclesincomparisonwiththeHornerform, themethodin[6]andthemethodin[1]withanaverageof 64.73%,49.97%,and0.01%,respectively,acrossall benchmarks.Suchimprovementinthe required clockcycles withafixednumberoffunctionalunitsindicatesthatby usingourmethodthenumberofoperationswouldbefewer thanthoseofotherworks.Inotherwords,ourproposed method can efficiently determine common sub-expressions. 6. CONCLUSION Inthispaperwehaveproposedasystem-level optimizationmethodforthe data pathsimplementedusing a systemofpolynomials.Ourmethodoptimizespolynomials to reduce the complexity of polynomial datapaths in terms of thenumberofarithmeticoperationsoverZ2m.Inthe proposed method first all givenmultivariate polynomials are transformedtoseveralunivariatepolynomials.Then univariate functional decompositions are calculated for them toobtaingoodbuildingblocks.Toextractcommonsub-expressionswemakeuseofunivariatefunctional decompositionalgorithm.WehaveusedGAUThigh-level synthesis tool to generate RTL datapath architectures for the optimizedpolynomialsassequentialcircuits.Experimental results show superiority of our approach in the area and delay savingsincontrastwiththeotherrelatedworks.Asafuture work,wearegoingtoutilizemultivariatefunctional decomposition algorithm to extract better building blocks. References [1] S.Ghandali,B.Alizadeh,Z.NavabiandM.Fujita, "PolynomialDatapathSynthesisandOptimization BasedonVanishingPolynomialoverZ2mand AlgebraicTechniques,"10thACM-IEEEconference onFormalMethodsandModelsforCo-Design (MEMOCODE), 2012, pp.65-74. [2] B.AlizadehandM.Fujita,"ModularDatapath OptimizationandVerificationBasedonModular-HED,"IEEETrans.onComputer-AidedDesignof IntegratedCircuitsandSystems(TCAD),vol.29,pp. 1422-1435, 2010. [3] B.AlizadehandM.Fujita,"Improvedheuristicsfor finiteword-lengthpolynomialdatapathoptimization," ACM-IEEEInternationalConferenceonComputer-AidedDesign-DigestofTechnicalPapers(ICCAD), 2009, pp. 739-744. [4] O.Sarbishei,B.AlizadehandM.Fujita,"Polynomial datapathoptimizationusingpartitioningand compensationheuristics,"DesignAutomation Conference (DAC), 2009, pp. 931-936. [5] B.AlizadehandM.Fujita,"Modular-HED:A CanonicalDecisionDiagramforModularEquivalence VerificationofPolynomialFunctions,"fifthWorkshop onConstraintsinFormalVerification(CFV),2008, pp. 22-40. [6] S.GopalakrishnanandP.Kalla,"algebraictechniques toenhancecommonsub-expressioneliminationfor polynomialsystemsynthesis,"Design,Automation& TestinEurope(DATE)Conference,2009,pp.1452- 1457.[7] A.Hosangadi,F.FallahandR.Kastner,"Optimizing polynomialexpressionsbyalgebraicfactorizationand commonsubexpressionelimination,"IEEETrans.on Computer-AidedDesignofIntegratedCircuitsand Systems(TCAD), vol. 25, pp. 20122022, 2006. [8] A.Hosangadi,F.Fallah,andR.Kastner,"Factoring and eliminating commonsub expressions in polynomial expressions,"inProc.,ACM-IEEEInternational ConferenceonComputer-AidedDesign(ICCAD), 2004, pp. 169-174. [9] Z. CHEN,"On polynomial functions from Zn1Zn2 Znr to Zm," Discrete Math., Vol. 162, No. 13, pp. 6776, 1996. [10]F.Smarandache,"Afunctioninnumbertheory,"AnaleleUniv.Timisoara,Fascicle1,vol.XVII,pp. 7988, 1980. [11]MATLABversion8.2,(2013),(computersoftware), The MathWorks Inc., Natick, Massachusetts. [12]P.Coussy,etal.,"GAUT:AHigh-LevelSynthesis ToolforDSPApplications,"High-LevelSynthesis: FromAlgorithmtoDigitalCircuits,Springer,Berlin, Germany, 2008. [13]J. von zur Gathen, J. Gutierrez, R. Rubio, "Multivariate PolynomialDecomposition,"ApplicableAlgebrain Engineering,CommunicationandComputing,Vol. 14, No. 1 , pp. 11-31, 2003. [14] D.KozenandS.Landau,"Polynomialdecomposition algorithms,"J.SymbolicComputation,Vol.7,No.5, pp. 445456, 1989. [15] Maple, 2013, http://www.maplesoft.com. [16] J.Krumm,"Savitzky-GolayFiltersfor2DImages," http://homepages.inf.ed.ac.uk/rf/CVonline/LOCAL_COPIES/KRUMMI/SavGol.htm.[17] V. J. Mathews and G. L. Sicuranza, "Polynomial Signal processing," Wiley-Interscience, 2000. [18] A.PeymandoustandG.DeMicheli,"Applicationof symboliccomputeralgebrainhigh-leveldata-flow synthesis,"IEEE Trans. on Computer-Aided Design of IntegratedCircuitsandSystems(TCAD),vol.22,pp. 11541165, Sep. 2003.