optimizing polynomial expressions by algebraic factorization and common subexpression...

Optimizing Polynomial Expressions by Algebraic Factorization and Common Subexpression Elimination1

Anup Hosangadi Farzan Fallah Ryan Kastner

University of California, Fujitsu Labs of America, Inc. University of California, Santa Barbara [email protected] Santa Barbara

[email protected] [email protected]

Abstract

Polynomial expressions are frequently encountered in many application domains, particularly in signal

processing and computer graphics. Conventional compiler techniques for redundancy elimination such as Common

Subexpression Elimination (CSE) are not suited for manipulating polynomial expressions, and designers often

resort to hand optimizing these expressions. This paper leverages the algebraic techniques originally developed for

multi level logic synthesis to optimize polynomial expressions by factoring and eliminating common subexpressions.

We tested our algorithm on a set of benchmark polynomial expressions where we observed a savings of 26.7% in the

latency and 26.4% in the energy consumption for computation on the StrongARMTM SA1100 processor core, for

computing these expressions. When these expressions were synthesized in custom hardware, we observed an

average energy savings of 63.4% for minimum hardware constraints and 24.6% for medium hardware constraints,

over CSE.

1. Introduction

In recent years, the embedded systems market has seen a huge demand for high performance and low power

solutions to products such as cellular phones, network routers and video cameras. This demand has led to the

research and development of various high-level optimization techniques for both software [4-6] and hardware [7-9].

Modern software compilers often produce 20-30% and sometimes two or three times faster performance than

traditional compilers [10]. There are a number of optimizations that are well known in the compiler design

community [4, 5] Common Subexpression Elimination, Scalar replacement of aggregates, Value Numbering,

Constant and Copy Propagation are applied both locally to basic blocks as well as globally across the whole Control

Flow Graph (CFG). Other significant optimizations that are not directly related to the optimizations presented in this

paper are loop and procedure optimizations, register allocation and instruction scheduling. With the rapid 1 Preliminary versions of this paper appeared in the International Conference on Computer Aided Design [1] in the International Conference on VLSI Design [2] and in the International Workshop on Logic and Synthesis [3]

advancements in VLSI technology, it has become possible to build low cost Application Specific Integrated Circuits

(ASICs) to be included in embedded systems. The growing complexity of these ASICs has necessitated the

development of High-Level synthesis tools to automate the design of custom datapaths from a behavioral

description. These tools mainly perform controller synthesis, scheduling and resource allocation, in addition to

performing some limited transformations for algebraic restructuring and redundancy elimination.

Unfortunately, most of the work that has been done in these compiler tools has been done for general purpose

applications. These techniques do not do a good job of optimizing special functions such as those consisting of

polynomial expressions, which is the focus of this paper. Polynomial expressions are present in a wide variety of

applications domains since any continuous function can be approximated by a polynomial to the desired degree of

accuracy [11-13]. There are a number of scientific applications that use polynomial interpolation on data of physical

phenomenon. These polynomials need to be computed frequently during simulation. Real time computer graphics

applications often use polynomial interpolations for surface and curve generation, texture mapping etc. [14]. Most

DSP algorithms often use trigonometric functions like Sine and Cosine which can be approximated by their Taylor

series as long as the resulting inaccuracy can be tolerated [9, 15]. Furthermore, many adaptive signal processing

applications need to use polynomials to model the nonlinearities in high speed communication channels [16]. Most

often, system designers have to depend upon hand optimized library routines for implementing such functions,

which can be both time consuming and error prone. It is important to do a good optimization of polynomial

expressions, since they generally consist of a number of multiplications, which are expensive in resource constrained

embedded systems.

Conventional techniques for redundancy elimination such as Common Subexpression Elimination (CSE) and

Value Numbering are very limited in optimizing polynomials, mainly because they are unable to perform

factorization of these expressions. The Horner form is a popular evaluation scheme for polynomials in many

libraries including the GNU C library [17]. But this form gives good results mainly for single variable polynomial

expressions. This paper presents an optimization technique for polynomials aimed at reducing the number of

operations by algebraic factorization and common subexpression elimination. This technique can be integrated into

the front end of modern optimizing compilers and high level synthesis frameworks, for optimizing programs or

datapaths consisting of a number of polynomial expressions. As a motivating example for demonstrating the benefits

of our technique, consider the polynomial expression shown in Figure 1a, which is a multivariate polynomial used in

Computer Graphics for modeling surfaces and textures [14]. This polynomial originally has 23 multiplications. After

applying the iterative two-term CSE algorithm [4] on this expression, we get an implementation that has 16

multiplications (1.b). After applying the Horner form for this expression, we get an implementation with 17

multiplications. But on applying our algebraic method that we develop in this paper, we get an implementation with

only 13 multiplications. Such an optimization is impossible to perform using conventional optimization techniques.

Figure 1. Optimizations on Multivariate Quartic-Spline Polynomial

Algebraic techniques have been successfully applied in multi-level Logic Synthesis for reducing the number of

literals in a set of Boolean expressions [18-21]. These techniques are typically applied to a set of Boolean

expressions consisting of hundreds of variables. In this paper we show that the same techniques can be applied to

reduce the number of operation in a set of polynomial expressions. The rest of the paper is organized as follows.

Section 2 presents some related work on optimizing arithmetic expressions and Boolean expressions. In Section 3,

we present the transformation of polynomial expressions. In Section 4, we present the algorithms for factorization

and common subexpression elimination. Experimental results are presented in Section 5, where we evaluate the

benefits of our optimization for execution on an ARM processor and also for custom datapath synthesis. Finally we

conclude and talk about future work in Section 6.

2. Related Work

There has been a lot of early work on generation of code for arithmetic expressions [22-24]. In [22, 24], the

authors propose algorithms for minimizing the number of program steps and storage references in the evaluation of

arithmetic expressions, given a fixed number of registers. This work was later extended in [23] to include

expressions with common subexpressions. The above works performed optimizations by code generation

techniques, and did not really present an algorithm that would efficiently reduce the number of operations in a set of

P = zu4 + 4avu3 + 6bu2v2 + 4uv3w + qv4 P = zu4 + ( 4au3 + ( 6bu2 + (4uw + zw)v)v )v (a) Unoptimized polynomial (c) Optimization using Horner Transform d1 = u2 d1 = u2

d2 = v2 d2 = 4*u d3 = uv P = u(uz + ad2)d1 + v2(u (d2w + vq) + 6bd1) P = d1

2z + 4ad1d3 + 6bd1d2 + 4wd2d3 + qd22

(b) Optimization using CSE (d) Optimization using our Algebraic technique

arithmetic expressions. Some work was done to optimize code with arithmetic expressions by factorization of the

expressions [25]. This work developed a canonical form for representing the arithmetic expressions and an

algorithm for finding common subexpressions. The drawback of this algorithm is that it takes advantage of only the

associative and commutative properties of arithmetic operators and therefore can only optimize expressions

consisting of only one type of associative and/or commutative operator at a time. As a result, it cannot perform

factorization of expressions and generate complex common subexpressions consisting of additions and

multiplications.

There has been some work on optimization of arithmetic expressions for critical path minimization [26-28] by

utilizing the commutative, associative and distributive properties of arithmetic operators, and retiming of latches in

the circuit. These methods do not present advanced techniques for common subexpression elimination and

factorization, which is the contribution of this paper. We can explore different high speed implementations by

combining our optimizations with those originally developed for critical path minimization.

Horner form is a popular representation of polynomial expressions and is the default way of evaluating

polynomial approximations of trigonometric functions in many libraries including the GNU C library [17]. The

Horner form of evaluating a polynomial transforms the expression into a sequence of nested additions and

multiplications, which are suitable for sequential machine evaluation using multiply accumulate (MAC) instructions.

For example a polynomial in variable x

p(x) = a0xn + a1xn-1 + a2xn-2 + …..+ an is evaluated as

p(x) = (…..( ( a0x + a1)x + a2)x + ….. + an-1)x + an

The disadvantage of this technique is that optimizes only a single polynomial expression at a time, and does not look

for common subexpressions among a set of polynomial expressions. Furthermore, it is not good at optimizing

multivariate polynomial expressions, of the type commonly found in computer graphics applications [29].

Symbolic algebra has shown to be a good technique for manipulating polynomials and it has been shown to have

applications in both high level synthesis [9] and low power embedded software optimization [15]. The techniques

discussed in these works make use of the Grobner bases in symbolic algebra to decompose a dataflow represented

by a polynomial expression to a set of library polynomials. These techniques have been applied in high level

datapath synthesis for producing a minimum component and minimum latency implementations. The quality of

results from these techniques depends heavily on the set of library elements. For example, the decomposition of the

polynomial expression (a2 – b2) into (a+b)*(a-b) is possible only if there is a library element (a+b) or (a-b). It should

be noted that just having an adder in the library will not suffice for this decomposition. Though this problem can be

addressed by having a large number of library elements, the run time of the algorithm can become excessively large.

This method is useful for arithmetic decomposition of a datapath to library elements, but it is not really useful for

simplifying a set of polynomial expressions by factoring and eliminating common subexpressions.

Power optimization techniques for dataflow architectures have been explored in detail at the behavioral level

[30-32], though not particularly for arithmetic expressions. [31, 32] present good surveys of all the relevant work in

this area, where different techniques such as operation reduction and operation substitution, power aware scheduling

and resource allocation, multiple voltage, multiple frequency scheduling and bus encoding are explored. Reducing

the number of operations is a good technique for reducing dynamic power, since it is directly proportional to the

amount of switched capacitance. In [32], the authors suggest the Horner scheme for reducing the number of

operations. But the Horner scheme has many disadvantages as discussed earlier.

Arithmetic expressions have to be computed during address calculation for Data Transfer Intensive (DTI)

applications. In [33], different word level optimizing transformations for the synthesis of these expressions such as

expression splitting and clustering, induction variable analysis and common subexpression and factoring are

explored. The algebraic techniques that we present in this paper can be used for factoring and eliminating common

subexpressions for general polynomial expressions, and can be used as one of the transformations in the synthesis of

address generators.

Algebraic methods have been successfully applied to the problem of multi-level logic synthesis for minimizing

the number of literals in a set of Boolean expressions[18, 19, 21]. These techniques are typically applied to set of

Boolean expressions consisting of thousands of literals and hundreds of variables. The optimization is achieved by

decomposition and factorization of the Boolean expressions. Rectangle covering methods have been used to perform

this optimization [20], but due to the high algorithmic complexity, faster heuristics like Fast-Extract (Fx) have also

been developed [21]. There are two main algorithms in the rectangle covering method: Distill and Condense. Distill

algorithm is preceded by a Kernelling algorithm, where a subset of algebraic divisors are generated. Using these

divisors, Distill performs multiple cube decomposition and factorization. This is followed by the Condense

algorithm, which performs single cube decomposition. Our optimization technique for polynomial expressions is

based on the Distill and Condense algorithms, which we generalized to handle polynomial expressions of any order.

3. Transformation of polynomial expressions

The goal of this section is to introduce our matrix representation of arithmetic expressions and their

transformations that allow us to perform our optimizations. The transformation is achieved by finding a subset of all

possible subexpressions and writing them in a matrix form. We also point out the key differences between the

algebraic methods for polynomials and those originally developed for multi-level logic synthesis.

3.1 Representation of polynomial expressions

Each polynomial expression is represented by an integer matrix, where there is one row for each product term

(cube), and one column for each variable/constant in the matrix. Each element (i,j) in the matrix is a non-negative

integer that represents the exponent of the variable j in the product term i. There is an additional field in each row of

the matrix for the sign (+/-) of the corresponding product term. For example, the expression for the Taylor’s series

approximation for sin(x) (as shown in Figure 4) is represented in the matrix form shown in Figure 2.

Figure 2. Matrix representation of polynomial expressions

3.2 Definitions

We use the following terminology to explain our technique. A literal is a variable or a constant (e.g. a, b, 2,

3.14…). A cube is a product of the variables each raised to a non-negative integer power. In addition, each cube has

a positive or a negative sign associated with it. Examples of cubes are +3a2b, -2a3b2c. An SOP representation of a

polynomial is the sum of the cubes ( +3a2b + (-2a3b2c) + …). An SOP expression is said to be cube-free if there is

no cube that divides all the cubes of the SOP expression. For a polynomial P and a cube c, the expression P/c is a

kernel if it is cube-free and has at least two terms (cubes). For example in the expression P = 3a2b – 2a3b2c, the

expression P/ (a2b) = (3 – 2abc) is a kernel. The cube that is used to obtain a kernel is called a co-kernel. In the

above example the cube a2b is a co-kernel. The literals, cubes, kernels and co-kernels are represented in matrix form

in our technique.

+/- x S3 S5 S7 + 1 0 0 0 - 3 1 0 0 + 5 0 1 0 - 7 0 0 1

3.3 Algebraic techniques for polynomial expressions and Boolean expressions in multi-level logic

synthesis

The techniques for polynomial expressions that we present in this paper are based on the algebraic methods

originally developed for Boolean expressions in multi-level logic synthesis [18-20]. We have generalized these

methods to handle polynomial expressions of any order and consisting of any number of variables. The changes

made are mainly because of the differences between Boolean and arithmetic operations. These differences are

detailed below

1. Restriction on Single Cube Containment (SCC): In [19], the Boolean expressions are restricted to be a set of non-

redundant cubes, such that no cube properly contains another. An example of redundant cubes is in the Boolean

expression f = ab + b. Here the cube b can be written as b = ab + a’b. Therefore cube ‘b’ contains the other cube,

‘ab’. For polynomial expressions, this notion of cube containment does not exist, and we can have a polynomial P =

ab + b. Dividing this polynomial by b, P/b = (ab + b)/b = a + 1, which we treat as a valid expression. To handle this,

we treat ‘1’ as a distinct literal in our representation for polynomial expressions.

For simplifying our algorithm, we place a different restriction on the polynomial expressions. We define a

variable cube of a cube(term) of a polynomial expressions, as a cube obtained after ignoring the constant. For

example, the variable cube of the cube 3a2b is a2b. Our restriction is that no two terms of a polynomial expression

can have the same variable cube. If we have an expression having some terms with the same variable cubes, then we

collapse them into a single term by adding all those terms. For example the expression P = a2 + ab + ab + b2 is

converted into the expression P = a2 + 2ab + b2. If we allow terms with the same variable cubes in the expressions,

then the kernels of these expressions will have repeated terms and/or will have multiple terms having just constants,

both of which are undesirable in the construction of the Kernel Cube Matrix (KCM, described in Section 3.5). For

instance, in the above expression P = a2 + ab + ab + b2, we have the co-kernel and kernel pairs (a)(a + b + b) and

(b)(a + a + b), both of which have repeated terms. We cannot represent these kernels on the KCM.

This restriction prevents us from factorizing the expression P as (a + b)(a + b) (rather P is factorized as a(a + 2b)

+ b2), but it simplifies the optimization algorithm.

2. Algebraic division and the kernel generation algorithm: For Boolean expressions, there is a restriction on the

division procedure for obtaining kernels. During division, the quotient and the divisor are required to have

orthogonal variable supports. For example, if there is a Boolean expression f, and we need to divide f by another

expression g (f/g), then the quotient expression h is defined [19] as “the largest set of cubes in f, such that h g⊥ (h

is orthogonal to g, that is the variable support of h and the variable support of g are disjoint), and that hg f∈ .”

For example, if f = a(b + c) + d, and g = a. Then f/g = h = (b + c). We can see that h g⊥ and hg = a(b + c) ∈ f .

For polynomial expressions we remove this restriction for orthogonality, so that we can handle expressions of

any order. This is based on one key difference between the multiplication and the Boolean AND operation. For

example, the multiplication a*a = a2, but the AND operation a.a = a. Therefore, we can have a polynomial F = a(ab

+ c) + d and a divisor G = a. The division F/G = H = (ab + c). We can see that H and G are not orthogonal since they

share the variable ‘a’. Since the kernel generation algorithm generates kernels by recursive division, we have

modified that algorithm to take into account this difference. This algorithm is detailed in Section 3.4 (Figure 3). In

[19, 34], a central theorem is presented on which the decomposition and factorization of Boolean expressions is

based on. This theorem shows how all multiple-cube intersections and algebraic factorizations can be obtained from

the set of kernels. For our modified problem, this theorem still holds and multiple term common subexpressions and

algebraic factorizations can be detected by the set of kernels for the polynomial expressions. This is shown in detail

in Section 3.4.

3. Finding single cube intersections: In the decomposition and factorization methodology of Boolean expressions

presented in [19], multiple term common factors are extracted first and then single cube common factors are

extracted. For polynomial expressions, we follow the same order. We modify the algorithm for finding single cube

intersections, since we have higher order terms. All single cube common subexpressions can be detected by our

method, and is described in detail in Section 3.5 and Section 4.2.

3.4 Generation of kernels and co-kernels

The kernels and co-kernels of polynomial expressions are important because of two reasons. Firstly, all possible

minimal algebraic factorizations of an expression can be obtained from the set of kernels and co-kernels of the

expression. An algebraic factorization of an expression P is the factorization P = C*F1 + F2, where C is a cube and

F1 and F2 are subexpressions consisting of set of terms. This algebraic factorization is called minimal if the only

common cube among the set of terms in F1 is ‘1’.

Secondly, all multiple term (algebraic) common subexpressions can be detected by an intersection among the set

of kernels. These two properties are illustrated by the following two theorems.

Theorem 1: All minimal algebraic factorizations of a polynomial expression can be obtained from the set of kernels

and co-kernels of the expression.

Proof: Consider the minimal algebraic factorization of P = C*F1 + F2. By definition, the subexpression F1 is cube-

free, since the only cube that commonly divides all the terms of F1 is ‘1’. We have to prove that F1 is a part of a

kernel expression, with the corresponding co-kernel C. Let {ti} be the set of original terms of expression P, that

cover the terms in F1. Since F1 is obtained from {ti} by dividing it by C (F1 = {ti}/C), the common cube of the terms

{ti} is C. Let {tj} be the set of original terms of P that also have C as the common cube among them and do not

overlap with any of the terms in {ti}. Now consider the expression K1 = ( {ti} + {tj})/C = {ti}/C + {tj}/C = F1 + {t’j},

where {t’j} = {tj}/C. This expression K1 is cube-free since we know that F1 is cube-free. K1 covers all the terms of

the original expression P that contain cube C. Therefore by definition, K1 is a kernel of expression P with the cube

C as the corresponding co-kernel. Since our kernel generation procedure generates all kernels and co-kernels of the

polynomial expressions, the kernel K1 = F1 + {t’j} will be generated with the corresponding co-kernel C. Since F1 is

a part of this kernel expression, we proved the theorem.

Theorem 2: There is a multiple term common subexpression in the set of polynomial expressions iff there is a

multiple term intersection among the set of kernels belonging to the expressions.

Explanation: By multiple term intersection, we mean an intersection between kernel expressions yielding two or

more terms(cubes). For example when we intersect the kernels K1 = a + bc + d and K2 = bc + d + f, then there is a

multiple term intersection = (bc + d).

Proof: The If case is easy to prove. Let K1 be the multiple term intersection. It means that there are multiple

instances of the subexpression K1 among the set of expressions. For the Only If case, assume that f is a multiple

term common subexpression among the set of expressions. If the expression f is cube-free, then f is a part of some

kernel expressions as proved in Theorem 1. Each instance of expression f will therefore be a part of some kernel

expression, and an intersection among the set of kernels will detect the common subexpression f. If f is not a cube-

free expression, then let C be the biggest cube that divides all terms of f, and let f ’ = f /C. The expression f ’ is now a

cube-free expression, and reasoning as above, each instance of f ’ will be detected by an intersection among the set

of kernels.

The algorithm for extracting all kernels and co-kernels of a set of polynomial expressions is shown in Figure 3.

This algorithm is analogous to kernel generation algorithms in [19], and has been generalized to handle polynomial

expressions. The main difference is the way division is performed in our technique, where the values of the elements

in the divisor are subtracted from the suitable set of rows in the dividend. Another difference is in that in the

recursive algorithm, we continue to divide by the current variable, till it does not appear in at least two rows of the

matrix form of the expression.

Figure 3. Algorithms for kernel and co-kernel extraction

The main algorithm Kernels is recursive and is called for each expression in the set of polynomial expressions.

The arguments to this function are the literal index i, and the cube d, which is the co-kernel extracted in the previous

call to the algorithm. The variables in the set of expressions are ordered randomly (the ordering is inconsequential),

FindKernels({Pi},{Li}) Divide(P,d) { { {Pi } = Set of polynomial expressions; P = Expression; d = cube ; {Li } = Set of Literals; Q = set of rows of P that contain cube d; {Di } = Set of Kernels and Co-Kernels = φ ; ∀ Expressions Pi in {Pi} ∀ Rows Ri of Q {Di } = {Di } ∪ {Kernels(0, Pi ,φ )} ∪ {Pi , 1}; { return {Di}; ∀ Columns j in the Row Ri } Ri[j] = Ri[j] – d[j]; }

Kernels(i,P,d) return Q; { } i = Literal Number ; P =expression in SOP; d = Cube; Merge(C1, C2, C3) D = Set of Divisors = φ ; {

for( j = i; j < L ; j++) C1, C2, C3 = cubes ; { Cube M; If( Lj appears in more than 1 row) for(i = 0; i < Number of literals; i++) { M[i] = C1[i] + C2[i] + C3[i]; Ft = Divide (P , Lj ); C = Largest Cube dividing each cube of Ft; return M; if( ( Lk ∉C) ∀ (k < j) ) } { F1 = Divide(Ft , C); // kernel D1 = Merge(d,C, Lj );// co-kernel D = D ∪D1 ∪F1; D = D ∪Kernels(j ,F1 , D1); } } } return D; }

and the variable index represents the position of the variable in this order. Before the first call to the Kernels

algorithm, the variable index is initialized to ‘0’ and the cube ‘d’ is initialized to φ. The recursive nature of the

algorithm extracts kernels and co-kernels within the kernel extracted and returns when there are no kernels present.

The algorithm Divide is used to divide an SOP expression by a cube d. It first collects those rows that contain

cube d (those rows R such that all elements R[i] ≥ d[i]). The cube d is then subtracted from these rows and these

rows form the quotient.

The biggest cube dividing all the cubes of an SOP expression is the cube C with the greatest literal count (literal

count of C is ∑i

]i[C ) that is contained in each cube of the expression.

The algorithm Merge is used to find the product of the cubes d, C and Lj and is done by adding up the

corresponding elements of the cubes. In addition to the kernels generated from the recursive algorithm, the original

expression is also added as a kernel with co-kernel ‘1’. Passing the index i to the Kernels algorithm, checking for Lk

∉C and iterating from literal index i to |L| (total number of literals) in the for loop are used to prevent the same

kernels from being generated again.

We use the example of the polynomial shown in Figure 4, to explain our technique. The numbers in the

subscripts for each term indicate the term number in the set of expressions. This polynomial is a Taylor’s series

approximation for sin(x). The literal set is {L} = {x,S3,S5,S7}. Dividing first by x gives Ft = (1 – S3x2 + S5x4 –S7x6).

There is no cube that divides it completely, so we record the first kernel F1 = Ft with co-kernel x. In the next call to

Kernels algorithm dividing F by x gives Ft = -S3x + S5x3 – S7x5.

The biggest cube dividing this completely is C = x. Dividing Ft by C gives our next kernel F1 = (-S3 + S5x2 –

S7x4) and the co-kernel is x*x*x = x3. In the next iteration we obtain the kernel S5 – S7x2 with co-kernel x5. Finally

we also record the original expression for sin(x) with co-kernel ‘1’. The set of all co-kernels and kernels generated

are [x](1 – S3x2 + S5x4 – S7x6); [x3](-S3 + S5x2 – S7x4); [x5](S5 – S7x2); [1](x – S3x3 + S5x5 – S7x7);

Figure 4. Evaluation of sin(x)

sin(x) = x(1) – S3x3(2) + S5x5

(3) – S7x7(4)

S3 = 1/3! , S5 = 1/5!, S7 = 1/7!

3.5 Constructing kernel cube matrix (KCM)

All kernel intersections and multiple term factors can be identified by arranging the kernels and co-kernels in a

matrix form. There is a row in the matrix for each kernel (co-kernel) generated, and a column for each distinct cube

of a kernel generated. The KCM for our sin(x) expression is shown in Figure 5. The kernel corresponding to the

original expression (with co-kernel ‘1’) is not shown for ease of presentation. An element (i,j) of the KCM is ‘1’, if

the product of the co-kernel in row i and the kernel-cube in column j gives a product term in the original set of

expressions. The number in parenthesis in each element represents the term number that it represents. A rectangle is

a set of rows and set of columns of the KCM such that all the elements are ‘1’. The value of a rectangle is the

weighted sum of the number of operations saved by selecting the common subexpression or factor corresponding to

that rectangle. Selecting the optimal set of common subexpressions and factors is equivalent to finding a maximum

valued covering of the KCM, and is analogous to the minimum weighted rectangular covering problem described in

[20], which is NP-hard. We use a greedy iterative algorithm described in Section 4.1 where we pick the best prime

rectangle in each iteration. A prime rectangle is a rectangle that is not covered by any other rectangle, and thus has

more value that any rectangle it covers.

Figure 5. Kernel Cube Matrix (KCM) for the Sin(x) example

Given a rectangle with the following parameters, we can calculate its value:

R = number of rows ; M( Ri ) = number of multiplications in row (co-kernel) i.

C = number of columns ; M( Cj ) = number of multiplications in column (kernel-cube i).

Each element (i,j) in the rectangle represents a product term equal to the product of co-kernel i and kernel-cube j,

which has a total number of M( Ri ) + M(Cj ) + 1 multiplications. The total number of multiplications represented by

the whole rectangle is equal to CRRMCCMRR

iC

j *)(*)(* +∑+∑ . Each row in the rectangle has C – 1 additions

for a total of R*(C-1) additions. By selecting the rectangle, we extract a common factor with

)(∑C

jCM multiplications and C -1 additions. This common factor is multiplied by each row which leads to a further

RRMR

i +∑ )( multiplications. The value of the rectangle can be described as the weighted sum of the savings in the

1 2 3 4 5 6 7 8 9 1 -S3x2 S5x4 -S7x6 -S3 S5x2 -S7x4 S5 -S7x2

1 x 1(1)_ 1(2)_ 1(3) 1(4) 2 x3 1(2) 1(3) 1(4) 3 x5 1(3)_ 1(4)

number of multiplications and additions and is given in Equation I. The weighing factor ‘m’ can be selected by the

designer depending on the relative cost of a multiplication and addition on the target architecture. For example

multiplication takes about five times more cycle time than addition on an ARM processor [35]. For custom hardware

synthesis, multiplication has much higher power consumption compared to addition. In [36], the average energy

consumption of a multiplier is about 40 times that of an adder at 5V.

1) -C( * 1) -R (

CMRM(R R ( * 1) - C ( { *m ValueC

jR

i

+

∑−+∑+= })((*)1())1 (I)

3.6 Constructing Cube Literal Incidence Matrix (CIM)

The KCM allows the detection of multiple cube common subexpressions and factors. All possible cube

intersections can be identified by arranging the cubes in a matrix where there is a row for each cube and a column

for each literal in the set of expressions. The CIM for our sin(x) expression is shown in Figure 2.

Each cube intersection appears in the CIM as a rectangle. A rectangle in the CIM is a set of rows and columns

such that all the elements are non-zero. This CIM provides a canonical representation of the cubes of the expression,

which enables the detection of all single cube common subexpressions, as is illustrated by the following theorem.

Theorem 3: There is a single cube common subexpression in the set of polynomial expressions iff there is a

rectangle with more than one row in the Cube Intersection Matrix.

Proof: For the If case, assume that there is a rectangle with more than one row in the CIM. The elements of this

rectangle are non-zero by definition, and let C be the biggest common cube which is obtained from the minimum

values in each column of the rectangle. This cube C is the single cube common subexpression. For the Only If case,

assume that there is a common cube D in the set of expressions. Let {vd} be the set of variables that cube D is made

of. There will be non-zero values in the columns corresponding to the variables {vd} for each term that contains

cube D. As a result, a rectangle containing the columns corresponding to the variables {vd} will be formed.

The value of a rectangle is the number of multiplications saved by selecting the single term common

subexpression corresponding to that rectangle. The best set of common cube intersections is obtained by a maximum

valued covering of the CIM. We use a greedy iterative algorithm described in Section 4.2, where we select the best

prime rectangle in each iteration. The common cube C corresponding to the prime rectangle is obtained by finding

the minimum value in each column of the rectangle. The value of a rectangle with R rows can be calculated as

follows.

Let ∑C[i] be the sum of the integer powers in the extracted cube C. This cube saves ∑C[i] – 1 multiplications in

each row of the rectangle. The cube itself needs ∑C[i] – 1 multiplications to compute. Therefore the value of the

rectangle is given by the equation:

Value2 = (R -1)*( ∑C[i] – 1) (II)

4. Factoring and eliminating common subexpressions

In this section we present two algorithms that detect kernel and cube intersections. We first extract kernel

intersections and eliminate all multiple term common subexpressions and factors. We then extract cube intersections

from the modified set of expressions and eliminate all single cube common subexpressions.

4.1 Extracting kernel intersections

The algorithm for finding kernel intersections is shown in Figure 6, and is analogous to the Distill procedure

[19] in logic synthesis. It is a greedy iterative algorithm in which the best prime rectangle is extracted in each

iteration. In the outer loop, kernels and co-kernels are extracted for the set of expressions {Pi} and the KCM is

formed from that. The outer loop exits if there is no favorable rectangle in the KCM. Each iteration in the inner loop

selects the most valuable rectangle if present, based on our value function in Equation I.

Figure 6. The algorithm for finding kernel intersections

FindKernelIntersections( {Pi}, {Li}) { while(1) { D = FindKernels({Pi}, {Li}); KCM = Form Kernel Cube Matrix (D); {R} =Set of new kernel intersections = φ ; {V} =Set of new variables = φ; if (no favorable rectangle) return; while(favorable rectangles exist) { {R} = {R} ∪ Best Prime Rectangle; {V} = {V} ∪ New Literal; Update KCM; } Rewrite {Pi} using{R}; {Pi} = {Pi} ∪ {R}; {Li} = {Li} ∪ {V}; } }

This rectangle is added to the set of expressions and a new literal is introduced to represent this rectangle. The

kernel cube matrix is then updated by removing those 1’s in the matrix that correspond to the terms covered by the

selected rectangle. For example, for the KCM for the sin(x) expression shown in Figure 5, consider the rectangle

that is formed by the row {2} and the columns {5,6,7}.

This is a prime rectangle having R = 1 row and C = 3 columns. The total number of multiplications in rows

2)( =∑R

iRM and total number of multiplications in the columns )C(MC

i∑ = 6. Using the value function in Equation

I, the value of this rectangle is 6*m (m ≥ 1). This is the most valuable rectangle and is selected. Since this rectangle

covers the terms 2,3 and 4, the elements in the matrix corresponding to these terms are deleted. No more favorable

(valuable) rectangles are extracted in this KCM, and now we have two expressions, after rewriting the original

expression for sin(x).

sin(x) = x(1) + (x3d1)(2)

d1 = (-S3)(3) + (S5x2)(4) – (S7x4)(5)

Extracting kernels and co-kernels as before, we have the KCM shown in Figure 7. Again the original expressions for

sin(x) and d1 are not included in the matrix to simplify representation. From this matrix we can extract 2 favorable

rectangles corresponding to d2 = (S5 – S7x2) and d3 = (1 + x2d1). No more rectangles are extracted in the subsequent

iterations and the set of expressions can now be written as shown in Figure 8.

Figure 7. Kernel Cube Matrix (2nd iteration) Figure 8. Expressions after kernel intersection 4.2 Extracting cube intersections The Distill algorithm discussed in the previous section could find only multiple (two or more) term common

subexpressions. We need an algorithm that can find single term common subexpressions. We do this optimization

by means of a Cube Literal Incidence Matrix (CIM), which was discussed in Section 3.6. Single term common

subexpressions appear as rectangles in this matrix, where the rectangle entries can have any non-zero integers. The

optimization is performed by iteratively eliminating the rectangle having the most savings. For example, consider

the three terms F1 = a2b3c, F2 = a3b2 and F3 = b2c3. The CIM for these expressions is shown in Figure 9a.

1 2 3 4 1 x2d1 S5 -S7x2

1 x 1(1) 1(2) 2 x2 1(4) 1(5)

sin(x) = (xd3)(1) d3 = (1)(2) + (x2d1)(3) d2 = (S5)(4) – (x2S7)(5) d1 = (x2d2)(6) – (S3)(7)

A rectangle with rows corresponding to F1 and F2 and columns corresponding to the variables a and b yields the

common subexpression d1 = a2b2. Rewriting the CIM, we get the matrix shown in Figure 9b. Another rectangle with

rows corresponding to F1 and F3 and columns corresponding to the variables b and c is obtained, eliminating the

common subexpression d2 = b*c. After rewriting we get the CIM as shown in Figure 9c. No more favorable

rectangles can be found in this matrix, and we can stop here.

F1 = a2b3c F2 = a3b2 F3 = b2c3 d1= a2b2 (a) (b) (c)

Figure 9. Extracting single term common subexpressions

Figure 11. Cube Literal Incidence Matrix for expressions in Figure 8 Figure 12. Final optimization of sin(x) example Figure 10. The algorithm for finding cube intersections The algorithm for finding cube intersection is done at the end of the Distill algorithm and is shown in Figure 10.

In each iteration of the inner loop, the best prime rectangle is extracted using the value function in Equation (II). The

cube intersection corresponding to this rectangle is extracted from the minimum value in each column of the

a

b c

2 3 1 3 2 0 0 2 3

a

b

c

d1

0 1 1 1 1 0 0 1 0 2 3 0 2 2 0 0

a

b

c

d1

d2

0 0 0 1 1 1 0 0 1 0 0 1 2 0 1 2 2 0 0 0 0 1 1 0 0 2 2 0

0 1 1 0

1 2 3 4 5 6 7 8 Term +/- x d1 d2 d3 1 S3 S5 S7 1 + 1 0 0 1 0 0 0 0 2 + 0 0 0 0 1 0 0 0 3 + 2 1 0 0 0 0 0 0 4 + 0 0 0 0 0 0 1 0 5 - 2 0 0 0 0 0 0 1 6 + 2 0 1 0 0 0 0 0 7 - 0 0 0 0 0 1 0 0

FindCubeIntersections( {Pi}, {Li}) { while(1) { M = Cube Literal incidence Matrix {V} = Set of new variables = φ; {C} = Set of new cube intersections = φ; if( no favorable rectangle present) return; while(Favorable rectangles exist) { Find B = Best Prime Rectangle; {C} = {C} ∪ Cube corresponding to B; {V} = {V} ∪ New Literal; Collapse(M,B); } M = M ∪ {C}; {Li} = {Li} ∪ {V}; } } Collapse(M,B) { ∀ rows i of M that contain cube B ∀ columns j of B M[i][j] = M[i][j] – B[j]; }

d4 = x*x d2 = S5 – S7 *d4 d1 = d2*d4 – S3 d3 = d1*d4 + 1 sin(x) = x*d3

F1 = bcd1 F2 = ad1 F3 = b2c3 d1 = a2b2 d2 = bc

F1 = d1d2 F2 = ad1 F3 = bc2d2d1 = a2b2 d2 = bc

rectangle. After each iteration of the inner loop, the CIM is updated by subtracting the extracted cube from all rows

in the CIM in which it is contained (Collapse procedure). Coming back to our sin(x) example, consider the CIM in

Figure 11 for the set of expressions in Figure 8. The prime rectangle consisting of the rows {3,5,6} and the column

{1} is extracted, which corresponds to the common subexpression C = x2. Using Equation (II), we can see that this

rectangle saves two multiplications. This is the most favorable rectangle and is chosen. No more cube intersections

are detected in the subsequent iterations. A literal d4 = x2 is added to the expressions which can now be written as in

Figure 12.

4.3 Complexity and quality of presented algorithms

The complexity of the kernel generation algorithm (Figure 3) can be defined as maximum number of kernels that

can be generated from an expression. The total number of kernels generated from an expression depends on the

number of variables, the number of terms, the order of the expressions and also the distribution of the variables in

the original terms. Since kernels are generated by division by cubes (co-kernels), the set of co-kernels generated

have to be distinct. Given the number of variables m, and the maximum integer exponent of any variable in the

expression as n, we can find the worst case number of kernels generated for an expression as the number of different

co-kernels possible. The number of different co-kernels possible is (n + 1)m. For example if we have m = 2

variables, a and b, and the maximum integer power n = 2, then we have (2 + 1)2 = 9 different possible co-kernels

which are 1, b, b2, a, ab, ab2, a2, a2b, a2b2. This worst case happens only when there is a uniform distribution of the

variables in the different terms of the polynomial expression.

The complexity of the kernel extraction procedure depends on the organization of the Kernel Cube Matrix. In the

worst case, the Kernel Intersection Matrix (KIM) can have a structure like the one shown for the square matrix in

Figure 13, where all elements except the ones on the diagonal are non-zero.

Figure 13. Worst case complexity for finding kernel intersections

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

In such a scenario, the number of rectangles generated is exponential in the number of rows/columns in the matrix.

The number of kernels that are generated is equal to the number of rows in the matrix and the number of distinct

kernel cubes is equal to the number of columns. The algorithm for finding single cube intersections also has a worst

case exponential complexity, when the Cube Intersection Matrix (CIM) is of the form shown in Figure 13. The

number of rows of the CIM is equal to the number of terms and the number of columns is equal to the number of

literals in the set of expressions.

The algorithms that we presented are greedy heuristic algorithms. To the best of our knowledge, there has been

no previous work done for finding an optimal solution to the common subexpression elimination problem. An

optimal solution can be generated by a brute force exploration of the entire search space where all possible

subexpressions in different selection orderings are explored. Since this is impractical for even moderately sized

examples, we do not compare our solution with the optimal one. For our greedy heuristic, the average run time for

the examples that we investigated is only 0.45s.

5. Experimental results The goal of our experiments was to investigate the usefulness of our technique in reducing the number of

operations in polynomial expressions occurring in some real applications. We found polynomial expressions to be

prevalent in signal processing [37] and in 3-D computer graphics [14]. We optimized the polynomials using

different methods, CSE, Horner form and our algebraic technique. We used Jouletrack simulator [38] to estimate the

reduction in latency and energy consumption for computing the polynomials on the StrongARMTM SA1100

micoprocessor. We used the same set of randomly generated inputs for simulating the different programs. Table 1

shows the result of our experiments, where we compare the savings in the number of operations produced by our

method over Common Subexpression Elimination (CSE) and the Horner form. The first two examples are obtained

from source codes from signal processing applications [37] where the polynomials are obtained by approximating

the trigonometric functions by their Taylor series expansions. The next four examples are multivariate polynomials

obtained from 3-D computer graphics [29]. The results show that our optimizations reduce the number of

multiplications, on an average by 34% over CSE and by 34.9% over Horner. The number of additions is the same in

all examples. This is because all the savings are produced by efficient factorization and elimination of single term

common subexpressions, which only reduce the number of multiplications. The average run time for our technique

for these examples was only 0.45s.

The results also show the results of simulation on the StrongARMTM SA1100 processor, where we observe an

average latency reduction of 26.7% over CSE and 26.1% over Horner scheme. We also observed an energy

reduction of 26.4% over CSE and 26.3% over Horner scheme. The energy consumption measure by the simulator

[38] is only for the processor core.

We synthesized the polynomials optimized by CSE, Horner and our algebraic method to observe the

improvements for hardware implementation. We synthesized the polynomials using Synopsys Behavioral Compiler

TM and Synopsys Design Compiler TM using the 1.0 µ power2_sample.db technology library, with a clock period of

40 ns, operating at 5V. We used this library because it was the only one we had that was characterized for power

consumption. We used the Synopsys DesignWare TM library for the functional units. The adder and multiplier in this

library took one clock cycle and two clock cycles respectively at this clock period. We synthesized the designs with

both minimum hardware constraints and medium hardware constraints. We then interfaced the Synopsys Power

Compiler TM with the Verilog RTL simulator VCS TM to capture the total power consumption including switching,

short circuit and leakage power. All the polynomials were simulated with the same set of randomly generated inputs.

Table 2 shows the synthesis results when the polynomials were scheduled with minimum hardware constraints.

Typically the hardware allocated consisted of a single adder, a single multiplier, registers and control units. We

measured the reduction in area, energy and energy-delay product of the polynomials optimized using our technique

over the polynomials optimized using CSE (C) and Horner transform (H). The results show that there is not much

difference in area for the different implementations, but there is a significant reduction in total energy consumption

(average 30% over CSE and 33.4% over Horner), and energy delay product (average 47.5% over CSE and 50.9%

over Horner). Since our technique gave significantly reduced latency, we estimated the savings in energy

consumption for the same latency using voltage scaling.

We obtained the scaled down voltage for the new latency using the library parameters, and estimated the new

energy consumption using the quadratic relationship between the energy consumption and the supply voltage. This

voltage scaling was done by comparing the latencies obtained by our technique with that of CSE and Horner

separately.

The results (Column IV) show an energy reduction of 63.4% over CSE(C) and 63.2% reduction over Horner (H)

with voltage scaling. Table 3a shows the synthesis results for medium hardware constraints. Medium hardware

implementations trade off area for energy efficiency. Though they have larger area, they have a shorter schedule and

lesser energy consumption compared to those produced from minimum hardware constraints. We constrained the

Synopsys Behavioral Compiler TM to allocate a maximum of four multipliers for each example. Table 3b shows the

number of adders (A) and multipliers (M) allocated for the different implementations. The scheduler will allocate

fewer multipliers than four, if the same latency can be achieved by the fewer number of resources.

Table 1. Comparing the number of additions (A) and multiplications (M) produced by different methods

Unoptimized Using CSE

Using Horner

Using our Algebraic Technique

Reduction in Latency and Energy consumption for execution on

StrongARM SA1100

CSE Horner

Application Function

A M A M A M A M Latency (%)

Energy (%)

Latency (%)

Energy (%)

1 Fast Convolution FFT 7 56 7 20 7 30 7 10 26.8 26.1 44.0 42.3

2 Gaussian noise filter FIR 6 34 6 23 6 20 6 13 30.5 26.4 16.5 16.1

3 Graphics quartic-spline 4 23 4 16 4 17 4 13 7.0 10.7 35.5 35.1 4 Graphics quintic-spline 5 34 5 22 5 23 5 16 23.4 24.8 23.3 26.2 5 Graphics chebyshev 8 32 8 18 8 18 8 11 33.4 31.3 27.5 25.4 6 Graphics cosine-wavelet 17 43 17 23 17 20 17 17 39.1 39.0 9.7 12.8 Average 7.8 37 7.8 20.3 7.8 21.3 7.8 13.3 26.7 26.4 26.1 26.3

Table 2. Synthesis results with minimum hardware constraints

Area (%) (I)

Energy @5V (%) (II)

Energy Delay @5V (%) (III)

Energy (scaled voltage) (%) (IV)

C H C H C H C H 1 18.6 6.5 33.5 69.9 58.4 90.8 88.9 99.0 2 7.5 0.1 13.6 25.6 20.4 39.4 24.6 49.5 3 0.3 -4.2 21.6 29.3 39.0 48.8 52.2 64.6 4 -7.5 -24.2 29.4 10.4 47.6 25.9 62.2 36.9 5 5.6 2.5 37.0 28.7 57.1 46.1 74.3 59.8 6 3.7 2.0 44.8 36.8 62.8 54.8 78.3 69.7

Average 4.7 -2.8 30.0 33.4 47.5 50.9 63.4 63.2

The results show a significant reduction in total energy consumption (average 24.6% over CSE(C) and 38.7%

over Horner (H) scheme), as well as energy delay product (average 26.7% over CSE and 53.5 % over Horner). We

obtain reduction in energy delay product for every example except for the first one, where the energy delay product

for the expression synthesized using CSE is 12.7% lesser than that produced by our method. This is because, the

latency of the CSE optimized expression, when four multipliers are used is much lesser compared to the one

optimized using our algebraic method (which uses two multipliers). The delay for Horner scheme is much worse

than both CSE and our technique, because the nested additions and multiplications of Horner scheme typically

results in a longer critical path.

Table 3a. Synthesis results with medium Table 3b. Multipliers (M) and adders (A)

hardware constraints allotted with medium hardware constraints

6. Conclusions

In this paper we presented an algebraic method for reducing the number of operations in a set of polynomial

expressions of any degree and consisting of any number of variables, by performing algebraic factorization and

elimination of common subexpressions. The presented technique is based on the algebraic methods originally

developed for multi-level logic synthesis. Our technique is superior to the conventional optimization techniques for

polynomial expressions. Experimental results have shown significant reduction in latency and energy consumption.

By integrating these optimizations into conventional compiler and high level synthesis frameworks, we believe that

we can observe substantial performance and power consumption improvements for the whole application.

In the future we would like to work on extending our technique by optimizing latency and performing scheduling

so that we can work with different architectural constraints. Also we would like to study the impact of our

techniques on the stability of computing the polynomial expressions.

Area (%) Energy @5V (%)

Energy Delay @5V (%)

C H C H C H 1 44.0 48.0 9.8 63.9 -12.7 81.0 2 30.5 3.9 16.1 39.2 9.7 44.1 3 14.8 1.0 9.7 29.6 20.3 58.7 4 8.3 3.7 42.5 29.1 44.9 37.0 5 8.9 9.0 28.2 29.5 39.5 40.6 6 8.0 6.6 41.4 40.8 58.4 59.7

Average 19.0 12.0 24.6 38.7 26.7 53.5

CSE Horner Our Technique M A M A M A 1 4 2 4 1 2 1 2 3 1 2 1 2 1 3 4 1 4 1 4 1 4 3 1 3 2 3 2 5 4 1 4 2 4 2 6 3 1 3 1 3 2

REFERENCES [1] A.Hosangadi, F.Fallah, and R.Kastner, "Factoring and Eliminating Polynomial Expressions in Polynomial

Expressions," presented at International Conference on Computer Aided Design (ICCAD), San Jose, CA, 2004.

[2] A.Hosangadi, F.Fallah, and R.Kastner, "Energy Efficient Hardware Synthesis of Polynomial Expressions," presented at International Conference on VLSI Design, Kolkata, India, 2005.

[3] A.Hosangadi, F.Fallah, and R.Kastner, "Optimizing Polynomial Expressions by Factoring and Eliminating Common Subexpressions," presented at International Workshop on Logic and Synthesis, Temecula, CA, 2004.

[4] S.S.Muchnick, Advanced Compiler Design and Implementation: Morgan Kaufmann Publishers, 1997. [5] R.Allen and K.Kennedy, Optimizing Compilers for Modern Architectures: Morgan Kaufmann Publishers,

2002. [6] M. Kandemir, N. Vijaykrishnan, M. J. Irwin, and W. Ye, "Influence of compiler optimizations on system

power," Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, vol. 9, pp. 801-804, 2001. [7] G. D. Micheli, Synthesis and Optimization of Digital Circuits: McGraw-Hill, Inc, 1994. [8] H. De Man, J. Rabaey, J. Vanhoof, G. Goossens, P. Six, and L. Claesen, "CATHEDRAL-II-a computer-

aided synthesis system for digital signal processing VLSI systems," Computer-Aided Engineering Journal, vol. 5, pp. 55-66, 1988.

[9] A.Peymandoust and G. D. Micheli, "Using Symbolic Algebra in Algorithmic Level DSP Synthesis," presented at Design Automation Conference, 2001.

[10] "Windriver advanced compiler optimization techniques. A Technical White Paper." [11] D.E.Knuth, The Art of Computer Programming, vol. 2: Seminumerical algorithms, Second ed: Addison-

Wesley, 1981. [12] C.Fike, Computer Evaluation of Mathematical Functions. Englewood Cliffs, N.J: Prentice-Hall, 1968. [13] J. Villalba, G. Bandera, M. A. Gonzalez, J. Hormigo, and E. L. Zapata, "Polynomial evaluation on

multimedia processors," presented at Application-Specific Systems, Architectures and Processors, 2002. Proceedings. The IEEE International Conference on, 2002.

[14] R.H.Bartels, J.C.Beatty, and B.A.Barsky, An Introduction to Splines for Use in Computer Graphics and Geometric Modeling: Morgan Kaufmann Publishers, Inc., 1987.

[15] A.Peymandoust, T.Simunic, and G. D. Micheli, "Low Power Embedded Software Optimization Using Symbolic Algebra," IEEE Transactions on Computer Aided Design of Integrated Circuits and Systems, 2002.

[16] V. J. Mathews, "Adaptive polynomial filters," Signal Processing Magazine, IEEE, vol. 8, pp. 10-26, 1991. [17] "GNU C Library." [18] A.S.Vincentelli, A.Wang, R.K.Brayton, and R.Rudell, "MIS: Multiple Level Logic Optimization System,"

IEEE Transactions on Computer Aided Design of Integrated Circuits and Systems, 1987. [19] R.K.Brayton and C.T.McMullen, "The Decomposition and Factorization of Boolean Expressions,"

presented at International Symposium on Circuits and Systems, 1982. [20] R.K.Brayton, R.Rudell, A.S.Vincentelli, and A.Wang, "Multi-level Logic Optimization and the

Rectangular Covering Problem.," presented at International Conference on Compute Aided Design, 1987. [21] J. Rajski and J. Vasudevamurthy, "The testability-preserving concurrent decomposition and factorization of

Boolean expressions," Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, vol. 11, pp. 778-793, 1992.

[22] A.V.Aho and S.C.Johnson, "Optimal Code Generation for Expression Trees," Journal of the ACM (JACM), vol. 23, pp. 488-501, 1976.

[23] A.V.Aho, S.C.Johnson, and J.D.Ullman, "Code Generation for Expressions with Common Subexpressions," ACM, vol. 24, pp. 146-160, 1977.

[24] R.Sethi and J.D.Ullman, "The Generation of Optimal Code for Arithmetic Expressions," Communications of the ACM, vol. 17, pp. 715-728, 1970.

[25] M.A.Breuer, "Generation of Optimal Code for Expressions via Factorization," Communications of the ACM, vol. 12, pp. 333-340, 1969.

[26] A. Nicolau and R. Potasman, "Incremental tree height reduction for high level synthesis," presented at Design Automation Conference, 1991. 28th ACM/IEEE, 1991.

[27] B. Landwehr and P. Marwedel, "A new optimization technique for improving resource exploitation and critical path minimization," presented at System Synthesis, 1997. Proceedings., Tenth International Symposium on, 1997.

[28] M. Potkonjak, S. Dey, Z. Iqbal, and A. C. Parker, "High performance embedded system optimization using algebraic and generalized retiming techniques," presented at Computer Design: VLSI in Computers and Processors, 1993. ICCD '93. Proceedings., 1993 IEEE International Conference on, 1993.

[29] G.Nurnberger, J.W.Schmidt, and G.Walz, Multivariate approximation and splines: Springer Verlag, 1997. [30] J.-M. Chang and M. Pedram, "Energy minimization using multiple supply voltages," Very Large Scale

Integration (VLSI) Systems, IEEE Transactions on, vol. 5, pp. 436-443, 1997. [31] E. Macii, M. Pedram, and F. Somenzi, "High-level power modeling, estimation, and optimization,"

Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, vol. 17, pp. 1061-1079, 1998.

[32] A.P.Chandrakasan, M.Potkonjak, R.Mehra, J.Rabaey, and R.W.Brodersen, "Optimizing power using transformations," Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, vol. 14, pp. 12-31, 1995.

[33] M. A. Miranda, F. V. M. Catthoor, M. Janssen, and H. J. De Man, "High-level address optimization and synthesis techniques for data-transfer-intensive applications," Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, vol. 6, pp. 677-686, 1998.

[34] R.Rudell, "Logic Synthesis for VLSI Design," PhD Thesis, University of California, Berkeley, 1989. [35] "ARM7TDMI Technical Reference Manual," 2003. [36] V.Krishna, N.Ranganathan, and N.Vijayakrishnan, "An Energy Efficient Scheduling Scheme for Signal

Processing Applications," presented at VLSI Design, 1999. [37] P.M.Embree, C Algorithms for Real-Time DSP: Prentice Hall, 1995. [38] A. Sinha and A. P. Chandrakasan, "JouleTrack-a Web based tool for software energy profiling," presented

at Design Automation Conference, 2001. Proceedings, 2001.

Response to Reviewer #1’s comments We thank the reviewer for reading our paper and providing valuable comments. We have tried our best to

address the problems in the paper pointed by the reviewer.

The reviewer pointed out that the differences between the algebraic techniques for Boolean expressions and for

arithmetic expressions need to be clearly pointed out. The reviewer mentions that there are two restrictions used in

the model of Boolean expressions in Brayton’s 1982 paper (The Decomposition and Factorization of Boolean

Expressions). Though these restrictions are not explicitly mentioned in the paper, we believe that the reviewer was

referring to the restriction on Single Cube Containment (SCC) and in the restriction on the quotient and divisor

during algebraic division. We modified the former restriction and removed the latter one. The revised version of the

paper contains a section (Section 3.3) that details these differences. The central theorem (in Brayton’s paper) still

holds and has been reformulated, and explained (proved) using three different theorems in Sections 3.4 and 3.6.

The reviewer also mentions that we need to talk about quality of the generated results. In the revised paper, we

have a Section 4.3 that talks about the algorithm complexity and quality. There is no known ‘optimal’ solution to the

common subexpression elimination problem, except for a brute force exploration of all possible subexpressions in

different possible sequences of selections. Since this is not practical, we do not compare our results with the

‘optimal’ one.

The reviewer also points out that we need to compare our work with the work done by mathematicians on the

Symbolic handling of polynomials. We are not aware of any work for the problem of factoring and eliminating

common subexpressions. There has been some work in High Level Synthesis using Symbolic Algebra methods

(references [6] and [15]), but they attack a different problem, that of mapping polynomials to library polynomials.

We talk about their work in the Related Work section (Page 4, last paragraph).

Response to Reviewer #2’s comments

We would like to thank the reviewer for reading our paper, providing valuable comments and suggesting

changes. We have made all the changes suggested by the reviewer.

The reviewer has some comments relating to the clarity of the paper. The Section 3.3 (now Section 3.4) which

explains the Kernel generation algorithm has been rewritten. In the experimental results section, one of the examples

shows a negative improvement (-12.7%) in the energy-delay product. This was not explained in the earlier version

of the paper, but has now been explained in the revised version. The reason for the negative improvement for this

example is that though the number of operations is reduced using our methods, the length of the critical path is

increased, and using more multipliers, the polynomial optimized with CSE method has much shorter delay, which

makes the energy-delay product smaller for that example.

The other typographical and stylistic changes suggested by the reviewer have been incorporated in the revised

version.

Response to Reviewer #3’s comments

We thank the reviewer for reading our paper and providing detailed comments and suggestions. We have

addressed each of the concerns individually. The reviewer gives a number of comments which are addressed

below

0) Including the reference to our paper in International Workshop on Logic and Synthesis

This has been included now

1) Referencing other works on Algebraic transformations

In the revised paper, we have made references (in Related Work Section) to the work done on custom Address

generation by M.A Marwedel et. al. This paper describes a number of algebraic transformations for the synthesis of

custom address generators, but they do not present any specific techniques for algebraic factorization and common

subexpression elimination, which is the focus of our paper. We also have referenced the work done by B.Landwehr

et. al on Critical path minimization using algebraic transformations. In this paper, the authors present a genetic

algorithm based framework, where all possible algebraic formulations are considered for the evaluation of a

Dataflow graph (DFG). But this paper does not really present any method for common subexpression elimination. In

our method, we implicitly explore the commutative, associative and distributive properties of arithmetic operations.

We do not make use of the properties of constant multiplications such as 2*X = X<<1, which is explored in their

paper. This kind of strength reduction can be performed as a post processing step after our algorithm.

2) Central theorem is not clear and not explained

In the revised version of the paper, we have reformulated the original theorem in Brayton’s paper into three

different theorems, each of which have been explained and proved. They are now in Sections 3.4 and 3.6. The

meaning of multiple term(cube) intersection has also been explained. The reviewer mentioned that there does not

seem to be a relation between co-kernel of an expression and common subexpression. There is actually no relation

between the two. A co-kernel and kernel pair represents a possible factorization opportunity. A kernel on the other

hand is a potential common subexpression.

3) Complexity of the presented algorithms is not mentioned

In the revised paper, we have a new Section 4.3 that talks about the complexity and the quality of the

presented algorithms. In the worst case, the kernel generation algorithm is exponential in the number of variables,

and the algorithm for finding kernel intersections is exponential in the number of kernels that are generated. This

worst case happens only when there is a uniform distribution of the variables in the terms of the expression, which

we have not found in practical polynomial expressions. The complexity of finding single cube intersections is also

exponential in the number of variables.

4) More details for the experimental workflow are needed

We did not perform any global optimizations. These optimizations were done only on polynomials in local

basic blocks.

5) Do the techniques optimize only a single expression?

Our technique optimizes multiple polynomial expressions, similar to the way multiple Boolean expressions

are optimized in Logic synthesis.

6) More detailed experiments to measure clock cycle improvement is needed

We used the ARM simulator Jouletrack for measuring the reduction in latency and energy consumption for

computing the polynomials. We simulated the programs with the same set of randomly generated inputs. The results

are included in the revised version of the paper.

7) Why does the filter have so many multiplications?

We approximated the trigonometric functions using Taylor series expansion to the seventh order for the first

two examples. We did this to use our algorithms on large polynomial expressions. In practice, the values of the

trigonometric functions can be obtained from a lookup table. The remaining four examples are polynomials used in

computer graphics, where we have not used any Taylor series approximations.

8) Correction in the result tables

The tables have now been corrected.

optimizing polynomial expressions by algebraic factorization and common subexpression...

Documents