optimizing polynomial expressions by algebraic factorization and common subexpression...
TRANSCRIPT
Optimizing Polynomial Expressions by Algebraic Factorization and Common Subexpression Elimination1
Anup Hosangadi Farzan Fallah Ryan Kastner
University of California, Fujitsu Labs of America, Inc. University of California, Santa Barbara [email protected] Santa Barbara
[email protected] [email protected]
Abstract
Polynomial expressions are frequently encountered in many application domains, particularly in signal
processing and computer graphics. Conventional compiler techniques for redundancy elimination such as Common
Subexpression Elimination (CSE) are not suited for manipulating polynomial expressions, and designers often
resort to hand optimizing these expressions. This paper leverages the algebraic techniques originally developed for
multi level logic synthesis to optimize polynomial expressions by factoring and eliminating common subexpressions.
We tested our algorithm on a set of benchmark polynomial expressions where we observed a savings of 26.7% in the
latency and 26.4% in the energy consumption for computation on the StrongARMTM SA1100 processor core, for
computing these expressions. When these expressions were synthesized in custom hardware, we observed an
average energy savings of 63.4% for minimum hardware constraints and 24.6% for medium hardware constraints,
over CSE.
1. Introduction
In recent years, the embedded systems market has seen a huge demand for high performance and low power
solutions to products such as cellular phones, network routers and video cameras. This demand has led to the
research and development of various high-level optimization techniques for both software [4-6] and hardware [7-9].
Modern software compilers often produce 20-30% and sometimes two or three times faster performance than
traditional compilers [10]. There are a number of optimizations that are well known in the compiler design
community [4, 5] Common Subexpression Elimination, Scalar replacement of aggregates, Value Numbering,
Constant and Copy Propagation are applied both locally to basic blocks as well as globally across the whole Control
Flow Graph (CFG). Other significant optimizations that are not directly related to the optimizations presented in this
paper are loop and procedure optimizations, register allocation and instruction scheduling. With the rapid 1 Preliminary versions of this paper appeared in the International Conference on Computer Aided Design [1] in the International Conference on VLSI Design [2] and in the International Workshop on Logic and Synthesis [3]
advancements in VLSI technology, it has become possible to build low cost Application Specific Integrated Circuits
(ASICs) to be included in embedded systems. The growing complexity of these ASICs has necessitated the
development of High-Level synthesis tools to automate the design of custom datapaths from a behavioral
description. These tools mainly perform controller synthesis, scheduling and resource allocation, in addition to
performing some limited transformations for algebraic restructuring and redundancy elimination.
Unfortunately, most of the work that has been done in these compiler tools has been done for general purpose
applications. These techniques do not do a good job of optimizing special functions such as those consisting of
polynomial expressions, which is the focus of this paper. Polynomial expressions are present in a wide variety of
applications domains since any continuous function can be approximated by a polynomial to the desired degree of
accuracy [11-13]. There are a number of scientific applications that use polynomial interpolation on data of physical
phenomenon. These polynomials need to be computed frequently during simulation. Real time computer graphics
applications often use polynomial interpolations for surface and curve generation, texture mapping etc. [14]. Most
DSP algorithms often use trigonometric functions like Sine and Cosine which can be approximated by their Taylor
series as long as the resulting inaccuracy can be tolerated [9, 15]. Furthermore, many adaptive signal processing
applications need to use polynomials to model the nonlinearities in high speed communication channels [16]. Most
often, system designers have to depend upon hand optimized library routines for implementing such functions,
which can be both time consuming and error prone. It is important to do a good optimization of polynomial
expressions, since they generally consist of a number of multiplications, which are expensive in resource constrained
embedded systems.
Conventional techniques for redundancy elimination such as Common Subexpression Elimination (CSE) and
Value Numbering are very limited in optimizing polynomials, mainly because they are unable to perform
factorization of these expressions. The Horner form is a popular evaluation scheme for polynomials in many
libraries including the GNU C library [17]. But this form gives good results mainly for single variable polynomial
expressions. This paper presents an optimization technique for polynomials aimed at reducing the number of
operations by algebraic factorization and common subexpression elimination. This technique can be integrated into
the front end of modern optimizing compilers and high level synthesis frameworks, for optimizing programs or
datapaths consisting of a number of polynomial expressions. As a motivating example for demonstrating the benefits
of our technique, consider the polynomial expression shown in Figure 1a, which is a multivariate polynomial used in
Computer Graphics for modeling surfaces and textures [14]. This polynomial originally has 23 multiplications. After
applying the iterative two-term CSE algorithm [4] on this expression, we get an implementation that has 16
multiplications (1.b). After applying the Horner form for this expression, we get an implementation with 17
multiplications. But on applying our algebraic method that we develop in this paper, we get an implementation with
only 13 multiplications. Such an optimization is impossible to perform using conventional optimization techniques.
Figure 1. Optimizations on Multivariate Quartic-Spline Polynomial
Algebraic techniques have been successfully applied in multi-level Logic Synthesis for reducing the number of
literals in a set of Boolean expressions [18-21]. These techniques are typically applied to a set of Boolean
expressions consisting of hundreds of variables. In this paper we show that the same techniques can be applied to
reduce the number of operation in a set of polynomial expressions. The rest of the paper is organized as follows.
Section 2 presents some related work on optimizing arithmetic expressions and Boolean expressions. In Section 3,
we present the transformation of polynomial expressions. In Section 4, we present the algorithms for factorization
and common subexpression elimination. Experimental results are presented in Section 5, where we evaluate the
benefits of our optimization for execution on an ARM processor and also for custom datapath synthesis. Finally we
conclude and talk about future work in Section 6.
2. Related Work
There has been a lot of early work on generation of code for arithmetic expressions [22-24]. In [22, 24], the
authors propose algorithms for minimizing the number of program steps and storage references in the evaluation of
arithmetic expressions, given a fixed number of registers. This work was later extended in [23] to include
expressions with common subexpressions. The above works performed optimizations by code generation
techniques, and did not really present an algorithm that would efficiently reduce the number of operations in a set of
P = zu4 + 4avu3 + 6bu2v2 + 4uv3w + qv4 P = zu4 + ( 4au3 + ( 6bu2 + (4uw + zw)v)v )v (a) Unoptimized polynomial (c) Optimization using Horner Transform d1 = u2 d1 = u2
d2 = v2 d2 = 4*u d3 = uv P = u(uz + ad2)d1 + v2(u (d2w + vq) + 6bd1) P = d1
2z + 4ad1d3 + 6bd1d2 + 4wd2d3 + qd22
(b) Optimization using CSE (d) Optimization using our Algebraic technique
arithmetic expressions. Some work was done to optimize code with arithmetic expressions by factorization of the
expressions [25]. This work developed a canonical form for representing the arithmetic expressions and an
algorithm for finding common subexpressions. The drawback of this algorithm is that it takes advantage of only the
associative and commutative properties of arithmetic operators and therefore can only optimize expressions
consisting of only one type of associative and/or commutative operator at a time. As a result, it cannot perform
factorization of expressions and generate complex common subexpressions consisting of additions and
multiplications.
There has been some work on optimization of arithmetic expressions for critical path minimization [26-28] by
utilizing the commutative, associative and distributive properties of arithmetic operators, and retiming of latches in
the circuit. These methods do not present advanced techniques for common subexpression elimination and
factorization, which is the contribution of this paper. We can explore different high speed implementations by
combining our optimizations with those originally developed for critical path minimization.
Horner form is a popular representation of polynomial expressions and is the default way of evaluating
polynomial approximations of trigonometric functions in many libraries including the GNU C library [17]. The
Horner form of evaluating a polynomial transforms the expression into a sequence of nested additions and
multiplications, which are suitable for sequential machine evaluation using multiply accumulate (MAC) instructions.
For example a polynomial in variable x
p(x) = a0xn + a1xn-1 + a2xn-2 + …..+ an is evaluated as
p(x) = (…..( ( a0x + a1)x + a2)x + ….. + an-1)x + an
The disadvantage of this technique is that optimizes only a single polynomial expression at a time, and does not look
for common subexpressions among a set of polynomial expressions. Furthermore, it is not good at optimizing
multivariate polynomial expressions, of the type commonly found in computer graphics applications [29].
Symbolic algebra has shown to be a good technique for manipulating polynomials and it has been shown to have
applications in both high level synthesis [9] and low power embedded software optimization [15]. The techniques
discussed in these works make use of the Grobner bases in symbolic algebra to decompose a dataflow represented
by a polynomial expression to a set of library polynomials. These techniques have been applied in high level
datapath synthesis for producing a minimum component and minimum latency implementations. The quality of
results from these techniques depends heavily on the set of library elements. For example, the decomposition of the
polynomial expression (a2 – b2) into (a+b)*(a-b) is possible only if there is a library element (a+b) or (a-b). It should
be noted that just having an adder in the library will not suffice for this decomposition. Though this problem can be
addressed by having a large number of library elements, the run time of the algorithm can become excessively large.
This method is useful for arithmetic decomposition of a datapath to library elements, but it is not really useful for
simplifying a set of polynomial expressions by factoring and eliminating common subexpressions.
Power optimization techniques for dataflow architectures have been explored in detail at the behavioral level
[30-32], though not particularly for arithmetic expressions. [31, 32] present good surveys of all the relevant work in
this area, where different techniques such as operation reduction and operation substitution, power aware scheduling
and resource allocation, multiple voltage, multiple frequency scheduling and bus encoding are explored. Reducing
the number of operations is a good technique for reducing dynamic power, since it is directly proportional to the
amount of switched capacitance. In [32], the authors suggest the Horner scheme for reducing the number of
operations. But the Horner scheme has many disadvantages as discussed earlier.
Arithmetic expressions have to be computed during address calculation for Data Transfer Intensive (DTI)
applications. In [33], different word level optimizing transformations for the synthesis of these expressions such as
expression splitting and clustering, induction variable analysis and common subexpression and factoring are
explored. The algebraic techniques that we present in this paper can be used for factoring and eliminating common
subexpressions for general polynomial expressions, and can be used as one of the transformations in the synthesis of
address generators.
Algebraic methods have been successfully applied to the problem of multi-level logic synthesis for minimizing
the number of literals in a set of Boolean expressions[18, 19, 21]. These techniques are typically applied to set of
Boolean expressions consisting of thousands of literals and hundreds of variables. The optimization is achieved by
decomposition and factorization of the Boolean expressions. Rectangle covering methods have been used to perform
this optimization [20], but due to the high algorithmic complexity, faster heuristics like Fast-Extract (Fx) have also
been developed [21]. There are two main algorithms in the rectangle covering method: Distill and Condense. Distill
algorithm is preceded by a Kernelling algorithm, where a subset of algebraic divisors are generated. Using these
divisors, Distill performs multiple cube decomposition and factorization. This is followed by the Condense
algorithm, which performs single cube decomposition. Our optimization technique for polynomial expressions is
based on the Distill and Condense algorithms, which we generalized to handle polynomial expressions of any order.
3. Transformation of polynomial expressions
The goal of this section is to introduce our matrix representation of arithmetic expressions and their
transformations that allow us to perform our optimizations. The transformation is achieved by finding a subset of all
possible subexpressions and writing them in a matrix form. We also point out the key differences between the
algebraic methods for polynomials and those originally developed for multi-level logic synthesis.
3.1 Representation of polynomial expressions
Each polynomial expression is represented by an integer matrix, where there is one row for each product term
(cube), and one column for each variable/constant in the matrix. Each element (i,j) in the matrix is a non-negative
integer that represents the exponent of the variable j in the product term i. There is an additional field in each row of
the matrix for the sign (+/-) of the corresponding product term. For example, the expression for the Taylor’s series
approximation for sin(x) (as shown in Figure 4) is represented in the matrix form shown in Figure 2.
Figure 2. Matrix representation of polynomial expressions
3.2 Definitions
We use the following terminology to explain our technique. A literal is a variable or a constant (e.g. a, b, 2,
3.14…). A cube is a product of the variables each raised to a non-negative integer power. In addition, each cube has
a positive or a negative sign associated with it. Examples of cubes are +3a2b, -2a3b2c. An SOP representation of a
polynomial is the sum of the cubes ( +3a2b + (-2a3b2c) + …). An SOP expression is said to be cube-free if there is
no cube that divides all the cubes of the SOP expression. For a polynomial P and a cube c, the expression P/c is a
kernel if it is cube-free and has at least two terms (cubes). For example in the expression P = 3a2b – 2a3b2c, the
expression P/ (a2b) = (3 – 2abc) is a kernel. The cube that is used to obtain a kernel is called a co-kernel. In the
above example the cube a2b is a co-kernel. The literals, cubes, kernels and co-kernels are represented in matrix form
in our technique.
+/- x S3 S5 S7 + 1 0 0 0 - 3 1 0 0 + 5 0 1 0 - 7 0 0 1
3.3 Algebraic techniques for polynomial expressions and Boolean expressions in multi-level logic
synthesis
The techniques for polynomial expressions that we present in this paper are based on the algebraic methods
originally developed for Boolean expressions in multi-level logic synthesis [18-20]. We have generalized these
methods to handle polynomial expressions of any order and consisting of any number of variables. The changes
made are mainly because of the differences between Boolean and arithmetic operations. These differences are
detailed below
1. Restriction on Single Cube Containment (SCC): In [19], the Boolean expressions are restricted to be a set of non-
redundant cubes, such that no cube properly contains another. An example of redundant cubes is in the Boolean
expression f = ab + b. Here the cube b can be written as b = ab + a’b. Therefore cube ‘b’ contains the other cube,
‘ab’. For polynomial expressions, this notion of cube containment does not exist, and we can have a polynomial P =
ab + b. Dividing this polynomial by b, P/b = (ab + b)/b = a + 1, which we treat as a valid expression. To handle this,
we treat ‘1’ as a distinct literal in our representation for polynomial expressions.
For simplifying our algorithm, we place a different restriction on the polynomial expressions. We define a
variable cube of a cube(term) of a polynomial expressions, as a cube obtained after ignoring the constant. For
example, the variable cube of the cube 3a2b is a2b. Our restriction is that no two terms of a polynomial expression
can have the same variable cube. If we have an expression having some terms with the same variable cubes, then we
collapse them into a single term by adding all those terms. For example the expression P = a2 + ab + ab + b2 is
converted into the expression P = a2 + 2ab + b2. If we allow terms with the same variable cubes in the expressions,
then the kernels of these expressions will have repeated terms and/or will have multiple terms having just constants,
both of which are undesirable in the construction of the Kernel Cube Matrix (KCM, described in Section 3.5). For
instance, in the above expression P = a2 + ab + ab + b2, we have the co-kernel and kernel pairs (a)(a + b + b) and
(b)(a + a + b), both of which have repeated terms. We cannot represent these kernels on the KCM.
This restriction prevents us from factorizing the expression P as (a + b)(a + b) (rather P is factorized as a(a + 2b)
+ b2), but it simplifies the optimization algorithm.
2. Algebraic division and the kernel generation algorithm: For Boolean expressions, there is a restriction on the
division procedure for obtaining kernels. During division, the quotient and the divisor are required to have
orthogonal variable supports. For example, if there is a Boolean expression f, and we need to divide f by another
expression g (f/g), then the quotient expression h is defined [19] as “the largest set of cubes in f, such that h g⊥ (h
is orthogonal to g, that is the variable support of h and the variable support of g are disjoint), and that hg f∈ .”
For example, if f = a(b + c) + d, and g = a. Then f/g = h = (b + c). We can see that h g⊥ and hg = a(b + c) ∈ f .
For polynomial expressions we remove this restriction for orthogonality, so that we can handle expressions of
any order. This is based on one key difference between the multiplication and the Boolean AND operation. For
example, the multiplication a*a = a2, but the AND operation a.a = a. Therefore, we can have a polynomial F = a(ab
+ c) + d and a divisor G = a. The division F/G = H = (ab + c). We can see that H and G are not orthogonal since they
share the variable ‘a’. Since the kernel generation algorithm generates kernels by recursive division, we have
modified that algorithm to take into account this difference. This algorithm is detailed in Section 3.4 (Figure 3). In
[19, 34], a central theorem is presented on which the decomposition and factorization of Boolean expressions is
based on. This theorem shows how all multiple-cube intersections and algebraic factorizations can be obtained from
the set of kernels. For our modified problem, this theorem still holds and multiple term common subexpressions and
algebraic factorizations can be detected by the set of kernels for the polynomial expressions. This is shown in detail
in Section 3.4.
3. Finding single cube intersections: In the decomposition and factorization methodology of Boolean expressions
presented in [19], multiple term common factors are extracted first and then single cube common factors are
extracted. For polynomial expressions, we follow the same order. We modify the algorithm for finding single cube
intersections, since we have higher order terms. All single cube common subexpressions can be detected by our
method, and is described in detail in Section 3.5 and Section 4.2.
3.4 Generation of kernels and co-kernels
The kernels and co-kernels of polynomial expressions are important because of two reasons. Firstly, all possible
minimal algebraic factorizations of an expression can be obtained from the set of kernels and co-kernels of the
expression. An algebraic factorization of an expression P is the factorization P = C*F1 + F2, where C is a cube and
F1 and F2 are subexpressions consisting of set of terms. This algebraic factorization is called minimal if the only
common cube among the set of terms in F1 is ‘1’.
Secondly, all multiple term (algebraic) common subexpressions can be detected by an intersection among the set
of kernels. These two properties are illustrated by the following two theorems.
Theorem 1: All minimal algebraic factorizations of a polynomial expression can be obtained from the set of kernels
and co-kernels of the expression.
Proof: Consider the minimal algebraic factorization of P = C*F1 + F2. By definition, the subexpression F1 is cube-
free, since the only cube that commonly divides all the terms of F1 is ‘1’. We have to prove that F1 is a part of a
kernel expression, with the corresponding co-kernel C. Let {ti} be the set of original terms of expression P, that
cover the terms in F1. Since F1 is obtained from {ti} by dividing it by C (F1 = {ti}/C), the common cube of the terms
{ti} is C. Let {tj} be the set of original terms of P that also have C as the common cube among them and do not
overlap with any of the terms in {ti}. Now consider the expression K1 = ( {ti} + {tj})/C = {ti}/C + {tj}/C = F1 + {t’j},
where {t’j} = {tj}/C. This expression K1 is cube-free since we know that F1 is cube-free. K1 covers all the terms of
the original expression P that contain cube C. Therefore by definition, K1 is a kernel of expression P with the cube
C as the corresponding co-kernel. Since our kernel generation procedure generates all kernels and co-kernels of the
polynomial expressions, the kernel K1 = F1 + {t’j} will be generated with the corresponding co-kernel C. Since F1 is
a part of this kernel expression, we proved the theorem.
Theorem 2: There is a multiple term common subexpression in the set of polynomial expressions iff there is a
multiple term intersection among the set of kernels belonging to the expressions.
Explanation: By multiple term intersection, we mean an intersection between kernel expressions yielding two or
more terms(cubes). For example when we intersect the kernels K1 = a + bc + d and K2 = bc + d + f, then there is a
multiple term intersection = (bc + d).
Proof: The If case is easy to prove. Let K1 be the multiple term intersection. It means that there are multiple
instances of the subexpression K1 among the set of expressions. For the Only If case, assume that f is a multiple
term common subexpression among the set of expressions. If the expression f is cube-free, then f is a part of some
kernel expressions as proved in Theorem 1. Each instance of expression f will therefore be a part of some kernel
expression, and an intersection among the set of kernels will detect the common subexpression f. If f is not a cube-
free expression, then let C be the biggest cube that divides all terms of f, and let f ’ = f /C. The expression f ’ is now a
cube-free expression, and reasoning as above, each instance of f ’ will be detected by an intersection among the set
of kernels.
The algorithm for extracting all kernels and co-kernels of a set of polynomial expressions is shown in Figure 3.
This algorithm is analogous to kernel generation algorithms in [19], and has been generalized to handle polynomial
expressions. The main difference is the way division is performed in our technique, where the values of the elements
in the divisor are subtracted from the suitable set of rows in the dividend. Another difference is in that in the
recursive algorithm, we continue to divide by the current variable, till it does not appear in at least two rows of the
matrix form of the expression.
Figure 3. Algorithms for kernel and co-kernel extraction
The main algorithm Kernels is recursive and is called for each expression in the set of polynomial expressions.
The arguments to this function are the literal index i, and the cube d, which is the co-kernel extracted in the previous
call to the algorithm. The variables in the set of expressions are ordered randomly (the ordering is inconsequential),
FindKernels({Pi},{Li}) Divide(P,d) { { {Pi } = Set of polynomial expressions; P = Expression; d = cube ; {Li } = Set of Literals; Q = set of rows of P that contain cube d; {Di } = Set of Kernels and Co-Kernels = φ ; ∀ Expressions Pi in {Pi} ∀ Rows Ri of Q {Di } = {Di } ∪ {Kernels(0, Pi ,φ )} ∪ {Pi , 1}; { return {Di}; ∀ Columns j in the Row Ri } Ri[j] = Ri[j] – d[j]; }
Kernels(i,P,d) return Q; { } i = Literal Number ; P =expression in SOP; d = Cube; Merge(C1, C2, C3) D = Set of Divisors = φ ; {
for( j = i; j < L ; j++) C1, C2, C3 = cubes ; { Cube M; If( Lj appears in more than 1 row) for(i = 0; i < Number of literals; i++) { M[i] = C1[i] + C2[i] + C3[i]; Ft = Divide (P , Lj ); C = Largest Cube dividing each cube of Ft; return M; if( ( Lk ∉C) ∀ (k < j) ) } { F1 = Divide(Ft , C); // kernel D1 = Merge(d,C, Lj );// co-kernel D = D ∪D1 ∪F1; D = D ∪Kernels(j ,F1 , D1); } } } return D; }
and the variable index represents the position of the variable in this order. Before the first call to the Kernels
algorithm, the variable index is initialized to ‘0’ and the cube ‘d’ is initialized to φ. The recursive nature of the
algorithm extracts kernels and co-kernels within the kernel extracted and returns when there are no kernels present.
The algorithm Divide is used to divide an SOP expression by a cube d. It first collects those rows that contain
cube d (those rows R such that all elements R[i] ≥ d[i]). The cube d is then subtracted from these rows and these
rows form the quotient.
The biggest cube dividing all the cubes of an SOP expression is the cube C with the greatest literal count (literal
count of C is ∑i
]i[C ) that is contained in each cube of the expression.
The algorithm Merge is used to find the product of the cubes d, C and Lj and is done by adding up the
corresponding elements of the cubes. In addition to the kernels generated from the recursive algorithm, the original
expression is also added as a kernel with co-kernel ‘1’. Passing the index i to the Kernels algorithm, checking for Lk
∉C and iterating from literal index i to |L| (total number of literals) in the for loop are used to prevent the same
kernels from being generated again.
We use the example of the polynomial shown in Figure 4, to explain our technique. The numbers in the
subscripts for each term indicate the term number in the set of expressions. This polynomial is a Taylor’s series
approximation for sin(x). The literal set is {L} = {x,S3,S5,S7}. Dividing first by x gives Ft = (1 – S3x2 + S5x4 –S7x6).
There is no cube that divides it completely, so we record the first kernel F1 = Ft with co-kernel x. In the next call to
Kernels algorithm dividing F by x gives Ft = -S3x + S5x3 – S7x5.
The biggest cube dividing this completely is C = x. Dividing Ft by C gives our next kernel F1 = (-S3 + S5x2 –
S7x4) and the co-kernel is x*x*x = x3. In the next iteration we obtain the kernel S5 – S7x2 with co-kernel x5. Finally
we also record the original expression for sin(x) with co-kernel ‘1’. The set of all co-kernels and kernels generated
are [x](1 – S3x2 + S5x4 – S7x6); [x3](-S3 + S5x2 – S7x4); [x5](S5 – S7x2); [1](x – S3x3 + S5x5 – S7x7);
Figure 4. Evaluation of sin(x)
sin(x) = x(1) – S3x3(2) + S5x5
(3) – S7x7(4)
S3 = 1/3! , S5 = 1/5!, S7 = 1/7!
3.5 Constructing kernel cube matrix (KCM)
All kernel intersections and multiple term factors can be identified by arranging the kernels and co-kernels in a
matrix form. There is a row in the matrix for each kernel (co-kernel) generated, and a column for each distinct cube
of a kernel generated. The KCM for our sin(x) expression is shown in Figure 5. The kernel corresponding to the
original expression (with co-kernel ‘1’) is not shown for ease of presentation. An element (i,j) of the KCM is ‘1’, if
the product of the co-kernel in row i and the kernel-cube in column j gives a product term in the original set of
expressions. The number in parenthesis in each element represents the term number that it represents. A rectangle is
a set of rows and set of columns of the KCM such that all the elements are ‘1’. The value of a rectangle is the
weighted sum of the number of operations saved by selecting the common subexpression or factor corresponding to
that rectangle. Selecting the optimal set of common subexpressions and factors is equivalent to finding a maximum
valued covering of the KCM, and is analogous to the minimum weighted rectangular covering problem described in
[20], which is NP-hard. We use a greedy iterative algorithm described in Section 4.1 where we pick the best prime
rectangle in each iteration. A prime rectangle is a rectangle that is not covered by any other rectangle, and thus has
more value that any rectangle it covers.
Figure 5. Kernel Cube Matrix (KCM) for the Sin(x) example
Given a rectangle with the following parameters, we can calculate its value:
R = number of rows ; M( Ri ) = number of multiplications in row (co-kernel) i.
C = number of columns ; M( Cj ) = number of multiplications in column (kernel-cube i).
Each element (i,j) in the rectangle represents a product term equal to the product of co-kernel i and kernel-cube j,
which has a total number of M( Ri ) + M(Cj ) + 1 multiplications. The total number of multiplications represented by
the whole rectangle is equal to CRRMCCMRR
iC
j *)(*)(* +∑+∑ . Each row in the rectangle has C – 1 additions
for a total of R*(C-1) additions. By selecting the rectangle, we extract a common factor with
)(∑C
jCM multiplications and C -1 additions. This common factor is multiplied by each row which leads to a further
RRMR
i +∑ )( multiplications. The value of the rectangle can be described as the weighted sum of the savings in the
1 2 3 4 5 6 7 8 9 1 -S3x2 S5x4 -S7x6 -S3 S5x2 -S7x4 S5 -S7x2
1 x 1(1)_ 1(2)_ 1(3) 1(4) 2 x3 1(2) 1(3) 1(4) 3 x5 1(3)_ 1(4)
number of multiplications and additions and is given in Equation I. The weighing factor ‘m’ can be selected by the
designer depending on the relative cost of a multiplication and addition on the target architecture. For example
multiplication takes about five times more cycle time than addition on an ARM processor [35]. For custom hardware
synthesis, multiplication has much higher power consumption compared to addition. In [36], the average energy
consumption of a multiplier is about 40 times that of an adder at 5V.
1) -C( * 1) -R (
CMRM(R R ( * 1) - C ( { *m ValueC
jR
i
+
∑−+∑+= })((*)1())1 (I)
3.6 Constructing Cube Literal Incidence Matrix (CIM)
The KCM allows the detection of multiple cube common subexpressions and factors. All possible cube
intersections can be identified by arranging the cubes in a matrix where there is a row for each cube and a column
for each literal in the set of expressions. The CIM for our sin(x) expression is shown in Figure 2.
Each cube intersection appears in the CIM as a rectangle. A rectangle in the CIM is a set of rows and columns
such that all the elements are non-zero. This CIM provides a canonical representation of the cubes of the expression,
which enables the detection of all single cube common subexpressions, as is illustrated by the following theorem.
Theorem 3: There is a single cube common subexpression in the set of polynomial expressions iff there is a
rectangle with more than one row in the Cube Intersection Matrix.
Proof: For the If case, assume that there is a rectangle with more than one row in the CIM. The elements of this
rectangle are non-zero by definition, and let C be the biggest common cube which is obtained from the minimum
values in each column of the rectangle. This cube C is the single cube common subexpression. For the Only If case,
assume that there is a common cube D in the set of expressions. Let {vd} be the set of variables that cube D is made
of. There will be non-zero values in the columns corresponding to the variables {vd} for each term that contains
cube D. As a result, a rectangle containing the columns corresponding to the variables {vd} will be formed.
The value of a rectangle is the number of multiplications saved by selecting the single term common
subexpression corresponding to that rectangle. The best set of common cube intersections is obtained by a maximum
valued covering of the CIM. We use a greedy iterative algorithm described in Section 4.2, where we select the best
prime rectangle in each iteration. The common cube C corresponding to the prime rectangle is obtained by finding
the minimum value in each column of the rectangle. The value of a rectangle with R rows can be calculated as
follows.
Let ∑C[i] be the sum of the integer powers in the extracted cube C. This cube saves ∑C[i] – 1 multiplications in
each row of the rectangle. The cube itself needs ∑C[i] – 1 multiplications to compute. Therefore the value of the
rectangle is given by the equation:
Value2 = (R -1)*( ∑C[i] – 1) (II)
4. Factoring and eliminating common subexpressions
In this section we present two algorithms that detect kernel and cube intersections. We first extract kernel
intersections and eliminate all multiple term common subexpressions and factors. We then extract cube intersections
from the modified set of expressions and eliminate all single cube common subexpressions.
4.1 Extracting kernel intersections
The algorithm for finding kernel intersections is shown in Figure 6, and is analogous to the Distill procedure
[19] in logic synthesis. It is a greedy iterative algorithm in which the best prime rectangle is extracted in each
iteration. In the outer loop, kernels and co-kernels are extracted for the set of expressions {Pi} and the KCM is
formed from that. The outer loop exits if there is no favorable rectangle in the KCM. Each iteration in the inner loop
selects the most valuable rectangle if present, based on our value function in Equation I.
Figure 6. The algorithm for finding kernel intersections
FindKernelIntersections( {Pi}, {Li}) { while(1) { D = FindKernels({Pi}, {Li}); KCM = Form Kernel Cube Matrix (D); {R} =Set of new kernel intersections = φ ; {V} =Set of new variables = φ; if (no favorable rectangle) return; while(favorable rectangles exist) { {R} = {R} ∪ Best Prime Rectangle; {V} = {V} ∪ New Literal; Update KCM; } Rewrite {Pi} using{R}; {Pi} = {Pi} ∪ {R}; {Li} = {Li} ∪ {V}; } }
This rectangle is added to the set of expressions and a new literal is introduced to represent this rectangle. The
kernel cube matrix is then updated by removing those 1’s in the matrix that correspond to the terms covered by the
selected rectangle. For example, for the KCM for the sin(x) expression shown in Figure 5, consider the rectangle
that is formed by the row {2} and the columns {5,6,7}.
This is a prime rectangle having R = 1 row and C = 3 columns. The total number of multiplications in rows
2)( =∑R
iRM and total number of multiplications in the columns )C(MC
i∑ = 6. Using the value function in Equation
I, the value of this rectangle is 6*m (m ≥ 1). This is the most valuable rectangle and is selected. Since this rectangle
covers the terms 2,3 and 4, the elements in the matrix corresponding to these terms are deleted. No more favorable
(valuable) rectangles are extracted in this KCM, and now we have two expressions, after rewriting the original
expression for sin(x).
sin(x) = x(1) + (x3d1)(2)
d1 = (-S3)(3) + (S5x2)(4) – (S7x4)(5)
Extracting kernels and co-kernels as before, we have the KCM shown in Figure 7. Again the original expressions for
sin(x) and d1 are not included in the matrix to simplify representation. From this matrix we can extract 2 favorable
rectangles corresponding to d2 = (S5 – S7x2) and d3 = (1 + x2d1). No more rectangles are extracted in the subsequent
iterations and the set of expressions can now be written as shown in Figure 8.
Figure 7. Kernel Cube Matrix (2nd iteration) Figure 8. Expressions after kernel intersection 4.2 Extracting cube intersections The Distill algorithm discussed in the previous section could find only multiple (two or more) term common
subexpressions. We need an algorithm that can find single term common subexpressions. We do this optimization
by means of a Cube Literal Incidence Matrix (CIM), which was discussed in Section 3.6. Single term common
subexpressions appear as rectangles in this matrix, where the rectangle entries can have any non-zero integers. The
optimization is performed by iteratively eliminating the rectangle having the most savings. For example, consider
the three terms F1 = a2b3c, F2 = a3b2 and F3 = b2c3. The CIM for these expressions is shown in Figure 9a.
1 2 3 4 1 x2d1 S5 -S7x2
1 x 1(1) 1(2) 2 x2 1(4) 1(5)
sin(x) = (xd3)(1) d3 = (1)(2) + (x2d1)(3) d2 = (S5)(4) – (x2S7)(5) d1 = (x2d2)(6) – (S3)(7)
A rectangle with rows corresponding to F1 and F2 and columns corresponding to the variables a and b yields the
common subexpression d1 = a2b2. Rewriting the CIM, we get the matrix shown in Figure 9b. Another rectangle with
rows corresponding to F1 and F3 and columns corresponding to the variables b and c is obtained, eliminating the
common subexpression d2 = b*c. After rewriting we get the CIM as shown in Figure 9c. No more favorable
rectangles can be found in this matrix, and we can stop here.
F1 = a2b3c F2 = a3b2 F3 = b2c3 d1= a2b2 (a) (b) (c)
Figure 9. Extracting single term common subexpressions
Figure 11. Cube Literal Incidence Matrix for expressions in Figure 8 Figure 12. Final optimization of sin(x) example Figure 10. The algorithm for finding cube intersections The algorithm for finding cube intersection is done at the end of the Distill algorithm and is shown in Figure 10.
In each iteration of the inner loop, the best prime rectangle is extracted using the value function in Equation (II). The
cube intersection corresponding to this rectangle is extracted from the minimum value in each column of the
a
b c
2 3 1 3 2 0 0 2 3
a
b
c
d1
0 1 1 1 1 0 0 1 0 2 3 0 2 2 0 0
a
b
c
d1
d2
0 0 0 1 1 1 0 0 1 0 0 1 2 0 1 2 2 0 0 0 0 1 1 0 0 2 2 0
0 1 1 0
1 2 3 4 5 6 7 8 Term +/- x d1 d2 d3 1 S3 S5 S7 1 + 1 0 0 1 0 0 0 0 2 + 0 0 0 0 1 0 0 0 3 + 2 1 0 0 0 0 0 0 4 + 0 0 0 0 0 0 1 0 5 - 2 0 0 0 0 0 0 1 6 + 2 0 1 0 0 0 0 0 7 - 0 0 0 0 0 1 0 0
FindCubeIntersections( {Pi}, {Li}) { while(1) { M = Cube Literal incidence Matrix {V} = Set of new variables = φ; {C} = Set of new cube intersections = φ; if( no favorable rectangle present) return; while(Favorable rectangles exist) { Find B = Best Prime Rectangle; {C} = {C} ∪ Cube corresponding to B; {V} = {V} ∪ New Literal; Collapse(M,B); } M = M ∪ {C}; {Li} = {Li} ∪ {V}; } } Collapse(M,B) { ∀ rows i of M that contain cube B ∀ columns j of B M[i][j] = M[i][j] – B[j]; }
d4 = x*x d2 = S5 – S7 *d4 d1 = d2*d4 – S3 d3 = d1*d4 + 1 sin(x) = x*d3
F1 = bcd1 F2 = ad1 F3 = b2c3 d1 = a2b2 d2 = bc
F1 = d1d2 F2 = ad1 F3 = bc2d2d1 = a2b2 d2 = bc
rectangle. After each iteration of the inner loop, the CIM is updated by subtracting the extracted cube from all rows
in the CIM in which it is contained (Collapse procedure). Coming back to our sin(x) example, consider the CIM in
Figure 11 for the set of expressions in Figure 8. The prime rectangle consisting of the rows {3,5,6} and the column
{1} is extracted, which corresponds to the common subexpression C = x2. Using Equation (II), we can see that this
rectangle saves two multiplications. This is the most favorable rectangle and is chosen. No more cube intersections
are detected in the subsequent iterations. A literal d4 = x2 is added to the expressions which can now be written as in
Figure 12.
4.3 Complexity and quality of presented algorithms
The complexity of the kernel generation algorithm (Figure 3) can be defined as maximum number of kernels that
can be generated from an expression. The total number of kernels generated from an expression depends on the
number of variables, the number of terms, the order of the expressions and also the distribution of the variables in
the original terms. Since kernels are generated by division by cubes (co-kernels), the set of co-kernels generated
have to be distinct. Given the number of variables m, and the maximum integer exponent of any variable in the
expression as n, we can find the worst case number of kernels generated for an expression as the number of different
co-kernels possible. The number of different co-kernels possible is (n + 1)m. For example if we have m = 2
variables, a and b, and the maximum integer power n = 2, then we have (2 + 1)2 = 9 different possible co-kernels
which are 1, b, b2, a, ab, ab2, a2, a2b, a2b2. This worst case happens only when there is a uniform distribution of the
variables in the different terms of the polynomial expression.
The complexity of the kernel extraction procedure depends on the organization of the Kernel Cube Matrix. In the
worst case, the Kernel Intersection Matrix (KIM) can have a structure like the one shown for the square matrix in
Figure 13, where all elements except the ones on the diagonal are non-zero.
Figure 13. Worst case complexity for finding kernel intersections
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
In such a scenario, the number of rectangles generated is exponential in the number of rows/columns in the matrix.
The number of kernels that are generated is equal to the number of rows in the matrix and the number of distinct
kernel cubes is equal to the number of columns. The algorithm for finding single cube intersections also has a worst
case exponential complexity, when the Cube Intersection Matrix (CIM) is of the form shown in Figure 13. The
number of rows of the CIM is equal to the number of terms and the number of columns is equal to the number of
literals in the set of expressions.
The algorithms that we presented are greedy heuristic algorithms. To the best of our knowledge, there has been
no previous work done for finding an optimal solution to the common subexpression elimination problem. An
optimal solution can be generated by a brute force exploration of the entire search space where all possible
subexpressions in different selection orderings are explored. Since this is impractical for even moderately sized
examples, we do not compare our solution with the optimal one. For our greedy heuristic, the average run time for
the examples that we investigated is only 0.45s.
5. Experimental results The goal of our experiments was to investigate the usefulness of our technique in reducing the number of
operations in polynomial expressions occurring in some real applications. We found polynomial expressions to be
prevalent in signal processing [37] and in 3-D computer graphics [14]. We optimized the polynomials using
different methods, CSE, Horner form and our algebraic technique. We used Jouletrack simulator [38] to estimate the
reduction in latency and energy consumption for computing the polynomials on the StrongARMTM SA1100
micoprocessor. We used the same set of randomly generated inputs for simulating the different programs. Table 1
shows the result of our experiments, where we compare the savings in the number of operations produced by our
method over Common Subexpression Elimination (CSE) and the Horner form. The first two examples are obtained
from source codes from signal processing applications [37] where the polynomials are obtained by approximating
the trigonometric functions by their Taylor series expansions. The next four examples are multivariate polynomials
obtained from 3-D computer graphics [29]. The results show that our optimizations reduce the number of
multiplications, on an average by 34% over CSE and by 34.9% over Horner. The number of additions is the same in
all examples. This is because all the savings are produced by efficient factorization and elimination of single term
common subexpressions, which only reduce the number of multiplications. The average run time for our technique
for these examples was only 0.45s.
The results also show the results of simulation on the StrongARMTM SA1100 processor, where we observe an
average latency reduction of 26.7% over CSE and 26.1% over Horner scheme. We also observed an energy
reduction of 26.4% over CSE and 26.3% over Horner scheme. The energy consumption measure by the simulator
[38] is only for the processor core.
We synthesized the polynomials optimized by CSE, Horner and our algebraic method to observe the
improvements for hardware implementation. We synthesized the polynomials using Synopsys Behavioral Compiler
TM and Synopsys Design Compiler TM using the 1.0 µ power2_sample.db technology library, with a clock period of
40 ns, operating at 5V. We used this library because it was the only one we had that was characterized for power
consumption. We used the Synopsys DesignWare TM library for the functional units. The adder and multiplier in this
library took one clock cycle and two clock cycles respectively at this clock period. We synthesized the designs with
both minimum hardware constraints and medium hardware constraints. We then interfaced the Synopsys Power
Compiler TM with the Verilog RTL simulator VCS TM to capture the total power consumption including switching,
short circuit and leakage power. All the polynomials were simulated with the same set of randomly generated inputs.
Table 2 shows the synthesis results when the polynomials were scheduled with minimum hardware constraints.
Typically the hardware allocated consisted of a single adder, a single multiplier, registers and control units. We
measured the reduction in area, energy and energy-delay product of the polynomials optimized using our technique
over the polynomials optimized using CSE (C) and Horner transform (H). The results show that there is not much
difference in area for the different implementations, but there is a significant reduction in total energy consumption
(average 30% over CSE and 33.4% over Horner), and energy delay product (average 47.5% over CSE and 50.9%
over Horner). Since our technique gave significantly reduced latency, we estimated the savings in energy
consumption for the same latency using voltage scaling.
We obtained the scaled down voltage for the new latency using the library parameters, and estimated the new
energy consumption using the quadratic relationship between the energy consumption and the supply voltage. This
voltage scaling was done by comparing the latencies obtained by our technique with that of CSE and Horner
separately.
The results (Column IV) show an energy reduction of 63.4% over CSE(C) and 63.2% reduction over Horner (H)
with voltage scaling. Table 3a shows the synthesis results for medium hardware constraints. Medium hardware
implementations trade off area for energy efficiency. Though they have larger area, they have a shorter schedule and
lesser energy consumption compared to those produced from minimum hardware constraints. We constrained the
Synopsys Behavioral Compiler TM to allocate a maximum of four multipliers for each example. Table 3b shows the
number of adders (A) and multipliers (M) allocated for the different implementations. The scheduler will allocate
fewer multipliers than four, if the same latency can be achieved by the fewer number of resources.
Table 1. Comparing the number of additions (A) and multiplications (M) produced by different methods
Unoptimized Using CSE
Using Horner
Using our Algebraic Technique
Reduction in Latency and Energy consumption for execution on
StrongARM SA1100
CSE Horner
Application Function
A M A M A M A M Latency (%)
Energy (%)
Latency (%)
Energy (%)
1 Fast Convolution FFT 7 56 7 20 7 30 7 10 26.8 26.1 44.0 42.3
2 Gaussian noise filter FIR 6 34 6 23 6 20 6 13 30.5 26.4 16.5 16.1
3 Graphics quartic-spline 4 23 4 16 4 17 4 13 7.0 10.7 35.5 35.1 4 Graphics quintic-spline 5 34 5 22 5 23 5 16 23.4 24.8 23.3 26.2 5 Graphics chebyshev 8 32 8 18 8 18 8 11 33.4 31.3 27.5 25.4 6 Graphics cosine-wavelet 17 43 17 23 17 20 17 17 39.1 39.0 9.7 12.8 Average 7.8 37 7.8 20.3 7.8 21.3 7.8 13.3 26.7 26.4 26.1 26.3
Table 2. Synthesis results with minimum hardware constraints
Area (%) (I)
Energy @5V (%) (II)
Energy Delay @5V (%) (III)
Energy (scaled voltage) (%) (IV)
C H C H C H C H 1 18.6 6.5 33.5 69.9 58.4 90.8 88.9 99.0 2 7.5 0.1 13.6 25.6 20.4 39.4 24.6 49.5 3 0.3 -4.2 21.6 29.3 39.0 48.8 52.2 64.6 4 -7.5 -24.2 29.4 10.4 47.6 25.9 62.2 36.9 5 5.6 2.5 37.0 28.7 57.1 46.1 74.3 59.8 6 3.7 2.0 44.8 36.8 62.8 54.8 78.3 69.7
Average 4.7 -2.8 30.0 33.4 47.5 50.9 63.4 63.2
The results show a significant reduction in total energy consumption (average 24.6% over CSE(C) and 38.7%
over Horner (H) scheme), as well as energy delay product (average 26.7% over CSE and 53.5 % over Horner). We
obtain reduction in energy delay product for every example except for the first one, where the energy delay product
for the expression synthesized using CSE is 12.7% lesser than that produced by our method. This is because, the
latency of the CSE optimized expression, when four multipliers are used is much lesser compared to the one
optimized using our algebraic method (which uses two multipliers). The delay for Horner scheme is much worse
than both CSE and our technique, because the nested additions and multiplications of Horner scheme typically
results in a longer critical path.
Table 3a. Synthesis results with medium Table 3b. Multipliers (M) and adders (A)
hardware constraints allotted with medium hardware constraints
6. Conclusions
In this paper we presented an algebraic method for reducing the number of operations in a set of polynomial
expressions of any degree and consisting of any number of variables, by performing algebraic factorization and
elimination of common subexpressions. The presented technique is based on the algebraic methods originally
developed for multi-level logic synthesis. Our technique is superior to the conventional optimization techniques for
polynomial expressions. Experimental results have shown significant reduction in latency and energy consumption.
By integrating these optimizations into conventional compiler and high level synthesis frameworks, we believe that
we can observe substantial performance and power consumption improvements for the whole application.
In the future we would like to work on extending our technique by optimizing latency and performing scheduling
so that we can work with different architectural constraints. Also we would like to study the impact of our
techniques on the stability of computing the polynomial expressions.
Area (%) Energy @5V (%)
Energy Delay @5V (%)
C H C H C H 1 44.0 48.0 9.8 63.9 -12.7 81.0 2 30.5 3.9 16.1 39.2 9.7 44.1 3 14.8 1.0 9.7 29.6 20.3 58.7 4 8.3 3.7 42.5 29.1 44.9 37.0 5 8.9 9.0 28.2 29.5 39.5 40.6 6 8.0 6.6 41.4 40.8 58.4 59.7
Average 19.0 12.0 24.6 38.7 26.7 53.5
CSE Horner Our Technique M A M A M A 1 4 2 4 1 2 1 2 3 1 2 1 2 1 3 4 1 4 1 4 1 4 3 1 3 2 3 2 5 4 1 4 2 4 2 6 3 1 3 1 3 2
REFERENCES [1] A.Hosangadi, F.Fallah, and R.Kastner, "Factoring and Eliminating Polynomial Expressions in Polynomial
Expressions," presented at International Conference on Computer Aided Design (ICCAD), San Jose, CA, 2004.
[2] A.Hosangadi, F.Fallah, and R.Kastner, "Energy Efficient Hardware Synthesis of Polynomial Expressions," presented at International Conference on VLSI Design, Kolkata, India, 2005.
[3] A.Hosangadi, F.Fallah, and R.Kastner, "Optimizing Polynomial Expressions by Factoring and Eliminating Common Subexpressions," presented at International Workshop on Logic and Synthesis, Temecula, CA, 2004.
[4] S.S.Muchnick, Advanced Compiler Design and Implementation: Morgan Kaufmann Publishers, 1997. [5] R.Allen and K.Kennedy, Optimizing Compilers for Modern Architectures: Morgan Kaufmann Publishers,
2002. [6] M. Kandemir, N. Vijaykrishnan, M. J. Irwin, and W. Ye, "Influence of compiler optimizations on system
power," Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, vol. 9, pp. 801-804, 2001. [7] G. D. Micheli, Synthesis and Optimization of Digital Circuits: McGraw-Hill, Inc, 1994. [8] H. De Man, J. Rabaey, J. Vanhoof, G. Goossens, P. Six, and L. Claesen, "CATHEDRAL-II-a computer-
aided synthesis system for digital signal processing VLSI systems," Computer-Aided Engineering Journal, vol. 5, pp. 55-66, 1988.
[9] A.Peymandoust and G. D. Micheli, "Using Symbolic Algebra in Algorithmic Level DSP Synthesis," presented at Design Automation Conference, 2001.
[10] "Windriver advanced compiler optimization techniques. A Technical White Paper." [11] D.E.Knuth, The Art of Computer Programming, vol. 2: Seminumerical algorithms, Second ed: Addison-
Wesley, 1981. [12] C.Fike, Computer Evaluation of Mathematical Functions. Englewood Cliffs, N.J: Prentice-Hall, 1968. [13] J. Villalba, G. Bandera, M. A. Gonzalez, J. Hormigo, and E. L. Zapata, "Polynomial evaluation on
multimedia processors," presented at Application-Specific Systems, Architectures and Processors, 2002. Proceedings. The IEEE International Conference on, 2002.
[14] R.H.Bartels, J.C.Beatty, and B.A.Barsky, An Introduction to Splines for Use in Computer Graphics and Geometric Modeling: Morgan Kaufmann Publishers, Inc., 1987.
[15] A.Peymandoust, T.Simunic, and G. D. Micheli, "Low Power Embedded Software Optimization Using Symbolic Algebra," IEEE Transactions on Computer Aided Design of Integrated Circuits and Systems, 2002.
[16] V. J. Mathews, "Adaptive polynomial filters," Signal Processing Magazine, IEEE, vol. 8, pp. 10-26, 1991. [17] "GNU C Library." [18] A.S.Vincentelli, A.Wang, R.K.Brayton, and R.Rudell, "MIS: Multiple Level Logic Optimization System,"
IEEE Transactions on Computer Aided Design of Integrated Circuits and Systems, 1987. [19] R.K.Brayton and C.T.McMullen, "The Decomposition and Factorization of Boolean Expressions,"
presented at International Symposium on Circuits and Systems, 1982. [20] R.K.Brayton, R.Rudell, A.S.Vincentelli, and A.Wang, "Multi-level Logic Optimization and the
Rectangular Covering Problem.," presented at International Conference on Compute Aided Design, 1987. [21] J. Rajski and J. Vasudevamurthy, "The testability-preserving concurrent decomposition and factorization of
Boolean expressions," Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, vol. 11, pp. 778-793, 1992.
[22] A.V.Aho and S.C.Johnson, "Optimal Code Generation for Expression Trees," Journal of the ACM (JACM), vol. 23, pp. 488-501, 1976.
[23] A.V.Aho, S.C.Johnson, and J.D.Ullman, "Code Generation for Expressions with Common Subexpressions," ACM, vol. 24, pp. 146-160, 1977.
[24] R.Sethi and J.D.Ullman, "The Generation of Optimal Code for Arithmetic Expressions," Communications of the ACM, vol. 17, pp. 715-728, 1970.
[25] M.A.Breuer, "Generation of Optimal Code for Expressions via Factorization," Communications of the ACM, vol. 12, pp. 333-340, 1969.
[26] A. Nicolau and R. Potasman, "Incremental tree height reduction for high level synthesis," presented at Design Automation Conference, 1991. 28th ACM/IEEE, 1991.
[27] B. Landwehr and P. Marwedel, "A new optimization technique for improving resource exploitation and critical path minimization," presented at System Synthesis, 1997. Proceedings., Tenth International Symposium on, 1997.
[28] M. Potkonjak, S. Dey, Z. Iqbal, and A. C. Parker, "High performance embedded system optimization using algebraic and generalized retiming techniques," presented at Computer Design: VLSI in Computers and Processors, 1993. ICCD '93. Proceedings., 1993 IEEE International Conference on, 1993.
[29] G.Nurnberger, J.W.Schmidt, and G.Walz, Multivariate approximation and splines: Springer Verlag, 1997. [30] J.-M. Chang and M. Pedram, "Energy minimization using multiple supply voltages," Very Large Scale
Integration (VLSI) Systems, IEEE Transactions on, vol. 5, pp. 436-443, 1997. [31] E. Macii, M. Pedram, and F. Somenzi, "High-level power modeling, estimation, and optimization,"
Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, vol. 17, pp. 1061-1079, 1998.
[32] A.P.Chandrakasan, M.Potkonjak, R.Mehra, J.Rabaey, and R.W.Brodersen, "Optimizing power using transformations," Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, vol. 14, pp. 12-31, 1995.
[33] M. A. Miranda, F. V. M. Catthoor, M. Janssen, and H. J. De Man, "High-level address optimization and synthesis techniques for data-transfer-intensive applications," Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, vol. 6, pp. 677-686, 1998.
[34] R.Rudell, "Logic Synthesis for VLSI Design," PhD Thesis, University of California, Berkeley, 1989. [35] "ARM7TDMI Technical Reference Manual," 2003. [36] V.Krishna, N.Ranganathan, and N.Vijayakrishnan, "An Energy Efficient Scheduling Scheme for Signal
Processing Applications," presented at VLSI Design, 1999. [37] P.M.Embree, C Algorithms for Real-Time DSP: Prentice Hall, 1995. [38] A. Sinha and A. P. Chandrakasan, "JouleTrack-a Web based tool for software energy profiling," presented
at Design Automation Conference, 2001. Proceedings, 2001.
Response to Reviewer #1’s comments We thank the reviewer for reading our paper and providing valuable comments. We have tried our best to
address the problems in the paper pointed by the reviewer.
The reviewer pointed out that the differences between the algebraic techniques for Boolean expressions and for
arithmetic expressions need to be clearly pointed out. The reviewer mentions that there are two restrictions used in
the model of Boolean expressions in Brayton’s 1982 paper (The Decomposition and Factorization of Boolean
Expressions). Though these restrictions are not explicitly mentioned in the paper, we believe that the reviewer was
referring to the restriction on Single Cube Containment (SCC) and in the restriction on the quotient and divisor
during algebraic division. We modified the former restriction and removed the latter one. The revised version of the
paper contains a section (Section 3.3) that details these differences. The central theorem (in Brayton’s paper) still
holds and has been reformulated, and explained (proved) using three different theorems in Sections 3.4 and 3.6.
The reviewer also mentions that we need to talk about quality of the generated results. In the revised paper, we
have a Section 4.3 that talks about the algorithm complexity and quality. There is no known ‘optimal’ solution to the
common subexpression elimination problem, except for a brute force exploration of all possible subexpressions in
different possible sequences of selections. Since this is not practical, we do not compare our results with the
‘optimal’ one.
The reviewer also points out that we need to compare our work with the work done by mathematicians on the
Symbolic handling of polynomials. We are not aware of any work for the problem of factoring and eliminating
common subexpressions. There has been some work in High Level Synthesis using Symbolic Algebra methods
(references [6] and [15]), but they attack a different problem, that of mapping polynomials to library polynomials.
We talk about their work in the Related Work section (Page 4, last paragraph).
Response to Reviewer #2’s comments
We would like to thank the reviewer for reading our paper, providing valuable comments and suggesting
changes. We have made all the changes suggested by the reviewer.
The reviewer has some comments relating to the clarity of the paper. The Section 3.3 (now Section 3.4) which
explains the Kernel generation algorithm has been rewritten. In the experimental results section, one of the examples
shows a negative improvement (-12.7%) in the energy-delay product. This was not explained in the earlier version
of the paper, but has now been explained in the revised version. The reason for the negative improvement for this
example is that though the number of operations is reduced using our methods, the length of the critical path is
increased, and using more multipliers, the polynomial optimized with CSE method has much shorter delay, which
makes the energy-delay product smaller for that example.
The other typographical and stylistic changes suggested by the reviewer have been incorporated in the revised
version.
Response to Reviewer #3’s comments
We thank the reviewer for reading our paper and providing detailed comments and suggestions. We have
addressed each of the concerns individually. The reviewer gives a number of comments which are addressed
below
0) Including the reference to our paper in International Workshop on Logic and Synthesis
This has been included now
1) Referencing other works on Algebraic transformations
In the revised paper, we have made references (in Related Work Section) to the work done on custom Address
generation by M.A Marwedel et. al. This paper describes a number of algebraic transformations for the synthesis of
custom address generators, but they do not present any specific techniques for algebraic factorization and common
subexpression elimination, which is the focus of our paper. We also have referenced the work done by B.Landwehr
et. al on Critical path minimization using algebraic transformations. In this paper, the authors present a genetic
algorithm based framework, where all possible algebraic formulations are considered for the evaluation of a
Dataflow graph (DFG). But this paper does not really present any method for common subexpression elimination. In
our method, we implicitly explore the commutative, associative and distributive properties of arithmetic operations.
We do not make use of the properties of constant multiplications such as 2*X = X<<1, which is explored in their
paper. This kind of strength reduction can be performed as a post processing step after our algorithm.
2) Central theorem is not clear and not explained
In the revised version of the paper, we have reformulated the original theorem in Brayton’s paper into three
different theorems, each of which have been explained and proved. They are now in Sections 3.4 and 3.6. The
meaning of multiple term(cube) intersection has also been explained. The reviewer mentioned that there does not
seem to be a relation between co-kernel of an expression and common subexpression. There is actually no relation
between the two. A co-kernel and kernel pair represents a possible factorization opportunity. A kernel on the other
hand is a potential common subexpression.
3) Complexity of the presented algorithms is not mentioned
In the revised paper, we have a new Section 4.3 that talks about the complexity and the quality of the
presented algorithms. In the worst case, the kernel generation algorithm is exponential in the number of variables,
and the algorithm for finding kernel intersections is exponential in the number of kernels that are generated. This
worst case happens only when there is a uniform distribution of the variables in the terms of the expression, which
we have not found in practical polynomial expressions. The complexity of finding single cube intersections is also
exponential in the number of variables.
4) More details for the experimental workflow are needed
We did not perform any global optimizations. These optimizations were done only on polynomials in local
basic blocks.
5) Do the techniques optimize only a single expression?
Our technique optimizes multiple polynomial expressions, similar to the way multiple Boolean expressions
are optimized in Logic synthesis.
6) More detailed experiments to measure clock cycle improvement is needed
We used the ARM simulator Jouletrack for measuring the reduction in latency and energy consumption for
computing the polynomials. We simulated the programs with the same set of randomly generated inputs. The results
are included in the revised version of the paper.
7) Why does the filter have so many multiplications?
We approximated the trigonometric functions using Taylor series expansion to the seventh order for the first
two examples. We did this to use our algorithms on large polynomial expressions. In practice, the values of the
trigonometric functions can be obtained from a lookup table. The remaining four examples are polynomials used in
computer graphics, where we have not used any Taylor series approximations.
8) Correction in the result tables
The tables have now been corrected.