fast bilinear algorithms for convolution · convolution the discrete convolution between vectors f...
TRANSCRIPT
Fast Bilinear Algorithms for Convolution
Caleb Ju
CS598EVS
March 5, 2020
Convolution
The discrete convolution between vectors f ∈ Rr and g ∈ Rn is
yk =∑i
figk−i .
View as a matrix–vector product between matrix T and vector f ,
yk =∑i
gk−i fi =∑j
Tk,j · fj = Tf .
What does the matrix Tlook like?
Denote as T〈g ,r〉, whichis a Toeplitz matrix,whereT〈g ,r〉 ∈ Rn+r−1×r .
T〈g ,r〉 =
g0 0 · · · 0...
. . ....
gn−1. . . 0
. . . g0. . .
...gn−1
Convolution and its Variants
Linear convolution is
yk =
min(k,r−1)∑i=max(0,k−n+1)
figk−i .
The bounds ensure that if we go past either end of vector g , wedon’t compute.
We also have cyclic convolution,
yk =r−1∑i=0
fig(k−i) mod n.
Can also derive correlation,
yk =r−1∑i=0
figk+i .
Applications of Convolution
String matching (Clifford and Clifford, 2007)
Let the pattern be p ∈ Σm and the text be t ∈ Σn.
m−1∑j=0
(pj − ti+j)2 =
m−1∑j=0
(p2j − 2pj ti+j + t2i+j) , ∀ 0 ≤ i ≤ n −m.
Image Processing (Convolutional Neural Network)
Given K filters in tensor F of size r × r , N input images in tensorG of size n × n. Seek to sum over all H channels,
yikxy =H∑
c=1
r∑v=1
r∑u=1
fkcuv · gi ,c,x+u,y+v .
Other applications: cosmological simulation, solutions to partialdifferential equations, signal processing, integer multiplication, . . .
Fast Algorithms for Computing Convolution
A direction computation has O(n2) cost.
Consider complex multiplication,
x × y = (a + bi)× (c + di) = (ac − bd) + (ad + bc)i
= (ac − bd) +(ac + bd − (a− b)(c − d)
)i .
Karatsuba’s Algorithm applies this recursively for O(nlog2(3)) cost.Can also be solved by the discrete Fourier transform,
a ∗ b = IDFT(DFT(a) DFT(b)
).
Using the fast Fourier transform (FFT), can compute linearconvolution in O(nlogn) time.
Other algorithms: Winograd’s minimal filtering method, matrixmultiplication, fast symmetric multiplication
Derivation of Bilinear Algorithms
Recall a bilinear algorithm is
c = F (C)(
(F (A)Ta) (F (B)Tb))
=∑i
∑j
tijkajbk .
The discrete linear convolution of f and g by
yk =
min(k,r−1)∑i=max(0,k−n+1)
fi · gk−i =∑i ,j
tijk figj ,
The tensor T is defined by tijk =
1 : i + j − k = 0
0 : otherwise.
Convolution is Multiplication
How can we derive fast bilinear algorithms for convolution?
Define polynomials a(x) = a0 + a1x + · · ·+ ar−1xr−1 and
b(x) = b0 + b1x + · · · bn−1xn−1. Their product is
c(x) = a(x)b(x) =r+n−2∑k=0
min(k,n−1)∑i=max(0,k−n+1)
(ai · bk−i )xk .
The coefficients of c(x) = c0 + c1x + . . .+ cr+n−2xr+n−2 are
determined by linear convolution.
Convolution as Multiplication
How can we compute c(x)? Suppose we know the value of c(xi )at some nodes x0, . . . , xi , . . . xR−1 and R = deg c(x) + 1. Letcoefficients of c(x) be c . We can get c by
c(xi ) =R−1∑k=0
xki ck = Vi ,:c where V =
x00 . . . xR−10...
...
x0R−1 . . . xR−1R−1
∈ CR×R .
How can we compute c(xi )? Recall c(x) = a(x)b(x). Therefore,
c(xi ) = a(xi )b(xi ).
How can we compute a(xi )? Let a be the coefficients ofpolynomial a(x) (and b for b(x)). Then, computing a(xi ) is aninner product,
a(xi ) =r−1∑k=0
xki ak = Vi ,:a where V is the first r columns of V .
Toom-Cook Algorithm
Toom-Cook
1. Evaluate α = V a and β = V b2. Compute the products ν = α β
3. Interpolate by solving the linear system Vc = ν
Can prescribe this three-step computation as the following bilinearalgorithm,
c = V−1(2n−1×2n−1)
(V(2n−1×n)a V(2n−1×n)b
).
where V is a Vandermonde matrix, V =
x00 . . . xR−10...
...
x0R−1 . . . xR−1R−1
.
Discrete Fourier Transform
(a) Chebyshev Nodes(b) Equispaced Nodes on Unit Cirlce
Discrete Fourier TransformLet ω(n) = exp(−2πi/n), the nth primitive root of unity. Set the
nodes of V as [ω0(n), ω(n), . . . , ω
r−1(n) ]. Then, V is the Fourier matrix
(and V−1 is the inverse Fourier matrix), leading to bilinearalgorithm,
c = F−1(2n−1×2n−1)
(F(2n−1×n)a F(2n−1×n)b
).
Alternative Bilinear Algorithms
The Toom-Cook method and fast Fourier transform work well forsmall and large convolution problems respectively.
I The Toom-Cook is numerically inaccurate for convolutions ofsize greater than four
I The FFT has significant hidden constants
Now we examine alternative algorithms that offer trade-offsbetween computational efficiency and numerical accuracy.
Modular Polynomial Multiplication
Let’s revisit convolution as a polynomial multiplication problem,
c(x) = a(x)b(x) =2n∑k=0
min(k,n−1)∑i=max(0,k−n+1)
(ai · bk−i )xk .
What is the remainder of c(x) divided polynomial M wheredegM > deg c(x)?
c(x) = r(x) ≡ c(x) (mod M).
What if we use a polynomial m where degm ≤ deg c(x)?
c(x) 6= r(x) ≡ c(x) (mod m).
Modular Polynomial Multiplication
Why use modulo polynomial multiplication? Modulomultiplication decreases size of inputs.
c(x) ≡ a(x)b(x) ≡(a(x) mod m
)(b(x) mod m
)(mod m).
However, this leads to an answer that is congruent to the actualproduct, i.e. not the solution we actually want.
Can we compute the polynomial multiplication using modulopolynomial multiplication?
Yes, using the Chinese Remainder Theorem.
Chinese Remainder Theorem
TheoremLet m(1), . . . ,m(k) be coprime integers and M =
∏ki m
(i). Givenremainders r (1), . . . , r (k) where 0 ≤ r (i) < m(i), the ChineseRemainder Theorem (CRT) asserts that there exists a uniqueinteger x (modulo M) such that
x ≡ r (i) (mod m(i)) ∀i ∈ [k].
Further, this mapping between integer and remainders is a ringisomorphism (structure preserving).
Example
Let m(1) = 3,m(2) = 4, and M = 12. Let x = 7 (mod M), and itsremainders,
x ≡ r (1) ≡ 1 (mod 3) and x ≡ r (2) ≡ 3 (mod 4).
Chinese Remainder Theorem: Example
Let x ≡ 7 (mod 12). Seek to compute (7× 4) (mod 12).
Figure: Ring Isomorphism
Chinese Remainder Theorem: Example
x ≡ r (1) ≡ 1 (mod 3) and x ≡ r (2) ≡ 3 (mod 4).
Figure: Ring Isomorphism
Chinese Remainder Theorem: Example
r ′(1) ≡ r (1) × 4 ≡ 4 ≡ 1 (mod 3) and r ′(2) ≡ r (2) × 4 ≡ 0 (mod 4).
Figure: Ring Isomorphism
Chinese Remainder Theorem: Example
y ≡ 28 ≡ 4 (mod 12) satisfiesr ′(1) ≡ 1 (mod 3) and r ′(2) ≡ 0 (mod 4).
Figure: Ring Isomorphism
Modular Polynomial Multiplication
Akin to interpolation, modular polynomial multiplication can becomputed via
I Compute the remainders of a(x) and b(x) for a series ofcoprime divisors m(i)
I Multiply the corresponding remainders (can use normalpolynomial multiplication)
I Map remainders back to its (unique) polynomial
How do we recover the polynomial from its remainder?The Chinese Remainder Theorem also tells us how to do so.
Chinese Remainder Theorem (part 2)
TheoremRecall the polynomial divisors m(i) are coprime, M =
∏i m
(i), andwe have a set of remainders r (i). To solve for x , we compute
x =( k∑
i=1
r (i)M(i)N(i))
mod M,
where M(i) = M/m(i) and N(i) and n(i) are arbitrary polynomialssatisfying Bezout’s identity,
M(i)N(i) + m(i)n(i) = 1.
Chinese Remainder Theorem (part 2): Example
Coprimepolynomial divisorsm(i),
whereM =
∏i m
(i),
andM(i) = M/m(i).Let N(i), n(i) suchthat ∀i M(i)N(i) +m(i)n(i) = 1.
Solution is x =( k∑i=1
r (i)M(i)N(i))
mod M.
Compute product y = (4×7) (mod 12).
Have M(1) = 4, m(1) = 3, M(2) = 3,m(2) = 4, and M = 12,
with remainders r ′(1) ≡ 1 (mod 3) andr ′(2) ≡ 0 (mod 4).
See 4N(1) + 3n(1) = 1 and3N(2) + 4n(2) = 1 are satisfied withN(1) = 1, n(1) = −1, N(2) = −1, andn(2) = 1.
So we have∑i
r (i)M(i)N(i) = 1(4)(1) + 0(3)(−1)
= 4 ≡ 28 (mod 12).
Chinese Remainder Theorem (part 2)
x =( k∑
i=1
r (i)M(i)N(i))
mod M
Why does this work?
Since M(i)N(i) = 1−m(i)n(i), then for a fixed i ,
x =∑j
r (j)M(j)N(j) = r (i)(1−m(i)n(i)︸ ︷︷ ︸=M(i)N(i)
) = r (i) (mod m(i))
The Chinese Remainder tells us there is bijection betweenremainders and the original polynomial. Therefore, any polynomialsatsifying the remainder equivalences is equivalent to the originalpolynomial (modulo M)!
Modular Polynomial Multiplication
The Chinese Remainder Theorem required thatM(i)N(i) + m(i)n(i) = 1 for all i . Does there even exist N(i), n(i)?
Theorem (Bezout’s identity)
Let p and q be coprime polynomials (do not share any roots), thenthere exists polynomials u and v such that pu + qv = 1.
Since M(i) and m(i) are coprime, there exists polynomials N(i) andn(i) such that
M(i)N(i) + m(i)n(i) = 1.
Winograd Convolution Algorithm
Let f ∈ Rr and g ∈ Rn be the vectors we seek to convolve. Recallthat we first compute the remainders,
f = r(i)(f )(mod m(i)) and g = r
(i)(g)(mod m(i)).
Next, we compute the product of remainders using a convolutionalgorithm,
r (i) = (r(i)(g) ∗ r
(i)(g))(mod m(i)).
We use the Chinese remainder theorem to recover the uniquesolution,
y =(∑
r (i) ∗M(i) ∗ N(i))(mod M),
where M(i) = M/m(i) and M(i)N(i) + m(i)n(i) = 1.
Toom-Cook vs. Winograd Convolution Algorithm
Toom-Cook
1. Evaluate at a set ofunique integer points
2. Compute the element-wisemultiplication (these areevaluated points of theproduct)
3. Interpolate to recover theproduct polynomial
Winograd ConvolutionAlgorithm
1. Evaluate the remainderwith the set of coprimepolynomial divisors m(i)
2. Compute the element-wisepolynomial multiplication(via convolution)
3. Use the CRT to recover theproduct polynomial moduloM
Toom-Cook vs. Winograd Convolution Algorithm
Toom-Cook
1. Evaluate at a set ofunique integer points
2. Compute the element-wisemultiplication (these areevaluated points of theproduct)
3. Interpolate to recover theproduct polynomial
Winograd ConvolutionAlgorithm
1. Evaluate the remainderwith the set of coprimepolynomial divisors m(i)
2. Compute the element-wisepolynomial multiplication(via convolution)
3. Use the CRT to recover theproduct polynomial moduloM
Evaluate the Remainder of a Polynomial Division
Denote the coefficients of an arbitrary polynomial p as p, e.g.p = 3x2 − 1 is represented as p =
[−1 0 3
]Let p and m be polynomials where deg(m) ≤ deg(p).
Modulo Operation
LemmaLet r = p (mod m), with d = deg p. There exists a matrix X〈m,d〉such that r = X〈m,d〉p.
Evaluate the Remainder of a Polynomial Division
LemmaLet r = p (mod m), with d = deg p. There exists a matrix X〈m,d〉such that r = X〈m,d〉p.
Proof.We know p = mq + r for some polynomial q. Then,
T〈m,r〉q + r =
m0 . . . 0...
. . .
mdegm−1 m0
. . ....
mdegm−1
q + r =
[UL
]q +
[r0
]=
[p(A)
p(B)
].
Solving both systems, we get r = −UL−1p(B) + p(A).
Solve Bezout’s identity
LemmaWrite MN + mn = 1 as[
T〈M,degm−1〉 T〈m,degM−1〉]︸ ︷︷ ︸
A
[Nn
]=[1 0 . . .
]T
Proof.Show that the matrix A is invertible.
Winograd Convolution Algorithm
Theorem (Winograd Convolution Algorithm)
Given coprime polynomials m(1),m(2) such that M = m(1)m(2) anddegM = n + r − 1, bilinear algorithms (A(i),B(i),C (i)) for aconvolution of dimension degm(i) for i ∈ 1, 2, then (A,B,C ) isa convolution for vectors of dimension r and n, where
A =[XT〈m(1),r−1〉A
(1) , XT〈m(2),r−1〉A
(2))],
B =[XT〈m(1),n−1〉B
(1) , XT〈m(2),n−1〉B
(2))], and
C =[C (1) , C (2)
],
with C (i) = X〈M,degM+degm(i)−2〉T〈e(i),degm(i)〉X〈m(i),2 degm(i)−1〉C(i)
and polynomial e(i) = M(i)N(i) mod M is defined from Bezout’sidentity.
Rank of Winograd Convolution Algorithm
Given f ∈ Rr and g ∈ Rn, the solution y ∈ Rr+n−1. Therefore,select M to be a (n + r − 1)-degree polynomial.
Remark The bilinear rank R of the Winograd convolutionalgorithm with polynomial divisors m(1), . . . ,m(k) is
k∑i=1
(2 degm(i) − 1).
Observation Increasing the bilinear rank of the Winogradconvolution with (at least one) superpolynomial divisor (degreegreater than one) improves the numerical accuracy of convolution.
Nested and Multidimensional Convolution
Given F ,G ∈ Rn×n, a 2D convolution is defined as
yxy =r∑
i=0
r∑j=0
fijgx+i ,y+j =∑i
∑j
fijguv .
Can nest the tensors,
yab =r∑
i=0
r∑j=0
n∑u=0
n∑v=0
t(A)ixu t
(B)jyv fijguv .
Equivalently, we have the following nested bilinear algorithm,
vec(Y ) = (C ⊗ C )[(
(A⊗ A)T vec(F ))((B ⊗ B)T vec(G )
))],
or otherwise,
Y = C[(ATFA) (BTGB)
]CT .
Overlap Add
We can use multidimensional convolution to solve 1D convolutionproblems.Let the recomposition matrix be
Q(γ,η) =
Iη−11
Iη−1 Iη−11
. . .
Iη−1 Iη−11
Iη−1
.
LemmaLet Y = F ∗ G , where F , G ∈ Rγ×η. Then if f = vec(F ),g = vec(G ), f ∗ g = vec(Q(γ,η)Y ).
Numerical Accuracy
Figure: 1D convolution error
Numerical Accuracy
Figure: 2D convolution error
Properties of Bilinear Algorithms
Matrix Interchange
I How can we build new algorithms with the sameencoding/decoding matrices?
I Can we design new algorithms with the same complexity assimilar bilinear algorithms?
Asymptotic Complexity
I The role of bilinear rank.
I How can we nest bilinear algorithms?
Lower BoundsI What are lower bounds for bilinear algorithms?
Matrix Interchange
Recall the definition of the discrete convolution and correlationalgorithm,
yk =r−1∑i=0
figk−i and yk =r−1∑i=0
figk+i .
Theorem (Matrix Interchange)
Let the bilinear algorithm for discrete convolution f and g bedefined as C
((AT f ) (BTg)
). The correlation algorithm with
output size m = n is
B(
(AT f ) (CTg)).
Matrix Interchange
Let the bilinear algorithm for discrete convolution f and g bedefined as C
((AT f ) (BTg)
). The correlation algorithm with
output size m = n is
B(
(AT f ) (CTg)).
Proof.The tensor T in yk =
∑ijtijk figj is 1 if and only if i + j − k = 0.
Moreover, the tensor T corr in yk =∑ijtcorrijk figj is one if and only if
i − j + k = 0.
We see the role of index j (belonging to encoding matrix B) andindex k (belonging to decoding matrix C ) are swapped.
Bilinear Rank
We will denote the bilinear algorithm,
yk =R−1∑l=0
ckl
( r−1∑i=0
ail fi
)( n−1∑j=0
bjlgj
), i.e., y = C
[(AT f )(BTg)
].
with the triplet (A,B,C ). The variable R determines the numberof element-wise multiplications.
Theorem (Correlation Rank Lower Bound (Winograd, 1980))
Given a filter of size r and output of size m, the minimum rank ofa correlation algorithm is m + r − 1.
Corollary
Given a filter of size r and input of size n, the minimum rank of alinear convolution algorithm is n + r − 1.
Asymptotic Complexity
Like in matrix multiplication, we can recursively compute a largerconvolution using a smaller one.
Given a convolution algorithm that divides the problem by size band has bilinear rank R, the cost of the algorithm is
T (n) = R · T (n/b) + (c · b) · n/b= c · nlogb(R).
Error Bounds
Convolution is an ill-posed problem
Consider the cyclic convolution of1−11−1
...
∗cyclic
1111...
=
0000...
.
Therefore, we will use absolute error rather than relative error.
Error Bounds
Theorem (1D bilinear algorithm convolution error)
Given inputs f ∈ Rr and g ∈ Rn, the absolute error of the bilinearalgorithm
‖δy‖ ≤ 2(‖C‖ · ‖A‖ · ‖B‖ · ‖f ‖ · ‖g‖
)ε+ O(ε2),
where ‖ · ‖ is the 2-norm.
Corollary
A d-nested convolution with F ∈ Rr×···×r and G ∈ Rn×···×n hasan error of
||δY || ≤ 2(||C ||d · ||A||d · ||B||d · ||vec(F)|| · ||vec(G)||
)ε+ O(ε2).
Error Bounds
Proof.We can use the fact ||Ax || ≤ ||A|| · ||x || for the encoding anddecoding step. To bound the error from the element-wise product,we use the fact that
‖x y‖2 =∑i
|xiyi |2 ≤(∑
i
|xi |2)(∑
i
|yi |2)
= ‖x‖2 · ‖y‖2,
which leads to ‖x y‖ ≤ ‖x‖ · ‖y‖.
Error Mitigation
Theorem (Pan 2016)
For a Vandermonde matrix V with s as the large magnitude node,the condition number is proportional to
κ(V ) = Ω(sn−1√
n
).
Need node to find ways to either decrease κ(V ) or use a differentmatrix.
Error Mitigation
Better node choiceNumerical accuracy of interpolation improves buy better nodechoices
I Chebyshev nodes
I Brute force search
Can combine small convolution algorithms into larger convolutionalgorithms. Given matrices A,B where C = A⊗ B, we have
κ(C ) = κ(A)κ(B).
Instead of having ||A|| = Ω(nn), we have ||A|| = Ω(nn
1/d)
.
Numerical Accuracy
Figure: 1D convolution error
Numerical Accuracy
Figure: 2D convolution error
Arithmetic Complexity
Let nnz(A) be the number of nonzeros, additions a(A) the numberof additions needed, and m(A) the number of multiplications. Wehave
a(A) ≤(nnz(A)−#row(A)
)and m(A) ≤ nnz(A).
Therefore, the overall cost of a convolution is
a(F ) ≤ a(A)+a(B)+a(C ) and m(F ) ≤ m(A)+m(B)+m(C )+R.
Final Thoughts
Can also use bilinear algorithms to
I Find communication lower bounds
I Discover alternative bilinear algorithms
Concluding Thoughts
We have derived a family of fast bilinear algorithms.
We analyzed the error bounds and arithmetic costs for the differentalgorithms, esepcially bounded vs. unbounded algorithms.
Thanks!
Remaining Questions
I Communication lower bounds for nested convolutionalgorithms
I Error lower bounds with node and polynomial divisors choice
I Do polynomial and interpolation-based algorithms cover theentire class of fast bilinear algorithms?
More information covered in the paper,
Caleb Ju and Edgar Solomonik. Derivation and analysis of fastbilinear algorithms for convolution, arXiv:1910.13367 [math.NA],October 2019.