bit-serial vlsi implementation of vector quantizer for...

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS, VOL. 36, NO. 10, OCTOBER 1989 1281

Bit-Serial VLSI Implementation of Vector Quantizer for Real-Time

Image Coding

Absrract -Vector quantization (VQ) has emerged as a viable approach for coding speech and image data in the last decade. Various vector quantizers are designed and their performance evaluated during the last few years. The hardware realizations of VQ encoder systems have been designed mainly for real-time speech coding ranging from the use of “off-the-shelf” components to applying VLSI technology. Real-time image coding requires much higher throughput rate compared to speech data. In this paper, a practical high throughput architecture and its implementation for real-time coding of TV quality signals is presented. The architecture is directed towards the implementation of multi-stage VQ as our simulation results (presented in this paper) show that it is more suitable for real-time coding. However, the implementation is suitable for both single-stage and multi-stage VQ. The functional blocks of the VQ encoder system have been designed and implemented in VLSI technology. The VQ encoding scheme designed has an encoding delay of 25 clock cycles and it is independent of the codebook sue.

I. INTRODUCTION IGITAL representation of signals has created the D need for efficient coding or data compression tech-

niques that will reduce the storage and channel bandwidth associated with such signals. The efficiency of a data compression algorithm is measured by its data compression ability, the resulting distortion and the implementa- tional complexity, in particular the hardware implementations.

The implementation complexity of the coder becomes a major factor for coding data at low bit rates with an acceptable level of distortion. A careful consideration of the coding algorithms becomes essential when one is faced with real-time implementation of compression techniques for large bandwidth signals such as video signals. Vector quantization (VQ) has emerged as a viable approach for coding speech and image data in recent years [l], [2]. Various review articles are written for VQ application to speech and image coding [1]-[4]. Low bit rate coding implies bit rates of the order of 1 bit/sample and 0.5 bits/pixel for speech and image data, respectively. Meth-

Manuscript received November 4, 1988; revised February 17, 1989. This work was supported in part by NASA Lewis Research Center, Cleveland, OH, under Grant NAG-582.

The authors are with the Signal Processing and Communications Group, Department of Electrical and Computer Engineering, University of Cincinnati, Cincinnati, OH 45221.

IEEE Log Number 8929990.

ods such as sub-band coding and transform coding other than VQ are capable of producing acceptable rate but have much higher complexity [2].

Though VQ algorithm is equally applicable to speech and image coding, we concentrate on VQ and in particular multi-stage VQ and its implementation for low bit rate real-time image coding. Low bit rate image coding finds application in image transmission such as broadcast televi- sion, remote sensing via satellite, aircraft, radar, sonar, teleconferencing, computer communications, facsimile transmission, etc., and in image storage applications such as educational and business documents and medical images for patient monitoring systems.

VQ encoder algorithm for real-time image transmission and storage applications can be implemented using present day VLSI technology using high throughput systolic architectures [ 5 ] . Most implementations of VQ systems which are reviewed in detail in Section IV are for real-time speech coding. They, in general, have a throughput rate that is inversely proportional to the codebook size [7]-[ 151. This results in a substantial decrease in throughput rate for a large-size codebook. Moreover, the sampling frequency for image data is approximately three orders of magnitude higher compared to real-time speech coding. This requires similar orders of magnitude improvement in throughput rate for real-time image encoding systems.

We have reported the mapping of VQ algorithm for real-time image encoding of TV quality images into a two-dimensional systolic architecture [16] and some of the VLSI implementation details earlier [17], [18]. There are many variations of VQ algorithms [19]-[27] and all of them contain an embedded VQ codebook search processor. We have chosen single stage and multi-stage VQ implementation due to the issues described in Section 111. Other techniques such as finite-state VQ or predictive VQ lead to more complex implementations.

The remaining paper is organized as follows. The process of VQ for image encoding, the distortion measures used and conversion of mean-squared error to inner product for simplified VQ implementations are described in Section 11. Multi-stage vector quantization and its advan- tages over single stage VQ are presented in Section 111. We

0098-4094/89/1000-128 l$Ol .OO 01 989 IEEE

1282 IEEE TRANSACTIONS O N CIRCUITS AND SYSTEMS, VOL. 36, NO. 10, OCTOBER 1989

(a)

Codebook

Divide into s u b i e s

m. to be cod neighbor

(e) Fig. 1. VQ system.

have also presented results of computer simulations with TV images. Previous implementations of VQ encoder for speech and image signals are described in Section IV. Section V describes the issues involved in real-time image encoding of TV images using VQ, the mapping of the algorithm to a systolic architecture, functions of the cells required, the implementation of the cells for VQ and the operation of VQ system using timing diagrams. The salient features of the implementation and directions for future work are given in the conclusions in Section VI.

11. VECTOR QUANTIZATION (VQ) VQ refers to a family of source coding methods that

quantize the signal source in blocks or vectors. Various design techniques for VQ have emerged in the last decade. VQ can be memoryless or with memory such as vector predictive quantizers and finite state vector quantizers. Several of these vector quantizers have been designed and their performance has been studied during the last few years [1]-[4].

VQ involves decomposing the input signal into vectors or blocks and quantizing each vector to the nearest neighbor vector in a pre-designed optimal codebook. The nearest neighbor vector for a given input vector is the codebook vector or codevector that best matches the input vector according to a given fidelity measure. Thereafter, the codevector's index or address in the codebook is used to identify the input vector. The process of codebook formation and coding and decoding of a particular vector using VQ is illustrated in Fig. 1. The codebook design is normally performed off-line using Lind, Buzo, and Gray (LBG) algorithm [6]. Recently, Kohonen's self-organizing feature maps [28] have also been applied to VQ for design- ing codebooks for both speech and image coding applications [28]-[30].

Representing the input vector as X = [xi] of dimension m, codebook C = [eiJ], consisting of N codevectors of size

m, and D, as suitably defined distortion error between the input vector and ith codevector, the VQ encoder has to generate the address or index k as

k = subscript of minimum D,, I = 0,. e , N - 1. (1)

The index k is used as an identifier for the input vector X . A commonly used distortion measure is the mean squared error (MSE) because of its ease of computation compared to other fidelity measures. Also, the MSE can be converted into an inner product form so as to obtain a simpler systolic implementation as shown below [31].

* yl7 where x,, y, , and D,J form the inputs and D,J+', the output. The MSE D,, for a given input vector [x,] and codevector [ c,

The inner product has the form D,J+'= D J I +

is computed as m - 1

, N - 1 . ( 2 ) D ,= ( c l J - x J ) , i = O ; * * 2

J = o Equation (2) can be written as

m - 1 m - 1 m - l

D , = C x;-2 cijxj+ j = O j = O j = O

i = O , I , . . . , N - I . (3)

It can be observed that the first term of equation (3) is a constant, independent of the codevectors, and hence will not have any effect on the selection of the codevector that gives the minimum distortion. Thus we can modify (3) to obtain an inner product representation for MSE as

m -1

min D,= min x j z i j + y , (4) 0 6 i g N - 1 O < i < N - 1 j = o

m - 1 2 where zJJ = - 2 c I J , y, = cIJ .

J = o

111. MULTI-STAGE VECTOR QUANTIZATION (MSVQ)

To encode pictures with acceptable level of distortion, a single-stage VQ requires a fairly large-size codebook. Since MSVQ for speech coding [19] did not result in favorable results, it is generally assumed that it would be the case with image coding. However, we have shown that good quality color image encoding can be achieved using MSVQ [26], [27]. It performs comparable to a single stage VQ using codebooks of moderate size. In MSVQ (Fig. 2), the original image is encoded by first stage VQ and the differ- ence between the original and the reconstructed image is then encoded again by a second stage VQ. The process is repeated until the desired quality of pictures is obtained. The decoder consists of look-up tables (LUT) and adders to reconstruct the signal. Real-time implementation of MSVQ requires much less hardware compared to single- stage VQ because smaller size codebooks can be used at each stage.

We include some simulation results to show that good quality composite color image encoding can be achieved using MSVQ. The signal chosen is PCM samples of com-

RAMAMOOKTHY ('1 U / . : VISI I!vff'l.t\lENTA l I O N FOR IMAGE CODING 12x3

I I I V I "1

I I

Fig. 2. Multi-stage VQ system

Fig. 3. MSE improvement with number of stages

posite TV pictures in the NTSC format. The sampling frequency is chosen as 4 times the color subcarrier frequency resulting in pictures of 512x768 pixels. With a vector size of 16 ( 4 x 4 ) pixels. the reconstructed pictures have a MSE of under 5/pixel, employing 2 to 3 stages and with codebook sizes decreasing from 128 for the first stage to 16 in the third stage. This yields bit rates of 0.8 to 1 bits per pixel (bpp).' Increasing the codevector size to 64 (8 x 8) resulted in reconstructed pictures having a slightly higher MSE but still under lO/pixel. The codebooks are generated using frames containing images with different features and the basic LBG algorithm. The initial codebooks are obtained by averaging a fixed number of sub- images for each codevector. The codebook generation took 3-4 h even with the use of an array processor and in general, the codebook generation process is terminated when the decrease in the MSE became very small. Depend- ing on the complexity of the picture, 2 to 5 stages are used in MSVQ resulting in bit rates of 0.4 to 0.5 bpp. The decrease in MSE/pixel with the increase in number of stages is illustrated in Fig. 3. Also. even with increased size of the codevector, MSVQ does not show the blocking effects evident in other block coding methods such as discrete cosine transform. I t should be noted that the bit rate of 0.5 bpp has not been achieved so far to encode TV quality composite color images. MSVQ is also applicable in situations like low bit rate picture phone where we can use few stages and update the bits corresponding to certain stages at a low rate. Examples of the images used for MSVQ and the reconstructed images are shown in Figs. 4

'bit rate = (log,l?X-logl 64)/(16 pixels) = 0.X125 bits/pixel

(h)

Two-stage V Q coded image with 4x4 \ub-blocka. (0.X115 hits/pi\tI). Fig. 4. (a ) Original digitizcd irnagc. (512 r768) pi\cls. (8 hit\/pi\,cl,. (h)

and 5.2 Both the examples use ;I two stage VQ with 4x4 sub-blocks and 128 codevectors at first stage and 64 codevectors at the second stage. It can be noted that the image in Fig. 5 has a lot of detail which is preserved by MSVQ.

The results obtained so far indicate that. MSVQ has excellent potential for real-time encoding of TV quality images at very low bit rates. Besides employing smaller codebook sizes at each stage. MSVQ also allows us to encode larger size vectors3 resulting in very low bit rates without significantly increasing the overall codebook size.

IV. REVIEW OF PRk,VlOUS IMPLEMENTATIONS

In this section. we describe varioigs hardware re a 1' iza- tions of VQ encoding systems reported for speech and image coding applications. The early implementations are designed with "off-the-shelf" components. Pipelined architectures using VLSI technology have emerged during the recent years. Wherever possible. codebook sires. the bit rate and throughput rate are described for implementations considered.

'Fig,. 4 and 5 are shown in monochrome due to a time restriction in printing color.

The CPU time needed definitely increases as the size of vectors is increased. HoweLer, i t is a one time process. performed off-line and steps outlined above can be employed to rcduce CPU time.

1284 IEEE TRANSACTIONS ON CIRCUITS A N D SYSTEMS, VOL. 36, NO. 10, OCTOBER 1989

(b)

Fig. 5. (a) Original digitized image (512x768) pixels with detail, (8 bits/pixel). (b) Two-stage VQ coded image with 4 x 4 sub-blocks, (0.8125 bits/pixel).

A hardware realization of real-time full-search vector quantizer for speech waveform coding has been first reported in [7]. The total system is implemented with “off- the-shelf” LSTTL, CMOS, and nMOS components and is interfaced with a microprocessor. The encoding system is pipelined and involves preprocessing of the analog speech waveform and generation of the index of the best matched codevector using VQ. The single codebook implementation uses 8-bit codebooks with an encoding rate of either 1 bit/sample at dimension 8 or 2 bits/sample at dimension 4. Calculation of distortion measure (MSE) is carried out in 500 ns and a full pass of the codebook is done in 1.024 ms. A design is of the MSVQ system with LSTTL devices for speech waveform coding using two stages is reported in [13]. The encoding delay is 4 ms for vectors of dimensions 16 and 8 with compression of 1 and 2 bits/sample, respectively.

An LSI architecture for VQ using two different types of modules is reported in [SI. Full-search or tree-search encoding system can be implemented with these modules, namely, distortion processing module (DPM) and array processing controller (APC). A VQ system with vector dimension m requires m DPM’s and one APC. DPM is capable of computing MSE distortion measures at a rate of

10 MHz. Computer simulations and prototype implementation using “off-the-shelf’ components have been carried out for an image coding application. A rate of 30,720 vectors/s with compression of 0.5 bpp is achieved with a picture resolution of 256 x 256 and 7.5 frames/s.

The first VLSI chip for real-time implementation of VQ algorithm for speech coding is reported in [9], [lo]. The heart of the system is an nMOS VLSI pattern matching chip (PMC). Various functions are pipelined in PMC and it computes the best matching index of the codevector sequentially. The throughput rate of the implementation for an exhaustive search is inversely proportional to the product of the codebook size and codevector dimension. The time required for computing the squaring and comparison operations is 0.33 ps/sample. Various applications such as vector pulse coded modulation (VPCM), adaptive vector predictive coding (AVPC), and rapid codebook design (RCDP) have been cited for PMC. Second generation of VQ processors for real-time speech coding have been implemented using bit-level systolic arrays [32] in 2-pm NMOS technology [l l] . A VQ system with m dimensions can be built with m inner product processors which are bit-level systolic arrays having 234 full-adders each and a bit-serial comparator processor. The encoding delay is calculated to be around lOON ns where N is the codebook size. Recently, the basic linear systolic architecture implementation described in [ l l ] has been extended to two- dimensional architectures [12] using the tradeoff between hardware and throughput rate of the system. A systolic architecture for pattern clustering is also described [14] using MSE distortion measure.

A bit-serial VLSI vector quantizer is designed to test new methodology of structured tiling [15]. A mean residual reflected vector quantizer can be implemented with a set of five chps, with front end chip performing mean extraction and vector orientation, two codebook search chips for tree search, and two ROM chips for codebooks. The codebook search chip contains a bit-serial sequential pipeline of adder, squarer, adder and a comparator for a single pass with a codevector. Most of the chip area of the codebook search chip is occupied by squarer circuits.

V. IMPLEMENTATION OF VQ ENCODER The computational complexity of VQ is of the order of

O ( m N ) with a SISD machine as mN distortion measures have to be computed for a codebook size of N and vector size of rn. Real-time VQ encoding requires much higher and constant throughput rate not achievable by a SISD machine or with general purpose digital signal processor (DSP) chips. This has inhbited hardware realizations of VQ algorithm for sometime. However, as VQ algorithm involves a repetitive process of subtraction, squaring, and addition for each codevector and data vector, it can be mapped on to a systolic architecture that can be implemented with considerable ease using VLSI technology.

A direct mapping of the VQ encoder system onto a word-level systolic architecture is illustrated in Fig. 6 [16]. Referring to Fig. 6, distortion measures are computed by

RAMAMOORTHY et al.: VLSI IMPLEMENTATION FOR IMAGE CODING 1285

I

Fig. 6. VQ mapped onto a systolic architecture.

P. Coefficient register

L I

Register (b) w Z+'

(a) Fig. 7. (a) Cell A. (b) Cell B.

the two-dimensional systolic architecture that generate distortion measures every clock cycle. The upper two-dimensional array takes elements of the input vectors x , , and partially computed distortion measures or null as inputs and generates elements of delayed input vector and distortion measures 0, as outputs. The index or address of the best matched codevector is generated by the lower linear systolic array at the rate of one codevector index per cycle. It takes distortions measures Dl from the upper array, maximum distortion measure that can be represented D,, and a corresponding dummy index k,, as inputs and generates minimum distortion measure that can be obtained for the codebook search, Dh,, and its corresponding index, kmin as outputs. The architecture has modular- ity for cascading in both horizontal and vertical directions to accommodate larger size of codebooks and codevectors.

The processing element (PE) or cell A of the two-dimensional array is illustrated in Fig. 7(a). An element of the codevector, cIJ is pre-loaded in each cell A . Each cell A receives a partially computed MSE distortion, 0;' from the upper cell and one sample of the input, x, as inputs and generates the updated distortion measure, D;'+' and the delayed input, xJ( A ) as outputs after performing subtrac-

Data Stream: ID. zl, z2, z3, . . . . . Assume N=4 [c.,],~~

Unit Holding Cafficients

cm CIO cm e30 CO, c,, c2, c,, col CI, c22 G,

Format: Input ( at the start of the baric cycle)/Output (at the end of basic cycle) VQ add- every basic cycle startin& from 8th cycle

Fig. 8. Operation of the VQ encoder.

TABLE I IWLEMENTATIONAL CONSIDERATIONS OF VQ ENCODER

FOR REAL-TIME IMAGE CODING Gdrbook SI_: N. Vector dimenson. m Samplmg time: T, Cells needed = mN &+ratr (bpp) = l ~ N / m Time (T) available for VQ enrodmg(ps) = mT, F, = 13 5 M h z , T. = 74 074 ni

&m& S h m are cells needed Blt for -ions d u e of n ( = l w N ) and m

1 ' 1 Y 1=1 ...- . II , ... "" , .

tion, squaring and addition operations. Cell B contains a comparator, multiplexer and a load/increment counter. A comparison operation is performed on the inputs, D, and 0, and the minimum of the two is passed to the output. The corresponding index of the codevector that has given the minimum distortion is also passed to the next B cell. The computational flow in the word-level systolic architecture at various time instances is illustrated in Fig. 8 assum- ing a codebook size of 4 and a vector dimension of 3. Accumulation of distortion measures at various time instances for one codebook search is illustrated. For example, partial distortion measures, D:, D& and Do2 = Do with the first codevector are obtained in lst, 2nd and 3rd time instances. The codebook search for the first input vector [ xo , x,, x,] is performed by computing the distortion measures, Do, D,, D,, D,, at time instances 3rd, 4th, 5th, and 6th instances, respectively. The corresponding comparison operations are performed in 4th-7th time instances. The index of the best matched vector is available every basic cycle starting from the 8th cycle.

The time available for VQ encoder for real-time image coding depends on the sampling frequency of the signal. Some of the implementation considerations for coding TV signals using VQ for various sizes of codebook and codevector dimension are listed in Table I. With a sampling frequency of 13.5 MHz, the time available for encoding an input vector of dimension 64 (8x8) is 4.736 ys and it increases to 18.944 ys with a vector of dimension 256 (16 X 16). However, the number of cells needed increases as the vector dimension is increased. It can be noted that the time available for VQ encoding increases with an increase in the codevector size but is independent of the codebook

1286 IEEl E TRANSACTIONS ON CIRCUITS AND SYSTEMS, VOL. 36, NO. 10, OCTOBER 1989

size. So it is preferable that the architecture support a throughput rate which is independent of the codebook size.

The two-dimensional systolic architecture consisting of cascaded A cells horizontally as well as vertically is, subse- quently referred to as inner product processor (IPP) and the linear systolic array consisting of B cells cascaded horizontally is referred to as comparator and address generator (CAG) processor. The functions of cells A and B are also modified to obtain much simpler implementations. As shown earlier in (4), cell A needs to compute only an inner product instead of subtraction, squaring and addition operations. The issues involved in choosing particular IPP and CAG implementations are described in the subse- quent sections.

A . Inner Product Processor (IPP) A word-parallel two-dimensional systolic architecture is

an ideal choice for IPP operating at a very high throughput rate computing distortion measures every clock cycle. However, parallel architectures have limitations such as the following.

Only a small number of PE's can be integrated on a chip. They require large number of input/output pins. They require large amount of interconnection area.

Considering the above limitations, a parallel systolic architecture is not an appropriate choice for our purpose. The second alternative, bit-level systolic architecture can also compute distortion measures every clock cycle. But the number of input/output pins and the hardware complexity of single processor element or cell remain comparable to the word-parallel architecture. A third alternative is to use a two-dimensional (i.e., for input as well as the distortion measure) bit-serial architecture. In MSVQ for image coding, a codebook with large codevector size (rn 2 64) is preferable since it gives a good compression ratio. The large vector size also provides ample time for encoding. For example, a codevector with size 64 (8 X 8) gives us 4.736 ps (at 13.5-MHz sampling), which is sufficient to encode one vector using a two-dimensional bit-serial systolic architecture having a number of PE's. With bit-serial architecture, the hardware complexity of the PE is very low having a few input/output pins and a small interconnection area. It is also possible to pipeline to the lowest level with bit-serial architectures. As only few bit operations are involved per clock cycle, the circuit can operate at a much higher clock rate.

A block diagram of four PE's operating in bit-serial architecture is illustrated with inputs and outputs at various time instances in Fig. 9(a). For simplicity, a codebook having only 2 two-dimensional vectors is illustrated. A detailed schematic of 4-bit PE is given in Fig. 9(b) to illustrate the operation of each PE. The input and codevector elements are represented in 9-bit two's complement form in the actual implementation and 24 bits are used for the distortion mea~ure .~ The elements of each codevector (or scaled codevectors) are pre-stored inside each PE be-

(b) Fig. 9. (a) Inputs/outputs of 2 x 2 bit-serial IPP. (b) 4-bit processing

element of IPP.

fore the operation of VQ. The input vector, xi enters the front end of IPP bit-serially with least significant bit first. An r th pulse clears the latches used for pipelining which appears every 24 clock cycles. Each PE computes the inner product bit-serially and passes it to the adjacent vertical PE. Also, the delayed input bit is passed on to the adjacent horizontal PE in a systolic rhythm. Input bits at the front end are delayed in a triangular fashion to offset the data delay from PE to the next vertical PE. An IPP chip consisting of 32 PE's with a provision for horizontal and vertical cascadability has been designed. This will enable us to use the chips for VQ implementation of arbitrary input vector length and codebook size.5

Referring to Fig. 9(b), each PE outputs an inner product bit serially after a delay of one clock cycle. This serial mode of operation introduces a delay of 24 clock cycles to output a distortion measure of length 24 bits without considering the vertical latency due to the PE chain. The clock rate is limited by the three carry-save adder-chain (or (5,3) counter) in the last stage of the pipeline. A typical

'It is assumed that the orignal image is digitized at 8 bits per pixel. One more bit is needed to represent the inputs since the error signals can have negative values. A wordlength of 24 bits for distortion measure is arrived with the assumption that the maximum size of the codebook wquld be 256.

We are in the process of building a proof-of-concept VQ encoder using the chips fabricated.


delay of 24 ns is estimated for the chain using a 3-pm scalable CMOS process from MOSIS. The pipeline functions the same way for positive and negative numbers until it encounters the sign bit of the input. At this stage, the sign bit is latched in by a separate clock. If the input is negative, the sign bit is used for inverting the codevector element bits in the PE before forming the inner product bit. The feedback in the first stage is required for sign extension. Each PE requires 24 clock cycles to form the inner product before a new input sample can be processed.

As the bit and sign-bit clocks are applied externally, the architecture is independent of the bit lengths of the input and codevector elements. Thus IPP can be operated at lower bit lengths with a higher throughput rate, if required. A maximum input vector length of 256 (16x16) can be used with cascaded PE‘s producing a distortion measure. Higher vector dimensions can be permitted by scaling the codevectors or by truncating the distortion measures com- ing out at the interface of cascaded IPP chips. The IPP or cascaded IPP’s output the accumulated inner product or the distortion measure at the input port of CAG.

B. Comparator and Address Generator The considerations for implementing the architecture of

CAG are different from IPP. The IPP presents a distortion measure per cycle after a latency t , given by

t , = T, + T24 ( 5 )

where T, is the delay due to performing a running sum on the distortion measures using vertically cascaded PE‘s of the IPP, and T24 is delay due to accumulating 24 bits of the distortion measure. Ths necessitates the comparison of two distortion measures to select the minimum of the two, and generation of the corresponding codevector index to be completed in a clock cycle. The critical step which decides the speed of the CAG is the comparison of two 24-bit distortion measures in two’s complement form. To perform high-speed comparison operation, an internal parallel architecture is necessary for CAG. However, the delay due to the bit-serial operation of IPP can be effectively used to reduce the hardware complexity of CAG. This CAG architecture will require only one comparator and one address counter for every 24 clock cycles of operation whereas the word-level systolic architecture in Fig. 6 requires an individual cell having comparator and address counter for each codevector.

The cascadable CAG architecture integrating 24 cells is illustrated in Fig. 10. It has bit-serial inputs for distortion measures from IPP and parallel inputs for loading a distortion measure and corresponding index from the previous CAG. A start-up count, corresponding to the index of the first codevector to be searched, is also loaded through the parallel inputs. The outputs are the minimum distortion measure for the search involved, and the corresponding codevector index in parallel. The first step of CAG is to assemble the 24 bits of distortion measure presented by IPP into a parallel format. This is accomplished by serial to parallel converters or shift registers at the front end of

1 1 -

n Pardlel

counter

Index I.: I I I I I

2, SbVt LO Min Dist. Out lndcx ‘0 Index I”

Fig. 10. Implementation of comparator and address generator.

the CAG. A sync pulse which is a modified version of r th pulse from the IPP, signals the accumulation of last bit of a distortion measure. It also works as a load signal for loading a distortion measure and an index from the previous CAG, and also for loading the start-up count value.

The components of CAG are 24 x 24 register bit-array, a high speed 24-bit 2’s complement comparator, a minimum index register (MIR) and a counter for address generation. The comparator is implemented with carry-look-ahead technique. The newly acquired 24-bit distortion measure is divided into 4-bit groups giving rise to six groups. The last group is of 3 bits only as the sign bit (msb) is not used in the carry-look-ahead chain. The minimum distortion measure of the previous comparisons or from previous CAG is available in register B. The six groups are used only to generate group carry-generate and propagate bits as it is not necessary to compute carry and sum bits. Block carry- generate and propagate signals are used to compute the carry to the 24th bit. Based on the sign bits of register A and register B and computed carry bit, the minimum of the two values is decided. If the register A has the minimum value it is transferred to register B and the counter passes its value to MIR overwriting the previous value. The parallel load counter will have the correct index of the codevector, as it keeps counting after being loaded with the start-up count. The index from MIR is always available at the output of CAG. It is the index of the best matched codevector, whenever a codebook search is completed. The high-speed comparison can be performed in less than 32 ns in 3-pm CMOS. The total delay for address generation is 25 clock cycles after an input vector is presented. In other words, the encoding delay is 25 clock cycles for a codebook search and it is independent of the codebook size.

C. System Timing

The VQ encoder system timing with the main signals is illustrated in Fig. 11. The system requires two clocks, a bit-clock and a sign-bit clock. The sign-bit clock is the ninth bit-clock as we have 8-bit positive image pixel data for the first stage of VQ. It also can be defined externally depending on the dynamic range of data without modify- ing the system. The rth signal which appears with the first bit of input vector clears all the pipelined latches of PE‘s

1288 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS, VOL. 36, NO. 10, OCTOBER 1989

Do 4 24 bits 24 bits

24 bits O 2 3

D24 -- 24 bits

sync \nn

en 23 \\

Fig. 11. System timing.

except the latches between PE's. Input latches of the PE take data when the bit-clock is high and pipelined latches take data when the clock is low. The vertical latency due to propagation of the signals through PE's whch is equal to rn * Tclk is assumed to be unity in Fig. 11 for easier illustration. The first distortion measure Do bits appear after the vertical latency (delay of one clock cycle in the illustration). After the initial latency, the bit-stream of all distortion measures appear continuously every clock cycle. The sync signal loads distortion measure into register B, and corresponding index into MIR from the input. It also loads the parallel counter with the start-up count. The en i, i varying from 0 to 23, signals are generated from sync by appropriately delaying it. During the high state of en i, the distortion measures are compared and depending on the result, modification of the contents of register B and MIR takes place. The current lowest distortion measure arld the corresponding index are always available in register B and MIR, respectively. After a delay of 24 clock cycles all the signals reappear in the same sequence.

D. Fabrication The VQ encoder system was partitioned to small func-

tional blocks, such as 9-bit PE, high-speed comparator, register array, parallel load counter, and control for CAG. Since we are only interested in demonstrating the basic concepts and due to our limited resources, only the functional blocks have been fabricated through 3-pm double metal, twin-tub scalable CMOS process from MOSIS. Two critical blocks, namely, 9-bit PE and the high-speed comparator, which decide the throughput of the system are presently being evaluated. The 9-bit PE has approximately 1000 devices and the comparator has 420 devices integrated. Both use the smallest standard die of 90 X 133 mil from MOSIS. The die micrographs of 9-bit PE and comparator are given in Fig. 12(a) and (b), We have designed IPP with 32 PE's using 3-pm CMOS process with a die size of 310x360 mil. With the same die size, as high as 4 CAGs described can be integrated. A single-stage VQ encoder for real-time image coding of TV signals with a codebook size of 128 and vector size of 64 requires 256

(b) Fig. 12. (a) Die micrograph of 9-bit PE. (b) Die micrograph of com-

parator.

1PF"s having 32 PE's each and 6 CAGs. Scaling down the design to 1.5 pm allows integration of much larger number of PE's, and reduced delays to encode even smaller codevector sizes ( < 64) using VQ and MSVQ implementations. The VQ encoder system with the described architecture is also well suited for wafer-scale interconnection and packaging technology [33].

VI. CONCLUSIONS Vector quantization seems to be a suitable candidate for

real-time image encoding at low bit rates. However, VQ requires fairly large-size codebook to encode images with an acceptable level of distortion. Through simulations, we have shown that MSVQ requires only a moderate codebook size in each stage to encode images and performs comparably to a single-stage VQ with large number of codevectors. Some computer simulations of MSVQ-coded NTSC composite images are presented to demonstrate that good images with acceptable level of distortion can be obtained. A practical implementation of high throughput


architectures for real-time image coding of TV signals using MSVQ is attempted.

The implementation is based on systolic architectural concepts and two dimensional bit-serial architecture is employed to allow as many PE‘s as possible in a single chip. The architecture is specifically directed towards MSVQ and features cascadability in both horizontal and vertical directions as it will not be feasible to pack the needed PE’s in a single chip. The architecture consists of two functional blocks-IPP and CAG. The IPP is designed with two dimensional bit-serial systolic architecture and CAG is designed with internal parallel architecture. IPP outputs one bit of distortion measure corresponding to each codevector in the codebook every clock cycle. Thus, it requires 24 clock cycles to produce a distortion measure, The delay from one distortion measure (24 bits) to the next one is only one clock cycle. CAG has both bit-serial and bit-parallel inputs. Both the processors are cascadable for various sizes of codebook and codevector. The total encoding delay when both the processors are cascaded is 25 clock cycles and is independent of codebook sue.

All the functional blocks for VQ encoder system are implemented in 3-pm scalable CMOS process from MO- SIS in view of scaling down to 1.5 pm and lower in the future. With a conservative clock rate of 8-10 MHz a real-time image encoder using VQ/MSVQ is possible with codevector sizes of 64 and higher. Scaling down the design allows us to integrate a larger number of processing elements giving reduced delays to encode codevectors of even smaller size. Hardware simulations of the architecture have shown the feasibility of the VQ encoder system for TV quality image coding and viability of an integrated approach using VLSI technology. It is also well-suited for wafer-scale interconnection and packaging technology to build a complete VQ encoder system.

REFERENCES R. M. Gray, “Vector quantization,” IEEE ASSP Mag., vol. 1, pp. 4-29, Apr. 1984. A. Gersho, and V. Cuperman, “Vector quantization: a pattern matching technique for speech coding,” IEEE Commun. Mag., pp. 15-21, Dec. 1983. J. Makhoul, S. Roucos, and H. Gish, “Vector quantization in speech coding,” Proc. IEEE, vol 73, pp. 1551-1588, Nov. 1988. N. M. Nasrabadi, and R. A. King, “Image coding using vector quantization: A review, IEEE Trans. Commun., vol. 36, pp. 957-971. Aug. 1988. ., H. T. Kung, “Why systolic architectures?” Computer, vol. 15, no. 1, pp. 37-45, Jan. 1982. Y . Linde, A. Buzo, and R. M. Gray, “An algorithm for vector quantizer design,” IEEE Trans. Commun., vol. COM-28, pp. 84-95, Jan. 1980. B. P. M. Tao, H. Abut, and R. M. Gray, “Hardware realization of waveform vector quantizers,” IEEE J. Select. Areas Commun., vol.

H. Abut, B. P. Tao. and J. Smith, “Vector quantizer archtectures for speech and image coding,” in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, pp. 756-759, Apr. 1987. G. Davidson, T. Stanhope, R. Arvind, and A. Gersho, “Real-time speech compression with a VLSI vector quantization processor,” in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, pp. 1437-1440, Apr. 1985. G. Davidson and A. Gersho, “Application of a VLSI vector quantization processor to real-time speech coding,” IEEE J. Select. Areas Commun., vol. SAC-4, Jan. 1986. P. Cappello, G. Davidson, A. Gersho, C. Koc, and V. Somayazulu, “A systolic vector quantization processor for real-time speech coding,” in Proc. IEEE Int. Conf. on Acoustic, Speech, and Signal Processing, pp. 2143-2146, Apr. 1986.

SAC-2, pp. 343-352, M a . 1984.

G. A. Davidson, P. R. Cappello, and A. Gersho, “Systolic architectures for vector quantization,” IEEE Trans. Acoust., Speech, Sig- nal Processing, vol. 36, pp. 1651-1664, Oct. 1988. T. M. Liu and Z. Hu, “Hardware realization of multistage speech waveform vector quantizer,” in Proc. IEEE Int. Conf. on Acoustic, Speech and Signal Processing, pp. 3095-3097, Apr. 1986. L. M. Ni and A. K. Jain, “A VLSI systolic architecture for pattern clustering,” IEEE Trans. Patt. Anal. Mach. Intell., vol. PAMI-7, pp. 80-89, Jan 1985. B. E. Nelson and R. J. Read, “A bit-serial VLSI vector quantizer,” in Proc. IEEE Int. Conf. on Acoustics. Soeech and Sianal Process- ,~

ing, pp. 2211-2214, Ap;. 1986. P. A. Ramamoorthy and T. Tran, “A systolic architecture for real-time composite video image coding,” in Proc. IEEE Military Communications Conf ., pp. 49.6.1-42.6.4, Oct. 1986. P. A. Ramamoorthy and B. Potu, Bit-serial systolic chip set for real-time image coding,” in Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, pq; 787-790, Apr. 1987. P. A. Ramamoorthy and B. Potu, VLSI vector quantizer,” in Proc. 30th Midwest Symp. on Circuits and Systems, pp. 1009-1012, Aug. 1987. B. H. Juang and A. H. Gray Jr., “Multiple stage vector quantization for speech coding,” in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, pp. 597-600, Apr. 1982. V. Cuperman and A. Gersho, “Adaptive differential vector coding of speech,” in Conf. Rec., Globecom 82, pp. 1092-1096, Dec. 1982. J. Foster and R. M. Gray, “Finite-state vector quantization,” Abstracts of the 1982 Int. Symp. on Information Theoty, June 1982. R. M. Gray and Y. Lind, “Vector yuantizers and predictive quantizers for Gauss-Markov sources, IEEE Trans. Commun., vol COM-30, pp. 381-389, Feb. 1982.“ B. Ramamurthi and A. Gersho, Image coding using segmented codebooks,” in Proc. Int. Picture Coding Symp., Mar. 1983. A. Haoui and D. Messerschmidt, “Predictive vector quantization,” in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Process- ing, 1984. R. L. Baker and R. M. Gray, “Image Compression using non-adaptive spatial vector quantization,” in Proc. Conf. Rec. Sixteenth Assilomar Conf. Circuits, Systems, and Computers, pp. 55-61, Oct. 1982. P. A. Ramamoorthy and T. Tran, “A hybrid coding involving ADM and vector quantization for digital image compression,” in Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, pp. 153-146, Apr. 1986. -, “A high quality image compression scheme for real-time applications,” in Proc. IEEE Int. Conf. on Communications, pp. 1893- 1897, June 1986. T. Kohonen, Self Organization and Associative Memory. New York: Springer-Verlag, 1984. N. M. Nasrabadi, and Y. Feng, “Vector quantization of images based upon the Kohonen self-organizing feature maps,” in Proc. IEEE Int. Conf. on Neural Networks, pp. 1-101-1-108, June 1988. F. H. Wu and K. Ganesan, “An algorithm for robust vector quantization using a neural-net model,” (Poster) presented in IEEE Int. Conf. on Neural Networks, San Diego, CA, June 1988. T. Murakami, K. Asai and A. Itoh, “Vector Quantization of Color Images,” in Proc. IEEE Int. Conf. on Acoustics, Speech, and Signals Processing, pp. 133-136, Tokyo, Japan, 1986. J. V. McCanny and J. G. McWhirter, “~pmpletely iterative, pipelined multiplier array suitable for VLSI, Proc. Inst. Elect. Eng., vol. 129, no. 2, pp. 40-46, Apr. 1982. R. W. Johnson, J. M. Davidson, R. C. Jae ar, and D. V. Kerns, “Silicon hybrid wafer-scale package technofogy,” IEEE J. Solid- State Circuits, vol. SC-21, pp. 845-851, Oct. 1986.

I

3

P. A. Ramamoorthy (S’72-M77) received the B.S. and M.S. degrees from the University of Madras, India, in 1971 and 1974, respectively, and the Ph.D. degree from the University of Calgary, Alberta, Canada, in 1977, all in electrical engineering.

Since 1982 he has been with the University of Cincinnati, where he currently is an Associate Professor of Electrical and Computer Engineer- ing. Previously he was a faculty member at Wavne State Universitv. Detroit. MI. and at

Western New England Coliege, Springfield, k. He has published over 70 technical papers and is a contributing author for Two-Dimensional Signal Processing I , Topics in Applied Physics, Vol. 42, published by Springer-Verlag. His current research interests are in digital signal processing (DSP) and applications, VLSI architectures for DSP, optical computing, and neural networks for signal processing.

1290 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS, VOL. 36, NO. 10, OCTOBER 1989

Brahmaji Potu (S’82) received the B.S. degree from Birla Institute of Technology and Science, Pilani, India, in 1981 and the M.S. degree from the Indian Institute of Technology, Madras, In- dia, in 1983, both in electrical engineering. At present he is working toward the Ph.D. degree in the Department of Electrical and Computer En- gineering, University of Cincinnati, Cincinnati.

From 1983 to 1985 he was an IC design engineer with Semiconductor Complex Ltd., India, working on CMOS gate array products, and with

Gould AMI Semiconductors, Santa Clara, CA, on 16K CMOS static RAM and 64K EEPROM products. He has published over 10 technical papers. His research interests are in image coding, VLSI architectures for DSP and neural networks.

Tien N. Tran (S’85) received the B.S. degree in electrical engineering from Pennsylvania State University, University Park, in 1980. He is currently working toward the Ph.D. degree at the University of Cincinnati, Cincinnati.

He has been a Research Assistant at Univer- sity of Cincinnati since 1985. His research interests include image coding, image processing and digital signal processing.

bit-serial vlsi implementation of vector quantizer for...

Documents