[ieee 1997 ieee workshop on speech coding for telecommunications proceedings. back to basics:...

A ROBUST LOW RATE VOICE CODEC FOR WIRELESS COMMUNICATIONS

K. Swaminathan, S. Nandkumar, U. Bhaskar, N. Kowalski, S. Patel, G. Zakaria, J. Li, V. Prasad

Hughes Network Systems 1 1 7 17 Exploration Lane

Gmnantown, MD 20876 erriail: [email protected]

ABSTRACT

The design, implementation and performance of a hi,gh quality low bit rate speech codec for wireless communication is presented. The codec is based on the CELP model. Generalized analysis-by-synthesis, algebraic fixed codebooks, and multi- stage LSF techniques are used, resulting in robustness to transmission errors and high quality across changing speech levels and background noise conditions. The bit allocations for the quantization of LSF, pitch and the excitation are chosen in a mode specific manner based on a robust mode classification scheme. A 4.8 kb/s version has been implemented and subjective tests show speech quality that is equivalent to or better than mokt cellular standard codecs. Performance is also consistent across speech levels and transmission errors.

analysis resulting in an algorithmic delay of 30 ms. The input speech frame is divided into 4 subframes for mode A and 3 subframes for modes B and C (see Table 1 for bit allocations). 4 discussion of the processing techniques used to enhance qualily and robustness is presented below.

Table 1: Bit allocation for 4.8 kb/s codec

Encoded Parameter I ModeA I ModeB I ModeC] I Mode Information 1 2 1 2 I 1 I

I I

1. INTRODUCTION

Applications in wireless communications require low rate speech codecs to operate at a high voice quality in the presence of channel errors, varying input signal levels, different handsets, and background noise. A low rate multi-mode CELP based coder is designed to address these requirements. The input speech frame is classified as being in one of three modes: a voiced stationary mode (mode A), a voiced nonstationary mode (mode B), an unvoiced/noise/silence mode (mode C). For each frame, the available bits are allocated among the coder parameters so that the characteristics of the corresponding mode are exploited, while a constant bit rate is maintained. For example, spectral stationarity in mode A is exploited to reduce the number of bits required to encode spectral parameters (LSF). Another example is to exploit the lack of long-term periodicity in rnode C to reduce the number of bits used to encode the adaptive codebook gain. In each case, the savings in bits is used to encode the residual excitation. In addition, the generalized LP analysis-by- synthesis technique is used to reduce the number of bits needed to encode pitch information.

2. DESIGN OF A 4.8 KB/S CODEC

The proposed multi-mode CELP coder is illustrated iin Figure 1. A 4.8 kb/s version of the multi-mode CELP coder operates on an input frame size of 20 ms with a lookahead of 10 ms for LP

2.1 Mode Classification

The classification of input speech frames into 3 modes (voiced stationary, voiced nonstationary, and unvoiced/silence/noise) is designed to be consistent across varying levels of speech and background noise. The pattern classification problem is solved using a fuzzy systems approach. The background noise level is updated every frame based on the probability of voice activity in that frame and a set of thresholds are adapted to this noise estimate. The fuzzy system operates on open-loop pitch deviation from past frame, cepstral distance from past frame, energies of four equally divided frequency subbands and assigns a certain weight to the input frame regarding its membership in one of the three modes. A voice activity indicator is also available as an output of the mode classification and is useful for background noise suppression or comfort noise generation during DTX operation.

2.2 LSF Vector Quantization

The LSF parameters are quantized in a mode specific manner using the multi-stage VQ technique [l]. LSF error vectors are obtained by subtracting the long-term mean and performing first order backward prediction. The correlation coefficients for backward prediction are optimally estimated over a large training database (> 1 million LSF vectors). For highly stationary mode

0-7803-4073-6/97/$10.000 1 997 I E E E. 75

mailto:[email protected]

A frames, a 12 bit 2-stage VQ is used and for modes B and C, a 22 bit 4-stage VQ is used. High quality quantization is achieved at these rates. The savings in bits in mode A are passed on to the fixed codebook. The multi-stage codebooks are also more robust to transmission errors and lend themselves to selective error protection of most significant bits.

2.3 Residual Excitation Modelling

The generalized analysis-by-synthesis technique is used to encode pitch information resulting in bit savings. Here, instead of transmitting the closed-loop fractional pitch, the LP residual is time shifted to match an interpolated pitch contour [2]. Then, the synthesized speech derived using the time shifted residual is used as the reference to choose the optimum fixed excitation sequence. Fixed codebooks are mode specific multi-pulse algebraic codebooks [3] with fast search procedures. The choice of an optimal fixed codebook vector is enhanced by the use of harmonic noise weighting, perceptually weighted synthesis, and adaptive pitch harmonic enhancement.

3. REAL-TIME IMPLEMENTATION

A 4.8 kbis version of the codec has been implemented in real- time using a floating point DSP (40 MHz Texas Instruments TMS320C30). The encoder uses about 81% of the computational capability of the DSP and the decoder uses about 14%. The total (encoder and decoder) memory requirement is about 30 Kwords. Thus, a full-duplex implementation is possible using a single. low cost DSP. Currently, a real-time implementation based on a single fixed point DSP (30 MHz AT&T DSP1610) is also under way.

4. SUBJECTIVE EVALUATION RESULTS

The 4.8 kbis version of the codec has been subjectively evaluated under clear channel conditions, over a range of input levels. The MOS results for the 4.8 kb/s codec as well as a number of cellular standard codecs are presented in Table 2. The results indicate that the 4.8 kbis codec is equivalent to or better than all the cellular codecs tested, with the exception of the IS-641 codec.

Table 2: Performance under Clear Channel Conditions

The codec performance was also evaluated under additive white Gaussian noise (AWGN) impaired channel conditions for a number of bit error rate (BER) conditions. QPSK modulation with a decision feedback demodulator and a 5 bit soft decision based convolutional FEC with a bit rate of 1.7 kbis were used. The results of this test are presented in Table 3. Since this testing, additional improvements have been made, which will be presented at the workshop.

Table 3: Peqormance under A WGN Channel Conditions

I Raw BER I MOS I 95%ConfidenceInt. 0% (Clearchannel) I 3.62 I 3.55 - 3.69

1.20 Yo 3.23 - 3.37 I ~~

1.66% 3.19 3.12 - 3.26 2.25 Yo 3.00 2.94 - 3.07 3.0 Yo 2.79 2.72 - 2.86

References

[I] W. P. LeBlanc, et. al., “Efficient Search and Design Procedures for Robust Multi-Stage VQ of LPC Parameters for 4 kbis Speech Coding,” IEEE Trans. on Speech and Audio Processing, vol. 1 , no. 4, Oct. 1993, pp 373 - 385.

[2] W. B. Kleijn, et. al., “The RCELP Speech Coding Algorithm,” European Trans. on Telecomm., vol 5, no. 5, Sept 1994, pp. 573 - 582.

[3] C. Laflamme, et. al., “16 kbis Wideband Speech Coding Technique based on Algebraic CELP,” The RCELP Speech Coding Alg~rithm,’~ Proc. ICASSP 1991, pp. 13 - 16.

f

Figure I : Block Diagram of the multi-mode CELP coder

Fixed T Preprocess

input Speech L .... . ... ...... .. ..._. . . .. . ..... . . . .. .... . .. .. ..... . . . .. . ... . . . . . .. . . . . .. . . . ... . . . ....... . .. ._._. . . . . . .. . ... . . . . ... . .. . . . . . ... . . . . . ... . ._. . . .. . . . . _ _ ___. . . . . _. . . __. . . . .

76

[ieee 1997 ieee workshop on speech coding for telecommunications proceedings. back to basics:...

Documents