an introduction to various features of speech signalspeech features

This Tutorial is Downloaded from: https://sites.google.com/site/enggprojectece

1

An Introduction to Various Features of Speech Signal

Compiled by: Sivaranjan Goswami, Pursuing M. Tech. (2013-15 batch)

Dept. of ECE, Gauhati University, Guwahati, India Contact: [email protected]

Speech is the most fundamental mode of communication among human beings as well as many other creatures. In our day to day life we always communicate through speech. In the last few decades a large number of researches have been undergone to make use of speech to control various electronic systems. Speech has a number of advantages over hand control through panel and switches since speech can be easily transmitted over telephone channel and hence remote controlling of devices become easier using speech.

The audibility range of human ear is 20Hz to 2 kHz. However, the frequency of human speech varies from 300 Hz to 3400 Hz. Thus according to Nyquist theorem, the sampling rate should be greater than or equal to 6800 Hz. In telecommunication, the sampling rate is considered to be 8 kHz. Therefore, Analog-to-Digital converters of mobile phones, sample the signal at a sampling rate of 8 kHz. However, for multimedia applications, the sampling rate is usually much higher. In MP3 songs that we download, the sampling rate is usually 44100 Hz. This is the reason, why the quality of sound recorded using a mobile phone is very poor compared to MP3 songs that we download.

A. What is Speech Signal

Speech or any sound is basically an acoustic signal that travels through air or any other material through expansion and compression of the particles. It is hence a pressure wave. A microphone is a transducer that converts this pressure wave into a voltage signal.

A detailed description of the human speech generation system is beyond the scope of this discussion. However a brief discussion which is inevitable in the context of feature extraction is presented. The human speech production system is a complex mechanical system. The air exhaled by the lungs is modulated by various hard and soft tissues initially by the glottal fold and then by the tissues of the vocal tract such as tongue, lips, jaw, and velum. In Digital Speech Processing, this process is represented as a discrete time model as shown in figure 1. The system containing the lungs and the glottal fold comes in the block Excitation Generator. The vocal tract is modeled as a linear system, which is usually a digital FIR filter. The vocal tract parameters are the parameters of the digital filter.

Fig. 1. Block diagram of speech generation model


2

Based on the type of the excitation signal, a speech signal can be classified into two major types:

1. Voiced Speech: Voiced sounds are produced by forcing air through the glottis or an opening between the vocal folds. The excitation is a quasi-stationary impulse train, that is, a signal whose frequency remain constant for a small amount of time, sometimes referred to as the stationarity period. Example of voiced speech are vowel sounds as in cat, hear, too etc.

2. Unvoiced Speech: Unvoiced sounds are generated by forming a constriction at some point along the vocal tract, and forcing air through the constriction to produce turbulence. The excitation is a random signal. It can be modeled as a White Gaussian Noise. These are consonant sounds as in ship, key etc.

It can be said that the voiced component of a word is responsible for its tone or shape of the waveform of the word, whereas the unvoiced section carries the actual meaning. The two waveforms below correspond to the words CUP and DUCK. Their voiced part is similar so the waveforms are also similar.

As shown in figure 1 the speech production mechanism is modeled as a cascade combination of an excitation generator and a digital filter. The excitation of the filter determines the type of speech and the digital filter simulates the effect of various organs or tissues of the vocal tract on the excitation. The parameters of the filter are known as vocal tract parameters. The excitation is either an impulse train or a random noise based on whether the speech is voiced or unvoiced respectively. Thus figure 1 can be drawn as figure 2.


3

Figure 2: Block diagram of speech generation model for Linear Predictive Analysis

B. Short-Time Analysis of Speech Signal

From the above discussion we have seen that the properties of a speech signal remain same only for a short duration of time. Therefore, any kind of speech processing first requires segmentation of the speech signal into frames of short duration. The duration for which the properties of a speech signal remains stationary varies from speaker to speaker. It usually ranges from 15 to 25 milliseconds. It is a common practice to take the range as 20 milliseconds. If the speech signal is sampled at a rate of 8 kHz, it implies that there will are 160 samples per frame.

Sometimes to overcome certain difficulties particular to some problem, the speech frames are overlapped or multiplied with some window function. Such cases are not covered in this tutorial. Such cases are discussed on the tutorial on “Short Term Spectral and Cepstral Analysis of Speech Signal”.

C. Features of Speech Signal

Till now we had a brief introduction to the generation and types of speech signal. Now we will come to feature extractions.

1. Zero-Crossing Rate: Zero-crossing rate is a measure of frequency of the signal over a small period. It can be obtained by measuring the number of times the sign of the signal changes and dividing it by two.


4

Figure 3: Zero crossing

It can be seen that during one period, the signal crosses zero twice. Thus for any frame, the zero-crossing rate (ZCR) is given by:

푍퐶푅 =푁표. 표푓 푆푖푔푛 퐶ℎ푎푛푔푒푠 푖푛 푡ℎ푒 푓푟푎푚푒

퐹푟푎푚푒 푑푢푟푎푡푖표푛 (푠푒푐 )

2. Mean Square or Mean Magnitude value:

This is a mean value of the signal for a particular frame ignoring the sign. The mean square value of the k-th frame is given by:

푃 (푘) =1푁

푥 (푛) ; 푘 = 0,1,2, … ,퐿 − 1푁

Similarly, the mean magnitude is given by:

퐴 (푘) =1푁

|푥(푛)| ; 푘 = 0,1,2, … ,퐿 − 1푁

Where, L is the total number of samples in a given audio clip.

Both mean square and mean magnitude carries information about the short time energy of the signal. If the magnitude of the signal is normalized in the range [-1, 1], then the range of mean square value and mean magnitude value are also same [0, 1]. Usually a selection between these two parameters is done to determine a suitable threshold value. In case of mean square value, sometimes, it is easier to select a threshold value during some operation.

In theory books, usually, these equations are written using sliding window. In this introductory tutorial I avoid that notation as the present notation is easier for implementing in a computer program.

3. Voice Activity Detection


5

We don’t speak continuously. During our speech, there are many pauses and breaks. To perform any speech processing, it is necessary to distinguish between presence and absence of speech in an audio clip. Presence or absence of speech in a short-duration frame can be easily determined if there is no background noise. If speech is not present, the mean magnitude value or mean square value is very small. On the other hand, a high value of mean magnitude or mean square value indicates presence of speech. If there is background noise, it is a challenging task to determine voice activity. Many literatures have been published for detection of voice activity in presence of background noise.

4. Detection of Voiced and Unvoiced Speech

Voiced and unvoiced speeches are already introduced. It is relevant to mention here that most of the features of a speech signal are extracted for voiced speech. Hence identification of voiced and unvoiced speech is another important task after voice activity detection.

It is to be noted that, for voiced speech, the mean square value is large, whereas the zero-crossing rate is small. On the other hand, the zero-crossing rate for unvoiced speech is large and the average magnitude is very small.

Detection of voiced and unvoiced speech is also a challenging task in presence of background noise. Usually voiced speech is somewhat easy to distinguish if background noise is stationary; however, the unvoiced speech is difficult. In this field also a number of literatures have been published.

5. Pitch and Pitch Period Estimation

Pitch is the perceived fundamental frequency of musical note or voiced speech. It may not be same as the actual fundamental frequency of the speech signal. However, in many literatures, the terms pitch and fundamental frequency are used interchangeably. Pitch period is the fundamental period of voiced speech. Pitch estimation is a great challenge.

Pitch is one of the most important parameters that are required for high level speech processing like speech recognition, speaker recognition etc. Everyone has a pitch range to which he or she is constrained by simple physics of his or her larynx. For men, the possible pitch range is usually found somewhere between the two bounds 50-250 Hz, while for women the range usually falls somewhere in the interval 120-500 Hz. Everyone has a "habitual pitch level," which is a sort of "preferred" pitch that will be used naturally on the average. Pitch is shifted up and down in speaking in response to factors relating to stress, intonation, and emotion. Stress* refers to a change in fundamental frequency and loudness to signify a change in emphasis of a syllable, word, or phrase. Intonation* is associated with the pitch contour over time and performs several functions in a language, the most important being to signal grammatical structure. The markings of sentence, clause, and other boundaries are accomplished through intonation patterns.

There are many of literatures on pitch estimation techniques published in various journals and conferences worldwide. However, a classical approach using cepstral analysis has been discussed on the tutorial on “Short Term Spectral and Cepstral Analysis of Speech Signal”.

*The terms stress and intonation are explained bellow at feature no.7


6

6. Phonemics and Phonetics of Speech Signal

Phonemes are the basic theoretical unit of speech. Each phoneme can be considered to be a code that consists of a unique set of articulatory gestures. In English there are about 42 phonemes. Due to many different factors including, for example, accents, gender, and, most importantly, coarticulatory effects, a given "phoneme" will have a variety of acoustic manifestations in the course of flowing speech. Therefore, any acoustic utterance that is clearly "supposed to be" that ideal phoneme, would be labeled as that phoneme. The phonemes of a language, therefore, comprise a minimal theoretical set of units that are sufficient to convey all meaning in the language.

One common approach of speech recognition is to segment and distinguish the phonemes from phones (the sound produced in speaking).

The study of the abstract units (phonemes) and their relationships in a language is called phonemics, while the study of the actual sounds of the language is called phonetics. More specifically, there are three branches of phonetics each of which approaches the subject somewhat differently:

(a) Articulatory phonetics is concerned with the manner in which speech sounds are produced by the articulators of the vocal system.

(b) Acoustic phonetics studies the sounds of speech through analysis of the acoustic waveform. (c) Auditory phonetics studies the perceptual response to speech sounds as reflected in listener

trials. In speech recognition systems or speech to text-conversion systems, a corpus is to be made that contains the whole set of phonemes of a particular language and corresponding letters or meanings. In languages like English, the same phoneme may correspond to a number of letters as there is no one-to-one correspondence between sounds and letters. In languages like Hindi or Assamese, it is somewhat simpler. But all languages have their own challenges. When a test speech is input to the system, it segments the speech into segments and identifies the corresponding phoneme in the corpus using some suitable algorithm such as Dynamic Time Wrapping (DTW) etc. The succeeding and preceding phoneme helps to resolve ambiguity if a phoneme corresponds to more than one letter. These systems are really complex and beyond the scope of this basic tutorial.

7. Prosodic Features: Stress and Intonation of Speech The tonal and rhythmic aspects of speech are generally called prosodic features. These features have significant contributions to the formal linguistic structure of a language. These features extend over more than one phoneme; therefore such features are also known as suprasegmental. Prosodic features are created by certain special manipulations of the speech production system during the normal sequence of phoneme production. These manipulations are categorized as either source factors or vocal-tract shaping factors. The source factors are based on subtle changes in the speech breathing muscles and vocal folds, while the vocal-tract shaping factors operate via movements of the upper articulators. The acoustic patterns of prosodic features are heard in systematic changes in duration, intensity, fundamental frequency, and spectral patterns of the individual phonemes. Stress and intonation are most important prosodic features of speech signal. Stress refers to a change in fundamental frequency and loudness to signify a change in emphasis of a syllable, word, or phrase. Intonation is associated with the pitch contour over time and performs several functions in a language, the most important being to signal grammatical structure. The marking of sentence, clause, and other boundaries is accomplished through intonation patterns.


7

Stress is used to distinguish similar phonetic sequences or to highlight a syllable or word against a background of unstressed syllables. For example, consider the two phrases "That is insight" and "That is in sight." In the first phrase there is stress on "in" but "sight" is unstressed, while the converse is true in the second phrase. Extraction of features like stress or intonation can be performed using pattern recognition based approach using various methods. *Prosodic feature extraction and speech recognition need very in-depth study of the subject. Here I have given only a hint of such features. Suggested book: “Discrete Time Processing of Speech Signal” by John R. Deller, John H. L. Hansen and John G. Prokais. References: [1] Lawrence R. Rabiner and Ronald W. Schafer, “Introduction to Digital Speech Processing”, now Publishers Inc. [2] John R. Deller, John H. L. Hansen and John G. Prokais, “Discrete Time Processing of Speech Signal”, The Instituteof Electrical and Electronics Engineers (IEEE), lnc.,NewYork. *Download links of both of these two books are available at the website.

an introduction to various features of speech signalspeech features

Engineering