sgn–14006 audio and speech processingintroduction 1 sgn-14006 / a.k. sgn–14006 audio and speech...

8
Introduction 1 SGN-14006 / A.K. SGN–14006 Audio and Speech Processing Lectures, Fall 2015 Pasi Pertilä Tampere University of Technology (slides by Anssi Klapuri) Introduction 2 SGN-14006 / A.K. Course goals ! Learn basics of audio signal processing Basic operations and their underlying ideas and principles Give basic skills although all the latest cutting edge algorithms cannot be covered ! Learn fundamentals of speech processing Speech production and its computational modeling Acoustic features to represent speech signals Some applications: speech coding, synthesis ! Learn the basics of acoustics and human hearing These form the foundation for technical applications Introduction 3 SGN-14006 / A.K. Lecture timeline (some changes may still take place) ! Sound, audio signals, acoustics ! Hearing ! Basic audio signal processing operations AD/DA-conversion, filters and filter banks, dynamic control, etc. ! Sound synthesis ! Audio coding ! Speech production anatomy, phonetics ! Linear prediction, MFCCs, and cepstrum ! Speech coding ! Speech synthesis Introduction 4 SGN-14006 / A.K. What is not covered by this course ! Speech recognition, audio content analysis, and acoustic pattern recognition " Course SGN-24006 ”Analysis of Audio, Speech and Music Signals” (period 4) ! Analog audio Electroacoustics, microphone and loudspeaker design " See the course ”Akustiikan mittaukset” ! Hardware implementations

Upload: others

Post on 29-Jun-2020

7 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: SGN–14006 Audio and Speech ProcessingIntroduction 1 SGN-14006 / A.K. SGN–14006 Audio and Speech Processing Lectures, Fall 2015 Pasi Pertilä Tampere University of Technology !

Introduction 1 SGN-14006 / A.K.

SGN–14006 Audio and Speech Processing

Lectures, Fall 2015 Pasi Pertilä Tampere University of Technology (slides by Anssi Klapuri)

Introduction 2 SGN-14006 / A.K. Course goals

!  Learn basics of audio signal processing –  Basic operations and their underlying ideas and principles –  Give basic skills although all the latest cutting edge algorithms

cannot be covered

!  Learn fundamentals of speech processing –  Speech production and its computational modeling –  Acoustic features to represent speech signals –  Some applications: speech coding, synthesis

!  Learn the basics of acoustics and human hearing –  These form the foundation for technical applications

Introduction 3 SGN-14006 / A.K. Lecture timeline (some changes may still take place)

!  Sound, audio signals, acoustics !  Hearing !  Basic audio signal processing operations

–  AD/DA-conversion, filters and filter banks, dynamic control, etc.

!  Sound synthesis !  Audio coding

!  Speech production anatomy, phonetics !  Linear prediction, MFCCs, and cepstrum !  Speech coding !  Speech synthesis

Introduction 4 SGN-14006 / A.K. What is not covered by this course

!  Speech recognition, audio content analysis, and acoustic pattern recognition " Course SGN-24006 ”Analysis of Audio, Speech and Music

Signals” (period 4)

!  Analog audio –  Electroacoustics, microphone and loudspeaker design " See the course ”Akustiikan mittaukset”

!  Hardware implementations

Page 2: SGN–14006 Audio and Speech ProcessingIntroduction 1 SGN-14006 / A.K. SGN–14006 Audio and Speech Processing Lectures, Fall 2015 Pasi Pertilä Tampere University of Technology !

Introduction 5 SGN-14006 / A.K. Practical arrangements

!  Course homepage: http://www.cs.tut.fi/~sgn14006 !  Lectures

–  Mondays 12-14 in TB219 –  Thursdays 14-16 in TB222 –  Pasi Pertilä, pasi.pertila @ tut.fi

!  Lecture slides will be available as pdf on the course page –  Course is not based on any individual textbook. Lectures, lecture notes

and exercises will be sufficient to take the exam. –  Some recommended textbooks are mentioned at the end of this

introduction !  Requirements: exam and project work !  5 cr

Introduction 6 SGN-14006 / A.K. Exercises

!  Exercises start one week after the lectures (2.9.2015) !  Assistants: Shriram Nandakumar, Emre Cakir !  Contents: math and Matlab exercises related to the

lectures !  Two alternative groups

–  Tuesday 10-12 in TC303 (updated!) –  Friday 12-14 in TC303 –  Register to either group on-line at 14:00 today www.tut.fi/pop

!  Math problems are to be solved in advance, Matlab exercises are done during the exercises

!  Active completion of the exercises and participation in the exercises is credited up to 3 points in the exam (equivalent to one mark)

!  Project work will be discussed at the exercises too

Introduction 7 SGN-14006 / A.K. Project work

!  Implementing an audio signal processing algorithm in Matlab –  In two-person groups

!  Topic(s) will be introduced later during the lectures !  Requirements:

–  Choosing the topic –  Implementing the algorithm –  Final report by 28.10.

!  More detailed instructions will appear on the course home page

Introduction 8 SGN-14006 / A.K. Reference material

!  Gold, Morgan, Ellis, ”Speech and audio signal processing,” Wiley, 2011. !  Zölzer.”Digital audio signal processing,” Wiley&Sons, 2nd ed. 2008.

–  Including AD/DA-conversion, dynamic control, equalization, filter banks !  T.F. Quatieri: "Discrete-Time Speech Signal Processing: Principles and

Practice", Prentice Hall PTR, 2002. !  Rossing. ”The science of sound”, Addison-Wesley, 1990.

–  Acoustics, hearing !  Brandenburg, Kahrs. (1998). ”Applications of digital signal processing to audio

and acoustics,” Kluwer Academic Publishers –  Chapter on Perceptual audio coding

!  Pulkki, Karjalainen, ”Communication acoustic”,2015, Wiley

Page 3: SGN–14006 Audio and Speech ProcessingIntroduction 1 SGN-14006 / A.K. SGN–14006 Audio and Speech Processing Lectures, Fall 2015 Pasi Pertilä Tampere University of Technology !

Introduction 9 SGN-14006 / A.K.

Introduction to audio signals and their representation

Introduction 10 SGN-14006 / A.K. Audio signals

!  Audio = related to sound or hearing !  The word sound may mean

1.  a sensation perceived by the auditory system, or 2.  longitudinal pressure waves in a material medium (such as air)

that may cause a hearing sensation –  Due to human hearing, we usually consider the frequency range

20 Hz – 20 kHz and air as the medium (although hearing works also underwater for example)

!  Sound signal – audio signal –  Numerical representation of sound –  Sound pressure level as a function of time, measured using a

microphone for example

!  Note: audio signal is often understood as non-speech audio signal, although speech signals are audio too

Introduction 11 SGN-14006 / A.K. Audio and speech processing

!  Where is audio and speech processing needed? !  Examples:

–  Convert a musical piece into compressed mp3 format and store it on a hard disc for playback later (audio coding)

–  Encode a speech signal on a mobile phone before transmission –  Add reverberation to a sound, correct the pitch of a singer (studio

technology) –  Enhance the quality of a speech signal (denoising, echo cancell.) –  Compensate for loudspeaker non-idealities by digital equalization

!  Typical digital signal processing system: 1. Digitize a signal (sampling, quantization) 2. Process in digital form (store, manipulate, etc)

-digital representation enables a variety of algorithms 3. Convert back to an analog signal

Introduction 12 SGN-14006 / A.K. Audio signal representations

!  Different applications employ different representations –  Time domain representation –  Frequency domain representation –  Time-frequency domain representation

!  On this course we consider mainly music and speech –  Music signals involve a wide variety of sounds, billions of people

listen to music worldwide –  Speech signals are an important special category of sound signals

due to their importance for communication

Page 4: SGN–14006 Audio and Speech ProcessingIntroduction 1 SGN-14006 / A.K. SGN–14006 Audio and Speech Processing Lectures, Fall 2015 Pasi Pertilä Tampere University of Technology !

Introduction 13 SGN-14006 / A.K. Time domain signal

!  Air pressure level as a function of time (zero level = normal air pressure) is a natural representation for audio –  An analog signal is easy to record using a microphone and play

back using a loudspeaker

!  For music, typical sampling rates are 44.1 or 48 kHz –  Allows for representing the frequency range of human hearing

(approximately 20 Hz – 20 kHz)

!  For speech –  8 kHz: Narrowband

•  the conventional telephone rate (sibilants /s/, /f/ distorted) –  16 kHz: Wideband

•  voice over IP, bandwidth extension

!  Other rates are also widely used: 96, 32, 22.05 kHz etc. !  Most of the energy (and information) of natural sounds is

at low frequencies (around 200 Hz – 5 kHz)

Introduction 14 SGN-14006 / A.K. Time domain signal (1)

!  Analog signal (solid line) can be represented with discrete samples (dots) without loss of information, if the sampling frequency ≥ 2 * highest frequency component in the signal –  Remember from introductory signal processing courses

Introduction 15 SGN-14006 / A.K. Time domain signal (2)

!  Large time scale illustrates the sound amplitude envelope !  Example signal: one note from the oboe

–  Amplitude is zero before the sound starts –  The oboe has continuous excitation, therefore the sound’s

amplitude envelope remains nearly constant throught it duration

Introduction 16 SGN-14006 / A.K. Time domain signal (3)

!  Zoom-in of the same oboe signal at time t = 0.45 s !  90 ms frame illustrates the periodic waveform

–  Many sounds are periodic, for example most musical instrument sounds and vowels in speech

Page 5: SGN–14006 Audio and Speech ProcessingIntroduction 1 SGN-14006 / A.K. SGN–14006 Audio and Speech Processing Lectures, Fall 2015 Pasi Pertilä Tampere University of Technology !

Introduction 17 SGN-14006 / A.K. Frequency domain representation – spectrum

!  Obtained by computing discrete Fourier transform (for example) of the time-domain signal, usually in a short frame

!  Many perceptually important properties are more clearly visible in the frequency domain

!  Decibel scale for amplitude is useful from the viewpoint of the human hearing and the dynamics of natural sounds –  Due to Fechner’s law (subjective sensation is proportional to the

logarithm of the stimulus intensity) !  Phases are perceptually less important – often omitted

Introduction 18 SGN-14006 / A.K. Consider log-frequency and dB-magnitude

!  Linear scale –  usually

hard to ”see” anything

!  Log-frequency –  each octave is

approximately equally important perceptually

!  Log-magnitude –  perceived change

from 50dB to 60dB about the same as from 60dB to 70dB

Introduction 19 SGN-14006 / A.K. Time-frequency representation – spectrogram

!  Shows sound intensity as a function of time and frequency !  Obtained by blocking the signal into short analysis frames

and by computing their spectra !  For audio, the frame size is typically 10–100 ms: sound

spectra are often nearly stationary at that time scale

Introduction 20 SGN-14006 / A.K. Example audio signals: guitar

!  Sound decays gradually after the onset !  Instantaneous excitation: string is plucked at onset !  Periodic sound (vibrating string, covered on Acoustics

lecture)

Page 6: SGN–14006 Audio and Speech ProcessingIntroduction 1 SGN-14006 / A.K. SGN–14006 Audio and Speech Processing Lectures, Fall 2015 Pasi Pertilä Tampere University of Technology !

Introduction 21 SGN-14006 / A.K. Example audio signal: snare drum

!  Instantaneous excitation, exponentially decaying amplitude envelope

Introduction 22 SGN-14006 / A.K. Example audio signals: snare drum (2)

!  Zoom-in of the snare drum waveform !  The signal contains also non-periodic components

Introduction 23 SGN-14006 / A.K. Example audio signals: snare drum (3)

!  Spectrum is noise-like too: not as clear structure as that in oboe’s spectrum

Introduction 24 SGN-14006 / A.K. Example audio signals: snare drum (4)

!  Spectrogram

Page 7: SGN–14006 Audio and Speech ProcessingIntroduction 1 SGN-14006 / A.K. SGN–14006 Audio and Speech Processing Lectures, Fall 2015 Pasi Pertilä Tampere University of Technology !

Introduction 25 SGN-14006 / A.K. Polyphonic music (1)

!  Polyphonic music consists of a mix of several sound sources (linear superposition)

Introduction 26 SGN-14006 / A.K. Polyphonic music (2)

!  Spectrogram reveals e.g. the rhythmic structure

Introduction 27 SGN-14006 / A.K. Speech: time domain signal (1)

!  One sentence (”He knew what taboos he was violating.”) !  Speech can be viewed as a sequence of phonemes

Introduction 28 SGN-14006 / A.K. Speech: time domain (2)

!  Zooming in to different phonemes –  Left: vowel ”e” in He (voiced: periodic) –  Right: ”t” in ”taboos” (unvoiced: ”noisy”)

Page 8: SGN–14006 Audio and Speech ProcessingIntroduction 1 SGN-14006 / A.K. SGN–14006 Audio and Speech Processing Lectures, Fall 2015 Pasi Pertilä Tampere University of Technology !

Introduction 29 SGN-14006 / A.K. Speech spectrogram

!  Each phoneme has its characteristic spectral shape !  Transitions between phonemes are continuous rather than

step-like

Introduction 30 SGN-14006 / A.K. “NERDS MEET ARTISTS ”

2015-‐2016 Joint Course Module of Signal Processing, School of Architecture and Civil

Engineering  

   This  course  module  invites  students  from  signal  processing,  architecture  and  civil  engineering.    GOAL:  Help  signal  processing  engineers  to  understand  needs  of  urban  design  and  help  architects  and  civil  engineers  to  understand  potential  of  modern  ICT  in  quantitative  analysis  of  urban  spaces.  With  the  help  of  camera  and  microphone  systems  automatic  analysis  is  provided  for  quantitative  urban  space  monitoring.  The  quantitative  data  is  used  for  boosting  architectural  and  civil  engineering  design  of  future  urban  spaces.    COURSES  (depends  on  your  home  department):  

ARK-­53806  Sustainable  Design  Studio  RAK-­13106  Sustainable  Development  Studio  SGN-­81006  Signal  Processing  Innovation  Project    PARTICIPATION:  Enroll  to  one  of  the  above  courses  and  come  to  the  Opening  Session  August  25  2015  10:00-­12:00  RO104  where  the  overall  description  is  given  and  the  project  groups  will  be  formed.  The  works  will  be  supervised  by  the  researchers  from  Department  of  Signal  Processing,  School  of  Architecture  and  Department  of  Civil  Engineering.    FOR  MORE  INFORMATION:  Harry  Edelman  (School  of  Architecture  /  Dept.  of  Civil  Engineering)  Joni  Kämäräinen  (Dept.  of  Signal  Processing  -­  video  processing)  Tuomas  Virtanen  (Dept.  of  Signal  Processing  -­  audio  processing)  

GOAL: Help signal processing engineers to understand needs of urban design and help architects and civil engineers to understand potential of modern ICT in quantitative analysis of urban spaces. With the help of camera and microphone systems automatic analysis is provided for quantitative urban space monitoring. The quantitative data is used for boosting architectural and civil engineering design of future urban spaces. COURSE: SGN-81006 S ignal Processing Innovation Project PARTICIPATION: Enroll to the above course and come to the O pening Session August 25 2015 10:00-12:00 RO104 where the overall description is given and the project groups will be formed. The works will be supervised by the researchers from Department of Signal Processing, School of Architecture and Department of Civil Engineering. FOR MORE INFORMATION: Harry Edelman (School of Architecture / Dept. of Civil Engineering) Joni Kämäräinen (Dept. of Signal Processing - video processing) Tuomas Virtanen (Dept. of Signal Processing - audio processing)

“NERDS MEET ARTISTS ” 2015-‐2016 Joint Course Module of Signal

Processing, School of Architecture and Civil Engineering

 

   This  course  module  invites  students  from  signal  processing,  architecture  and  civil  engineering.    GOAL:  Help  signal  processing  engineers  to  understand  needs  of  urban  design  and  help  architects  and  civil  engineers  to  understand  potential  of  modern  ICT  in  quantitative  analysis  of  urban  spaces.  With  the  help  of  camera  and  microphone  systems  automatic  analysis  is  provided  for  quantitative  urban  space  monitoring.  The  quantitative  data  is  used  for  boosting  architectural  and  civil  engineering  design  of  future  urban  spaces.    COURSES  (depends  on  your  home  department):  

ARK-­53806  Sustainable  Design  Studio  RAK-­13106  Sustainable  Development  Studio  SGN-­81006  Signal  Processing  Innovation  Project    PARTICIPATION:  Enroll  to  one  of  the  above  courses  and  come  to  the  Opening  Session  August  25  2015  10:00-­12:00  RO104  where  the  overall  description  is  given  and  the  project  groups  will  be  formed.  The  works  will  be  supervised  by  the  researchers  from  Department  of  Signal  Processing,  School  of  Architecture  and  Department  of  Civil  Engineering.    FOR  MORE  INFORMATION:  Harry  Edelman  (School  of  Architecture  /  Dept.  of  Civil  Engineering)  Joni  Kämäräinen  (Dept.  of  Signal  Processing  -­  video  processing)  Tuomas  Virtanen  (Dept.  of  Signal  Processing  -­  audio  processing)  

AD#1

Introduction 31 SGN-14006 / A.K.

Invitation to Data Collection Campaign

I A project in Department of Signal Processing needsspeech data for research purposes.

I Your task is to read out simple English sentencesfrom a script. Takes 25 minutes.

IReward: a movie ticket.

I How to participate?I We need two persons per recording. !

come with a friend. If you are alone, wecould try to pair you.

I Sign-up via [email protected]

I The sessions take place on 24-28.8during office hours, or at a different timeupon agreement.

AD#2, Participate in a study, get a movie ticket!