5 4 3 2 1 music analysis sinewave synthesis nonspeech lecture...
TRANSCRIPT
E6820 SAPR - Music A & S - Dan Ellis 2001-03-06 - 1
EE E6820: Speech & Audio Processing & Recognition
Lecture 7:Music analysis and synthesis
Music and nonspeech
Music synthesis techniques
Sinewave synthesis
Music analysis
Transcription
Dan Ellis <[email protected]>http://www.ee.columbia.edu/~dpwe/e6820/
1
2
3
4
5
E6820 SAPR - Music A & S - Dan Ellis 2001-03-06 - 2
Music & nonspeech
• What is ‘nonspeech’?
- according to research effort: a little music- in the world: most everything
attributes?
1
Originnatural man-made
Info
rmat
ion
cont
ent
low
high
wind &water
animalsounds
speech music
machines & enginescontact/
collision
E6820 SAPR - Music A & S - Dan Ellis 2001-03-06 - 3
Sound attributes
• Attributes suggest model parameters
• What do we notice about ‘general’ sound?
- psychophysics: pitch, loudness, ‘timbre’- bright/dull; sharp/soft; grating/soothing- sound is not ‘abstract’:
tendency is to describe by source-events
• Ecological perspective
- what matters about sound is ‘what happened’
→
our percepts express this more-or-less directly
E6820 SAPR - Music A & S - Dan Ellis 2001-03-06 - 4
Aside: Sound textures
• What do we hear in:
- a city street- a symphony orchestra
• How do we distinguish:
- waterfall- rainfall- applause- static
•
Levels
of ecological description...
time / s
freq
/ H
z
Applause04
0 1 2 3 40
1000
2000
3000
4000
5000
time / s
freq
/ H
z
Rain01
0 1 2 3 40
1000
2000
3000
4000
5000
E6820 SAPR - Music A & S - Dan Ellis 2001-03-06 - 5
Motivations for modeling
• Describe/classify
- cast sound into model because want to use the resulting parameters
• Store/transmit
- model implicitly exploits limited structure of signal
• Resynthesize/modify
- model separates out interesting parameters
Sound Model parameterspace
E6820 SAPR - Music A & S - Dan Ellis 2001-03-06 - 6
Analysis and synthesis
• Analysis is the converse of synthesis:
• Can exist apart:
- analysis for classification- synthesis of artificial sounds
• Often used together:
- encoding/decoding of compressed formats- resynthesis based on analyses-
analysis-by-synthesis
Sound
AnalysisSynthesis
Model / representation
E6820 SAPR - Music A & S - Dan Ellis 2001-03-06 - 7
Outline
Music and nonspeech
Music synthesis techniques
- Framework- Historical development
Sinewave synthesis
Music analysis
Transcription
elements?
1
2
3
4
5
E6820 SAPR - Music A & S - Dan Ellis 2001-03-06 - 8
Music synthesis techniques
• What is music?
- could be anything
→
flexible synthesis needed!
• Key elements of conventional music
- instruments
→
note-events (time, pitch, accent level)
→
melody, harmony, rhythm- patterns of repetition & variation
• Synthesis framework:
instruments: common framework for many notesscore: sequence of (time, pitch, level) note events
2
E6820 SAPR - Music A & S - Dan Ellis 2001-03-06 - 9
The nature of musical instrument notes
• Characterized by instrument (register), note, loudness (emphasis), articulation...
distinguish how?
Time
Fre
quen
cy
Piano
0
1000
2000
3000
4000
0 1 2 3 4 Time
Violin
0
1000
2000
3000
4000
0 1 2 3 4
Time
Fre
quen
cy
Clarinet
0
1000
2000
3000
4000
0 1 2 3 4 Time
Trumpet
0
1000
2000
3000
4000
0 1 2 3 4
E6820 SAPR - Music A & S - Dan Ellis 2001-03-06 - 10
Development of music synthesis
• Goals of music synthesis:
- generate realistic / pleasant new notes- control / explore timbre (quality)
• Earliest computer systems in 1960s(voice synthesis, algorithmic)
• Pure synthesis approaches:
- 1970s: Analog synths- 1980s: FM (Stanford/Yamaha)- 1990s: Physical modeling, hybrids
• Analysis-synthesis methods:
- sampling / wavetables- sinusoid modeling- harmonics + noise (+ transients)
others?
E6820 SAPR - Music A & S - Dan Ellis 2001-03-06 - 11
Analog synthesis
• The minimum to make an ‘interesting’ sound
• Elements:
- harmonics-rich oscillators- time-varying filters- time-varying envelope- modulation: low frequency + envelope-based
• Result:
- time-varying spectrum, independent pitch
Oscillator Filter
Envelope
PitchTrigger
Vibrato
t
t
f
+
Cutofffreq
GainSound
++
E6820 SAPR - Music A & S - Dan Ellis 2001-03-06 - 12
FM synthesis
• Fast freq. modulation
→
harmonic sidebands:
•
J
n
(
β
)
is a Bessel function:
→
Complex harmonic spectra by varying
β
ωct β ωmtsin+( )cos Jn β( ) ω0 nωm+( )cosn ∞–=
∞∑=
0 1 2 3 4 5 6 7 8 9-0.5
0
0.5
1J0
J1 J2 J3 J4Jn(β) ≈ 0 for β < n + 2
modulation index β
time / s
freq
/ H
z
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80
1000
2000
3000
4000
whatuse?
E6820 SAPR - Music A & S - Dan Ellis 2001-03-06 - 13
Sampling synthesis
• Resynthesis from real notes
→
vary pitch, duration, level
• Pitch:
stretch (resample) waveform
• Duration: loop a ‘sustain’ section
• Level: cross-fade different examples
- need to ‘line up’ source samples
-0.2
-0.1
0
0.1
0.2
0 0.1 0.2 time
0 0.002 0.004 0.006 0.008 time / s-0.2
-0.1
0
0.1
0.2
0 0.002 0.004 0.006 0.008 time / s
596 Hz 894 Hz
-0.2
-0.1
0
0.1
0.2
-0.2
-0.1
0
0.1
0.2
-0.2
-0.1
0
0.1
0.2
0 0.1 0.2 0.3 0 0.1 0.2 0.3time / s time / s
0.204 0.2060.174 0.176
-0.2
-0.1
0
0.1
0.2
-0.2
-0.1
0
0.1
0.2
0 0.05 0.1 0.15 0 0.05 0.1 0.15time / s time / s
Soft Loud
veloc
mix
good& bad?
E6820 SAPR - Music A & S - Dan Ellis 2001-03-06 - 14
Outline
Music and nonspeech
Music synthesis techniques
Sinewave synthesis (detail)- Sinewave modeling- Sines + residual ...
Music analysis
Transcription
1
2
3
4
5
E6820 SAPR - Music A & S - Dan Ellis 2001-03-06 - 15
Sinewave synthesis
• If patterns of harmonics are what matter,why not generate them all explicitly:
- particularly powerful model for pitched signals
• Analysis (as with speech):- find peaks in STFT |S[ω,n]| & track
- or track fundamental ω0 (harmonics / autoco)
& sample STFT at k·ω0
→set of Ak[n] to duplicate tone:
• Synthesis via bank of oscillators
3
s n[ ] Ak n[ ] n k ω⋅ 0⋅( )cosk∑=
0 0.05 0.1 0.15 0.2 time / s time / s0
2000
4000
6000
8000
freq
/ H
z
freq / Hz
mag
00.1
0.2
05000
0
1
2
E6820 SAPR - Music A & S - Dan Ellis 2001-03-06 - 16
Steps to sinewave modeling - 1
• The underlying STFT:
What value for N (FFT length & window size )?What value for H (hop size: n0 = r·H, r = 0, 1, 2...)?
• STFT window length determines freq. resol’n:
• Choose N long enough to resolve harmonics→ 2-3x longest (lowest) fundamental period- e.g. 30-60 ms = 480-960 samples @ 16 kHz- choose H ≤ N/2
• N too long → lost time resolution- limits sinusoid amplitude rate of change
X k n0,[ ] x n n0+[ ] w n[ ] j2πkn
N-------------
–exp⋅ ⋅n 0=
N 1–
∑=
Xw ejω( ) X e
jω( ) W ejω( )*=
E6820 SAPR - Music A & S - Dan Ellis 2001-03-06 - 17
Steps to sinewave modeling - 2
• Choose candidate sinusoids at each time by picking peaks in each STFT frame:
• Quadratic fit for peak, lin. interp. for phase:
+ linear interp. of unwrapped phase
time / s
freq
/ H
zle
vel /
dB
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.180
2000
4000
6000
8000
0 1000 2000 3000 4000 5000 6000 7000 freq / Hz-60
-40
-20
0
20
400 600 800 freq / Hz
-20
-10
0
10
20
400 600 800 freq / Hz
-10
-5
0
leve
l / d
B
phas
e / r
ady
x
y = ax(x-b)
b/2
ab2/4
E6820 SAPR - Music A & S - Dan Ellis 2001-03-06 - 18
Steps to sinewave modeling - 3
• Which peaks to pick?Want ‘true’ sinusoids, not noise fluctuations- ‘prominence’ threshold above smoothed spec.
• Sinusoids exhibit stability...- of amplitude in time- of phase derivative in time→compare with adjacent time frames to test?
0 1000 2000 3000 4000 5000 6000 7000 freq / Hz-60
-40
-20
0
20
leve
l / d
B
E6820 SAPR - Music A & S - Dan Ellis 2001-03-06 - 19
Steps to sinewave modeling - 4
• ‘Grow’ tracks by appending newly-found peaks to existing tracks:
- ambiguous assignments possible
• Unclaimed new peak- ‘birth’ of new track- backtrack to find earliest trace?
• No continuation peak for existing track- ‘death’ of track- or: reduce peak threshold for hysteresis
timefr
eq
death
birth
existingtracks
newpeaks
E6820 SAPR - Music A & S - Dan Ellis 2001-03-06 - 20
Resynthesis of sinewave models
• After analysis, each track defines contours in frequency, amplitude fk[n], Ak[n] (+ phase?)
- use to drive a sinewave oscillators & sum up
• ‘Regularize’ to exactly harmonic fk[n] = k·f0[n]
0 0.05 0.1 0.15 0.2500
600
7000
1
2
3
0 0.05 0.1 0.15 0.2time / s
time / sfreq
/ Hz
leve
l
-3
-2
-1
0
1
2
3
Ak[n]·cos(2πfk[n]·t)
fk[n]
Ak[n]
n
0 0.05 0.1 0.15 0.20
2000
4000
6000
0 0.05 0.1 0.15 0.2550
600
650
700
time / s time / s
freq
/ H
z
freq
/ H
z
whatto do?
E6820 SAPR - Music A & S - Dan Ellis 2001-03-06 - 21
Modification in sinewave resynthesis
• Change duration by warping timebase- may want to keep onset unwarped
• Change pitch by scaling frequencies- either stretching or resampling envelope
• Change timbre by interpolating params
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50
1000
2000
3000
4000
5000
time / s
freq
/ H
z
0 1000 2000 3000 40000
10
20
30
40
freq / Hz
leve
l / d
B
0 1000 2000 3000 40000
10
20
30
40
freq / Hz
leve
l / d
B
E6820 SAPR - Music A & S - Dan Ellis 2001-03-06 - 22
Sinusoids + residual
• Only ‘prominent peaks’ became tracks- remainder of spectral energy was noisy?→ model residual energy with noise!
• How to obtain ‘non-harmonic’ spectrum?- zero-out spectrum near extracted peaks?- or: resynthesize (exactly) & subtract waveforms
.. must preserve phase!
• Can model residual signal with LPC→flexible representation of noisy residual
es n[ ] s n[ ] Ak n[ ] 2πn f k n[ ]⋅( )cosk∑–=
0 1000 2000 3000 4000 5000 6000 7000 freq / Hz
mag
/ dB
-80
-60
-40
-20
0
20
original
sinusoids
residualLPC
E6820 SAPR - Music A & S - Dan Ellis 2001-03-06 - 23
Sinusoids + noise + transients
• Sound represented as sinusoids and noise:
Parameters are {Ak[n], fk[n]}, hn[n]
• Separate out abrupt transients in residual?
- more specific → more flexible
s n[ ] Ak n[ ] 2πn f k n[ ]⋅( )cosk∑ hn n[ ] b n[ ]*+=
Sinusoids Residual es n[ ]
time / s
freq
/ H
z
0 0.2 0.4 0.60
2000
4000
6000
8000
2000
4000
6000
8000
0 0.2 0.4 0.60
2000
4000
6000
8000
{Ak[n], fk[n]}
hn[n]
es n[ ] tk n[ ]k∑ hn n[ ] b n[ ]*+=
E6820 SAPR - Music A & S - Dan Ellis 2001-03-06 - 24
Outline
Music and nonspeech
Music synthesis techniques
Sinewave synthesis
Music analysis- Instrument identification- Pitch tracking
Transcription
1
2
3
4
5
E6820 SAPR - Music A & S - Dan Ellis 2001-03-06 - 25
Music analysis
• What might we want to get out of music?
• Instrument identification- different levels of specificity- ‘registers’ within instruments
• Score recovery- transcribe the note sequence- extract the ‘performance’
• Ensemble performance- ‘gestalts’: chords, tone colors
• Broader timescales- phrasing & musical structure- artist / genre clustering and classification
4
E6820 SAPR - Music A & S - Dan Ellis 2001-03-06 - 26
Instrument identification
• Research looks for perceptual ‘timbre space’
• Cues to instrument identification- onset (rise time), sustain (brightness)
• Hierarchy of instrument families- strings / reeds / brass- optimize features at each level
bright
dull
low fluxhi flux
low attack
hi attack
procedure?
E6820 SAPR - Music A & S - Dan Ellis 2001-03-06 - 27
Pitch tracking• Fundamental frequency ( → pitch)
is a key attribute of musical sounds→pitch tracking as a key technology
• Pitch tracking for speech- voice pitch & spectrum highly dynamic- speech is voiced and unvoiced ground truth?
• Applications- voice coders (excitation description)- harmonic modeling
E6820 SAPR - Music A & S - Dan Ellis 2001-03-06 - 28
Pitch tracking for music
• Pitch in music- pitch is more stable (although vibrato)- but: multiple pitches
• Applications- harmonic modeling- music transcription (→ storage, resynthesis)- source separation
• Approaches: “place” & “time”
Time
Fre
quen
cy
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50
1000
2000
3000
4000
??
E6820 SAPR - Music A & S - Dan Ellis 2001-03-06 - 29
Meddis & Hewitt pitch model
• Autocorrelation (time) based pitch extraction- fundamental period → peak(s) in autocorrelation
• Compute separately in each frequency band& ‘summarize’ across (perceptual) channels
x t( ) x t T+( )≈ r xx T( ) x t( )x t T+( )∫= max≈→
0 20 40 60 80 100-0.2-0.1
00.10.2
0 20 40 60 80 100-10123
time / samples lag / samples
Waveform x[n] Autocorrelation rxx[l]
80
328
866
1924
4000CF / Hz Autocorrelogram
2.5 5.0 7.5 10.0 12.50
1
lag / ms
Ban
dpas
sfil
ters
Rec
tific
atio
n &
low
-pas
s fil
ter
Periodicity detection
Cross-channel sum
sound
Summary ACG
E6820 SAPR - Music A & S - Dan Ellis 2001-03-06 - 30
Tolonen & Karjalainen simplification• Multiple frequency channels can have different
pitches dominant...
• But equalizing (flattening) the spectrum works:
→ Summary AC as a function of time:
- ‘Enhancement’ = cancel subharmonics
Pre-whiteningsound
Highpass@ 1kHz
Rectify &low-pass
Lowpass@ 1kHz
Periodicitydetection
SACFenhance
ESACFPeriodicitydetection
+
time/s
50
100
200
400
1000f/Hz Periodogram for M/F voice mix
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8Summary autocorrelation at t=0.775 s
0.001 0.002 0.003 0.004 0.006 0.01 0.02 lag/s
200 Hz(0.005s)
125 Hz(0.008s)
lag vs.freq?
E6820 SAPR - Music A & S - Dan Ellis 2001-03-06 - 31
Post-processing of pitch tracks
• Remove outliers with median filtering
• Octave errors are common:- if x(t) ≈ x(t + T ) then x(t) ≈ x(t + 2T ) etc.
→ dynamic programming/HMM
• Validity- “is there a pitch at this time?”- voiced/unvoiced decision for speech
• Event detection- when does a pitch slide indicate a new note?
time
5-pt median
E6820 SAPR - Music A & S - Dan Ellis 2001-03-06 - 32
Music and nonspeech
Music synthesis techniques
Sinewave synthesis
Music analysis
Transcription- Bottom-up and top-down- Transcription from sinewave models
1
2
3
4
5
E6820 SAPR - Music A & S - Dan Ellis 2001-03-06 - 33
Transcription
• Basic idea: Recover the score
• Is it possible? Why is it hard?- music students do it
... but they are highly trained; know the rules
• Motivations- for study: what was played?- highly compressed representation (e.g. MIDI)- the ultimate restoration system...
5
Time
Fre
quen
cy
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50
1000
2000
3000
4000
E6820 SAPR - Music A & S - Dan Ellis 2001-03-06 - 34
Transcription framework
• Recover discrete events to explain signal
- analysis-by-synthesis?
• Exhaustive search?- would be possible given exact note waveforms - .. or just a 2-dimensional ‘note’ template?
but superposition is not linear in |STFT| space
• Inference depends on all detected notes- is this evidence ‘available’ or ‘used’?- full solution is exponentially complex
Note events{tk, pk, ik}
ObservationsX[k,n]synthesis
?
notetemplate
2-Dconvolution
E6820 SAPR - Music A & S - Dan Ellis 2001-03-06 - 35
Bottom-up versus top-down
• Bottom-up: observ’n directly gives description- e.g. peaks in 2-D convolution- but: few domains are that ‘linear’
• Top-down: pursue & confirm hypotheses- e.g. analysis-by-resynthesis matching- but: need to limit search space
• Generally, need to do both:
- bottom-up guides & limits search- top-down resolves ambiguities in low-levelhow to transcribe?
Abstractconstraints
guidance
tests
Rawobservations
Data-drivenanalyses
Hypothesissearch
E6820 SAPR - Music A & S - Dan Ellis 2001-03-06 - 36
Transcription from sinewave models
• Form sinusoid model- as with synthesis, but signal
is more complex
• Break tracks- need to detect new ‘onset’
at single frequencies
• Group by onset & common harmonicity- find sets of tracks that start
around the same time- + stable harmonic pattern
• Pass on to constraint-based filtering...
time / s
freq
/ H
z
0 1 2 3 40
500
1000
1500
2000
2500
3000
bu/td? mistakes?
0 0.5 1 1.5 time / s0
0.020.040.06
E6820 SAPR - Music A & S - Dan Ellis 2001-03-06 - 37
Problems for transcription
• Music is practically worst case!- note events are often synchronized
→ defeats common onset- notes have harmonic relations (2:3 etc.)
→ collision/interference between harmonics- variety of instruments, techniques, ...
• Listeners are very sensitive to certain errors- .. and impervious to others
• Apply further constraints- like our ‘music student’- maybe even the whole score (Scheirer)!
E6820 SAPR - Music A & S - Dan Ellis 2001-03-06 - 38
Summary
• ‘Nonspeech audio’- i.e. sound in general- characteristics: ecological
• Music synthesis- control of pitch, duration, loudness, articulation- evolution of techniques- sinusoids + noise + transients
• Music analysis- different aspects: instruments, pitches,
performance- transcription complications:
representation, octaves, onsets, ...- rely on high-level structural constraints
and beyond?