wp4 – sound object representation
DESCRIPTION
WP4 – Sound Object Representation. Enabling Access to Sound Archives through Integration, Enrichment and Retrieval. Introduction to Workpackage-Overview. Objectives: How to represent audio for the purposes of efficient querying. Segmentation of audio streams. - PowerPoint PPT PresentationTRANSCRIPT
Enabling Access to Sound Archives through Integration, Enrichment and Retrieval
WP4 – Sound Object Representation
12 Month Review Meeting
Project #033902
Introduction to Workpackage-Overview Objectives:
How to represent audio for the purposes of efficient querying. Segmentation of audio streams. Distinct objects may then be recognized using musical instrument identification
and speaker identification techniques . Identification of higher level features
Speech related- Gender, Emotion, Laughter and Language Music related- tempo, beat detection, rhythm…
Tasks: T 4.1 Audio stream segmentation- Speech/music separation… T 4.2 Source separation- Instrument Identification, Speaker Identification T 4.3 Sound object identification T4.5: Transcription
Music transcription High level speech phonetics & characteristics
12 Month Review Meeting
Project #033902
Deliverables and Milestones Deliverables
D4.1 Prototype segmentation, separation and speaker/instrument identification system (Month 14)
D4.2 Prototype transcription system (Month 27) D4.3 Final report on sound object representations (Month 30)
Milestones and expected result M4.1- Month 6: Speech/music separation methods implemented and
tested M4.2 - Month 10: Initial results on identification of sound objects,
prototype segmenter and separator M4.3 – Month 18: Identification of speech characteristics from
segmented, separated audio streams M4.4 – Month 24: Transcription of monophonic music from
segmented, separated audio streams M4.5 – Month 28: Testing and evaluation of complete system
12 Month Review Meeting
Project #033902
Workpackage Progress – Speech Related
Prototype for speaker segmentation is ready. Preliminary prototype for SID is ready. Pre-processing module implemented for ED and SID: Energy based
Voice Activity Detector. ED, Laughter DLL is ready (NICE’s API). LID algorithm evaluated on English UK corpus. We got (achieved ?)
over 85% accuracy (explain more this point ?). Trained on a testbed representing atleast 10 (European)
languages On going research on speaker identification (outlier detection and
exclusion, how to deal with multi-speaker?).
12 Month Review Meeting
Project #033902
Contributions and Connections with Other Workpackages This WP provides many inputs to other WPs and relies on few
outputs from other WPs WP2
The sound objects extracted in WP4 populate the ontology devised in WP2
WP3 Sound object recognition used to enable enhanced retrieval
Retrieval of speakers Retrieval of key speech and music features
WP5 Sound objects used both in archiving and as access tools
Source separation Audio enhancement
12 Month Review Meeting
Project #033902
Upcoming Work Plan Months 12-24 – Speech Related
Speaker Identification Retrieval of speakers (for use in WP3) Research on Outlier detection and exclusion Research on new scoring methods How to Deal with Multiple Targets in Speaker
Identification? ED, Laughter and Gender
VAMP API On going research on robust methods.
LID Build robust model for English UK and implementation.
12 Month Review Meeting
Project #033902
DemonstrationSpeaker Identification
12 Month Review Meeting
Project #033902
DemonstrationSpeaker segmentation
12 Month Review Meeting
Project #033902
Music Transcription
Reasonable accuracy detection in: Onset detection Tempo detection Key detection Monophonic pitch detection
Unsolved or unexplored research areas: Ornamentation detection Time signature detection Segmentation:
Bar line detection Music Structure Detection
12 Month Review Meeting
Project #033902
Music Transcription: Ornamentation detection
CUT STRIKE
ROLL
B5note…7.077
D5note7.076.8736
RollstrikeB5note6.8736.6535
RollstrikeA5orn6.6536.6064
RollcutB5note6.6066.4673
RollcutC#6orn6.4676.422
RollB5note6.426.235 1
MN Orn.SN Orn.PitchSegmentOffsetOnsetn
B5note…7.077
D5note7.076.8736
RollstrikeB5note6.8736.6535
RollstrikeA5orn6.6536.6064
RollcutB5note6.6066.4673
RollcutC#6orn6.4676.422
RollB5note6.426.235 1
MN Orn.SN Orn.PitchSegmentOffsetOnsetn
Onset
Detection
System
(ODCF)
Audio
Signal Offsets
Cancellation
Onset
Candidates Audio
Segmentation
Segment
Pitch
Detection
Ornamentation
TranscriptionOrnaments
Segments
Gainza, M. and E. Coyle. Automating Ornamentation Transcription. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '07)
12 Month Review Meeting
Project #033902
Music Transcription: Time Signature Detection
Music is highly repetitive: chorus, phrases, bars…
The method utilises a multi-resolution audio similarity matrix to detect repetitive musical bars by building templates of time signature candidates
The method only depends on musical structure, and does not depend on the presence of percussive instruments or strong musical accents
12 Month Review Meeting
Project #033902
Gainza, M. and E. Coyle. Time Signature Detection by Using a Multi-Resolution Audio Similarity Matrix. In Audio Engineering Society 122nd Convention. 2007. Vienna.
Music Transcription: Time Signature Detection
12 Month Review Meeting
Project #033902
Music Transcription: Bar line Segmentation
Detects the musical bar length and the anacrusis using Audio Sim. Matrix
Predicts and aligns the position of future bars by using an Onset Detector
[b1, b2... bn]
ASM Bar lineprediction Bar line
aligment
Onset detector
Bar length
Anacrucis
[p1, p2... pn]Song
Gainza, Mikel; Barry, Dan ; Coyle, Eugene Automatic Bar Line Segmentation. In Audio Engineering Society 123nd Convention, New York, 2007
12 Month Review Meeting
Project #033902
Anacrucis Bar length
Music Transcription: Bar line Segmentation
12 Month Review Meeting
Project #033902
Music Transcription: Music Structure Segmentation
There are many mid-level representations: spectrogram, chromagram, MFCC…
Novel mid-level representation: Azimugramtime-azimuth representation of a stereo field
System based on the assumption that each section type (e.g: chorus) have a unique source location-intensity profile.
Azimugram S A,T
N basis func B1,T
ADDRESS PCA ICAOrthogonalityenforcement
SegmentsSong
12 Month Review Meeting
Project #033902
Music Transcription: Music Structure Segmentation
Barry, Dan; Gainza, Mikel; Coyle, Eugene. Music Structure Segmentation using the Azimugram in conjunction with Principal Component Analysis. In AES 123nd Convention, New York, 2007
Audio Signal
Azimugram
Segmentation
Intro
ChorusVerse
12 Month Review Meeting
Project #033902
Upcoming Work Plan Months 12-24
Assess the robustness of the ornamentation detector for a variety of instruments
Dynamically adapt time signature and bar line detections to tempo variations
Assess the best mid-level representation for music segmentation
Combine the music structure and bar line segmentation systems. Thus, a segment is aligned to the bar lines
Incorporate knowledge of music structure (e.g.: 8 bars per section…)
Migrate all MATLAB applications to C++
12 Month Review Meeting
Project #033902
ALL - Workpackage progressSilence to silence segmentation – ALL Start – stop segmentation Threshold algorithm – ALL use this, it is sufficient for speech
wave energy under the threshold value is silence Multi-threshold
there are different threshold values for different situations Trained HMM
manually segmented sample for the trainingUsage Preparation phase for the manual segmentation of the training
corpus
12 Month Review Meeting
Project #033902
ALL - Workpackage progressSpeech – non speech segmentation – ALL Trained HMM with gaussian mixture distribution
Trained for: Speech Music Singing Whistle ….
Using 26 dimensions MFCC feature vectors
Usage speech – non-speech segmentation filters the input for the speech
recognition