wp4 – sound object representation

19
Enabling Access to Sound Archives through Integration, Enrichment and Retrieval WP4 – Sound Object Representation

Upload: connor-merrill

Post on 03-Jan-2016

33 views

Category:

Documents


0 download

DESCRIPTION

WP4 – Sound Object Representation. Enabling Access to Sound Archives through Integration, Enrichment and Retrieval. Introduction to Workpackage-Overview. Objectives: How to represent audio for the purposes of efficient querying. Segmentation of audio streams. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: WP4 – Sound Object Representation

Enabling Access to Sound Archives through Integration, Enrichment and Retrieval

WP4 – Sound Object Representation

Page 2: WP4 – Sound Object Representation

12 Month Review Meeting

Project #033902

Introduction to Workpackage-Overview Objectives:

How to represent audio for the purposes of efficient querying. Segmentation of audio streams. Distinct objects may then be recognized using musical instrument identification

and speaker identification techniques . Identification of higher level features

Speech related- Gender, Emotion, Laughter and Language Music related- tempo, beat detection, rhythm…

Tasks: T 4.1 Audio stream segmentation- Speech/music separation… T 4.2 Source separation- Instrument Identification, Speaker Identification T 4.3 Sound object identification T4.5: Transcription

Music transcription High level speech phonetics & characteristics

Page 3: WP4 – Sound Object Representation

12 Month Review Meeting

Project #033902

Deliverables and Milestones Deliverables

D4.1 Prototype segmentation, separation and speaker/instrument identification system (Month 14)

D4.2 Prototype transcription system (Month 27) D4.3 Final report on sound object representations (Month 30)

Milestones and expected result M4.1- Month 6: Speech/music separation methods implemented and

tested M4.2 - Month 10: Initial results on identification of sound objects,

prototype segmenter and separator M4.3 – Month 18: Identification of speech characteristics from

segmented, separated audio streams M4.4 – Month 24: Transcription of monophonic music from

segmented, separated audio streams M4.5 – Month 28: Testing and evaluation of complete system

Page 4: WP4 – Sound Object Representation

12 Month Review Meeting

Project #033902

Workpackage Progress – Speech Related

Prototype for speaker segmentation is ready. Preliminary prototype for SID is ready. Pre-processing module implemented for ED and SID: Energy based

Voice Activity Detector. ED, Laughter DLL is ready (NICE’s API). LID algorithm evaluated on English UK corpus. We got (achieved ?)

over 85% accuracy (explain more this point ?). Trained on a testbed representing atleast 10 (European)

languages On going research on speaker identification (outlier detection and

exclusion, how to deal with multi-speaker?).

Page 5: WP4 – Sound Object Representation

12 Month Review Meeting

Project #033902

Contributions and Connections with Other Workpackages This WP provides many inputs to other WPs and relies on few

outputs from other WPs WP2

The sound objects extracted in WP4 populate the ontology devised in WP2

WP3 Sound object recognition used to enable enhanced retrieval

Retrieval of speakers Retrieval of key speech and music features

WP5 Sound objects used both in archiving and as access tools

Source separation Audio enhancement

Page 6: WP4 – Sound Object Representation

12 Month Review Meeting

Project #033902

Upcoming Work Plan Months 12-24 – Speech Related

Speaker Identification Retrieval of speakers (for use in WP3) Research on Outlier detection and exclusion Research on new scoring methods How to Deal with Multiple Targets in Speaker

Identification? ED, Laughter and Gender

VAMP API On going research on robust methods.

LID Build robust model for English UK and implementation.

Page 7: WP4 – Sound Object Representation

12 Month Review Meeting

Project #033902

DemonstrationSpeaker Identification

Page 8: WP4 – Sound Object Representation

12 Month Review Meeting

Project #033902

DemonstrationSpeaker segmentation

Page 9: WP4 – Sound Object Representation

12 Month Review Meeting

Project #033902

Music Transcription

Reasonable accuracy detection in: Onset detection Tempo detection Key detection Monophonic pitch detection

Unsolved or unexplored research areas: Ornamentation detection Time signature detection Segmentation:

Bar line detection Music Structure Detection

Page 10: WP4 – Sound Object Representation

12 Month Review Meeting

Project #033902

Music Transcription: Ornamentation detection

CUT STRIKE

ROLL

B5note…7.077

D5note7.076.8736

RollstrikeB5note6.8736.6535

RollstrikeA5orn6.6536.6064

RollcutB5note6.6066.4673

RollcutC#6orn6.4676.422

RollB5note6.426.235 1

MN Orn.SN Orn.PitchSegmentOffsetOnsetn

B5note…7.077

D5note7.076.8736

RollstrikeB5note6.8736.6535

RollstrikeA5orn6.6536.6064

RollcutB5note6.6066.4673

RollcutC#6orn6.4676.422

RollB5note6.426.235 1

MN Orn.SN Orn.PitchSegmentOffsetOnsetn

Onset

Detection

System

(ODCF)

Audio

Signal Offsets

Cancellation

Onset

Candidates Audio

Segmentation

Segment

Pitch

Detection

Ornamentation

TranscriptionOrnaments

Segments

Gainza, M. and E. Coyle. Automating Ornamentation Transcription. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '07)

Page 11: WP4 – Sound Object Representation

12 Month Review Meeting

Project #033902

Music Transcription: Time Signature Detection

Music is highly repetitive: chorus, phrases, bars…

The method utilises a multi-resolution audio similarity matrix to detect repetitive musical bars by building templates of time signature candidates

The method only depends on musical structure, and does not depend on the presence of percussive instruments or strong musical accents

Page 12: WP4 – Sound Object Representation

12 Month Review Meeting

Project #033902

Gainza, M. and E. Coyle. Time Signature Detection by Using a Multi-Resolution Audio Similarity Matrix. In Audio Engineering Society 122nd Convention. 2007. Vienna.

Music Transcription: Time Signature Detection

Page 13: WP4 – Sound Object Representation

12 Month Review Meeting

Project #033902

Music Transcription: Bar line Segmentation

Detects the musical bar length and the anacrusis using Audio Sim. Matrix

Predicts and aligns the position of future bars by using an Onset Detector

[b1, b2... bn]

ASM Bar lineprediction Bar line

aligment

Onset detector

Bar length

Anacrucis

[p1, p2... pn]Song

Gainza, Mikel; Barry, Dan ; Coyle, Eugene Automatic Bar Line Segmentation. In Audio Engineering Society 123nd Convention, New York, 2007

Page 14: WP4 – Sound Object Representation

12 Month Review Meeting

Project #033902

Anacrucis Bar length

Music Transcription: Bar line Segmentation

Page 15: WP4 – Sound Object Representation

12 Month Review Meeting

Project #033902

Music Transcription: Music Structure Segmentation

There are many mid-level representations: spectrogram, chromagram, MFCC…

Novel mid-level representation: Azimugramtime-azimuth representation of a stereo field

System based on the assumption that each section type (e.g: chorus) have a unique source location-intensity profile.

Azimugram S A,T

N basis func B1,T

ADDRESS PCA ICAOrthogonalityenforcement

SegmentsSong

Page 16: WP4 – Sound Object Representation

12 Month Review Meeting

Project #033902

Music Transcription: Music Structure Segmentation

Barry, Dan; Gainza, Mikel; Coyle, Eugene. Music Structure Segmentation using the Azimugram in conjunction with Principal Component Analysis. In AES 123nd Convention, New York, 2007

Audio Signal

Azimugram

Segmentation

Intro

ChorusVerse

Page 17: WP4 – Sound Object Representation

12 Month Review Meeting

Project #033902

Upcoming Work Plan Months 12-24

Assess the robustness of the ornamentation detector for a variety of instruments

Dynamically adapt time signature and bar line detections to tempo variations

Assess the best mid-level representation for music segmentation

Combine the music structure and bar line segmentation systems. Thus, a segment is aligned to the bar lines

Incorporate knowledge of music structure (e.g.: 8 bars per section…)

Migrate all MATLAB applications to C++

Page 18: WP4 – Sound Object Representation

12 Month Review Meeting

Project #033902

ALL - Workpackage progressSilence to silence segmentation – ALL Start – stop segmentation Threshold algorithm – ALL use this, it is sufficient for speech

wave energy under the threshold value is silence Multi-threshold

there are different threshold values for different situations Trained HMM

manually segmented sample for the trainingUsage Preparation phase for the manual segmentation of the training

corpus

Page 19: WP4 – Sound Object Representation

12 Month Review Meeting

Project #033902

ALL - Workpackage progressSpeech – non speech segmentation – ALL Trained HMM with gaussian mixture distribution

Trained for: Speech Music Singing Whistle ….

Using 26 dimensions MFCC feature vectors

Usage speech – non-speech segmentation filters the input for the speech

recognition