3spandh.dcs.shef.ac.uk/projects/hoarse/project_only/hoarsey2.doc  · web viewperiod covered by the...

24
Periodic Progress Report Research Training Network Hearing, Organisation and Recognition of Speech in Europe HOARSE Contract N°: HPRN-CT-2002-00276 Commencement date of contract: 1/9/2002 Duration of contract (months): 48 Period covered by the report: 1/9/2003 to 31/8/2004 Coordinator: Professor Phil Green Department of Computer Science, University of Sheffield Regent Court, Portobello St., Sheffield S1 4DP, UK Phone: +44 114 222 1828: Fax: +44 114 222 1810 : e-mail [email protected] HOARSE Partners 1. The University of Sheffield [USFD] coordinator

Upload: others

Post on 16-Mar-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 3spandh.dcs.shef.ac.uk/projects/hoarse/project_only/hoarseY2.doc  · Web viewPeriod covered by the report: 1/9/2003 to 31/8/2004 ... differences between chest and head register in

Periodic Progress Report

Research Training Network

Hearing, Organisation and Recognition of Speech in Europe

HOARSE

Contract N°: HPRN-CT-2002-00276Commencement date of contract: 1/9/2002

Duration of contract (months): 48

Period covered by the report: 1/9/2003 to 31/8/2004

Coordinator:

Professor Phil Green

Department of Computer Science, University of Sheffield

Regent Court, Portobello St.,

Sheffield S1 4DP, UK

Phone: +44 114 222 1828: Fax: +44 114 222 1810 : e-mail [email protected]

HOARSE Partners

1. The University of Sheffield [USFD] coordinator2. Ruhr-Universitat Bochum [RUB]3. DaimlerChrysler AG [DCAG]4. Helsinki University of Technology [HUT]5. Institut Dalle Molle d’Intelligence Artificielle Perceptive [IDIAP]6. Liverpool University [UNILIV]7. University of Patras [PATRAS]

Page 2: 3spandh.dcs.shef.ac.uk/projects/hoarse/project_only/hoarseY2.doc  · Web viewPeriod covered by the report: 1/9/2003 to 31/8/2004 ... differences between chest and head register in

Part A. Research Results

A.1 Scientific HighlightsAt least one young researcher is now in post at each HOARSE lab. Here are the highlights of their work so far.

At Sheffield, doctoral researcher JANA EGGINK continues her work on automatic instrument recognition in polyphonic music under Task 1.5., auditory scene analysis in music. A system for the recognition of the solo instrument in accompanied sonatas and concertos has been developed. Compared with the previous system, which used a missing feature approach for instrument recognition in music with only a low number of concurrent tones, a change of focus from the background towards the foreground has taken place. Instead of identifying regions dominated by interfering sound sources, only the harmonic series belonging to the dominant instrument is identified and used for recognition. Test material is taken from commercially available classical music CDs, without placing any restrictions on the background accompaniment. The recognition accuracies achieved are comparable to those of systems developed to deal with monophonic music only. In an additional step, knowledge about the solo instrument is used to extract the F0s of the main melodic line played by this instrument. Combining different knowledge sources in a probabilistic framework led to a significant improvement in F0s estimation when compared to a baseline system using only bottom-up processing.

At Bochum, doctoral researcher JUHA MERIMAA has concentrated on HOARSE Tasks 2.1 (Researching the precedence effect), 2.2 (Reliability of auditory cues in multi-source scenarios), and 2.3 (Perceptual models of room reverberation with application to speech recognition). A novel auditory modeling mechanism predicting localization under precedence effect, multi-source, and reverberant conditions has been proposed. The model is currently investigated further by gathering new experimental data on the precedence effect. The perception of room reverberation has also been investigated in a study of spatial impression. The first part of this work included developing the experimental methods, as well as finding and training suitable test subjects for the listening experiments. The ongoing work concentrates on the effect of conflicting binaural cues on perception and on grouping of the cues to those related to sound sources and acoustical environments. Furthermore, a novel method for multi-channel loudspeaker reproduction of room reverberation has been developed in collaboration with HUT, leading to several joint papers.

At DCAG, doctoral researcher JULIEN BOURGEOIS is working on Task 4.2, informing speech recognition. This year, he concentrated on the comparison between linear blind source separation (BSS) methods and minimum-variance (or beamforming) techniques for the separation of the driver and co-driver speech in cars. We observed experimentally that BSS performs poorly when microphones are placed on the roof, as close as possible

Page 3: 3spandh.dcs.shef.ac.uk/projects/hoarse/project_only/hoarseY2.doc  · Web viewPeriod covered by the report: 1/9/2003 to 31/8/2004 ... differences between chest and head register in

to the mouth of each speaker. We examined this theoretically and showed that when the input signal-to-interference ration (SIR) at the microphone is above a certain threshold, BSS is not able to bring any crosstalk-reduction, whereas beamforming still performs SIR improvement. Another limitation of BSS methods is their slower convergence on so-called non-causal mixtures, which arises for example if the two speakers are on the same half-plane defined by the microphone position. As a consequence, it is not advantageous to incorporate spatial prior information (available in cars) as hard constrains on the separation filters. This finding is confirmed in other experimental settings. Therefore, classical beamforming methods are preferable whenever reasonable speaker activity detection can be achieved. In Task 5.1, Speech recognition evaluation in multi-speaker conditions, DCAG made additional recordings using the commercial S-Klasse mirror beamformer and close-talk microphones. Further multi-speaker recognition evaluation: on these recordings, BSS methods performed lower word error rate reduction than beamforming.

At HUT Helsinki, working on HOARSE Tasks 3.1, Glottal excitation estimation and 3.2, Voice production studies doctoral researcher EVA BJORKNER has studied physiological differences between chest and head register in the female singing voice were studied by inverse filtering the oral airflow recorded for a sequence of /pae/ syllables sung at constant pitch and decreasing vocal loudness in each register by seven female musical theatre singers. Ten equidistantly spaced subglottal pressure (Ps) values were selected and the relationships between Ps and several parameters were examined. The normalised amplitude quotient (NAQ) was used for measuring glottal adduction. Development and evaluation of inverse filtering has been studied using physiological modelling of voice production as well as high-speed digital imaging of the vocal folds fluctuation. Thus this experiment combines several Ps -values with NAQ to measure glottal adduction.

At IDIAP, researcher VIKTORIA MAIER studied contextual and temporal information in speech and its use in ASR: The relevant HOARSE Task is 4.2, Informing Speech Recognition.

The classic experiment of Liberman et al (1952) was re-designed and perceptual test run on a group of 37 listeners. Results were broadly consistent with Liberman et al (1952). The implications for HMM-based speech recognition systems were discussed.

The importance of the number of emitting states in a model and the relationship of phoneme duration has been analyzed.

Viktoria Maier is about to leave the HOARSE network and continue her doctoral work at Sheffield, under different funding.

Also at IDIAP, doctoral researcher PETR SVOJANOVSKY is extending the TRAP-TANDEM model proposed by IDIAP. The main effort is towards universal classifiers frequency-localized patterns, extending (Hermansky and Jain, Eurospeech 2003). Recently an interesting and apparently effective method of training a classifier on a

Page 4: 3spandh.dcs.shef.ac.uk/projects/hoarse/project_only/hoarseY2.doc  · Web viewPeriod covered by the report: 1/9/2003 to 31/8/2004 ... differences between chest and head register in

particular frequency band and applying it also at all other frequencies has emerged Svojanovsky’s work. The HOARSE task involved here is 4.3, Advanced ASR algorithms.

Svojanovsky was also involved in ASR experiments with nonsense syllables. This database in principle allows for evaluation of automatic recognizer, independently of any language-level constraints. HOARSE Task 5.1, Speech recognition evaluation in multi-speaker conditions.

Doctoral researcher GUILLAUME LATHOUD is working under task 5.2 (signal and speech detection in sound mixtures) on overlaps between speakers. Previously proposed microphone array-based speaker segmentation methods were extended into a generic short-term segmentation/tracking framework [Lathoud et al. 04] that successfully copes with unkown number of speakers and unkown speakers' locations. An audiovisual database called AV16.3 is now accessible online [Lathoud et al. 04] including a variety of multi-speaker cases, 3D location annotation and some speech/silence segmentation annotation. Recent work focused on sector-based multiple sources detection and localization [Lathoud et al. 04].

At Liverpool, the work of doctoral researcher ELVIRA PEREZ has concentrated on Task 1.3, active/passive speech perception. After a year on a Fulbright fellowship she has now returned to Liverpool. Two sets of experiments that test whether listeners actively predict the temporal or spectral nature of masking sounds were conducted. The experiments evaluated speech intelligibility in two contexts:

regularly spaced and randomly spaced noise bursts (to test temporal prediction) using a predictable and unpredictable frequency modulated sinewave that could

be integrated into the speech percept or heard as a separate sound. Both experiments confirm that our ability to segregate signals from maskers does not exploit (or rely on) regularity of the masker. A paper on this work is in preparation.Also at Liverpool, post-doctoral researcher PATTI ADANK has worked on Task 1.4: Envelope information and binaural processing. Adank concentrated on the use of voice characteristics to help segregation of simultaneous speakers. Previous work has shown that listeners are able to segregate spatially disparate signals much better when spoken by different speakers (Darwin and Hukin, 2000). We hypothesized that a two-stage process may first segregate the signals on F0 and than bind components together using cues such as speaker location of voice characteristics (cf. Darwin et al., 2003). Important voice characteristics are local amplitude modulation (flutter) or random F0 variation (jitter). We tested whether jitter can be used as a primary or secondary segregation cue because previous modelling work (Ellis 1993) has shown that jitter can be extracted by computational models and used for grouping. In a first experiment synthetic vowel pairs were synthesized with a range of jitter and F0 values. We show that while F0 differences lead to improved recognition, manipulation of the F0 jitter does not and therefore conclude that jitter is not a primary grouping cue. In a second set of experiments listeners were presented with sentences that were synthesized with pitch and jitter differences to test whether jitter might aid stream formation. Again our results show that the introduction of jitter does not aid in the segregation of sentences. This leaves the

Page 5: 3spandh.dcs.shef.ac.uk/projects/hoarse/project_only/hoarseY2.doc  · Web viewPeriod covered by the report: 1/9/2003 to 31/8/2004 ... differences between chest and head register in

intriguing question how speaker specific information aids stream formation. A technical report on this work is available.

Other work at Liverpool has addressed Task 2.2 Reliability of auditory cues in multi-cue scenarios. A key question for systems that have to integrate multiple cues is how to combine and weight the different cues that are available. At Liverpool a range of experiments that examine combination rules for low-level auditory and visual motion signals were examined. Three models for cue integration were formalised: independent decisions, probability summation (i.e. independent local decisions) and linear summation (i.e. direct integration of the signals before decisions are made). Results show that human observers use probability summation for signals that are not ecologically plausible and linear summation for signals that are ecologically plausible. The work was presented at ICA2004, Kyoto. A paper on this topic has been accepted for publication.

Liverpool are collaborating with Bochum and Sheffield in Task 4.1: Informing speech recognition. Liverpool carried out initial studies aiming to use linear prediction of the energy in 32 channels of an auditory filterbank to predict noise spectra based on past data. The results, based on the AURORA noises, show that short term prediction should lead to much better noise estimates than measures such as the long term average. The gains are larger for non-stationary noises than for stationary noises by virtue of the long term average being an already good predictor. The current aim is to record a database of typical environmental noises to evaluate the system with a reasonable sample of sounds. With help from Bochum Liverpool built a set of in-ear microphones that can be used with a DAT recorder to record binaural environmental sounds and are now collaborating with Sheffield to make the recordings.

The team at Patras is engaged on several HOARSE tasks. The post-doctoral researcher involved is JOHN WORLEY (previously at Bochum).

Task 2.3: Perceptual models of room reverberation with application to speech recognition: Work has been performed based on the use of smoothed room response measurements. The tests have illustrated some novel aspects of response measurements when employed for real-time room acoustics compensation and also the robustness of the method based on smoothed room response. This work is forming the starting point for further tests, which are described in Task 2.4.

Task 2.4: Speech enhancement for reverberant environments:: John Worley has designed an experiment that tests the spatial quality and sound efficacy of a complex smoothed room response filter. The initial stage of the experiment has been completed which has involved the building of two Graphical user interfaces to obtain subjective data as to various aspects of spatial quality (source width, envelopment, and room size) and sound quality (phase clarity, spectral balance, loudness, and overall sound quality). The testing will reveal the factors that listeners consider important when assessing reverberation characteristics of a room. Some work is also in progress on the use beamforming arrays for use in speech enhancement and ASR tasks.

Page 6: 3spandh.dcs.shef.ac.uk/projects/hoarse/project_only/hoarseY2.doc  · Web viewPeriod covered by the report: 1/9/2003 to 31/8/2004 ... differences between chest and head register in

Pursuing Task 2.1: Researching the Precedence effect, Worley travelled to Bochum to test subjects on the Franssen illusion within different sized rooms, and with a range of onset transitions. He completed three experiments in Bochum, which show that the traditional illusion breaks down when it is performed within a large hall. The preliminary conclusion for this work is that for the precedence effect to work, the secondary signal in the Franssen illusion must not be active until the listener has received the reflections within the room. Therefore, congruent with the ‘plausibility hypothesis’ the secondary signal will be perceived as a reflection and the illusion will operate.

A.2 Joint Publications and Patents

Publications

IDIAP and USFD Andrew C. Morris, Viktoria Maier and Phil Green, “From WER and RIL to MER

and WIL: improved evaluation measures for connected speech recognition”, in International Conference on Spoken Language Processing (ICSLP), Jeju Island, Korea, 2004

HUT and USFD Palomäki, K. Brown, G., and Barker, J., 'Techniques For Handling Convolutional

Distortion With `Missing Data' Automatic Speech Recognition”, Speech Communication Vol. 43, no. 1-2, pp. 123-142, 2004

Palomäki, K., Brown, G., and Wang, D., ''A Binaural Processor for Missing Data Speech Recognition in the Presence of Noise and Small-Room Reverberation,'' Speech Communication, 2004. In press.

Patents

HUT and Bochum

Merimaa, J & Pulkki, V: Perceptually-Based Processing of Directional Room Responses for Multichannel Loudspeaker Reproduction, Proc. IEEE WASPAA, New Paltz, NY, USA, 2003, pp. 51-54.

Pulkki, V, Merimaa, J & Lokki, T: Multi-Channel Reproduction of Measured Room Responses, 18th International Congress on Acoustics, Kyoto, Japan, 2004, pp. II 1273-1276.

Pulkki, V, Merimaa, J & Lokki, T: Reproduction of Reverberation with Spatial Impulse Response Rendering, AES 116th Convention, Berlin, Germany, 2004, Preprint 6057.

Merimaa, J. & Pulkki, V: Spatial Impulse Response Rendering, 7th International Conference on Digital Audio Effects (DAFx'04), Naples, Italy, 2004. Invited paper.

Page 7: 3spandh.dcs.shef.ac.uk/projects/hoarse/project_only/hoarseY2.doc  · Web viewPeriod covered by the report: 1/9/2003 to 31/8/2004 ... differences between chest and head register in

Pulkki, V, Merimaa J. & Lokki T: A Method for Reproducing Natural or Modified Spatial Impression in Multichannel Listening. International patent application, filed March 2004

Part B - Comparison with the Joint Programme of Work (Annex I of the contract)

B.1 Research Objectives

The research objectives, as set down in Annex I of the contract, are still relevant and achievable. There are inevitable shifts in perspective and emphasis, to reflect scientific progress and the expertise and interests of the young researchers we have recruited.

B.2 Research Method There were no additions to our methodological toolkit during the reporting period.

B.3 Work Plan

B3.1 Breakdown of tasksWe have made no changes to the task structure since year 1, though we recognise that the following table is looking somewhat dated.

B3.2 Schedule and milestones: Table 1Note that here we are reporting on the work of the HOARSE teams, rather than the work of the young researchers alone.

Task Lead Partner 12 Month Milestone 24 Month Milestone

Comments

1.1 Neural Oscillators for Auditory Scene Analysis

USFD Multiple F0s using harmonic cancellation.Initial implementation of binaural grouping

F0 tracking using continuity constraints

Multiple F0 work published [Wu, Wang & Brown 03]. Work on auditory selective attention published in IEEE tran. Neural networks.

1.2 Modelling grouping integration by multisource decoding

USFD incorporation of noise estimation into oscillator-based grouping

mask-level integration.

. Multisource decoding theory journal article published in Speech

Page 8: 3spandh.dcs.shef.ac.uk/projects/hoarse/project_only/hoarseY2.doc  · Web viewPeriod covered by the report: 1/9/2003 to 31/8/2004 ... differences between chest and head register in

Communication1.3 Active/Passive

speech perceptionLiverpool Planning experiments Experiments

conductedExperiments conducted

1.4 Envelope information and binaural processing

Liverpool Preliminary experiments Experiments and analysis

Experiments conducted

1.5 Auditory Scene Analysis in Music

USFD F0 estimation Development of a two-stage (lower and cognitive) precedence effect model

Second system completed

2.1 Researching the precedence effect

RUB Psychoacoustic experiments on the precedence effect in realistic scenarios.

Development of a localisation model using automatic weighting function for binaural cues

Model completed [Faller & Merimaa 04]. Further psychoacoustical experiments being conducted.Some work at Patras on the relationship of Precedence effect and the Franssen illusion in conjunction with Bochum

2.2 Reliability of auditory cues in multi-source scenarios

RUB The importance of single binaural cues in various multisource environments determined in psychoacoustic experiments

Extension to multiple sources and practical room conditions

Completed [Braasch 03], Braasch et al 03], [Braasch & Blauert 03]. Research at RUB extended to spatial impression and separation of binaural cues to source and room related.

2.3 Perceptual models of room reverberation with application to speech recognition

Patras integrated response/ signal perceptual model for single source in reverberant environments.

Extension for multiple sources

Significant part of the work completed

2.4 Speech enhancement for reverberant environments

Patras Research into auto-directive arrays, controlled from the perceptual directivity module

Development of new parameterisation techniques for the voice source

Some work completed (test interfaces ready) to be supplemented by subjective tests.

Page 9: 3spandh.dcs.shef.ac.uk/projects/hoarse/project_only/hoarseY2.doc  · Web viewPeriod covered by the report: 1/9/2003 to 31/8/2004 ... differences between chest and head register in

Missing data techniques for handling reverb developed at Sheffield

3.1 Glottal excitation estimation

HUT Research on combining new AR (Auto Regressive) models to inverse filtering

Inverse filtering experiments on intensity regulation of speech with soft and extremely loud voices

On schedule

3.2 Voice production studies

HUT Inverse filtering experiments on high-pitched voices

Research on the relationship between the main effects of the glottal flow (fundamental frequency, phonation type etc.) and brain functions using MEG.

On schedule

3.3 Voice production and cortical speech processing

HUT Development of DSP algorithms for parameterisation of the voice source, getting familiar with MEG

.Ongoing

4.1 Developments in MultiSource Decoding

USFD Probabilistic decoding contraints

Design of predictive noise estimation algorithms. Known BSS algorithms adopted as a common base for evaluation

Probabilistic decoding implemented in current software. Adaptive noise estimation implemented in multisource models

4.2 Informing Speech Recognition

Liverpool Design of predictive noise estimation algorithms. Known BSS algorithms adopted as a common base for evaluation

HMM2 & DBM adaptation

Work at DCAG and IDIAP

4.3 Advanced ASR Algorithms

IDIAP Multistream adaptation Assessment report 1Targets for

Work reported on this task in Eurospeech 03,

Page 10: 3spandh.dcs.shef.ac.uk/projects/hoarse/project_only/hoarseY2.doc  · Web viewPeriod covered by the report: 1/9/2003 to 31/8/2004 ... differences between chest and head register in

assessment report 2

IEEE ASRU 03

5.1 Speech recognition evaluation in multi-speaker conditions

DCAG Database specification.Targets for assessment report 1

First recognition test in multi-speaker environment using separation algorithms (BSS and beamforming).

5.2 Signal and speech detection in sound mixtures

IDIAP Analysis of auditory cues

ASR performance for simulated deteriorated speech tested

Work reported: Ajmera et al 2003, Lathoud et al 2003

5.3 Speech technology assessment by simulated acoustic environments

RUB Simulation environment for hands-free communication developed

Completed and integrated into IKA telephone line simulation tool. ASR, speaker recognition, and speech synthesis assessment experiments carried out.

B3.3 Research effort in the reporting period: Table 2

Participant Young researchers financed by the contract (person-months)

Researchers financed from other sources (person-months)

Researchers contributing to the project (number of individuals)

1. USFD 12 48 1YR + 5 others=62. RUB 15.5 30 2YRs + 4 others=63. DCAG 12 24 1 YR + 2 others=34. HUT 12 12 1YR + 2 others =35. IDIAP 23.5 24 3YRs + 3 others=66. LIVERPOOL 12 24 2YRs + 2 others=47. PATRAS 7 6 1YR+ 1 other = 2Totals 94 168 11YR+19 other = 30

Page 11: 3spandh.dcs.shef.ac.uk/projects/hoarse/project_only/hoarseY2.doc  · Web viewPeriod covered by the report: 1/9/2003 to 31/8/2004 ... differences between chest and head register in

B.4 Organisation and Management

B4.1 Organisation and managementHOARSE is being managed in the way described in Annex 1 of the contract. The non-executive director is Dr. Jordan Cohen of VoiceSignal inc, Boston, MA. Administrative is being handled from USFD by Gillian Callaghan ([email protected]).

B4.2 Communication StrategyMost communication within HOARSE is conducted electronically. The HOARSE web site is www.hoarsenet.org. Meeting records and so on are on password-protected pages on that site. The email address for the whole network is [email protected].

B4.3 Network Meetings

Our pattern is to hold a HOARSE workshop every 6 months. Most of the time is taken on research updates: all young researchers make a presentation and we also have update talks from academics where appropriate. There is much discussion. The meeting begins with a report from the coordinator and ends with a session planning activities for the next 6 months. Prior to this, the non-executive director has an opportunity to provide feedback on the progress of the network. The workshops are scheduled for 2 days and the steering committee meets at some point in this time period. In the reporting period the following workshops were held:

3rd Workshop, hosted by IDIAP, 5-6 September 20034th Workshop, hosted by USFD, 20-21 February 2004

The non-executive director was present at both workshops. He is treated as an external expert for funding purposes. Dr. Stefan Launer of Phonak (a Swiss-based hearing aid company) attended the 3rd workshop as an external expert.

B4.4 Networking

HOARSE policy is that each young researcher should spend at least a week with each network partner.

Visits in the reporting period by members of the teams were as follows:

Viktoria Meyer from IDIAP to Sheffield, February 04 Eva Bjorkner from HUT to Sheffield, March 04 John Worley from Patras to Bochum Juha Merimaa from Bochum to Patras, May 04 Juha Merimaa of RUB to HUT, December 2003

The following visits are planned in the 3rd year:

Jana Eggink from Sheffield to HUT

Page 12: 3spandh.dcs.shef.ac.uk/projects/hoarse/project_only/hoarseY2.doc  · Web viewPeriod covered by the report: 1/9/2003 to 31/8/2004 ... differences between chest and head register in

Julien Bourgeois from DCAG to IDIAP Guillaume Lathoud from IDIAP to DCAG

B.5 Training

B.5.1 Publicising Positions

HOARSE opportunities have been publicised by means ofemail lists such as those maintained by ELSNET (European Language and Speech

Network), ISCA (International Speech Communication Association) and SALT (UK Speech and Language Technology).

The IHP network vacancies siteThe HOARSE web site

Though we have not been overwhelmed with applications, there has been a steady stream, of high quality. We are not recruiting at the moment though there may be some further opportunities later.

B5.2 Recruitment Progress: Table 3

Recruitment has gone well: all partners have YRs in place

Participant Contract deliverable of Young Researchers to be financed by the contract (person- months)

Young Researchers financed by the contract so far (person-months)

Pre-doc (a)

Post-doc (b)

Total (a+b) Pre-doc ( c) Post-doc (d) Total (c+d)

1. USFD 18 18 36 24 0 242. RUB 18 18 36 18 9 273. DCAG 18 18 36 24 0 244. HUT 18 18 36 18 0 185. IDIAP 18 18 36 32 0 326. LIVERPOOL 18 18 36 8 11 197. PATRAS 18 18 36 0 7 7

B5.3 Integration

We feel we have created an informal atmosphere in which young researchers can readily integrate with more experienced researchers and with their peers. It is difficult to be precise about how we have done this but the quality of the interactions at workshops, and the value of the discussions has been high. A young post-doctoral researcher comments:

"Having a mix of Ph.D and post-doctoral researchers, in addition to the more senior members of the group gives a synergistic effect. Since, the Ph.D students can learn from

Page 13: 3spandh.dcs.shef.ac.uk/projects/hoarse/project_only/hoarseY2.doc  · Web viewPeriod covered by the report: 1/9/2003 to 31/8/2004 ... differences between chest and head register in

the direct contact with recent post-docs and the post-doc gains experience of advisement and discussion in an informal environment."

Our policy to encourage integration into the network is for each young researcher to have a supervisor in the host lab and an advisor in a different lab, usually but not necessarily in the network. These arrangements are as follows:

Table 4: Supervisors and Advisors

Young Researcher Supervisor AdvisorJana Eggink Guy Brown, USFD Georg Meyer, LiverpoolJuha Merimaa Jens Blauert, RUB Matti Karjalainan, HUTJohn Worley Jon Mourjopoulos, Patras Jens Blauert, BochumJulien Bourgeois Udo Haiber, DCAG Ian McCowan, IDIAPEva Bjorkner Paavo Alku (HUT) Johan Sundberg, KTH, SwedenGuillaume Lathoud Herve Bourlard, IDIAP Klaus Linhard, DCAGElvira Perez Georg Meyer, Liverpool Martin Cooke, USFDPatti Adank Georg Meyer, Liverpool Guy Brown, USFDViktoria Maier Hynek Hermansky, IDIAP Martin Cooke, USFDPetr Svojanovsky Hynek Hermansky, IDIAP Roger Moore, USFD

B5.4 Training Measures.

At the IDIAP workshop there was a training session on the ‘smart meeting room’ facility.

Many Universities provide complementary skills programmes for researchers, and HOARSE students are encouraged to take advantage of these. An example is the Research Training Programme at USFD, which Jana Eggink has completed. There is a similar programme at Liverpool. At RUB, John Worley has taken a German language course.

B5.5 Equal Opportunities

We have taken no special equal opportunities measures, but 50% of the young researchers HOARSE has recruited are female.

B5.6 Multidisciplinarity

In HOARSE, multidisciplinarity is so central to the research that young researchers receive training across discipline boundaries every day. We have recruited from a variety of backgrounds: mathematics, phonetics, linguistics and music for instance. Much of our work involves a combination of experimental work, perhaps with human listeners, and computational or mathematical modelling.

Page 14: 3spandh.dcs.shef.ac.uk/projects/hoarse/project_only/hoarseY2.doc  · Web viewPeriod covered by the report: 1/9/2003 to 31/8/2004 ... differences between chest and head register in

B5.7 Industrial Training

HOARSE has a full industrial partner in DCAG, and in future years we anticipate young researchers spending time there to learn about the priorities and strengths of industrial research.

B.6 Difficulties

We are delighted to have recruited so many high-quality young researchers. Our only problem has been in Patras, where negotiations with two candidates foundered on the difficulty of non-national enrolling for a PhD in Greece and the practical difficulties with travelling to a relatively distant location of Patras from mainland Europe.

Part C - Summary Reports by Young ResearchersPatti Adank’s report has been sent to the commission.

Publications

Publications by young researchers

In print Eggink, J. and Brown, G.J. (2004): Instrument recognition in accompanied

sonatas and concertos. Proc. International Conference on Acoustics, Speech, and Signal Processing, ICASSP'04, pp. 217-220

Eggink, J. and Brown, G.J. (2004): Extracting melody lines from complex audio. Proc. International Conference on Music Information Retrieval, ISMIR'04

Viktoria Maier and Hynek Hermansky, “Perception of synthetic consonant-vowel stimuli”, in Multimodal Interaction and Related Machine Learning Algorithms (MLMI), Martigny, Switzerland, 2004

G. Lathoud, I.A. McCowan, and J.M. Odobez. Unsupervised Location-Based Segmentation of Multi-Party Speech. Proceedings of the 2004 NIST Meeting Recognition Workshop (NIST-RT04).

J. Ajmera, G. Lathoud and I.A. McCowan. Clustering and Segmenting Speakers and their Locations in Meetings. Proceedings of the 2004 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP-04).

D. Zhang, D. Gatica-Perez, S. Bengio, I.A. McCowan and G. Lathoud. Modeling Individual and Group Actions in Meetings: a Two-Layer HMM Framework. Proceedings of CVPR 2004.

D. Gatica-Perez, G. Lathoud, I.A. McCowan and J.M. Odobez. A Mixed-State I-Particle Filter for Multi-Camera Speaker Tracking. Proceedings of the 2003 IEEE Int. Conf. on Computer Vision Workshop on Multimedia Technologies for E-Learning and Collaboration (ICCV-WOMTEC), 2003.

Page 15: 3spandh.dcs.shef.ac.uk/projects/hoarse/project_only/hoarseY2.doc  · Web viewPeriod covered by the report: 1/9/2003 to 31/8/2004 ... differences between chest and head register in

G. Lathoud, I.A. McCowan, and D.C. Moore. Segmenting Multiple Concurrent Speakers Using Microphone Arrays. Proceedings of Eurospeech 2003.

D. Gatica-Perez, G. Lathoud, I.A. McCowan, J.M. Odobez and D.C. Moore. Audio-Visual Speaker Tracking with Importance Particle Filters. Proceedings of the 2003 IEEE International Conference on Image Processing (ICIP-2003).

G. Lathoud and I.A. McCowan. Location based speaker segmentation. Proceedings of the 2003 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP-03).

I.A. McCowan, S. Bengio, D. Gatica-Perez, G. Lathoud, F. Monay, D.C. Moore, P. Wellner and H. Bourlard. Modeling human interactions in meetings. Proceedings of the 2003 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP-03).

G. Lathoud, J.M. Odobez and D. Gatica-Perez. AV16.3: an Audio-Visual Corpus for Speaker Localization and Tracking. IDIAP Research Report 04-28, 2004.

D. Zhang, D. Gatica-Perez, S. Bengio, I.A. McCowan and G. Lathoud. Multimodal Group Action Clustering in Meetings. IDIAP Research Report RR 04-24, 2004.

I.A. McCowan, D. Gatica-Perez, S. Bengio, and G. Lathoud. Automatic Analysis of Multimodal Group Actions in Meetings. IDIAP Research Report 03-27, 2003.

Eva Björkner, Johan Sundberg, Tom Cleveland and Ed Stone: “Voice source characteristics in different registers in classically trained musical theatre singers”, Proc. ICA2004, Kyoto, Japan, April 4-10, 2004; accepted for publication in Journal of Voice.

Julien Bourgeois and Klaus Linhard. Frequency-Domain Multichannel Signal Enhancement: Minimum-Variance vs. Minimum Correlation. Eusipco 2004, Vienna.

Merimaa, J: Auditorily Motivated Analysis of Directional Room Responses, 1st ISCA Tutorial & Research Workshop on Auditory Quality of Systems, Akademie Mont-Cenis, Germany, 2003. Invited talk (no written paper).

Merimaa, J. & Hess, W: Training of Listeners for Evaluation of Spatial Attributes of Sound, AES 117th Convention, San Francisco, CA, USA, 2004.

In Press Fousek Petr, Svojanovsky Petr, Grezl Frantisek, Hermansky Hynek: New

Nonsense Syllables Database - Analyses and Preliminary ASR Experiments, In ICSLP 2004, Soul, KR,

G. Lathoud and I.A. McCowan. A Sector-Based Approach for Localization of Multiple Speakers with Microphone Arrays. To appear in Proceedings of the 2004 ISCA Tutorial and Research Workshop on Statistical and Perceptual Audio Processing (SAPA-2004).

G. Lathoud, J.M. Odobez and D. Gatica-Perez. AV16.3: an Audio-Visual Corpus for Speaker Localization and Tracking. To appear in proceedings of the 2004 MLMI Workshop, Bengio & Bourlard Eds., Springer-Verlag, 2004.

Page 16: 3spandh.dcs.shef.ac.uk/projects/hoarse/project_only/hoarseY2.doc  · Web viewPeriod covered by the report: 1/9/2003 to 31/8/2004 ... differences between chest and head register in

D. Zhang, D. Gatica-Perez, S. Bengio, I.A. McCowan and G. Lathoud. Multimodal Group Action Clustering in Meetings. Proceedings of the 2004 ACM International Conference on Multimedia, Workshop on Video Surveillance and Sensor Networks (ACM MM-VSSN), 2004.

I.A. McCowan, D. Gatica-Perez, S. Bengio, and G. Lathoud. Automatic Analysis of Multimodal Group Actions in Meetings. To appear in the IEEE Transactions on Speech and Audio Processing, 2005.

Faller, C. & Merimaa, J: Source localization in complex listening situations: Selection of binaural cues based on interaural coherence, J. Acoust. Soc. Am., vol. 116, no 5, Nov. 2004.