Transcript
Page 1: The Sheffield Wargames Corpus - Day Two and Day Three · The Sheffield Wargames Corpus - Day Two and Day Three Yulan Liu1y, Charles Fox2, Madina Hasen1, Thomas Hain1z 1MINI, SpandH,

The Sheffield Wargames Corpus - Day Two and Day ThreeYulan Liu1†, Charles Fox2, Madina Hasen1, Thomas Hain1‡

1MINI, SpandH, The University of Sheffield, UK2The University of Leeds, UK

[email protected][email protected]

Introduction• Speech recognition on natural conversation in natural envi-

ronment is of considerable current interest.•However it is challenging in application, particularly with far-

field recording, due to overlapping speech, reverberation, back-ground noise, speaker motion and informal speech patterns.•Most existing speech corpora lack informal natural speech

with movement, and only limited data is available that containshigh quality near-field and far-field recordings from real interac-tions among participants.• The first recording of Sheffield Wargames Corpus (SWC1),

based on a social scenario where native English speakers playa table-top game named Warhammer, collected 8.0h naturalspeech data for research on speech recognition, speaker track-ing and diarisation.•Day 2 and Day 3 recording collects 16.6h annotated data

(SWC2, SWC3), with 6.1h being female speech. All threerecordings make a 24.6h annotated database in total.•A Kaldi recipe is provided for standalone training using datasets

defined with all three recordings. An in-domain LM is built withblog data, wiki data and conversational meeting data. Baselineresults are reported for both standalone training and adaptation.

WALL-01WALL-02

WALL-03

WALL-04

GRID-01

GRID-02

GRID-03

GRID-04

GRID-05 GRID-06

GRID-07GRID-08

TBL11 5West axis

camera (C1)

North PTZ axis camera (C2)

East axis camera (C3)

http://mini-vm20.dcs.shef.ac.uk/swc/SWC-home.html

SWC Statistics

SWC1 SWC2 SWC3 overall#session 10 8 6 24#game 4 4 3 11

#annotated speaker 9 11 8 22gender M M F&M F&M

#unique mic 96 71 24 103#shared mic - - - 24

annotated speech 8.0h 10.5h 6.1h 24.6h#speech utt. 14.0k 15.4k 10.2k 39.6k

duration per utt. 2.1s 2.5s 2.2s 2.2s#word per utt. 6.6 7.9 5.5 6.8

vocabulary 4.4k 5.7k 2.9k 8.5kvideo

√ √-

location tracking√ √ √ √

• Statistics of SWC1, SWC2 and SWC3.• Vocabulary of SWC3 is much smaller than SWC1 and SWC2.

0 1 2 3 4 5X axis (meter)

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

Y a

xis

(m

ete

r)

Speaker location distribution: XY view

mn0001mn0007mn0011mn0013

Task and DatasetEach recording file (session) is split into three strips (A, B, C) withequal amount of annotated speech.

task set strips dur. #utt. #spk.

standalone-1 (SA1)train 1, {2, 3}.A 13.5h 22.6k 22dev {2, 3}.B 5.5h 8.5k 18eval {2, 3}.C 5.6h 8.4k 18

standalone-2 (SA2)train 1 8.0h 14.0k 9dev {2, 3}.A 5.5h 8.7k 18eval {2, 3}.B+C 11.1h 16.9k 18

adapt-1 (AD1) dev {1, 2, 3}.A+B 16.3h 26.2k 22eval {1, 2, 3}.C 8.2h 13.3k 22

adapt-2 (AD2) dev 1 8.0h 14.0k 9eval 2, 3 16.6h 25.6k 18

Baseline Systems

Microphone channels• IHM: individual headset microphone• SDM: single distant microphone•MDM: multiple distant microphones

– 8 channel weighted delay and sum beamforming using Beam-formIt

Standalone system (Kaldi recipe)•HMM-GMM

– LDA+MLLT– LDA+MLLT+SAT (best WER on eval set for IHM: 48.8%)– LDA+MLLT+SAT+MMI•DHH-HMM hybrid structure

– DNN-HMM– DNN-HMM+sMBR (best WER on eval set: IHM 42.0%, SDM

77.3%, MDM 74.9%)– DNN-HMM+fMLLR– DNN-HMM+fMLLR+sMBR•Using in-domain LM

Adaptation system•DNN-HMM-GMM, using bottleneck features only

– DNN: fine-tuning– HMM-GMM: MAP adaptation with updated bottleneck features•Using in-domain LM•Overall WER on eval set: IHM 47.7%, SDM 78.2%, MDM 75.0%

Out of domain In-domain

initialize

MAP adaptation

HMM-GMM

AMI data SWC corpora

train adapt

DNN

SWC LM

wargameblog data

wargamewiki

conversationalmeeting data

Language Model

• Topic and vocabulary of SWC differ from the existing data.•Game related text data was harvested from four Warhammer

blogs and Warhammer wikipedia pages.• A 4-gram LM of 30k words is built by interpolating:

LM component #words vocabulary weightConversational web data 165.9M 457.8k 0.65

Blog 1 (addict) 21.1k 3.3k 0.05Blog 2 (atomic) 126.8k 7.9k 0.05Blog 3 (cadia) 40.4k 3.9k 0.19Blog 4 (cast) 71.2k 7.0k 0.06

wikipedia (warhammer) 26.2k 4.1k 0.003

Baseline ResultsStandalone system

dev eval overallS D I WER

IHM

LDA+MLLT 50.9 51.8 35.9 8.9 6.4 51.3+SAT 48.7 48.8 34.4 8.1 6.3 48.7

+MMI 48.8 49.1 34.4 8.8 5.7 48.9DNN 44.4 44.3 30.5 9.7 4.1 44.4

+sMBR 42.0 42.0 29.5 7.6 5.0 42.0+fMLLR 48.1 48.1 32.9 11.4 3.8 48.1

+sMBR 44.9 44.8 31.2 9.8 3.8 44.9

SDM DNN 78.9 80.5 53.9 21.4 4.4 79.7+sMBR 76.4 77.3 39.1 35.5 2.2 76.8

MDM DNN 76.0 77.9 53.3 18.2 5.5 76.9+sMBR 73.8 74.9 36.0 36.0 2.4 74.3

Adaptation system

dev eval

SWC1 SWC2 SWC3 overallS D I WER

IHM 24.9 46.4 50.5 33.4 9.3 5.0 47.7SDM 55.2 75.0 85.2 53.2 19.1 6.0 78.2MDM 53.5 71.6 82.4 52.4 15.4 7.3 75.0

Conclusions•New recordings extend SWC1 to a 24.6h annotated database

with multi-media and multi-microphone recordings.• Four datasets are suggested for standalone training and adap-

tation. A Kaldi recipe is prepared for standalone training.• An in-domain 4-gram 30k LM is built and released.• The best overall WER obtained is 42.0% for IHM, 76.8% for SDM

and 74.3% for MDM, suggesting a high difficulty level of SWCcorpora for ASR. Beamforming reduces WER by 3-4% relatively.

Funded by EPSRC Natural Speech Technology Programme Grant EP/I031022/1

Top Related