mediaeval 2015 - percolatte: a multimodal person discovery system in tv broadcast for the medieval...

29
PERCOLATTE A multimodal person discovery system in TV broadcast for the Medieval 2015 evaluation campaign Meriem Bendris, Delphine Charlet, Gregory Senay, MinYoung Kim, Benoit Favre, Mickael Rouvier, Frederic Bechet, G´ eraldine Damnati Sept. 14-15, 2015 Wurzen, Germany Meriem Bendris, Delphine Charlet, Gregory Senay, MinYoung Kim, Benoit Favre, Mickael Rouvier, Frederic Bechet, G´ PERCOLATTE Sept. 14-15, 2015 Wurzen, Germany 1 / 16

Upload: multimediaeval

Post on 20-Jan-2017

116 views

Category:

Education


0 download

TRANSCRIPT

Page 1: MediaEval 2015 - PERCOLATTE: a multimodal person discovery system in TV broadcast for the Medieval 2015 evaluation campaign

PERCOLATTEA multimodal person discovery system in TV broadcast for

the Medieval 2015 evaluation campaign

Meriem Bendris, Delphine Charlet, Gregory Senay, MinYoung Kim, BenoitFavre, Mickael Rouvier, Frederic Bechet, Geraldine Damnati

Sept. 14-15, 2015Wurzen, Germany

Meriem Bendris, Delphine Charlet, Gregory Senay, MinYoung Kim, Benoit Favre, Mickael Rouvier, Frederic Bechet, Geraldine DamnatiPERCOLATTE Sept. 14-15, 2015 Wurzen, Germany 1 / 16

Page 2: MediaEval 2015 - PERCOLATTE: a multimodal person discovery system in TV broadcast for the Medieval 2015 evaluation campaign

Task

Context

Percol, Percolator and Percolatte !

Task

Talking faces identification in TV broadcast

1 Search engine

2 No biometric systems

3 Identification evidence

4 Provided baseline modules

Meriem Bendris, Delphine Charlet, Gregory Senay, MinYoung Kim, Benoit Favre, Mickael Rouvier, Frederic Bechet, Geraldine DamnatiPERCOLATTE Sept. 14-15, 2015 Wurzen, Germany 2 / 16

Page 3: MediaEval 2015 - PERCOLATTE: a multimodal person discovery system in TV broadcast for the Medieval 2015 evaluation campaign

Percolatte approach

Percolatte approach

Scene analysis features and restricted names propagation

1. Scene analysis features

Anchor name detection

Document chaptering: shot classification (Studio/Report)

Speaker role classification (Anchor/Reporter/Other)

2. Restricted names propagation

Prior knowledge about broadcast news structure

Meriem Bendris, Delphine Charlet, Gregory Senay, MinYoung Kim, Benoit Favre, Mickael Rouvier, Frederic Bechet, Geraldine DamnatiPERCOLATTE Sept. 14-15, 2015 Wurzen, Germany 3 / 16

Page 4: MediaEval 2015 - PERCOLATTE: a multimodal person discovery system in TV broadcast for the Medieval 2015 evaluation campaign

Percolatte approach

Percolatte approach

Scene analysis features and restricted names propagation

1. Scene analysis features

Anchor name detection

Document chaptering: shot classification (Studio/Report)

Speaker role classification (Anchor/Reporter/Other)

2. Restricted names propagation

Prior knowledge about broadcast news structure

Meriem Bendris, Delphine Charlet, Gregory Senay, MinYoung Kim, Benoit Favre, Mickael Rouvier, Frederic Bechet, Geraldine DamnatiPERCOLATTE Sept. 14-15, 2015 Wurzen, Germany 3 / 16

Page 5: MediaEval 2015 - PERCOLATTE: a multimodal person discovery system in TV broadcast for the Medieval 2015 evaluation campaign

Percolatte approach

Percolatte approach

Scene analysis features and restricted names propagation

1. Scene analysis features

Anchor name detection

Document chaptering: shot classification (Studio/Report)

Speaker role classification (Anchor/Reporter/Other)

2. Restricted names propagation

Prior knowledge about broadcast news structure

Meriem Bendris, Delphine Charlet, Gregory Senay, MinYoung Kim, Benoit Favre, Mickael Rouvier, Frederic Bechet, Geraldine DamnatiPERCOLATTE Sept. 14-15, 2015 Wurzen, Germany 3 / 16

Page 6: MediaEval 2015 - PERCOLATTE: a multimodal person discovery system in TV broadcast for the Medieval 2015 evaluation campaign

Percolatte approach

Percolatte approach

Meriem Bendris, Delphine Charlet, Gregory Senay, MinYoung Kim, Benoit Favre, Mickael Rouvier, Frederic Bechet, Geraldine DamnatiPERCOLATTE Sept. 14-15, 2015 Wurzen, Germany 4 / 16

Page 7: MediaEval 2015 - PERCOLATTE: a multimodal person discovery system in TV broadcast for the Medieval 2015 evaluation campaign

Percolatte approach

Percolatte approach

Meriem Bendris, Delphine Charlet, Gregory Senay, MinYoung Kim, Benoit Favre, Mickael Rouvier, Frederic Bechet, Geraldine DamnatiPERCOLATTE Sept. 14-15, 2015 Wurzen, Germany 5 / 16

Page 8: MediaEval 2015 - PERCOLATTE: a multimodal person discovery system in TV broadcast for the Medieval 2015 evaluation campaign

Text processing

Text processingAnchor name detector

OCRa on the first 2 minutes

List of names: metadata from the INA website (2004-2009)

Soft mapping: Levenshtein distance on last names

Recall = 93%

ahttps://github.com/meriembendris/ADNVideo

Meriem Bendris, Delphine Charlet, Gregory Senay, MinYoung Kim, Benoit Favre, Mickael Rouvier, Frederic Bechet, Geraldine DamnatiPERCOLATTE Sept. 14-15, 2015 Wurzen, Germany 6 / 16

Page 9: MediaEval 2015 - PERCOLATTE: a multimodal person discovery system in TV broadcast for the Medieval 2015 evaluation campaign

Audio processing

Audio processing

Speaker clustering [Barras et al., 2006]

BIC clustering + GMMs/CLR

Speaker role classification [Damnati and Charlet, 2011]

Anchor: regular speaker that maximizes temporal speech

Reporter/Other: GMM classification

Corpus: 38 broadcast news from 7 channels (Oct. 2008-Jan. 2009), 14.5hours, 1400 speakersTrain: 24 shows/test: 14 showsEER= 15%

Speaker identification

Propagate names to speaker turns that maximise temporal overlapping and to it’sspeaker-cluster within the same chapter

Meriem Bendris, Delphine Charlet, Gregory Senay, MinYoung Kim, Benoit Favre, Mickael Rouvier, Frederic Bechet, Geraldine DamnatiPERCOLATTE Sept. 14-15, 2015 Wurzen, Germany 7 / 16

Page 10: MediaEval 2015 - PERCOLATTE: a multimodal person discovery system in TV broadcast for the Medieval 2015 evaluation campaign

Audio processing

Audio processing

Speaker clustering [Barras et al., 2006]

BIC clustering + GMMs/CLR

Speaker role classification [Damnati and Charlet, 2011]

Anchor: regular speaker that maximizes temporal speech

Reporter/Other: GMM classification

Corpus: 38 broadcast news from 7 channels (Oct. 2008-Jan. 2009), 14.5hours, 1400 speakersTrain: 24 shows/test: 14 showsEER= 15%

Speaker identification

Propagate names to speaker turns that maximise temporal overlapping and to it’sspeaker-cluster within the same chapter

Meriem Bendris, Delphine Charlet, Gregory Senay, MinYoung Kim, Benoit Favre, Mickael Rouvier, Frederic Bechet, Geraldine DamnatiPERCOLATTE Sept. 14-15, 2015 Wurzen, Germany 7 / 16

Page 11: MediaEval 2015 - PERCOLATTE: a multimodal person discovery system in TV broadcast for the Medieval 2015 evaluation campaign

Audio processing

Audio processing

Speaker clustering [Barras et al., 2006]

BIC clustering + GMMs/CLR

Speaker role classification [Damnati and Charlet, 2011]

Anchor: regular speaker that maximizes temporal speech

Reporter/Other: GMM classification

Corpus: 38 broadcast news from 7 channels (Oct. 2008-Jan. 2009), 14.5hours, 1400 speakersTrain: 24 shows/test: 14 showsEER= 15%

Speaker identification

Propagate names to speaker turns that maximise temporal overlapping and to it’sspeaker-cluster within the same chapter

Meriem Bendris, Delphine Charlet, Gregory Senay, MinYoung Kim, Benoit Favre, Mickael Rouvier, Frederic Bechet, Geraldine DamnatiPERCOLATTE Sept. 14-15, 2015 Wurzen, Germany 7 / 16

Page 12: MediaEval 2015 - PERCOLATTE: a multimodal person discovery system in TV broadcast for the Medieval 2015 evaluation campaign

Visual processing

Visual processing

Shot boundaries

Colour histogram peaks on sliding window

Shot boundaries mapping:

July 1st: overlapping shots over 0.5sJuly 8th: overlapping coverage above 50%

Shot similarities

Cosine-based distance on:

RGB histograms

HOG features on resized frames (128×64)

Image embeddings: feature vectors at the 3rd fully-connected layer of theAlexnet DNN [Krizhevsky et al., 2012] (1000 dimension vectors)

Shot clustering

Integer Linear Program clustering [Rouvier and Meignier, 2012].

No face-related processing (detection/identification) is used in our approach

Meriem Bendris, Delphine Charlet, Gregory Senay, MinYoung Kim, Benoit Favre, Mickael Rouvier, Frederic Bechet, Geraldine DamnatiPERCOLATTE Sept. 14-15, 2015 Wurzen, Germany 8 / 16

Page 13: MediaEval 2015 - PERCOLATTE: a multimodal person discovery system in TV broadcast for the Medieval 2015 evaluation campaign

Visual processing

Visual processing

Shot boundaries

Colour histogram peaks on sliding window

Shot boundaries mapping:

July 1st: overlapping shots over 0.5sJuly 8th: overlapping coverage above 50%

Shot similarities

Cosine-based distance on:

RGB histograms

HOG features on resized frames (128×64)

Image embeddings: feature vectors at the 3rd fully-connected layer of theAlexnet DNN [Krizhevsky et al., 2012] (1000 dimension vectors)

Shot clustering

Integer Linear Program clustering [Rouvier and Meignier, 2012].

No face-related processing (detection/identification) is used in our approach

Meriem Bendris, Delphine Charlet, Gregory Senay, MinYoung Kim, Benoit Favre, Mickael Rouvier, Frederic Bechet, Geraldine DamnatiPERCOLATTE Sept. 14-15, 2015 Wurzen, Germany 8 / 16

Page 14: MediaEval 2015 - PERCOLATTE: a multimodal person discovery system in TV broadcast for the Medieval 2015 evaluation campaign

Visual processing

Visual processing

Shot boundaries

Colour histogram peaks on sliding window

Shot boundaries mapping:

July 1st: overlapping shots over 0.5sJuly 8th: overlapping coverage above 50%

Shot similarities

Cosine-based distance on:

RGB histograms

HOG features on resized frames (128×64)

Image embeddings: feature vectors at the 3rd fully-connected layer of theAlexnet DNN [Krizhevsky et al., 2012] (1000 dimension vectors)

Shot clustering

Integer Linear Program clustering [Rouvier and Meignier, 2012].

No face-related processing (detection/identification) is used in our approach

Meriem Bendris, Delphine Charlet, Gregory Senay, MinYoung Kim, Benoit Favre, Mickael Rouvier, Frederic Bechet, Geraldine DamnatiPERCOLATTE Sept. 14-15, 2015 Wurzen, Germany 8 / 16

Page 15: MediaEval 2015 - PERCOLATTE: a multimodal person discovery system in TV broadcast for the Medieval 2015 evaluation campaign

Visual processing

Visual processing

Shot boundaries

Colour histogram peaks on sliding window

Shot boundaries mapping:

July 1st: overlapping shots over 0.5sJuly 8th: overlapping coverage above 50%

Shot similarities

Cosine-based distance on:

RGB histograms

HOG features on resized frames (128×64)

Image embeddings: feature vectors at the 3rd fully-connected layer of theAlexnet DNN [Krizhevsky et al., 2012] (1000 dimension vectors)

Shot clustering

Integer Linear Program clustering [Rouvier and Meignier, 2012].

No face-related processing (detection/identification) is used in our approachMeriem Bendris, Delphine Charlet, Gregory Senay, MinYoung Kim, Benoit Favre, Mickael Rouvier, Frederic Bechet, Geraldine DamnatiPERCOLATTE Sept. 14-15, 2015 Wurzen, Germany 8 / 16

Page 16: MediaEval 2015 - PERCOLATTE: a multimodal person discovery system in TV broadcast for the Medieval 2015 evaluation campaign

Document chaptering

Shot annotations

8 videos, 4914 shots

4 labels: Studio, Report, Mixed, Other

Meriem Bendris, Delphine Charlet, Gregory Senay, MinYoung Kim, Benoit Favre, Mickael Rouvier, Frederic Bechet, Geraldine DamnatiPERCOLATTE Sept. 14-15, 2015 Wurzen, Germany 9 / 16

Page 17: MediaEval 2015 - PERCOLATTE: a multimodal person discovery system in TV broadcast for the Medieval 2015 evaluation campaign

Document chaptering

Document chaptering

Shot classification

Train = 3688 shots / test=1226 shots

Liblinear classifier

Accuracy = 99.43 %

Chaptering

Successive shots having the same label

Meriem Bendris, Delphine Charlet, Gregory Senay, MinYoung Kim, Benoit Favre, Mickael Rouvier, Frederic Bechet, Geraldine DamnatiPERCOLATTE Sept. 14-15, 2015 Wurzen, Germany 10 / 16

Page 18: MediaEval 2015 - PERCOLATTE: a multimodal person discovery system in TV broadcast for the Medieval 2015 evaluation campaign

Document chaptering

Document chaptering

Shot classification

Train = 3688 shots / test=1226 shots

Liblinear classifier

Accuracy = 99.43 %

Chaptering

Successive shots having the same label

Meriem Bendris, Delphine Charlet, Gregory Senay, MinYoung Kim, Benoit Favre, Mickael Rouvier, Frederic Bechet, Geraldine DamnatiPERCOLATTE Sept. 14-15, 2015 Wurzen, Germany 10 / 16

Page 19: MediaEval 2015 - PERCOLATTE: a multimodal person discovery system in TV broadcast for the Medieval 2015 evaluation campaign

Talking faces identification

Secondary strategy

Speaker identification + rule-based speaker-face mapping

Name propagation

The speaker is visible when:

Name appears on screen

On studio shots

On report shots when the role is not a reporter

Scores

No scores function was developed (score=1)

Meriem Bendris, Delphine Charlet, Gregory Senay, MinYoung Kim, Benoit Favre, Mickael Rouvier, Frederic Bechet, Geraldine DamnatiPERCOLATTE Sept. 14-15, 2015 Wurzen, Germany 11 / 16

Page 20: MediaEval 2015 - PERCOLATTE: a multimodal person discovery system in TV broadcast for the Medieval 2015 evaluation campaign

Talking faces identification

Secondary strategy

Speaker identification + rule-based speaker-face mapping

Name propagation

The speaker is visible when:

Name appears on screen

On studio shots

On report shots when the role is not a reporter

Scores

No scores function was developed (score=1)

Meriem Bendris, Delphine Charlet, Gregory Senay, MinYoung Kim, Benoit Favre, Mickael Rouvier, Frederic Bechet, Geraldine DamnatiPERCOLATTE Sept. 14-15, 2015 Wurzen, Germany 11 / 16

Page 21: MediaEval 2015 - PERCOLATTE: a multimodal person discovery system in TV broadcast for the Medieval 2015 evaluation campaign

Talking faces identification

Primary strategyShot clusterings + chapter-restricted propagation

Name propagation

Direct propagation: names to overlapping shots

Within a chapter, to shot-clusters sharing the speaker-cluster

Anchor name:

Propagate anchor names to overlapping studio-shots and their shot-clustersPropagate anchor names if speaker role is anchor

Scores

Initialize with OCR scores + incrementally increase following the origin:

Direct propagation: OCR shot overlapping

Talking-face score > 0.8

Name pronounced around the shot (± 5s)

Meriem Bendris, Delphine Charlet, Gregory Senay, MinYoung Kim, Benoit Favre, Mickael Rouvier, Frederic Bechet, Geraldine DamnatiPERCOLATTE Sept. 14-15, 2015 Wurzen, Germany 12 / 16

Page 22: MediaEval 2015 - PERCOLATTE: a multimodal person discovery system in TV broadcast for the Medieval 2015 evaluation campaign

Talking faces identification

Primary strategyShot clusterings + chapter-restricted propagation

Name propagation

Direct propagation: names to overlapping shots

Within a chapter, to shot-clusters sharing the speaker-cluster

Anchor name:

Propagate anchor names to overlapping studio-shots and their shot-clustersPropagate anchor names if speaker role is anchor

Scores

Initialize with OCR scores + incrementally increase following the origin:

Direct propagation: OCR shot overlapping

Talking-face score > 0.8

Name pronounced around the shot (± 5s)

Meriem Bendris, Delphine Charlet, Gregory Senay, MinYoung Kim, Benoit Favre, Mickael Rouvier, Frederic Bechet, Geraldine DamnatiPERCOLATTE Sept. 14-15, 2015 Wurzen, Germany 12 / 16

Page 23: MediaEval 2015 - PERCOLATTE: a multimodal person discovery system in TV broadcast for the Medieval 2015 evaluation campaign

Talking faces identification

Submissions

Systems

Primary: primary strategy with DNN- and HOG-based shot clustering

Primary DNNOnly: primary strategy with DNN-based shot clustering

Primary RGBOnly: primary strategy with RGB-based shot clustering

Secondary: secondary strategy based on speaker identification andspeaker-face rule-based mapping

Evidences

For each name, select the provided OPN shot that maximizes the OCR result score

Meriem Bendris, Delphine Charlet, Gregory Senay, MinYoung Kim, Benoit Favre, Mickael Rouvier, Frederic Bechet, Geraldine DamnatiPERCOLATTE Sept. 14-15, 2015 Wurzen, Germany 13 / 16

Page 24: MediaEval 2015 - PERCOLATTE: a multimodal person discovery system in TV broadcast for the Medieval 2015 evaluation campaign

Talking faces identification

Submissions

Systems

Primary: primary strategy with DNN- and HOG-based shot clustering

Primary DNNOnly: primary strategy with DNN-based shot clustering

Primary RGBOnly: primary strategy with RGB-based shot clustering

Secondary: secondary strategy based on speaker identification andspeaker-face rule-based mapping

Evidences

For each name, select the provided OPN shot that maximizes the OCR result score

Meriem Bendris, Delphine Charlet, Gregory Senay, MinYoung Kim, Benoit Favre, Mickael Rouvier, Frederic Bechet, Geraldine DamnatiPERCOLATTE Sept. 14-15, 2015 Wurzen, Germany 13 / 16

Page 25: MediaEval 2015 - PERCOLATTE: a multimodal person discovery system in TV broadcast for the Medieval 2015 evaluation campaign

Results

Results

Metrics EwMAP MAP CBaseline 78.35 78.64 92.71Secondary on July 1st 85.89 86.12 97.6Secondary on July 8th 86.40 86.61 97.68Primary DNNOnly on July 1st 81.41 81.67 97.63Primary DNNOnly on July 8th 87.75 88.01 97.63Primary RGBOnly on July 1st 81.02 81.28 97.63Primary RGBOnly on July 8th 87.33 87.60 97.63

Primary on July 1st deadline 81.70 81.96 97.63Primary on July 8th 88.19 88.45 97.63

Table : Performances of PERCOLATTE 2015 runs

Meriem Bendris, Delphine Charlet, Gregory Senay, MinYoung Kim, Benoit Favre, Mickael Rouvier, Frederic Bechet, Geraldine DamnatiPERCOLATTE Sept. 14-15, 2015 Wurzen, Germany 14 / 16

Page 26: MediaEval 2015 - PERCOLATTE: a multimodal person discovery system in TV broadcast for the Medieval 2015 evaluation campaign

Results

Results

Metrics EwMAP MAP CBaseline 78.35 78.64 92.71Primary 88.19 88.45 97.63

Primary DNNOnly 87.75 88.01 97.63Primary HOGOnly 88.04 88.30 97.63Primary RGBOnly 87.33 87.60 97.63

Primary without speaker restriction 88.49 88.75 97.63Primary without anchor process 88.05 88.31 97.39

Table : Performances of PERCOLATTE 2015 runs

Meriem Bendris, Delphine Charlet, Gregory Senay, MinYoung Kim, Benoit Favre, Mickael Rouvier, Frederic Bechet, Geraldine DamnatiPERCOLATTE Sept. 14-15, 2015 Wurzen, Germany 15 / 16

Page 27: MediaEval 2015 - PERCOLATTE: a multimodal person discovery system in TV broadcast for the Medieval 2015 evaluation campaign

Conclusion

Conclusion

Talking faces identification

Without face-related processing

Easy-to-establish features:

Shots classificationSpeaker role classification

Minor prior knowledge about broadcast news:

Chapter restrictionList of journalists

+10% of MAP compared to the Baseline

TV programs = ambiguous context but regular structure

Meriem Bendris, Delphine Charlet, Gregory Senay, MinYoung Kim, Benoit Favre, Mickael Rouvier, Frederic Bechet, Geraldine DamnatiPERCOLATTE Sept. 14-15, 2015 Wurzen, Germany 16 / 16

Page 28: MediaEval 2015 - PERCOLATTE: a multimodal person discovery system in TV broadcast for the Medieval 2015 evaluation campaign

Conclusion

Conclusion

Talking faces identification

Without face-related processing

Easy-to-establish features:

Shots classificationSpeaker role classification

Minor prior knowledge about broadcast news:

Chapter restrictionList of journalists

+10% of MAP compared to the Baseline

TV programs = ambiguous context but regular structure

Meriem Bendris, Delphine Charlet, Gregory Senay, MinYoung Kim, Benoit Favre, Mickael Rouvier, Frederic Bechet, Geraldine DamnatiPERCOLATTE Sept. 14-15, 2015 Wurzen, Germany 16 / 16

Page 29: MediaEval 2015 - PERCOLATTE: a multimodal person discovery system in TV broadcast for the Medieval 2015 evaluation campaign

Conclusion

Barras, C., Zhu, X., Meignier, S., and Gauvain, J.-L. (2006).Multi-stage speaker diarization of broadcast news.IEEE Transactions on Audio, Speech and Language Processing.

Damnati, G. and Charlet, D. (2011).Multi-view approach for speaker turn role labeling in tv broadcast news shows.In INTERSPEECH, pages 1285–1288. ISCA.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012).Imagenet classification with deep convolutional neural networks.In Pereira, F., Burges, C., Bottou, L., and Weinberger, K., editors, Advancesin Neural Information Processing Systems 25, pages 1097–1105. CurranAssociates, Inc.

Rouvier, M. and Meignier, S. (2012).A global optimization framework for speaker diarization.In Speaker Odyssey.

Meriem Bendris, Delphine Charlet, Gregory Senay, MinYoung Kim, Benoit Favre, Mickael Rouvier, Frederic Bechet, Geraldine DamnatiPERCOLATTE Sept. 14-15, 2015 Wurzen, Germany 16 / 16