a study on the video scene retrieving system
DESCRIPTION
Recently, a variety of video data are being generated, stored, and accessed with advances in computer technology and the Int ernet. To make search a video, or a video scene quickly from the data, an efficient and effective technique is needed. So I proposed a video scene retrieval system based on speech recognition which is using HMM(Hidden Markov Model). The proposed system is applied to scene retrieval experiments that evaluate a recognition rate for 457 short words. Experiment result shows average detection accuracy is 68%.TRANSCRIPT
![Page 1: A Study on the Video Scene Retrieving System](https://reader033.vdocuments.site/reader033/viewer/2022060201/559a69a81a28abdb348b477e/html5/thumbnails/1.jpg)
A Study on the Video Scene Retrieving System
with a Speech Recognizer
2013. 5. 14
Yoshika OSAWA
Kohno Lab.
![Page 2: A Study on the Video Scene Retrieving System](https://reader033.vdocuments.site/reader033/viewer/2022060201/559a69a81a28abdb348b477e/html5/thumbnails/2.jpg)
Outline1. Introduction
2. Aim of Study
3. Composition of Systemi. Voice Divide Section
ii. Speech Recognize Section
iii. Scene Retrieve Section
4. Evaluation Experiment
5. Conclusion
![Page 3: A Study on the Video Scene Retrieving System](https://reader033.vdocuments.site/reader033/viewer/2022060201/559a69a81a28abdb348b477e/html5/thumbnails/3.jpg)
1. Introduction
• A variety of video data are being generated, stored, and accessed with advances in the Internet.
• To make search a video scene quickly from the data, an efficient technique is needed.
![Page 4: A Study on the Video Scene Retrieving System](https://reader033.vdocuments.site/reader033/viewer/2022060201/559a69a81a28abdb348b477e/html5/thumbnails/4.jpg)
1. Introduction• Multimedia Annotations
oNagao(2001)
![Page 5: A Study on the Video Scene Retrieving System](https://reader033.vdocuments.site/reader033/viewer/2022060201/559a69a81a28abdb348b477e/html5/thumbnails/5.jpg)
1. Introduction• A Subtitling System for Broadcast
Programs with a Speech Recognizer
oAndo et al.(2001)
![Page 6: A Study on the Video Scene Retrieving System](https://reader033.vdocuments.site/reader033/viewer/2022060201/559a69a81a28abdb348b477e/html5/thumbnails/6.jpg)
1. Introduction• Extracting voices from the video.
• The advantage of voice :
Easy to Make texts.
Simple association.
Apply the speech recognition to the scene retrieving.
![Page 7: A Study on the Video Scene Retrieving System](https://reader033.vdocuments.site/reader033/viewer/2022060201/559a69a81a28abdb348b477e/html5/thumbnails/7.jpg)
Outline1. Introduction
2. Aim of Study
3. Composition of Systemi. Voice Divide Section
ii. Speech Recognize Section
iii. Scene Retrieve Section
4. Evaluation Experiment
5. Conclusion
![Page 8: A Study on the Video Scene Retrieving System](https://reader033.vdocuments.site/reader033/viewer/2022060201/559a69a81a28abdb348b477e/html5/thumbnails/8.jpg)
2. Aim of Study
Implement a scene retrieving system, then verify the accuracy and
check the operations.
Make annotations with the speech recognition automatically.
![Page 9: A Study on the Video Scene Retrieving System](https://reader033.vdocuments.site/reader033/viewer/2022060201/559a69a81a28abdb348b477e/html5/thumbnails/9.jpg)
Outline1. Introduction
2. Aim of Study
3. Composition of Systemi. Voice Divide Section
ii. Speech Recognize Section
iii. Scene Retrieve Section
4. Evaluation Experiment
5. Conclusion
![Page 10: A Study on the Video Scene Retrieving System](https://reader033.vdocuments.site/reader033/viewer/2022060201/559a69a81a28abdb348b477e/html5/thumbnails/10.jpg)
3. Composition of System
Start
End
Select a Video
Speech Recognize Section
Input a Keyword
Scene Retrieve Section
Output the resultVoice Divide Section
![Page 11: A Study on the Video Scene Retrieving System](https://reader033.vdocuments.site/reader033/viewer/2022060201/559a69a81a28abdb348b477e/html5/thumbnails/11.jpg)
i. Voice Divide Section• Focus on the Amplitude
oUse signals while exceeding the threshold value of the amplitude.
o Reject because it is not possible to recognize if it is too short.
oDerive threshold based on experiment.
axis threshold
Amplitude 10[%]
Time 1000[ms]
![Page 12: A Study on the Video Scene Retrieving System](https://reader033.vdocuments.site/reader033/viewer/2022060201/559a69a81a28abdb348b477e/html5/thumbnails/12.jpg)
ii. Speech Recognize Section
![Page 13: A Study on the Video Scene Retrieving System](https://reader033.vdocuments.site/reader033/viewer/2022060201/559a69a81a28abdb348b477e/html5/thumbnails/13.jpg)
(1) Pre-Processing Unit• Digitization
o Sampling frequency: 16kHz
oQuantization bit : 16bit
• Noise Reductiono Additive: Subtract the difference between the silence
o Multiplicative: Subtract in the log axis
Microphone characteristics of SM57
![Page 14: A Study on the Video Scene Retrieving System](https://reader033.vdocuments.site/reader033/viewer/2022060201/559a69a81a28abdb348b477e/html5/thumbnails/14.jpg)
(2) Feature Extraction Unit
Resonant frequency is effective as a feature value
![Page 15: A Study on the Video Scene Retrieving System](https://reader033.vdocuments.site/reader033/viewer/2022060201/559a69a81a28abdb348b477e/html5/thumbnails/15.jpg)
• Resolution of human hearing
oHigher sensitivity in lower frequency
• Filter that matches the human hearing
Mel-frequency
(2) Feature Extraction Unit
![Page 16: A Study on the Video Scene Retrieving System](https://reader033.vdocuments.site/reader033/viewer/2022060201/559a69a81a28abdb348b477e/html5/thumbnails/16.jpg)
• Inverse Fourier transform in the Mel-frequency axiso New axis: Cepstrum
o Separate the voice pitch and resonance frequency
• MFCC(Mel Frequency Cepstrum Coefficient)o Information of vowel
• ΔMFCCo Infromation of consonant
• Feature vectoro (Average power, MFCC, ΔMFCC)
(2) Feature Extraction Unit
![Page 17: A Study on the Video Scene Retrieving System](https://reader033.vdocuments.site/reader033/viewer/2022060201/559a69a81a28abdb348b477e/html5/thumbnails/17.jpg)
(3) Identification Unit
From Bayes' theorem
![Page 18: A Study on the Video Scene Retrieving System](https://reader033.vdocuments.site/reader033/viewer/2022060201/559a69a81a28abdb348b477e/html5/thumbnails/18.jpg)
(3) Identification UnitSpeech waveform : Observable
Character information: Unobservable directly
Estimate the character information from the waveform by using HMM (Hidden Markov Models)
Maximum likelihood calculation : Viterbi algorithmMachine learning : Baum-Welch algorithm
![Page 19: A Study on the Video Scene Retrieving System](https://reader033.vdocuments.site/reader033/viewer/2022060201/559a69a81a28abdb348b477e/html5/thumbnails/19.jpg)
iii. Scene Retrieve Section• Matching keyword and text
1. Input a keyword
2. Matching the keyword by String searching
3. Extract scene that the keyword was spoken.
4. Output a thumbnail
![Page 20: A Study on the Video Scene Retrieving System](https://reader033.vdocuments.site/reader033/viewer/2022060201/559a69a81a28abdb348b477e/html5/thumbnails/20.jpg)
Outline1. Introduction
2. Aim of Study
3. Composition of Systemi. Voice Divide Section
ii. Speech Recognize Section
iii. Scene Retrieve Section
4. Evaluation Experiment
5. Conclusion
![Page 21: A Study on the Video Scene Retrieving System](https://reader033.vdocuments.site/reader033/viewer/2022060201/559a69a81a28abdb348b477e/html5/thumbnails/21.jpg)
4. Evaluation Experiment1. Compare the result with the word I heard
2. Calculate the recognition rate
3. Evaluate it by each number of charactersSample data
Video NHK news
Time 3 minutes
Number 30 videos
Words 457 words
Engine Julius
![Page 22: A Study on the Video Scene Retrieving System](https://reader033.vdocuments.site/reader033/viewer/2022060201/559a69a81a28abdb348b477e/html5/thumbnails/22.jpg)
4. Evaluation Experiment
Total average rate is 68%.
67%73%
69%
46% 45%40%
0%
20%
40%
60%
80%
Recognition Rate
1 2 3 4 5 6 words
![Page 23: A Study on the Video Scene Retrieving System](https://reader033.vdocuments.site/reader033/viewer/2022060201/559a69a81a28abdb348b477e/html5/thumbnails/23.jpg)
4. Evaluation Experiment• Verify the correspondence between
keyword and the seek destination
o Select thumbnail and play from the scene
oCheck whether the keyword was spoken.
![Page 24: A Study on the Video Scene Retrieving System](https://reader033.vdocuments.site/reader033/viewer/2022060201/559a69a81a28abdb348b477e/html5/thumbnails/24.jpg)
4. Evaluation Experiment• Recognition rate decrease when number
of characters increase.
• The retrieved scene is corresponding to the keyword.
• Recognition error in weak consonant part
oNeed improvement in Voice Devide Section
oMust also improve the recognition accuracy
![Page 25: A Study on the Video Scene Retrieving System](https://reader033.vdocuments.site/reader033/viewer/2022060201/559a69a81a28abdb348b477e/html5/thumbnails/25.jpg)
Outline1. Introduction
2. Aim of Study
3. Composition of Systemi. Voice Divide Section
ii. Speech Recognize Section
iii. Scene Retrieve Section
4. Evaluation Experiment
5. Conclusion
![Page 26: A Study on the Video Scene Retrieving System](https://reader033.vdocuments.site/reader033/viewer/2022060201/559a69a81a28abdb348b477e/html5/thumbnails/26.jpg)
5. Conclusion• System for efficient watching video
oUse Speech Recognition
oMake Annotations automatically
• Future work
oAdopt the Zero-Crossing Number in Voice Devide Section
o Take in latest Speech Recognition technology.
o Incorporate Image Recognition.
![Page 27: A Study on the Video Scene Retrieving System](https://reader033.vdocuments.site/reader033/viewer/2022060201/559a69a81a28abdb348b477e/html5/thumbnails/27.jpg)
Thank you for your attention!