andrÁs beke - doktori.btk.elte.hu
Post on 08-Feb-2022
18 Views
Preview:
TRANSCRIPT
Eötvös Loránd University
Faculty of Arts
DOCTORAL DISSERTATION – THESIS
Doctoral School of Linguistics
Professor Dr. Vilmos Bárdosi CSc
Applied Linguistics Professor Dr. Mária Gósy DSc
AUTOMATIC SPEAKER DIARIZATION IN HUNAGRIAN SPONTANEOUS CONVERSATIONS
by
ANDRÁS BEKE
Supervisor
Professor Dr. Mária Gósy DSc
Budapest, 2014
1. INTRODUCTION In the Conversation Analysis framework, the conversation is made up in a
structural way (Garfinkel 1967; Goffman 1983; Schegloff 1992; Sacks et al. 1974;
Sacks 1992; Iványi 2001; Stokoe 2006). Based on this theory, if the conversation is
systematic, it could be automatically modeled. Therefore, the examination of
conversation is very important not only for linguistics but for speech technology as
well. In human-machine communication several processes have been modeled by applying
speech technology, such as speech decoding (speech recognition), speech production
(speech synthesis) or speaker identification based on voice (speaker recognition)
(Németh–Olaszy 2010). These processes are linked together in conversation where the
operation of speech decoding and speech production are circularly interleaved. This
circulation is caused by speaker changes. Automatic Detection of the speaker change
is therefore very important. Speaker diarization is the process of partitioning an input
audio stream into homogeneous segments according to the speaker's identity. It can
enhance the readability of an automatic speech transcription by structuring the audio
stream into speaker turns and, when used together with speaker recognition systems,
by providing the speaker’s true identity. It is used to answer the question "who spoke
when?" (Jin et al. 2004). Speaker diarization is a combination of speaker segmentation and speaker clustering.
The first aims at finding speaker change points in an audio stream. The second aims at
grouping together speech segments on the basis of speaker characteristics (Jin et al.
2004; Kotti et al. 2008). To improve the speaker diarization, voice activity detection
and detection of overlapping speech can be used as well. In the literature, extensive research is described for speaker diarization, but principally
for English (Tritschler–Gopinath 1999; Sivakumaran–Fortuna–Ariyaeeinia 2001; Lu–
Zhang 2002a; Cettolo–Vescovi 2003; Shih-Sian Cheng et al. 2010; Vescovi–Cettolo–
Rizzi 2003). However, for the Hungarian language, no work is known which
addresses the field of speaker diarization. The aim of this PhD thesis is to develope a
speaker diarization system for the Hungarian language. The main focus is to create
algorithms for speaker diarization (speaker segmentation, speaker clustering,
overlapping speech detection) and to implement and enhance some already existing
algorithms in speaker diarization (voice activity detection, speaker recognition),
focusing on Hungarian conversation. Besides, on one hand we introduce the main
application implementing speaker diarization and, on the other hand we present the
scientific areas connected to speaker diarization. The aim of this thesis is to build speaker diarization system which can automatically
mark speaker changes based on acoustic information in Hungarian spontaneous
conversation. The approach is mainly based on unsupervised methods.
2. ORGANIZATION OF THIS THESIS
The dissertation contains 11 chapters. The first chapter provides an introduction,
where the theoretical and practical background of the introduced scientific areas is
presented focusing on speaker diarization. Spontaneous speech will be described by
means of analyzing both the speech production and speech perception processes. In
this chapter we present the main theory of conversation analysis, discourse analysis
and speech accommodation. The building blocks of conversations are the turns and
their possible markers (the discourse markers) are also presented in the first chapter as
well. The last part of the first chapter addresses overlapping speech in spontaneous
conversations. Chapter 2 reviews some approaches previously used in speaker diarization.
Chapter 3 describes the goal of the thesis and research questions and hypotheses. The material, subject of the research and the evaluation of the speaker diarization
(DER: Detection Error Rate) is presented in Chapter 4.
The main algorithm for the speaker diarization system and results are presented in
Chapter 5. In this Chapter the development of our two-pass speaker diarization system
is explored. The preprocessing step relies on VAD (voice activity detection) and
proposes modifications in the algorithm, which are presented in Chapter 5. Chapter 5
describes the examination of speaker specific acoustic features for speaker
recognition. The overlapping speech detection algorithm is described in Chapter 5. Finally, Chapter 6 concludes this thesis with a summary of its contributions
(Chapter 7). Chapter 8 provides an outlook for possible directions of future work in
the field. The thesis of this dissertation are presented in Chapter 9. Chapter 10 shows
the references. The last Chapter (11) presents the abbreviation.
3. MATERIAL AND SUBJECTS
In this thesis, 100 spontaneous conversations (total duration is of 55 hours) were
selected from the BEA database (Gósy 2012), recorded in a laboratory environment.
In each case, three persons were involved in the conversations. Two of them were
permanent (2 females, average age 32 years old). The third subject (interviewer) was
one out of the 43 male and 67 female (average age 35 years old) speakers. The speech data is collected using single channel recording and 44 kHz sampling
frequency, encoded linearly for 16 bits per sample resolution. For final processing, the
recordings were resampled to 16 kHz. The manual annotation includes the following levels: (i) Silence: all of silent parts were marked, which were longer than 100 ms.
Naturally, caused by articulation, VOT is not marked even if longer than this
threshold.
(ii) Speaker change point: in the continuous signal we manual marked the
transition points were speakers changed. The backchannel was not marked to turn only
in case of speakers changed. (iii) Overlapping speech: regions in the audio file were marked where more the
one speaker spoke at the same time. If this region was not longer than 50 ms, it was
not marked because this is regarded to be a very short speech interval, automatically
not detectable. Methods employed in particular tasks are described at the beginning of the chapter,
before the presentation of the results.
4. RESULTS
4.1 Speech detection Voice activity detection (also known as speech activity detection or speech
detection) plays an important role in speaker diarization because this algorithm can
improve the result of speaker diarization. The VAD is a technique used in speech
processing in which the presence or absence of human speech is detected. As the aim of this thesis was not develop a new VAD algorithm, therefore we
choose and modify a very simple and fast algorithm to segment pauses, created by
Giannakopoulos and implemented in MATLAB. To segment speech and non-speech
regions in a waveform, the short-term energy and spectral centroid features were
extracted and used to make a decision using an adaptive threshold. Our proposed
method differs from the original algorithms, as it defines the two centers of the feature
distributions (one for the speech, and the other for the non-speech). k-means
unsupervised learning method is used (Ying et al. 2011). The aim of this task is to test
whether our modified VAD yields better result. For evaluation, the VAD Detection
Error Rate (DET) was used. To compare the results given by the basic and the
modified method, Wilcoxon-test completed by Monte Carlo simulation was applied.
The corpus contains 49 hours speech and 6 hours silence in total. The 10.9% of the
total duration of the corpus is silence. 5 hours of spontaneous conversation were used to develop our VAD algorithm: define
the optimal thresholds; and 39 hours of conversation were used to evaluate the system.
In the first experiment we focused on the impact of the window size to apply the
threshold in the audio file. This window length ranges between 25 ms and 250 ms. At
the same time we test whether our modified algorithm is better compared to the basic
method. Using the VAD algorithm with a 250 ms long window gave 9.51% DER.
However using k-mean clustering method in our VAD algorithm the result was
improved, although this improvement was not significant. We studied the effects of additive white noise on speech signals for voice activity
detection. Various levels of white noise were added to the speech signal so that the
signal to noise ratio (SNR) values of the recordings continuously decreased by 5dB
The original SNR value of the recording was 25 dB. The result showed that by
increasing noise, the DER value was increased, but in speech of 10dB SNR the DER
value was 34.72%. We proposed a VAD algorithm based on k-means unsupervised learning method to
segment the audio signal to speech and non-speech segments. We demonstrated that it
can be useful for robust speech detection in spontaneous conversations. The proposed
VAD was characterized by 90.49% accuracy in good quality recordings. 4.2 Overlapping speech detection The ratio of overlapping speech is high in spontaneous conversations (Gráczi–Bata
2010). Beattie analyzed such conversations and showed that 13% of the speech was
overlapping in English conversations, but this ratio increased when more than two
participants were involved in the conversations (31% of the total duration of
conversations) (1983, referring to Levelt 1989). According to Cetin and Shriberg (2006) 10-13% of total duration of conversations
is overlapping speech. In Hungarian literature Markó (2006) described that 6%
overlapping speech occurred in four-participant Hungarian conversations. Bata (2009)
claimed that this ratio is 1.7-3% in two-participant Hungarian spontaneous speech.
Several studies showed that overlapping speech was responsible for most of the errors
in speaker diarization. According to Wooters and Hujbert (2007) 17% of the DER is
false rejection caused by overlapping speech in speaker diarization. Some studies have
been reported about the effects of overlapping in meetings (e.g., Boakye et al. 2008
a,b,c; Boakye 2008; Boakye 2011; Trueba-Hornero 2008), but work on systems for
identifying overlapped speech and mitigating its effects in speaker diarization appear
to be absent from the literature. As overlapped speech is now a major obstacle in
improving the performance of speaker diarization systems, efforts in overlap detection
are of increasing interest and importance. In this Chapter we presented the measures by which automatically classified the
overlapping speech segments in Hungarian spontaneous speech. To detect overlapping
speech, we used an ANN/SVM system (Artificial Neural Network and Support Vector
Machine). In the first step, four acoustic features were extracted from speech: i) FFT
spectrum (SP); ii) mel-frequency cepstral coefficients (MFCC); iii) log mel scale
filterbank coefficients (MSFC); subband energy (SBE). In order to gain a better
representation of the features we used restricted Boltzmann Machine (RBM).
Restricted Boltzmann Machines are used successfully to extract features from raw
data. They are especially useful if they are stacked to form a Deep Belief Network
(DBN) and are able to extract feature hierarchies. The output from the DBN was the
input to the Support Vector Machine (LSVM: Last Square SVM using RBF kernel
function) which carried out the classification. We used three restricted Boltzmann
Machines in DBN (H1,H2,H3). The corpus contained 8056 time intervals where more than one speaker talk at the
same time. This means that the corpus contained 7 hours of overlapping speech,
representing 12% of the total duration of corpus. To train the SVM 5371 training
samples were used, and to test it 2386 testing samples were used.
The results showed that the four acoustic features of the MSFC gave the lowest
EER (Equal Error Rate). In this case, the EER on average was 47.49%. The second-
best feature was the MFCC. In this case the EER on average was 50.84%. Using
MSFC gave better results than MFCC which means 3.35% relative improvement. This
improvement is statistically confirmed (Wilcoxon test: Z=-2,211; p=0,023). We studied the effects of the number of units in the third hidden layer in DBN.
The best results were obtained using 500 neurons in third layer in the DBN, using
MSFC acoustic features (Figure 1).
Figure 1
The EER value depending on the number of neurons in H3 The statistical analysis confirmed that MSFC features work better than the other
three features independently of the number of neurons in H3 (MSFC-MFCC: Z=-
2,201; p=0,028; MSFC-SP: Z=-2,201; p=0,028; MSFC-SBE: Z=-2,201; p=0,028). In general, the results show that from 500 to 800 neurons results are worse than
under 500 neurons. We analyzed the type of the errors. First of all the manual annotation contained the
most of the errors. In other cases we found that the backchannel and laughters were
caused a lot of errors as well. This Chapter reports an investigation into the use of Deep Neural Network and
Support Vector Machine for the classification of overlapping speech in spontaneous
speech. The DBN/SVM approach gives overlap detection results comparable to the
ones published in the literature of overlapping detection systems. 4.3 Speaker recognition for speaker diarization In a general point of view, speaker recognition algorithms are a very useful part of
many speech technology applications, for example: speaker indexing and rich
transcription, automatic speech recognition. The result of speaker recognition can be
used for modules of speaker-based algorithms as speaker diarization (Campbell 1997).
In Hungarian literature there are several researches which examined the possibility
of speaker identification. Still, there exist only some works about automatic speaker
recognition in Hungarian speech (Fék 1997). The aim of this Chapter was to examine which spectral region is speaker specific,
and to make an automatic speaker recognition system based on GMM and GMM-
UBM method. The system is built around the likelihood ratio test for verification
using simple but effective GMMs for likelihood functions, a universal backround
model (UBM) for normalizing the scores (Higgins et al. 1991; Rosenberg et al. 1992;
Reynolds 1995; Matsui–Furui 1995; Reynolds 1997). Each speaker was modelled by
various number of mixture GMMs. A 25 s long interval was used from all 80 speakers
to train the model and 13 s to test it. For UBM training we selected 20 speakers (10 male, 10 female) from the corpus
and modelled them with various number of mixtures (2-512 mixtures). To classify the overlapping speech and non-overalapping speech, spectral
information is used. The spectral information is calculated using tree different
subbands which were represented by using three MFCCs. The front-end processing is
made in three different ways: i) MFC for spectral full-band; ii) MFC for spectral
subband between 1.5 and 2.5 kHz, iii) MFC for spectral subband between 2.5 and 3.5
kHz; iv) MFC for spectral spectral between 3.5 and 4.5 kHz. To compute subband
MFCC, we employed HCopy implemented in MATLAB: it bandlimits the signal
between i) 1.5, ii) 2.5, iii) 3.5 kHz to i) 2.5, ii) 3.5, iii) 4.5 kHz, and distributes the
four filterbank channels equally on the mel scale such that the lower cutoff of the first
filter is at i) 1.5, ii) 2.5, iii) 3.5 kHz and the upper cutoff of the fourth filter is at i) 2.5,
ii) 3.5, iii) 4.5 kHz. Three cepstral coefficients are then calculated from the four values
using Discrete Cosine Transform (DCT). To measure the performance of the speaker recognition the accuracy (true positive
rate) is calculated. The results showed that the best performance is yielded by using spectral sub-band
MFC between 2.5 and 3.5 kHz. This result was significantly better than using MFC(1,5–
2,5) (Z=-2,201; p=0,028) and using MFC(3,5–4,5) (Z=-2,201; p=0,028). However the
performance of MFC(2,5–3,5) is better than MFC(fullband), this difference was not
significant. These results are correlated by previous studies which have shown that the spectral
subband, 2500 Hz and 3500 Hz, carries speaker specific information (Furui 1986;
Parthasarathi et al. 2013). We examined the effect of number of mixtures in GMM-UBM system. The best
result was obtanied by using 256 mixtures GMM-UBM and MFC(2,5–3,5) acoustic
feature. In this case the accuracy was 79.76%. In this Chapter we studied various strategies to represent the speaker specific
acoustic features to improve the performance of the GMM-UBM speaker recognition
system. We proved experimentally that subband information from 2.5 kHz to 3.5 kHz
carried the speaker specific information in the spectrum. This result can be used for
improving the performance of speaker diarization.
4.4 Speaker diarization The previous speaker diarization systems have been set up on radio broadcasting in
different languages, which can largely be considered as half-spontaneous as the show
participants are familiar with the topic in advance. The BEA database contains more
spontaneity than the previous corpus because during the recording processing the
speech planning and speech production are due at the same time. Therefore, this work
also can be considered as a pioneer one because speaker diarization of such
conversations have not been developed so far. Our corpus contained 7,827 turns form 100 conversations. One conversation
contained 78 turns, on average (std.: 41 turns; max.: 240 turns; min.: 11 turns). The male speakers produced 79 turns (std.: 45 turns), while females produced 65
turns (std.: 37 turns), on average. There were no statistically significant differences in
the number of turns between males and females (one-way ANOVA). We examined the ratio of the total duration of the recordings and the time a given
speaker spoke in the recording. The total speaking time of all of the interviewees was
40.3% of the total duration. Females talked more (37% of the total duration) than
males (42% ot the total duration), but this difference was not significant (one-way
ANOVA). The total speaking time of the interviewer was 33.9% of the total duration. The
second participant talked the least, only 18.3% of the total duration (repeated
ANOVA: interviewer*second paticipant: F(2, 200)=39.833; p<0.001;
interviewees*second paticipant F(2, 200)=39.833; p<0.001) (Figure 2.). These ratios
indicate that, during the meeting, the roles were not tied: second participant stayed
often in the background (there could be several reasons for this: for example the
degree of familiarity).
Figure 2
The ratio of total duration of the recording and the speaker’s speaking time in the recording
We calculated how many turns were produced by the speaker during one minute.
The interviewees produced 1.38 turn per minute, on average. The interviewer
produced 1.15 turn per minute, on average. The second participant produced 0.78 turn
per minute, one average. Additionally, we examined whether the speaking time and the turn per minute rate
were correlated. In case of the interviewee, any tendency could not been estabilished;
so could not tell who talks a lot, gets more cases or takes the floor. In case of the
interviewer a moderate uphill (positive) relationship between the speaking time and
the turn per minute could be measured (Pearson correlation: r =0.424, p<0.001). The
same trend was seen in case of the second participant (Pearson correlation: r =0,441,
p<0,001). This supports that the subjects in the conversations should not seek the possibility
for taking the floor since the basic situation is that they are expected to speak. On the
contrary, the interviewer and the third participant of the conversation should seek the
possibility to take the floor as frequently as they can in order to create longer
utterances.
In this thesis we built, an unsupervised, BIC-based (Bayesian Information
Criterion) speaker segmentation algorithm. BIC is a most used algorithm for speaker
segmentation and clustering. The BIC is probably the most extensively used
segmentation and clustering metric justified by its simplicity and effectiveness. It uses
a likelihood criterion penalized by the model complexity (amount of free parameters
in the model) introduced by Schwarz 1971 and Schwarz 1978 as a model selection
criterion. For the task of speaker segmentation, the technique was first used by Chen
and Gopalakrishnan (Chen–Gopalakrishnan 1998), where a single full covariance
Gaussian was used for each of the models. Although not existent in the original
formulation, the λ parameter was introduced to adjust the effect of the penalty term on
the comparison, which constitutes a hidden threshold to the BIC difference. Such
threshold needs to be tuned to the data and therefore its correct setting has been
subject of constant study. Several researchers propose ways to automatically select λ
(Tritschler–Gopinath 1999; Delacourt–Wellekens 2000; Delacourt–Kryze–Wellekens
1999; Mori–Nakagawa 2001; Lopez–Ellis 2000; Vandecatseye et al. 2004). A false alarm compensation (FAC) was implemented to try to reduce the number
of incorrectly detected speaker changes. The symmetric Kullback-Leibler (KL2)
distance was measured between the data D1 and D2 centered around each proposed
change point, returned from the initial speaker segmentation (Siegler et al. 1997, Hung
et al. 2000; Adami et al. 2002). The KL2 distance will in this thesis be used as a post-
processing step with the purpose of reducing the number of false change points
introduced by the BIC algorithm (Ida 2011). The purpose of this stage is to associate or cluster segments from the same speaker
together. The clustering ideally produces one cluster for each speaker in the audio
with all segments from a given speaker in a single cluster. The predominant approach
used in diarization systems is hierarchical, agglomerative clustering with a BIC based
stopping criterion (Chen–Gopalakrishnan 1998). In our work the clusters are represented by GMM-supervectors with adapted
means using MAP (maximum a posteriori approach) from UBM (Reynolds et a.
2000). The GMM-UBM consists of 256 mixture components. The distance metric
between clusters was the BIC clustering framework. To reduce the high
dimensionality of data PCA (Principal Component Analysis) was used. A system hypothesizes a set of speaker segments each of which consists of a
(relative) speaker-id label such as ‘spkr1’, ‘spkr2’ or 'spkr3' and the corresponding
start and end times. This is then scored against reference ‘ground-truth’ speaker
segmentation which is generated using the rules given in (Fiscus et al. 2004). Since
the hypothesis speaker labels are relative, they must be matched appropriately to the
true speaker names in the reference. To accomplish this, a one-to-one mapping of the
reference speaker IDs to the hypothesis speaker IDs is performed so as to maximize
the total overlap of the reference and (corresponding) mapped hypothesis speakers.
Speaker diarization performance is then expressed in terms of the miss (speaker in
reference but not in hypothesis), false alarm (speaker in hypothesis but not in
reference), and speaker-error (mapped reference speaker is not the same as the
hypothesized speaker) rates. The overall diarization error (DER) is the sum of these
three components. A complete description of the evaluation measure and scoring
software implementing it can be found at http://nist.gov/speech/tests/rt/rt2004/fall. It should be noted that this measure is time-weighted, so the DER is primarily
driven by (relatively few) loquacious speakers and it is therefore more important to get
the main speakers complete and correct than to accurately find speakers who do not
speak much. From the BEA database, 12 conversations were selected randomly. The total
duration of conversations was 2.8 hours which contained 480 speaker-changes. In this thesis we built a standard BIC-base speaker segmentation to compare to the
proposed system. The standard BIC-base speaker segmentation used standard MFCC
and the λ parameter value was 0. In this standard system we did not use any speech
detection or overlapping detection algorithm. By using standard BIC-base speaker
segmentation the best DER (Diarization Error Rate) value was 39.43%. To improve this result we used MFC for the spectral subband between 2.5 and 3.5
kHz and energy along with deltas in standard BIC-base speaker segmentation. These
features were mean and variance normalized. In Chapter 5 we demonstrated that using
MFC(2.5–3.5) features the results could be improved. The result showed that when the
BIC-base segmentation included MFC(2.5–3.5), the proposed method achieved about
0.869% relative DER reduction (from 39.43.5% to 38.56.2%) which is a statistically
significant improvement (Wilcoxon test: Z=-2.824; p=0.005). Performance of BIC-base speaker segmentation is depending on the penalty factor
λ. We tested our speaker segmentation system using various values for λ (from 0 to 4).
The best DER is obtained if the penalty factor was 1. In this case DER was 35.73%. The use of a speech/non-speech detector in speaker diarization is important to
ensure that acoustic models for each of the clusters correctly represent the speech data
and are not “contaminated” by non-speech information. By adding the speech/non-
speech detector proposed for spontaneous speech, not only it does not improve the
non-speech errors but also reduces the speaker error, due to a reduction in clustering
errors as noted above. The result showed that when the BIC-base segmentation
included VAD, the proposed method achieved about 4.535% relative DER reduction
(from 35.73% to 31.21%) which is a statistically significant improvement (Wilcoxon
test: Z=-3.059; p<0.001). State-of-the-art speaker diarization systems for meetings are now at a point where
overlapping speech contributes significantly to the errors made by the system.
However, little if no work has yet been done on detecting overlapping speech. We
present our initial work toward developing an overlapping detection system for
improved speaker diarization. We demonstrated a relative improvement of about
2.49% DER over the baseline diarization system (from 31.21% to 28.713%) which is
a statistically significant improvement (Wilcoxon test: Z=-3.06; p=0.002). In this Chapter we presented the structure and the performance of the baseline
BIC-based speaker diarization and our system. Chapter 5 have shown that the spectral
subband ranging from 2500 Hz to 3500 Hz carries speaker specific information. We
exploited this for speaker change detection (SCD) by representing the subband using
three MFCCs. We confirmed experimentally that this spectral subband could play an
important role not only in speaker recognition but in speaker diarization as well. We
presented how the optimal value of λ could be determinated in BIC. We proved
experimentally that by implementing the VAD algorithm in speaker diarization system
it could improve the results. Experimental results revealed that by adding our
overlapping speech detection method to the speaker diarization system it could reduce
the diarization error rate by almost 2.49%. Generally, in this thesis, the best result is yielded by using BIC-base method where
the penalty value was 1, the features were MFCC(2,5–3,5) and the system contained the
VAD and overlapping detection algorithm as well. In this case the average DER value
was 28.71%.
5. CONCLUSIONS
This PhD thesis addressed the topic of speaker diarization for spontaneous
conversations. While answering the question “Who spoke when?”, the presented
speaker diarization system is able to process spontaneous speech and determine the
optimum output without any prior knowledge about the number of speakers or their
identities. Our speaker diarization system is based on unsupervised learning method
which could be easily adapted to another speech corpus. The presented BIC-base system uses as baseline the technology in speaker diarization
for broadcast news and adapts it to the spontaneous speech by developing new
algorithms and improving existent ones to use speaker specific features and to
implement VAD and overlapping speech detection algorithm.
The result of this thesis consists of two main parts. In first, we described three
experiments: i) adopted and modified voice activity detection; ii) propose developed
overlapping detection; iii) evaluate speaker specific features. In the second part, we
confirmed experimentally that by adding the algorithms presented in the first part to
baseline BIC-base speaker diarization system it could improve the results. In the topic of discourse modeling, speaker diarization could benefit from research
aiming at modeling the turn-taking between the speakers. Using information at a
higher level than simple acoustics, the transition probabilities between speakers could
be appropriately set to help the decoding.
6. REFERENCES Bata Sarolta 2009. Beszélőváltások a beszédpartnerek személyes kapcsolatának
függvényében. In: Beszédkutatás 2009. 107–120. Beattie, G. W. 1982. Turn-taking and interruption in political interviews: Margaret
Thatcher and Jim Callaghan compared and contrasted. Semiotica 39: 93–114. Beke András 2008. Az alapfrekvencia-eloszlás modellezése a beszélőfelismeréshez.
Alkalmazott Nyelvtudomány 2008/1–2: 121–132. Boakye K. 2008. Audio Segmentation for Meetings Speech Processing. Ph.D.
dissertation, University of California at Berkeley, 2008.
Boakye K. – Vinyals O. – Friedland G. 2008a. Two’sa crowd: Improving speaker
diarization by automatically identifying and excluding overlapped speech. In Proc.
Interspeech 2008. 32–35. Boakye K. –Trueba-Hornero B. –Vinyals O. –Friedland G. 2008b. Overlapped speech
detection for improved speaker diarization in multiparty meetings. Proc. ICASSP.
4353–4356, 2008.
Boakye, K. – Trueba-Hornero, B. – Vinyals, O. – Friedland, G. 2008c. Overlapped
speech detection for improved speaker diarization in multiparty meetings. In:
Proceeding of IEEE International Conference on Acoustics, Speech and Signal
Processing. Las Vegas, Nevada. 4353–4356. Boakye, K. – Vinyals, O. – Friedland, G. 2011. Improved Overlapped Speech
Handling for Speaker Diarization. In: Proceeding of INTERSPEECH 2011.
Firenze, Olaszország. 941–944. Bőhm Tamás 2006. A glottalizáció szerepe a beszélő személy felismerésében.
Beszédkutatás 2006. 197–207. Campbell, J. P. 1997. Speaker Recognition: A Tutorial. In: Proceedings of the the
Institute of Electrical and Electronic Engineers, Vol. 85, No. 9. 1437–1462. Çetin, Ö. – Shriberg, E. 2006. Analysis of overlaps in meetings by dialog factors, hot
spots, speakers, and collection site: insights for automatic speech recognition. In:
Proceedings of INTERSPEECH 2006. 293–296. Cettolo, M. – Vescovi, M. 2003. Efficient audio segmentation algorithms based on the
BIC. In: Proceedings of IEEE International Conference on Acoustics, Speech and
Signal Processing.
Chen, S. S. – Gopalakrishnan, P. 1998. Clustering via the Bayesian information
criterion with applications in speech recognition. In: Proceedings of IEEE
International Conference on Acoustics, Speech and Signal Processing. Seattle,
USA, 645–648. Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector
machines. ACM Transactions on Intelligent Systems and Technology, Letöltés
ideje: 2013.06.05. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. Cho, Y.D.; Kondoz, A. (2001). Analysis and improvement of a statistical model-based
voice activity detector, IEEE Signal Processing Letters, vol. 8, no. 10, pp. 276–
278. Daniel P. W. Ellis 2005. PLP and RASTA and MFCC, and inversion in Matlab,
http://www.ee.columbia.edu/~dpwe/resources/matlab/rastamat/ Delacourt, P. – Kryze, D. – Wellekens, C. J. 1999. Detection of speaker changes in an
audio document. In: Proceedings of Eurospeech 1999. 1195–1198. Delacourt, P. – Wellekens, C. J. 2000, DISTBIC: A speaker-based segmentation for
audio data indexing. Speech Communication: Special Issue in Accessing
Information in Spoken Audio 32: 111–126. Fék Márk 1997. Beszélőfelismerés neurális hálózatokkal és vektorkvantálóval. OTDK
konferencia. Szeged 1997. Fiscus, J. G. – Garofolo, J. S. – Le, A. – Martin, A. F. – Pallett, D. S. – Przybocki, M.
A. – Sanders, G. 2004. Results of the fall 2004 STT and MDE evaluation. In:
Proceeding of Fall 2004 4 Rich Transcription Workshop (RT-04), Palisades, NY,
Nov. 2004. Furui S. 1986. Research on individuality features in speech waves and automatic
speaker recognition techniques. Speech Communication, vol. 5, 183–197. Garfinkel, H. 1967. Studies in Ethnomethodology. Prentice Hall, Englewood Cliffs,
NJ. Giannakopoulos, T. 2009. Study and application of acoustic information for the
detection of harmful content, and fusion with visual information. Department of
Informatics and Telecommunications,University of Athens, Greece, PhD theis. Goffman, E. 1983. The Interaction Order. American Sociological Review 48: 1–17. Gósy Mária 2012. Multifunkcionális beszélt nyelvi adatbázis – BEA. In Prószéky
Gábor – Váradi Tamás (szerk.): Általános Nyelvészeti Tanulmányok XXIV.
Nyelvtechnológiai kutatások. Akadémiai Kiadó, Budapest, 329–349. Gráczi Tekla Etelka – Bata Sarolta 2010. Megszólalási formák és funkciók az
összeszokottság függvényében. In: Gecső Tamás – Sárdi Csilla (szerk.) Új
módszerek az alkalmazott nyelvészeti kutatásban. Kodolányi János Főiskola, Tinta
Könyvkiadó, Székesfehérvár, Budapest. 28–32. Higgins, A. L. – Bahler, L. – Porter, J. 1991. Speaker verification using randomized
phrase prompting. Digital Signal Processing 1/2: 89–106. Hung, J. – Wang, H. – Lee, L. 2000. Automatic metric based speech segmentation for
broadcast news via principal component analysis. In: Proceedings of the
International Conference on Speech and Language Processing, Beijing, China.
Ida, O. 2011. Indexing of Audio Databases : Event Log of Broadcast News. PhD
thesis. Norwegian University of Science and Technology, Department of
Electronics and Telecommunications. Iványi Zsuzsanna 2001. A nyelvészeti konverzációelemzés. Magyar Nyelvőr 125.
74−93. Jin, Q. – Laskowski, K. – Schultz, T. – Waibel, A. 2004. Speaker segmentation and
clustering in meetings. In: Proceedings of NIST 2004 Spring Rich Transcrition
Evaluation Workshop, Montreal, Canada. 112–117. Markó Alexandra 2006. Beszélőváltások a társalgásban.
http://fonetika.nytud.hu/letolt/ma_2.pdf (Letöltve: 2011. október 1.) Matsui, T. – Furui, S. 1995. Likelihood normalization for speaker verification using a
phoneme- and speaker-independent model. Speech Communication 17: 109–116. Matza, A.– Bistritz, Y. 2011. Skew Gaussian mixture models for speaker recognition.
Presentation. In: Proceedings of the 12th
Annual Conference of the International
Speech Communication Association (INTERSPEECH) 2011. 28–31. Mohamed, G. Hinton, and G. Penn, 2012. Understanding how deep belief networks
perform acoustic modelling. In: Proc. ICASSP, pp. 4273-4276, 2012.
Mori, K. – Nakagawa, S. 2001. Speaker change detection and speaker clustering using
VQ distortion for broadcast news speech recognition. In: Proceedings of IEEE
International Conference on Acoustics, Speech and Signal Processing, Salt Lake
City, USA, 413–416.
Németh Géza – Olaszy Gábor (szerk.) 2010. A magyar beszéd. Beszédkutatás,
beszédtechnológia, beszédinformációs rendszerek (8–12. fejezet). Akadémiai
Kiadó, Budapest. Nikléczy Péter – Gósy Mária 2008. A személyazonosítás lehetősége a beszédanyag
időtartamának függvényében. Beszédkutatás 2008. 172–181. Nikléczy Péter 2003. A zönge periódusidejének funkciója a hangszínezetben.
Beszédkutatás 2003. 101–113. Parthasarathi, S. H. K. – Bourlard, H. – Gatica-Perez, D. 2013. Wordless Sounds:
Robust Speaker Diarization Using Privacy-Preserving Audio Representations.
IEEE Transactions on Audio, Speech & Language Processing 21(1), 83–96. Reynolds, D. A. – Quatieri, T. F. – Dunn, R. 2000. Speaker verifcation using adapted
Gaussian mixture models. Digital Signal Processing 10/1–3: 19–41. Reynolds, D. A. 1995. Speaker identification and verification using Gaussian mixture
speaker models. Speech Communication 17: 91–108. Reynolds, D. A. 1997. Comparison of background normalization methods for text-
independent speaker verification, In: Proceedings of 5th European Conference on
Speech Communication and Technology (Eurospeech). 963–966. Rosenberg, A. E. – DeLong, J. – Lee, C.-H. – Juang, B.-H. –Soong, F. K. 1992. The
use of cohort normalized scores for speaker verification. In: Proceedings of
International Conference on Spoken Language Processing. 599–602. Sacks, H. – Schegloff, E. A. – Jefferson, G. 1974. A simplest systematics for the
organization of turntaking for conversation. Language 50: 696–735.
Sacks, H. 1992. Lectures on Conversation. Blackwell, Oxford. Schegloff, E. 1992. Introduction. In: Sacks, H. Lectures on Conversation. Vol.1.
Blackwell, Oxford. 9–12. Schwarz, G. 1971. A sequential student test. The Annals of Statistics 42/3: 1003–1009. Shih-Sian Cheng – Hsin-Min Wang – Hsin-Chia Fu 2010. BIC-Based Speaker
Segmentation Using Divide-and-Conquer Strategies With Application to Speaker
Diarization," IEEE Transaction on Audio, Speech, and Language Processing, vol.
18, no. 1, pp. 141–157, Jan. 2010. Siegler, M. A. – Jain, U. – Raj, B. – Stern, R. M. 1997. Automatic segmentation,
classification and clustering of broadcast news audio. In: Proceedings of DARPA
Speech Recognition Workshop, 97–99. Sivakumaran, P. – Fortuna, J. – Ariyaeeinia, A. 2001. On the use of the Bayesian
information criterion in multiple speaker detection. In: Proceedings of Eurospeech
2001, Scandinavia. Stokoe, E. 2006. On ethnomethodology, feminism, and the analysis of categorial
reference to gender in talk-in-interaction. Sociological Review 54: 467–94. Tritschler, A. – Gopinath, R. 1999. Improved speaker segmentation and segments
clustering using the bayesian information criterion. In: Proceedings of Eurospeech
1999. 679–682. Trueba-Hornero B. 2008 Handling overlapped speech in speaker diarization. Master's
thesis, Universitat Politecnica de Catalunya, May 2008.
Vandecatseye, A. – Martens, J.-P. 2003. A fast, accurate and stream-based speaker
segmentation and clustering algorithm. In: Proceedings of Eurospeech 2003,
Geneva, Switzerland, 941–944. Vescovi, M. – Cettolo, M. – Rizzi, R. 2003. A DP algoritm for speaker change
detection. In: Proceedings of Eurospeech 2003. Wooters C. – Huijberts M. 2007. The ICSI RT07s speaker diarization system. In
Proceedings of of the Rich Transcription 2007 Meeting Recognition Evaluation
Workshop, 2007. Baltimore, MD.
Ying, D. – Yan, Y. – Dang, J. – Soong, F. 2011. Voice Activity Detection Based On
An Unsupervised Learning Framework. In: IEEE Transactions on Audio, Speech
and Language Processing 19/8: 2624–2633.
7. PUBLICATIONS BY THE CANDIDATE ON THE TOPIC OF
THE THESIS
Beke A, Gósy M 2014. Phonetic analysis and automatic prediction of vowel duration
in Hungarian spontaneous speech. INTERNATIONAL JOURNAL OF
INTELLIGENT DECISION TECHNOLOGIES 10: 57-66.
Váradi V, Beke A 2013. Az artikulációs tempó variabilitása a felolvasásban.
BESZÉDKUTATÁS 21: 26–42.
Szaszák Gy, Beke A. 2013. Using phonological phrase segmentation to improve
automatic keyword spotting for the highly agglutinating Hungarian language. In:
14th Annual Conference of the International Speech Communication Association.
Lyon, Franciaország, 2013.08.25–2013.08.29.
Neuberger T, Beke A 2013. Automatic laughter detection in spontaneous speech using
GMM-SVM method. In: Habernal I, Matousek V (szerk.) Text, Speech, and
Dialogue: 16th International conference, TSD 2013, Pilsen, Czech Republic,
September 1–5, 2013. Proceedings. Berlin; H. eidelberg: Springer Verlag, 2013.
113–120.
Gósy M, Bóna J, Beke A, Horváth V 2013. A kitöltött szünetek fonetikai sajátosságai
az életkor függvényében. BESZÉDKUTATÁS 21: 121–143.
Beke A, Szaszák Gy, Váradi V 2013. Automatic phrase segmentation and clustering in
spontaneous speech In: IEEE 4th International Conference on Cognitive
Infocommunications, CogInfoCom 2013, December 2–5, 2013
Szaszák, György and Beke, András: Exploiting Prosody for Syntactic Analysis in
Automatic Speech Understanding, Journal of Language Modelling, 0(1) 143–172.
(2012)
Szaszák Gy, Beke A 2012. Statisztikai módszerek alkalmazása a szintaktikai szerkezet
és a beszédjel prozódiai szerkezetének feltérképezéséhez olvasott és spontán
beszédben In: Gósy M (szerk.)
Beszéd, adatbázis, kutatások. Budapest: Akadémiai Kiadó, 2012. 236–250.
Szaszák Gy, Beke A 2012. Automatic prosodic and syntactic analysis from speech in
Cognitive Infocommunication. In: IEEE (szerk.)3rd IEEE International
Conference on Cognitive Infocommunications. CogInfoCom 2012. Proceedings.
Kosice, Szlovákia, 2012.12.02–2012.12.05.
Gósy M, Gyarmathy D, Horváth V, Gráczi TE, Beke A, Neuberger T, Nikléczy P
2012. BEA: Beszélt nyelvi adatbázis. In: Gósy M (szerk.) Beszéd, adatbázis,
kutatások. Budapest: Akadémiai Kiadó, 2012. 9–24.
Beke A, Szaszák Gy 2012. Unsupervised clustering of prosodic patterns in
spontaneous speech In: Sojka P, Horák A, Kopeček I, Pala K (szerk.) Text, Speech
and Dialogue: 15th International Conference, TSD 2012, Brno, Czech Republic,
September 3–7, 2012. Proceedings. Berlin: Springer, 2012. 648–655.
Beke A, Gósy M, Horváth V 2012. Gyakorisági vizsgálatok spontán beszédben.
BESZÉDKUTATÁS 20: 260–277.
Beke A, Gósy M 2012. Characteristic and spectral features used in automatic
prediction of vowel duration in spontaneous speech. In: IEEE (szerk.) 3rd IEEE
International Conference on Cognitive Infocommunications. CogInfoCom 2012.
Proceedings. Kosice, Szlovákia, 2012.12.02–2012.12.05.
Beke A 2012. Beszélőfelismerés kevert Gauss–modellekkel. In: Markó Alexandra
(szerk.) Beszédtudomány: Az anyanyelv–elsajátítástól a zöngekezdési időig.
Budapest: ELTE és MTA Nyelvtudományi Intézete, 2012. 335–352.
Beke A 2012. Beszéddetektálás spontán beszédben a beszélőváltás–detektáláshoz. In:
Váradi T (szerk.) VI. Alkalmazott Nyelvészeti Doktoranduszkonferencia:
Budapest, 2012. 02. 03. Budapest: MTA Nyelvtudományi Intézet, 2012. 14–23.
Beke A 2012. Az egyszerre beszélések automatikus osztályozása spontán magyar
társalgásokban. In: Bárdosi Vilmos (szerk.) Tanulmányok: Nyelvtudományi
Doktori Iskola, Budapest: ELTE BTK, 2012. 23–39.
Szaszák Gy, Nagy K, Beke A 2011. Analysing the correspondence between automatic
prosodic segmentation and syntactic structure. In: Piero Cosi, Renato De Mori,
Giuseppe Di Fabbrizio, Roberto Pieraccini (szerk.) Interspeech 2011, 12th Annual
Conference of the International Speech Communication Association. Firenze,
Olaszország, 2011.08.27–2011.08.31.
Gósy M, Beke A, Horváth V 2011. Temporális variabilitás a spontán beszédben.
BESZÉDKUTATÁS 19: 5–30.
Beke A 2011. Szókezdetek automatikus osztályozása spontán beszédben. MAGYAR
NYELVŐR 135: 226–241.
Beke A, Szaszák Gy 2010. Szótagok automatikus osztályozása spontán beszédben
spektrális és prozódiai jellemzők alapján. In: Tanács Attila, Vincze Veronika
(szerk.) VII. Magyar Számítógépes Nyelvészeti Konferencia: MSZNY 2010,
Szeged, Szegedi Tudományegyetem, 2010. 236–249.
Beke A, Szaszák Gy 2010. Kísérlet a szintaktikai szerkezet részleges automatikus
feltérképezésére a prozódiai szerkezet alapján. In: Tanács Attila, Vincze Veronika
(szerk.) VII. Magyar Számítógépes Nyelvészeti Konferencia: MSZNY 2010,
Szeged, Szegedi Tudományegyetem, 2010. 178–190.
Beke A, Szaszák Gy 2010. Automatic recognition of schwa variants in spontanneous
Hungarian speech. ACTA LINGUISTICA HUNGARICA 57:(2–3) 329–353.
Beke A, Szaszák GY 2009. A sávvariációk automatikus felismerése magyar nyelvű
spontán beszédben. BESZÉDKUTATÁS 17: 148–169.
Beke A 2009. A beszélő hangtartományának vizsgálata: Néhány statisztikai jellemző
az alapfrekvencia–eloszlásról. In: Keszler Borbála, Tátrai Szilárd (szerk.)
Diskurzus a grammatikában – grammatika a diskurzusban Budapest: Tinta
Könyvkiadó, 2009. 83–91.
Beke A 2008. Az alapfrekvencia–eloszlás modellezése a beszélőfelismeréshez.
ALKALMAZOTT NYELVTUDOMÁNY 8:(1–2) 121–133.
Beke A 2008. A felolvasás és a spontán beszéd alaphangszerkezeteinek vizsgálata.
BESZÉDKUTATÁS 16: 93–107.
8. TALKS GIVEN BY THE CANDIDATE ON THE TOPIC OF
THE THESIS
April 2007. A kérdés–válasz prozódiája számítógépes vizsgálattal. OTDK,
Székesfehérvár.
April 2007. A kérdés–válasz dallamszerkezetének fonetikai vizsgálata magyar nyelvű
társalgásokban. FÉLÚTON Konferencia, Budapest.
April 2008. A beszélő személy felismerése: az automatikus formánslekérdezés
eredményei. FÉLÚTON Konferencia, Budapest.
November 2008. A beszélő hangtartományának vizsgálata (néhány statisztikai
jellemző az alapfrekvencia–eloszlásról).: Diskurzus a grammatikában –
grammatika a diskurzusban (Új nézőpontok a magyar nyelv leírásában 2.)
Konferencia, Budapest, 2008. nov. 11–12.
November 2009. A svávariációk automatikus felismerése magyar nyelvű spontán
beszédben. Beszédkutatás 2009. Budapest. 2009. (with: Szaszák György)
November 2010. Temporális variabilitás a spontán beszédben. Kultúra és nyelv,
kulturális nyelvészet – Új nézőpontok a magyar nyelv leírásában 3. ELTE BTK:
Budapest (with: Horváth Viktória és Gósy Mária).
December 2010. Szótagok automatikus osztályozása spontán beszédben spektrális és
prozódiai jellemzők alapján. VII. Magyar Számítógépes Nyelvészeti Konferencia
(with: Szaszák György)
May 2011. Figyi, ki beszél most? A beszélők automatikus osztályozása a spontán
társalgásokban. XIII. Balatonalmádi Pszicholingvisztikai Nyári Egyetem.
May 2011. A hezitációs jelenségek gépi osztályozása a spontán beszédben. XIII.
Balatonalmádi Pszicholingvisztikai Nyári Egyetem. (with: Horváth Viktória)
August 2011. Analysing the correspondence between automatic prosodic
segmentation and syntactic structure. Interspeech 2011, Firenze, Olaszország (with:
Szaszák György)
October 2011. Gyakorisági mutatók a spontán beszédben. Beszédkutatás konferencia.
(with: Gósy Máriával és Horváth Viktóriával)
Ocober 2011. Kísérlet a szintaktikai szerkezet részleges automatikus feltérképezésére
a prozódiai szerkezet alapján. Beszédkutatás konferencia. (with: Szaszák György)
October 2011. Az ismétlések automatikus osztályozása a spontán beszédben.
Beszédkutatás konferencia. (with: Gyarmathy Dorottyával)
November 2011. Beszélők szegmentálása és osztályozása társalgásban. A Magyar
Tudomány Ünnepe 2011 Beszédadatbázisok a kutatásban és az alkalmazásban.
MTA NYTUD. Budapest.
December 2011. A szintakai szerkezet automatikus feltérképezése a beszédjel
prozódiai elemzése alapján. VIII. Magyar Számítógépes Nyelvészeti Konferencia
(with: Szaszák György)
March 2012. Toward Exploring the Prosodic Structure of Spontaneous Speech by
Focusing on Automatic Modelling, IAST Workshop, Dublin (with: Szaszák
György)
April 2012. A beszélő személy gépi felismerése. Fonetikanap, ELTE BTK, Budapest.
September 2012. Unsupervised Clustering of Prosodic Patterns in Spontaneous
Speech. TSD, Brno, (with: Szaszák György)
September 2012. A szintaxis és a prozódia kapcsolata, BEA Workshop, MTA
Nyelvtudományi Intézet, Budapest (with: Szaszák György)
December 2012. Automatic prosodic and syntactic analysis from speech in Cognitive
Infocommunication, CogInfoCom konferencia, Kassa (with: Szaszák György)
December 2012. Characteristic and spectral features used in automatic prediction of
vowel duration in spontaneous speech, CogInfoCom konferencia, Kassa (with:
Gósy Mária)
March 2013. Automatic identification of discourse markers in spontaneous speech for
speaker diarization. SJUSK 2012. Copenhagen
March 2013. Automatic laughter detection in Hungarian spontaneous speech using
GMM/ANN hybrid method. SJUSK 2012. Copenhagen (with: Neuberger Tilda)
March 2013. Automatic classification of repeated words in Hungarian spontaneous
speech. ExAPP 2013. Copenhagen (with: Gyarmaty Dorottya)
August 2013. Using Phonological Phrase Segmentation to Improve Automatic
Keyword Spotting for the Highly Agglutinating Hungarian Language.
INTERSPEECH 2013, Lyon, Farnciaország (with: Szaszák György)
December 2013. Temporal variability in spontaneous Hungarian speech. 6th Language
& Technology Conference: Human Language Technologies as a Challenge for
Computer Science and Linguistics (társzerzőségben: Gósy Mária, Horáth Viktória)
September 2013. Automatic Laughter Detection in Spontaneous Speech Using GMM–
SVM Method.Text, Speech, and Dialogue – 16th International Conference, TSD
2013, Pilsen, Czech Republic (with: Neuberger Tilda)
September 2013. A Logistic Regression Approach for the Improvement of Keyword
Spotting based on Phonological Phrasing. Text, Speech, and Dialogue – 16th
International Conference, TSD 2013, Pilsen, Czech Republic (with: Szaszák
György)
top related