predicting voice elicited emotions nishant pandey
TRANSCRIPT
Predicting Voice Elicited Emotions
Nishant Pandey
Synopsis
• Problem statement and motivation• Previous work and background• System• Intuition and Overview• Pre-processing of audio signals• Building feature space• Finding patterns in unlabelled data and labelling of samples• Regression Results
• Deployed System• Market Research
Motivation• Automate the screening process in service based industries• Hourly job workers (two-thirds of U.S. Labour force or ~50 million job
seekers every year)
Problem Statement• To be able to analyse voice and predict listener emotions elicited by
the paralinguistic elements of the voice.
Previous work
Current work focuses on predicting the elicited emotions of voice clips.
2 set of goals, which includes recognizing-• the type of personality traits intrinsically possessed by the speaker,
for e.g. speaker trait and speaker state• the types of emotions carried within the speech clip, for e.g. acoustic
affect (cheerful, trustworthy, deceitful etc.)
Background – Emotion Taxonomy
The framework articulated by “FEELTRACE” • Includes all the emotionresponses we want topredict.• Emotions by finite
quantifiable dimensions.
Features - Paralinguistic features of Voices
Concept Definition Data RepresentationAmplitude measurement of the variations over
time of the acoustic signalquantified values of a sound wave’s Oscillation
Energy acoustic signal energy representation in decibels
20*log10(abs(FFT))
Formants the resonance frequencies of the vocal tract
maxima detected using Linear Prediction on audio windows with high tonal content
Perceived pitch Perceived Fundamental frequency and harmonics
Formants
Fundamental frequency the reciprocal of time duration of one glottal cycle - a strict definition of “pitch”
first formant
System – Intuition
Spectrogram of two job applicants responding to “Greet me as if I am a customer”
System – Overview
System – Pre-Processing of Audio Signals
• Pre-processing tasks involve:• Removing voice clips with <2 seconds
length and containing noise• audio signal to data in time and frequency
domain• Short-term Fast Fourier Transform per
frame• Energy measures in frequency domain
per frame• Linear prediction coefficient in
frequency domain per frame
System - Feature Space ConstructionWe experimented with feature construction based on the following dimensions and combinations:• Signal measurements such as energy and amplitude.• Statistics such as min, max, mean, and standard deviation on signal
measurements• Measurement window in time domain: different time size and entire
time window• Measurement window in frequency domain: all frequencies, optimal
audible frequencies, and selected frequency ranges
System – Labels and Right set of Features?• Conventional approach – getting voice samples rated by experts• Unsupervised Learning – Analyse features and their effectiveness
Process:1. Unsupervised learning is used to find patterns in unlabelled data.2. Now, training data sets are constructed based on clustering results
and manual labelling.
System – How do we get the labels? Contd.
Parameters• Cost Function:• Connectivity• Dunn Index• Silhouette
Clustering Results• Technique: Hierarchical
Clustering• Number of clusters: 5 • Manual validation of clusters
was also done
System – Visualization of clusters
System – Modelling
Supervised Learning algorithms• Logistic Regression• Support Vector Machine• Random Forest
Semi-Supervised Learning algorithm• KODAMA
Output:• Binary outcome (positive or negative)• Numerical scores
Case Study – Modelling
• Prediction – Positive vs Negative Response• A positive response could be one or multiple perceptions of a
“pleasant voice”, “makes me feel good”, “cares about me”, “makes me feel comfortable”, or “makes me feel engaged”.
• System.V1 -> Using SVM and V2 -> Random Forest• Interview Prompts: “Greet me as If I am a customer”
System - Prediction Results
• Accuracy : 0.86• 95% CI : (0.76, 0.92)• P-Value [Acc > NIR] : 5.76e-07• Sensitivity : 0.81• Specificity : 0.88• Pos Pred Value : 0.81• Neg Pred Value : 0.88
System - Prediction Results (KODAMA)• Kodama performs feature
extraction from noisy and high-dimensional data.• Output of Kodama includes
dissimilarity matrix from which we can perform clustering and classification.
Deployed System
Market Research• Demographics Matters• Young listeners (18-29 years
old) and Income less than $29000/year have more strict criteria of how they sense engaging.• No Correlation b/w emotion
elicited vs age/ ethnicity/ education level.• Bias towards female voice.
Thanks
Time and Frequency Domain
• Time Domain: https://en.wikipedia.org/wiki/Time_domain#/media/File:Fourier_transform_time_and_frequency_domains_(small).gif
• Frequency Domain:https://en.wikipedia.org/wiki/Frequency_domain#/media/File:Fourier_transform_time_and_frequency_domains_(small).gif
Learnings – Difference in Voice Characteristics
• Result Improves by 10% - when a decision tree is layered by features related to voice characteristic on top of the Random Forest.
Prediction Results – SVM vs Random Forest