predicting voice elicited emotions nishant pandey

Predicting Voice Elicited Emotions

Nishant Pandey

Synopsis

• Problem statement and motivation• Previous work and background• System• Intuition and Overview• Pre-processing of audio signals• Building feature space• Finding patterns in unlabelled data and labelling of samples• Regression Results

• Deployed System• Market Research

Motivation• Automate the screening process in service based industries• Hourly job workers (two-thirds of U.S. Labour force or ~50 million job

seekers every year)

Problem Statement• To be able to analyse voice and predict listener emotions elicited by

the paralinguistic elements of the voice.

Previous work

Current work focuses on predicting the elicited emotions of voice clips.

2 set of goals, which includes recognizing-• the type of personality traits intrinsically possessed by the speaker,

for e.g. speaker trait and speaker state• the types of emotions carried within the speech clip, for e.g. acoustic

affect (cheerful, trustworthy, deceitful etc.)

Background – Emotion Taxonomy

The framework articulated by “FEELTRACE” • Includes all the emotionresponses we want topredict.• Emotions by finite

quantifiable dimensions.

Features - Paralinguistic features of Voices

Concept Definition Data RepresentationAmplitude measurement of the variations over

time of the acoustic signalquantified values of a sound wave’s Oscillation

Energy acoustic signal energy representation in decibels

20*log10(abs(FFT))

Formants the resonance frequencies of the vocal tract

maxima detected using Linear Prediction on audio windows with high tonal content

Perceived pitch Perceived Fundamental frequency and harmonics

Formants

Fundamental frequency the reciprocal of time duration of one glottal cycle - a strict definition of “pitch”

first formant

System – Intuition

Spectrogram of two job applicants responding to “Greet me as if I am a customer”

System – Overview

System – Pre-Processing of Audio Signals

• Pre-processing tasks involve:• Removing voice clips with <2 seconds

length and containing noise• audio signal to data in time and frequency

domain• Short-term Fast Fourier Transform per

frame• Energy measures in frequency domain

per frame• Linear prediction coefficient in

frequency domain per frame

System - Feature Space ConstructionWe experimented with feature construction based on the following dimensions and combinations:• Signal measurements such as energy and amplitude.• Statistics such as min, max, mean, and standard deviation on signal

measurements• Measurement window in time domain: different time size and entire

time window• Measurement window in frequency domain: all frequencies, optimal

audible frequencies, and selected frequency ranges

System – Labels and Right set of Features?• Conventional approach – getting voice samples rated by experts• Unsupervised Learning – Analyse features and their effectiveness

Process:1. Unsupervised learning is used to find patterns in unlabelled data.2. Now, training data sets are constructed based on clustering results

and manual labelling.

System – How do we get the labels? Contd.

Parameters• Cost Function:• Connectivity• Dunn Index• Silhouette

Clustering Results• Technique: Hierarchical

Clustering• Number of clusters: 5 • Manual validation of clusters

was also done

System – Visualization of clusters

System – Modelling

Supervised Learning algorithms• Logistic Regression• Support Vector Machine• Random Forest

Semi-Supervised Learning algorithm• KODAMA

Output:• Binary outcome (positive or negative)• Numerical scores

Case Study – Modelling

• Prediction – Positive vs Negative Response• A positive response could be one or multiple perceptions of a

“pleasant voice”, “makes me feel good”, “cares about me”, “makes me feel comfortable”, or “makes me feel engaged”.

• System.V1 -> Using SVM and V2 -> Random Forest• Interview Prompts: “Greet me as If I am a customer”

System - Prediction Results

• Accuracy : 0.86• 95% CI : (0.76, 0.92)• P-Value [Acc > NIR] : 5.76e-07• Sensitivity : 0.81• Specificity : 0.88• Pos Pred Value : 0.81• Neg Pred Value : 0.88

System - Prediction Results (KODAMA)• Kodama performs feature

extraction from noisy and high-dimensional data.• Output of Kodama includes

dissimilarity matrix from which we can perform clustering and classification.

Deployed System

Market Research• Demographics Matters• Young listeners (18-29 years

old) and Income less than $29000/year have more strict criteria of how they sense engaging.• No Correlation b/w emotion

elicited vs age/ ethnicity/ education level.• Bias towards female voice.

Thanks

Time and Frequency Domain

• Time Domain: https://en.wikipedia.org/wiki/Time_domain#/media/File:Fourier_transform_time_and_frequency_domains_(small).gif

• Frequency Domain:https://en.wikipedia.org/wiki/Frequency_domain#/media/File:Fourier_transform_time_and_frequency_domains_(small).gif

https://en.wikipedia.org/wiki/Time_domain#/media/File:Fourier_transform_time_and_frequency_domains_(small).gif



Learnings – Difference in Voice Characteristics

• Result Improves by 10% - when a decision tree is layered by features related to voice characteristic on top of the Random Forest.

Prediction Results – SVM vs Random Forest

predicting voice elicited emotions nishant pandey

Documents

workers voice

voice samples

listener emotions

speech clip

paralinguistic feature

unlabelled data

data mining

audio windows