Date(s) - 01/07/2013
11:00 am - 12:00 pm
People use multiple cues to interpret each other’s affective state, including facial expressions, voice characteristics and what is being said. However in automatic affect prediction each modality is often studied in isolation.
Here I present our work on multimodal prediction of continuous dimensions of affect which won the 2012 word-level AVEC challenge. First I will briefly discuss lexical representations for affect prediction based on domain-independent dictionaries of affect norms for words and standard domain-dependent representations. Then I introduce state of the art representations for audio based on regions of interest. In these the goal is to tease apart linguistically salient variation in the voice used to realize lexical and utterance accent from the paralinguistic variation related to affect. These representations lead to significant improvements in the spontaneous speech labelled with continuous dimensions of affect as well as in acted emotion datasets with categorical labels of emotion.
Finally, I will present our Bayesian framework for multimodal prediction of affect. It models the temporal aspects of emotion via particle filtering and uses weights for individual modality predictors informed by the accuracy of each modality. The method results in substantial improvement in performance compared to the best single modality predictor. In contrast on this dataset, as in many others, standard methods for modality combination result in only minimal gains over the best modality.