The recognition of emotion and dialogue acts enrich conversational analysis and help to build natural dialogue systems. Emotion makes us understand feelings and dialogue acts reflect the intentions and performative functions in the utterances. However, most of the textual and multi-modal conversational emotion datasets contain only emotion labels but not dialogue acts. To address this problem, we propose to use a pool of various recurrent neural models trained on a dialogue act corpus, with or without context. These neural models annotate the emotion corpus with dialogue act labels and an ensemble annotator extracts the final dialogue act label. We annotated two popular multi-modal emotion datasets: IEMOCAP and MELD. We analysed the co-occurrence of emotion and dialogue act labels and discovered specific relations. For example, Accept/Agree dialogue acts often occur with the Joy emotion, Apology with Sadness, and Thanking with Joy. We make the Emotional Dialogue Act (EDA) corpus publicly available to the research community for further study and analysis.
However, a new algorithm from researchers at Stanford and Adobe has shown it's pretty damn good at video dialogue editing, something that requires artistry, skill and considerable time. For instance, many scenes start with a wide "establishing" shot so that the viewer knows where they are. You can also use leisurely or fast pacing, emphasize a certain character, intensify emotions or keep shot types (like wide or closeup) consistent. In an example shown (below), the team selected "start wide" to establish the scene, "avoid jump cuts" for a cinematic (non-YouTube) style, "emphasize character" ("Stacey") and use a faster-paced performance.
The strategies for interactive characters to select appropriate dialogues remain as an open issue in related research areas. In this paper we propose an approach based on reinforcement learning to learn the strategy of interrogation dialogue from one virtual agent toward another. The emotion variation of the suspect agent is modeled with a hazard function, and the detective agent must learn its interrogation strategies based on the emotion state of the suspect agent. The reinforcement learning reward schemes are evaluated to choose the proper reward in the dialogue. Our contribution is twofold. Firstly, we proposed a new framework of reinforcement learning to model dialogue strategies. Secondly, background knowledge and emotion states of agents are brought into the dialogue strategies. The resulted dialogue strategy in our experiment is sensitive in detecting lies from the suspect, and with it the interrogator may receive more correct answer.
While there have been significant advances in detecting emotions from speech and image recognition, emotion detection on text is still under-explored and remained as an active research field. This paper introduces a corpus for text-based emotion detection on multiparty dialogue as well as deep neural models that outperform the existing approaches for document classification. We first present a new corpus that provides annotation of seven emotions on consecutive utterances in dialogues extracted from the show, Friends. We then suggest four types of sequence-based convolutional neural network models with attention that leverage the sequence information encapsulated in dialogue. Our best model shows the accuracies of 37.9% and 54% for fine- and coarse-grained emotions, respectively. Given the difficulty of this task, this is promising.
As a generative model for building end-to-end dialogue systems, Hierarchical Recurrent Encoder-Decoder (HRED) consists of three layers of Gated Recurrent Unit (GRU), which from bottom to top are separately used as the word-level encoder, the sentence-level encoder, and the decoder. Despite performing well on dialogue corpora, HRED is computationally expensive to train due to its complexity. To improve the training efficiency of HRED, we propose a new model, which is named as Simplified HRED (SHRED), by making each layer of HRED except the top one simpler than its upper layer. On the one hand, we propose Scalar Gated Unit (SGU), which is a simplified variant of GRU, and use it as the sentence-level encoder. On the other hand, we use Fixed-size Ordinally-Forgetting Encoding (FOFE), which has no trainable parameter at all, as the word-level encoder. The experimental results show that compared with HRED under the same word embedding size and the same hidden state size for each layer, SHRED reduces the number of trainable parameters by 25\%--35\%, and the training time by more than 50\%, but still achieves slightly better performance.