The multimodal neural network is used to predict user sentiment from multimodal features such as text, audio, and visual data. Speech and language recognition technology is a rapidly developing field, which has led to the emergence of novel speech dialog systems, such as Amazon Alexa and Siri. A significant milestone in the development of dialog artificial intelligence (AI) systems is the addition of emotional intelligence. A system able to recognize the emotional states of the user, in addition to understanding language, would generate a more empathetic response, leading to a more immersive experience for the user. "Multimodal sentiment analysis" is a group of methods that constitute the gold standard for an AI dialog system with sentiment detection.