It is obvious that emotions are peculiar to humans and some social animals, like apes, wolves, crows. Emotion recognition is an important part of the communication between people. The efficiency of humans’ interactions depends on how we can predict the behavior of the other person, and, as a result, adjust or change our behavior. Fear can indicate danger; satisfaction indicates that the conversation is successful.


Emotion recognition is not an easy task. The same emotion is shown differently by different people. But still most people have no trouble to distinguish basic emotions such as fear, anger, disgust, happiness, surprise and others. Then the question arises whether we can teach the computer to recognize emotions. Now the answer is yes. Automatic emotion recognition is a field of study in AI. It is a process of identifying human emotion by leveraging techniques from multiple areas, such as signal processing, machine learning, computer vision, natural language processing.


But before talking about automatic emotion recognition in detail, we need to answer whether we need to do it at all. Well, as we already mentioned above, emotions are powerful source of information. Different surveys said that verbal components convey on-third of human communication, and nonverbal components convey two-thirds. So, successful human-computer interaction needs this channel of communication.


How Automatic Emotion Recognition Works?


Humans use different ways to recognize emotions: voice, facial expressions, gestures, and text. The same is for computers. They use different inputs like speech, images, videos, text, physiological sensors to recognize emotions. Also, we can combine processing results from different input channels in order to have more precise picture of the human’s state. In the last case, it is called multimodal emotion recognition. So, the first step is to collect a dataset (or datasets in the case of the multimodal inputs) with examples of different emotions and label them. In general, there are six basic cross-cultural emotions corresponding to the Ekman’s model: anger, disgust, fear, happiness, sadness, surprise. Quite often, data scientists add 12 compound emotions (based on basic emotions, e.g. angrily surprised, sadly disgusted, etc.), additional emotions (appall, hate, and awe), and plus neutral.


Then we train a model to classify emotions from different samples. Nowadays, the most efficient models are based on neural networks such as recurrent neural networks (RNN) or convolutional neural networks (CNN). For multimodal emotion recognition, there is also a fusion bloc that combines results from different channels and produces final decision. The accuracy of such models varies from 60% to 95% depending on used database, architecture of neural network, and signal type/types.

Emotion Recognition
Emotion Recognition

Let’s have a look at different types of recognition.

  • Speech Emotion Recognition (SER). SER refers to analyzing vocal behavior as an affect marker that focuses on the nonverbal aspects of speech. For example, anger often produces changes in respiration and increase muscle tension. Consequently, the vocal folds and vocal tract shape have additional vibration, and the acoustic characteristics of speech changes. Speech signals are a good source for effective computing. It is economically favorable to use them (we need microphone) compare biological signals (e.g., electrocardiogram) or facial recognition (we need camera). The area of applications is wide. These include emotional hearing aids for people with autism; detection of an angry caller at an automated call center to transfer to a human; presentation style adjustment of a computerized e-learning tutor if the student is bored.

  • Facial Emotion Recognition (FER). FER is based on visual information: images, video. However, FER can be conducted using multiple sensors. For example, Microsoft Kinect 3D modeling is used. It has infrared emitter and two cameras, so it is able to measure depth. Such a system solves two main problems of image recognition: proper lighting and pose behavior. There are several FER’s applications worth mentioning. Cars with AI can alert the driver when he is feeling drowsy. FER can be also used for marketing research (e.g., observation of user’s reaction while interacting with brand or product). One more usage of FER, as well as SER, is that when a person is scared and withdrawing money ATM not dispensing cash.

  • Multimodal Emotion recognition. There are different types of multimodal emotion recognition. Some of them combines recognition from speech and text data. Another use images and voice data. There are also more exotic systems that combines images, speech, text with physiological sensors (e.g., electrocardiography, electroencephalography). Multimodal systems give a wider picture of a human emotional state. For example, it can be used in healthcare for determining patients feeling and comfort level about the treatment.