Emotion Recognition

Emotion Recognition

It is obvious that emotions are peculiar to humans and some social animals, like apes, wolves, crows. Emotion recognition is an important part of the communication between people. The efficiency of humans’ interactions depends on how we can predict the behavior of the other person we are interacting with, and, as a result, adjust or change our behavior. Fear can indicate danger; satisfaction indicates that the conversation is successful.


Emotion recognition is not an easy task, as the same emotion may be shown differently by different people. With this being said, most people have no trouble distinguishing basic emotions such as fear, anger, disgust, happiness, or surprise, to list a few examples. The question that arises here is whether we can teach a computer to recognize emotions. Because of the advancements made in recent years, the answer is yes. Automatic emotion recognition is a field of study in AI. It is a process of identifying human emotion by leveraging techniques from multiple areas, such as signal processing, machine learning, computer vision, natural language processing.


But before we discuss automatic emotion recognition in detail, it is important to explore why this technology is necessary at all. Well, as we already mentioned above, emotions are a powerful source of information. Different surveys said that verbal components convey one-third of human communication, and nonverbal components convey two-thirds. So, successful human-computer interaction needs this channel of communication.

How does Automatic Emotion Recognition Work?

Humans use different components of interactions to recognize emotions: voice, facial expressions, gestures, and text. The same is true for computers. They use different inputs, such as speech, images, videos, text, and physiological sensors, to recognize emotions. Also, we can combine processing results from different input channels in order to have a more precise picture of a given human’s state; this is called multimodal emotion recognition. So, the first step is to collect a dataset (or datasets in the case of the multimodal inputs) with examples of different emotions and label them. In general, there are six basic cross-cultural emotions corresponding to Ekman’s model: anger, disgust, fear, happiness, sadness, surprise. Quite often, data scientists add 12 compound emotions (based on basic emotions, e.g. angrily surprised, sadly disgusted, etc.), additional emotions (appall, hate, and awe), and plus neutral.

Then we train a model to classify emotions from different samples. Nowadays, the most efficient models are based on neural networks such as recurrent neural networks (RNN) or convolutional neural networks (CNN). For multimodal emotion recognition, there is also a fusion bloc that combines results from different channels and produces a final decision. The accuracy of such models varies from 60% to 95% depending on the database being used, the architecture of the neural network, and the signal type/types.

Emotion Recognition

Emotion Recognition

Emotion Recognition

Let’s have a look at different types of recognition.

  • Speech Emotion Recognition (SER). SER involves analyzing vocal behavior as an affect marker that focuses on the nonverbal aspects of speech. For example, anger often produces changes in respiration and an increase in muscle tension. Consequently, the vocal folds and vocal tract shape have additional vibration, and the acoustic characteristics of speech change. Speech signals are a good source for effective computing. This technology provides a more economical option (as only a microphone is needed) when compared to using biological signals (which may require an electrocardiogram) or facial recognition (which requires a camera). Furthermore, this technology has a wide range of applications. These include emotional hearing aids for people with autism; detection of an angry caller at an automated call center to transfer to a human; and presentation style adjustment of a computerized e-learning tutor if the student is bored.
  • Facial Emotion Recognition (FER). FER is based on visual information: images and video. However, FER can be conducted using multiple sensors. For example, Microsoft Kinect 3D modeling is used. It has an infrared emitter and two cameras, so it is able to measure depth. Such a system solves two central problems of image recognition: proper lighting and pose behavior. There are several FER applications worth mentioning. Cars with AI can alert their driver when they are feeling drowsy. FER can be also used for marketing research (e.g., observation of a user’s reaction while interacting with a specific brand or product). One more usage of FER, as well as SER, involves ATM security: if a person withdrawing money is scared, the ATM will not dispense cash, as a safety measure.
  • Multimodal Emotion recognition. There are different types of multimodal emotion recognition. Some of them combine recognition from speech and text data. Another use involves images and voice data. There are also more exotic systems that combine images, speech and text with physiological sensors (e.g., electrocardiography, electroencephalography). Multimodal systems give a wider picture of the human emotional state. For example, these systems be used in healthcare for determining patients' feelings and comfort level regarding their treatment.
Let’s have talk
Let’s have talk

Interesting For You

Sentiment Analysis in NLP

Sentiment Analysis in NLP

Sentiment analysis has become a new trend in social media monitoring, brand monitoring, product analytics, and market research. Like most areas that use artificial intelligence, sentiment analysis (also known as Opinion Mining) is an interdisciplinary field spanning computer science, psychology, social sciences, linguistics, and cognitive science. The goal of sentiment analysis is to identify and extract attitudes, emotions, and opinions from a text. In other words, sentiment analysis is a mining of subjective impressions, but not facts, from users’ tweets, reviews, posts, etc.

Read article

Chatbots in NLP

Chatbots in NLP

Chatbots or conversational agents are so widespread that the average person is no longer surprised to encounter them in their daily life. What is remarkable is how quickly chatbots are getting smarter, more responsive, and more useful. Sometimes, you don’t even realize immediately that you are having a conversation with a robot. So, what is a chatbot? Simply put, it is a communication interface which can interpret users’ questions and respond to them. Consequently, it simulates a conversation or interaction with a real person. This technology provides a low-friction, low-barrier method of accessing computational resources.

Read article

Computer Vision

Computer Vision

Computer Vision (CV) is one of Artificial Intelligence’s cutting-edge topics. The goal of CV is to extract information from digital images or videos. This information may relate to camera position, object detection and recognition, as well as grouping and searching image content. In practice, the extraction of information is a big challenge, which requires a combination of programming, modeling, and mathematics, in order to be completed successfully. Interest in Computer Vision began to emerge among scholars in the 60’s. In those days, researchers worked on extracting 3D information from 2D images. While some progress was made in this regard, imperfect computing capacity and small isolated groups caused slow development of the field. The first commercial application using Computer Vision was an optical character recognition program, which emerged in 1974. This program interpreted typed or handwritten text, with the goal of helping the blind or visually impaired. Thanks to growing computing power and NVIDIA’s parallelizable GPU, significant progress was achieved in deep learning and convolutional neural networks (CNN).

Read article