Computer Vision (CV) is a cutting-edge topic of Artificial Intelligence. The goal of CV is to extract information from digital images or videos. This information may be camera position, object detection and recognition, grouping searching image content. The extraction of information is a big challenge in practice that requires combining of programming, modeling, and mathematics.


The interest on the Computer Vision raised among scholars in 60’s. In those days, researchers worked on extracting 3D information from 2D images. Despite this, imperfect computing capacity and small isolated groups caused slow development of the field. In 1974 the first commercial application using the Computer Vision was optical character recognition program. This program interpreted typed or handwritten text that helped at overcoming the handicap of blindness. Next decades thanks to growing computing power and NVIDIA’s parallelizable GPU, the significant progress in deep learning and convolutional neural networks (CNN) was achieved.


How Computer Vision works?


Machine interprets images or video frames as a series of pixels. Each pixel is characterized by its position on image and specified value for its color (in 8 bits). So, image for computer is a simple array of numbers. Then machine identify similar group of colors, using color gradient technique the computer finds edges of different objects. For the next step, algorithms search for lines that meet at an angle and cover the image part with one color. Then algorithm must determine correctly the texture of the image and guess the object on the image using its database. It is like you have a puzzle: you search for similar colors, identify edges, lines and textures, and finally you piece all the parts of the image together.


Tasks in Computer Vision


Computer vision has a wide array of topics and applications based on them. Here we will try to list a few:


Object classification (or object recognition). In this task, a model is trained to recognize one or several objects or objects classes, then the model is able to classify object/objects by categories on new images. For example, using convolutional neural networks the model can be trained to automatically annotate diseases from chest X-rays. Another example mentioned above is the classifying a handwritten digits. The success in solving such classification tasks directly depends on the correctly selected images of objects. Here are some popular already solved image datasets: MNIST and SVHN datasets of photos of digits, CIFAR-10 and CIFAR-100 datasets with 10 and 100 classes respectively.


Object Identification. Object Identification is a subtask of object recognition. While classification task is a recognition of the object’s basic level category, like dog or human, the identification is a recognition of the object’s subordinate category, like German Shepherd or Elvis Presley. Fingerprint identification used in cyber security and fraud detection, identification of a specific vehicle, bacteria identification in 3D microscopy data are typical examples of object identification.

Object identification
Object identification

Object Detection. This is also subtask of object recognition. In object detection task the model is trained to recognize one or several objects in a single image. These models create bounding boxes and label objects inside of them. For example, in face detection, the model must detect all faces in a given photo with their bounding boxes. Driverless cars also need to solve detection tasks constantly and instantly to react if a car or someone suddenly move into the path of the car. Vehicle detection, pedestrian counting, security systems, the list of object detection applications seems to be endless.


Semantic segmentation (or object segmentation). In segmentation, as a kind of object detection task, the image should be divided into grouped pixel that are labeled and classified. Simply, the model gives an answer what pixels belong to the object in the image. One of the most known examples of object segmentation is a portrait mode realized in cell phone cameras. The model recognizes pixels that belong to the object on the foreground and blur the background. Semantic segmentation is using in bio medical image diagnosis, precision agriculture, etc.


Motion analysis. In motion analysis we are dealing with a sequence of images or videos. There are several tasks in motion analysis. Object tracking is a process of following the movements a specific object(s) in a video scene. The main problem in this task is to associate this target object in consecutive video frames.

Motion analysis
Motion analysis

The task solution has a variety of uses. It is used in autonomous driving systems, compression, augmented reality, traffic control. Egomotion is a task of determining the 3D rigid motion of the camera within the environment. Such a task arises in autonomous robot navigation. Autonomous robots need accurate and real-time estimation of their position to enable the robot's perception and control tasks.