Image classification:

The problem with image classification is this: Given a set of images that are all labeled with a single category, we are asked to predict these categories for a new set of test images and measure the accuracy of the predictions. There are many challenges to this task, including point of view variation, scale variation, intra-class variation, image distortion, image occlusion, lighting conditions, and background clutter.
How could we write an algorithm capable of classifying images into distinct categories? Computer Vision researchers have developed a data-based approach to solving this problem. Instead of trying to specify directly in the code what each image category of interest looks like, they provide the computer with many examples of each class of images and then develop learning algorithms that examine these examples and learn the visual appearance of each class.
In other words, they first accumulate a set of tagged image learning data and then send it to the computer to process the data. Given this fact, the complete image classification pipeline can be formalized as follows:

  • Our input is a training data set that consists of N images, each labeled with one of K different classes.
  • We then use this training set to train a classifier to learn what each class looks like.
  • finally, we evaluate the quality of the classifier by asking him to predict labels for a new set of images that he has never seen before.
  • We then compare the actual labels for these images with those predicted by the classifier.

The most common algorithms used to solve image classification are CNN convolutional neural networks

Convolutional neural networks are currently the most efficient models for image classification. CNN has two very different parts. As input, the image is provided as a matrix of pixels. A 2-dimensional grayscale image. The color is represented by the third dimension, depth 3, representing the basic colors [red, green, blue]. The first part of CNN is the convolutional part. It acts as an image extractor. The image is passed through subsequent filters or convolution kernels, creating new images called convolution maps. Some indirect filters reduce the resolution of the image by a maximum local operation. Finally, the weave maps are flattened and combined into a vector of entities called CNN code.
Creating a new convolutional neural network is expensive in terms of expertise, equipment, and the amount of data required with annotations. First of all, it is necessary to determine the architecture of the network, i.e. the number of layers, their size and the matrix operations that connect them. The training then involves optimizing the network coefficients to minimize output error. This training can take several weeks for the best CNN, with many graphics processors working on hundreds of thousands of commented images.

Today, most image classification techniques are trained on ImageNet, a data set of approximately 1.2 million high-resolution training images.
The winner of the first ImageNet competition, Alex Krizhevsky, has revolutionized deep learning with an Alexnet very deep convolutional neural network. Its architecture consists of 7 hidden layers, plus a few maximum aggregation layers. The first layers were convolutional, while the last two were globally connected. Activation functions were linear units rectified in each hidden layer. These train much faster and are more expressive than logistics units. In addition, it also uses competitive standardization to suppress hidden activities when neighbouring units have stronger activities. This allows for better management of intensity variations.

In terms of hardware requirements, Alex uses a very efficient implementation of convolutional networks on 2 Nvidia GTX 580 GPUs (more than 1000 small fast cores). GPUs are very good for matrix multiplications and also have a very high memory bandwidth. This allows him to form the network in a week and to quickly combine the results of 10 patches at the time of testing. We can spread a network over several cores if we can communicate the states quickly enough. As cores become cheaper and data sets get larger, large neural networks will improve faster than older CV systems. Since AlexNet, many new models using CNN as a base architecture have been developed and have performed very well in ImageNet.