Human-level performance. Human-level accuracy. Those are terms you hear a lot from companies developing artificial intelligence systems, whether it’s facial recognition, object detection, or question answering. And to their credit, the recent years have seen many great products powered by AI algorithms, mostly thanks to advances in machine learning and deep learning.

But many of these comparisons only take into account the end-result of testing the deep learning algorithms on limited data sets. This approach can create false expectations about AI systems and yield dangerous results when they are entrusted with critical tasks.

In a recent study, a group of researchers from various German organizations and universities has highlighted the challenges of evaluating the performance of deep learning in processing visual data. In their paper, titled, “The Notorious Difficulty of Comparing Human and Machine Perception,” the researchers highlight the problems in current methods that compare deep neural networks and the human vision system.

In their research, the scientist conducted a series of experiments that dig beneath the surface of deep learning results and compare them to the workings of the human visual system. Their findings are a reminder that we must be cautious when comparing AI to humans, even if it shows equal or better performance on the same task.

[Read: An introduction to one-shot learning]

The complexity of human and computer vision

In the seemingly endless quest to reconstruct human perception, the field that has become known as computer vision, deep learning has so far yielded the most favorable results. Convolutional neural networks (CNN), an architecture often used in computer vision deep learning algorithms, are accomplishing tasks that were extremely difficult with traditional software.

However, comparing neural networks to human perception remains a challenge. And this is partly because we still have a lot to learn about the human vision system and the human brain in general. The complex workings of deep learning systems also compound the problem. Deep neural networks work in very complicated ways that often confound their own creators.

In recent years, a body of research has tried to evaluate the inner workings of neural networks and their robustness in handling real-world situations. “Despite a multitude of studies, comparing human and machine perception is not straightforward,” the German researchers write in their paper.

In their study, the scientists focused on three areas to gauge how humans and deep neural networks process visual data.

How do neural networks perceive contours?

The first test involves contour detection. In this experiment, both humans and AI participants must say whether an image contains a closed contour or not. The goal here is to understand whether deep learning algorithms can learn the concept of closed and open shapes, and whether they can detect them under various conditions.

contour example
Can you tell which one of the above images contains a closed shape?

“For humans, a closed contour flanked by many open contours perceptually stands out. In contrast, detecting closed contours might be difficult for DNNs as they would presumably require a long-range contour integration,” the researchers write.

For the experiment, the scientists used the ResNet-50, a popular convolutional neural network developed by AI researchers at Microsoft. They used transfer learning to finetune the AI model on 14,000 images of closed and open contours.

They then tested the AI on various examples that resembled the training data and gradually shifted in other directions. The initial findings showed that a well-trained neural network seems to grasp the idea of a closed contour. Even though the network was trained on a dataset that only contained shapes with straight lines, it could also performed well on curved lines.

“These results suggest that our model did, in fact, learn the concept of open and closed contours and that it performs a similar contour integration-like process as humans,” the scientists write.

resnet-50 successful contour detection
The ResNet neural network was able to detect various open- and closed-contour images, despite having been trained on straight-line examples only.

However, further investigation showed that other changes that didn’t affect human performance degraded the accuracy of the AI model’s results. For instance, changing the color and width of the lines caused a sudden drop in the accuracy of the deep learning model. The model also seemed to struggle with detecting shapes when they became larger than a certain size.

resnet-50 unsuccessful contour detection
The ResNet-50 neural network struggled when presented with images that contained lines of different colors and thickness, and where shapes were larger than the training set.

The neural network was also very sensitive to adversarial perturbations, carefully crafted changes that are imperceptible to the human eye but cause disruption in the behavior of machine learning systems.

contour adversarial example
The image on the right has been modified with adversarial perturbations, noise that is imperceptible by humans. To the human eye, both images are identical. But to a neural network, they are different images.

To further investigate the decision-making process of the AI, the scientists used a Bag-of-Feature network, a technique that tries to localize the bits of data that contribute to the decision of a deep learning model. The analysis proved that “there do exist local features such as an endpoint in conjunction with a short edge that can often give away the correct class label,” the researchers found.

Can machine learning reason about images?

The second experiment tested the abilities of deep learning algorithms in abstract visual reasoning. The data used for the experiment is based on the Synthetic Visual Reasoning Test (SVRT), in which the AI must answer questions that require understanding of the relations between different shapes in the picture. The tests include same-different tasks (e.g., are two shapes in a picture identical?) and spatial tasks (e.g., is the smaller shape in the center of the larger shape?). A human observer would easily solve these problems.

SVRT example
The SVRT challenge requires the participating AI to solve same-different and spatial tasks.

For their experiment, the researchers use the ResNet-50 and tested how it performed with different sizes of training dataset. The results show that a pretrained model finetuned on 28,000 samples performs well both on same-different and spatial tasks. (Previous experiments trained a very small neural network on a million images.) The performance of the AI dropped as the researchers reduced the number of training examples, but degradation in same-different tasks was faster.

“Same-different tasks require more training samples than spatial reasoning tasks,” the researchers write, adding, “this cannot be taken as evidence for systematic differences between feed-forward neural networks and the human visual system.”

The researchers note that the human visual system is naturally pre-trained on large amounts of abstract visual reasoning tasks. This makes it unfair to test the deep learning model on a low-data regime, and it is almost impossible to draw solid conclusions about differences in the internal information processing of humans and AI.

“It might very well be that the human visual system trained from scratch on the two types of tasks would exhibit a similar difference in sample efficiency as a ResNet-50,” the researchers write.

Measuring the recognition gap of deep learning

The recognition gap is one of the most interesting tests of visual systems. Consider the following image. Can you tell what it is without scrolling further down?

cat fur close-up

Below is the zoomed-out view of the same image. There’s no question that it’s a cat. If I showed you a close-up of another part of the image (perhaps the ear), you might have had a greater chance of predicting what was in the image. We humans need to see a certain amount of overall shapes and patterns to be able to recognize an object in an image. The more you zoom in, the more features you’re removing, and the harder it becomes to distinguish what is in the image.

cat dissect
Depending on the features they contain, close-up views of different sections of the cat image have different effects on our perception.

Deep learning systems also operate on features, but they work in subtler ways. Neural networks sometimes the find minuscule features that are imperceptible to the human eye but remain detectable even when you zoom in very closely.

In their final experiment, the researchers tried to measure the recognition gap of deep neural networks by gradually zooming in images until the accuracy of the AI model started to degrade considerably.

Previous experiments show a large difference between the image recognition gap in humans and deep neural networks. But in their paper, the researchers point out that most previous tests on neural network recognition gaps are based on human-selected image patches. These patches favor the human vision system.

When they tested their deep learning models on “machine-selected” patches, the researchers obtained results that showed a similar gap in humans and AI.

Recognition gap
The recognition gap test evaluates how zooming in on an image affects the precision of an AI’s

“These results highlight the importance of testing humans and machines on the exact same footing and of avoiding a human bias in the experiment design,” the researchers write. “All conditions, instructions and procedures should be as close as possible between humans and machines in order to ensure that all observed differences are due to inherently different decision strategies rather than differences in the testing procedure.”

Closing the gap between AI and human intelligence

As our AI systems become more complex, we will have to develop more complex methods to test them. Previous work in the field shows that many of the popular benchmarks used to measure the accuracy of computer vision systems are misleading. The work by the German researchers is one of many efforts that attempt to measure artificial intelligence and better quantify the differences between AI and human intelligence. And they draw conclusions that can provide directions for future AI research.

“The overarching challenge in comparison studies between humans and machines seems to be the strong internal human interpretation bias,” the researchers write. “Appropriate analysis tools and extensive cross checks – such as variations in the network architecture, alignment of experimental procedures, generalization tests, adversarial examples and tests with constrained networks – help rationalizing the interpretation of findings and put this internal bias into perspective. All in all, care has to be taken to not impose our human systematic bias when comparing human and machine perception.”

This article was originally published by Ben Dickson on TechTalks, a publication that examines trends in technology, how they affect the way we live and do business, and the problems they solve. But we also discuss the evil side of technology, the darker implications of new tech and what we need to look out for. You can read the original article here.

Published August 22, 2020 — 15:00 UTC