In the past decade, convolutional neural networks (CNNs) have been the backbone of computer vision applications. Traditionally, computer vision tasks have been handled using CNNs, which are designed to process data with a network-like structure, such as an image. CNNs apply a series of filters to the input data, extracting features such as edges, corners, and textures. Subsequent layers then process these features into the network, which combine them to form more complex features and finally make predictions.
The CNN success story started around 2012 with the release of AlexNet and its very impressive performance in detecting objects. After that, people put a lot of effort to make it better and apply it in multiple fields.
The dominance of CNNs has recently been rivaled by the introduction of the vision transducer (ViT) architecture. ViT has shown impressive results in object detection performance, even outperforming the latest CNNs. Even so, the competition between CNNs and ViTs is still going on. Depending on the task and data set, one outperforms the other, and if we change the testing environment, the results change.
ViT brings the power of transducers to a computer’s field of vision by treating images as a series of spots rather than a grid of pixels. These patches are then processed using the same self-attention mechanisms as in the NLP transducers, allowing the model to weigh the importance of different patches based on their relationship to other patches in the image.
One of the main advantages of ViT is that it is more efficient than CNNs, because it does not require the computation of convolutional filters. This makes training easier and allows for larger models, which can improve performance. Another advantage of ViT is that it is more flexible than CNNs. Since it treats data as a sequence rather than a grid, it can handle data of any size and aspect ratio without requiring any additional pre-processing. This is in contrast to CNNs, which require input data to be resized and adapted to fit a fixed-size network.
Of course, people wanted to understand the real advantages of ViTs over CNNs, and there have been many studies on this topic recently. However, there is a common problem with all of those comparisons, in one form or another. They attempt to compare ViTs and CNNs using ImageNet accuracy as a metric. However, they do not consider that the ConvNets being compared may be using slightly outdated design and training techniques.
So, how can we be sure we’re making a fair comparison between ViTs and CNNs? We need to make sure that we are only comparing structural differences. Well, the researchers in this paper have defined what the comparison should be like, describing it as follows: “We believe that examining the differences that arise in the acquired representations between Transformers and ConvNets to natural variations such as illumination, occlusions, object scale, object position, and others is important.“
This is the main idea behind this paper. But how can the environment be achieved to make this comparison? There were two main obstacles that prevented this comparison. First, the Transformer and ConvNet architectures were not comparable in terms of overall design techniques and training of convolutional class differences. Second, the dearth of datasets that include fine-grained natural variations of object scale, object position, scene lighting, and 3D occlusions, among others.
The first problem is solved by comparing ConvNext CNN to the Swin adapter architecture. The only difference between these networks is the use of convolutions and adapters.
The main contribution of this paper is about solving the second problem. They devise a solution to test the structures against reality using simulations. They built a synthetic dataset, called the Natural Variation Object Dataset (NVD), which includes various modifications to the scene.
Counter reality simulation is a way of thinking about what might have happened in the past or what might happen in the future under different circumstances. It involves looking at how the outcome of an event or sequence of events might differ if one or more factors that contributed to the outcome were different. So, in our context, it explores the network result if we change the object position, scene lighting, 3D occlusions, etc., will the network still predict the correct naming of the object?
The results showed that ConvNext was always more robust than Swin in terms of handling differences in object placement and camera cycles. Moreover, they also found that ConvNext tends to perform better than Swin in recognizing small-sized objects. However, when it came to resolving blockages, the two architectures were nearly equivalent, with Swin slightly outperforming ConvNext in cases of severe blockages. On the other hand, both builds struggled with natural differences in the test data. It was observed that increasing the network size or the variety and quantity of training data improved robustness.
scan the paper And The project. All credit for this research goes to the researchers on this project. Also, don’t forget to join Our Reddit page And discord channelwhere we share the latest AI research news, cool AI projects, and more.
Ekrem Cetinkaya has a Bachelor’s degree. in 2018 and MA. in 2019 from Ozyegin University, Istanbul, Turkey. He wrote his master’s degree. Thesis on image noise reduction using deep convolutional networks. He is currently pursuing his Ph.D. degree at the University of Klagenfurt, Austria, and works as a researcher on the ATHENA project. His research interests include deep learning, computer vision, and multimedia networks.