How Image Classification Began – From Perceptrons to Early Visual Models
The automatic classification of images is one of the oldest areas of research in computer vision. The idea of assigning a semantic label to visual inputs emerged as early as the 1950s, when researchers first attempted to teach machines to recognize simple patterns. The Mark I Perceptron was one of the earliest models capable of processing visual inputs and classifying them.¹ Although these systems were extremely limited, they marked the beginning of a development that would later transform entire industries.
In the decades that followed, it became clear that simple neural networks were insufficient for capturing the diversity of visual information. Images contain spatial structures, textures, shapes, and complex interdependencies that go far beyond linear decision boundaries. The introduction of the Neocognitron by Kunihiko Fukushima in the late 1970s was a major first step toward an architecture specifically designed for visual data. This model laid the foundation for hierarchical feature processing, which later became a central component of modern image classifiers.²
The Breakthrough in Deep Learning – The Era of Convolutional Neural Networks
The year 2012 marked a decisive turning point. With AlexNet, researchers succeeded for the first time in developing a deep, GPU-accelerated convolutional neural network (CNN) that achieved a dramatic improvement in accuracy in the ImageNet competition.³ This success led to CNNs becoming the dominant approach to image classification within a matter of months.
What makes CNNs unique is their ability to learn features directly from raw pixels, rather than having to define them manually. The lower layers of the network learn simple structures such as edges and textures, while higher layers represent more complex shapes and objects. This hierarchical learning has largely rendered hand-crafted features obsolete and, for the first time, enabled robust classification systems for large, heterogeneous image datasets.
In the years that followed, architectures such as VGG, ResNet, and GoogLeNet emerged, further refining this fundamental principle. VGG demonstrated that depth and simplicity can lead to significant performance gains.⁴ ResNet introduced the concept of “residual connections,” which made it possible to train extremely deep networks stably.⁵ GoogLeNet demonstrated how networks can be designed more efficiently by processing different filter sizes in parallel.⁶ These innovations made image classification not only more accurate but also more versatile and scalable.
Efficiency, Depth, and Optimization – Advances in Classical CNNs
As network sizes and data volumes grew, it became clear that depth alone was not the only factor that could improve classification performance. Researchers developed increasingly efficient architectures that reduced computational costs without sacrificing accuracy. The Inception architecture is an early example of how clever parallelization and filter combinations can achieve better results with fewer resources.⁶
At the same time, variants such as DenseNet and MobileNet emerged, each addressing different challenges—from efficient gradient propagation to optimization for mobile devices. Overall, this has led to a wide range of specialized CNN models suitable for a variety of applications: high-resolution classification in data centers, real-time inference on edge devices, and energy-efficient models for hardware with limited processing power.
This phase demonstrates just how adaptable CNNs were. For many years, they formed the backbone of nearly all practical image classification systems—in industry, medicine, autonomous robotics, and consumer applications.
The Arrival of Transformers in the Visual World
The Vision Transformer (ViT), which emerged in the early 2020s, introduced a completely new approach to image classification. Instead of extracting features locally via convolutions, the Vision Transformer breaks an image down into small patches and processes them sequentially using self-attention.⁷ This allows global relationships within the image to be modeled much more directly, without the need for local filter structures.
The success of this architecture demonstrated that image classification does not necessarily have to rely on convolution. The ability to establish relationships between distant regions of an image led to a new generation of models that were competitive or superior in many benchmarks. At the same time, hybrid approaches emerged that combine convolutions and self-attention to link local details with global context.
Recent research has further advanced Vision Transformer models by incorporating more efficient attention mechanisms, distance-sensitive structures, or modular combinations of CNN and Transformer components. This has resulted in systems that are both powerful and practical, and can be applied across a wide range of fields.
Why modern image classification is now virtually inconceivable without machine learning models
Image classification has evolved from simple experimental models to highly complex architectural systems that capture visual information across multiple levels. Traditional methods based on fixed rules or hand-engineered features played an important role for decades, but they are significantly limited compared to today’s machine learning approaches.
Modern deep learning models are capable of learning image representations that are structured both locally and globally. They do not require manually defined features, are robust against variations, and can be flexibly adapted to new data sources. These characteristics make them the standard for nearly all real-world image classification tasks—from industrial inspections and medical diagnostics to multimodal systems that combine visual and linguistic information.
Current Models: How Image Classification Has Evolved at the Cutting Edge
After convolutional neural networks had established themselves as the dominant approach for over a decade, research increasingly shifted from purely improving performance to two questions: How much further can accuracy actually be improved—and how efficient can a model be at the same time? During this phase, architectures emerged that reinterpreted classical convolutional networks while also benefiting from self-supervised learning and large-scale pretraining strategies.
A prominent example of this is ConvNeXt V2. The architecture builds on the idea of stylistically aligning modern convolutional neural networks (ConvNets) with Transformer designs, but takes it a step further: it combines architectural improvements with fully convolutional masked autoencoder pretraining and introduces a new normalization component called Global Response Normalization (GRN).⁸ In the accompanying paper, the authors report Top-1 accuracies on ImageNet that rival those of large vision transformers, achieved exclusively with publicly available training data—reaching up to approximately 88.9% Top-1 accuracy with the largest variant.
These models are not only interesting as “leaderboard models,” but are already being used in practical applications: for example, in document classification, in the recognition of decorative patterns in architecture, or in specialized domains such as mushroom classification, where ConvNeXt-V2 variants outperform other established architectures like ResNet, Swin Transformer, or MobileViT.⁹ This makes it clear that, despite the success of Vision Transformers, modern convolutional networks are anything but obsolete—rather, they are being further developed in parallel.
Vision transformer families and large foundation encoders
At the same time, Vision Transformer-based models have evolved into an ecosystem of their own. Building on the original ViT, numerous variants have emerged in recent years, each with a different focus: improved data efficiency, more stable training, higher resolution, or robust self-monitoring. This trend has become particularly evident in models such as EVA-02, which combine masked image modeling with strong pretraining schemes.¹⁰
EVA-02 utilizes an advanced Transformer architecture and is pre-trained using a CLIP-Vision encoder as a "teacher." The authors report that a variant with approximately 304 million parameters achieves 90.0% top-1 accuracy on ImageNet-1K—using only publicly available data.¹⁰ At the same time, EVA-02 variants demonstrate remarkable performance in zero-shot scenarios and are increasingly serving as a general visual representation for a wide variety of tasks.
A second line is characterized by so-called “vision foundation models,” which include InternViT, for example. These models are no longer developed solely for a single benchmark such as ImageNet, but rather as universal encoders that are reused in multitask or multimodal systems. InternViT models, for example, are used as a visual foundation within the InternVL family and are evaluated for quality using methods such as classical image classification and semantic segmentation.¹¹
Such foundation encoders are designed to cover a wide range of visual patterns: natural images, technical scenes, and domain-specific data. In this context, image classification is no longer the goal, but rather a key tool for evaluating the quality of representation.
Self-supervised learning and large pre-training datasets
A key driver behind current model generations is the shift in pretraining. Instead of relying exclusively on traditionally labeled datasets such as ImageNet, many studies now use masked image modeling or related self-supervised strategies. In these approaches, models learn to reconstruct masked image regions or maintain consistency across augmentations before being fine-tuned for specific classification tasks.
ConvNeXt V2, for example, combines a fully convolutional masked autoencoder with the actual classification model and demonstrates that the representations obtained through such pretraining schemes directly translate into higher accuracy and robustness.⁸ EVA-02 follows a similar principle but uses a strong CLIP encoder as a teacher and reconstructs its feature space instead of the raw image pixels.¹⁰
At the same time, there is an unmistakable trend toward ever-larger and more diverse datasets. Many current models are pre-trained on ImageNet-22K or other collections containing several million images and then fine-tuned for specific tasks. This creates a distinction between expensive, one-time pre-training and relatively inexpensive, domain-specific fine-tuning steps.
Multimodal models and image classification as a building block
Another current trend is the integration of image classification into multimodal systems. Vision-language models such as CLIP have demonstrated that image and text representations can be anchored in a shared space, enabling classification to be performed—in some cases—without explicit training on the target classes, for example through zero-shot labeling using text prompts.¹² More recent work, such as InternVL, scales this approach further and combines large vision encoders with language models to create multimodal systems that treat image classification as a byproduct rather than a primary task.¹³
Interestingly, there are now studies that explicitly examine just how effective such multimodal models actually still are at “classical” image classification. A 2024 study analyzes various multimodal large language models on tasks such as ImageNet, ObjectNet, and fine-grained classification, concluding that image classification performance varies significantly depending on how strongly the visual encoder has been optimized for fundamental visual categories.¹⁴ This shows that image classification remains an important litmus test—even in a world where many models go far beyond pure classification tasks.
As of today: Accuracy, efficiency, and practical relevance
Current state-of-the-art models achieve Top-1 accuracy on ImageNet-1K ranging from just under 90% to around 90%, depending on the training regimen, dataset size, and architecture. ConvNeXt V2 and EVA-02 represent two distinct but complementary approaches: on the one hand, highly optimized ConvNets with self-supervised pretraining; on the other, large vision transformers with masked image modeling and, in some cases, multimodal capabilities.⁸⁻¹⁰
At the same time, there is a growing ecosystem of smaller, more efficient models designed for edge scenarios or real-time applications, which deliberately sacrifice some accuracy in favor of latency, memory usage, and energy efficiency. These variants now appear regularly in scientific and industrial publications when it comes to integrating image classification into real-world systems—ranging from the medical field to document workflows and specialized visual inspections.¹⁵
Image classification has thus evolved from a purely benchmark-oriented topic into a building block in a wide range of systems: as a standalone task, as an evaluation criterion for foundation encoders, and as an integral component of multimodal models.
Unresolved Challenges and Future Developments in Image Classification
Despite the enormous progress made in recent years, image classification continues to face key challenges that shape both scientific research and practical applications. Many modern models now achieve levels of accuracy that would have been almost unimaginable a decade ago, yet the complexity of real-world conditions repeatedly demonstrates that simply improving performance on benchmarks addresses only part of the problem.
One of the key challenges remains robustness against domain shift. Models trained on large, curated datasets often encounter conditions in practical use that differ significantly from the training examples: new camera properties, changed lighting, unfamiliar perspectives, or entirely different image distributions. Even state-of-the-art Vision Transformers and large ConvNeXt-V2 models show performance drops when they encounter data that does not belong to their original training domain. Research on this topic emphasizes that larger data sets alone are not sufficient; what is crucial is a model’s ability to generalize structural patterns and understand visual concepts independently of context.
Added to this is the issue of data efficiency. Many state-of-the-art models owe their performance to extensive pretraining processes involving millions of images, often supplemented by self-supervised learning. While this approach works well in research, the question arises in real-world settings as to how such methods can be applied in areas where only limited or highly specialized image data is available. Industrial systems, medical applications, and domain-specific image sources, in particular, require models that can be trained reliably with limited data. This challenge has led to a growing emphasis on research in few-shot learning, domain adaptation, and self-supervised pretraining.
Another area of concern is interpretability. As models grow in size, so does the difficulty of making their decisions transparent. While early CNNs were still relatively easy to analyze, today’s models—especially large Transformer-based architectures—are often barely interpretable. The literature shows that even visualizations of attention maps do not always provide a clear picture of which structures a model actually uses to predict a class. This has both safety-related and regulatory implications, particularly in areas where models make decisions regarding critical processes.
Ultimately, the issue of bias remains a central focus of research. Many models learn statistical correlations derived from their training distribution, rather than from the real, semantically relevant structure. Work such as EVA-02 or InternViT impressively demonstrates how much representations can be improved when models are pre-trained with better-curated or more diverse data.¹⁰⁻¹¹ Nevertheless, evaluation studies show that even large vision encoders favor certain image types, styles, or object variants because these occur more frequently in training. The trend toward multimodal models further intensifies this discussion, as image and language biases intertwine in these systems.
Overall, this is a field of research that is far from complete. Modern architectures such as ConvNeXt V2, EfficientViT, EVA-02, and large InternViT encoders represent significant advances, but they do not solve all the fundamental problems. The coming years will likely be shaped less by ever-larger models and more by the question of how visual intelligence can be designed to be stable, explainable, and data-efficient—and how image classification fits as a building block into an increasingly multimodal, context-sensitive AI ecosystem.
References
¹See. Rosenblatt – The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain, 1958
²See. Fukushima – Neocognitron: A Self-organizing Neural Network Model for Pattern Recognition, 1980
³See. Krizhevsky, Sutskever & Hinton – ImageNet Classification with Deep Convolutional Neural Networks, 2012
⁴See. Simonyan & Zisserman – Very Deep Convolutional Networks for Large-Scale Image Recognition (VGG), 2014
⁵See. He, Zhang, Ren & Sun – Deep Residual Learning for Image Recognition (ResNet), 2015
⁶See. Szegedy et al. – Going Deeper with Convolutions (Inception / GoogLeNet), 2015
⁷See. Dosovitskiy et al. – An Image is Worth 16x16 Words: Vision Transformer (ViT), 2020
⁸See. Liu et al. – ConvNeXt V2: Co-Designing and Scaling ConvNets with Masked Autoencoders, 2023
⁹See. Zhang et al. / Li et al. – Studien zur praktischen Nutzung von ConvNeXt V2 (Pilzklassifikation, Dokumente), 2023
¹⁰See. Yao et al. – EVA-02: A Strong Vision Transformer with CLIP Teacher, 2023
¹¹See. Cao et al. – InternViT: Scaling Vision Transformers for Universal Visual Representation, 2024
¹²See. Radford et al. – CLIP: Learning Transferable Visual Models from Natural Language Supervision, 2021
¹³See. Wei et al. – InternVL: A Multimodal Large Model for Vision and Language, 2023
¹⁴See. Liang et al. – Evaluation of Multimodal Large Language Models on Image Recognition Benchmarks, 2024
¹⁵See. Howard et al. / Chen et al. – MobileNet, MobileViT und weitere effiziente Modelle für Edge-Vision, 2017–2023