How feature detection has evolved—and why modern systems need to understand context
Optical feature detection—that is, the automatic identification of relevant features or components in images—is one of the core tasks of modern computer vision. Whether dealing with industrial parts, safety-critical elements, or mechanical components, a system must first locate the object and then determine its condition or properties. This two-step process—detection followed by classification—now forms the backbone of many automated inspection processes.
One example would be the detection and condition classification of a train car brake lever. Although the concept may seem unspectacular, the task involves a wide range of challenges: The lever may appear under different lighting conditions, be partially obscured, vary in color, material properties, or position, and its state is often discernible only by visual differences of a few pixels.
Such scenarios clearly illustrate why the development of object detection has become so important over the past two decades. Traditional methods quickly reached their limits in these situations, whereas modern deep learning-based models are increasingly capable of reliably locating objects even in complex scenes.
Early Approaches: Rules, Characteristics, and the Limitations of Handcrafted Methods
Before neural networks became practical for object recognition, many systems were based on fixed rules or hand-crafted features. A well-known example of this is the Haar cascade detectors introduced by Viola and Jones in 2001.¹ They worked surprisingly well as long as the target object had a defined shape and the image was captured under controlled conditions. However, as soon as the perspective changed or parts of the object were obscured, detection performance plummeted.
The situation was similar with methods such as HOG+SVM, which Dalal and Triggs introduced in 2005.² The idea was to describe the visual object using gradient orientations and then detect these features with a classifier. This method represented a significant advance over purely rule-based approaches, but it too was sensitive to changes in perspective, variations in lighting, and complex backgrounds.
For a component like a brake lever, this meant that as long as the position, angle, and lighting were controlled, such systems worked acceptably. But as soon as real-world operating conditions came into play—dirt, shadows, material reflections, varying contrasts—they lost their robustness and, with it, their practical usefulness.
The Transition to Deep Learning: Region-Based Methods and the Dawn of Truly Robust Detection
The introduction of region-based neural networks marked a fundamental shift. Work such as R-CNN (Girshick et al., 2014)³ and, later, Fast R-CNN and Faster R-CNN⁴ combined the idea of object proposals with deep convolutional neural networks for the first time. This enabled models not only to automatically learn features but also to robustly localize them in complex scenes. Instead of relying on hand-crafted rules, the networks learned directly from the training data what constitutes an object and how it differs from its surroundings.
For technical components such as brake levers, this represented a significant improvement. Even when the object was partially obscured or viewed from an oblique angle, the network was often able to identify it correctly. This principle—the combination of visual feature extraction and object-specific regions—remains the foundation of many industrial solutions to this day.
YOLO, SSD, and the Era of Real-Time Detection
Shortly thereafter, models emerged that further simplified and accelerated detection. Using the “You Only Look Once” approach, Redmon et al. introduced a model in 2016 that combined the entire image analysis process into a single network.⁵ YOLO and its successors—including YOLOv3, YOLOv5, YOLOX, and newer community-developed versions—introduced real-time detection without drastically reducing accuracy. At the same time, architectures such as SSD (Single Shot MultiBox Detector) by Liu et al.⁶ emerged, which pursued similar concepts.
These models made it possible to reliably locate objects even in situations with significant background noise or changing perspectives. To this day, this remains a crucial advantage for the analysis of technical components. A brake lever must not only be identifiable—it must also appear in scenes that are not perfectly lit or structurally unambiguous. It is precisely in these situations that such models achieve remarkable results in practice.
Detection Using Modern Transformer Models: From Pixels to Semantic Structure
Since 2020, another major paradigm shift has taken hold. With the introduction of DETR (Carion et al., 2020), a completely different model was presented for the first time, one that performs object recognition exclusively using Transformer architectures.⁷ Instead of relying on anchor points, feature pyramids, or multi-stage methods, DETR formulates object recognition as a matching problem between images and objects. The result is a system that requires fewer heuristics and exhibits unusually high robustness against structural disturbances.
In later versions—such as Deformable DETR, DN-DETR, or DINO—the models were further improved and accelerated, enabling them to achieve speeds relevant for practical applications. For tasks involving the detection of an object—such as a brake lever—under varying angles, partial occlusion, or complex material structures, the significant advantages of such models become clear.
Transformer-based detectors not only recognize the object but are also increasingly able to understand the context in which it is located. This improves detection performance, especially when visual features vary significantly from one image to the next.
Classification of State: From Simple Features to Deep Neural Networks
Once an object has been located, the next question arises: What is its condition? This step—classification within the ROI—has historically followed its own path of development.
Early approaches often relied on fixed features such as HOG, LBP, or geometric metrics, and then performed classification using SVMs or decision trees. While these methods worked quite reliably in controlled scenarios, they suffered from the same limitations as classical object localization.
The advent of CNNs fundamentally changed this. Models such as AlexNet (Krizhevsky et al., 2012)⁸ and, later, ResNet (He et al., 2015)⁹ demonstrated that neural networks are better at capturing complex visual features than any hand-crafted alternative. For state classification—such as distinguishing between different brake lever positions—this represented a significant leap in quality.
More recent architectures, such as Vision Transformers (Dosovitskiy et al., 2020), take this approach a step further by modeling visual structures through self-attention.¹⁰ This makes them particularly robust against subtle differences that can be critical in technical condition classifications.
This development made it possible to distinguish between complex states, even when the differences are subtle and the environment varies greatly.
Why technical components pose particular challenges for feature detection
When it comes to mechanical or safety-critical components, it quickly becomes clear that detection is not merely a matter of locating an object. The crucial step comes next: understanding its condition. A component may be properly aligned, misaligned, engaged, disengaged, or partially damaged—and these differences are often only very subtly discernible to the naked eye.
Traditional image processing methods naturally struggled with this because they relied on clearly defined contours, consistent contrast, and geometric rules. Yet it is precisely these characteristics that are rarely present in technical components in practical use. Metallic surfaces reflect light to varying degrees, material aging alters the structure, grease or dirt accumulates on individual areas, and small mechanical defects change local shapes. While the basic object remains recognizable, its condition manifests itself in details that are often only apparent in the overall picture.
In this context, it becomes particularly clear why deep learning models have increasingly replaced traditional methods in recent years. While a rule-based system attempts to interpret the exact shape of a component, a neural network searches for patterns that are typical across many examples—not only for the object itself, but also for its possible states. This makes the analysis less dependent on lighting, color, or surface and relies more heavily on structural properties.
When reflections, materials, and wear alter the visual structure
Mechanical components made of metal or composite materials often exhibit widely varying reflectance properties. Even slight changes in lighting can cause the relevant area to appear either overexposed or too dark. For traditional methods based on binarization or edge detection, this often results in the loss of important image information.
Deep learning models, on the other hand, interpret these visual fluctuations in a completely different way. Instead of directly evaluating the absolute distribution of color or brightness, they learn statistical patterns that persist regardless of the variations that occur. This robustness has been documented in numerous studies on industrial quality control, such as research on CNN-based surface analysis, which yields stable results even in the face of significant changes in brightness.¹¹
The situation is similar when it comes to wear and damage. A component that has been in use for years may exhibit scratches, nicks, or irregularities. In traditional systems, this leads to false detections because the visual features no longer match the expected template. Neural networks, on the other hand, can often classify such changes without difficulty, provided they were exposed to a sufficient variety of examples during training. Research on robust feature representations—such as that by Geirhos et al. on “shape bias” vs. “texture bias”—shows how modern models learn to recognize structural properties even when the surface varies significantly.¹²
Partial obstructions and complex shapes: When the object is only partially visible
A particularly difficult challenge arises when the object is not fully visible. Components may be partially obscured by other elements or located within a spatial structure that blocks significant parts of the object. Traditional methods fail in such cases because they require complete contours to make a reliable identification.
Deep learning models, on the other hand, can often fill in missing areas internally. They do not rely on individual edges, but rather on the overall structure of visual features. The ability to interpret incomplete information is a fundamental component of modern models. Transformer-based architectures, which model visual relationships over longer distances, further enhance this capability. Studies such as those on DINO¹³ or Deformable DETR¹⁴ show that such models still deliver accurate detections even when 20–40% of the object is occluded.
For subsequent classification of the condition, this means that the algorithm can make decisions even when critical areas are partially obscured. A condition that manifests itself only as a minor geometric change remains detectable because the model has learned what the structure of such a component looks like in various forms—including minor deviations, orientation effects, and material changes.
Transition to multimodal models
With the emergence of multimodal models, the field is beginning to change once again. Models such as CLIP or PaLI use not only images but also textual or symbolic descriptions to make decisions. This allows them to generalize more effectively in some cases, particularly when certain conditions are rarely encountered in training data.
Although these models have so far been used only selectively in traditional industrial detection, early research shows just how great the potential is when visual information is combined with semantic structures.¹⁵
This gives rise to a new approach to feature detection: no longer just “object detected,” but “object understood.”.
References
¹See. Viola & Jones – Rapid Object Detection using a Boosted Cascade of Simple Features, 2001
²See. Dalal & Triggs – Histograms of Oriented Gradients for Human Detection, 2005
³See. Girshick et al. – Rich Feature Hierarchies for Accurate Object Detection (R-CNN), 2014
⁴See. Ren et al. – Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, 2015
⁵See. Redmon et al. – You Only Look Once: Unified, Real-Time Object Detection, 2016
⁶See. Liu et al. – SSD: Single Shot MultiBox Detector, 2016
⁷See. Carion et al. – End-to-End Object Detection with Transformers (DETR), 2020
⁸See. Krizhevsky et al. – ImageNet Classification with Deep Convolutional Neural Networks, 2012
⁹See. He et al. – Deep Residual Learning for Image Recognition (ResNet), 2015
¹⁰See. Dosovitskiy et al. – An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, 2020
¹¹See. Song et al. – Surface Defect Detection via CNNs, 2019
¹²See. Geirhos et al. – Imagenet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness, 2019
¹³See. Zhang et al. – DINO: DETR with Improved DeNoising Anchor Boxes, 2022
¹⁴See. Zhu et al. – Deformable DETR: Deformable Transformers for End-to-End Object Detection, 2021
¹⁵See. Radford et al. – Learning Transferable Visual Models From Natural Language Supervision (CLIP), 2021