OCR & Computer Vision 10 to 15 minutes reading time

Traditional computer vision and modern ML methods compared – Why real-world scenes require new solutions

An analytical look at typical image processing problems and why machine learning approaches outperform traditional methods in many real-world situations. Or how to compare apples with tomatoes.

Published on December 3, 2025 - Automatically translated

Comparison between two objects to illustrate the difference between classic image processing and modern machine learning recognition: on the left, a green apple as a simple, clearly structured object; on the right, a red tomato as a more complex motif.

Authors

Philip Zetterberg
Software AI Engineer, TRENPEX

Contributors

Angie Zetterberg
Public relations, TRENPEX

Subscribe to our newsletter

Share

Why classic computer vision quickly reaches its limits in real-world scenarios

For many decades, image processing systems relied on fixed rules, mathematical operators, and handcrafted features. These methods are precise, easy to explain, and extremely reliable in controlled environments. However, as soon as real-world conditions come into play—changing light, shadows, textures, rain, reflections, or varying colors—the behavior of these algorithms changes dramatically. A method that delivers perfect results under studio conditions can fail even with just a few raindrops, for example.

This discrepancy has little to do with the quality of traditional methods. Rather, it lies in their basic principle: an algorithm such as the Sobel operator or global binarization makes decisions based on deterministic thresholds. If an area in the image suddenly becomes darker or lighter, the entire histogram shifts, and the parameters that were previously ideal no longer work. This effect has long been known in scientific literature. Otsu already pointed out in his original work that global thresholding methods are sensitive to lighting fluctuations.¹

Machine learning models—especially deep learning-based architectures—have significantly changed these limitations. They are not based on each pixel having to meet a fixed threshold, but rather on learning patterns, shapes, and contexts in the image. This enables them to tolerate variations that would bring traditional methods to a standstill. The differences become particularly clear when looking at specific scenarios that occur repeatedly in real-world applications.

A raindrop as a systematic problem of classical CV

A single drop of water can pose an almost insurmountable hurdle for a traditional OCR system. The reason for this is simple: when water lies on a surface, it refracts light, alters local contrasts, and creates patterns that are not intended for a rule-based system.

Let's imagine a serial number on a metal housing, photographed outdoors. A drop of water lands directly on one of the digits. From a classic CV perspective, this creates an irregular brightness distribution: parts of the digit are overexposed, others are darkened, and the transition appears smooth instead of sharp.

An edge operator such as Sobel or Canny "interprets" these distorted gradients as additional structures, leading to oversegmentation or even completely lost characters. Works such as those by Marr & Hildreth² and later by Canny³ himself emphasize how sensitive edge detection methods are to such disturbances.

A deep learning model, on the other hand, does not calculate its decision based on individual pixels or only local gradients, but rather on global patterns. The shape of the digit, its context, its relative size, and previously seen variations are interpreted together. A drop may distort the local structure, but the global shape remains recognizable to a trained network.

Models such as CRAFT or CRNN show in practice that such distortions have little influence on detection or recognition, provided that similar examples were taken into account during training. Robustness against such disturbances is one of the reasons why ML-based OCR methods are becoming increasingly indispensable in outdoor areas, machine installations, or quality control.

Shadows and variable lighting – a classic problem that ML solves systematically

Another area where traditional methods quickly become unstable is lighting. Even minor changes—the shadow of a finger, a cloud, a reflective surface—can alter the binary separation line so much that entire characters disappear or stick together.

The problem is well documented scientifically. Serra's work on mathematical morphology⁴ and in the context of adaptive binarization⁵ repeatedly describes lighting as a key disruptive factor. Adaptive methods such as Sauvola or Wolf also work locally, but remain dependent on the distribution of intensities in the respective section.

Machine learning approaches this situation in a fundamentally different way. A model learns what text looks like under very different lighting conditions: bright, dark, reflective, shadowed, or partially obscured. The key point is that the model does not interpret the absolute value of a pixel, but rather its pattern. From this perspective, text is not "brighter than the background," but rather a visual object with shape, structure, and context.

Studies on robust text recognition models—such as the work on SynthText (Gupta et al., 2016)⁶ or later on TextFuseNet (2020)⁷—show that even complex scenes with highly variable lighting conditions can be processed reliably. The models abstract from the lighting and focus on the invariant properties of the text.

When colors and material properties break the rules

While traditional OCR often works with black-and-white images or at least assumes that text contrasts with the background, the reality is quite different: colored labels, laser-etched lettering, matte and glossy materials, transparent surfaces.

A classic example is laser-engraved text on metal. From certain angles, the text appears darker than the background, while from others it appears lighter. A fixed decision rule such as "text = dark area" simply does not work here. Similar problems arise on colored plastic housings, painted surfaces, or digitally printed labels, whose color coating reflects differently depending on the angle.

Machine learning—especially CNN- and Transformer-based models—interprets the structure of a character independently of its absolute color. Research in the field of scene text recognition (STR) has been showing for years how well models can compensate for such variations. Jaderberg et al. (2016)⁸ and Baek et al. (2019)⁹ emphasize that color and material variations have no structural influence on recognition performance as long as the model has seen enough examples.

This shifts the focus away from fixed rules toward data-driven robustness, which traditional methods are hardly able to replicate.

Why complex geometry and perspective overwhelm classical methods

As soon as text is no longer flat, cleanly printed, or aligned orthogonally to the camera, traditional methods encounter difficulties. The assumption that characters have certain geometric properties—straight lines, clear contours, defined widths—often does not apply in real scenes. Even slight perspective distortions can cause letters to appear compressed, stretched, or skewed. This affects both edge detection and segmentation, as these methods are based on gradients, the course of which changes dramatically under perspective.

A classic example is text printed on cylindrical or curved objects, such as bottles, pipes, or tool housings. For classic CV, this results in an irregular projection: lines that are parallel in reality appear curved; distances appear inconsistent; entire characters can overlap or be lost locally during binarization. Studies on document recognition from the early 2000s—such as the work of Liang, Li, and Doermann—impressively demonstrate how strongly perspective distortions influence OCR quality.¹⁰

Deep learning models, on the other hand, learn these variations directly from the data. Instead of assuming that a character has a specific geometric shape, the model learns which features are invariant to transformations. Even early CNN models such as LeNet were transformation-invariant to a limited extent. With modern architectures—such as Spatial Transformer Networks, introduced by Jaderberg et al. in 2015—a systematic way to correct distortions directly within the model emerged.¹¹ These layers enable a network to actively "unwarp" or geometrically normalize image areas before they are further processed.

In practice, this means that text can be reliably recognized even when it is printed on slanted surfaces, uneven materials, or busy backgrounds. The model does not recognize individual pixels or edges, but rather the underlying pattern of the character. Even if the outline of a number in the image appears distorted or stretched, its structure remains recognizable to a suitably trained neural network.

Distorted or damaged characters as a challenge for manual image processing

Another area in which traditional image processing quickly reaches its limits is damage. As soon as a character shows scratches, wear, or partial failure, a rule-based system loses the necessary visual cues. The decision is based on the assumption that certain lines or curves are present in the image. If these are missing due to wear and tear, the system either interprets the character incorrectly or fails to recognize it at all.

Consider, for example, laser-engraved serial numbers on metal, which become partially illegible after a few years due to friction or corrosion. A classic OCR system would attempt to fill in the missing pixels or enhance them with morphological operators. However, these operations also increase noise and create additional artifacts—a well-known problem that is repeatedly described in the literature on morphology and segmentation.⁴

ML models operate fundamentally differently here. They do not learn the perfect form of a character, but rather the concept of a character. Even if parts are missing, deeper networks can recognize which character was originally intended based on global features. The ability to correctly interpret incomplete information has been documented in several studies on robust sequence models, such as in the work of Cheng et al. (2017) and later in the benchmarks of Baek et al. (2019).

These models are not based on individual contours, but on structural properties such as relative proportions, typical transitions, or contextual dependencies between neighboring characters. This means that even damaged areas are supplemented by the overall image—a capability that classic CV can only achieve in exceptional cases with a great deal of manual effort.

Font diversity: A problem of rules, not data

Another issue that quickly overwhelms traditional CVs is the wide variety of different fonts, stroke widths, spacing, and styles. Each new font requires a wealth of adjustments for a rule-based OCR system: new templates, new thresholds, new segmentation rules.

Traditionally, attempts have been made to mitigate this problem by using templates or feature descriptors. However, even features such as HOG or Zernike moments—as valuable as they are—cannot fully capture the enormous variability of modern typefaces. Any shift, rounding deviation, or stylistic variation causes a handmade feature to become ambiguous. Research on feature invariance therefore provided early indications that rule-based systems reach their limits as soon as the style of a character deviates too much from what is expected.¹⁴

For ML models, however, diversity is not a disadvantage but an advantage—provided that the training data covers these variations. In scene text recognition, models are regularly trained on thousands of different fonts, using both real and synthetically generated datasets. The model does not learn the exact shape of an "A," but rather the structure that defines an "A" across many variations. This is precisely what makes modern systems so robust against fonts they have never seen before.

Work such as the SynthText pipeline by Gupta et al.⁶ and the benchmarks by Baek et al.⁹ show that models can generalize writing styles that would have to be explicitly modeled in classic OCR.

Why unstructured scenes and complex backgrounds favor ML-based systems

In recent years, there has been a growing number of applications in which text appears not on clearly defined surfaces but in completely unstructured environments. Traffic signs, displays, packaging, machine recordings, screens in production, or device displays—all of these scenes feature a wide variety of backgrounds, colors, materials, and structures. For classic methods that rely heavily on contrasts, fixed edges, or simple segmentation rules, such scenes pose a fundamental challenge.

The central problem is that the background often has the same pattern characteristics as the text itself. For example, highly reflective metal can produce gradients similar to those found at the edges of characters. Tree edges, cables, lines, and shadows can look like fragments of letters. This problem has been frequently described in the context of scene text detection, including in the work of Neumann & Matas (2012) and later in the benchmarks of ICDAR competitions. Classic methods must attempt to define rules that distinguish between text and background patterns—a task that becomes nearly impossible in unstructured scenes.

Deep learning models, on the other hand, no longer view the image as the sum of individual pixels, but as a whole context. A neural network recognizes which patterns in the image actually belong to text because it has learned what text typically looks like in a wide variety of situations. This ability has contributed significantly to modern models far outperforming classic approaches in benchmarks in recent years. The work of Liao et al. on DBNet¹⁷ and the later transformer-based OCR models¹⁸ show that even very busy backgrounds no longer necessarily mean a loss of information.

If motion blur or overlaps occur

Another area in which rule-based methods consistently reach their limits is motion blur. As soon as an image is blurred by movement—for example, due to fast camera pans, vibrating machines, or passing objects—classic feature extractors lose their basis. Edges become blurred, characters flow into one another, and the shape of numbers or letters is no longer clearly defined. Work from the 1990s already shows how sensitive classic CV processes are to such disturbances.¹⁶

Machine learning models, on the other hand, are surprisingly good at interpreting these distortions. The reason for this lies in the training: many data sets contain images with synthetic or real blurring. This teaches models what typical characters look like despite motion blur. Modern architectures, such as attention-based models, focus on relevant image areas even if the edges are not clearly defined. Work on STN (Spatial Transformer Networks)¹¹ and later in the transformer context shows how well models can compensate for geometric distortions.¹⁸

The situation is similar with overlaps and partial coverings. In many practical scenarios, characters are not completely visible: a scratch covers a spot, a sticker covers two letters, or an object partially covers the text. Classic CV has little chance here. Without complete contours, characters cannot be classified cleanly, and even morphological additions often create more artifacts than solutions.

ML-based systems, on the other hand, often learn the "idea" of a character and can reconstruct missing parts from the context. Studies on robust text recognition—such as the work of Cheng et al.¹², Baek et al.⁹, and later STR research—show that modern models can interpret even incomplete characters with a high degree of certainty. This ability is not based on explicit rules, but on statistical knowledge of how character shapes are typically structured.

A conclusion about two worlds of visual processing

The comparison between traditional computer vision and modern machine learning approaches does not reveal competition between "old" and "new," but rather a development with clear specializations. Classic CV methods are fast, transparent, and extremely reliable in controlled scenarios. They have formed the backbone of industrial image processing for decades and remain valuable when light, perspective, and material properties are stable and do not change.

However, as soon as real-world conditions come into play—weather, material wear, variable lighting, perspective, motion blur, busy backgrounds—these methods naturally reach their limits. Their strict adherence to fixed rules makes them sensitive to any form of deviation. Machine learning models, on the other hand, are based on statistical generalization. They recognize patterns even when they are partially obscured, distorted, or disrupted because they have learned from a large number of examples how text appears in very different situations.

Scientific developments in recent years clearly show that ML-based methods are not just an extension of traditional methods, but structurally overcome many of their core problems. From robust text detectors to sequence models to transformer-based full systems, developments have made OCR and text analysis in complex scenes practicable for the first time.

Today, two worlds exist side by side: the stability of traditional methods and the flexibility of modern ML models. The value lies not in considering one of these worlds superior, but in understanding which method is suitable for which conditions—and how both can work together to make a system reliable.

References

¹See. Otsu – A threshold selection method from gray-level histograms, 1979

²See. Marr & Hildreth – Theory of edge detection, 1980

³See. Canny – A computational approach to edge detection, 1986

⁴See. Serra – Image Analysis and Mathematical Morphology, 1982

⁵See. Sauvola & Pietikäinen – Adaptive Document Image Binarization, 2000

⁶See. Gupta et al. – SynthText in the Wild: Generating Training Data for Text Recognition, 2016

⁷See. Ye et al. – TextFuseNet: Scene Text Detection with Richer Fused Features, 2020

⁸See. Jaderberg et al. – Deep Structured Output Learning for Unconstrained Text Recognition, 2016

⁹See. Baek et al. – What is wrong with scene text recognition model comparisons?, 2019

¹⁰See. Liang, Li & Doermann – Camera-based analysis of text and documents: a survey, 2005

¹¹See. Jaderberg et al. – Spatial Transformer Networks, 2015

¹²See. Cheng et al. – Focusing Attention: Towards Accurate Text Recognition in Natural Images, 2017

¹³See. Belongie et al. – Shape Matching and Object Recognition using Shape Contexts, 2002

¹⁴See. Neumann & Matas – Real-Time Scene Text Localization and Recognition, 2012

¹⁵See. ICDAR Robust Reading Competitions, 2011–2023

¹⁶See. Trier, Jain & Taxt – Feature extraction methods for character recognition: A survey, 1996

¹⁷See. Liao et al. – DBNet: Real-Time Scene Text Detection with Differentiable Binarization, 2020

¹⁸See. Kim et al. – Donut: Document Understanding Transformer without OCR, 2022

Would you like to learn more about computer vision and modern ML methods?

Our team is happy to help—just contact us if you have any questions about computer vision and modern ML methods.

Get in touch