OCR & Computer Vision 10 minutes reading time

How traditional OCR is used in computer vision

A clear and scientifically sound look at how classic OCR methods are structured and why they remain part of modern image processing systems to this day.

Published on June 7, 2025 - Automatically translated

A conventional sheet of paper with lines on it, the contents of which are captured and processed by a camera.

Authors

Philip Zetterberg
Software AI Engineer, TRENPEX

Contributors

Dr. Christian Schüller
Head of Software Development, TRENPEX

Angie Zetterberg
Public relations, TRENPEX

Subscribe to our newsletter

Share

The role of traditional OCR in modern image processing systems

Many business processes today generate image data that contains text in some form: scanned documents, photos of type plates, labels in logistics, serial numbers on components, measured values on displays, or handwritten notes on forms. To use this information efficiently, processes are needed that recognize text from an image and convert it into a form that can be further processed. Optical character recognition—OCR for short—is the key technology for this.

While machine learning is often in the spotlight these days, many systems continue to use traditional computer vision methods. This is not because these methods are "better," but because they function reliably and stably under certain conditions: for example, when lighting is controlled, fonts are consistent, or hardware offers only limited computing power. Classic OCR is therefore not a counter-model to AI, but a building block within an entire technical toolbox.

Structure of traditional OCR

The basic processes that enable OCR have been well researched for decades. The literature describes an almost uniform structure for such systems¹ – a process that has proven itself in practice and is therefore still in use today.² The first step is to prepare the image, followed by separating the text from the background. Only when these two steps work properly can downstream processes such as segmentation or classification deliver reliable results.

Preprocessing and binarization are therefore central elements of classic OCR. They define how "clear" the image will be for all further analysis. And they determine whether a system will actually recognize characters later on—or whether it will fail due to noise, shadows, or weak contrast.

Preprocessing: Bringing images to a reliable state

In practice, no image is perfect. Noise, shadows, unfavorable lighting, blurred images, or skewed perspectives can make recognition difficult. Preprocessing therefore has the task of optimizing the image so that the relevant text structures are as unaltered as possible. This step is particularly important for classic, rule-based OCR methods, as these methods do not have learning mechanisms that automatically compensate for weaknesses.

A typical problem is image noise. It occurs particularly when images are taken under difficult lighting conditions or when the camera is of poor quality. Various filters are used to reduce this interference. The bilateral filter, described by Tomasi and Manduchi in 1998, is a frequently used tool for this purpose.³ It preserves edges while smoothing out noise—a feature that is particularly valuable for text images.

Contrast also plays an important role. Faded prints or weak differences in brightness between text and background make it difficult to separate image areas later on. Methods such as adaptive histogram equalization—in particular the well-known CLAHE method—have proven effective in compensating for local contrast differences and making details stand out more clearly.⁴ This can have a direct impact on recognition quality, especially with scans and poor-quality photos.

Another classic issue is alignment. Even slight tilts can cause lines to be misrecognized or characters to be incorrectly segmented later on. To prevent this, many systems analyze the image structure, for example using the Hough transform or text line projections. Work such as that by Leedham et al.⁵ shows how effective such corrections can be, even with severely impaired documents.

Lighting also plays a major role. In industrial environments or when taking mobile photos, it is rare for an image to be evenly lit. Shadows, reflections, or shiny surfaces can easily cause text to appear overexposed or too dark. Methods such as homomorphic filtering⁶ or top-hat transformation from mathematical morphology⁷ help to reduce the influence of lighting and reveal the actual structure of the text.

This combination of noise reduction, contrast adjustment, geometric correction, and lighting normalization creates an image that provides a stable foundation for all subsequent steps.

Binarization: The fundamental decision in the image

Once the image has been stabilized, the next step is one that is of central importance for classic OCR systems: binarization. This involves converting the grayscale image into a black-and-white image. This may sound simple, but it has far-reaching consequences, as many of the subsequent processes—such as segmentation or shape analysis—work exclusively with binary images.

The best-known global method is Otsu's thresholding, which was published in 1979.⁸ It automatically calculates a threshold value that divides the image in such a way that the variance between the foreground and background is maximized. For images with homogeneous lighting, Otsu often delivers very good and reproducible results. In controlled environments—such as scans or defined industrial setups—such a method can work reliably.

In many practical scenarios, however, lighting conditions are not uniform. Shadows, shiny surfaces, or local brightness fluctuations mean that a single threshold value is not sufficient. This is where adaptive methods come into play. Niblack, Sauvola, and later extensions such as those by Wolf and Jolion calculate the threshold value for each image area individually. This allows them to adapt to local conditions and extract text even when there are large differences in brightness.

Binarization plays a decisive role in determining whether text can later be cleanly separated, segmented, and classified. Accordingly, it has been the subject of intensive research for many years. Competitions such as "DIBCO" show that this area continues to be developed today—not least because it also plays a role in modern OCR pipelines as a preprocessing step.

How edge detection and segmentation pave the way for character recognition

Once an image has been sufficiently stabilized and converted into a two-stage representation, the question arises as to how individual, clearly defined characters can be created from this mass of pixels. For humans, this process is self-evident: we immediately recognize where a letter begins and where it ends. Computers, on the other hand, require their own methods to derive structures from the brightness gradients of the image. Two steps play a central role in this process: edge detection and segmentation.

Edge detection is used to highlight the essential contours of a character. While binarization merely separates the foreground from the background, edges provide information about shape, orientation, and transitions between structures. In technical terms, an edge is nothing more than a point of significant change in brightness. However, these points are crucial for computer vision because they provide clues about lines, curves, or closed shapes—precisely the characteristics that define a letter or number.

As early as the 1960s and 1980s, methods were developed that still serve as the basis for many image processing pipelines today. The Sobel operator is one of the best-known examples of this. It responds to changes in brightness along the horizontal and vertical axes, thereby providing a clear picture of where structure is present in the image.¹¹ For many technical applications in which the lighting is constant and the objects have a defined shape, Sobel is often perfectly adequate.

The Laplace operator works a little more sensitively, responding not to the first but to the second brightness gradient. In combination with smoothing—known as "Laplacian of Gaussian" (LoG)—even weak or thin contours become visible, which is important in some scenarios.¹² This method became particularly interesting when Marr and Hildreth provided a theoretical basis for how visual systems perceive edges in 1980. The idea that smoothing and differentiation belong together can still be found in many industrial OCR systems today.

The best-known algorithm in this field is the Canny edge detector. In 1986, John Canny published a method that is still considered the "optimal" edge detection method today.¹³ It combines smoothing, gradient analysis, and a process known as hysteresis thresholding, which produces stable and closed edges. This feature is particularly valuable for segmentation because it allows the system to recognize not only that a structure exists somewhere, but also whether this structure is actually contiguous—a prerequisite for later separation into individual characters.

Once the essential edges of an image have been made visible, the next question arises: How can these contours be turned into individual areas that can be analyzed separately? This step has long been described in research as segmentation. The image is broken down into units that are relevant for later classification: lines, words, and finally individual characters. The quality of this segmentation has a decisive influence on how well an OCR system ultimately works.

One of the basic methods for character segmentation is connected component analysis (CCA). This method treats the image as if it consisted of clusters of connected pixels. Each cluster that has a common connection is treated as a separate component.¹⁴ In many cases, this component corresponds to a character or at least a logically related part. The approach is relatively simple, but can be implemented extremely efficiently. That is why CCA is still used today in numerous industrial applications, such as serial number recognition or clearly structured labels.

When structures are more complex—for example, in the case of merged characters or busy backgrounds—purely contiguous pixels are often not sufficient. In such cases, contour recognition is frequently used. Methods such as the contour following method developed by Suzuki and Abe in 1985 analyze closed lines and can thus capture the outer shape of a possible character very precisely.¹⁵ By combining contour recognition with simple morphological operations such as dilation or erosion, it is also possible to process cases in which the writing is severely distorted or partially damaged.

At the end of these steps, rectangular areas usually appear in the image – so-called bounding boxes. They mark the regions that are to be classified later. A bounding box is, in a sense, the "package" that the system passes on to a classification algorithm: a cleanly isolated image fragment that contains a single character or symbol element. The quality of these boxes has a direct impact on recognition accuracy. If a character is cut out incompletely or if the box still contains distracting background structures, subsequent classification processes may easily make incorrect assignments.

Segmentation and edge detection together form the bridge between a raw image signal and actual character recognition. They ensure that what is later classified is not just any image section, but a well-defined, structured fragment. Only through these steps can a complex image become the basis for precise text recognition.

How traditional OCR classifies characters—and why shape characteristics play a key role in this process

Once an image has been segmented and broken down into individual characters, the crucial question arises: How does the system recognize which letter, digit, or symbol is contained in the cut-out area? Modern approaches use neural networks for this purpose, but before machine learning became widely used, this task was solved using traditional methods for many years. These methods are not based on probabilities, but on the geometric, structural, and statistical properties of the characters.

One approach that was used particularly early on is known as template matching. Here, the isolated character is compared directly with templates that serve as references. The idea behind this is simple: if two patterns are similar enough, they are likely to be the same character. This method is particularly useful in environments where there are only a few fonts or very clearly structured characters. Typical examples include embossed serial numbers, type plates, or labels with standardized symbols. As long as the shape, size, and layout are stable, template matching delivers very reliable results—and with comparatively little computing power.

Over time, classification continued to evolve and was increasingly supplemented by feature extraction. Instead of comparing a character as a whole image, researchers examined which characteristics were particularly relevant for differentiation. This led to the development of descriptors such as the "Histogram of Oriented Gradients" (HOG), which Dalal and Triggs introduced in 2005.¹⁶ HOG describes the directions in which the edges in an image point and how pronounced they are. These features form a kind of "fingerprint" of a character: similar enough to recognize related variants and different enough to distinguish them from other letters or numbers.

Another milestone was feature methods such as SIFT and SURF, which are based on local structural points. SIFT (Scale-Invariant Feature Transform), first described by David Lowe, identifies particularly distinctive points in the image and describes their surroundings.17 SURF (Speeded Up Robust Features) shortens this process and enables faster calculations.18 Although these methods were originally developed for general object recognition, they have repeatedly been used in OCR practice—especially for symbols or non-standard fonts, where classic methods reach their limits.

In addition to these global and local characteristics, there are other mathematical descriptions of shapes, such as the analysis of closed curves. Fourier descriptors, Zernike moments, and the shape context approach developed by Belongie et al.19 demonstrate the diversity of research into shape description and the depth with which the topic has been explored over decades. The goal of these methods is always to describe a character in such a way that even small differences between symbols can be reliably recognized—regardless of whether they are printed, embossed, or slightly distorted.

In many classic OCR systems, this feature extraction is followed by a rule-based or statistical classifier. These can be simple distance measures, but also models such as support vector machines, which were widely used in the 1990s and 2000s. The combination of clearly defined features and a well-trained classifier forms a robust overall system that continues to function reliably in many controlled scenarios to this day.

It is interesting to note that even modern, AI-based OCR systems continue to use parts of these traditional approaches. Preprocessing, segmentation, and geometric normalization are often implemented according to classic principles because they are deterministic and easy to control. While neural networks take care of pattern recognition, classic image processing ensures that the input data is in a consistent state.

The interaction of segmented characters, extracted features, and a classifier completes the classic OCR process. An original image—be it a scan, a photo, or a recording from a production facility—is ultimately transformed into structured, machine-readable information. This transition from visual signal to digital database is at the heart of OCR.

Classification and outlook

Traditional computer vision methods are primarily used today in situations where conditions are stable, deterministic decisions are important, or computing power is limited. Many industrial systems, test stations, and scanner solutions continue to rely on these concepts—not for nostalgic reasons, but because they are simply sufficient and reliable for certain tasks.

At the same time, the rise of modern deep learning models has fundamentally changed the OCR landscape. ML-based systems offer significant advantages when dealing with complex fonts, unstructured backgrounds, or handwritten text. Instead of having to choose between traditional and ML-based methods, many companies now combine both approaches: traditional methods provide preprocessing and structure, while neural networks handle the actual recognition.

This combination shows that traditional OCR is no longer the sole focus, but remains an important component of modern image processing pipelines. It provides functions that continue to work very reliably in clearly defined environments and on which even modern methods can be based.

References

¹See. Govindan & Shivaprasad – Character recognition: A review, 1990

²See. Trier, Jain & Taxt – Feature extraction methods for character recognition: A survey, 1996

³See. Tomasi & Manduchi – Bilateral Filtering for Gray and Color Images, 1998

⁴See. Pizer et al. – Adaptive Histogram Equalization and its Variations, 1987

⁵See. Leedham et al. – Separating Text and Background in Degraded Document Images, 2002

⁶See. Oppenheim et al. – Homomorphic Filtering, 1968

⁷See. Serra – Image Analysis and Mathematical Morphology, 1982

⁸See. Otsu – A threshold selection method from gray-level histograms, 1979

⁹See. Niblack – An introduction to digital image processing, 1986

¹⁰See. Wolf & Jolion – Extraction and Recognition of Artificial Text in Multimedia Documents, 2005

¹¹See. Sobel & Feldman – A 3x3 isotropic gradient operator for image processing, 1968

¹²See. Marr & Hildreth – Theory of edge detection, 1980

¹³See. Canny – A computational approach to edge detection, 1986

¹⁴See. Jain & Zhong – Page segmentation using texture analysis, 1996

¹⁵See. Suzuki & Abe – Topological structural analysis of digitized binary images by border following, 1985

¹⁶See. Dalal & Triggs – Histograms of Oriented Gradients for Human Detection, 2005

¹⁷See. Lowe – Distinctive Image Features from Scale-Invariant Keypoints, 2004

¹⁸See. Bay et al. – SURF: Speeded Up Robust Features, 2006

¹⁹See. Belongie et al. – Shape Matching and Object Recognition using Shape Contexts, 2002

Would you like to learn more about OCR and computer vision?

Our team will be happy to help you—just contact us if you have any questions about OCR and computer vision.

Get in touch