Story Melange
  • Blog
  • Projects
  • Archive
  • About
  • Subscribe

On this page

  • 1 What happened so far
  • 2 Doclayout-YOLO
    • 2.1 First impression
  • 3 Text or not
    • 3.1 Evaluation
    • 3.2 Performance on CPU
  • 4 Foundation models with some logic often win

Is training your own classifier really worth it?

Computer Vision
Machine Learning
Python
I small update to the longer update. We use a YOLO-doclayout to decide if it is a text or image.
Author

Dominik Lindner

Published

October 2, 2025

1 What happened so far

I recently trained a text or page classifier, which helped me dramatically speed up the sorting of scans for further processing in an OCR pipeline.

Recently I have been using Doclayout-YOLO for further processing of the text pages and especially the pages with mixed layout.

Doclayout-YOLO has proofed quite reliable in detecting existing text with bounding boxes. In addition it is optimized for speed.

When I finished the first part of this classifier I had the idea, why not just count the number or size of text boxes on a page?

2 Doclayout-YOLO

Doclayout-YOLO was trained on top of YOLO10 with 300k synthetic documents. YOLO is a End-to-end object detection network, whereas Resnet is first a classification network. While object detection can be done with a resnet by outputting 4 numbers for each pixel. It requires further processing. This makes it slower than YOLO. YOLO splits the image into boxes and then detects classes within those boxes.

Code
import cv2
import matplotlib.pyplot as plt
from doclayout_yolo import YOLOv10

2.1 First impression

We first load the model to the GPU.

model = YOLOv10("data/models/doclayout_yolo.pt")
model.to("cuda:0");

and then examine one page

det_res = model.predict(
    "data/raw/text_page/IMG_0751.JPG",
    imgsz=1024,
    conf=0.2,
    device="cuda:0",
    verbose=False
)

annotated_frame = det_res[0].plot(pil=True, line_width=10, font_size=30)
img_rgb = cv2.cvtColor(annotated_frame, cv2.COLOR_BGR2RGB)

plt.figure(figsize=(10, 14))
plt.imshow(img_rgb)
plt.axis("off")
plt.show()

We can clearly see title, text and image boxes. The model ignores areas of white space. That the part outside of the book is recognized as image seems understandable, as the model has no concept of the entity book.

3 Text or not

The model returns the type of box. A text box is of the class 1.

Let’s first count text boxes


def count_text_boxes(model, filename, conf_threshold=0.2):
    det_res = model.predict(
        filename,
        imgsz=1024,
        conf=conf_threshold,
        verbose=False
    )
    detections = det_res[0].boxes

    text_boxes = [
        box for box in detections
        if int(box.cls) == 1 and float(box.conf) >= conf_threshold
    ]
    return len(text_boxes)
%%time
count_text_boxes(model, "data/raw/text_page/IMG_0751.JPG")
CPU times: user 144 ms, sys: 31 ms, total: 175 ms
Wall time: 172 ms
11

There are 11 text boxes. The result was obtained in 179ms. My self trained CNN needs 160ms.

We allow one text box for image pages

def is_text_page(model, filename, conf_threshold=0.2, box_threshold=1):
    text_box_count = count_text_boxes(model, filename, conf_threshold)
    if text_box_count > box_threshold:
        return True
    return False
is_text_page(model, "data/raw/text_page/IMG_0751.JPG")
True

3.1 Evaluation

We will evaluate this on image pages and on text pages.

import glob

total = 0
incorrect = 0
incorrect_files = []
for file in glob.glob("data/raw/image_page/*.JPG"):
    total += 1
    if is_text_page(model, file):
        incorrect += 1
        incorrect_files.append(file)

incorrect
1
count_text_boxes(model, incorrect_files[0])
5
1-incorrect/total
0.9969604863221885

99.6% is far higher than any score, I obtained during training. On the other hand 5 boxes is clearly an outlier which needs to be accepted with this simple approach.

Let’s check the text pages

import glob

total = 0
incorrect = 0
incorrect_files = []
for file in glob.glob("data/raw/text_page/*.JPG"):
    total += 1
    if not is_text_page(model, file):
        incorrect += 1
        incorrect_files.append(file)
incorrect
0

We have 100% correct identification.

3.2 Performance on CPU

One place where the OCR based model was a bottleneck was automated testing. The slow performance dramatically slowed down the tests.

In the actual usage, the slow performance was is not that important as we proceed with api calls later on, which are slow too.

model2 = YOLOv10("data/models/doclayout_yolo.pt")
model2.to("cpu");
%%time
is_text_page(model2, "data/raw/text_page/IMG_0751.JPG")
CPU times: user 5.06 s, sys: 368 ms, total: 5.43 s
Wall time: 1.02 s
True

On CPU the model is slower by factor 6.

4 Foundation models with some logic often win

In the way we are using Doclayout-YOLO it can be called a foundation model.

Instead of spending time an resources on training of our own model, it can be quicker to run some post-processing on the output of a foundational model.

Most counterintuitive we are using a regression-classification model to perform pure classification.

Like this post? Get espresso-shot tips and slow-pour insights straight to your inbox.

Comments

Join the discussion below.


© 2025 by Dr. Dominik Lindner
This website was created with Quarto


Impressum

Cookie Preferences


Code · Data · Curiosity
Espresso-strength insights on AI, systems, and learning.