Story Melange
  • Blog
  • Projects
  • Archive
  • About
  • Subscribe

On this page

  • 1 Many ways lead to digitized documents
    • 1.1 Using LLM API
    • 1.2 The new way: end-to-end conversion
    • 1.3 The not so easy way
    • 1.4 Using only domain knowledge
  • 2 Converting Data to Dataframes
    • 2.1 OCR
    • 2.2 Doclayout YOLO
  • 3 Title detection
    • 3.1 Title detection using OCR
    • 3.2 Title detection using DocLayout-YOLO and OCR
  • 4 Recipe detection using only OCR
    • 4.1 Working on OCR Block level
    • 4.2 Working on OCR paragraph level
    • 4.3 Generalization of OCR based splitting
  • 5 Recipe detection using doc-layout YOLO
    • 5.1 Detecting columns
    • 5.2 Generalization of YOLO-doclayout+OCR
  • 6 Comparison of pure OCR pipeline vs OCR+YOLO-doclayout
    • 6.1 Runtime comparison
    • 6.2 Recipe output comparison
  • 7 Quantitative analysis
  • 8 Outlook

Page Segmentation: The easy and the hard way

Python
Computer Vision
Machine Learning
Generative AI
An OCR scan of a whole page of a complex layout can be done two ways. The easy expensive one using an LLM or the more sophisticated one, which is harder to develop but cheaper to run.
Author

Dominik Lindner

Published

October 24, 2025

1 Many ways lead to digitized documents

When I first started working on recipescanner, the biggest issue was scanning multi column pages and multi recipe pages. How to group the output from the OCR scan in such a way that recipes are not mixed with each other?

The following table shows different options of workflows, from image to fully parsed recipe.

Option How it Works Speed (per page) Needs Best When Cost
1. OCR → LLM Parsing Extract with OCR, LLM for classification 3–5 s (small LLM)
15–30 s (vision-LLM)
GPU or strong CPU, few GB RAM Layouts are messy Medium
2. Vision → Text (Donut / Pix2Struct) End-to-end vision does classification 0.8–2 s on GPU GPU / NNAPI required Handwriting, warped images High
3. OCR → Segmentation → Rules / Small Model Extract with OCR, Rules or DecisionTree for classification 0.3 s on CPU CPU only, ~400 MB RAM Scans are clean and structured Low

For pages with multiple recipes, the approach can be separated into two stages.

  1. Image to Text Blocks and recipe sections
  2. Classification of each recipe into ingredients, description, …

The division of the task into smaller tasks allows us to use less complex methods. In this notebook we focus on the task #1.

1.1 Using LLM API

Nowadays, one straightforward solution is to run an LLM on the OCR output and ask it to cluster the text. You can even set up a complete workflow: cluster the text, check it, extract ingredients and instructions and the check again to ensure all content is used recipes are not mixed up.

The downside about all this? It is expensive to run. Especially if we introduce correctness checks. How much more expensive?

For reference, Google currently charges $1.5 per 1000 pages, whereas as Mistral asks for $3 per 1000 pages for annotated output. Google’s new interface, which (like Mistral) relies on Vision Transformers also charges asks for $1.5 per 1000 pages. Layout parsing costs extra at $10 per 1000 pages.

Let’s say each of your users has about 2000 books he wants to convert. That results to $23 per user; and that is without classification. A custom extractor sets you back another $20. So the full workflow is about $50 per user. Again only for recipe extraction. Interaction with the recipe database will probably cost around $3 per month.

1.2 The new way: end-to-end conversion

I recently tested Docling, a transformer-based architecture. It has created quite a buzz recently.

Here is my short evaluation. On the document which we are going to use throughout this notebook, it failed. The deterministic demo (temperature=0) got stuck in an inference loop. I tried increasing temperature and add other tweaks to help the model escape local optima. This came at the cost of accuracy.

The model repeated titles as ingredients for other recipes and failed to stop properly.

Docling Result: one recipe title is not detected as such

The model has issue with line breaks: “Die Pilze damit bin-den” became “Pilze damit ben den” including line breaks.

On top of that, the speed was poor on a legacy GPU (GTX 1080), despite running at full load for the full time of 22 seconds! Memory consumption was steadily increasing as more tokens were decoded, just barely fitting on the 8GB GPU with 6.5GB max usage.

Reflecting on the architecture, I suspect it must work better on obscure edge cases. Complex Formats with overlaying figures or strongly nested tables could perform better.

Strangely, even the demo mostly showcases simple formats.

1.3 The not so easy way

Layout Parsing is not new. In fact, it became popular about three years ago.

One such layout parser is the actual LayoutParser. It combines an automatic analysis of OCR output and segmentation. Unfortunately, the segmentation is done with detecron2. The active development seems to have stopped and it no longer works on my python installation (Python 3.12). My research revealed Python 3.9 as the last working version.

A more recent model is DocLayout-YOLO. Based on the already massive YOLO dataset, it uses a very large document dataset to identify bounding boxes and classification in a combined loss function.

What it lacks, however, is an automatic connection to OCR. One can either crop and ocr the text boxes or align with the OCR output.

Here is an example:

Code
# Standard library
import glob
import json
import os
import random
import warnings
from difflib import SequenceMatcher
from pathlib import Path

# Third-party
import cv2
import matplotlib.patches as patches
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from PIL import Image, ImageDraw
# Local / custom
from doclayout_yolo import YOLOv10
from sklearn.metrics import accuracy_score

# Jupyter magic
%matplotlib inline

# Suppress all warnings (optional)
warnings.filterwarnings("ignore")
Code
model = YOLOv10("data/models/doclayout_yolo.pt")
det_res = model.predict(
    "data/raw/20250922_135514.jpg",
    imgsz=1024,
    conf=0.2,
    device="cuda:0",
    verbose=False
)
Code
annotated_frame = det_res[0].plot(pil=True, line_width=10, font_size=30)
img_rgb = cv2.cvtColor(annotated_frame, cv2.COLOR_BGR2RGB)

plt.figure(figsize=(10, 14))
plt.imshow(img_rgb)
plt.axis("off")
plt.title("Segmentation using Doclayout YOLO")
plt.show()

Despite knowing nothing about the text, the CNN could still discover the layout. Which is clear, you don’t need to understand a language to break a book into paragraphs.

There are some errors in how the titles are handled. Bright red boxes highlight detected titles, and we can see that there are too many.

When we run the CUDA-enabled version, the total runtime is 164ms, 650ms for cold start, whereas on CPU it can take up to 1.6s.

1.4 Using only domain knowledge

And then there’s the domain knowledge approach. That is how I started. If we know all the text on a page, its location, and assume it’s a recipe: can we identify where recipes end, and which text belongs to which recipe?

In this notebook, I will examine this way and how it compares to DocLayout-YOLO.

2 Converting Data to Dataframes

We will perform a statistical analysis and therefore convert the data to pandas dataframes.

2.1 OCR

In my project, I currently rely on Google cloud OCR. Therefore, we need to convert the API response to a dataframe. While doing so, we also add information on font_size and word count.

def ocr_json_to_df(ocr_json):
    rows = []

    for block in ocr_json:
        block_type = block["blockType"]
        verts = block["boundingBox"]["vertices"]
        x1, y1 = verts[0].get("x", 0), verts[0].get("y", 0)
        x2, y2 = verts[2].get("x", 0), verts[2].get("y", 0)

        # Reconstruct text from words and establish size
        block_words = []
        average_word_height_sum = 0
        word_counter = 0
        for para_idx, para in enumerate(block.get("paragraphs", [])):
            paragraph_words = []
            paragraph_average_word_height_sum = 0
            paragraph_word_counter = 0
            para_verts = para["boundingBox"]["vertices"]
            px1, py1 = para_verts[0].get("x", 0), para_verts[0].get("y", 0)
            px2, py2 = para_verts[2].get("x", 0), para_verts[2].get("y", 0)
            for word_idx, word in enumerate(para.get("words", [])):
                symbols = [s["text"] for s in word.get("symbols", [])]
                word_text = "".join(symbols)
                block_words.append(word_text)
                paragraph_words.append(word_text)

                # store word-level rows (optional)
                word_verts = word["boundingBox"]["vertices"]
                wx1, wy1 = word_verts[0].get("x", 0), word_verts[0].get("y", 0)
                wx3, wy3 = word_verts[3].get("x", 0), word_verts[3].get("y", 0)
                average_word_height_sum += (wy3 - wy1)
                word_counter += 1
                paragraph_average_word_height_sum += (wy3 - wy1)
                paragraph_word_counter += 1

            rows.append({
                "level": "paragraph",
                "x1": px1, "y1": py1, "x2": px2, "y2": py2,
                "font_size": paragraph_average_word_height_sum / paragraph_word_counter,
                "text": " ".join(paragraph_words),
                "block_type": block_type
            })

        rows.append({
            "level": "block",
            "x1": x1, "y1": y1, "x2": x2, "y2": y2,
            "font_size": average_word_height_sum / word_counter,
            "text": " ".join(block_words),
            "block_type": block_type
        })

    df = pd.DataFrame(rows)

    df["word_count"] = df["text"].str.split().str.len()

    return df
Code
def read_json(image_id):
    json_path = Path("data/ocr/")

    with open(json_path / f"{image_id}.json", "r") as f:
        return json.load(f)
filename = "20250922_135514"
page = read_json(filename)
df_ocr = ocr_json_to_df(page)

Let’s split block and paragraph rows for further processing.

df_block = df_ocr[df_ocr["level"] == "block"]
df_para = df_ocr[df_ocr["level"] == "paragraph"]
df_block.head()
level x1 y1 x2 y2 font_size text block_type word_count
1 block 874 2331 911 2355 24.000000 64 TEXT 1
7 block 853 484 1510 1389 31.902913 Die Butter zerlassen , das Weißbrot von beiden... TEXT 103
10 block 878 1501 1404 1737 52.375000 Soupe à l'ail bonne femme Knoblauchsuppe nach ... TEXT 8
16 block 875 1815 1291 2081 32.761905 2 Stangen Porree ( Lauch ) 250 g enthäutete To... TEXT 21
19 block 877 2101 989 2176 26.500000 Salz Pfeffer TEXT 2

2.2 Doclayout YOLO

DocLayout-YOLO returns boxes, labels, and confidence levels.

def yolo_to_df(result):
    boxes = result.boxes.xyxy.cpu().numpy()
    labels = result.boxes.cls.cpu().numpy().astype(int)
    scores = result.boxes.conf.cpu().numpy()
    names = result.names

    records = []
    for box, lbl, score in zip(boxes, labels, scores):
        x1, y1, x2, y2 = box.tolist()
        label = names[lbl]
        records.append({
            "x1": x1,
            "y1": y1,
            "x2": x2,
            "y2": y2,
            "label": label,
            "confidence": score,
            "is_title": (label.lower() == "title")
        })

    return pd.DataFrame(records)
det_res = model.predict(
    "data/raw/20250922_135514.jpg",
    imgsz=1024,
    conf=0.2,
    device="cuda:0",
    verbose=False
)
df_yolo = yolo_to_df(det_res[0]);df_yolo
x1 y1 x2 y2 label confidence is_title
0 879.157227 1498.183594 1404.824463 1631.679443 title 0.922120 True
1 879.624573 1643.227661 1357.394409 1753.128174 plain text 0.903516 False
2 3569.065430 2246.001465 3609.026123 2281.185547 abandon 0.831622 False
3 2968.354492 450.490784 3297.232910 592.068970 title 0.809443 True
4 870.937500 2327.354248 912.915344 2362.086182 abandon 0.795582 False
5 874.531006 1780.801758 1492.744629 2286.900391 table 0.727726 False
6 1527.393433 598.798950 2144.085449 1672.833618 plain text 0.705026 False
7 2974.093750 598.492249 3467.499023 654.774719 plain text 0.638011 False
8 877.592163 465.383392 1481.144409 1402.559326 plain text 0.626123 False
9 2980.972656 682.661133 3532.022705 1200.553101 table 0.596943 False
10 1526.767578 597.575073 2054.986328 686.416992 plain text 0.583433 False
11 2316.687988 450.310455 2927.670898 824.062439 plain text 0.513546 False
12 3022.026855 1725.583984 3593.486572 2192.724609 plain text 0.488287 False
13 2325.665283 856.174072 2937.416748 1041.672241 plain text 0.481687 False
14 2330.425537 1042.294434 2914.214844 1225.915161 plain text 0.477337 False
15 1532.011597 1245.381348 2047.088745 1337.381592 plain text 0.455887 False
16 2350.650879 1689.926758 2967.089355 1876.195068 plain text 0.453441 False
17 2337.659668 1229.601807 2960.235352 1686.113770 plain text 0.445066 False
18 1522.936157 457.546967 1769.362915 547.661194 title 0.433465 True
19 1525.181885 461.238281 1770.478882 548.516296 title 0.419111 True
20 1537.929932 1843.337891 2071.566162 1914.832031 title 0.392423 True
21 882.206665 469.045898 1474.692993 848.210815 plain text 0.391044 False
22 2355.457520 1878.174194 2988.262939 2259.762207 plain text 0.389824 False
23 882.791809 845.683289 1476.399170 982.723877 plain text 0.378071 False
24 3001.068604 1266.987061 3527.293213 1354.325928 plain text 0.370774 False
25 3006.092529 1268.646606 3583.754639 2200.445557 plain text 0.360401 False
26 1535.614624 1842.507202 2077.353516 1984.601807 title 0.337300 True
27 880.934509 1258.313110 1462.825928 1400.205078 plain text 0.305315 False
28 881.682800 1118.672363 1455.794189 1258.220947 plain text 0.294823 False
29 882.505371 982.659119 1479.449707 1116.903931 plain text 0.263816 False
30 1534.299072 1620.801636 2126.622803 1667.255005 plain text 0.257625 False
31 1533.716064 1338.074707 2135.751709 1658.443481 plain text 0.245955 False
32 1533.999634 2035.315918 2170.545654 2285.304443 plain text 0.232413 False
33 1537.556274 1929.555420 1743.389038 1981.646973 plain text 0.227065 False

In both cases, we can see that the title detection is not optimal. In the case of the block-based detection it will be difficult to separate “Soupe à l’ail bonne femme Knoblauchsuppe nach”.

For DocLayout-YOLO, there are too many titles. Ingredients were detected as title as they are on top of a column and in bold.

3 Title detection

3.1 Title detection using OCR

Let’s try to improve the title detection. We start with the OCR blocks.

df_block[["text","font_size"]]
text font_size
1 64 24.000000
7 Die Butter zerlassen , das Weißbrot von beiden... 31.902913
10 Soupe à l'ail bonne femme Knoblauchsuppe nach ... 52.375000
16 2 Stangen Porree ( Lauch ) 250 g enthäutete To... 32.761905
19 Salz Pfeffer 26.500000
21 einige runde , ausgestochene Toast- brotscheiben 34.000000
24 Speiseöl Parmesankäse 29.500000
27 Den Porree putzen , waschen , in Ringe schneid... 35.600000
29 zerdrücken . 25.500000
34 Das Öl erhitzen , das Gemüse mit den Knoblauch... 33.516854
36 Gigot de chevreuil 62.000000
38 Rehblatt 37.000000
41 800 g Rehblatt ( Schulter ) Salz , Pfeffer 50 ... 34.187500
43 2 Schalotten 26.500000
49 2 zerdrückte Wacholderbeeren 2 zerdrückte Knob... 32.347826
51 2 Teel . Weizenmehl 125 ml ( 1 ) Schlagsahne 33.600000
56 Das Rehblatt unter fließendem kal- tem Wasser ... 33.964072
58 Croûtes aux champignons Champignons in Pasteten 45.666667
60 500 g Champignons 50 g Butter 33.500000
62 Salz Pfeffer 25.500000
69 Cayennepfeffer 125 ml ( 1 ) Wasser 2 gestriche... 32.321429
73 Die Champignons putzen , waschen , vierteln . ... 32.377193
75 55 -1.000000
77 65 26.000000

Titles are in index 10, 36, and 58. For index 10 and 58, the title is together with the subtitles. Both have a bigger average line height (fontsize) than the rest. Almost by factor 1.5. With this in mind we create our title detector. Just to be sure we limit the amount of words, too.

Luckily we extracted this information earlier in our dataframe.

def detect_titles(df, font_factor=1.2, max_words=15):
    df = df[df["word_count"] > 0].copy()

    if len(df) == 1:
        return df
    # compute mean font size ignoring NaNs
    mean_font_size = df["font_size"].mean()

    df["is_title"] = (
            (df["font_size"] > font_factor * mean_font_size) &
            (df["word_count"] <= max_words) &
             (df["text"].str.len() >= 3)
    )
    return df[df["is_title"]]
df_titles_blocks = detect_titles(df_block)
df_titles_blocks[["text","is_title","font_size"]]
text is_title font_size
10 Soupe à l'ail bonne femme Knoblauchsuppe nach ... True 52.375000
36 Gigot de chevreuil True 62.000000
58 Croûtes aux champignons Champignons in Pasteten True 45.666667
df_titles_para = detect_titles(df_para)
df_titles_para[["text","is_title","font_size"]]
text is_title font_size
8 Soupe à l'ail bonne femme True 56.600000
9 Knoblauchsuppe nach Hausfrauenart True 45.333333
35 Gigot de chevreuil True 62.000000
57 Croûtes aux champignons Champignons in Pasteten True 45.666667

In the case of blocks, the subtitle is detected with the title in a block. And in the case of the paragraphs, it has an almost equal font size. For the third title, separation is not possible on paragraph level.

Now it would be great to include this information back into the dataframe.

def add_titles_to_df(original, titles):
    original["is_title"] = False
    original.loc[titles.index, "is_title"] = titles["is_title"]
add_titles_to_df(df_block, df_titles_blocks)
df_block[["text","is_title","font_size"]]
text is_title font_size
1 64 False 24.000000
7 Die Butter zerlassen , das Weißbrot von beiden... False 31.902913
10 Soupe à l'ail bonne femme Knoblauchsuppe nach ... True 52.375000
16 2 Stangen Porree ( Lauch ) 250 g enthäutete To... False 32.761905
19 Salz Pfeffer False 26.500000
21 einige runde , ausgestochene Toast- brotscheiben False 34.000000
24 Speiseöl Parmesankäse False 29.500000
27 Den Porree putzen , waschen , in Ringe schneid... False 35.600000
29 zerdrücken . False 25.500000
34 Das Öl erhitzen , das Gemüse mit den Knoblauch... False 33.516854
36 Gigot de chevreuil True 62.000000
38 Rehblatt False 37.000000
41 800 g Rehblatt ( Schulter ) Salz , Pfeffer 50 ... False 34.187500
43 2 Schalotten False 26.500000
49 2 zerdrückte Wacholderbeeren 2 zerdrückte Knob... False 32.347826
51 2 Teel . Weizenmehl 125 ml ( 1 ) Schlagsahne False 33.600000
56 Das Rehblatt unter fließendem kal- tem Wasser ... False 33.964072
58 Croûtes aux champignons Champignons in Pasteten True 45.666667
60 500 g Champignons 50 g Butter False 33.500000
62 Salz Pfeffer False 25.500000
69 Cayennepfeffer 125 ml ( 1 ) Wasser 2 gestriche... False 32.321429
73 Die Champignons putzen , waschen , vierteln . ... False 32.377193
75 55 False -1.000000
77 65 False 26.000000

Let’s do the same for the paragraphs.

add_titles_to_df(df_para, df_titles_para)

3.2 Title detection using DocLayout-YOLO and OCR

Luckily, the layout detector already flags which blocks are title. The issue: there is no text in any of those blocks. Most of the blocks span multiple lines, which means our font-size approach does not work. We do not know how many lines or words exist.

This is where DocLayout-YOLO falls short. We could run every single block through OCR. On a local OCR program that could be as effective as full page detection, maybe even better. But since we rely on Google OCR, that approach would lead to very high cost, as billing is per request.

Instead, we align the two boxes and copy every OCR word which falls in a Doclayout boxes to the related dataframe row.

df_titles_yolo = df_yolo[df_yolo["is_title"]].copy()
df_titles_yolo["text"] = ""
df_titles_yolo["average_word_height_sum"] = 0
df_titles_yolo["word_count"] = 0
for block in page:
    for para_idx, para in enumerate(block.get("paragraphs", [])):

        for word_idx, word in enumerate(para.get("words", [])):
            symbols = [s["text"] for s in word.get("symbols", [])]
            word_text = "".join(symbols)
            word_verts = word["boundingBox"]["vertices"]
            wx1, wy1 = word_verts[0].get("x", 0), word_verts[0].get("y", 0)
            wx2, wy2 = word_verts[3].get("x", 0), word_verts[3].get("y", 0)
            wcx = (wx1 + wx2) / 2
            wcy = (wy1 + wy2) / 2
            word_height = abs(wy2 - wy1)

            # assign to title box if inside
            for idx, row in df_titles_yolo.iterrows():
                if (row["x1"] <= wcx <= row["x2"]) and (row["y1"] <= wcy <= row["y2"]):
                    # append word text
                    df_titles_yolo.at[idx, "text"] += " " + word_text
                    # accumulate height + count
                    df_titles_yolo.at[idx, "average_word_height_sum"] += word_height
                    df_titles_yolo.at[idx, "word_count"] += 1
                    # can only be in one title
                    break
df_titles_yolo["font_size"] = df_titles_yolo.average_word_height_sum / df_titles_yolo.word_count
df_titles_yolo[["text","is_title","font_size"]]
text is_title font_size
0 Soupe à l'ail bonne femme True 56.600000
3 Croûtes aux champignons True 47.333333
18 Speiseöl Parmesankäse True 29.500000
19 True NaN
20 Gigot de chevreuil True 62.000000
26 Rehblatt True 37.000000

As indicated earlier, we have different kind of false positives. Only the “Rehblatt” is similar to the previous case, of subtitles in the paragraph based detection. Luckily for us, this time font size should work well.

df_titles_yolo = detect_titles(df_titles_yolo, font_factor=1)
df_titles_yolo[["text","is_title","font_size"]]
text is_title font_size
0 Soupe à l'ail bonne femme True 56.600000
3 Croûtes aux champignons True 47.333333
20 Gigot de chevreuil True 62.000000

As expected it worked even without a safety factor. We wrap this in a function.

def add_text_and_font_size_to_layout_df(df, ocr_page):
    df["text"] = ""
    df["average_word_height_sum"] = 0
    df["word_count"] = 0
    unassigned_words = []
    for block in ocr_page:
        for para_idx, para in enumerate(block.get("paragraphs", [])):

            for word_idx, word in enumerate(para.get("words", [])):
                symbols = [s["text"] for s in word.get("symbols", [])]
                word_text = "".join(symbols)
                verts = word["boundingBox"]["vertices"]
                wx1, wy1 = verts[0].get("x", 0), verts[0].get("y", 0)
                wx2, wy2 = verts[3].get("x", 0), verts[3].get("y", 0)
                wcx = (wx1 + wx2) / 2
                wcy = (wy1 + wy2) / 2
                word_height = abs(wy2 - wy1)

                assigned = False
                for idx, row in df.iterrows():
                    if (row["x1"] <= wcx <= row["x2"]) and (row["y1"] <= wcy <= row["y2"]):
                        # append word text
                        df.at[idx, "text"] += " " + word_text
                        # accumulate height + count
                        df.at[idx, "average_word_height_sum"] += word_height
                        df.at[idx, "word_count"] += 1
                        assigned = True
                        break

                if not assigned:
                    unassigned_words.append({
                        "text": word_text,
                        "x": wcx,
                        "y": wcy,
                        "height": word_height
                    })
    if len(unassigned_words) > 0:
        print("unassigned words:")
    df["font_size"] = df.average_word_height_sum / df.word_count
    return df

Note, I included a small debug hint, which should trigger if there are any unassigned words.

Next, it would be great to include this information back into the dataframe.

The complete code for DocLayout-YOLO.

df_yolo_font_size = add_text_and_font_size_to_layout_df(df_yolo.copy(), page)
df_titles_yolo = detect_titles(df_yolo_font_size[df_yolo_font_size.is_title], font_factor=1)
add_titles_to_df(df_yolo_font_size, df_titles_yolo)
df_yolo_font_size[["text", "is_title","font_size"]]
unassigned words:
text is_title font_size
0 Soupe à l'ail bonne femme True 56.600000
1 Knoblauchsuppe nach Hausfrauenart False 45.333333
2 65 False 26.000000
3 Croûtes aux champignons True 47.333333
4 64 False 24.000000
5 2 Stangen Porree ( Lauch ) 250 g enthäutete T... False 32.586207
6 Den Porree putzen , waschen , in Ringe schnei... False 33.900826
7 Champignons in Pasteten False 44.000000
8 Die Butter zerlassen , das Weißbrot von beide... False 31.902913
9 500 g Champignons g Butter Salz Pfeffer Cayen... False 32.142857
10 False NaN
11 zerdrückte Wacholderbeeren zerdrückte Knoblau... False 32.419355
12 Sahne und Petersilie unterrühren . Die Champi... False 32.016667
13 Das Rehblatt unter fließendem kal- tem Wasser... False 35.842105
14 Den Speck in Streifen schneiden . Die Butter ... False 34.090909
15 False NaN
16 Fleisch von den Knochen Das gare lösen , in P... False 30.380952
17 Schalotten abziehen , vierteln , mit den Wach... False 32.963636
18 Speiseöl Parmesankäse False 29.500000
19 False NaN
20 Gigot de chevreuil True 62.000000
21 False NaN
22 Den Bratensatz mit etwas Wasser los- kochen u... False 35.877551
23 False NaN
24 Die Champignons putzen , waschen , vierteln . False 31.250000
25 Die Butter zerlassen , die Champi- gnons dari... False 33.043478
26 Rehblatt False 37.000000
27 False NaN
28 False NaN
29 False NaN
30 False NaN
31 False NaN
32 800 g Rehblatt ( Schulter ) Salz , Pfeffer 50... False 33.333333
33 False NaN

4 Recipe detection using only OCR

Now with the titles cleaned, we know the number of titles. Next step is the distribution of the rows of each dataframe to the recipes.

An important domain knowledge, or prior knowledge, is that almost all recipe formats are organized in columns. Titles are usually placed somewhere in these columns, and a title marks the beginning of a recipe.

The main drawback: any other layout format cannot be processed.

We will do this approach in two steps:

  1. Split the text into columns based on the row’s bounding box
  2. Split columns into recipes based on title position.

4.1 Working on OCR Block level

This was actually the hardest part. For the sake of brevity I only provide the final result and not the full way.

4.1.1 Column detection

The OCR pipeline has already discovered fragments of text that belong together, and organized them into paragraphs and blocks.

Our algorithm works with two approaches.

  1. We assume there are no more than five columns, and they are evenly distributed. When column size is unequal, that is usually the case if ingredients are in a column.

We search for the best fit. Fit is defined by a score, which is the absolute distance of left beginning of the box and column centers.

  1. We try to establish the number of columns with a normalized, area-weighted histogram. A valid column is defined by having 50% of the text amount of the maximum column. That of course assumes equal length of recipes. Because this assumption is shaky, the first approach is preferred.
def detect_columns(df, max_cols=5, error_threshold=20):
    page_width = df["x2"].max()
    best_n, best_score, best_assignments = 1, float("inf"), None

    for n_cols in range(1, max_cols + 1):
        col_width = page_width / n_cols
        col_centers = [(i + 0.5) * col_width for i in range(n_cols)]

        # Assign each block to nearest center
        assignments = []
        errors = []
        for x in df["x1"]:
            dists = [abs(x - c) for c in col_centers]
            col_idx = int(np.argmin(dists))
            assignments.append(col_idx)
            errors.append(min(dists))

        score = np.mean(errors)  # lower = better alignment
        if score < best_score:
            best_score = score
            best_n = n_cols
            best_assignments = assignments

    if best_score < error_threshold:
        df["col_id"] = best_assignments

        col_boxes = []
        for col_id, group in df.groupby("col_id"):
            col_boxes.append({
                "col_id": col_id,
                "col_x1": group["x1"].min(),
                "col_y1": group["y1"].min(),
                "col_x2": group["x2"].max(),
                "col_y2": group["y2"].max()
            })

        col_boxes = pd.DataFrame(col_boxes).sort_values("col_x1").reset_index(drop=True)
        col_boxes["col_id"] = range(len(col_boxes))
        df["col_id"] = df["col_id"].map({old: new for new, old in enumerate(col_boxes["col_id"])})
        return col_boxes, df
    else:
        # Fallback to histogram based method
        df["area"] = (df["x2"] - df["x1"]) * (df["y2"] - df["y1"])
        hist, bin_edges = np.histogram(
            df["x1"],
            bins=10,
            weights=df["area"]
        )
        hist_norm = hist / np.max(hist)
        valid_bins = np.where(hist_norm > 0.5)[0]
        bin_centers = (bin_edges[:-1] + bin_edges[1:]) / 2
        col_centers = bin_centers[valid_bins]

        df["col_id"] = df["x1"].apply(lambda x: np.argmin(np.abs(col_centers - x)))

        col_boxes = []
        for col_id, group in df.groupby("col_id"):
            col_boxes.append({
                "col_id": col_id,
                "col_x1": group["x1"].min(),
                "col_y1": group["y1"].min(),
                "col_x2": group["x2"].max(),
                "col_y2": group["y2"].max()
            })

        col_boxes = pd.DataFrame(col_boxes).sort_values("col_x1").reset_index(drop=True)
        col_boxes["col_id"] = range(len(col_boxes))
        df["col_id"] = df["col_id"].map({old: new for new, old in enumerate(col_boxes["col_id"])})

        return col_boxes, df
cols, df = detect_columns(df_block.copy())
cols
col_id col_x1 col_y1 col_x2 col_y2
0 0 853 484 1510 2355
1 1 1526 467 2138 2271
2 2 2306 474 2993 2249
3 3 2973 455 3608 2274
df[["text","col_id"]]
text col_id
1 64 0
7 Die Butter zerlassen , das Weißbrot von beiden... 0
10 Soupe à l'ail bonne femme Knoblauchsuppe nach ... 0
16 2 Stangen Porree ( Lauch ) 250 g enthäutete To... 0
19 Salz Pfeffer 0
21 einige runde , ausgestochene Toast- brotscheiben 0
24 Speiseöl Parmesankäse 1
27 Den Porree putzen , waschen , in Ringe schneid... 1
29 zerdrücken . 1
34 Das Öl erhitzen , das Gemüse mit den Knoblauch... 1
36 Gigot de chevreuil 1
38 Rehblatt 1
41 800 g Rehblatt ( Schulter ) Salz , Pfeffer 50 ... 1
43 2 Schalotten 1
49 2 zerdrückte Wacholderbeeren 2 zerdrückte Knob... 2
51 2 Teel . Weizenmehl 125 ml ( 1 ) Schlagsahne 2
56 Das Rehblatt unter fließendem kal- tem Wasser ... 2
58 Croûtes aux champignons Champignons in Pasteten 3
60 500 g Champignons 50 g Butter 3
62 Salz Pfeffer 3
69 Cayennepfeffer 125 ml ( 1 ) Wasser 2 gestriche... 3
73 Die Champignons putzen , waschen , vierteln . ... 3
75 55 3
77 65 3

The number of columns is correct.

4.1.2 Recipe detection

We now proceed by grouping the blocks into recipes. This function is the work of many failed iterations.

Since we have established columns, we treat all text as if it were in one big column. Whenever a title appears, we start a new recipe.

I solved the subtitle issue by checking that the gap to the previous title is similar to the title font size. If so, it is a subtitle, not a new recipe.

def group_recipes(df, subtitle_factor=1.2):
    df = df.copy()
    recipe_id = -1
    all_recipes = []
    recipe_map = {}

    # Sort by column, then y
    cols = sorted(df["col_id"].unique())
    last_recipe_id = None
    current_recipe = {"title": None, "blocks": []}

    last_title_font = None

    for col in cols:
        col_blocks = df[df["col_id"] == col].sort_values("y1")
        last_title_bottom = -1
        for idx, row in col_blocks.iterrows():

            if row.is_title:
                # check if this title is actually a subtitle
                subtitle_gap = (last_title_font or row.font_size) * subtitle_factor

                is_subtitle = (
                        current_recipe["title"] is not None
                        and (row["y1"] - last_title_bottom) < subtitle_gap
                )

                if is_subtitle:
                    # merge into current recipe title
                    current_recipe["title"] += " " + row.text
                    current_recipe["blocks"].append(row.text)
                    recipe_map[idx] = last_recipe_id
                else:

                    if current_recipe["title"] is not None:
                        all_recipes.append(current_recipe)

                    # start new recipe
                    recipe_id += 1
                    current_recipe = {"title": row.text, "blocks": [row.text]}
                    recipe_map[idx] = recipe_id
                    last_recipe_id = recipe_id

                last_title_bottom = row["y2"]
                last_title_font = row.font_size

            else:

                if last_recipe_id is None:
                    recipe_map[idx] = -1  # orphan
                else:
                    recipe_map[idx] = last_recipe_id
                    current_recipe["blocks"].append(row.text)

    if current_recipe["title"] is not None:
        all_recipes.append(current_recipe)

    df["recipe_id"] = df.index.map(recipe_map).fillna(-1).astype(int)
    return df, all_recipes
df, recipes = group_recipes(df)
df[["text", "recipe_id"]]
text recipe_id
1 64 0
7 Die Butter zerlassen , das Weißbrot von beiden... -1
10 Soupe à l'ail bonne femme Knoblauchsuppe nach ... 0
16 2 Stangen Porree ( Lauch ) 250 g enthäutete To... 0
19 Salz Pfeffer 0
21 einige runde , ausgestochene Toast- brotscheiben 0
24 Speiseöl Parmesankäse 0
27 Den Porree putzen , waschen , in Ringe schneid... 0
29 zerdrücken . 0
34 Das Öl erhitzen , das Gemüse mit den Knoblauch... 0
36 Gigot de chevreuil 1
38 Rehblatt 1
41 800 g Rehblatt ( Schulter ) Salz , Pfeffer 50 ... 1
43 2 Schalotten 1
49 2 zerdrückte Wacholderbeeren 2 zerdrückte Knob... 1
51 2 Teel . Weizenmehl 125 ml ( 1 ) Schlagsahne 1
56 Das Rehblatt unter fließendem kal- tem Wasser ... 1
58 Croûtes aux champignons Champignons in Pasteten 2
60 500 g Champignons 50 g Butter 2
62 Salz Pfeffer 2
69 Cayennepfeffer 125 ml ( 1 ) Wasser 2 gestriche... 2
73 Die Champignons putzen , waschen , vierteln . ... 2
75 55 2
77 65 2

Nice, even the fragment of the previous recipe in the first column was treated correctly. Only the page number was wrongly attributed to the first recipe.

A picture says more than a thousand words.

def plot_recipes(df, image, save=False):
    colors = {}
    fig = plt.figure(figsize=(12, 16))
    plt.imshow(image)

    for _, row in df.iterrows():
        rid = row["recipe_id"]
        if rid not in colors:
            colors[rid] = [random.random(), random.random(), random.random()]
        color = colors[rid]

        rect = patches.Rectangle(
            (row["x1"], row["y1"]),
            row["x2"] - row["x1"],
            row["y2"] - row["y1"],
            linewidth=2,
            edgecolor=color,
            facecolor="none"
        )
        plt.gca().add_patch(rect)

        if row["is_title"]:
            plt.text(row["x1"], row["y1"] - 5, row["text"][:30],
                     color=color, fontsize=10, weight="bold")

    plt.axis("off")
    if save:
        fig.savefig("result.jpg")
    plt.show()
def read_image(image_id,):
    image_path = Path("data/raw")
    filename = image_path / f"{image_id}.jpg"
    image = Image.open(filename)

    return image
image = read_image(filename)
plot_recipes(df, image, True)

As we can see this approach also works quite well.

4.2 Working on OCR paragraph level

We will try OCR Paragraphs. Thanks to the subtitle workaround it also works for this recipe. However, doing so introduce another variable in the whole process and making it more brittle.

We can see that the first sub-title is printed in fat as the table still thinks it is a title.

cols, df = detect_columns(df_para.copy())
df, recipes = group_recipes(df)
plot_recipes(df, image)

4.3 Generalization of OCR based splitting

Let’s extend the algorithm to two other formats and see if it succeeds

Code
filename = "IMG_2077"
page = read_json(filename)
df = ocr_json_to_df(page)
df_block = df[df["level"] == "block"]
titles = detect_titles(df_block.copy())
add_titles_to_df(df_block, titles)
image = read_image(filename)
cols, df = detect_columns(df_block.copy())
df, recipes = group_recipes(df, 300)
plot_recipes(df, image)

Code
filename = "IMG_2074"
page = read_json(filename)
df = ocr_json_to_df(page)
df_block = df[df["level"] == "block"]
titles = detect_titles(df_block.copy())
add_titles_to_df(df_block, titles)
image = read_image(filename)
cols, df = detect_columns(df_block.copy())
df, recipes = group_recipes(df, 300)
plot_recipes(df, image)

recipes
[{'title': 'Ein würziger Schweinebraten aus der Normandie RÔTI DE PORC AUX POMMES CARAMÉLISÉES',
  'blocks': ['Ein würziger Schweinebraten aus der Normandie RÔTI DE PORC',
   'AUX POMMES CARAMÉLISÉES',
   'SCHWEINEBRATEN MIT KARAMELLISIERTEN ÄPFELN',
   'Wenig Rosmarinnadeln , die Salbeiblätter , Arbeitsaufwand : 30 Minuten die Knoblauchzehen und die Fenchelsamen im Mörser zerstoßen . Salz und Pfeffer zuge- ben . - Die Äpfel schälen , entkernen und in Schnitze schneiden . - Den Zucker mit dem Zitronensaft in einer Bratpfanne erhitzen . Sobald er hellbraun wird , die Äpfel zuge- ben , gut wenden , die Butter in Flocken zu- geben , mit 3 bis 5 EL Wasser ablöschen und ca. 5 Minuten garen . Salzen und pfeffern.- 3 bis 4 Einschnitte im Fleisch anbringen . Die Öffnungen mit der Gewürzmischung fül- len . Das Fleisch zu einem Rollbraten schnü- 2 dl Apfelwein ren , salzen und pfeffern . - Die Rosmarin-',
   'Bratzeit : 2 Stunden Für 4 Personen 5 Zweige Rosmarin 2-3 Salbeiblätter 2 Knoblauchzehen 1 Prise Fenchelsamen Salz , Pfeffer',
   '500 g säuerliche Äpfel 50g Rohzucker 1 EL Zitronensaft 2 EL frische Butter 1 kg magerer Schweinehals 2 EL zimmerwarme Bratbutter',
   'zweige verteilt unter der Schnur anbringen . Das Fleisch mit der weichen Bratbutter bestreichen . In einer Bratkasserolle rundum an- braten . - Den Apfelwein zufügen und zugedeckt bei kleiner Hitze 2 Stunden garen . - Den Bratenfond mit 1 bis 2 EL Wasser aufkochen . 1/3 der Äpfel pürieren und mit dem Bratenjus gut mischen , abschme- cken . - Die restlichen Apfelschnitze rasch erwärmen und als Garni- tur zum tranchierten Braten servieren . Getränk : Rustikaler Rotwein , zum Beispiel aus der Provence Anmerkung : Diese ausgeprägten Zutaten passen auch gut zu Kalb- fleisch . Deshalb lässt sich nach demselben Rezept ebenso gut ein',
   'Kalbsbratenzubereiten .']}]

This last layout is one of my favourites in terms of complexity. Triple title and a deeply nested format. The heuristic title detection with all the domain knowledge captures all the three titles together in the final extract. I’m looking forward, how this performs on completely unseen layouts.

5 Recipe detection using doc-layout YOLO

We already know title detection works better with the YOLO detector, but what about the columns and recipes?

Once again, we proceed in a two-step approach. First columns, then recipes.

5.1 Detecting columns

We use the previously defined function to find columns based on bounding box positions. At this stage, no text is required.

col_df, df_yolo_font_size = detect_columns(df_yolo_font_size)
df_yolo_font_size[["text","is_title","col_id"]]
text is_title col_id
0 Soupe à l'ail bonne femme True 0
1 Knoblauchsuppe nach Hausfrauenart False 0
2 65 False 3
3 Croûtes aux champignons True 3
4 64 False 0
5 2 Stangen Porree ( Lauch ) 250 g enthäutete T... False 0
6 Den Porree putzen , waschen , in Ringe schnei... False 1
7 Champignons in Pasteten False 3
8 Die Butter zerlassen , das Weißbrot von beide... False 0
9 500 g Champignons g Butter Salz Pfeffer Cayen... False 3
10 False 1
11 zerdrückte Wacholderbeeren zerdrückte Knoblau... False 2
12 Sahne und Petersilie unterrühren . Die Champi... False 3
13 Das Rehblatt unter fließendem kal- tem Wasser... False 2
14 Den Speck in Streifen schneiden . Die Butter ... False 2
15 False 1
16 Fleisch von den Knochen Das gare lösen , in P... False 2
17 Schalotten abziehen , vierteln , mit den Wach... False 2
18 Speiseöl Parmesankäse False 1
19 False 1
20 Gigot de chevreuil True 1
21 False 0
22 Den Bratensatz mit etwas Wasser los- kochen u... False 2
23 False 0
24 Die Champignons putzen , waschen , vierteln . False 3
25 Die Butter zerlassen , die Champi- gnons dari... False 3
26 Rehblatt False 1
27 False 0
28 False 0
29 False 0
30 False 1
31 False 1
32 800 g Rehblatt ( Schulter ) Salz , Pfeffer 50... False 1
33 False 1

Again we have four columns. As the DocLayout-YOLO detector counts different, it is not obvious if col_id is correct. We therefore plot the result.

image = read_image("20250922_135514")

fig, ax = plt.subplots(1, figsize=(12, 12))
ax.imshow(image)

# YOLO boxes
for _, row in df_yolo_font_size.iterrows():
    color = "red" if row["is_title"] else "blue"
    rect = patches.Rectangle(
        (row["x1"], row["y1"]),
        row["x2"] - row["x1"],
        row["y2"] - row["y1"],
        linewidth=2,
        edgecolor=color,
        facecolor="none"
    )
    ax.add_patch(rect)
    if row["is_title"]:
        ax.text(row["x1"], row["y1"] - 5, "TITLE", color="red", fontsize=10, weight="bold")

# Column boxes
if col_df is not None:
    for _, row in col_df.iterrows():
        rect = patches.Rectangle(
            (row["col_x1"], row["col_y1"]),
            row["col_x2"] - row["col_x1"],
            row["col_y2"] - row["col_y1"],
            linewidth=3,
            edgecolor="green",
            facecolor="none",
            linestyle="--"
        )
        ax.add_patch(rect)
        ax.text(row["col_x1"], row["col_y1"] - 5, "COLUMN", color="green", fontsize=10)

plt.axis("off")
plt.show()

First, columns are correct.

There are also empty cells which we need to filter for the grouping.

Then we can call our grouping function.

sections_df, recipes = group_recipes(df_yolo_font_size[df_yolo_font_size.text != ""]); recipes
[{'title': " Soupe à l'ail bonne femme",
  'blocks': [" Soupe à l'ail bonne femme",
   ' Knoblauchsuppe nach Hausfrauenart',
   ' 2 Stangen Porree ( Lauch ) 250 g enthäutete Tomaten 3-5 Knoblauchzehen 3 EBI . Speiseöl 2 große Kartoffeln 141 Fleischbrühe Salz Pfeffer einige runde , ausgestochene Toast- brotscheiben',
   ' 64',
   ' Speiseöl Parmesankäse',
   ' Den Porree putzen , waschen , in Ringe schneiden . Die Tomaten halbieren , die Stenge- lansätze herausschneiden , das Toma- tenfleisch in Würfel schneiden . Die Knoblauchzehen abziehen und zerdrücken . Das Öl erhitzen , das Gemüse mit den Knoblauchzehen darin andünsten . Die Kartoffeln schälen , waschen , in Scheiben schneiden , mit der Fleisch- brühe zu dem Gemüse geben , zum Kochen bringen , etwa 30 Minuten kochen lassen . Die Suppe mit Salz und Pfeffer abschmecken . Die Toastbrotscheiben mit dem Spei- seöl bestreichen , mit Parmesankäse bestreuen , in den auf 200-225 Grad ( Gas : Stufe 4-5 ) vorgeheizten Back- ofen schieben und 8-10 Minuten überbacken . Das Brot heiß zu der Suppe reichen .',
   ' Rehblatt']},
 {'title': ' Gigot de chevreuil',
  'blocks': [' Gigot de chevreuil',
   ' 800 g Rehblatt ( Schulter ) Salz , Pfeffer 50 g durchwachsener Speck 25 g Butter 2 Schalotten',
   ' zerdrückte Wacholderbeeren zerdrückte Knoblauchzehen 2-3 Thymianzweige 125 ml ( 1 ) Rotwein 250 ml ( 1 ) Wasser 10 g Butter 2 Teel . Weizenmehl 125 ml ( 1 ) Schlagsahne',
   ' Das Rehblatt unter fließendem kal- tem Wasser abspülen , trockentupfen , enthäuten und mit Salz und Pfeffer einreiben .',
   ' Den Speck in Streifen schneiden . Die Butter ( 25 g ) zerlassen , die Speckstreifen und das Fleisch darin anbraten .',
   ' Schalotten abziehen , vierteln , mit den Wacholderbeeren und den gewaschenen Thymianzweigen zu dem Fleisch geben . Den Rotwein und etwas von dem Wasser hinzugießen . Das Fleisch etwa 1 Stunde schmoren lassen , ab und zu wenden und mit dem Bratensatz begießen . Die ver- dampfte Flüssigkeit nach und nach durch Wasser ersetzen .',
   ' Fleisch von den Knochen Das gare lösen , in Portionsstücke schneiden , auf einer vorgewärmten Platte anrichten und warm stellen .',
   ' Den Bratensatz mit etwas Wasser los- kochen und durch ein Sieb gießen . Die Butter ( 10 g ) mit dem Weizen- mehl verrühren , zum Bratensatz geben , mit einem Schneebesen durchschlagen und aufkochen lassen . Die Sahne unterrühren . Die Sauce mit Salz und Pfeffer abschmecken .']},
 {'title': ' Croûtes aux champignons',
  'blocks': [' Croûtes aux champignons',
   ' Champignons in Pasteten',
   ' 500 g Champignons g Butter Salz Pfeffer Cayennepfeffer 125 ml ( 1 ) Wasser 2 gestrichene EBI . Speisestärke 3 EBI . Schlagsahne 2 EBI . gehackte Petersilie Zitronensaft 4 Blätterteigpasteten ( fertig gekauft )',
   ' Die Champignons putzen , waschen , vierteln .',
   ' Die Butter zerlassen , die Champi- gnons darin andünsten , mit Salz , Pfeffer und Cayennepfeffer würzen . Das Wasser hinzugießen , in etwa 10 Minuten gar dünsten lassen . Die Speisestärke mit 3 EBI . kaltem Wasser anrühren , die Pilze damit bin- den .',
   ' Sahne und Petersilie unterrühren . Die Champignons mit den Gewürzen und dem Zitronensaft abschmecken . Von den Pasteten Hülsen und Deckel auf ein Backblech legen und in den auf 200-225 Grad ( Gas : Stufe 4-5 ) vorgeheizten Backofen schieben und in etwa 5 Minuten erwärmen . Die Champignons in die Pasteten fül- len , die Deckel darauf setzen .',
   ' 65']}]

There is an issue with nested blocks: title “gigot de chevreuil” and “Rehblatt”. Somehow Rehblatt ended up in recipe 0.

5.2 Generalization of YOLO-doclayout+OCR

Again we check, how other recipes perform. This time we look at the dataframe.

Code
filename = "IMG_2077"
page = read_json(filename)

det_res = model.predict(
    "data/raw/IMG_2077.jpg",
    imgsz=1024,
    conf=0.2,
    device="cuda:0",
    verbose=False
)
df_yolo = yolo_to_df(det_res[0])

df_yolo_font_size = add_text_and_font_size_to_layout_df(df_yolo.copy(), page)

df_titles_yolo = detect_titles(df_yolo_font_size[df_yolo_font_size.is_title].copy(), font_factor=1)
add_titles_to_df(df_yolo_font_size, df_titles_yolo)

col_df, df_yolo_font_size = detect_columns(df_yolo_font_size.copy())
sections_df, recipes = group_recipes(df_yolo_font_size[df_yolo_font_size.text != ""])
sections_df[["text","is_title","recipe_id"]]
text is_title recipe_id
0 Préchauffez le four à 220 ° C . Faites cuire ... False 0
1 Mélangez les fèves dans un saladier avec le f... False 0
2 Toastez les tranches de pain . Répartissez le... False 0
3 Salade de poulet , fèves , fenouil et concomb... True 0
4 L'estragon est utilisé en phytothérapie pour ... False 0
5 Pour 4 personnes Préparation : 10 min Cuisson... False 0
6 Pelez le concombre ( s'il n'est pas bio ) et ... False 0
7 Parfait pour le soir ! False 0
8 ✓ 1 cuil . à café de zestes de citron False 0
9 126 PLATS DETOX False 0
10 ✓2 cuil . à soupe d'estragon frais haché False 0
11 ✓ 8 tranches de pain de campagne aux graines ... False 0
12 ✓ 1 cuil . à soupe de vinaigre de vin rouge False 0
13 ✓ 2 gros blancs de poulet ( ou 4 petits ) False 0
14 ✓2 cuil . à café de jus de citron frais False 0
15 ✓ 150 g de fèves ( surgelées ) 1 petit concom... False 0
16 ✓ Huile d'olive ✓ Sel , poivre False 0
Code
filename = "IMG_2074"
page = read_json(filename)

det_res = model.predict(
    "data/raw/IMG_2074.jpg",
    imgsz=1024,
    conf=0.2,
    device="cuda:0",
    verbose=False
)
df_yolo = yolo_to_df(det_res[0])

df_yolo_font_size = add_text_and_font_size_to_layout_df(df_yolo.copy(), page)

df_titles_yolo = detect_titles(df_yolo_font_size[df_yolo_font_size.is_title].copy(), font_factor=1)
add_titles_to_df(df_yolo_font_size, df_titles_yolo)

col_df, df_yolo_font_size = detect_columns(df_yolo_font_size.copy())
sections_df, recipes = group_recipes(df_yolo_font_size[df_yolo_font_size.text != ""])
sections_df[["text","is_title","recipe_id"]]
unassigned words:
text is_title recipe_id
0 Für 4 Personen Zweige Rosmarin 2-3 Salbeiblät... False 0
1 zweige verteilt unter der Schnur anbringen . ... False 0
2 RÔTI DE PORC POMMES CARAMÉLISÉES True 0
3 Bratzeit : 2 Stunden : 30 Minuten False 0
4 SCHWEINEBRATEN MIT KARAMELLISIERTEN ÄPFELN False 0
5 Rosmarinnadeln , die Salbeiblätter , die Knob... False 0
6 Ein würziger Schweinebraten aus der Normandie False -1
7 Getränk : Rustikaler Rotwein , zum Beispiel a... False 0

Here we run into the first issue: the triple subtile leads to missing text in the final segmentation, shown as “-1” in the recipe_id column.

Something actually worse than the wrong identified title in the pure OCR case.

6 Comparison of pure OCR pipeline vs OCR+YOLO-doclayout

We will compare the two approaches by wrapping them in functions To make the comparison clearer, we’ll also use a different recipe this time.

def get_recipes_ocr_only_block(filename, type='block'):
    page = read_json(filename)
    df = ocr_json_to_df(page)
    df_block = df[df["level"] == type]

    titles = detect_titles(df_block.copy())
    add_titles_to_df(df_block, titles)

    cols, df = detect_columns(df_block.copy())
    df, recipes = group_recipes(df)
    return df, recipes


def get_recipes_yolo(filename):
    page = read_json(filename)

    det_res = model.predict(
        "data/raw/"+ filename +".jpg",
        imgsz=1024,
        conf=0.2,
        device="cuda:0",
        verbose=False
    )
    df_yolo = yolo_to_df(det_res[0])

    df_yolo_font_size = add_text_and_font_size_to_layout_df(df_yolo.copy(), page)

    df_titles_yolo = detect_titles(df_yolo_font_size[df_yolo_font_size.is_title].copy(), font_factor=1)
    add_titles_to_df(df_yolo_font_size, df_titles_yolo)

    col_df, df_yolo_font_size = detect_columns(df_yolo_font_size.copy())
    sections_df, recipes = group_recipes(df_yolo_font_size[df_yolo_font_size.text != ""])
    return sections_df, recipes

6.1 Runtime comparison

%%time
df_ocr, r_ocr = get_recipes_ocr_only_block("IMG_2073")
CPU times: user 22.5 ms, sys: 162 μs, total: 22.7 ms
Wall time: 22.1 ms
%%time

df_yolo, r_yolo = get_recipes_yolo("IMG_2073")
unassigned words:
CPU times: user 354 ms, sys: 31.9 ms, total: 386 ms
Wall time: 386 ms

The OCR only approach is 20x faster, as we do not need to access the GPU. Without GPU it would take even more time.

6.2 Recipe output comparison

Let’s check if we discovered the same recipes. For that we try to realign the different rows and put the recipe ids of each approach on the ocr dataframe.

def attach_yolo_recipe_ids(df_ocr, df_yolo, pad=2, min_iou=0.1):
    df_ocr = df_ocr.copy()
    recipe_ids = []

    for _, block in df_ocr.iterrows():
        bx1, by1, bx2, by2 = block["x1"], block["y1"], block["x2"], block["y2"]

        assigned_id = -1
        best_iou = 0
        candidate_boxes = []

        for _, yrow in df_yolo.iterrows():
            yx1, yy1 = yrow["x1"] - pad, yrow["y1"] - pad
            yx2, yy2 = yrow["x2"] + pad, yrow["y2"] + pad

            # Check containment first
            if (bx1 >= yx1) and (by1 >= yy1) and (bx2 <= yx2) and (by2 <= yy2):
                candidate_boxes.append((yrow["recipe_id"], (yx2 - yx1) * (yy2 - yy1)))

            # IoU calculation
            inter_x1 = max(bx1, yx1)
            inter_y1 = max(by1, yy1)
            inter_x2 = min(bx2, yx2)
            inter_y2 = min(by2, yy2)
            inter_w = max(0, inter_x2 - inter_x1)
            inter_h = max(0, inter_y2 - inter_y1)
            inter_area = inter_w * inter_h

            block_area = (bx2 - bx1) * (by2 - by1)
            yolo_area = (yx2 - yx1) * (yy2 - yy1)
            union_area = block_area + yolo_area - inter_area

            iou = inter_area / union_area if union_area > 0 else 0

            if iou > best_iou:
                best_iou = iou
                assigned_id = yrow["recipe_id"]

        # Prefer containment rule
        if candidate_boxes:
            # pick smallest enclosing box (most specific title/region)
            assigned_id = min(candidate_boxes, key=lambda x: x[1])[0]
        elif best_iou < min_iou:
            # fallback centroid if IoU too small
            bcx, bcy = (bx1 + bx2) / 2, (by1 + by2) / 2
            for _, yrow in df_yolo.iterrows():
                if (yrow["x1"] <= bcx <= yrow["x2"]) and (yrow["y1"] <= bcy <= yrow["y2"]):
                    assigned_id = yrow["recipe_id"]
                    break

        recipe_ids.append(assigned_id)

    df_ocr["recipe_id_yolo"] = recipe_ids
    return df_ocr
combined = attach_yolo_recipe_ids(df_ocr, df_yolo); combined[["recipe_id","recipe_id_yolo"]]
recipe_id recipe_id_yolo
1 0 0
4 0 0
11 0 0
14 0 0
16 0 0
20 0 0
23 0 0
28 0 -1
32 0 -1
34 0 -1
39 0 -1
41 0 -1

Something has gone wrong.

Let’s look at the recipes

r_ocr
[{'title': 'Merlu Koskera',
  'blocks': ['Merlu Koskera',
   'Pour 6 personnes : Temps de préparation : 30 minutes Temps de cuisson : 20 minutes',
   "Ingrédients : • 6 médaillons de merlu • 300 g d'asperges blanches en conserve • 500 g de petits pois en conserve • 1 poignée de palourdes • 1 poignée de moules ⚫ 3 œufs durs",
   "• 10 cl de vin blanc sec type Irouléguy • 1 cuillère à café de purée de piment d'Espelette • 4 gousses d'ail",
   '• Persil',
   "• 3 cuillères à soupe de farine • Sel de Guérande • Poudre de piment d'Espelette",
   "• Huile d'olive • 20 cl de fumet de poisson",
   "Faites un hachis d'ail et de persil . Réservez . Salez et farinez les médaillons de merlu . Réservez . Dans une cocotte , faites ouvrir les moules et les palourdes , conservez leur jus . Dans une sauteuse , faites revenir les médaillons de merlu dans l'huile d'olive mélangée à la purée",
   "de piment durant 2 minutes de chaque côté . Dans un plat en terre , déposez les médaillons de merlu . Réservez . Faites revenir pendant quelques minutes le hachis de persil et d'ail dans l'huile d'olive . Saupoudrez de farine , arrosez du jus des coquillages , du vin blanc et du fumet de poisson .",
   'Versez sur les médaillons .',
   'Ajoutez les moules , les palourdes , les asperges et les petits pois . Laissez mijoter à feu doux pendant 10 minutes . Ajoutez les œufs durs en quartier avant la fin de la cuisson . Saupoudrez de persil et de poudre de piment .',
   'Servez sans attendre .']}]
r_yolo
[{'title': ' Merlu Koskera',
  'blocks': [' Merlu Koskera',
   ' Pour 6 personnes : Temps de préparation : 30 minutes Temps de cuisson : 20 minutes',
   ' Ingrédients :',
   " 6 médaillons de merlu 300 g d'asperges blanches en conserve • 500 g de petits pois en conserve • 1 poignée de palourdes • 1 poignée de moules ⚫ 3 œufs durs • 10 cl de vin blanc sec type Irouléguy • 1 cuillère à café de purée de piment d'Espelette • 4 gousses d'ail • Persil • 3 cuillères à soupe de farine • Sel de Guérande • Poudre de piment d'Espelette • Huile d'olive • 20 cl de fumet de poisson"]}]

We are missing text in the DocLayout-YOLO recipe. When we look at the input dataframe, we see there are two columns.

df_yolo[["text","col_id","recipe_id"]]
text col_id recipe_id
0 Faites un hachis d'ail et de persil . Réserve... 0 -1
1 6 médaillons de merlu 300 g d'asperges blanch... 1 0
2 Pour 6 personnes : Temps de préparation : 30 ... 1 0
3 Merlu Koskera 1 0
4 Ingrédients : 1 0

But when we look at the image there is only one.

skew image

A grain of salt: the shortcomings could be related to my optimization towards the heuristic approach.

7 Quantitative analysis

To evaluate the performance, I did define some reference data, with recipe title and the correct OCR block to recipe mapping.

with open("data/ground-truth/ground-truth.json", "r", encoding="utf-8") as f:
    reference = json.load(f)

reference_map = {page["page_id"]: page for page in reference["data"]}
def evaluate_page(page_id, df_ocr, reference_map):
    ref = reference_map[page_id]

    df_ocr = df_ocr.sort_values(["col_id", "y1"]).reset_index(drop=True)
    df_ocr["recipe_id_ref"] = ref["reference_sections"]

    return df_ocr, ref
def compute_metrics(df):
    y_ref = df["recipe_id_ref"]
    y_ocr = df["recipe_id"]

    results = {}

    results["ocr_acc"] = accuracy_score(y_ref, y_ocr) if "recipe_id" in df else None

    return results
def title_accuracy_from_recipes(recipes, ref_titles, threshold=0.7):
    pred_titles = [r["title"].strip() for r in recipes if r.get("title")]
    print(pred_titles)
    ref_titles = [rt.strip() for rt in ref_titles]

    matches = []
    matched = 0

    for rt in ref_titles:
        scores = [(pt, SequenceMatcher(None, pt.lower(), rt.lower()).ratio()) for pt in pred_titles]
        if scores:
            best_pred, best_score = max(scores, key=lambda x: x[1])
            matches.append((rt, best_pred, best_score))
            if best_score >= threshold:
                matched += 1
        else:
            matches.append((rt, None, 0.0))

    accuracy = matched / len(ref_titles) if ref_titles else 0.0
    return accuracy, matches

all_metrics = []

for json_path in glob.glob("data/raw/*.jpg"):
    page_id = os.path.splitext(os.path.basename(json_path))[0]
    if page_id not in reference_map:
        continue

    df_ocr, recipes_ocr = get_recipes_ocr_only_block(page_id)

    df_eval, ref = evaluate_page(page_id, df_ocr,  reference_map)
    metrics = compute_metrics(df_eval)

    ref_titles = [r["title"] for r in ref["recipes"]]

    ocr_acc, ocr_matches = title_accuracy_from_recipes(recipes_ocr, ref_titles, threshold=0.7)

    metrics.update({
        # OCR metrics
        "ocr_title_accuracy": ocr_acc,
        "ocr_title_matches": ocr_matches,
        "ocr_num_pred_recipes": len(recipes_ocr),

        # Reference info
        "num_ref_recipes": len(ref_titles),
    })

    all_metrics.append((page_id, metrics))
['Merlu Koskera']
['Artichauts à la sauce vinaigrette Artischocken mit Vinaigrette ( Foto S. 63 )', 'Asperges ,, sauce mousseline❝ Spargel mit abgeschlagener Sauce', '2 EBI . steifgeschlagene Schlagsahne']
["Soupe à l'ail bonne femme Knoblauchsuppe nach Hausfrauenart", 'Gigot de chevreuil']
["Dinde pochée au lait d'amande , mange - tout et haricots", 'Info nutrition']
['Croquetas', 'Boudin noir sur Canapé']
['Lapin aux pruneaux']
["Soupe à l'ail bonne femme Knoblauchsuppe nach Hausfrauenart", 'Gigot de chevreuil', 'Croûtes aux champignons Champignons in Pasteten']
['Variantes', '92 PLATS FEEL GOOD']
['Salade de poulet , fèves , fenouil et concombre sur toasts']
['Ein würziger Schweinebraten aus der Normandie RÔTI DE PORC AUX POMMES CARAMÉLISÉES']
['Mulligatawny']
['Croûtes aux champignons Champignons in Pasteten']
df_metrics = pd.DataFrame(
    [{**{"page_id": pid}, **m} for pid, m in all_metrics]
)
df_metrics
page_id ocr_acc ocr_title_accuracy ocr_title_matches ocr_num_pred_recipes num_ref_recipes
0 IMG_2073 1.000000 1.000000 [(Merlu Koskera, Merlu Koskera, 1.0)] 1 1
1 20250922_135453 0.750000 0.000000 [(Artichauts à la sauce vinaigrette, Artichaut... 3 2
2 20250922_135507 1.000000 0.500000 [(Soupe à l'ail bonne femme, Soupe à l'ail bon... 2 2
3 IMG_2079 0.692308 1.000000 [(Dinde pochée au lait d'amande, mange - tout ... 2 1
4 20250922_213505 1.000000 1.000000 [(Croquetas, Croquetas, 1.0), (Boudin noir sur... 2 2
5 IMG_2078 1.000000 1.000000 [(Lapin aux pruneaux, Lapin aux pruneaux, 1.0)] 1 1
6 20250922_135514 0.916667 0.333333 [(Soupe à l'ail bonne femme, Soupe à l'ail bon... 3 3
7 IMG_2080 0.125000 0.000000 [(Gnocchi sans gluten à la patate douce et pes... 2 1
8 IMG_2077 1.000000 1.000000 [(Salade de poulet, fèves, fenouil et concombr... 1 1
9 IMG_2074 1.000000 0.000000 [(RÔTI DE PORC POMMES CARAMÉLISÉES, Ein würzig... 1 1
10 IMG_2076 1.000000 1.000000 [(Mulligatawny, Mulligatawny, 1.0)] 1 1
11 20250922_135510 1.000000 0.000000 [(Croûtes aux champignons, Croûtes aux champig... 1 1

In short, the algorithm is far from perfect.

Most errors result from incorrect titles. When the titles are wrong, to too many recipes are created.

8 Outlook

A few possible improvements include:

  • stabilizing titles with DocLayout-YOLO
  • using book-dependent settings adjusted based on user feedback
  • allowing manual user overrides
  • training a classifier solely on OCR data, but with much larger datasets

We’ll see how this evolves once it’s integrated into the main app.

Like this post? Get espresso-shot tips and slow-pour insights straight to your inbox.

Comments

Join the discussion below.


© 2025 by Dr. Dominik Lindner
This website was created with Quarto


Impressum

Cookie Preferences


Code · Data · Curiosity
Espresso-strength insights on AI, systems, and learning.