Page Segmentation: The easy and the hard way

Python

Computer Vision

Machine Learning

Generative AI

An OCR scan of a whole page of a complex layout can be done two ways. The easy expensive one using an LLM or the more sophisticated one, which is harder to develop but cheaper to run.

Author

Dominik Lindner

Published

October 24, 2025

1 Many ways lead to digitized documents

When I first started working on recipescanner, the biggest issue was scanning multi column pages and multi recipe pages. How to group the output from the OCR scan in such a way that recipes are not mixed with each other?

The following table shows different options of workflows, from image to fully parsed recipe.

Option	How it Works	Speed (per page)	Needs	Best When	Cost
1. OCR → LLM Parsing	Extract with OCR, LLM for classification	3–5 s (small LLM) 15–30 s (vision-LLM)	GPU or strong CPU, few GB RAM	Layouts are messy	Medium
2. Vision → Text (Donut / Pix2Struct)	End-to-end vision does classification	0.8–2 s on GPU	GPU / NNAPI required	Handwriting, warped images	High
3. OCR → Segmentation → Rules / Small Model	Extract with OCR, Rules or DecisionTree for classification	0.3 s on CPU	CPU only, ~400 MB RAM	Scans are clean and structured	Low

For pages with multiple recipes, the approach can be separated into two stages.

Image to Text Blocks and recipe sections
Classification of each recipe into ingredients, description, …

The division of the task into smaller tasks allows us to use less complex methods. In this notebook we focus on the task #1.

1.1 Using LLM API

Nowadays, one straightforward solution is to run an LLM on the OCR output and ask it to cluster the text. You can even set up a complete workflow: cluster the text, check it, extract ingredients and instructions and the check again to ensure all content is used recipes are not mixed up.

The downside about all this? It is expensive to run. Especially if we introduce correctness checks. How much more expensive?

For reference, Google currently charges $1.5 per 1000 pages, whereas as Mistral asks for $3 per 1000 pages for annotated output. Google’s new interface, which (like Mistral) relies on Vision Transformers also charges asks for $1.5 per 1000 pages. Layout parsing costs extra at $10 per 1000 pages.

Let’s say each of your users has about 2000 books he wants to convert. That results to $23 per user; and that is without classification. A custom extractor sets you back another $20. So the full workflow is about $50 per user. Again only for recipe extraction. Interaction with the recipe database will probably cost around $3 per month.

1.2 The new way: end-to-end conversion

I recently tested Docling, a transformer-based architecture. It has created quite a buzz recently.

Here is my short evaluation. On the document which we are going to use throughout this notebook, it failed. The deterministic demo (temperature=0) got stuck in an inference loop. I tried increasing temperature and add other tweaks to help the model escape local optima. This came at the cost of accuracy.

The model repeated titles as ingredients for other recipes and failed to stop properly.

Docling Result: one recipe title is not detected as such

The model has issue with line breaks: “Die Pilze damit bin-den” became “Pilze damit ben den” including line breaks.

On top of that, the speed was poor on a legacy GPU (GTX 1080), despite running at full load for the full time of 22 seconds! Memory consumption was steadily increasing as more tokens were decoded, just barely fitting on the 8GB GPU with 6.5GB max usage.

Reflecting on the architecture, I suspect it must work better on obscure edge cases. Complex Formats with overlaying figures or strongly nested tables could perform better.

Strangely, even the demo mostly showcases simple formats.

1.3 The not so easy way

Layout Parsing is not new. In fact, it became popular about three years ago.

One such layout parser is the actual LayoutParser. It combines an automatic analysis of OCR output and segmentation. Unfortunately, the segmentation is done with detecron2. The active development seems to have stopped and it no longer works on my python installation (Python 3.12). My research revealed Python 3.9 as the last working version.

A more recent model is DocLayout-YOLO. Based on the already massive YOLO dataset, it uses a very large document dataset to identify bounding boxes and classification in a combined loss function.

What it lacks, however, is an automatic connection to OCR. One can either crop and ocr the text boxes or align with the OCR output.

Here is an example:

Code

# Standard library
import glob
import json
import os
import random
import warnings
from difflib import SequenceMatcher
from pathlib import Path

# Third-party
import cv2
import matplotlib.patches as patches
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from PIL import Image, ImageDraw
# Local / custom
from doclayout_yolo import YOLOv10
from sklearn.metrics import accuracy_score

# Jupyter magic
%matplotlib inline

# Suppress all warnings (optional)
warnings.filterwarnings("ignore")

Code

model = YOLOv10("data/models/doclayout_yolo.pt")
det_res = model.predict(
    "data/raw/20250922_135514.jpg",
    imgsz=1024,
    conf=0.2,
    device="cuda:0",
    verbose=False
)

Code

annotated_frame = det_res[0].plot(pil=True, line_width=10, font_size=30)
img_rgb = cv2.cvtColor(annotated_frame, cv2.COLOR_BGR2RGB)

plt.figure(figsize=(10, 14))
plt.imshow(img_rgb)
plt.axis("off")
plt.title("Segmentation using Doclayout YOLO")
plt.show()

Despite knowing nothing about the text, the CNN could still discover the layout. Which is clear, you don’t need to understand a language to break a book into paragraphs.

There are some errors in how the titles are handled. Bright red boxes highlight detected titles, and we can see that there are too many.

When we run the CUDA-enabled version, the total runtime is 164ms, 650ms for cold start, whereas on CPU it can take up to 1.6s.

1.4 Using only domain knowledge

And then there’s the domain knowledge approach. That is how I started. If we know all the text on a page, its location, and assume it’s a recipe: can we identify where recipes end, and which text belongs to which recipe?

In this notebook, I will examine this way and how it compares to DocLayout-YOLO.

2 Converting Data to Dataframes

We will perform a statistical analysis and therefore convert the data to pandas dataframes.

2.1 OCR

In my project, I currently rely on Google cloud OCR. Therefore, we need to convert the API response to a dataframe. While doing so, we also add information on font_size and word count.

def ocr_json_to_df(ocr_json):
    rows = []

    for block in ocr_json:
        block_type = block["blockType"]
        verts = block["boundingBox"]["vertices"]
        x1, y1 = verts[0].get("x", 0), verts[0].get("y", 0)
        x2, y2 = verts[2].get("x", 0), verts[2].get("y", 0)

        # Reconstruct text from words and establish size
        block_words = []
        average_word_height_sum = 0
        word_counter = 0
        for para_idx, para in enumerate(block.get("paragraphs", [])):
            paragraph_words = []
            paragraph_average_word_height_sum = 0
            paragraph_word_counter = 0
            para_verts = para["boundingBox"]["vertices"]
            px1, py1 = para_verts[0].get("x", 0), para_verts[0].get("y", 0)
            px2, py2 = para_verts[2].get("x", 0), para_verts[2].get("y", 0)
            for word_idx, word in enumerate(para.get("words", [])):
                symbols = [s["text"] for s in word.get("symbols", [])]
                word_text = "".join(symbols)
                block_words.append(word_text)
                paragraph_words.append(word_text)

                # store word-level rows (optional)
                word_verts = word["boundingBox"]["vertices"]
                wx1, wy1 = word_verts[0].get("x", 0), word_verts[0].get("y", 0)
                wx3, wy3 = word_verts[3].get("x", 0), word_verts[3].get("y", 0)
                average_word_height_sum += (wy3 - wy1)
                word_counter += 1
                paragraph_average_word_height_sum += (wy3 - wy1)
                paragraph_word_counter += 1

            rows.append({
                "level": "paragraph",
                "x1": px1, "y1": py1, "x2": px2, "y2": py2,
                "font_size": paragraph_average_word_height_sum / paragraph_word_counter,
                "text": " ".join(paragraph_words),
                "block_type": block_type
            })

        rows.append({
            "level": "block",
            "x1": x1, "y1": y1, "x2": x2, "y2": y2,
            "font_size": average_word_height_sum / word_counter,
            "text": " ".join(block_words),
            "block_type": block_type
        })

    df = pd.DataFrame(rows)

    df["word_count"] = df["text"].str.split().str.len()

    return df

Code

def read_json(image_id):
    json_path = Path("data/ocr/")

    with open(json_path / f"{image_id}.json", "r") as f:
        return json.load(f)

filename = "20250922_135514"
page = read_json(filename)
df_ocr = ocr_json_to_df(page)

Let’s split block and paragraph rows for further processing.

df_block = df_ocr[df_ocr["level"] == "block"]
df_para = df_ocr[df_ocr["level"] == "paragraph"]
df_block.head()

	level	x1	y1	x2	y2	font_size	text	block_type	word_count
1	block	874	2331	911	2355	24.000000	64	TEXT	1
7	block	853	484	1510	1389	31.902913	Die Butter zerlassen , das Weißbrot von beiden...	TEXT	103
10	block	878	1501	1404	1737	52.375000	Soupe à l'ail bonne femme Knoblauchsuppe nach ...	TEXT	8
16	block	875	1815	1291	2081	32.761905	2 Stangen Porree ( Lauch ) 250 g enthäutete To...	TEXT	21
19	block	877	2101	989	2176	26.500000	Salz Pfeffer	TEXT	2

2.2 Doclayout YOLO

DocLayout-YOLO returns boxes, labels, and confidence levels.

def yolo_to_df(result):
    boxes = result.boxes.xyxy.cpu().numpy()
    labels = result.boxes.cls.cpu().numpy().astype(int)
    scores = result.boxes.conf.cpu().numpy()
    names = result.names

    records = []
    for box, lbl, score in zip(boxes, labels, scores):
        x1, y1, x2, y2 = box.tolist()
        label = names[lbl]
        records.append({
            "x1": x1,
            "y1": y1,
            "x2": x2,
            "y2": y2,
            "label": label,
            "confidence": score,
            "is_title": (label.lower() == "title")
        })

    return pd.DataFrame(records)

det_res = model.predict(
    "data/raw/20250922_135514.jpg",
    imgsz=1024,
    conf=0.2,
    device="cuda:0",
    verbose=False
)
df_yolo = yolo_to_df(det_res[0]);df_yolo

	x1	y1	x2	y2	label	confidence	is_title
0	879.157227	1498.183594	1404.824463	1631.679443	title	0.922120	True
1	879.624573	1643.227661	1357.394409	1753.128174	plain text	0.903516	False
2	3569.065430	2246.001465	3609.026123	2281.185547	abandon	0.831622	False
3	2968.354492	450.490784	3297.232910	592.068970	title	0.809443	True
4	870.937500	2327.354248	912.915344	2362.086182	abandon	0.795582	False
5	874.531006	1780.801758	1492.744629	2286.900391	table	0.727726	False
6	1527.393433	598.798950	2144.085449	1672.833618	plain text	0.705026	False
7	2974.093750	598.492249	3467.499023	654.774719	plain text	0.638011	False
8	877.592163	465.383392	1481.144409	1402.559326	plain text	0.626123	False
9	2980.972656	682.661133	3532.022705	1200.553101	table	0.596943	False
10	1526.767578	597.575073	2054.986328	686.416992	plain text	0.583433	False
11	2316.687988	450.310455	2927.670898	824.062439	plain text	0.513546	False
12	3022.026855	1725.583984	3593.486572	2192.724609	plain text	0.488287	False
13	2325.665283	856.174072	2937.416748	1041.672241	plain text	0.481687	False
14	2330.425537	1042.294434	2914.214844	1225.915161	plain text	0.477337	False
15	1532.011597	1245.381348	2047.088745	1337.381592	plain text	0.455887	False
16	2350.650879	1689.926758	2967.089355	1876.195068	plain text	0.453441	False
17	2337.659668	1229.601807	2960.235352	1686.113770	plain text	0.445066	False
18	1522.936157	457.546967	1769.362915	547.661194	title	0.433465	True
19	1525.181885	461.238281	1770.478882	548.516296	title	0.419111	True
20	1537.929932	1843.337891	2071.566162	1914.832031	title	0.392423	True
21	882.206665	469.045898	1474.692993	848.210815	plain text	0.391044	False
22	2355.457520	1878.174194	2988.262939	2259.762207	plain text	0.389824	False
23	882.791809	845.683289	1476.399170	982.723877	plain text	0.378071	False
24	3001.068604	1266.987061	3527.293213	1354.325928	plain text	0.370774	False
25	3006.092529	1268.646606	3583.754639	2200.445557	plain text	0.360401	False
26	1535.614624	1842.507202	2077.353516	1984.601807	title	0.337300	True
27	880.934509	1258.313110	1462.825928	1400.205078	plain text	0.305315	False
28	881.682800	1118.672363	1455.794189	1258.220947	plain text	0.294823	False
29	882.505371	982.659119	1479.449707	1116.903931	plain text	0.263816	False
30	1534.299072	1620.801636	2126.622803	1667.255005	plain text	0.257625	False
31	1533.716064	1338.074707	2135.751709	1658.443481	plain text	0.245955	False
32	1533.999634	2035.315918	2170.545654	2285.304443	plain text	0.232413	False
33	1537.556274	1929.555420	1743.389038	1981.646973	plain text	0.227065	False

In both cases, we can see that the title detection is not optimal. In the case of the block-based detection it will be difficult to separate “Soupe à l’ail bonne femme Knoblauchsuppe nach”.

For DocLayout-YOLO, there are too many titles. Ingredients were detected as title as they are on top of a column and in bold.

3 Title detection

3.1 Title detection using OCR

Let’s try to improve the title detection. We start with the OCR blocks.

df_block[["text","font_size"]]

	text	font_size
1	64	24.000000
7	Die Butter zerlassen , das Weißbrot von beiden...	31.902913
10	Soupe à l'ail bonne femme Knoblauchsuppe nach ...	52.375000
16	2 Stangen Porree ( Lauch ) 250 g enthäutete To...	32.761905
19	Salz Pfeffer	26.500000
21	einige runde , ausgestochene Toast- brotscheiben	34.000000
24	Speiseöl Parmesankäse	29.500000
27	Den Porree putzen , waschen , in Ringe schneid...	35.600000
29	zerdrücken .	25.500000
34	Das Öl erhitzen , das Gemüse mit den Knoblauch...	33.516854
36	Gigot de chevreuil	62.000000
38	Rehblatt	37.000000
41	800 g Rehblatt ( Schulter ) Salz , Pfeffer 50 ...	34.187500
43	2 Schalotten	26.500000
49	2 zerdrückte Wacholderbeeren 2 zerdrückte Knob...	32.347826
51	2 Teel . Weizenmehl 125 ml ( 1 ) Schlagsahne	33.600000
56	Das Rehblatt unter fließendem kal- tem Wasser ...	33.964072
58	Croûtes aux champignons Champignons in Pasteten	45.666667
60	500 g Champignons 50 g Butter	33.500000
62	Salz Pfeffer	25.500000
69	Cayennepfeffer 125 ml ( 1 ) Wasser 2 gestriche...	32.321429
73	Die Champignons putzen , waschen , vierteln . ...	32.377193
75	55	-1.000000
77	65	26.000000

Titles are in index 10, 36, and 58. For index 10 and 58, the title is together with the subtitles. Both have a bigger average line height (fontsize) than the rest. Almost by factor 1.5. With this in mind we create our title detector. Just to be sure we limit the amount of words, too.

Luckily we extracted this information earlier in our dataframe.

def detect_titles(df, font_factor=1.2, max_words=15):
    df = df[df["word_count"] > 0].copy()

    if len(df) == 1:
        return df
    # compute mean font size ignoring NaNs
    mean_font_size = df["font_size"].mean()

    df["is_title"] = (
            (df["font_size"] > font_factor * mean_font_size) &
            (df["word_count"] <= max_words) &
             (df["text"].str.len() >= 3)
    )
    return df[df["is_title"]]

df_titles_blocks = detect_titles(df_block)
df_titles_blocks[["text","is_title","font_size"]]

	text	is_title	font_size
10	Soupe à l'ail bonne femme Knoblauchsuppe nach ...	True	52.375000
36	Gigot de chevreuil	True	62.000000
58	Croûtes aux champignons Champignons in Pasteten	True	45.666667

df_titles_para = detect_titles(df_para)
df_titles_para[["text","is_title","font_size"]]

	text	is_title	font_size
8	Soupe à l'ail bonne femme	True	56.600000
9	Knoblauchsuppe nach Hausfrauenart	True	45.333333
35	Gigot de chevreuil	True	62.000000
57	Croûtes aux champignons Champignons in Pasteten	True	45.666667

In the case of blocks, the subtitle is detected with the title in a block. And in the case of the paragraphs, it has an almost equal font size. For the third title, separation is not possible on paragraph level.

Now it would be great to include this information back into the dataframe.

def add_titles_to_df(original, titles):
    original["is_title"] = False
    original.loc[titles.index, "is_title"] = titles["is_title"]

add_titles_to_df(df_block, df_titles_blocks)
df_block[["text","is_title","font_size"]]

	text	is_title	font_size
1	64	False	24.000000
7	Die Butter zerlassen , das Weißbrot von beiden...	False	31.902913
10	Soupe à l'ail bonne femme Knoblauchsuppe nach ...	True	52.375000
16	2 Stangen Porree ( Lauch ) 250 g enthäutete To...	False	32.761905
19	Salz Pfeffer	False	26.500000
21	einige runde , ausgestochene Toast- brotscheiben	False	34.000000
24	Speiseöl Parmesankäse	False	29.500000
27	Den Porree putzen , waschen , in Ringe schneid...	False	35.600000
29	zerdrücken .	False	25.500000
34	Das Öl erhitzen , das Gemüse mit den Knoblauch...	False	33.516854
36	Gigot de chevreuil	True	62.000000
38	Rehblatt	False	37.000000
41	800 g Rehblatt ( Schulter ) Salz , Pfeffer 50 ...	False	34.187500
43	2 Schalotten	False	26.500000
49	2 zerdrückte Wacholderbeeren 2 zerdrückte Knob...	False	32.347826
51	2 Teel . Weizenmehl 125 ml ( 1 ) Schlagsahne	False	33.600000
56	Das Rehblatt unter fließendem kal- tem Wasser ...	False	33.964072
58	Croûtes aux champignons Champignons in Pasteten	True	45.666667
60	500 g Champignons 50 g Butter	False	33.500000
62	Salz Pfeffer	False	25.500000
69	Cayennepfeffer 125 ml ( 1 ) Wasser 2 gestriche...	False	32.321429
73	Die Champignons putzen , waschen , vierteln . ...	False	32.377193
75	55	False	-1.000000
77	65	False	26.000000

Let’s do the same for the paragraphs.

add_titles_to_df(df_para, df_titles_para)

3.2 Title detection using DocLayout-YOLO and OCR

Luckily, the layout detector already flags which blocks are title. The issue: there is no text in any of those blocks. Most of the blocks span multiple lines, which means our font-size approach does not work. We do not know how many lines or words exist.

This is where DocLayout-YOLO falls short. We could run every single block through OCR. On a local OCR program that could be as effective as full page detection, maybe even better. But since we rely on Google OCR, that approach would lead to very high cost, as billing is per request.

Instead, we align the two boxes and copy every OCR word which falls in a Doclayout boxes to the related dataframe row.

df_titles_yolo = df_yolo[df_yolo["is_title"]].copy()
df_titles_yolo["text"] = ""
df_titles_yolo["average_word_height_sum"] = 0
df_titles_yolo["word_count"] = 0

for block in page:
    for para_idx, para in enumerate(block.get("paragraphs", [])):

        for word_idx, word in enumerate(para.get("words", [])):
            symbols = [s["text"] for s in word.get("symbols", [])]
            word_text = "".join(symbols)
            word_verts = word["boundingBox"]["vertices"]
            wx1, wy1 = word_verts[0].get("x", 0), word_verts[0].get("y", 0)
            wx2, wy2 = word_verts[3].get("x", 0), word_verts[3].get("y", 0)
            wcx = (wx1 + wx2) / 2
            wcy = (wy1 + wy2) / 2
            word_height = abs(wy2 - wy1)

            # assign to title box if inside
            for idx, row in df_titles_yolo.iterrows():
                if (row["x1"] <= wcx <= row["x2"]) and (row["y1"] <= wcy <= row["y2"]):
                    # append word text
                    df_titles_yolo.at[idx, "text"] += " " + word_text
                    # accumulate height + count
                    df_titles_yolo.at[idx, "average_word_height_sum"] += word_height
                    df_titles_yolo.at[idx, "word_count"] += 1
                    # can only be in one title
                    break

df_titles_yolo["font_size"] = df_titles_yolo.average_word_height_sum / df_titles_yolo.word_count
df_titles_yolo[["text","is_title","font_size"]]

	text	is_title	font_size
0	Soupe à l'ail bonne femme	True	56.600000
3	Croûtes aux champignons	True	47.333333
18	Speiseöl Parmesankäse	True	29.500000
19		True	NaN
20	Gigot de chevreuil	True	62.000000
26	Rehblatt	True	37.000000

As indicated earlier, we have different kind of false positives. Only the “Rehblatt” is similar to the previous case, of subtitles in the paragraph based detection. Luckily for us, this time font size should work well.

df_titles_yolo = detect_titles(df_titles_yolo, font_factor=1)
df_titles_yolo[["text","is_title","font_size"]]

	text	is_title	font_size
0	Soupe à l'ail bonne femme	True	56.600000
3	Croûtes aux champignons	True	47.333333
20	Gigot de chevreuil	True	62.000000

As expected it worked even without a safety factor. We wrap this in a function.

def add_text_and_font_size_to_layout_df(df, ocr_page):
    df["text"] = ""
    df["average_word_height_sum"] = 0
    df["word_count"] = 0
    unassigned_words = []
    for block in ocr_page:
        for para_idx, para in enumerate(block.get("paragraphs", [])):

            for word_idx, word in enumerate(para.get("words", [])):
                symbols = [s["text"] for s in word.get("symbols", [])]
                word_text = "".join(symbols)
                verts = word["boundingBox"]["vertices"]
                wx1, wy1 = verts[0].get("x", 0), verts[0].get("y", 0)
                wx2, wy2 = verts[3].get("x", 0), verts[3].get("y", 0)
                wcx = (wx1 + wx2) / 2
                wcy = (wy1 + wy2) / 2
                word_height = abs(wy2 - wy1)

                assigned = False
                for idx, row in df.iterrows():
                    if (row["x1"] <= wcx <= row["x2"]) and (row["y1"] <= wcy <= row["y2"]):
                        # append word text
                        df.at[idx, "text"] += " " + word_text
                        # accumulate height + count
                        df.at[idx, "average_word_height_sum"] += word_height
                        df.at[idx, "word_count"] += 1
                        assigned = True
                        break

                if not assigned:
                    unassigned_words.append({
                        "text": word_text,
                        "x": wcx,
                        "y": wcy,
                        "height": word_height
                    })
    if len(unassigned_words) > 0:
        print("unassigned words:")
    df["font_size"] = df.average_word_height_sum / df.word_count
    return df

Note, I included a small debug hint, which should trigger if there are any unassigned words.

Next, it would be great to include this information back into the dataframe.

The complete code for DocLayout-YOLO.

df_yolo_font_size = add_text_and_font_size_to_layout_df(df_yolo.copy(), page)
df_titles_yolo = detect_titles(df_yolo_font_size[df_yolo_font_size.is_title], font_factor=1)
add_titles_to_df(df_yolo_font_size, df_titles_yolo)
df_yolo_font_size[["text", "is_title","font_size"]]

unassigned words:

	text	is_title	font_size
0	Soupe à l'ail bonne femme	True	56.600000
1	Knoblauchsuppe nach Hausfrauenart	False	45.333333
2	65	False	26.000000
3	Croûtes aux champignons	True	47.333333
4	64	False	24.000000
5	2 Stangen Porree ( Lauch ) 250 g enthäutete T...	False	32.586207
6	Den Porree putzen , waschen , in Ringe schnei...	False	33.900826
7	Champignons in Pasteten	False	44.000000
8	Die Butter zerlassen , das Weißbrot von beide...	False	31.902913
9	500 g Champignons g Butter Salz Pfeffer Cayen...	False	32.142857
10		False	NaN
11	zerdrückte Wacholderbeeren zerdrückte Knoblau...	False	32.419355
12	Sahne und Petersilie unterrühren . Die Champi...	False	32.016667
13	Das Rehblatt unter fließendem kal- tem Wasser...	False	35.842105
14	Den Speck in Streifen schneiden . Die Butter ...	False	34.090909
15		False	NaN
16	Fleisch von den Knochen Das gare lösen , in P...	False	30.380952
17	Schalotten abziehen , vierteln , mit den Wach...	False	32.963636
18	Speiseöl Parmesankäse	False	29.500000
19		False	NaN
20	Gigot de chevreuil	True	62.000000
21		False	NaN
22	Den Bratensatz mit etwas Wasser los- kochen u...	False	35.877551
23		False	NaN
24	Die Champignons putzen , waschen , vierteln .	False	31.250000
25	Die Butter zerlassen , die Champi- gnons dari...	False	33.043478
26	Rehblatt	False	37.000000
27		False	NaN
28		False	NaN
29		False	NaN
30		False	NaN
31		False	NaN
32	800 g Rehblatt ( Schulter ) Salz , Pfeffer 50...	False	33.333333
33		False	NaN

4 Recipe detection using only OCR

Now with the titles cleaned, we know the number of titles. Next step is the distribution of the rows of each dataframe to the recipes.

An important domain knowledge, or prior knowledge, is that almost all recipe formats are organized in columns. Titles are usually placed somewhere in these columns, and a title marks the beginning of a recipe.

The main drawback: any other layout format cannot be processed.

We will do this approach in two steps:

Split the text into columns based on the row’s bounding box
Split columns into recipes based on title position.

4.1 Working on OCR Block level

This was actually the hardest part. For the sake of brevity I only provide the final result and not the full way.

4.1.1 Column detection

The OCR pipeline has already discovered fragments of text that belong together, and organized them into paragraphs and blocks.

Our algorithm works with two approaches.

We assume there are no more than five columns, and they are evenly distributed. When column size is unequal, that is usually the case if ingredients are in a column.

We search for the best fit. Fit is defined by a score, which is the absolute distance of left beginning of the box and column centers.

We try to establish the number of columns with a normalized, area-weighted histogram. A valid column is defined by having 50% of the text amount of the maximum column. That of course assumes equal length of recipes. Because this assumption is shaky, the first approach is preferred.

def detect_columns(df, max_cols=5, error_threshold=20):
    page_width = df["x2"].max()
    best_n, best_score, best_assignments = 1, float("inf"), None

    for n_cols in range(1, max_cols + 1):
        col_width = page_width / n_cols
        col_centers = [(i + 0.5) * col_width for i in range(n_cols)]

        # Assign each block to nearest center
        assignments = []
        errors = []
        for x in df["x1"]:
            dists = [abs(x - c) for c in col_centers]
            col_idx = int(np.argmin(dists))
            assignments.append(col_idx)
            errors.append(min(dists))

        score = np.mean(errors)  # lower = better alignment
        if score < best_score:
            best_score = score
            best_n = n_cols
            best_assignments = assignments

    if best_score < error_threshold:
        df["col_id"] = best_assignments

        col_boxes = []
        for col_id, group in df.groupby("col_id"):
            col_boxes.append({
                "col_id": col_id,
                "col_x1": group["x1"].min(),
                "col_y1": group["y1"].min(),
                "col_x2": group["x2"].max(),
                "col_y2": group["y2"].max()
            })

        col_boxes = pd.DataFrame(col_boxes).sort_values("col_x1").reset_index(drop=True)
        col_boxes["col_id"] = range(len(col_boxes))
        df["col_id"] = df["col_id"].map({old: new for new, old in enumerate(col_boxes["col_id"])})
        return col_boxes, df
    else:
        # Fallback to histogram based method
        df["area"] = (df["x2"] - df["x1"]) * (df["y2"] - df["y1"])
        hist, bin_edges = np.histogram(
            df["x1"],
            bins=10,
            weights=df["area"]
        )
        hist_norm = hist / np.max(hist)
        valid_bins = np.where(hist_norm > 0.5)[0]
        bin_centers = (bin_edges[:-1] + bin_edges[1:]) / 2
        col_centers = bin_centers[valid_bins]

        df["col_id"] = df["x1"].apply(lambda x: np.argmin(np.abs(col_centers - x)))

        col_boxes = []
        for col_id, group in df.groupby("col_id"):
            col_boxes.append({
                "col_id": col_id,
                "col_x1": group["x1"].min(),
                "col_y1": group["y1"].min(),
                "col_x2": group["x2"].max(),
                "col_y2": group["y2"].max()
            })

        col_boxes = pd.DataFrame(col_boxes).sort_values("col_x1").reset_index(drop=True)
        col_boxes["col_id"] = range(len(col_boxes))
        df["col_id"] = df["col_id"].map({old: new for new, old in enumerate(col_boxes["col_id"])})

        return col_boxes, df

cols, df = detect_columns(df_block.copy())
cols

	col_id	col_x1	col_y1	col_x2	col_y2
0	0	853	484	1510	2355
1	1	1526	467	2138	2271
2	2	2306	474	2993	2249
3	3	2973	455	3608	2274

df[["text","col_id"]]

	text	col_id
1	64	0
7	Die Butter zerlassen , das Weißbrot von beiden...	0
10	Soupe à l'ail bonne femme Knoblauchsuppe nach ...	0
16	2 Stangen Porree ( Lauch ) 250 g enthäutete To...	0
19	Salz Pfeffer	0
21	einige runde , ausgestochene Toast- brotscheiben	0
24	Speiseöl Parmesankäse	1
27	Den Porree putzen , waschen , in Ringe schneid...	1
29	zerdrücken .	1
34	Das Öl erhitzen , das Gemüse mit den Knoblauch...	1
36	Gigot de chevreuil	1
38	Rehblatt	1
41	800 g Rehblatt ( Schulter ) Salz , Pfeffer 50 ...	1
43	2 Schalotten	1
49	2 zerdrückte Wacholderbeeren 2 zerdrückte Knob...	2
51	2 Teel . Weizenmehl 125 ml ( 1 ) Schlagsahne	2
56	Das Rehblatt unter fließendem kal- tem Wasser ...	2
58	Croûtes aux champignons Champignons in Pasteten	3
60	500 g Champignons 50 g Butter	3
62	Salz Pfeffer	3
69	Cayennepfeffer 125 ml ( 1 ) Wasser 2 gestriche...	3
73	Die Champignons putzen , waschen , vierteln . ...	3
75	55	3
77	65	3

The number of columns is correct.

4.1.2 Recipe detection

We now proceed by grouping the blocks into recipes. This function is the work of many failed iterations.

Since we have established columns, we treat all text as if it were in one big column. Whenever a title appears, we start a new recipe.

I solved the subtitle issue by checking that the gap to the previous title is similar to the title font size. If so, it is a subtitle, not a new recipe.

def group_recipes(df, subtitle_factor=1.2):
    df = df.copy()
    recipe_id = -1
    all_recipes = []
    recipe_map = {}

    # Sort by column, then y
    cols = sorted(df["col_id"].unique())
    last_recipe_id = None
    current_recipe = {"title": None, "blocks": []}

    last_title_font = None

    for col in cols:
        col_blocks = df[df["col_id"] == col].sort_values("y1")
        last_title_bottom = -1
        for idx, row in col_blocks.iterrows():

            if row.is_title:
                # check if this title is actually a subtitle
                subtitle_gap = (last_title_font or row.font_size) * subtitle_factor

                is_subtitle = (
                        current_recipe["title"] is not None
                        and (row["y1"] - last_title_bottom) < subtitle_gap
                )

                if is_subtitle:
                    # merge into current recipe title
                    current_recipe["title"] += " " + row.text
                    current_recipe["blocks"].append(row.text)
                    recipe_map[idx] = last_recipe_id
                else:

                    if current_recipe["title"] is not None:
                        all_recipes.append(current_recipe)

                    # start new recipe
                    recipe_id += 1
                    current_recipe = {"title": row.text, "blocks": [row.text]}
                    recipe_map[idx] = recipe_id
                    last_recipe_id = recipe_id

                last_title_bottom = row["y2"]
                last_title_font = row.font_size

            else:

                if last_recipe_id is None:
                    recipe_map[idx] = -1  # orphan
                else:
                    recipe_map[idx] = last_recipe_id
                    current_recipe["blocks"].append(row.text)

    if current_recipe["title"] is not None:
        all_recipes.append(current_recipe)

    df["recipe_id"] = df.index.map(recipe_map).fillna(-1).astype(int)
    return df, all_recipes

df, recipes = group_recipes(df)

df[["text", "recipe_id"]]

	text	recipe_id
1	64	0
7	Die Butter zerlassen , das Weißbrot von beiden...	-1
10	Soupe à l'ail bonne femme Knoblauchsuppe nach ...	0
16	2 Stangen Porree ( Lauch ) 250 g enthäutete To...	0
19	Salz Pfeffer	0
21	einige runde , ausgestochene Toast- brotscheiben	0
24	Speiseöl Parmesankäse	0
27	Den Porree putzen , waschen , in Ringe schneid...	0
29	zerdrücken .	0
34	Das Öl erhitzen , das Gemüse mit den Knoblauch...	0
36	Gigot de chevreuil	1
38	Rehblatt	1
41	800 g Rehblatt ( Schulter ) Salz , Pfeffer 50 ...	1
43	2 Schalotten	1
49	2 zerdrückte Wacholderbeeren 2 zerdrückte Knob...	1
51	2 Teel . Weizenmehl 125 ml ( 1 ) Schlagsahne	1
56	Das Rehblatt unter fließendem kal- tem Wasser ...	1
58	Croûtes aux champignons Champignons in Pasteten	2
60	500 g Champignons 50 g Butter	2
62	Salz Pfeffer	2
69	Cayennepfeffer 125 ml ( 1 ) Wasser 2 gestriche...	2
73	Die Champignons putzen , waschen , vierteln . ...	2
75	55	2
77	65	2

Nice, even the fragment of the previous recipe in the first column was treated correctly. Only the page number was wrongly attributed to the first recipe.

A picture says more than a thousand words.

def plot_recipes(df, image, save=False):
    colors = {}
    fig = plt.figure(figsize=(12, 16))
    plt.imshow(image)

    for _, row in df.iterrows():
        rid = row["recipe_id"]
        if rid not in colors:
            colors[rid] = [random.random(), random.random(), random.random()]
        color = colors[rid]

        rect = patches.Rectangle(
            (row["x1"], row["y1"]),
            row["x2"] - row["x1"],
            row["y2"] - row["y1"],
            linewidth=2,
            edgecolor=color,
            facecolor="none"
        )
        plt.gca().add_patch(rect)

        if row["is_title"]:
            plt.text(row["x1"], row["y1"] - 5, row["text"][:30],
                     color=color, fontsize=10, weight="bold")

    plt.axis("off")
    if save:
        fig.savefig("result.jpg")
    plt.show()

def read_image(image_id,):
    image_path = Path("data/raw")
    filename = image_path / f"{image_id}.jpg"
    image = Image.open(filename)

    return image
image = read_image(filename)
plot_recipes(df, image, True)

As we can see this approach also works quite well.

4.2 Working on OCR paragraph level

We will try OCR Paragraphs. Thanks to the subtitle workaround it also works for this recipe. However, doing so introduce another variable in the whole process and making it more brittle.

We can see that the first sub-title is printed in fat as the table still thinks it is a title.

cols, df = detect_columns(df_para.copy())
df, recipes = group_recipes(df)
plot_recipes(df, image)

4.3 Generalization of OCR based splitting

Let’s extend the algorithm to two other formats and see if it succeeds

Code

filename = "IMG_2077"
page = read_json(filename)
df = ocr_json_to_df(page)
df_block = df[df["level"] == "block"]
titles = detect_titles(df_block.copy())
add_titles_to_df(df_block, titles)
image = read_image(filename)
cols, df = detect_columns(df_block.copy())
df, recipes = group_recipes(df, 300)
plot_recipes(df, image)

Code

filename = "IMG_2074"
page = read_json(filename)
df = ocr_json_to_df(page)
df_block = df[df["level"] == "block"]
titles = detect_titles(df_block.copy())
add_titles_to_df(df_block, titles)
image = read_image(filename)
cols, df = detect_columns(df_block.copy())
df, recipes = group_recipes(df, 300)
plot_recipes(df, image)

recipes

[{'title': 'Ein würziger Schweinebraten aus der Normandie RÔTI DE PORC AUX POMMES CARAMÉLISÉES',
  'blocks': ['Ein würziger Schweinebraten aus der Normandie RÔTI DE PORC',
   'AUX POMMES CARAMÉLISÉES',
   'SCHWEINEBRATEN MIT KARAMELLISIERTEN ÄPFELN',
   'Wenig Rosmarinnadeln , die Salbeiblätter , Arbeitsaufwand : 30 Minuten die Knoblauchzehen und die Fenchelsamen im Mörser zerstoßen . Salz und Pfeffer zuge- ben . - Die Äpfel schälen , entkernen und in Schnitze schneiden . - Den Zucker mit dem Zitronensaft in einer Bratpfanne erhitzen . Sobald er hellbraun wird , die Äpfel zuge- ben , gut wenden , die Butter in Flocken zu- geben , mit 3 bis 5 EL Wasser ablöschen und ca. 5 Minuten garen . Salzen und pfeffern.- 3 bis 4 Einschnitte im Fleisch anbringen . Die Öffnungen mit der Gewürzmischung fül- len . Das Fleisch zu einem Rollbraten schnü- 2 dl Apfelwein ren , salzen und pfeffern . - Die Rosmarin-',
   'Bratzeit : 2 Stunden Für 4 Personen 5 Zweige Rosmarin 2-3 Salbeiblätter 2 Knoblauchzehen 1 Prise Fenchelsamen Salz , Pfeffer',
   '500 g säuerliche Äpfel 50g Rohzucker 1 EL Zitronensaft 2 EL frische Butter 1 kg magerer Schweinehals 2 EL zimmerwarme Bratbutter',
   'zweige verteilt unter der Schnur anbringen . Das Fleisch mit der weichen Bratbutter bestreichen . In einer Bratkasserolle rundum an- braten . - Den Apfelwein zufügen und zugedeckt bei kleiner Hitze 2 Stunden garen . - Den Bratenfond mit 1 bis 2 EL Wasser aufkochen . 1/3 der Äpfel pürieren und mit dem Bratenjus gut mischen , abschme- cken . - Die restlichen Apfelschnitze rasch erwärmen und als Garni- tur zum tranchierten Braten servieren . Getränk : Rustikaler Rotwein , zum Beispiel aus der Provence Anmerkung : Diese ausgeprägten Zutaten passen auch gut zu Kalb- fleisch . Deshalb lässt sich nach demselben Rezept ebenso gut ein',
   'Kalbsbratenzubereiten .']}]

This last layout is one of my favourites in terms of complexity. Triple title and a deeply nested format. The heuristic title detection with all the domain knowledge captures all the three titles together in the final extract. I’m looking forward, how this performs on completely unseen layouts.

5 Recipe detection using doc-layout YOLO

We already know title detection works better with the YOLO detector, but what about the columns and recipes?

Once again, we proceed in a two-step approach. First columns, then recipes.

5.1 Detecting columns

We use the previously defined function to find columns based on bounding box positions. At this stage, no text is required.

col_df, df_yolo_font_size = detect_columns(df_yolo_font_size)

df_yolo_font_size[["text","is_title","col_id"]]

	text	is_title	col_id
0	Soupe à l'ail bonne femme	True	0
1	Knoblauchsuppe nach Hausfrauenart	False	0
2	65	False	3
3	Croûtes aux champignons	True	3
4	64	False	0
5	2 Stangen Porree ( Lauch ) 250 g enthäutete T...	False	0
6	Den Porree putzen , waschen , in Ringe schnei...	False	1
7	Champignons in Pasteten	False	3
8	Die Butter zerlassen , das Weißbrot von beide...	False	0
9	500 g Champignons g Butter Salz Pfeffer Cayen...	False	3
10		False	1
11	zerdrückte Wacholderbeeren zerdrückte Knoblau...	False	2
12	Sahne und Petersilie unterrühren . Die Champi...	False	3
13	Das Rehblatt unter fließendem kal- tem Wasser...	False	2
14	Den Speck in Streifen schneiden . Die Butter ...	False	2
15		False	1
16	Fleisch von den Knochen Das gare lösen , in P...	False	2
17	Schalotten abziehen , vierteln , mit den Wach...	False	2
18	Speiseöl Parmesankäse	False	1
19		False	1
20	Gigot de chevreuil	True	1
21		False	0
22	Den Bratensatz mit etwas Wasser los- kochen u...	False	2
23		False	0
24	Die Champignons putzen , waschen , vierteln .	False	3
25	Die Butter zerlassen , die Champi- gnons dari...	False	3
26	Rehblatt	False	1
27		False	0
28		False	0
29		False	0
30		False	1
31		False	1
32	800 g Rehblatt ( Schulter ) Salz , Pfeffer 50...	False	1
33		False	1

Again we have four columns. As the DocLayout-YOLO detector counts different, it is not obvious if col_id is correct. We therefore plot the result.

image = read_image("20250922_135514")

fig, ax = plt.subplots(1, figsize=(12, 12))
ax.imshow(image)

# YOLO boxes
for _, row in df_yolo_font_size.iterrows():
    color = "red" if row["is_title"] else "blue"
    rect = patches.Rectangle(
        (row["x1"], row["y1"]),
        row["x2"] - row["x1"],
        row["y2"] - row["y1"],
        linewidth=2,
        edgecolor=color,
        facecolor="none"
    )
    ax.add_patch(rect)
    if row["is_title"]:
        ax.text(row["x1"], row["y1"] - 5, "TITLE", color="red", fontsize=10, weight="bold")

# Column boxes
if col_df is not None:
    for _, row in col_df.iterrows():
        rect = patches.Rectangle(
            (row["col_x1"], row["col_y1"]),
            row["col_x2"] - row["col_x1"],
            row["col_y2"] - row["col_y1"],
            linewidth=3,
            edgecolor="green",
            facecolor="none",
            linestyle="--"
        )
        ax.add_patch(rect)
        ax.text(row["col_x1"], row["col_y1"] - 5, "COLUMN", color="green", fontsize=10)

plt.axis("off")
plt.show()

First, columns are correct.

There are also empty cells which we need to filter for the grouping.

Then we can call our grouping function.

sections_df, recipes = group_recipes(df_yolo_font_size[df_yolo_font_size.text != ""]); recipes

[{'title': " Soupe à l'ail bonne femme",
  'blocks': [" Soupe à l'ail bonne femme",
   ' Knoblauchsuppe nach Hausfrauenart',
   ' 2 Stangen Porree ( Lauch ) 250 g enthäutete Tomaten 3-5 Knoblauchzehen 3 EBI . Speiseöl 2 große Kartoffeln 141 Fleischbrühe Salz Pfeffer einige runde , ausgestochene Toast- brotscheiben',
   ' 64',
   ' Speiseöl Parmesankäse',
   ' Den Porree putzen , waschen , in Ringe schneiden . Die Tomaten halbieren , die Stenge- lansätze herausschneiden , das Toma- tenfleisch in Würfel schneiden . Die Knoblauchzehen abziehen und zerdrücken . Das Öl erhitzen , das Gemüse mit den Knoblauchzehen darin andünsten . Die Kartoffeln schälen , waschen , in Scheiben schneiden , mit der Fleisch- brühe zu dem Gemüse geben , zum Kochen bringen , etwa 30 Minuten kochen lassen . Die Suppe mit Salz und Pfeffer abschmecken . Die Toastbrotscheiben mit dem Spei- seöl bestreichen , mit Parmesankäse bestreuen , in den auf 200-225 Grad ( Gas : Stufe 4-5 ) vorgeheizten Back- ofen schieben und 8-10 Minuten überbacken . Das Brot heiß zu der Suppe reichen .',
   ' Rehblatt']},
 {'title': ' Gigot de chevreuil',
  'blocks': [' Gigot de chevreuil',
   ' 800 g Rehblatt ( Schulter ) Salz , Pfeffer 50 g durchwachsener Speck 25 g Butter 2 Schalotten',
   ' zerdrückte Wacholderbeeren zerdrückte Knoblauchzehen 2-3 Thymianzweige 125 ml ( 1 ) Rotwein 250 ml ( 1 ) Wasser 10 g Butter 2 Teel . Weizenmehl 125 ml ( 1 ) Schlagsahne',
   ' Das Rehblatt unter fließendem kal- tem Wasser abspülen , trockentupfen , enthäuten und mit Salz und Pfeffer einreiben .',
   ' Den Speck in Streifen schneiden . Die Butter ( 25 g ) zerlassen , die Speckstreifen und das Fleisch darin anbraten .',
   ' Schalotten abziehen , vierteln , mit den Wacholderbeeren und den gewaschenen Thymianzweigen zu dem Fleisch geben . Den Rotwein und etwas von dem Wasser hinzugießen . Das Fleisch etwa 1 Stunde schmoren lassen , ab und zu wenden und mit dem Bratensatz begießen . Die ver- dampfte Flüssigkeit nach und nach durch Wasser ersetzen .',
   ' Fleisch von den Knochen Das gare lösen , in Portionsstücke schneiden , auf einer vorgewärmten Platte anrichten und warm stellen .',
   ' Den Bratensatz mit etwas Wasser los- kochen und durch ein Sieb gießen . Die Butter ( 10 g ) mit dem Weizen- mehl verrühren , zum Bratensatz geben , mit einem Schneebesen durchschlagen und aufkochen lassen . Die Sahne unterrühren . Die Sauce mit Salz und Pfeffer abschmecken .']},
 {'title': ' Croûtes aux champignons',
  'blocks': [' Croûtes aux champignons',
   ' Champignons in Pasteten',
   ' 500 g Champignons g Butter Salz Pfeffer Cayennepfeffer 125 ml ( 1 ) Wasser 2 gestrichene EBI . Speisestärke 3 EBI . Schlagsahne 2 EBI . gehackte Petersilie Zitronensaft 4 Blätterteigpasteten ( fertig gekauft )',
   ' Die Champignons putzen , waschen , vierteln .',
   ' Die Butter zerlassen , die Champi- gnons darin andünsten , mit Salz , Pfeffer und Cayennepfeffer würzen . Das Wasser hinzugießen , in etwa 10 Minuten gar dünsten lassen . Die Speisestärke mit 3 EBI . kaltem Wasser anrühren , die Pilze damit bin- den .',
   ' Sahne und Petersilie unterrühren . Die Champignons mit den Gewürzen und dem Zitronensaft abschmecken . Von den Pasteten Hülsen und Deckel auf ein Backblech legen und in den auf 200-225 Grad ( Gas : Stufe 4-5 ) vorgeheizten Backofen schieben und in etwa 5 Minuten erwärmen . Die Champignons in die Pasteten fül- len , die Deckel darauf setzen .',
   ' 65']}]

There is an issue with nested blocks: title “gigot de chevreuil” and “Rehblatt”. Somehow Rehblatt ended up in recipe 0.

5.2 Generalization of YOLO-doclayout+OCR

Again we check, how other recipes perform. This time we look at the dataframe.

Code

filename = "IMG_2077"
page = read_json(filename)

det_res = model.predict(
    "data/raw/IMG_2077.jpg",
    imgsz=1024,
    conf=0.2,
    device="cuda:0",
    verbose=False
)
df_yolo = yolo_to_df(det_res[0])

df_yolo_font_size = add_text_and_font_size_to_layout_df(df_yolo.copy(), page)

df_titles_yolo = detect_titles(df_yolo_font_size[df_yolo_font_size.is_title].copy(), font_factor=1)
add_titles_to_df(df_yolo_font_size, df_titles_yolo)

col_df, df_yolo_font_size = detect_columns(df_yolo_font_size.copy())
sections_df, recipes = group_recipes(df_yolo_font_size[df_yolo_font_size.text != ""])
sections_df[["text","is_title","recipe_id"]]

	text	is_title
0	Préchauffez le four à 220 ° C . Faites cuire ...	False
1	Mélangez les fèves dans un saladier avec le f...	False
2	Toastez les tranches de pain . Répartissez le...	False
3	Salade de poulet , fèves , fenouil et concomb...	True
4	L'estragon est utilisé en phytothérapie pour ...	False
5	Pour 4 personnes Préparation : 10 min Cuisson...	False
6	Pelez le concombre ( s'il n'est pas bio ) et ...	False
7	Parfait pour le soir !	False
8	✓ 1 cuil . à café de zestes de citron	False
9	126 PLATS DETOX	False
10	✓2 cuil . à soupe d'estragon frais haché	False
11	✓ 8 tranches de pain de campagne aux graines ...	False
12	✓ 1 cuil . à soupe de vinaigre de vin rouge	False
13	✓ 2 gros blancs de poulet ( ou 4 petits )	False
14	✓2 cuil . à café de jus de citron frais	False
15	✓ 150 g de fèves ( surgelées ) 1 petit concom...	False
16	✓ Huile d'olive ✓ Sel , poivre	False

Code

filename = "IMG_2074"
page = read_json(filename)

det_res = model.predict(
    "data/raw/IMG_2074.jpg",
    imgsz=1024,
    conf=0.2,
    device="cuda:0",
    verbose=False
)
df_yolo = yolo_to_df(det_res[0])

df_yolo_font_size = add_text_and_font_size_to_layout_df(df_yolo.copy(), page)

df_titles_yolo = detect_titles(df_yolo_font_size[df_yolo_font_size.is_title].copy(), font_factor=1)
add_titles_to_df(df_yolo_font_size, df_titles_yolo)

col_df, df_yolo_font_size = detect_columns(df_yolo_font_size.copy())
sections_df, recipes = group_recipes(df_yolo_font_size[df_yolo_font_size.text != ""])
sections_df[["text","is_title","recipe_id"]]

unassigned words:

	text	is_title	recipe_id
0	Für 4 Personen Zweige Rosmarin 2-3 Salbeiblät...	False	0
1	zweige verteilt unter der Schnur anbringen . ...	False	0
2	RÔTI DE PORC POMMES CARAMÉLISÉES	True	0
3	Bratzeit : 2 Stunden : 30 Minuten	False	0
4	SCHWEINEBRATEN MIT KARAMELLISIERTEN ÄPFELN	False	0
5	Rosmarinnadeln , die Salbeiblätter , die Knob...	False	0
6	Ein würziger Schweinebraten aus der Normandie	False	-1
7	Getränk : Rustikaler Rotwein , zum Beispiel a...	False	0

Here we run into the first issue: the triple subtile leads to missing text in the final segmentation, shown as “-1” in the recipe_id column.

Something actually worse than the wrong identified title in the pure OCR case.

6 Comparison of pure OCR pipeline vs OCR+YOLO-doclayout

We will compare the two approaches by wrapping them in functions To make the comparison clearer, we’ll also use a different recipe this time.

def get_recipes_ocr_only_block(filename, type='block'):
    page = read_json(filename)
    df = ocr_json_to_df(page)
    df_block = df[df["level"] == type]

    titles = detect_titles(df_block.copy())
    add_titles_to_df(df_block, titles)

    cols, df = detect_columns(df_block.copy())
    df, recipes = group_recipes(df)
    return df, recipes


def get_recipes_yolo(filename):
    page = read_json(filename)

    det_res = model.predict(
        "data/raw/"+ filename +".jpg",
        imgsz=1024,
        conf=0.2,
        device="cuda:0",
        verbose=False
    )
    df_yolo = yolo_to_df(det_res[0])

    df_yolo_font_size = add_text_and_font_size_to_layout_df(df_yolo.copy(), page)

    df_titles_yolo = detect_titles(df_yolo_font_size[df_yolo_font_size.is_title].copy(), font_factor=1)
    add_titles_to_df(df_yolo_font_size, df_titles_yolo)

    col_df, df_yolo_font_size = detect_columns(df_yolo_font_size.copy())
    sections_df, recipes = group_recipes(df_yolo_font_size[df_yolo_font_size.text != ""])
    return sections_df, recipes

6.1 Runtime comparison

%%time
df_ocr, r_ocr = get_recipes_ocr_only_block("IMG_2073")

CPU times: user 22.5 ms, sys: 162 μs, total: 22.7 ms
Wall time: 22.1 ms

%%time

df_yolo, r_yolo = get_recipes_yolo("IMG_2073")

unassigned words:
CPU times: user 354 ms, sys: 31.9 ms, total: 386 ms
Wall time: 386 ms

The OCR only approach is 20x faster, as we do not need to access the GPU. Without GPU it would take even more time.

6.2 Recipe output comparison

Let’s check if we discovered the same recipes. For that we try to realign the different rows and put the recipe ids of each approach on the ocr dataframe.

def attach_yolo_recipe_ids(df_ocr, df_yolo, pad=2, min_iou=0.1):
    df_ocr = df_ocr.copy()
    recipe_ids = []

    for _, block in df_ocr.iterrows():
        bx1, by1, bx2, by2 = block["x1"], block["y1"], block["x2"], block["y2"]

        assigned_id = -1
        best_iou = 0
        candidate_boxes = []

        for _, yrow in df_yolo.iterrows():
            yx1, yy1 = yrow["x1"] - pad, yrow["y1"] - pad
            yx2, yy2 = yrow["x2"] + pad, yrow["y2"] + pad

            # Check containment first
            if (bx1 >= yx1) and (by1 >= yy1) and (bx2 <= yx2) and (by2 <= yy2):
                candidate_boxes.append((yrow["recipe_id"], (yx2 - yx1) * (yy2 - yy1)))

            # IoU calculation
            inter_x1 = max(bx1, yx1)
            inter_y1 = max(by1, yy1)
            inter_x2 = min(bx2, yx2)
            inter_y2 = min(by2, yy2)
            inter_w = max(0, inter_x2 - inter_x1)
            inter_h = max(0, inter_y2 - inter_y1)
            inter_area = inter_w * inter_h

            block_area = (bx2 - bx1) * (by2 - by1)
            yolo_area = (yx2 - yx1) * (yy2 - yy1)
            union_area = block_area + yolo_area - inter_area

            iou = inter_area / union_area if union_area > 0 else 0

            if iou > best_iou:
                best_iou = iou
                assigned_id = yrow["recipe_id"]

        # Prefer containment rule
        if candidate_boxes:
            # pick smallest enclosing box (most specific title/region)
            assigned_id = min(candidate_boxes, key=lambda x: x[1])[0]
        elif best_iou < min_iou:
            # fallback centroid if IoU too small
            bcx, bcy = (bx1 + bx2) / 2, (by1 + by2) / 2
            for _, yrow in df_yolo.iterrows():
                if (yrow["x1"] <= bcx <= yrow["x2"]) and (yrow["y1"] <= bcy <= yrow["y2"]):
                    assigned_id = yrow["recipe_id"]
                    break

        recipe_ids.append(assigned_id)

    df_ocr["recipe_id_yolo"] = recipe_ids
    return df_ocr

combined = attach_yolo_recipe_ids(df_ocr, df_yolo); combined[["recipe_id","recipe_id_yolo"]]

	recipe_id	recipe_id_yolo
1	0	0
4	0	0
11	0	0
14	0	0
16	0	0
20	0	0
23	0	0
28	0	-1
32	0	-1
34	0	-1
39	0	-1
41	0	-1

Something has gone wrong.

Let’s look at the recipes

r_ocr

[{'title': 'Merlu Koskera',
  'blocks': ['Merlu Koskera',
   'Pour 6 personnes : Temps de préparation : 30 minutes Temps de cuisson : 20 minutes',
   "Ingrédients : • 6 médaillons de merlu • 300 g d'asperges blanches en conserve • 500 g de petits pois en conserve • 1 poignée de palourdes • 1 poignée de moules ⚫ 3 œufs durs",
   "• 10 cl de vin blanc sec type Irouléguy • 1 cuillère à café de purée de piment d'Espelette • 4 gousses d'ail",
   '• Persil',
   "• 3 cuillères à soupe de farine • Sel de Guérande • Poudre de piment d'Espelette",
   "• Huile d'olive • 20 cl de fumet de poisson",
   "Faites un hachis d'ail et de persil . Réservez . Salez et farinez les médaillons de merlu . Réservez . Dans une cocotte , faites ouvrir les moules et les palourdes , conservez leur jus . Dans une sauteuse , faites revenir les médaillons de merlu dans l'huile d'olive mélangée à la purée",
   "de piment durant 2 minutes de chaque côté . Dans un plat en terre , déposez les médaillons de merlu . Réservez . Faites revenir pendant quelques minutes le hachis de persil et d'ail dans l'huile d'olive . Saupoudrez de farine , arrosez du jus des coquillages , du vin blanc et du fumet de poisson .",
   'Versez sur les médaillons .',
   'Ajoutez les moules , les palourdes , les asperges et les petits pois . Laissez mijoter à feu doux pendant 10 minutes . Ajoutez les œufs durs en quartier avant la fin de la cuisson . Saupoudrez de persil et de poudre de piment .',
   'Servez sans attendre .']}]

r_yolo

[{'title': ' Merlu Koskera',
  'blocks': [' Merlu Koskera',
   ' Pour 6 personnes : Temps de préparation : 30 minutes Temps de cuisson : 20 minutes',
   ' Ingrédients :',
   " 6 médaillons de merlu 300 g d'asperges blanches en conserve • 500 g de petits pois en conserve • 1 poignée de palourdes • 1 poignée de moules ⚫ 3 œufs durs • 10 cl de vin blanc sec type Irouléguy • 1 cuillère à café de purée de piment d'Espelette • 4 gousses d'ail • Persil • 3 cuillères à soupe de farine • Sel de Guérande • Poudre de piment d'Espelette • Huile d'olive • 20 cl de fumet de poisson"]}]

We are missing text in the DocLayout-YOLO recipe. When we look at the input dataframe, we see there are two columns.

df_yolo[["text","col_id","recipe_id"]]

	text	col_id	recipe_id
0	Faites un hachis d'ail et de persil . Réserve...	0	-1
1	6 médaillons de merlu 300 g d'asperges blanch...	1	0
2	Pour 6 personnes : Temps de préparation : 30 ...	1	0
3	Merlu Koskera	1	0
4	Ingrédients :	1	0

But when we look at the image there is only one.

A grain of salt: the shortcomings could be related to my optimization towards the heuristic approach.

7 Quantitative analysis

To evaluate the performance, I did define some reference data, with recipe title and the correct OCR block to recipe mapping.

with open("data/ground-truth/ground-truth.json", "r", encoding="utf-8") as f:
    reference = json.load(f)

reference_map = {page["page_id"]: page for page in reference["data"]}

def evaluate_page(page_id, df_ocr, reference_map):
    ref = reference_map[page_id]

    df_ocr = df_ocr.sort_values(["col_id", "y1"]).reset_index(drop=True)
    df_ocr["recipe_id_ref"] = ref["reference_sections"]

    return df_ocr, ref

def compute_metrics(df):
    y_ref = df["recipe_id_ref"]
    y_ocr = df["recipe_id"]

    results = {}

    results["ocr_acc"] = accuracy_score(y_ref, y_ocr) if "recipe_id" in df else None

    return results

def title_accuracy_from_recipes(recipes, ref_titles, threshold=0.7):
    pred_titles = [r["title"].strip() for r in recipes if r.get("title")]
    print(pred_titles)
    ref_titles = [rt.strip() for rt in ref_titles]

    matches = []
    matched = 0

    for rt in ref_titles:
        scores = [(pt, SequenceMatcher(None, pt.lower(), rt.lower()).ratio()) for pt in pred_titles]
        if scores:
            best_pred, best_score = max(scores, key=lambda x: x[1])
            matches.append((rt, best_pred, best_score))
            if best_score >= threshold:
                matched += 1
        else:
            matches.append((rt, None, 0.0))

    accuracy = matched / len(ref_titles) if ref_titles else 0.0
    return accuracy, matches


all_metrics = []

for json_path in glob.glob("data/raw/*.jpg"):
    page_id = os.path.splitext(os.path.basename(json_path))[0]
    if page_id not in reference_map:
        continue

    df_ocr, recipes_ocr = get_recipes_ocr_only_block(page_id)

    df_eval, ref = evaluate_page(page_id, df_ocr,  reference_map)
    metrics = compute_metrics(df_eval)

    ref_titles = [r["title"] for r in ref["recipes"]]

    ocr_acc, ocr_matches = title_accuracy_from_recipes(recipes_ocr, ref_titles, threshold=0.7)

    metrics.update({
        # OCR metrics
        "ocr_title_accuracy": ocr_acc,
        "ocr_title_matches": ocr_matches,
        "ocr_num_pred_recipes": len(recipes_ocr),

        # Reference info
        "num_ref_recipes": len(ref_titles),
    })

    all_metrics.append((page_id, metrics))

['Merlu Koskera']
['Artichauts à la sauce vinaigrette Artischocken mit Vinaigrette ( Foto S. 63 )', 'Asperges ,, sauce mousseline❝ Spargel mit abgeschlagener Sauce', '2 EBI . steifgeschlagene Schlagsahne']
["Soupe à l'ail bonne femme Knoblauchsuppe nach Hausfrauenart", 'Gigot de chevreuil']
["Dinde pochée au lait d'amande , mange - tout et haricots", 'Info nutrition']
['Croquetas', 'Boudin noir sur Canapé']
['Lapin aux pruneaux']
["Soupe à l'ail bonne femme Knoblauchsuppe nach Hausfrauenart", 'Gigot de chevreuil', 'Croûtes aux champignons Champignons in Pasteten']
['Variantes', '92 PLATS FEEL GOOD']
['Salade de poulet , fèves , fenouil et concombre sur toasts']
['Ein würziger Schweinebraten aus der Normandie RÔTI DE PORC AUX POMMES CARAMÉLISÉES']
['Mulligatawny']
['Croûtes aux champignons Champignons in Pasteten']

df_metrics = pd.DataFrame(
    [{**{"page_id": pid}, **m} for pid, m in all_metrics]
)
df_metrics

	page_id	ocr_acc	ocr_title_accuracy	ocr_title_matches	ocr_num_pred_recipes	num_ref_recipes
0	IMG_2073	1.000000	1.000000	[(Merlu Koskera, Merlu Koskera, 1.0)]	1	1
1	20250922_135453	0.750000	0.000000	[(Artichauts à la sauce vinaigrette, Artichaut...	3	2
2	20250922_135507	1.000000	0.500000	[(Soupe à l'ail bonne femme, Soupe à l'ail bon...	2	2
3	IMG_2079	0.692308	1.000000	[(Dinde pochée au lait d'amande, mange - tout ...	2	1
4	20250922_213505	1.000000	1.000000	[(Croquetas, Croquetas, 1.0), (Boudin noir sur...	2	2
5	IMG_2078	1.000000	1.000000	[(Lapin aux pruneaux, Lapin aux pruneaux, 1.0)]	1	1
6	20250922_135514	0.916667	0.333333	[(Soupe à l'ail bonne femme, Soupe à l'ail bon...	3	3
7	IMG_2080	0.125000	0.000000	[(Gnocchi sans gluten à la patate douce et pes...	2	1
8	IMG_2077	1.000000	1.000000	[(Salade de poulet, fèves, fenouil et concombr...	1	1
9	IMG_2074	1.000000	0.000000	[(RÔTI DE PORC POMMES CARAMÉLISÉES, Ein würzig...	1	1
10	IMG_2076	1.000000	1.000000	[(Mulligatawny, Mulligatawny, 1.0)]	1	1
11	20250922_135510	1.000000	0.000000	[(Croûtes aux champignons, Croûtes aux champig...	1	1

In short, the algorithm is far from perfect.

Most errors result from incorrect titles. When the titles are wrong, to too many recipes are created.

8 Outlook

A few possible improvements include:

stabilizing titles with DocLayout-YOLO
using book-dependent settings adjusted based on user feedback
allowing manual user overrides
training a classifier solely on OCR data, but with much larger datasets

We’ll see how this evolves once it’s integrated into the main app.

1 Many ways lead to digitized documents

1.1 Using LLM API

1.2 The new way: end-to-end conversion

1.3 The not so easy way

1.4 Using only domain knowledge

2 Converting Data to Dataframes

2.1 OCR

2.2 Doclayout YOLO

3 Title detection

3.1 Title detection using OCR

3.2 Title detection using DocLayout-YOLO and OCR

4 Recipe detection using only OCR

4.1 Working on OCR Block level

4.1.1 Column detection

4.1.2 Recipe detection

4.2 Working on OCR paragraph level

4.3 Generalization of OCR based splitting

5 Recipe detection using doc-layout YOLO

5.1 Detecting columns

5.2 Generalization of YOLO-doclayout+OCR

6 Comparison of pure OCR pipeline vs OCR+YOLO-doclayout

6.1 Runtime comparison

6.2 Recipe output comparison

7 Quantitative analysis

8 Outlook

Comments