Beauty is in the eye of the beholder

Computer Vision

Machine Learning

Python

Automatically generating thumbnails from pictures, can lead to bad crops or small details. Saliency Maps allow us to focus on the dominant object in a picture. I use multimodal aesthetic scoring models to evaluate the crops.

Author

Dominik Lindner

Published

September 5, 2025

1 Why we need thumbnails

The recipescanner allows scanning books and creating recipes with thumbnails. These Thumbnails should look nice and provide a good first impression of the meal.

There are three categories of recipes:

The picture that belongs to the recipe is identified.
The recipe does not have a picture.
We have a picture and several recipes, but we don’t know which recipe the picture belongs to.

In this notebook we will examine case 1 and case 2. Case 3 is part of the page segmentation task, which I’ll cover in another notebook.

2 What makes a good thumbnail

We have a picture of a recipe and want to create a good thumbnail from it. Simply resizing the image often produces thumbnails that lack detail.

A better, straightforward solution is to center-crop the picture to the size of the thumbnail.

The rationale: plates are usually centered in recipe photos.

But what if that’s not the case?

Does this method produce aesthetically pleasing thumbnails, and is there a way to improve in case of non-centered subjects?

The short answer: yes.

See the following picture

Read on to discover how we do this.

3 Straightforward solution: center crop of pictures

Code

# Standard library
import os
import tempfile
from glob import glob
from pathlib import Path
from os.path import expanduser
from urllib.request import urlretrieve
import warnings

# Third-party
import cv2 as cv
import matplotlib.pyplot as plt
import numpy as np
import open_clip
import pandas as pd
import seaborn as sns
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
from PIL import Image, ImageOps
from gradio_client import Client, file, handle_file
from torchvision import transforms
from torchvision.transforms import InterpolationMode
import tqdm

# Local modules
from aesthetic_predictor_v2_5 import convert_v2_5_from_siglip

# Jupyter magic
%matplotlib inline

# Suppress all warnings (optional)
warnings.filterwarnings("ignore")


DATA_DIR = Path("data/covers")
THUMB_SIZE = 512
NROW = 12


thumb_transform = transforms.Compose([
    transforms.Resize(THUMB_SIZE, interpolation=InterpolationMode.LANCZOS),
    transforms.CenterCrop(THUMB_SIZE),
    transforms.ToTensor()
])

to_pil = transforms.ToPILImage()
to_tensor = transforms.ToTensor()

def show_image_grid(images, nrow=NROW, figsize=(14, 12)):
    grid = torchvision.utils.make_grid(images, nrow=nrow)
    plt.figure(figsize=figsize)
    plt.imshow(grid.permute(1, 2, 0))
    plt.axis("off")
    plt.show()

Let’s first load our sample data and apply center-cropping.

files = sorted(glob(str(DATA_DIR / "*.JPG")))

images = [thumb_transform(Image.open(f).convert("RGB")) for f in files]
show_image_grid(images)

As we can see, many picture look quite good. However, in some images the dish gets cut off. We could certainly do better.

For a human it’s obvious that we should center the plate in the thumbnail. For a computer that is challenging as plates are coming in different shapes, and sometimes there are no plates at all. In our specific case, the images are also upside down.

We’ll see later that this is still solvable. Before we get there, though, let’s first define what makes a picture look good.

For that, we need a metric.

4 How to measure beauty

Wouldn’t it be great if we could define a metric that tells us how good a picture is? How beautiful it looks?

In fact, there is a way to do this. We can use Aesthetic Predictor models. Let’s look at two such Models LAION and Aesthetic Predictor V2.5.

4.1 LAION

LAION is the older of the two models.

It is “a linear estimator on top of CLIP to predict the aesthetic quality of pictures.”

But how does it work?

Contrastive Language-Image Pretraining (CLIP) is a multimodal model introduced in 2021. It’s based on

A text encoder, usually GPT like
A vision encoder, a vision transformer

Both encoders produce embeddings, and the model produces combined embeddings with a dimensionality of 768. CLIP was trained on image-caption pairs and used cosine similarity to align text and image embeddings as close as possible.

This allowed the model to identify the best caption for a given image, or vice-versa.

LAION builds on top of CLIP, but scales it up to billions of images compared to clips 400 million.

On top of this embedding model, a linear regression model is trained using a much smaller dataset. The model is define by,

\[score= W * \vec{emb} + b\]

Where W and b are the weights and bias of the linear regression model.

4.2 Aesthetic Predictor V2.5

In AI, four years is a long time.

In 2023 Google introduced SigLIP, Sigmoid Loss for Language–Image Pretraining.

The original CLIP model from OpenAI uses a contrastive loss function. Core to this function is a softmax over all image pairs. Even though all images can not be included at once, this is approximated using a very large batch size. This large batch size requires expensive compute hardware.

Another limitation of LAION was its the underperformance across diverse domains.

SigLIP addresses both problems:

First it uses a sigmoid loss function. Smaller batch sizes can be used.
Second it uses more data, being more robust to diverse domains.

So let’s check the aesthetic scores for our images.

4.3 Calculating scores with Aesthetic Predictor 2.5

We will start with the newer model. Unfortunately, my GPU is too old and is no longer supported by PyTorch version required for this model.

We can use the Hugginface Api or CPU, though.

4.3.1 Using the Hugging Face API for Aesthetic Predictor 2.5

%%time

client = Client("discus0434/aesthetic-predictor-v2-5")

def predict_with_ae25api(img_tensor, client):
    img = to_pil(img_tensor.cpu())

    with tempfile.NamedTemporaryFile(suffix=".png", delete=False) as t:
        img.save(t.name, format="PNG")
        return client.predict(image=file(t.name), api_name="/inference")

scores_api = [predict_with_ae25api(img, client) for img in images]
print(scores_api)

Loaded as API: https://discus0434-aesthetic-predictor-v2-5.hf.space ✔
['4.712978', '4.985875', '5.3126655', '5.0395384', '4.75456', '4.390934', '5.410819', '5.163789', '5.7570896', '5.6178026', '5.0622964', '4.242473', '4.695843', '5.7283096', '5.394037', '5.970618', '5.013522', '5.021712', '5.727453', '4.788595', '5.810698', '5.6538353', '5.338957', '5.173321', '5.7376328', '5.856394', '4.9087625', '5.0130215', '5.60766', '4.9622364', '5.304905', '4.6225524', '4.5034065', '5.2350173', '5.9285026', '5.1088295', '5.6992407', '5.356913', '5.652878', '5.0367103', '4.662853', '4.8960724', '5.1651015', '5.030403', '4.730472', '5.0686293', '5.6805077', '5.1905656', '5.3466654', '5.454918', '5.108632', '5.384975', '5.2946644', '6.0702987', '5.935483', '5.140892', '4.4770913', '4.819613', '5.5843506', '5.7856994']
CPU times: user 14.3 s, sys: 432 ms, total: 14.7 s
Wall time: 4min 38s

4 Minutes for the execution is quite long. Let’s try CPU.

4.3.2 Running Aesthetic Predictor 2.5 locally on CPU

We first define a function, so we can reuse it later.

def get_aesthetic_scores_v25(images, batch_size=16):

    model, preproc = convert_v2_5_from_siglip(
        low_cpu_mem_usage=True,
        trust_remote_code=True,
    )
    model = model.eval()

    scores = []

    with torch.inference_mode():
        for i in range(0, len(images), batch_size):
            batch = images[i:i+batch_size]
            batch_tensor = torch.stack(batch)
            batch_tensor = F.interpolate(batch_tensor, size=(384,384), mode="bilinear", align_corners=False)
    
            mean = torch.tensor(preproc.image_mean).view(1,3,1,1)
            std  = torch.tensor(preproc.image_std).view(1,3,1,1)
            batch_tensor = (batch_tensor - mean) / std
            
            logits = model(batch_tensor).logits.squeeze(-1)
            batch_scores = logits.float().cpu().numpy()
    
            scores.extend(batch_scores)

    return scores

%%time
scores_ap25 = get_aesthetic_scores_v25(images, batch_size=16)

CPU times: user 7min 47s, sys: 47.3 s, total: 8min 34s
Wall time: 1min 27s

np.array(scores_ap25).mean()

np.float32(5.3517203)

With 1.5 minutes, this approach is faster than calling the api.

One possible use case is iterative improvement of the score through an algorithmic approach. In such a scenario, we should aim to process all 60 images within just a few seconds.

LAION has lower requirements on the hardware, we’ll try it next.

4.4 Calculating scores with LAION on GPU

From the LAION github repository, we find the following function:

def get_aesthetic_model(clip_model="vit_l_14"):
    """load the aethetic model"""
    home = expanduser("~")
    cache_folder = home + "/.cache/emb_reader"
    path_to_model = cache_folder + "/sa_0_4_"+clip_model+"_linear.pth"
    if not os.path.exists(path_to_model):
        os.makedirs(cache_folder, exist_ok=True)
        url_model = (
            "https://github.com/LAION-AI/aesthetic-predictor/blob/main/sa_0_4_"+clip_model+"_linear.pth?raw=true"
        )
        urlretrieve(url_model, path_to_model)
    if clip_model == "vit_l_14":
        m = nn.Linear(768, 1)
    elif clip_model == "vit_b_32":
        m = nn.Linear(512, 1)
    else:
        raise ValueError()
    s = torch.load(path_to_model)
    m.load_state_dict(s)
    m.eval()
    device = "cuda" if torch.cuda.is_available() else "cpu"
    return m.to(device=device, dtype=torch.float32)

This is just the linear head of the whole estimator. We need to run the CLIP model for scoring as well.

def score_with_laion(torch_images):
    device = "cuda" if torch.cuda.is_available() else "cpu"
    dtype  = torch.float32

    clip_model_name = "ViT-L-14"
    clip_ckpt = "openai"

    # CLIP encoder + preprocess (must match the LAION head variant)
    model, _, preprocess = open_clip.create_model_and_transforms(clip_model_name, pretrained=clip_ckpt)
    model = model.to(device).eval()

    # Linear estimator
    head = get_aesthetic_model()

    images_preprocessed = [preprocess(to_pil(img)) for img in torch_images]
    batch = torch.stack(images_preprocessed).to(device=device, dtype=dtype)

    with torch.inference_mode():
        clip_embeddings = model.encode_image(batch)
        clip_embeddings = clip_embeddings / clip_embeddings.norm(dim=-1, keepdim=True).clamp_min(1e-12)
        clip_embeddings = clip_embeddings.to(dtype) 
        scores = head(clip_embeddings).squeeze(-1).float().cpu().numpy()

    return scores

Running on our images.

%%time
scores_laion = score_with_laion(images)

CPU times: user 10.4 s, sys: 559 ms, total: 11 s
Wall time: 7.89 s

As expected, the GPU calculation is a lot faster, and 8.5 seconds is less than what we would need.

Let’s examine min and max result.

scores_laion

array([4.682658 , 5.5719824, 4.7548094, 5.6179185, 5.8664765, 5.2028084,
       6.0401983, 6.153771 , 4.6438675, 5.7687664, 5.8421736, 5.352836 ,
       5.795928 , 6.3411045, 5.2397337, 6.822626 , 5.9344873, 5.9164524,
       5.7080965, 5.8545227, 6.7204437, 5.451813 , 6.035965 , 6.1448298,
       6.310848 , 6.398098 , 6.13536  , 6.0679417, 5.075293 , 6.218848 ,
       4.5623236, 5.423691 , 5.0390825, 6.200203 , 5.6596227, 4.7416925,
       5.774774 , 6.2334433, 5.8048406, 5.828251 , 5.0799165, 4.574512 ,
       5.1658216, 6.413603 , 5.409901 , 5.9421396, 6.02083  , 5.6725616,
       5.6838965, 5.1442785, 4.757121 , 5.8007193, 6.006052 , 5.8004303,
       6.1855965, 4.3213387, 5.9920015, 5.3351316, 6.3207726, 5.7225537],
      dtype=float32)

min_idx = np.argmin(scores_laion)
min_score = scores_laion[min_idx]

max_idx = np.argmax(scores_laion)
max_score = scores_laion[max_idx]
print(f"Minimum score is {min_score} and maximum score is {max_score}")

plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)
img_min = images[min_idx].permute(1, 2, 0)
plt.imshow(img_min)
plt.title(f"min_score: {min_score:.4f}")
plt.axis("off")

plt.subplot(1, 2, 2)
img_max = images[max_idx].permute(1, 2, 0)
plt.imshow(img_max)
plt.title(f"max_score: {max_score:.4f}")
plt.axis("off")


plt.show()

Minimum score is 4.321338653564453 and maximum score is 6.822626113891602

While the plates are cut in both pictures, the higher rated picture somehow looks better.

4.5 Comparing the two predictors

Let’s see how the two scores of LAION and Aesthetic Predictor align.

scores_ap25= np.array(scores_ap25)
df = pd.DataFrame({
    "LAION Score": (scores_laion - scores_laion.mean()) /scores_laion.std(),
    "AE25 Score": (scores_ap25 - scores_ap25.mean()) /scores_ap25.std()
})

# Create scatter plot
plt.figure(figsize=(8, 6))
sns.scatterplot(data=df, x="LAION Score", y="AE25 Score", alpha=0.6, edgecolor=None)

# Improve visualization
plt.title("Aesthetic Score Comparison: LAION vs SigLIP")
plt.xlabel("LAION Aesthetic Score")
plt.ylabel("SigLIP Aesthetic Score")

Text(0, 0.5, 'SigLIP Aesthetic Score')

from scipy.stats import pearsonr, spearmanr
pearson_corr, _ = pearsonr(scores_laion, scores_ap25)
spearman_corr, _ = spearmanr(scores_laion, scores_ap25)

print(f"Pearson correlation is {pearson_corr}, Spearman correlation is {spearman_corr}")

Pearson correlation is 0.268462598323822, Spearman correlation is 0.22606279522089476

Values close to 0.2 indicate that there is little correlation between the CLIP-based and the SIGLIP-based evaluations. The picture confirms the numerical values, too.

At first, this might seem surprising. Should a good image not always look good?

Not necessarily. Beauty lies in the eye of the beholder. And in fact, those two models are different. The difference in the loss functions is not the only factor that changed.

The two models were trained on very different datasets.

LAION is primarily focused on Photography
SIGLIP focuses on much broader range of web images.

As a result, images with bright, unrealistic colours may score higher using SIGLIP than with LAION.

5 Optimizing thumbnails with global methods

Due to the significant speed advantage on my machine, I will focus on LAION. Let’s create a baseline.

baseline_score = scores_laion.mean()
baseline_score

np.float32(5.671463)

In theory, several improvements are possible.

Non-crop improvements, modifying the entire image - Correct orientation - Color correction - Glare and noise reduction

Crop-based improvements try to locate a plate in the image and crop. - Use saliency maps to highlight the most important object in the image - Use segmentation models to find all contours - Use bounding box object detection models like YOLO to detect the dish - Perform optimizations on (x, y, zoom) by scoring multiple crops and treating the search for the perfect crop as an optimization problem

5.1 Correct orientation

This is the most obvious one.

images_correct_orientation = [thumb_transform(ImageOps.exif_transpose(Image.open(f)).convert("RGB")) for f in files]

show_image_grid(images_correct_orientation)

scores_laion_correct_orientation = score_with_laion(images_correct_orientation)

scores_laion_correct_orientation.mean()

np.float32(5.8733764)

As we can see correct orientation leads to a better score.

5.2 Color correction and denoising

I experiment with applying color correction and denoising on some samples. However, in most cases this actually lowers the score.

One possible explanation is that the model was trained on untreated sRGB pictures. By altering the images too much, we risk creating an out-of-domain.

Nevertheless, we will still apply a mild correction to to reduce glare, but only after cropping.

6 Optimizing Thumbnails with cropping using saliency maps

Saliency detection attempts to identify which parts of an image are the most visually important.

We have two lightweight options for generating saliency maps:

Opencv-based fine-grained saliency map
U²-Net

We will not use the Opencv Method.

Opencv algorithm compares pixel color variations with their neighbours. However, the approach is outdated and quite slow, making it not suitable for our purpose.

6.1 U²-Net saliency map

U²-Net is a deep learning model for salient object detection that uses a nested U-shaped architecture with Residual U-blocks (RSUs) for efficient multi-scale feature extraction. It delivers high-accuracy segmentation and is widely used for background removal.

We aim to identify the main dish as the foreground. Instead of removing the background, we’ll simply use the saliency map to crop the dish region.

Let’s start by testing it on a sample picture.

Code

test_image = Image.open(files[2])
plt.axis("off")
plt.imshow(test_image)

Which we need to turn

Code

transposed_image = ImageOps.exif_transpose(test_image).convert("RGB")
transposed_image_np = np.array(transposed_image)
plt.axis("off")
plt.imshow(transposed_image_np)

We need to scale down the picture to 320px, as the model was trained on this and convert to pytorch.

arr = transposed_image_np.astype(np.float32) / 255.0
H, W = arr.shape[:2]
scale = min(320 / max(H, W), 1.0)
if scale < 1.0:
    newW, newH = int(round(W * scale)), int(round(H * scale))
    arr = cv.resize(arr, (newW, newH), interpolation=cv.INTER_AREA)

ten = torch.from_numpy(arr).permute(2, 0, 1).unsqueeze(0).to("cuda")

Next, let’s load the model, and examine if it correctly loaded. If so We should see definitions of RSUs. I copied definition and weights from https://github.com/xuebinqin/U-2-Net/.

from u2net import U2NET
model = U2NET(3, 1)
state = torch.load("data/u2net.pth", map_location='cuda')
model.load_state_dict(state, strict=True)
model.to("cuda")
_ = model.eval()

This seems identical to what is specified in the sources.

Let’s run the the model. U²-Net produces multiple maps; one at each level of the u-net. We are only interested in the upper layer.

with torch.inference_mode():
    out = model(ten)
    pred = out[0] if isinstance(out, (list, tuple)) else out
    sal_small = torch.sigmoid(pred)

For futher processing we will convert back to uint8 and inspect the output.

sal_u8 = (sal_small.detach().clamp(0, 1).cpu().numpy() * 255).astype("uint8").squeeze()
plt.axis("off")
plt.imshow(sal_u8)

The main plate is correctly identified.

6.2 Optimized pipeline

We start by defining a function to run inference on a unscaled numpy/opencv image.

def run_u2_on_pil(model, image,resolution=320):
    # preprocessing
    H, W = image.shape[:2]
    scale = min(resolution / max(H, W), 1.0)
    if scale < 1.0:
        newW, newH = int(round(W * scale)), int(round(H * scale))
        target_image = cv.resize(image, (newW, newH), interpolation=cv.INTER_AREA)
    else:
        target_image = image
    ten = (torch.from_numpy(target_image).float().permute(2,0,1) / 255.0).unsqueeze(0).to("cuda")

    # getting saliency map
    with torch.inference_mode():
        out = model(ten)
        pred = out[0] if isinstance(out, (list, tuple)) else out
        sal_small = torch.sigmoid(pred)

    # postprocessing back to original scale and cpu
    sal_full = F.interpolate(sal_small, size=(H, W), mode="bilinear", align_corners=False)
    sal_full = sal_full.squeeze(0).squeeze(0).clamp(0,1).detach().cpu().numpy().astype(np.float32)
    return sal_full

sal_full = run_u2_on_pil(model, transposed_image_np)

Let’s add an edge detector. The edge detector uses the Scharr operator on the luminance.

def rim_prior_from_L(image_uint8):
    L = cv.cvtColor(image_uint8, cv.COLOR_RGB2LAB)[:,:,0]
    gx = cv.Scharr(L, cv.CV_32F, 1, 0)
    gy = cv.Scharr(L, cv.CV_32F, 0, 1)
    mag = cv.magnitude(gx, gy)
    mag = cv.GaussianBlur(mag, (9,9), 0)
    mag -= mag.min()
    mag /= (mag.max() + 1e-6)
    return mag.astype(np.float32)

R = rim_prior_from_L(transposed_image_np)
plt.axis("off")
plt.imshow(R)

We fuse the two detections, doing so gives us a slightly nicer visualization.

# Fuse
S_fused = 0.85*sal_full + 0.15*R
# Normalize
S_fused -= S_fused.min()
S_fused /= (S_fused.max() + 1e-6)

Code

plt.figure(figsize=(15, 6))

plt.subplot(1, 3, 1)
plt.imshow(transposed_image_np)
plt.title("Original")
plt.axis("off")
plt.subplot(1, 3, 2)
plt.imshow(S_fused , cmap="inferno")
plt.title("Fused Saliency")
plt.axis("off")
plt.tight_layout()
plt.show()

Next, we are interested in generating a bounding box around the main dish.

S8 = np.clip(S_fused * (255 if S_fused.max() <= 1.0 else 1.0), 0, 255).astype(np.uint8)
_, mask = cv.threshold(S8, 0, 255, cv.THRESH_BINARY + cv.THRESH_OTSU)
plt.axis("off")
plt.imshow(mask)

We remove small islands through close open transforms. Eventually we search the largest component.

mask = cv.morphologyEx(mask, cv.MORPH_CLOSE, np.ones((7, 7), np.uint8))
mask = cv.morphologyEx(mask, cv.MORPH_OPEN,  np.ones((5, 5),  np.uint8))
plt.axis("off")
plt.imshow(mask)

num, labels, stats, _ = cv.connectedComponentsWithStats(mask, 8)

img_area = H * W
best_i, best_area = None, 0
for i in range(1, num):
    x, y, w, h, area = stats[i]
    if area > best_area and area >= 0.01 * img_area:
        best_i, best_area = i, area

x, y, w, h, _ = stats[best_i]
cx, cy = x + w / 2.0, y + h / 2.0

out = transposed_image_np.copy()
cv.rectangle(out, (x, y), (x+w,y+h), (0, 255, 0), 10)
plt.axis("off")
plt.imshow(out)

The green bounding box shows very nicely how we detected the plate. Let’s turn this into a function, which produces a resized square crop while respecting image boundaries.

def crop_on_saliency_map(saliency_map, image):
    S8 = np.clip(saliency_map * (255 if saliency_map.max() <= 1.0 else 1.0), 0, 255).astype(np.uint8)
    _, mask = cv.threshold(S8, 0, 255, cv.THRESH_BINARY + cv.THRESH_OTSU)
    mask = cv.morphologyEx(mask, cv.MORPH_CLOSE, np.ones((7, 7), np.uint8))

    mask = cv.morphologyEx(mask, cv.MORPH_OPEN,  np.ones((5, 5),  np.uint8))
    num, labels, stats, _ = cv.connectedComponentsWithStats(mask, 8)
    img_area = H * W
    best_i, best_area = None, 0
    for i in range(1, num):
        x, y, w, h, area = stats[i]
        if area > best_area and area >= 0.01 * img_area:
            best_i, best_area = i, area
    x, y, w, h, _ = stats[best_i]
    cx, cy = x + w / 2.0, y + h / 2.0

    # make image square and shift if we are too close to the border
    pad = 0.0
    w2, h2 = w * (1 + 2 * pad), h * (1 + 2 * pad)
    side = max(w2, h2)
    side_px = min(int(round(side)), W, H)
    half = side_px / 2.0

    cx = float(np.clip(cx, half, W - half))
    cy = float(np.clip(cy, half, H - half))

    x0 = int(round(cx - half))
    y0 = int(round(cy - half))
    
    # guard against rounding pushing us out of bounds
    x0 = max(0, min(x0, W - side_px))
    y0 = max(0, min(y0, H - side_px))
    x1 = x0 + side_px
    y1 = y0 + side_px

    # crop and resize    
    crop = image[y0:y1, x0:x1]
    interp = cv.INTER_AREA if 512 < max(crop.shape[:2]) else cv.INTER_LINEAR
    crop = cv.resize(crop, (512, 512), interpolation=interp)
    return crop

result = crop_on_saliency_map(S_fused, transposed_image_np)
plt.axis("off")
plt.imshow(result)

This is the correct crop of the plate in the picture. Next, we score the cropped image.

scores_laion_cropped = score_with_laion([to_tensor(transposed_image)])

scores_laion_cropped

array([5.954031], dtype=float32)

This score is higher than our baseline score. That means correct cropping has a effect.

6.3 Improving even more

On some images, there are too many fine details. The network will detect the whole page as salient object. We need to run the network with another input resolution. We can run several resolutions and decide which is best after scoring. A quicker approach is to examine the size of the main component in the picture, if it is too large we need to increase the resolution.

Therefore, we define a function, which checks if we cover too much of the page with one component, where “too much” means 80%.

def _looks_like_full_page(sal, area_frac_thresh=0.80,require_border_touch=True):
    """Heuristic: is the biggest component huge and touching the image border?"""
    S8 = (np.clip(sal * 255, 0, 255)).astype(np.uint8)
    _, mask = cv.threshold(S8, 0, 255, cv.THRESH_BINARY + cv.THRESH_OTSU)
    num, labels, stats, _ = cv.connectedComponentsWithStats(mask, 8)
    if num <= 1:
        return False
        
    # largest component (skip background 0)
    idx = 1 + np.argmax(stats[1:, cv.CC_STAT_AREA])
    x, y, w, h, area = stats[idx]
    H, W = sal.shape[:2]
    area_frac = area / float(H * W)
    touches = (x == 0) or (y == 0) or (x + w == W) or (y + h == H)
    return (area_frac >= area_frac_thresh) and (touches if require_border_touch else True)

With this function in place we can run a small optimization function. It will run over predefined scales and check if the result does not look like the full page. We start with smaller resolutions as these tend to produce the full page saliency map.

def run_u2_autoscale(model, image_np, sizes=(320, 480, 640, 896), device="cuda"):

    last_sal = None
    for i, s in enumerate(sizes):
        sal = run_u2_on_pil(model, image_np, s)
        last_sal = sal
        if not _looks_like_full_page(sal):
            return sal
            
    # If even the largest still looks like a page, fall back to a small multi-scale fuse (max)
    sal_big = last_sal
    sal_small = run_u2_on_pil(model, image_np, sizes[0])
    return np.maximum(sal_big, sal_small)

6.4 The full pipeline

Now with everything in place we can define a function that creates the fused saliency map, the bounding box, and finally crops.

def full_crop_pipeline(model, image):
    transposed_image = np.array(ImageOps.exif_transpose(image).convert("RGB"))
    
    sal_full = run_u2_autoscale(model, transposed_image)

    # run edge detector
    def rim_prior_from_L(image_uint8):
        L = cv.cvtColor(image_uint8, cv.COLOR_RGB2LAB)[:,:,0]
        gx = cv.Scharr(L, cv.CV_32F, 1, 0)
        gy = cv.Scharr(L, cv.CV_32F, 0, 1)
        mag = cv.magnitude(gx, gy)
        mag = cv.GaussianBlur(mag, (9,9), 0)
        mag -= mag.min()
        mag /= (mag.max() + 1e-6)
        return mag.astype(np.float32)

    R = rim_prior_from_L(transposed_image)

    # Fuse
    S_fused = 0.85*sal_full + 0.15*R
    
    # Normalize
    S_fused -= S_fused.min()
    S_fused /= (S_fused.max() + 1e-6)

    # Crop
    cropped = crop_on_saliency_map(S_fused, transposed_image)
    return cropped

cropped_test_image = full_crop_pipeline(model, test_image)
plt.axis("off")
plt.imshow(cropped_test_image)

As expected we get the same picture.

Let’s calculate for all images and score.

cropped_images= []
for file in tqdm.tqdm(files):
    cropped_images.append(full_crop_pipeline(model, Image.open(file)))

scores_laion_cropped = score_with_laion([to_tensor(img) for img in cropped_images])

scores_laion_cropped.mean()

np.float32(5.951551)

This mean score is slightly better than the uncropped. let’s check the details.

improvement = (scores_laion_cropped-scores_laion_correct_orientation)
improvement.mean()

np.float32(0.07817339)

improvement.std()

np.float32(0.3198055)

The high standard deviation means some images improved a lot more, some got worse. Let’s identify the worst decline in the score.

np.argmin(scores_laion_cropped-scores_laion_correct_orientation)

np.int64(22)

plt.axis("off")
plt.imshow(cropped_images[22])
print(improvement[22])

-0.76167774

The image is a perfect crop. It is not obvious why the score decreased. In terms of results it is exactly what we want. The same is true for almost all other images, as we can see below.

show_image_grid([to_tensor(img) for img in cropped_images])

6.5 Postprocessing

The pictures were done with a mobile phone camera. This is the quickest way to digitize a book without expensive equipment. There is some glare from glossy paper and non-perfect light condition. Let’s try to improve.

def reduce_glare(img):
    # Ensure BGR → LAB (good for luminance adjustments)
    lab = cv.cvtColor(img, cv.COLOR_RGB2LAB)
    l, a, b = cv.split(lab)

    # Apply CLAHE on L-channel, 2.0 and 8.8 produce moderately aggressive results
    clahe = cv.createCLAHE(clipLimit=2.0, tileGridSize=(8,8))
    cl = clahe.apply(l)

    # Merge and convert back
    limg = cv.merge((cl, a, b))
    final = cv.cvtColor(limg, cv.COLOR_LAB2RGB)
    return final

def detect_glare_mask(img, thresh=230):
    hsv = cv.cvtColor(img, cv.COLOR_RGB2HSV)
    h, s, v = cv.split(hsv)
    mask = (v >= thresh).astype(np.uint8) * 255

    # Optional: clean up mask
    kernel = np.ones((5,5), np.uint8)
    mask = cv.morphologyEx(mask, cv.MORPH_CLOSE, kernel)   # fill small holes
    mask = cv.morphologyEx(mask, cv.MORPH_OPEN, kernel)    # remove tiny specks

    return mask

def inpaint_glare(img, thresh):
    mask = detect_glare_mask(img, thresh=thresh)
    gray = cv.cvtColor(img, cv.COLOR_RGB2GRAY)
    
    # Detect glare: very bright areas
    img_bgr = cv.cvtColor(img, cv.COLOR_RGB2BGR)
    inpainted_bgr = cv.inpaint(img_bgr, mask, inpaintRadius=5, flags=cv.INPAINT_TELEA)
    inpainted_rgb = cv.cvtColor(inpainted_bgr, cv.COLOR_BGR2RGB)
    return inpainted_rgb

crop_glare_reduced = [reduce_glare(img) for img in cropped_images]
crop_glare_reduced_and_inpainted = [inpaint_glare(img,240) for img in crop_glare_reduced]

show_image_grid([to_tensor(img) for img in crop_glare_reduced])

show_image_grid([to_tensor(img) for img in crop_glare_reduced_and_inpainted])

scores_laion_cropped_fixed_glare = score_with_laion([to_tensor(img) for img in crop_glare_reduced])
scores_laion_cropped_fixed_glare.mean()

np.float32(5.9497223)

scores_laion_cropped_fixed_glare_inpaint = score_with_laion([to_tensor(img) for img in crop_glare_reduced_and_inpainted])
scores_laion_cropped_fixed_glare_inpaint.mean()

np.float32(5.816769)

Subjectively the pictures look better. The average score is a little lower for histogram equalization and gets bad for mask-based glare inpainting. Let’s check a single image.

print(scores_laion_cropped[0],scores_laion_cropped_fixed_glare[0],scores_laion_cropped_fixed_glare_inpaint[0])

6.1202517 6.257745 6.3136473

This tells a completely other story. Here the images got better, the more processing was applied. We will do a visual inspection.

Code

plt.figure(figsize=(15, 6))

plt.subplot(1, 3, 1)
plt.imshow(cropped_images[0])
plt.title("No Glare Fix")
plt.axis("off")
plt.subplot(1, 3, 2)
plt.imshow(crop_glare_reduced[0] , cmap="inferno")
plt.title("Histogram Equalization")
plt.axis("off")
plt.subplot(1, 3, 3)
plt.imshow(crop_glare_reduced_and_inpainted[0] , cmap="inferno")
plt.title("Inpainting")
plt.axis("off")
plt.tight_layout()
plt.show()

The strong glare is succesfully removed, without introducing artifacts or too high contrast.As we can see glare reduction can deliver improvements. Let’s combine the best of all.

scores_list = [scores_laion_cropped, scores_laion_cropped_fixed_glare, scores_laion_cropped_fixed_glare_inpaint]
images_list = [cropped_images, crop_glare_reduced, crop_glare_reduced_and_inpainted]

scores = np.stack(scores_list, axis=1)
best_indices = scores.argmax(axis=1)
best_scores = scores.max(axis=1)

N = scores.shape[0]
best_images = [images_list[idx][i] for i, idx in enumerate(best_indices)]
best_scores.mean()

np.float32(6.1018143)

show_image_grid([to_tensor(img) for img in best_images])

Now we have an average improvement. Let’s check a picture with a lot of glare

Code

plt.figure(figsize=(15, 6))

plt.subplot(1, 3, 1)
plt.imshow(best_images[29])
plt.title("Best Image")
plt.axis("off")
plt.subplot(1, 3, 2)
plt.imshow(crop_glare_reduced[29] , cmap="inferno")
plt.title("Histogram Equalization")
plt.axis("off")
plt.subplot(1, 3, 3)
plt.imshow(crop_glare_reduced_and_inpainted[29] , cmap="inferno")
plt.title("Inpainting")
plt.axis("off")
plt.tight_layout()
plt.show()

The best picture is the one without post-processing. Personally I like the histogram equalization most. The inpainting has to strong artifacts in the non glare parts. There is too much contrast on the parts of the image which were not affected by the glare. With more work this could certainly be improved.

Finally, my impression is that the LAION score is not good for our use case of food photography. The scores are too close together.

7 Other methods

7.1 Segmentation

When I brainstormed ideas, I considered using segmentation models. On of the most advanced segmentation models is (Segment Anything)[https://segment-anything.com/demo].

For a problematic image, the segmentation model gives the following result:

However, identifying the best crop would require significant post- processing. Assuming we always look for dishes, which is not necessarily the case, we could look for smooth large shapes.

7.2 Object detection

Bounding box object detection algorithms can, in theory, locate the plates quite well. The main drawback is that I would need to train such a detector myself, which requires a lot of labeled data.

While there are backbones such as YOLO, we would still require several hundreds of labeled images.

This could be a viable refinement once a significant number of images has been processed.

7.3 Direct optimization

Another possible approach is a brute-force optimization method. We would run an optimization algorithm that uses LAION to score the images. Based on the gradients we would vary the crop zone.

However, from my experiments with saliency-based images, the aesthetic score is somewhat subjective and not always intuitive. To make this approach effective, the scoring function would likely need to be reworked. While there is certainly room for experimentation here, this method would require more research and fine-tuning.

8 Summary

What did we learn?

We found that it is possible to generate better thumbnails using slightly more intelligent techniques. I used a multi-scale saliency algorithm to identify the dominant object in each image. This lead to a average score increase of 1.3%.

Additionally, glare reduction makes the pictures subjectively nicer, but it actually leads to a lower mean score.

This raises an interesting question: how should we score good-looking thumbnails? The LAION classifier can help slightly improve images, but in some cases, it actually prefers images with more glare.

Code

plt.figure(figsize=(15, 6))

plt.subplot(1, 3, 1)
plt.imshow(to_pil(images_correct_orientation[29]))
plt.title("Center Crop")
plt.axis("off")
plt.subplot(1, 3, 2)
plt.imshow(crop_glare_reduced[29] , cmap="inferno")
plt.title("Saliency Map Crop with Glare Removal")
plt.axis("off")
plt.tight_layout()
plt.savefig("best_thumbnail_method.jpg", dpi=300, bbox_inches="tight")
plt.show()

9 Bonus Section: case 2 recipes without images

What if we have no images at all. Then the only option is image generation.

However, this is computationally far more costly as the previous processing. And it does not seem to work that well.

Real photo

With this prompt : A bright, photorealistic cookbook-style photo of a freshly cooked Thai-style chicken stir-fry with cashews, beautifully plated on a white ceramic dish. The dish features tender, thinly sliced chicken thighs coated in a glossy, rich sauce made from oyster sauce, soy sauce, and fish sauce. Golden-brown roasted cashews scattered evenly, thin wedges of onion, vibrant green onion pieces, and delicate slices of red cayenne pepper for a pop of color. Served alongside a small bowl of perfectly steamed jasmine rice. The composition is clean and minimal, shot on a light wooden kitchen table with natural daylight. Soft, even lighting with gentle shadows, crisp textures, and realistic color tones. High-end food photography, cookbook aesthetic, ultra-HD.

Leads to this quite unrealistic picture from Chatgpt and FluxSchnell

ChatGPT

Fluxschnell

Personally i find those less appealing than real photos, even though Flux Schnell comes at a much lower price tag.

Slighly better is Stable diffusion

Stable Diffusion

Until one has cooked the recipe this is the only option to have a picture.

It would be interesting to see how the saliency map methods works on real pictures of the cooked food.

1 Why we need thumbnails

2 What makes a good thumbnail

3 Straightforward solution: center crop of pictures

4 How to measure beauty

4.1 LAION

4.2 Aesthetic Predictor V2.5

4.3 Calculating scores with Aesthetic Predictor 2.5

4.3.1 Using the Hugging Face API for Aesthetic Predictor 2.5

4.3.2 Running Aesthetic Predictor 2.5 locally on CPU

4.4 Calculating scores with LAION on GPU

4.5 Comparing the two predictors

5 Optimizing thumbnails with global methods

5.1 Correct orientation

5.2 Color correction and denoising

6 Optimizing Thumbnails with cropping using saliency maps

6.1 U²-Net saliency map

6.2 Optimized pipeline

6.3 Improving even more

6.4 The full pipeline

6.5 Postprocessing

7 Other methods

7.1 Segmentation

7.2 Object detection

7.3 Direct optimization

8 Summary

9 Bonus Section: case 2 recipes without images

Comments