test_cases= [
# === Milk family ===
{
"input": ["1l milk", "1l lait", "1l Milch"],
"expected": ["3l milk"]
},
{
"input": ["1l whole milk", "1l lait entier", "1l Vollmilch"],
"expected": ["3l milk"]
},
{
"input": ["1l hot milk", "1l warme Milch", "1l lait chaud"],
"expected": ["3l milk"]
},
# === Sugar ===
{
"input": ["1000g sugar", "1kg Zucker", "1kg sucre", "1 EL Zucker"],
"expected": ["3kg sugar"]
},
{
"input": ["1kg brown sugar", "1kg sucre blanc", "1kg weißer Zucker"],
"expected": ["1kg brown sugar", "1kg white sugar"]
},
# === Eggs ===
{
"input": ["2 eggs", "1 œuf", "1 Ei"],
"expected": ["4 eggs"]
},
# === Bread ===
{
"input": ["1 baguette", "1 bread", "1 Brot"],
"expected": ["3 bread"]
},
# === Water & wine distinction ===
{
"input": ["1 bottle of water", "1 bouteille de vin", "1 Flasche Wasser"],
"expected": ["2 bottles of water", "1 bottle of wine"]
},
# === Branded/variant milk ===
{
"input": ["1l Oatly milk", "1l lait Oatly", "1l Oatly Milch"],
"expected": ["3l Oatly Milk"]
},
# === Rice ===
{
"input": ["½kg rice", "500g Reis", "500g riz"],
"expected": ["1.5kg rice"]
},
# === Onion ===
{
"input": ["1 onion", "1 oignon", "1 Zwiebel"],
"expected": ["3 onions"]
},
# === Bell pepper ===
{
"input": ["1 bell pepper", "1 capsicum", "1 poivron"],
"expected": ["3 bell peppers"]
},
# === Red pepper flakes ===
{
"input": ["1 red pepper flakes", "1 chili flakes", "1 piment concassé"],
"expected": ["3 red pepper flakes"]
},
# === Butter ===
{
"input": ["1 tbsp butter", "1 cuillère de beurre", "1 EL Butter"],
"expected": ["3 tbsp butter"]
},
# === Salt ===
{
"input": ["1 tsp salt", "1 TL Salz", "1 cuillère à café de sel"],
"expected": ["3 tsp salt"]
},
# === Tomato cans ===
{
"input": ["1 can crushed tomatoes", "1 boîte de tomates concassées", "1 Dose Tomatenstücke"],
"expected": ["3 cans crushed tomatoes"]
},
# === Yogurt ===
{
"input": ["1 cup yogurt", "1 tasse de yaourt", "1 Becher Joghurt"],
"expected": ["3 cups yogurt"]
},
# === Oil ===
{
"input": ["1 tbsp olive oil", "1 EL Olivenöl", "1 cuillère à soupe d'huile d'olive"],
"expected": ["3 tbsp olive oil"]
},
# === Bread variants shouldn’t merge with pastry ===
{
"input": ["1 croissant", "1 pain", "1 bread"],
"expected": ["2 bread", "1 croissant"]
},
# === Cheese ===
{
"input": ["100g cheese", "100g fromage", "100g Käse"],
"expected": ["300g cheese"]
},
# === Flour ===
{
"input": ["1kg flour", "1000g Mehl", "1kg farine"],
"expected": ["3kg flour"]
},
]
Can AI fix your shopping list?
1 Meet the data

Most shopping or recipe apps can make a list — but few can aggregate it correctly. Ask them to combine “1 l milk”, “1 l lait”, and “1 l Milch”, and they’ll happily buy you one liter of each.
Can multilingual AI models understand that milk, lait, and Milch are the same thing? We’ll explore a full pipeline — from regex parsing to multilingual embeddings, fuzzy matching, and translation —
(Spoiler: not as far as you’d hope.)
1.1 Simple test cases
Let’s have a look at some test cases, to make things clearer:
1.2 Long phrases test cases
In addition, we will also examine what will happen on longer phrases:
test_cases_long_sentences = [
# === Quantities with comments ===
{
"input": ["1 large yellow onion, coarsely chopped", "1 oignon jaune haché", "1 große gelbe Zwiebel, gehackt"],
"expected": ["3 onions"]
},
{
"input": ["2 cups mango chunks, (2 large mangoes) (fresh or frozen)"],
"expected": ["2 cups mango chunks"]
},
{
"input": ["½ cup butter (softened)", "100g beurre (ramolli)", "100g Butter (weich)"],
"expected": ["1.5 sticks butter"]
},
{
"input": ["1 tbsp minced cilantro, leaves and stems", "1 EL gehackter Koriander", "1 cuillère à soupe de coriandre hachée"],
"expected": ["3 tbsp cilantro"]
},
# === Multi-part phrases ===
{
"input": ["1 bell pepper, cut in pieces", "1 poivron coupé en morceaux", "1 Paprika in Stücke geschnitten"],
"expected": ["3 bell peppers"]
},
{
"input": ["2 cloves garlic, finely chopped", "2 gousses d’ail hachées", "2 Knoblauchzehen, fein gehackt"],
"expected": ["6 cloves garlic"]
},
{
"input": ["1 stalk bell peppers, cut in pieces", "1 Stiel Paprika, geschnitten", "1 tige de poivron, coupée"],
"expected": ["3 bell peppers"]
},
# === Fractional & descriptive ===
{
"input": ["1½ tsp garam masala", "1 cuillère à café de garam masala", "1 TL Garam Masala"],
"expected": ["3 tsp garam masala"]
},
{
"input": ["a pinch of salt", "une pincée de sel", "eine Prise Salz"],
"expected": ["3 pinches salt"]
},
{
"input": ["a handful of nuts", "une poignée de noix", "eine Handvoll Nüsse"],
"expected": ["3 handfuls nuts"]
},
# === Compounds / alternatives ===
{
"input": ["2 cups milk or cream", "2 tasses de lait ou de crème", "2 Tassen Milch oder Sahne"],
"expected": ["6 cups milk or cream"]
},
{
"input": ["1 tbsp olive oil, plus extra for frying", "1 EL Olivenöl, zusätzlich zum Braten", "1 cuillère à soupe d’huile d’olive, plus pour la cuisson"],
"expected": ["3 tbsp olive oil"]
},
{
"input": ["1 cup chopped tomatoes (canned)", "1 boîte de tomates concassées", "1 Dose gehackte Tomaten"],
"expected": ["3 cups chopped tomatoes"]
},
# === Units expressed as nouns ===
{
"input": ["1 bottle of water", "1 bouteille d’eau", "1 Flasche Wasser"],
"expected": ["3 bottles of water"]
},
{
"input": ["1 can coconut milk", "1 boîte de lait de coco", "1 Dose Kokosmilch"],
"expected": ["3 cans coconut milk"]
},
# === Descriptive words shouldn’t break grouping ===
{
"input": ["1 large potato", "1 grosse pomme de terre", "1 große Kartoffel"],
"expected": ["3 potatoes"]
},
{
"input": ["2 small onions", "2 petits oignons", "2 kleine Zwiebeln"],
"expected": ["6 onions"]
},
]2 Simple Approach
For a human aggregating the items on the list is fairly easy. What similar items exist? And how many?
For a computer, this is not so easy.
I came up with a simple pipeline
- We start by separating numbers from pure text. This should allow us to identify quantity and units, we’ll use regex for this.
- As there will be multiple languages with similar words, we need to detect similar concepts. This will be done via multi language embeddings.
- Once we have grouped items via embeddings, we aggregate the quantity. We need to normalize quantities to do this.
2.1 Using Regex: What is the unit of 4 Apples?
While developing a solution, I discovered that embedding models have issues with numeric content. The reason is that we are looking at very short sequences. Numeric content leds to identical tokens for different items. If the numeric token is more than a certain threshold, things purely become similar due to the numeric component
We ’ll use a simple regex approach. Quantities are most of the time at the front of the ingredients.
Units are predefined as there only exists a limited set of units in each language. We will limit ourselves to english, french and german here. Let’s build unit map:
# === Multilingual normalization map ===
UNIT_MAP = {
# === volume ===
"l": "l", "lt": "l",
"liter": "l", "liters": "l", # EN
"litre": "l", "litres": "l", # FR
"literen": "l", "liter": "l", "literes": "l", "litern": "l", # DE plural inflections
"ml": "ml", "millilitre": "ml", "millilitres": "ml",
"milliliter": "ml", "milliliters": "ml",
# === weight ===
"g": "g", "gram": "g", "grams": "g",
"gramme": "g", "grammes": "g", # FR
"gramm": "g", "gramme": "g", "grammen": "g", # DE
"kg": "kg", "kilogram": "kg", "kilograms": "kg",
"kilogramme": "kg", "kilogrammes": "kg", # FR
"kilogramm": "kg", "kilogramme": "kg", "kilogrammen": "kg", # DE
# === spoons & cups ===
"cup": "cup", "cups": "cup",
"tasse": "cup", "tasses": "cup", # FR
"becher": "cup", "tasse": "cup", # DE
"tbsp": "tbsp", "tablespoon": "tbsp", "tablespoons": "tbsp",
"cuillère": "tbsp", "cuillerée": "tbsp", "cuillères": "tbsp", # FR
"esslöffel": "tbsp", "el": "tbsp", # DE
"tsp": "tsp", "teaspoon": "tsp", "teaspoons": "tsp",
"tl": "tsp", "teelöffel": "tsp", # DE
"cuillère à café": "tsp", "cc": "tsp", # FR
# === qualitative small measures ===
"pinch": "pinch", "pinches": "pinch",
"pincée": "pinch", "pincées": "pinch",
"prise": "pinch", "prisen": "pinch",
"dash": "dash", "dashes": "dash",
"goutte": "drop", "gouttes": "drop",
"tropfen": "drop", "tropf": "drop",
}For simplicity we assume that volumes can be converted to grams, regardless of food type. In most cases this leads to an upper bound, as water often has the highest density of all food types. As a result, we would just buy too much food.
VOLUME_EQUIVALENTS = {"cup": 240, "tbsp": 15, "tsp": 5, "ml": 1, "l": 1000}
WEIGHT_EQUIVALENTS = {"g": 1, "kg": 1000}
def normalize_unit(raw_unit: str) -> str:
raw_unit = raw_unit.lower().strip()
return UNIT_MAP.get(raw_unit, raw_unit)
def convert_to_base(qty: float, unit: str):
if unit in VOLUME_EQUIVALENTS:
return qty * VOLUME_EQUIVALENTS[unit], "ml"
if unit in WEIGHT_EQUIVALENTS:
return qty * WEIGHT_EQUIVALENTS[unit], "g"
return qty, unitNext comes the actual regex function. It searches for number and units. The remainder is assumed to be the ingredient. An Edge case is unitless quantities like 4 apples. We catch this with a comparison to the units in the UNIT_MAP.
import re
from rapidfuzz.distance import JaroWinkler
def parse_item(text: str):
text = text.strip()
# Match multiple number + unit combos, e.g. "1l 200g", "2 Tassen Zucker"
matches = re.findall(r'(\d+(?:[.,]\d+)?|\½|\¼|\¾)\s*([a-zA-ZÀ-ÿ]+)', text)
remainder = re.sub(r'^((\d+(?:[.,]\d+)?|\½|\¼|\¾)\s*[a-zA-ZÀ-ÿ]+\s*)+', '', text).strip().lower()
if not matches:
return [(1.0, None, remainder)]
result = []
for qty_str, unit_str in matches:
qty = (
0.5 if qty_str == "½"
else 0.25 if qty_str == "¼"
else 0.75 if qty_str == "¾"
else float(qty_str.replace(',', '.'))
)
raw = unit_str.lower().strip()
norm_unit = normalize_unit(raw)
if norm_unit in UNIT_MAP.values():
result.append((qty, norm_unit, remainder))
else:
result.append((qty, None, f"{unit_str.lower()} {remainder}".strip()))
return resultparse_item('4 apples')[(4.0, None, 'apples')]
parse_item("100g chocolate")[(100.0, 'g', 'chocolate')]
Let’s do it for one test case:
test_cases[3]["input"]['1000g sugar', '1kg Zucker', '1kg sucre', '1 EL Zucker']
def parse_items_list(texts):
results = []
for text in texts:
parsed = parse_item(text)
results.extend(parsed)
return resultsconverted= parse_items_list(test_cases[3]["input"]);converted[(1000.0, 'g', 'sugar'),
(1.0, 'kg', 'zucker'),
(1.0, 'kg', 'sucre'),
(1.0, 'tbsp', 'zucker')]
Soo sweet, it works!
2.2 Embeddings for clustering: putting together, what belongs together
The next phase is the identification of similar concepts. For this we use an embedding model.
The hardest part consists in finding an embedding model that works well with our data. Ingredients can be multilingual and their length can range from one short word to several words.
A simple one word model would be a simple dictionary. As we have longer phrases, we need a sentence model.
One such model is paraphrase-multilingual-MiniLM-L12-v2.
from sentence_transformers import SentenceTransformer, util
model_emb = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2")Similar embeddings are frequently identified using cosine similarity.
def is_same_concept(a: str, b: str, model, threshold=0.8) -> bool:
emb_a = model.encode(a, normalize_embeddings=True)
emb_b = model.encode(b, normalize_embeddings=True)
sim = util.cos_sim(emb_a, emb_b).item()
return sim >= thresholdThis is pairwise comparison and we need to transform our list.
from itertools import combinations
names = [t[2] for t in converted]
pairs = list({tuple(sorted(p)) for p in combinations(names, 2)})
pairs[('sucre', 'sugar'),
('zucker', 'zucker'),
('sucre', 'zucker'),
('sugar', 'zucker')]
for a, b in pairs:
print(is_same_concept(a,b, model_emb))True
True
True
True
Voila.
But what we are actually interested in is the number of similar concepts and which item belongs to which concept.
To do this we use a similarity matrix. And then search for clusters in the matrix.
def cluster_same_concept(names, model, threshold=0.8):
# Encode all names
embeddings = model.encode(names, normalize_embeddings=True)
n = len(names)
sim_matrix = util.cos_sim(embeddings, embeddings).numpy()
cluster_id = [-1] * n
current_id = 0
for i in range(n):
if cluster_id[i] != -1:
continue # already assigned
cluster_id[i] = current_id
for j in range(i + 1, n):
if sim_matrix[i, j] >= threshold:
cluster_id[j] = current_id
current_id += 1
# map each name to its assigned cluster id
return {names[i]: cluster_id[i] for i in range(n)}Let’s look at a longer list.
names = ["sugar", "zucker", "sucre", "milk", "Milch", "lait", "bread"]
for n, cid in cluster_same_concept(names, model_emb, threshold=0.8).items():
print(f"{n} → {cid:2d}")sugar → 0
zucker → 0
sucre → 0
milk → 1
Milch → 2
lait → 1
bread → 3
Not too bad. But Milch was identified as its own category, instead of adding it to milk.
The issue is that single words provide little information for cosine similarity of tokens.
There are four things we could do. First, lower the threshold and risk grouping unrelated concepts, second, we could use a better clustering algo, third contextualize, and fourth use fuzzy matching for small words.
2.2.1 Approach 1: lowering the threshold
Let’s start with approach 1.
for n, cid in cluster_same_concept(names, model_emb, threshold=0.65).items():
print(f"{n} → {cid:2d}")sugar → 0
zucker → 0
sucre → 0
milk → 1
Milch → 1
lait → 1
bread → 2
That one worked, but I had to lower the threshold to 0.65, which makes the algo brittle.
2.2.2 Approach 2: transitive clustering
Inside our similarity matrix we only do pair wise clustering. If milk is only partly similar to Milch and Milch and lait are more similar, the current algo does not cluster everything if lait and milk are similar.
What we need is transitive clustering. One way to do this is via a connected component clustering.
def cluster_same_concept_transitive(names, model, threshold=0.8):
embeddings = model.encode(names, normalize_embeddings=True)
sim_matrix = util.cos_sim(embeddings, embeddings).numpy()
n = len(names)
adjacency = sim_matrix >= threshold
visited = [False] * n
clusters = [-1] * n
cluster_id = 0
def dfs(i, cid):
visited[i] = True
clusters[i] = cid
for j in range(n):
if adjacency[i, j] and not visited[j]:
dfs(j, cid)
for i in range(n):
if not visited[i]:
dfs(i, cluster_id)
cluster_id += 1
return {names[i]: clusters[i] for i in range(n)}for n, cid in cluster_same_concept_transitive(names, model_emb, threshold=0.71).items():
print(f"{n} → {cid:2d}")sugar → 0
zucker → 0
sucre → 0
milk → 1
Milch → 1
lait → 1
bread → 2
Now, the threshold only needs to be lowered to 0.71.
2.2.3 Approach 3: contextualization
The third approach, contextualise, is rooted in the ways the embeddings were created. Multilingual embeddings are most often trained from sentences; not dictionary expressions like humans learn them for translation tasks. We can add a static expression to put all expressions in the same corner of the embedding space. By forcing a static expression of the input, we make the similarity expression less sensitive to milk vs. Milch.
def contextualize(name):
return f" this is a food ingredient called {name}"def cluster_same_concept_transitive_context(names, model, threshold=0.8):
embeddings = model.encode([contextualize(n) for n in names], normalize_embeddings=True)
sim_matrix = util.cos_sim(embeddings, embeddings).numpy()
n = len(names)
adjacency = sim_matrix >= threshold
visited = [False] * n
clusters = [-1] * n
cluster_id = 0
def dfs(i, cid):
visited[i] = True
clusters[i] = cid
for j in range(n):
if adjacency[i, j] and not visited[j]:
dfs(j, cid)
for i in range(n):
if not visited[i]:
dfs(i, cluster_id)
cluster_id += 1
return {names[i]: clusters[i] for i in range(n)}for n, cid in cluster_same_concept_transitive_context(names, model_emb, threshold=0.74).items():
print(f"{n} → {cid:2d}")sugar → 0
zucker → 0
sucre → 0
milk → 1
Milch → 2
lait → 2
bread → 0
for n, cid in cluster_same_concept_transitive_context(names, model_emb, threshold=0.73).items():
print(f"{n} → {cid:2d}")sugar → 0
zucker → 0
sucre → 0
milk → 0
Milch → 0
lait → 0
bread → 0
Oops. That did not work as expected. We go from bad to worse. #### Second Example: don’t spoil the milk Let’s try another datapoint.
converted= parse_items_list(test_cases[2]["input"])
names = [t[2] for t in converted]
names['hot milk', 'warme milch', 'lait chaud']
for n, cid in cluster_same_concept(names, model_emb, threshold=0.96).items():
print(f"{n} → {cid:2d}")hot milk → 0
warme milch → 0
lait chaud → 0
for n, cid in cluster_same_concept_transitive(names, model_emb, threshold=0.98).items():
print(f"{n} → {cid:2d}")hot milk → 0
warme milch → 0
lait chaud → 0
for n, cid in cluster_same_concept_transitive_context(names, model_emb, threshold=0.70).items():
print(f"{n} → {cid:2d}")hot milk → 0
warme milch → 0
lait chaud → 0
2.2.4 Approach 4: dealing with one word ingredients via fuzzy matching
One-word ingredients are a bit tricky. They can lead to only one token in the sentence. Embeddings are not able to capture the meaning of a single word and are not robust against spelling mistakes
Fuzzy matching might work better. We compare strings directly and count distances or number of permutations to turn one in the other.Levenshtein distance and Jaro Winkler are two such metrics. Without diving too deep: on small words with a similar prefix (milk & Milch), jaro winkler can lead to better results.
import numpy as np
def cluster_same_concept_transitive_winkler(names, model, threshold=0.8, jaro_thresh=0.8):
embeddings = model.encode(names, normalize_embeddings=True)
sim_matrix = util.cos_sim(embeddings, embeddings).numpy()
n = len(names)
adjacency = np.zeros((n, n), dtype=bool)
for i in range(n):
for j in range(n):
if i == j:
adjacency[i, j] = True
continue
# Embedding similarity
if sim_matrix[i, j] >= threshold:
adjacency[i, j] = True
continue
# Jaro–Winkler fallback
jw = JaroWinkler.similarity(names[i].lower(), names[j].lower())
if jw >= jaro_thresh:
adjacency[i, j] = True
visited = [False] * n
clusters = [-1] * n
cluster_id = 0
def dfs(i, cid):
visited[i] = True
clusters[i] = cid
for j in range(n):
if adjacency[i, j] and not visited[j]:
dfs(j, cid)
for i in range(n):
if not visited[i]:
dfs(i, cluster_id)
cluster_id += 1
return {names[i]: clusters[i] for i in range(n)}names = ["sugar", "zucker", "sucre", "milk", "Milch", "lait", "bread"]
for n, cid in cluster_same_concept_transitive_winkler(names, model_emb, threshold=0.90, jaro_thresh=0.84).items():
print(f"{n} → {cid:2d}")sugar → 0
zucker → 0
sucre → 0
milk → 1
Milch → 1
lait → 1
bread → 2
That helped quite a lot. Before we only had 0.71.
2.3 Normalization of quantities and merging
Now we have quantities, units and now which items belong together. We need two helpers to combine all this. One to get the Labels of each cluster. And the second to aggregate across units. In this second function we use the convert_to_base function we define earlier and which allows us adding g to ml.
from collections import defaultdict
def get_cluster_labels(clusters):
"""Invert {name -> cid} into {cid -> representative name}."""
groups = defaultdict(list)
for name, cid in clusters.items():
groups[cid].append(name)
return {cid: names[0] for cid, names in groups.items()}
VOLUME_UNITS = {"l", "ml", "cup", "tbsp", "tsp"}
WEIGHT_UNITS = {"kg", "g"}
def unify_units_1to1(qty, unit):
"""Normalize all weight units to grams and volume units to milliliters."""
if isinstance(unit, str):
unit_str = unit.strip().lower()
else:
unit_str = getattr(unit, "name", None) or getattr(unit, "unit", None) or str(unit)
unit_str = unit_str.strip().lower()
unit_str = UNIT_MAP.get(unit_str, unit_str)
# Handle weights
if unit_str in WEIGHT_UNITS:
qty, base = convert_to_base(qty, unit_str)
return qty, "g"
# Handle volumes
elif unit_str in VOLUME_UNITS:
qty, base = convert_to_base(qty, unit_str)
return qty, "ml"
# Unknown or nonstandard unit
else:
return qty, unit_str or ""Now comes the final aggregation logic. I added the final version here, which has an if block as we will later reuse this function, with a different input.
def aggregate_by_concept(items, clusters):
totals = defaultdict(float)
labels = get_cluster_labels(clusters)
for item in items:
if isinstance(item, dict):
qty = item.get("quantity", 1.0)
unit = item.get("unit", "")
name = item.get("name", "").lower()
else:
qty, unit, name = item
name = name.lower()
cid = clusters.get(name)
q, base = unify_units_1to1(qty, unit)
totals[(cid, base)] += q
return [(round(q, 3), u, labels.get(cid, str(cid))) for (cid, u), q in totals.items()]converted= parse_items_list(test_cases[3]["input"])
names = [t[2] for t in converted]
clusters = cluster_same_concept_transitive(names, model_emb)
merged = aggregate_by_concept(converted, clusters); merged[(3015.0, 'g', 'sugar')]
That is the result we want. Let’s polish it a little.
def format_aggregate(aggregated):
formatted = []
for qty, unit, cid in aggregated:
if unit == "ml" and qty >= 1000:
qty /= 1000
unit = "l"
elif unit == "g" and qty >= 1000:
qty /= 1000
unit = "kg"
elif unit == "none":
unit = ""
formatted.append(f"{qty:g} {unit} {cid}")
return formatted
format_aggregate(merged)['3.015 kg sugar']
2.4 Test case evaluation
Now comes the moment of truth. We define our test function.
def convert(ing_list, model):
converted= parse_items_list(ing_list)
names = [t[2] for t in converted]
clusters = cluster_same_concept_transitive_winkler(names, model, jaro_thresh=0.65, threshold=0.60)
merged = aggregate_by_concept(converted, clusters)
out = format_aggregate(merged)
return outfor case in test_cases:
print(case["expected"],convert(case["input"], model_emb))['3l milk'] ['3kg milk']
['3l milk'] ['3kg whole milk']
['3l milk'] ['3kg hot milk']
['3kg sugar'] ['3.015kg sugar']
['1kg brown sugar', '1kg white sugar'] ['3kg brown sugar']
['4 eggs'] ['3 eggs', '1 ei']
['3 bread'] ['2 baguette', '1 brot']
['2 bottles of water', '1 bottle of wine'] ['3 bottle of water']
['3l Oatly Milk'] ['3kg oatly milk']
['1.5kg rice'] ['1.5kg rice']
['3 onions'] ['2 onion', '1 zwiebel']
['3 bell peppers'] ['1 bell pepper', '1 capsicum', '1 poivron']
['3 red pepper flakes'] ['2 red pepper flakes', '1 piment concassé']
['3 tbsp butter'] ['45g butter']
['3 tsp salt'] ['25g salt']
['3 cans crushed tomatoes'] ['3 can crushed tomatoes']
['3 cups yogurt'] ['720g yogurt']
['3 tbsp olive oil'] ['45g olive oil']
['2 bread', '1 croissant'] ['1 croissant', '1 pain', '1 bread']
['300g cheese'] ['200g cheese', '100g käse']
['3kg flour'] ['2kg flour', '1kg mehl']
Again, I had to lower the thresholds quite dramatically to get some sensible outputs.
As we can see, this process is far from perfect. Luckily for us, we are not the first computer scientists who are going food shopping.
3 Using third party libraries
3.1 Intro
There are many other projects which deal with shopping lists and ingredients. Ingredient Parser sticks out as it already uses a data-driven approach trained on ingredient data.
I was thinking about this myself, but then discovered this project. Assembling a dataset is a lot of work cudos to the author. We will use this parser.
Another approach, which I discovered in Meallie is the use of an LLM with a Prompt. In my opinion, that becomes quite costly. For a self-hosted version with a few recipes, this might be ok, for a hosted version that will turn out to be expensive. Mealie uses a confidence logic similar to the threshold approach we used with embeddings and the Jaro Winkler.
3.2 Need for translations
3.2.1 Architecture
If we are going to rely on ingredient parser, there is one further issue. The model was trained using English words. Any non-English word will lead to less good results.
As now everything is in English, we need to back-translate to the target language. That has the advantage that we can have different input languages.
we will have the following pipeline:
flowchart TD
A[Raw input] --> B[Translate to English]
B --> C[Ingredient Parser]
C --> D[Embedding Model]
D --> E[Aggregation]
E --> F[Translate Output Back for UI]
3.2.2 Finding a good model
We need a multilanguage model. My intention is to use client side translation to reduce the server load. One such model is MarianMT based Helsinki-NLP models. They are small (300MB) and fast. The downside is that they require knowing the source language. We need to use langdetect to detect the language. Side note: that is actually how the recipe on device android app started back in 2020.
from transformers import MarianMTModel, MarianTokenizer
from langdetect import detect
MODELS = {
'fr': 'Helsinki-NLP/opus-mt-fr-en',
'de': 'Helsinki-NLP/opus-mt-de-en'
}
cache = {}
def translate_to_english_helsinki_with_langdetect(text):
lang = detect(text)
if lang not in MODELS:
return text, lang
if lang not in cache:
name = MODELS[lang]
cache[lang] = (
MarianTokenizer.from_pretrained(name),
MarianMTModel.from_pretrained(name)
)
tok, mod = cache[lang]
inputs = tok(text, return_tensors="pt")
outputs = mod.generate(**inputs)
return tok.decode(outputs[0], skip_special_tokens=True), langtranslated_all = []
for i, case in enumerate(test_cases, start=1):
print(f"\n=== Test case {i} ===")
translated, lang = zip(*(translate_to_english_helsinki_with_langdetect(x) for x in case["input"]))
translated_all.append({"input":translated})
for orig, trans, la in zip(case["input"], translated, lang):
print(f"{orig:<30} → {trans}; lang: {la}")
=== Test case 1 ===
1l milk → 1l milk; lang: et
1l lait → 1L milk; lang: fr
1l Milch → 1l Milch; lang: it
=== Test case 2 ===
1l whole milk → 1l whole milk; lang: en
1l lait entier → 1l whole milk; lang: fr
1l Vollmilch → 1l Vollmilch; lang: it
=== Test case 3 ===
1l hot milk → 1l hot milk; lang: et
1l warme Milch → 1l warme Milch; lang: en
1l lait chaud → 1l hot milk; lang: fr
=== Test case 4 ===
1000g sugar → 1000g sugar; lang: tl
1kg Zucker → 1kg of sugar; lang: de
1kg sucre → 1kg sucre; lang: en
1 EL Zucker → 1 tbsp sugar; lang: de
=== Test case 5 ===
1kg brown sugar → 1kg brown sugar; lang: en
1kg sucre blanc → 1kg sucre blanc; lang: en
1kg weißer Zucker → 1 kg of white sugar; lang: de
=== Test case 6 ===
2 eggs → 2 eggs; lang: no
1 œuf → 1 egg; lang: fr
1 Ei → 1 egg; lang: de
=== Test case 7 ===
1 baguette → 1 baguette; lang: no
1 bread → 1 bread; lang: pt
1 Brot → 1 Brot; lang: en
=== Test case 8 ===
1 bottle of water → 1 bottle of water; lang: en
1 bouteille de vin → 1 bottle of wine; lang: fr
1 Flasche Wasser → 1 bottle of water; lang: de
=== Test case 9 ===
1l Oatly milk → 1l Oatly milk; lang: hu
1l lait Oatly → 1l Oatly milk; lang: fr
1l Oatly Milch → 1l Oatly Milch; lang: en
=== Test case 10 ===
½kg rice → ½kg rice; lang: en
500g Reis → 500g rice; lang: de
500g riz → 500g riz; lang: hr
=== Test case 11 ===
1 onion → 1 onion; lang: en
1 oignon → 1 oignon; lang: it
1 Zwiebel → 1 onion; lang: de
=== Test case 12 ===
1 bell pepper → 1 bell pepper; lang: sv
1 capsicum → 1 capsicum; lang: ro
1 poivron → 1 poivron; lang: sl
=== Test case 13 ===
1 red pepper flakes → 1 red pepper flakes; lang: no
1 chili flakes → 1 chili flakes; lang: sw
1 piment concassé → 1 piment concassé; lang: ca
=== Test case 14 ===
1 tbsp butter → 1 tbsp butter; lang: no
1 cuillère de beurre → 1 spoon of butter; lang: fr
1 EL Butter → 1 tbsp butter; lang: de
=== Test case 15 ===
1 tsp salt → 1 tsp salt; lang: fi
1 TL Salz → 1 TL salt; lang: de
1 cuillère à café de sel → 1 teaspoon of salt; lang: fr
=== Test case 16 ===
1 can crushed tomatoes → 1 can crushed tomatoes; lang: en
1 boîte de tomates concassées → 1 can of crushed tomatoes; lang: fr
1 Dose Tomatenstücke → 1 can of tomato pieces; lang: de
=== Test case 17 ===
1 cup yogurt → 1 cup yogurt; lang: ro
1 tasse de yaourt → 1 cup of yogurt; lang: fr
1 Becher Joghurt → 1 cup of yoghurt; lang: de
=== Test case 18 ===
1 tbsp olive oil → 1 tbsp olive oil; lang: fi
1 EL Olivenöl → 1 EL Olivenöl; lang: sv
1 cuillère à soupe d'huile d'olive → 1 tablespoon of olive oil; lang: fr
=== Test case 19 ===
1 croissant → 1 increasing; lang: fr
1 pain → 1 pain; lang: fi
1 bread → 1 bread; lang: es
=== Test case 20 ===
100g cheese → 100g cheese; lang: nl
100g fromage → 100g fromage; lang: da
100g Käse → 100g Käse; lang: sv
=== Test case 21 ===
1kg flour → 1kg flour; lang: da
1000g Mehl → 1000g flour; lang: de
1kg farine → 1kg farine; lang: no
Langdetect fails to detect the language quite frequently, and we cannot translate. We should definitely normalize the numeric parts by splitting them from the letters.
def normalize_qty(text):
# add space between digits and letters (1l → 1 l)
text = re.sub(r"(\d)([a-zA-Z])", r"\1 \2", text)
return text.strip()translated_all = []
for i, case in enumerate(test_cases, start=1):
print(f"\n=== Test case {i} ===")
translated, lang = zip(*(translate_to_english_helsinki_with_langdetect(normalize_qty(x)) for x in case["input"]))
translated_all.append({"input":translated})
for orig, trans, la in zip(case["input"], translated, lang):
print(f"{orig:<30} → {trans}; lang: {la}")
=== Test case 1 ===
1l milk → 1 l milk; lang: et
1l lait → 1 l milk; lang: fr
1l Milch → 1 l Milch; lang: it
=== Test case 2 ===
1l whole milk → 1 l whole milk; lang: en
1l lait entier → 1 l whole milk; lang: fr
1l Vollmilch → 1 l Vollmilch; lang: it
=== Test case 3 ===
1l hot milk → 1 l hot milk; lang: et
1l warme Milch → 1 l warme Milch; lang: en
1l lait chaud → 1 l hot milk; lang: fr
=== Test case 4 ===
1000g sugar → 1000 g sugar; lang: tl
1kg Zucker → 1 kg sugar; lang: de
1kg sucre → 1 kg sucre; lang: en
1 EL Zucker → 1 tbsp sugar; lang: de
=== Test case 5 ===
1kg brown sugar → 1 kg brown sugar; lang: en
1kg sucre blanc → 1 kg sucre blanc; lang: en
1kg weißer Zucker → 1 kg of white sugar; lang: de
=== Test case 6 ===
2 eggs → 2 eggs; lang: no
1 œuf → 1 egg; lang: fr
1 Ei → 1 egg; lang: de
=== Test case 7 ===
1 baguette → 1 baguette; lang: no
1 bread → 1 bread; lang: es
1 Brot → 1 Brot; lang: en
=== Test case 8 ===
1 bottle of water → 1 bottle of water; lang: en
1 bouteille de vin → 1 bottle of wine; lang: fr
1 Flasche Wasser → 1 bottle of water; lang: de
=== Test case 9 ===
1l Oatly milk → 1 l Oatly milk; lang: hu
1l lait Oatly → 1 l Oatly milk; lang: fr
1l Oatly Milch → 1 l Oatly Milch; lang: en
=== Test case 10 ===
½kg rice → ½kg rice; lang: en
500g Reis → 500 g rice; lang: de
500g riz → 500 g riz; lang: hr
=== Test case 11 ===
1 onion → 1 onion; lang: en
1 oignon → 1 oignon; lang: it
1 Zwiebel → 1 onion; lang: de
=== Test case 12 ===
1 bell pepper → 1 bell pepper; lang: sv
1 capsicum → 1 capsicum; lang: ro
1 poivron → 1 poivron; lang: sl
=== Test case 13 ===
1 red pepper flakes → 1 red pepper flakes; lang: no
1 chili flakes → 1 chili flakes; lang: sw
1 piment concassé → 1 piment concassé; lang: ca
=== Test case 14 ===
1 tbsp butter → 1 tbsp butter; lang: en
1 cuillère de beurre → 1 spoon of butter; lang: fr
1 EL Butter → 1 EL Butter; lang: no
=== Test case 15 ===
1 tsp salt → 1 tsp salt; lang: lv
1 TL Salz → 1 TL salt; lang: de
1 cuillère à café de sel → 1 teaspoon of salt; lang: fr
=== Test case 16 ===
1 can crushed tomatoes → 1 can crushed tomatoes; lang: en
1 boîte de tomates concassées → 1 can of crushed tomatoes; lang: fr
1 Dose Tomatenstücke → 1 can of tomato pieces; lang: de
=== Test case 17 ===
1 cup yogurt → 1 cup yogurt; lang: ro
1 tasse de yaourt → 1 cup of yogurt; lang: fr
1 Becher Joghurt → 1 cup of yoghurt; lang: de
=== Test case 18 ===
1 tbsp olive oil → 1 tbsp olive oil; lang: fi
1 EL Olivenöl → 1 EL Olivenöl; lang: sv
1 cuillère à soupe d'huile d'olive → 1 tablespoon of olive oil; lang: fr
=== Test case 19 ===
1 croissant → 1 increasing; lang: fr
1 pain → 1 pain; lang: fi
1 bread → 1 bread; lang: pt
=== Test case 20 ===
100g cheese → 100 g cheese; lang: nl
100g fromage → 100 g fromage; lang: da
100g Käse → 100 g Käse; lang: sv
=== Test case 21 ===
1kg flour → 1 kg flour; lang: da
1000g Mehl → 1000 g flour; lang: de
1kg farine → 1 kg farine; lang: no
Not much better.
Then, there is a Helsinki-NLP allrounder model, which knows many languages at the expense of accuracy.
import functools
MODEL_NAME = "Helsinki-NLP/opus-mt-mul-en"
# Lazy-load for reuse
@functools.lru_cache(maxsize=1)
def get_multilingual_translator():
tokenizer = MarianTokenizer.from_pretrained(MODEL_NAME)
model = MarianMTModel.from_pretrained(MODEL_NAME)
return tokenizer, modeldef translate_to_english_helsinki(text: str) -> str:
text = text.strip()
if not text:
return text
tok, mod = get_multilingual_translator()
batch = tok([text], return_tensors="pt", truncation=True)
gen = mod.generate(**batch, max_new_tokens=64)
result = tok.decode(gen[0], skip_special_tokens=True)
return result.strip()translated_all = []
for i, case in enumerate(test_cases, start=1):
print(f"\n=== Test case {i} ===")
translated = [translate_to_english_helsinki(x) for x in case["input"]]
translated_all.append({"input":translated})
for orig, trans in zip(case["input"], translated):
print(f"{orig:<30} → {trans}")
=== Test case 1 ===
1l milk → 1l milk
1l lait → 1 l
1l Milch → 1l Milk
=== Test case 2 ===
1l whole milk → 1l whole milk
1l lait entier → 1l leaves all
1l Vollmilch → 1l Full milk
=== Test case 3 ===
1l hot milk → 1l hot milk
1l warme Milch → 1l hot milk
1l lait chaud → 1 L to the left
=== Test case 4 ===
1000g sugar → 1000g sugar
1kg Zucker → 1kg Sugar
1kg sucre → 1kg sugar
1 EL Zucker → 1 EL Sugar
=== Test case 5 ===
1kg brown sugar → 1kg brown sugar
1kg sucre blanc → 1kg white sugar
1kg weißer Zucker → 1kg white sugar
=== Test case 6 ===
2 eggs → 2 eggs
1 œuf → 1 egg
1 Ei → 1 Yes
=== Test case 7 ===
1 baguette → 1 baguette
1 bread → 1 board
1 Brot → 1 Brot
=== Test case 8 ===
1 bottle of water → 1 bottle of water
1 bouteille de vin → 1 bottle of wine
1 Flasche Wasser → 1 bottle of water
=== Test case 9 ===
1l Oatly milk → 1l Oatly milk
1l lait Oatly → 1l Leave Oatly
1l Oatly Milch → 1l Oatly Milk
=== Test case 10 ===
½kg rice → 1⁄2kg of rice
500g Reis → 500g Reis
500g riz → 500g rice
=== Test case 11 ===
1 onion → 1 onion
1 oignon → 1 firenon
1 Zwiebel → 1 Double
=== Test case 12 ===
1 bell pepper → 1 bell pepper
1 capsicum → 1 capsicum
1 poivron → 1 piovirone
=== Test case 13 ===
1 red pepper flakes → 1 red pepper flakes
1 chili flakes → 1 Chilean flakes
1 piment concassé → 1 Pim Concazed
=== Test case 14 ===
1 tbsp butter → 1 tbsp Butter
1 cuillère de beurre → 1 glass of beer
1 EL Butter → 1 EL Butter
=== Test case 15 ===
1 tsp salt → 1 tsp salt
1 TL Salz → 1 TL Salz
1 cuillère à café de sel → 1 silver coffee cooker
=== Test case 16 ===
1 can crushed tomatoes → 1 can cross-sectional tomatoes
1 boîte de tomates concassées → 1 box of sliced tomatoes
1 Dose Tomatenstücke → 1 Dose of Tomatoes
=== Test case 17 ===
1 cup yogurt → 1 cup yogurt
1 tasse de yaourt → 1 yourt rate
1 Becher Joghurt → 1 Becher Joghurt
=== Test case 18 ===
1 tbsp olive oil → 1 tbsp of olive oil
1 EL Olivenöl → 1 EU Olive oil
1 cuillère à soupe d'huile d'olive → 1 olive oil soup cooler
=== Test case 19 ===
1 croissant → 1 significant place
1 pain → 1 pen
1 bread → 1 board
=== Test case 20 ===
100g cheese → 100g cheese
100g fromage → 100g of cheese
100g Käse → 100g Käse
=== Test case 21 ===
1kg flour → 1kg flower
1000g Mehl → 1000g Mehl
1kg farine → 1kg far
Mixed results.Sometimes it is better, but then it completely fails.
There would be bigger models to try Facebook’s m2m100 or NLLB. Besides being too big for a browser-based application, the common flaw in all those models, they require a source language. Facebook’s NLLB (No Language Left Behind). However, even the smallest distilled model is 2.4g.
However, there is one thing we can do. There is Facebook’s Fasttext library, which works on subword embeddings (character n-grams). That could mean it works well on single words. We first need to get the weights of the model.
Sadly, the model relies on an old numpy version. I saw no other version than to patch the file directly.
from os import path
if not path.exists("lid.176.bin"):
!wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.binimport os, fasttext, inspect
# Locate FastText.py in your environment
ft_path = os.path.join(os.path.dirname(inspect.getfile(fasttext)), "FastText.py")
# Read the file
with open(ft_path, "r", encoding="utf-8") as f:
content = f.read()
# Check if it's already patched
if "np.asarray(probs)" not in content:
patched = content.replace("np.array(probs, copy=False)", "np.asarray(probs)")
with open(ft_path, "w", encoding="utf-8") as f:
f.write(patched)
print("✅ Patched FastText.py — replaced np.array(..., copy=False) with np.asarray(...).")
else:
print("✅ Already patched — no action needed.")
import sys, gc, importlib
mods = [m for m in sys.modules if m.startswith("fasttext")]
for m in mods:
del sys.modules[m]
gc.collect()
import fasttext
importlib.reload(fasttext)
print("fasttext reloaded with patched FastText.py")✅ Already patched — no action needed.
fasttext reloaded with patched FastText.py
import fasttext
# Load pretrained language identification model
model = fasttext.load_model("lid.176.bin")
def detect_lang(text: str):
labels, probs = model.predict(text)
return labels[0].replace("__label__", ""), float(probs[0])
print(detect_lang("lait")) # ('fr', 0.99)
print(detect_lang("Milch")) # ('de', 0.98)
print(detect_lang("egg")) # ('en', 0.99)
print(detect_lang("Zucker")) # ('de', 0.97)('fr', 0.9928548336029053)
('de', 0.8860894441604614)
('en', 0.5928217768669128)
('de', 0.486892431974411)
That worked quite nicely. We can use it to replace langdetect.
def translate_to_english_helsinki_with_fastdetect(text, fastdetect_model=model):
labels, probs = fastdetect_model.predict(text)
lang = labels[0].replace("__label__", "")
if lang not in MODELS or lang == 'en':
return text, lang
if lang not in cache:
name = MODELS[lang]
cache[lang] = (
MarianTokenizer.from_pretrained(name),
MarianMTModel.from_pretrained(name)
)
tok, mod = cache[lang]
inputs = tok(text, return_tensors="pt")
outputs = mod.generate(**inputs)
return tok.decode(outputs[0], skip_special_tokens=True), langtranslated_all = []
for i, case in enumerate(test_cases, start=1):
print(f"\n=== Test case {i} ===")
translated, lang = zip(*(translate_to_english_helsinki_with_fastdetect(normalize_qty(x)) for x in case["input"]))
translated_all.append({"input":translated})
for orig, trans, la in zip(case["input"], translated, lang):
print(f"{orig:<30} → {trans}; lang: {la}")
=== Test case 1 ===
1l milk → 1 l milk; lang: en
1l lait → 1 l milk; lang: fr
1l Milch → 1 l milk; lang: de
=== Test case 2 ===
1l whole milk → 1 l whole milk; lang: en
1l lait entier → 1 l whole milk; lang: fr
1l Vollmilch → 1 l whole milk; lang: de
=== Test case 3 ===
1l hot milk → 1 l hot milk; lang: en
1l warme Milch → 1 l warm milk; lang: de
1l lait chaud → 1 l hot milk; lang: fr
=== Test case 4 ===
1000g sugar → 1000 g sugar; lang: en
1kg Zucker → 1 kg sugar; lang: de
1kg sucre → 1 kg sugar; lang: fr
1 EL Zucker → 1 tbsp sugar; lang: de
=== Test case 5 ===
1kg brown sugar → 1 kg brown sugar; lang: en
1kg sucre blanc → 1 kg white sugar; lang: fr
1kg weißer Zucker → 1 kg of white sugar; lang: de
=== Test case 6 ===
2 eggs → 2 eggs; lang: en
1 œuf → 1 egg; lang: fr
1 Ei → 1 Ei; lang: pt
=== Test case 7 ===
1 baguette → 1 wand; lang: fr
1 bread → 1 bread; lang: en
1 Brot → 1 bread; lang: de
=== Test case 8 ===
1 bottle of water → 1 bottle of water; lang: en
1 bouteille de vin → 1 bottle of wine; lang: fr
1 Flasche Wasser → 1 bottle of water; lang: de
=== Test case 9 ===
1l Oatly milk → 1 l Oatly milk; lang: en
1l lait Oatly → 1 l Oatly milk; lang: fr
1l Oatly Milch → 1 l Oatly Milch; lang: en
=== Test case 10 ===
½kg rice → ½kg rice; lang: en
500g Reis → 500 g Reis; lang: en
500g riz → 500 g riz; lang: es
=== Test case 11 ===
1 onion → 1 onion; lang: pl
1 oignon → 1 oignon; lang: id
1 Zwiebel → 1 onion; lang: de
=== Test case 12 ===
1 bell pepper → 1 bell pepper; lang: en
1 capsicum → 1 capsicum; lang: la
1 poivron → 1 poivron; lang: pt
=== Test case 13 ===
1 red pepper flakes → 1 red pepper flakes; lang: en
1 chili flakes → 1 chili flakes; lang: en
1 piment concassé → 1 crushed chilli; lang: fr
=== Test case 14 ===
1 tbsp butter → 1 tbsp butter; lang: en
1 cuillère de beurre → 1 spoon of butter; lang: fr
1 EL Butter → 1 EL Butter; lang: en
=== Test case 15 ===
1 tsp salt → 1 tsp salt; lang: en
1 TL Salz → 1 TL salt; lang: de
1 cuillère à café de sel → 1 teaspoon of salt; lang: fr
=== Test case 16 ===
1 can crushed tomatoes → 1 can crushed tomatoes; lang: en
1 boîte de tomates concassées → 1 can of crushed tomatoes; lang: fr
1 Dose Tomatenstücke → 1 can of tomato pieces; lang: de
=== Test case 17 ===
1 cup yogurt → 1 cup yogurt; lang: en
1 tasse de yaourt → 1 cup of yogurt; lang: fr
1 Becher Joghurt → 1 Becher Joghurt; lang: en
=== Test case 18 ===
1 tbsp olive oil → 1 tbsp olive oil; lang: en
1 EL Olivenöl → 1 EL Olivenöl; lang: eo
1 cuillère à soupe d'huile d'olive → 1 tablespoon of olive oil; lang: fr
=== Test case 19 ===
1 croissant → 1 increasing; lang: fr
1 pain → 1 pain; lang: en
1 bread → 1 bread; lang: en
=== Test case 20 ===
100g cheese → 100 g cheese; lang: en
100g fromage → 100 g fromage; lang: en
100g Käse → 100 g cheese; lang: de
=== Test case 21 ===
1kg flour → 1 kg flour; lang: en
1000g Mehl → 1000 g flour; lang: de
1kg farine → 1 kg farine; lang: en
That is a lot better, but we get misclassifications. If we know what language we expect we can map to the closest language.
def translate_to_english_helsinki_with_fastdetect_v2(text, fastdetect_model=model):
labels, probs = fastdetect_model.predict(text)
lang = labels[0].replace("__label__", "")
if lang in ['pt', 'es','id']:
lang = 'fr'
if lang not in MODELS or lang == 'en':
return text, lang
if lang not in cache:
name = MODELS[lang]
cache[lang] = (
MarianTokenizer.from_pretrained(name),
MarianMTModel.from_pretrained(name)
)
tok, mod = cache[lang]
inputs = tok(text, return_tensors="pt")
outputs = mod.generate(**inputs)
return tok.decode(outputs[0], skip_special_tokens=True), langtranslated_all = []
for i, case in enumerate(test_cases, start=1):
print(f"\n=== Test case {i} ===")
translated, lang = zip(*(translate_to_english_helsinki_with_fastdetect_v2(normalize_qty(x)) for x in case["input"]))
translated_all.append({"input":translated})
for orig, trans, la in zip(case["input"], translated, lang):
print(f"{orig:<30} → {trans}; lang: {la}")
=== Test case 1 ===
1l milk → 1 l milk; lang: en
1l lait → 1 l milk; lang: fr
1l Milch → 1 l milk; lang: de
=== Test case 2 ===
1l whole milk → 1 l whole milk; lang: en
1l lait entier → 1 l whole milk; lang: fr
1l Vollmilch → 1 l whole milk; lang: de
=== Test case 3 ===
1l hot milk → 1 l hot milk; lang: en
1l warme Milch → 1 l warm milk; lang: de
1l lait chaud → 1 l hot milk; lang: fr
=== Test case 4 ===
1000g sugar → 1000 g sugar; lang: en
1kg Zucker → 1 kg sugar; lang: de
1kg sucre → 1 kg sugar; lang: fr
1 EL Zucker → 1 tbsp sugar; lang: de
=== Test case 5 ===
1kg brown sugar → 1 kg brown sugar; lang: en
1kg sucre blanc → 1 kg white sugar; lang: fr
1kg weißer Zucker → 1 kg of white sugar; lang: de
=== Test case 6 ===
2 eggs → 2 eggs; lang: en
1 œuf → 1 egg; lang: fr
1 Ei → 1 Ei; lang: fr
=== Test case 7 ===
1 baguette → 1 wand; lang: fr
1 bread → 1 bread; lang: en
1 Brot → 1 bread; lang: de
=== Test case 8 ===
1 bottle of water → 1 bottle of water; lang: en
1 bouteille de vin → 1 bottle of wine; lang: fr
1 Flasche Wasser → 1 bottle of water; lang: de
=== Test case 9 ===
1l Oatly milk → 1 l Oatly milk; lang: en
1l lait Oatly → 1 l Oatly milk; lang: fr
1l Oatly Milch → 1 l Oatly Milch; lang: en
=== Test case 10 ===
½kg rice → ½kg rice; lang: en
500g Reis → 500 g Reis; lang: en
500g riz → 500 g rice; lang: fr
=== Test case 11 ===
1 onion → 1 onion; lang: pl
1 oignon → 1 onion; lang: fr
1 Zwiebel → 1 onion; lang: de
=== Test case 12 ===
1 bell pepper → 1 bell pepper; lang: en
1 capsicum → 1 capsicum; lang: la
1 poivron → 1 pepper; lang: fr
=== Test case 13 ===
1 red pepper flakes → 1 red pepper flakes; lang: en
1 chili flakes → 1 chili flakes; lang: en
1 piment concassé → 1 crushed chilli; lang: fr
=== Test case 14 ===
1 tbsp butter → 1 tbsp butter; lang: en
1 cuillère de beurre → 1 spoon of butter; lang: fr
1 EL Butter → 1 EL Butter; lang: en
=== Test case 15 ===
1 tsp salt → 1 tsp salt; lang: en
1 TL Salz → 1 TL salt; lang: de
1 cuillère à café de sel → 1 teaspoon of salt; lang: fr
=== Test case 16 ===
1 can crushed tomatoes → 1 can crushed tomatoes; lang: en
1 boîte de tomates concassées → 1 can of crushed tomatoes; lang: fr
1 Dose Tomatenstücke → 1 can of tomato pieces; lang: de
=== Test case 17 ===
1 cup yogurt → 1 cup yogurt; lang: en
1 tasse de yaourt → 1 cup of yogurt; lang: fr
1 Becher Joghurt → 1 Becher Joghurt; lang: en
=== Test case 18 ===
1 tbsp olive oil → 1 tbsp olive oil; lang: en
1 EL Olivenöl → 1 EL Olivenöl; lang: eo
1 cuillère à soupe d'huile d'olive → 1 tablespoon of olive oil; lang: fr
=== Test case 19 ===
1 croissant → 1 increasing; lang: fr
1 pain → 1 pain; lang: en
1 bread → 1 bread; lang: en
=== Test case 20 ===
100g cheese → 100 g cheese; lang: en
100g fromage → 100 g fromage; lang: en
100g Käse → 100 g cheese; lang: de
=== Test case 21 ===
1kg flour → 1 kg flour; lang: en
1000g Mehl → 1000 g flour; lang: de
1kg farine → 1 kg farine; lang: en
That looks quite good, though not 100% perfect.
3.3 Ingredient parsing
Now that everything is sort of in english, we can use the ingredient parser
from ingredient_parser import parse_ingredient
def parse_ingredient_line(text: str):
"""
Parse an English ingredient line into structured (quantity, unit, name, note)
using the `ingredient-parser` package.
"""
parsed = parse_ingredient(text)
# Extract quantity
if parsed.amount and parsed.amount[0].quantity !="":
# Supports ranges, fractions, etc.
qty = float(parsed.amount[0].quantity)
unit = parsed.amount[0].unit or ""
else:
qty, unit = 1.0, ""
# Extract main food name
name = parsed.name[0].text if parsed.name else ""
# Optional note (preparation, comment, etc.)
note = ""
if parsed.comment:
note = parsed.comment.text
elif parsed.preparation:
note = parsed.preparation.text
elif parsed.size:
note = parsed.size.text
return {
"quantity": qty,
"unit": unit,
"name": name,
"note": note
}translated_all[3]["input"]('1000 g sugar', '1 kg sugar', '1 kg sugar', '1 tbsp sugar')
converted = [parse_ingredient_line(item) for item in translated_all[3]["input"]]; converted[{'quantity': 1000.0, 'unit': <Unit('gram')>, 'name': 'sugar', 'note': ''},
{'quantity': 1.0, 'unit': <Unit('kilogram')>, 'name': 'sugar', 'note': ''},
{'quantity': 1.0, 'unit': <Unit('kilogram')>, 'name': 'sugar', 'note': ''},
{'quantity': 1.0, 'unit': <Unit('tablespoon')>, 'name': 'sugar', 'note': ''}]
names = [t["name"] for t in converted]; names['sugar', 'sugar', 'sugar', 'sugar']
3.4 Clustering, aggregation & output
The next steps work the same way as before
clusters = cluster_same_concept_transitive_winkler(names, model_emb);clusters{'sugar': 0}
merged = aggregate_by_concept(converted, clusters);merged[(3015.0, 'g', 'sugar')]
out = format_aggregate(merged);out['3.015kg sugar']
3.5 Backtranslation
The user will have his own native language; therefore, we need to translate back from English to that language. However, this time we know the languages, so we do not need fastdetect.
from transformers import MarianMTModel, MarianTokenizer
# Add the reverse models
BACK_MODELS = {
'fr': 'Helsinki-NLP/opus-mt-en-fr',
'de': 'Helsinki-NLP/opus-mt-en-de'
}
back_cache = {}
def back_translate_from_english(text, target_lang):
"""
Translate English text back into the target language ('fr' or 'de')
using OPUS-MT. Falls back to English if no model is defined.
"""
if target_lang not in BACK_MODELS or target_lang == "en":
return text
if target_lang not in back_cache:
model_name = BACK_MODELS[target_lang]
back_cache[target_lang] = (
MarianTokenizer.from_pretrained(model_name),
MarianMTModel.from_pretrained(model_name)
)
tok, mod = back_cache[target_lang]
inputs = tok(text, return_tensors="pt", truncation=True)
outputs = mod.generate(**inputs)
return tok.decode(outputs[0], skip_special_tokens=True)back_translate_from_english(out, 'de')'3,015 kg Zucker'
The models even translated the floating point seperator from ‘.’ to ‘,’ . However, we now need four models for three languages. Maybe a bigger model is not too bad.
3.6 Evaluation
def convert_ingredient_parser(ing_list, model):
translated_all=[]
translated = [a for a, _ in (translate_to_english_helsinki_with_fastdetect_v2(normalize_qty(x)) for x in ing_list)]
translated_all.append({"input":translated})
for row in translated_all:
converted = [parse_ingredient_line(item) for item in row["input"]]
names = [t["name"] for t in converted]
clusters = cluster_same_concept_transitive_winkler(names, model)
merged = aggregate_by_concept(converted, clusters)
out = format_aggregate(merged)
return outfor case in test_cases:
print(case["expected"],convert_ingredient_parser(case["input"], model_emb))['3l milk'] ['3 l milk']
['3l milk'] ['3 l whole milk']
['3l milk'] ['3 l hot milk']
['3kg sugar'] ['3 kg sugar', '15 ml sugar']
['1kg brown sugar', '1kg white sugar'] ['3 kg brown sugar']
['4 eggs'] ['3 eggs', '1 None']
['3 bread'] ['1 wand', '2 bread']
['2 bottles of water', '1 bottle of wine'] ['2 bottle water', '1 bottle wine']
['3l Oatly Milk'] ['3 l None']
['1.5kg rice'] ['1 kg rice', '500 g None']
['3 onions'] ['3 onion']
['3 bell peppers'] ['2 bell pepper', '1 capsicum']
['3 red pepper flakes'] ['1 red pepper flakes', '2 chili flakes']
['3 tbsp butter'] ['15 ml butter', '1 butter', '1 None']
['3 tsp salt'] ['10 ml salt', '1 None']
['3 cans crushed tomatoes'] ['3 can crushed tomatoes']
['3 cups yogurt'] ['480 ml yogurt', '1 None']
['3 tbsp olive oil'] ['30 ml olive oil', '1 None']
['2 bread', '1 croissant'] ['1 increasing', '1 pain', '1 bread']
['300g cheese'] ['300 g cheese']
['3kg flour'] ['3 kg flour']
for case in test_cases_long_sentences:
print(case["expected"],convert_ingredient_parser(case["input"], model_emb))['3 onions'] ['3 yellow onion']
['2 cups mango chunks'] ['480 ml mango chunks']
['1.5 sticks butter'] ['120 ml butter', '200 g butter']
['3 tbsp cilantro'] ['15 ml cilantro', '1 None', '15 ml coriander']
['3 bell peppers'] ['3 bell pepper']
['6 cloves garlic'] ['6 cloves garlic']
['3 bell peppers'] ['2 stalk bell peppers', '1 tige de poivron']
['3 tsp garam masala'] ['12.5 ml garam masala', '1 None']
['3 pinches salt'] ['3 salt']
['3 handfuls nuts'] ['3 nuts']
['6 cups milk or cream'] ['1.44 l milk']
['3 tbsp olive oil'] ['45 ml olive oil']
['3 cups chopped tomatoes'] ['240 ml tomatoes', '2 can tomatoes']
['3 bottles of water'] ['3 bottle water']
['3 cans coconut milk'] ['2 can coconut milk', '1 box coconut milk']
['3 potatoes'] ['3 potato']
['6 onions'] ['4 onions', '2 petits oignons']
4 Lesssons learned
After all this, the verdict is still “not convinced.”
The translation models sometimes work impressively well. For short, multilingual phrases like ingredients (riz), the context is sometime too thin to provide meaningful embeddings.
The experiments reveal a big flaw in the use of NLP. Much of our daily communication is implicit. Humans excel at resolving short information flow in the current context. LLM would need to be told the same context, exploding token cost and making the approach economically unrealistic.
For production use, hybrid approaches will win: structured ontology with model-based matching and maybe some translation. However, that is exactly what I wanted to avoid. The user needs to input data. For a new user it is not evident that he could speed up the process by teaching the computer instead of doing the work himself. If users are to benefit from shared data, strict processes for privacy protection become necessary.