Duarte O.Carmo

Blog | About | Consulting | Resume | Talks | Photos | Newsletter

From NutriBench to Taralli: How far can you take a prompt?

[Listen]

Taralli Screenshots

Benchmarking calorie prediction for Taralli

There's something very funny about the current Machine Learning and AI landscape. If you're in the field you probably heard about it. "Vibes" they call it. When someone wants to test something out, they conduct a "vibe test".

I call bullshit. How are you supposed to improve something if you can't even measure it? How do you know you didn't just make it worse?

In the past month, I've been working on Taralli. And I'd like to show you how to incrementally improve an LLM based system.

But first, let's take a little side quest.

NutriBench: Evaluating LLMs on nutrition estimation

While browsing arXiv the other day I stumbled upon NutriBench. This is exciting - I thought to myself - it looks like someone else is also looking into this problem. NutriBench is a research project launched by the University of California that looks to quantify exactly Taralli's problem: "How good are Large Language Models at estimating the nutritional content of foods?".

The team created a dataset for ~12K meal descriptions from two data sources (WWEIA and FAO/WHO). Every row in the dataset is a meal description with the corresponding nutritional details (carbs, protein, fat, and energy). To answer the question, they tested 4 different prompting techniques and 12 different LLMs. The prompting techniques were the following:

If you are curious about the prompts, click here.

NutriBench results

NutriBench results

You're probably wondering: okay, but how good are the models at this? The answer is a resounding MEH. The best performer is GPT-4o with a Chain-of-Thought prompt. It responds to ~99% of prompts (models can refuse to answer) and achieves an [email protected] of 66.82%1. In other words, its carb predictions fall within ±7.5g of the actual value about two-thirds of the time.

What if we could use the NutriBench dataset to improve Taralli?

Improving Taralli's nutritional estimation

Here's the notebook with the entire process

The first thing, as always, is to create good dataset. I created a new dataset of 107 examples of food descriptions and their corresponding total calories2. I used the second version of the NutriBench dataset on Hugging Face. I mixed in that dataset with some other examples from a previous golden dataset.

Expand to see some examples of the dataset

 [
  {
    "input": "For dinner, I'm having 25 grams of bread, 150 grams of chicken wings, and a 250-gram mixed vegetable salad.",
    "total_calories": 633.0,
    "food_groups": [
      "grain",
      "meat and alternatives",
      "vegetable"
    ],
    "source": "nutribench"
  },
  {
    "input": "I enjoyed 200 grams of tea with sugar along with 230 grams of coconut milk rice for breakfast.",
    "total_calories": 558.0,
    "food_groups": [
      "grain",
      "fruit"
    ],
    "source": "nutribench"
  },
  {
    "input": "stracciatella with confit grapes and 2 little focaccia pieces",
    "total_calories": 350.0,
    "food_groups": [
      "fruit",
      "dairy",
      "grain"
    ],
    "source": "golden_dataset"
  }
] 

Now I needed to design an evaluation metric. For NutriBench they used accuracy of carb prediction at ±7.5 grams. In our case, I'm more interested in the accuracy calorie prediction. One of the nice things about DSPy is that it forces you to write down your evaluation metric explicitly. Below is the evaluation metric I used: In short, we want the predicted calories to be within 10% of the ground truth calories. Which means our metric will then be Accuracy@10%.

def eval_metric(
    example, pred, trace=None, pred_name=None, pred_trace=None
) -> dspy.Prediction:
    # see notebook for more detailed metric (this is simplified)
    total_calories_example = example.total_calories
    total_calories_predicted = na.total_calories()
    within_threshold = abs(total_calories_example - total_calories_predicted) <= abs(
        0.1 * total_calories_example
    )

    if not within_threshold:
        score = 0
        feedback_text = f"INCORRECT: Your answer was not within 10% of the correct answer (yours: {total_calories_predicted}, correct: {total_calories_example})"
        return dspy.Prediction(score=score, feedback=feedback_text)

    feedback_text = f"CORRECT: Your answer was within 10% of the correct answer (yours: {total_calories_predicted}, correct: {total_calories_example})"
    score = 1
    return dspy.Prediction(score=score, feedback=feedback_text)

Here are the different prompts I tested:

As for models, I chose some that were cheap and fast - some closed (Gemini) and other Open (DeepSeek). I decided to test, Gemini 2.5 Flash (what I am currently using in production), DeepSeech v3.2 with thinking on and off, and Google's new Gemini 3 Flash.

Some thoughts on the results: - The best performing model is Gemini 3 Flash with a set of 16 examples in the prompt. It achieves a score of around 60%. Similar to NutriBench, although the problems are slightly different. Here's an example prediction. - The GEPA optimization came up with a prompt itself. When using that prompt with Gemini 2.5 Flash it performs respectably well. However, the GEPA prompt actually fails to provide the correct response format when used with any other model. In other words, the prompt was overfit to the model it was trained on. As a result, you only see one GEPA score in the results graph. - The few-shot approach is the most reliable one. It is model agnostic, performs well, and follows the styles of examples in a faithful manner.

So, I decided to update Taralli to use Gemini 3 Flash with the few-shot approach. This approach is ~15% more accurate when compared to the old version, which was running on Gemini 2.5 Flash, with the exact same optimizer. In conclusion, all I changed was a model string.

Here's a snippet taken from the API. Since I'm using OpenRouter, I can also define our 2nd best performing model as a back up, just like so:

LM = dspy.LM(
    "openrouter/google/gemini-3-flash-preview:nitro",
    temperature=0.0,
    extra_body={
        "reasoning": {"enabled": False},
        "models": ["deepseek/deepseek-v3.2:nitro"],
    },
)

Running on the edge

Taralli experiment results

Taralli's new features showcased

I've updated Taralli to use Apple's new Liquid Glass design for iOS 26. That was pretty simple. It's a 5 file SwitfUI app with around 4.5 MB. You can now also set periodic reminders so that you don't forget to track.

One of the best new features is the ability to use Taralli completely offline with an on-device LLM for nutritional analysis. The code snippet below transforms a DSPy optimized program into OpenAI-compatible messages. I created an endpoint that takes a food_description and returns these messages for on-device inference. The iOS app calls this endpoint, receives the populated template, and uses it with the on-device model. As a backup, we can bundle the template directly in the app to skip the API call entirely. Long story short, Taralli now works on airplane (mode).

@lru_cache(maxsize=1)
def get_classifier_template() -> t.List[dict[str, str]]:
    program = get_classifier()
    adapter = dspy.ChatAdapter()
    openai_messages_format = {
        name: adapter.format(
            p.signature,  # type: ignore
            demos=p.demos,
            inputs={k: f"{{{k}}}" for k in p.signature.input_fields},  # type: ignore
        )
        for name, p in program.named_predictors()
    }["self"]
    return openai_messages_format 
    # returns [{"role": "system", "content": feajfeafl}, {"role..}]

Final summary

  1. For reference, this is on par with a Human nutritionist with internet access
  2. I could have used more, but wanted to keep things fast

December 23, 2025
Subscribe via Email