Evals Are Pre-Commit Hooks

You change a single word in your system prompt—swapping "concise" for "professional"—and suddenly your JSON output parser starts screaming because the model decided to wrap every response in a conversational "Certainly! Here is your data:" preamble. We've all been there. You spend forty minutes playing whack-a-mole with edge cases, only to realize that fixing one prompt broke three others you weren't even looking at.

Prompt engineering is still largely "vibes-based" development. We tweak, we refresh the UI, we say "looks good," and we ship. But if you wouldn't dream of pushing a Python update without running your test suite, why are you pushing LLM instructions that lack a safety net?

The "Vibes" Trap

In traditional software, we have pre-commit hooks. They catch your trailing whitespace, your missing semicolons, and your broken unit tests before the code ever touches a remote branch. In LLM development, our "code" is natural language, which is notoriously non-deterministic.

If you treat your evaluation suite as a "sometime" task—something you run manually once a week—you’ve already lost. Evals shouldn't be a post-mortem; they need to be the gatekeeper.

Making Evals Local and Fast

An eval doesn't have to be a massive, expensive 1,000-row benchmark. For a pre-commit hook, you want a "Smoke Test" suite—the 5 to 10 most brittle edge cases that usually break your app.

Here is a dead-simple way to structure a local eval using pytest and a basic assertion. We aren't doing anything fancy here; we're just checking if our "concise" prompt actually stays concise.

import pytest
from my_app.llm_chain import generate_response

# A small set of inputs that have historically caused "hallucinations" or formatting issues
TEST_CASES = [
    ("What is the status of order #123?", "Order #123 is currently processing."),
    ("Tell me a joke about robots.", "Why did the robot go on vacation? To recharge its batteries."),
    ("Export this user data: {id: 1}", '{"id": 1}')
]

@pytest.mark.parametrize("user_input, expected_subset", TEST_CASES)
def test_prompt_consistency(user_input, expected_subset):
    response = generate_response(user_input)
    
    # Check for the "Vibe": Is the response short?
    assert len(response.split()) < 50, "Response is way too wordy!"
    
    # Check for the "Content": Does it contain what we need?
    assert expected_subset.lower() in response.lower(), f"Response missed critical info: {expected_subset}"

    # Check for the "Format": Is it clean JSON when requested?
    if "{" in expected_subset:
        assert response.startswith("{") and response.endswith("}"), "Output is not valid JSON"

Wiring it into the Git Loop

Running pytest manually is better than nothing, but we’re humans; we’re lazy and we forget things. We need the machine to enforce the rules. If you’re using pre-commit, you can add a local script to your .pre-commit-config.yaml.

The trick here is to only run these tests when your prompt files or LLM logic change. You don't want to spend $2.00 in API credits every time you fix a typo in your README.

- repo: local
  hooks:
    - id: llm-evals
      name: Run LLM Smoke Tests
      entry: pytest tests/llm_smoke_tests.py
      language: python
      files: ^prompts/|^src/llm_logic/  # Only run if prompts change
      pass_filenames: false

The "LLM-as-a-Judge" Gotcha

Sometimes simple string matching isn't enough. You might need a "Judge" LLM to grade your "Worker" LLM. This is where things get meta—and a bit slower.

I usually recommend avoiding LLM-as-a-judge in a strictly *local* pre-commit hook because it adds 5–10 seconds of latency. However, if your prompt is highly subjective (e.g., "Make sure the tone is empathetic"), a quick call to gpt-4o-mini to provide a "Pass/Fail" is worth the wait.

Pro-tip: Use a cheaper, faster model for your judge than you use for your production output. If mini thinks your GPT-4o output is garbage, it probably is.

Why This Matters

When you treat evals as pre-commit hooks, the psychological shift is massive. You stop being afraid of the "System Prompt" file. You can experiment with different phrasing or new model versions (looking at you, o1-preview) with the confidence that you aren't silently breaking the user experience.

If the hook fails, the commit fails. You fix the prompt, you verify the output, and *then* you push. No more "Fixing prompt again" commit messages. No more "Oops, the LLM started apologizing" bug reports.

Stop shipping vibes. Start shipping verified outputs.