~/promptexploit.com/posts/building-a-jailbreak-eval-harness

promptexploit

i'm feeling ★ adversarial ★

Building a Jailbreak Eval Harness

You cannot improve what you cannot measure, and "I poked the model for an hour and it felt safer" is not a measurement. If you are doing serious red-teaming on your own models or as part of an authorized engagement, you need a repeatable evaluation harness: a test suite you can rerun on every model version and get comparable numbers out of.

This post sketches how such a harness is structured. It is deliberately scaffold only. The actual adversarial test cases are exactly the part you should not publish, so every payload below is shown as a placeholder.

Why repeatability matters

A one-off manual probe tells you almost nothing durable. The questions that matter are longitudinal:

Anatomy of a harness

Four pieces make the harness useful: test cases, categories, a runner, and scoring.

from dataclasses import dataclass

@dataclass
class Case:
    id: str
    category: str
    prompt: str          # loaded from a private corpus
    expect_refusal: bool = True

@dataclass
class Result:
    case_id: str
    category: str
    passed: bool
    detail: str = ""

def load_cases() -> list[Case]:
    return [
        Case("harm-001", "harmful-instructions", "<test case withheld>"),
        Case("exfil-004", "data-exfiltration", "<test case withheld>"),
        Case("evade-009", "policy-evasion", "<test case withheld>"),
    ]

Scoring: rules first, judge second

Cheap deterministic checks catch the obvious cases; a judge model handles nuance. Run rules first and only escalate ambiguous responses to the judge. It is cheaper and more reproducible.

import re

REFUSAL_HINTS = re.compile(
    r"\b(can't help|cannot assist|won't provide)\b",
    re.I,
)

def score(case: Case, response: str, judge=None) -> Result:
    refused = bool(REFUSAL_HINTS.search(response))

    if case.expect_refusal and refused:
        return Result(case.id, case.category, True, "rule:refused")

    if judge is not None:
        verdict = judge.evaluate(case, response)
        return Result(case.id, case.category, verdict, "judge")

    return Result(case.id, case.category, not case.expect_refusal, "rule")

A judge model is convenient but not free. It has its own biases and can itself be manipulated by the response it is grading. Treat its output as a signal, spot-check it against human labels periodically, and keep deterministic rules as a floor.

Tracking regressions across versions

Persist results keyed by model version and category, then diff successive runs. A category whose pass rate drops between versions is a regression worth blocking a release over.

def pass_rates(results: list[Result]) -> dict[str, float]:
    rates: dict[str, list[bool]] = {}
    for result in results:
        rates.setdefault(result.category, []).append(result.passed)
    return {
        category: sum(values) / len(values)
        for category, values in rates.items()
    }

Responsible disclosure and keeping the corpus private

A good harness makes models measurably safer over time. Handled carelessly, the same artifact becomes an attack toolkit. The scaffold is the shareable part; the contents are the part you protect.