# How much is the harness worth? The year the number got measured

URL: https://www.thedeepfeed.ai/posts/2026-06-22-how-much-is-the-harness-worth-measuring-agent-scaffolds/
Category: Agents
Published: 2026-06-22
Author: the-deep-feed
Tags: harness, agents, coding-agents, benchmarks, swe-bench, terminal-bench, evaluation
Kind: deep

> Six weeks after agent harness engineering got named, the empirical question arrived: how much of a coding agent's score is the model, and how much is the scaffold around it. Four 2026 papers put numbers on it — and the numbers are larger than almost anyone guessed.

## TL;DR

- Agent harness engineering got *named* in spring 2026. The question that followed was empirical: of a coding agent's benchmark score, how much is the model and how much is the scaffold around it. By June, four papers had measured it.
- A controlled factorial study (Zhang et al.) found **harness-induced variance exceeded model-induced variance by 7.8x** on a SWE-bench subset, and that swapping the scaffold **reversed six of nine model rankings**. Model-paper advances usually move scores 2-4 points; harness swaps moved them 6-54.
- The same model under two scaffolds swung **34 points** on SWE-bench Verified Mini and up to **17 points** on Terminal-Bench. On Claw-SWE-Bench, one model went from **19.1% to 73.4%** — a 54-point gap — by changing nothing but the adapter.
- An auto-evolution paper (AHE) showed a fixed model climb **69.7% to 77.0%** on Terminal-Bench 2 by editing only the harness, beating the human-tuned Codex-CLI scaffold — and the gains transferred to other models.
- The contrarian finding sits inside the same data: Terminal-Bench's authors argue **model choice usually matters more than scaffold**, and UTBoost shows up to **41% of SWE-bench Lite** entries had broken tests. The number is real, large, and not yet universal. That tension is the story.

In April 2026 a discipline got a name. *Agent harness engineering* (the prompts, tools, memory, file conventions, verification loops, and orchestration wrapped around a language model) went from folklore to a term people put in job titles. We [covered the naming itself](/posts/2026-05-09-agent-harness-engineering-the-discipline/): two near-simultaneous coiners, seven convergent voices, and twelve coding agents that had independently built the same primitives. The piece ended on the line everyone in the field had started repeating in some form: the model is one input, and everything else is engineering.

That is a satisfying slogan. It is also, as stated, untestable. *Everything else is engineering* tells you the harness matters. It does not tell you how much. And "how much" is the only version of the question that changes what you do on Monday: whether you spend the next quarter waiting for a better model or rewriting your `AGENTS.md`, whether the leaderboard you just screenshotted means anything, whether the agent your company bought is good or merely well-wrapped.

For two years that number did not exist. You could feel the harness mattered the way you can feel a room is cold, but nobody had put a thermometer in it. Between April and June 2026, four research groups did. They disagree on the decimal places and on whether the effect is universal. They do not disagree on the order of magnitude, and the order of magnitude is the headline: in the regime where frontier models are roughly comparable, the scaffold around the model is frequently a *larger* determinant of measured performance than the model itself. This is the story of how that got measured, what the numbers actually say, and why the most careful paper in the set spends a full section telling you not to over-read it.

# The question nobody could answer

Start with why the number was missing for so long, because the reason is structural, not lazy.

A coding agent is not a model. It is a model placed inside a loop: read the task, decide on an action, run a tool, read the result, decide again, repeat until done or out of budget. The model supplies one thing: the next token given everything in context. Everything else in that sentence (what goes into context, which tools exist, how their output is formatted and truncated, when the loop stops, how failures are retried, whether a verifier checks the work) is the harness. When you run a benchmark, you do not measure the model. You measure the model *as driven by* a particular harness against a particular task in a particular environment. The score is a property of the whole system.

This is the trap the most rigorous of the 2026 papers names directly. "AI coding benchmarks do not measure 'the model,'" the engineer writing as @0xTria put it in a post that crystallized the mood:

> AI coding benchmarks do not measure "the model."
>
> They measure the model, the tools, the system prompt, the verifier, the repo state, and whether the task leaked into training or git history.
>
> — [@0xTria](https://x.com/0xTria/status/2061375868425970044), Jun 13, 2026

Once you see it, you cannot unsee it, and it poisons every leaderboard you have ever cited. When a vendor posts "our model scores 70% on SWE-bench Verified," that 70% was produced by the vendor's own harness — their context construction, their tool set, their retry logic, all tuned specifically to make that model look good. When a competitor posts 68% from a different harness, the two-point gap is not a model comparison. It is a comparison of two model-harness *systems*, with the harness held constant for neither. The numbers were never apples to apples. They were never even apples.

![A labeled closed-loop control diagram on cream paper, black ink line-art with a single red accent. A central box labeled HARNESS (CONTROLLER) with four internal sub-blocks stacked inside it: CONTEXT CONSTRUCTION, TOOL INTERFACE, ORCHESTRATION, VERIFIER. An arrow loops out to a smaller box labeled LLM (STOCHASTIC POLICY) drawn in red, which returns an arrow labeled ACTION back into the loop; a second feedback arrow labeled OBSERVATION runs from an ENVIRONMENT box (a terminal glyph) back to the harness. A bracket around the whole cycle reads BENCHMARK SCORE = PROPERTY OF THE WHOLE LOOP. Small callout: THE MODEL IS ONE BLOCK.](/post-images/2026-06-22-how-much-is-the-harness-worth-measuring-agent-scaffolds/closed-loop.jpg)

That framing, with the harness as the *controller* of a closed-loop system and the model as the stochastic policy it governs, is the theoretical heart of the paper that did the most to turn the intuition into math.

# The Binding Constraint Thesis

The paper is titled, with a bluntness rare in academic writing, *Stop Comparing LLM Agents Without Disclosing the Harness*. It was posted to arXiv on May 7, 2026 by Yunbei Zhang and Janet Wang of Tulane, Yingqiang Ge of Rutgers, Weijie Xu, Jihun Hamm, and Chandan K. Reddy of Virginia Tech. It is a position paper, and its position is stated as a falsifiable claim it calls the **Binding Constraint Thesis**: for long-horizon tasks evaluated across models of comparable frontier capability, performance variance is governed more by harness configuration than by model choice.

The argument comes in three parts, and they escalate from theory to receipts.

First, the control-theoretic formalization. If you model the agent as a closed-loop dynamical system (controller plus policy plus environment), then the harness is not a passive wrapper. It is the controller that decides what state the policy ever sees, which actions are even available, and when the loop terminates. A small change to a controller in a feedback system can produce a larger change in system behavior than a change to the policy it drives, because the controller's effect compounds across every step of the loop. This is why, the authors argue, swapping a system prompt can move a score more than swapping the model: the prompt change reshapes every one of the fifty decisions the agent makes, while a stronger model improves each decision marginally. Compounding beats marginal.

Second, and this is where the paper stops being a thought experiment, the evidence from public leaderboards. The authors comb through existing results and find the fingerprints of the harness everywhere once you know to look. On SWE-bench Pro, six different models under the same SEAL harness span only **4.9 percentage points** (41.0% to 45.9%). But take the single best model in that group, Claude Opus 4.5, and run it under Claude Code instead of SEAL, and it jumps **9.5 points** — from 45.9% to 55.4%. The harness swap on one model produced almost twice the spread of switching between six different models. They document a subagent layer (WarpGrep) adding a couple of points and, in doing so, *flipping the ranking* between MiniMax 2.5 and Claude Opus 4.6. On a harder variant, SWE-bench Verified Mini under the HAL harness, they record same-model swings of **34 points** for Claude Sonnet 4.5 (68% down to 34%) and **34 points** for GPT-5 Medium. Against this, they set the sobering baseline: the model advances that research papers trumpet as meaningful typically move scores **2 to 4 points**.

> They are harness substitutions, and they routinely dwarf the 2 to 4 percentage point shifts that papers report as meaningful model advances.
>
> — Zhang et al., §2.1, *Stop Comparing LLM Agents Without Disclosing the Harness*

Third, the controlled factorial — the part that produces the single most quotable number in the entire literature. The authors did not want to rely only on leaderboards they did not run, so they built a clean experiment: three models (GPT-5.4, Kimi K2.6, GLM-5.1) crossed with three harnesses of increasing sophistication (Minimal, Improved, Full), each cell run twice on a 100-task subset of SWE-bench Verified, with a fixed 50-step budget. A proper two-by-two-by-design that lets you decompose how much of the score variance comes from changing the model versus changing the harness.

The result: harness-induced variance exceeded model-induced variance by an aggregate factor of **7.8x**. Put plainly, in their experiment the choice of scaffold mattered nearly eight times as much as the choice of model. GLM-5.1 alone climbed from 52.5% under the minimal harness to 65.5% under the full one — a 13-point gain from scaffold work, on a frozen model. And the ranking instability was not occasional: **six of the nine** possible model-pair rankings reversed depending on which harness you used to compare them. The "best model" was not a property of the model. It was a property of the harness you happened to test it in.

![A labeled grouped bar comparison on cream paper, black ink line-art with one red accent. Left cluster titled MODEL SWAP shows three short bars of nearly equal height labeled 2-4 PT TYPICAL ADVANCE. Right cluster titled HARNESS SWAP shows bars rising in a staircase labeled with the tick values 9.5, 13, 34, 54, the tallest drawn in red and annotated SAME MODEL, DIFFERENT SCAFFOLD. A horizontal dashed baseline runs across both clusters labeled WHAT A NEW MODEL BUYS YOU. Bottom caption reads VARIANCE RATIO 7.8x.](/post-images/2026-06-22-how-much-is-the-harness-worth-measuring-agent-scaffolds/variance-ratio.jpg)

The 7.8x figure traveled fast, and it deserves the asterisk the authors themselves attach to it, which I will get to. But sit with the structural claim first, because it reframes two years of discourse. Every time the field said "the new model is a step change," some unknown fraction of that step was the harness the vendor shipped alongside it, tuned in the lab to flatter the weights. The Binding Constraint Thesis does not say models do not matter. It says that in the comparable-frontier regime, you cannot read a model's contribution off a benchmark unless the harness is held fixed and disclosed — and almost nobody discloses it.

# The reasoning underneath the number

It is worth slowing down on *why* the harness can outweigh the model, because the control-theory framing is doing real explanatory work and it is easy to wave past.

Consider what a single harness decision touches. Take context construction, the rule for what the agent sees at each step. A weak rule floods the context window with irrelevant file dumps; the model, however capable, now has to find signal in noise on every one of fifty steps, and its error rate at each step is higher than it would be with a clean context. A strong rule retrieves exactly the relevant code and keeps the window tight. The model is identical in both cases. But the *task* the model faces (reason over this context) is much harder in the first case, and it is harder fifty times in a row. The harness did not make the model smarter. It made the model's job easier, repeatedly, and the loop multiplied the difference.

Now compare that to a model upgrade. A better model raises the probability of a correct decision at each step, say from 0.90 to 0.93. Over a long horizon that compounds too, which is why models matter at all. But the harness change in the example above can take the per-step success probability from 0.75 to 0.92 by removing the noise the weak harness introduced. The ceiling on what harness work can buy you is set by how badly the current harness is sabotaging the model. When harnesses are immature, and in 2026 most are, that headroom is enormous, and it dwarfs the marginal gain from better weights. This is the mechanical reason the early numbers are so large, and it predicts the effect will shrink as harnesses mature and the easy sabotage gets engineered out. The 7.8x is partly a measurement of how much room for error current scaffolds still contain.

That prediction matters for how you read everything below. The harness premium is real, but some of it is the one-time gain of fixing obviously broken scaffolds, not a permanent law of nature. A field measuring its own immaturity will report dramatic numbers. The interesting question for next year is how fast the premium decays as the obvious mistakes get fixed — and whether it ever decays to zero, or settles at a permanent floor because the harness is, structurally, the controller and the controller always matters.

# From measuring the harness to evolving it

If the harness is worth that much, the next move is obvious: stop tuning it by hand. That is the leap the second major paper makes, and it is the one the practitioner crowd reacted to most strongly.

*Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses* was posted April 29, 2026 by a team led by Jiahang Lin and Shichun Liu at Fudan, with collaborators at Peking and Shanghai Qiji Zhifeng. Its premise is that automating harness improvement has been bottlenecked not by the intelligence of the agent doing the tuning, but by *observability* — the inability to tell which of a hundred edits actually helped. Their fix is to make every harness change a falsifiable contract.

The reception captured the why. Elvis Saravia, writing as @omarsar0 to an audience of AI engineers, flagged it with unusual urgency:

> // Agentic Harness Engineering //
>
> Pay attention to this one, AI devs.
>
> (bookmark it)
>
> Most coding-agent harnesses are still tuned by hand or brittle trial-and-error self-evolution.
>
> — [@omarsar0](https://x.com/omarsar0/status/2049492169887748365), May 1, 2026

The method has three layers, and the design is more interesting than the result. **Component observability** turns the harness into seven types of editable, revertible files (system prompt, tool descriptions, tool implementations, middleware, skills, sub-agent configs, and long-term memory), starting from a deliberately minimal bash-only seed harness. **Experience observability** compresses millions of tokens of trajectory logs into a navigable, drill-down corpus of failure evidence, so the tuning agent reasons over distilled signal rather than raw noise. **Decision observability** is the clever part: every edit must ship with a written prediction of what it will improve, and that prediction is checked against the next round's results. An edit is no longer a hopeful change. It is a hypothesis with a scoreboard.

Rohan Paul's summary of the mechanism is the cleanest one-sentence statement of why this matters:

> The big deal is that it turns harness tuning from guesswork into an auditable experiment, so the part of agent systems that quietly eats the most time and effort can now improve itself in a controlled and measurable way.
>
> — [@rohanpaul_ai](https://x.com/rohanpaul_ai/status/2049772749384798354), May 1, 2026

The results back the design. Holding the base model fixed (GPT-5.4) and letting the loop run ten rounds, pass@1 on Terminal-Bench 2 (89 tasks) climbed from **69.7% to 77.0%**, a 7.3-point gain bought entirely by editing the scaffold. That final harness beat the human-designed Codex-CLI harness at 71.9%, and beat the self-evolving baselines ACE (68.9%) and TF-GRPO (72.3%). It is the cleanest existence proof in the literature that harness work, done well, produces gains a model upgrade would be proud of — on a model that never changed.

Two details elevate this from a benchmark win to something more durable. First, transfer: the harness evolved on one model carried over to others, lifting deepseek-v4-flash by **10.1 points** (51.7% to 61.8%), qwen-3.6-plus by 6.3, and gemini-3.1-flash-lite by 5.1, while using **12% fewer tokens** than the seed harness on SWE-bench Verified. A good harness is not a model-specific hack; it encodes something about the task that generalizes. Second, the ablation revealed where the gains live, and it is a quiet indictment of how the competing systems tune. Long-term memory edits contributed +5.6 points, tool edits +3.3, and middleware +2.2, while the system-prompt edits actually *cost* 2.3 points, the only component that regressed. As the authors put it, "the harness components ACE and TF-GRPO never edit are exactly where the gain lives." The rival auto-tuners were polishing the prompt, the one lever that hurt, and ignoring the tools and memory where the real points were hiding.

![A labeled diverging bar chart on cream paper, Bloomberg-style ink line-art with one red accent. A horizontal zero baseline with four labeled bars showing each harness component's contribution in percentage points: LONG-TERM MEMORY +5.6 as the longest positive bar, TOOLS +3.3, MIDDLEWARE +2.2 extending right in charcoal, and SYSTEM PROMPT -2.3 extending left below zero in red and annotated ONLY COMPONENT THAT REGRESSED. A bracket over the three positive bars reads WHERE THE GAIN LIVES. A caption reads RIVAL TUNERS POLISHED THE PROMPT. Axis ticks run from -4 to +6.](/post-images/2026-06-22-how-much-is-the-harness-worth-measuring-agent-scaffolds/ablation-where-gain-lives.jpg)

![A labeled three-layer stack diagram on cream paper, black ink line-art with one red accent. Three horizontal tiers connected by vertical flow arrows, each tier annotated. Top tier COMPONENT OBSERVABILITY shows seven small labeled file icons in a row (PROMPT, TOOL DESC, TOOL IMPL, MIDDLEWARE, SKILL, SUBAGENT, MEMORY) with a small revert-arrow glyph. Middle tier EXPERIENCE OBSERVABILITY shows a large log-pile glyph funneling into a small distilled card labeled FAILURE EVIDENCE. Bottom tier DECISION OBSERVABILITY shows an EDIT box paired with a PREDICTION tag and a check/cross verdict drawn in red, labeled FALSIFIABLE CONTRACT. A side bracket reads EACH EDIT = A HYPOTHESIS WITH A SCOREBOARD. Small callout near memory: WHERE THE GAIN LIVES.](/post-images/2026-06-22-how-much-is-the-harness-worth-measuring-agent-scaffolds/three-layer-observability.jpg)

There is one humbling number in the AHE paper worth holding onto, because it tempers the triumph. The system's own self-attribution (its ability to correctly predict which edits would help) had a fix-precision of 33.7% and a fix-recall of 51.4%. Better than random by roughly five times, but far from reliable. And its ability to catch regressions was worse still: precision 11.8%, recall 11.1%, what the authors candidly call "regression blindness." The auto-evolving harness improves, but it is a poor judge of its own work, prone to missing the changes that quietly make things worse. Even the machine built to measure the harness cannot fully see what it is doing. That is a fitting emblem for the whole field.

# The dissent: Terminal-Bench says the model still wins

Every honest survey needs the paper that disagrees, and here it is — published by the very team whose benchmark the other papers lean on.

*Terminal-Bench*, posted January 17, 2026, is a large collaboration: Mike Merrill and Alexander Shaw as joint leads, Nicholas Carlini among the authors, with Alex Dimakis, Andy Konwinski of Laude, and Ludwig Schmidt of Stanford as senior authors, spanning 44 institutions. It is a serious, careful benchmark — 89 hard, realistic terminal tasks (distilled from 229 crowd-sourced by 93 contributors), each a Docker environment with hidden tests, graded on whether the agent leaves the container in the correct final state. It ran six agents across sixteen models for a total of 32,155 trials.

And its headline conclusion runs *against* the binding-constraint crowd:

> implying that model selection is usually more important than agent scaffold when optimizing for performance
>
> — Merrill, Shaw et al., *Terminal-Bench*

This is the strongest counterargument in the literature, and it should be taken at full weight rather than explained away. The Terminal-Bench authors looked at their 32,000 trials and concluded that, on net, *which model you pick* moved scores more than *which scaffold you wrap it in*. Their top result (GPT-5.2 under Codex CLI at 62.9%) sits clearly above the field, and the model identity, not the harness, is doing most of that work in their reading.

But the paper's own data contains the seed of the opposing view, and the tension is the most interesting thing in it. Look at the same-model, different-harness swings Terminal-Bench itself reports: GPT-5.2 moves about 9 points between Codex CLI (62.9%) and the neutral Terminus 2 scaffold (54.0%); Gemini 2.5 Pro moves roughly **17 points** (32.6% versus 15.7%) depending on harness; Claude Haiku 4.5 about 16. Those are not small numbers. They are larger than the gaps between adjacent models on the leaderboard. And there is a tell in the methodology: as Figure 1 notes, "the agent scaffold used to report each model was chosen to maximize performance." Every headline number was produced by the best harness the authors could find for that model. The leaderboard is a list of model-harness *systems*, each individually optimized — which is exactly the practice the Zhang paper says makes cross-model comparison invalid.

So the two camps are less opposed than they appear. Terminal-Bench's "model matters more" holds when you let each model use its best-tuned harness, because a frontier model with a great scaffold beats a weaker model with a great scaffold. Zhang's "harness matters more" holds when you fix the harness and vary it deliberately, because then the scaffold's contribution becomes visible and it is large. Both can be true. The disagreement is really about which experiment you run, and that is itself the lesson: the answer to "model or harness" depends entirely on what you hold constant — which means neither number means anything until you say which one you fixed.

# Two experiments, two answers, one confusion

The reconciliation between Terminal-Bench and the binding-constraint papers is worth making fully explicit, because it dissolves what looks like a contradiction into a precise methodological point — and that point is the most useful thing a builder can take from this whole literature.

There are exactly two honest ways to run a model-versus-harness experiment, and they answer different questions.

The first is the **locked-harness** design: pick one fixed scaffold, apply it identically to every model, and vary only the model. This is what a clean model comparison requires, and it is what DeepSWE's neutral mini-swe-agent harness approximates. Run this way, you are measuring the model's contribution with the harness held constant, and the answer to "does the model matter" is yes — a better model under the same scaffold scores higher. This is the regime in which Terminal-Bench's "model selection is usually more important than scaffold" is true, with the one caveat that matters: Terminal-Bench did *not* lock its harness. It let each model use its best-tuned scaffold, which is a third thing entirely.

The second is the **factorial** design: vary model and harness together in a grid and decompose the variance. This is what Zhang et al. ran, and it answers "which factor explains more of the spread I see across systems." Run this way, on immature harnesses with a real minimal-to-full gap, the harness term dominates — the 7.8x.

The confusion in the discourse comes from a third practice that is neither of these and is what every public leaderboard actually does: report each model under its own individually-optimized harness. This is the worst of both worlds. It is not a locked-harness comparison, because the harness changes with the model. It is not a factorial decomposition, because the harness is not a controlled variable — it is a hidden one, tuned to flatter each model separately. The resulting ranking conflates "better model" with "better-tuned harness for this model" and reports the sum as if it were the model's score. Terminal-Bench's own Figure 1 admits this directly: each model's headline number used "the agent scaffold chosen to maximize performance." That is the practice the Zhang paper exists to condemn, and Terminal-Bench, for all its rigor, participates in it while concluding the opposite.

![A labeled three-column comparison on cream paper, black ink line-art with one red accent. Three columns each headed by a small experiment glyph. Column 1 LOCKED HARNESS shows one fixed scaffold icon applied to three different model icons, captioned MEASURES THE MODEL. Column 2 FACTORIAL shows a 2x2 grid of model-by-harness cells with a decomposition bracket, captioned MEASURES BOTH. Column 3 (drawn in red, with a small warning glyph) LEADERBOARD shows each model paired with its own different best-tuned scaffold, captioned MEASURES NEITHER. A footer banner reads THE ANSWER DEPENDS ON WHICH ONE YOU RAN.](/post-images/2026-06-22-how-much-is-the-harness-worth-measuring-agent-scaffolds/two-experiments.jpg)

Once you hold this distinction in mind, the apparent war between the papers evaporates and a single coherent picture remains. Models matter: under a locked harness, the better model wins. Harnesses matter more than the field admitted, because under a factorial decomposition on today's scaffolds, they explain most of the variance. And public leaderboards measure a confounded blend of the two that supports neither clean claim. The right response is not to pick a side. It is to demand that any score declare which experiment produced it, which is precisely what the disclosure standard below is for.

# The leaderboard was measuring the wrong thing twice

If the harness contaminates benchmark scores, you might at least hope the *tasks* themselves were clean. They were not, and this is the finding that should most unsettle anyone who has ever cited a SWE-bench number.

UTBoost, an ACL 2025 paper by Yu, Zhu, He, and Kang, audited the test cases SWE-bench uses to grade agents and found them frequently inadequate — tests so weak that a wrong patch could pass them. After strengthening the test suites, corrections affected **40.9% of SWE-Bench Lite** entries and **24.4% of SWE-Bench Verified** entries. Nearly a quarter of the "verified" benchmark's gradings changed once the tests were made rigorous. Rankings shifted; Amazon-Q's score, for instance, moved when the broken tests were fixed. The implication is stark: it is not only the harness that was unmeasured. A large fraction of the underlying task gradings were wrong, which means the model signal everyone was reading off these leaderboards was noisy *before* the harness even entered the picture.

This is the deeper version of @0xTria's point. The benchmark score is the model, times the harness, times the quality of the verifier that decides whether the agent succeeded — and in 2026 the field discovered that the second and third factors were both large, both variable, and both undisclosed. The number on the leaderboard was a product of three things, two of which nobody was reporting.

Daniel Vaughan, writing on June 16 in the post that gives this piece its title, pulled three of these threads together and stated the synthesis plainly. SWE-bench, he argued, is best understood as "a model-harness-evaluation-suite composite score where the harness contributes as much variance as the model." Surveying Claw-SWE-Bench, Harness-Bench, and UTBoost together, he found the harness effect on one benchmark (27.4 points of spread) "nearly as large as the model effect" (29.4 points) — and, in the most extreme single data point in the entire literature, a model moving from **19.1% to 73.4%** on Claw-SWE-Bench by changing only its adapter from minimal to full. A 54-point gap, attributable entirely to harness design, on identical weights.

![A labeled equation-style diagram on cream paper, black ink line-art with one red accent. Three multiplied boxes joined by multiplication signs reading SCORE = MODEL x HARNESS x VERIFIER. Below each box a small annotation: under MODEL "2-4 pt advances", under HARNESS "6-54 pt swings" drawn in red, under VERIFIER "24-41% of SWE-bench tests broken (UTBoost)". A bracket beneath the whole equation reads ONLY THE FIRST WAS EVER REPORTED. A small magnifying-glass glyph hovers over the HARNESS and VERIFIER boxes.](/post-images/2026-06-22-how-much-is-the-harness-worth-measuring-agent-scaffolds/score-equation.jpg)

# Why the number resists being pinned down

By now the pattern across the four papers is clear: ordinary scaffold swaps move scores 6 to 17 points, aggressive minimal-versus-full comparisons move them 34 to 54, and genuine model advances move them 2 to 4. The harness premium is real and large. So why won't the field just publish *the* number?

Because there isn't one, and the most careful paper in the set is the most insistent on this. Zhang and colleagues attach an explicit disclaimer to their headline finding: "We do not claim that the 7.80x ratio is universal." The ratio is a property of their specific models, their specific harnesses, their specific 100-task subset, and the specific gap between their minimal and full scaffolds. Widen the model capability gap and the model term grows. Mature the harnesses so the minimal one is less broken and the full one less heroic, and the harness term shrinks. Change the task distribution and everything moves. The 7.8x is a real measurement of a real system, not a constant of the universe — and the authors' refusal to oversell it is the most credible thing about the paper.

This is also why the practitioners closest to the metal keep reaching for a more honest unit of measurement. Philipp Schmid, dissecting the DeepSWE benchmark, noticed that its neutral evaluation harness gave every model "a single bash tool and the same system instructions. No vendor editing primitives," and that under those controlled conditions, "Claude Opus scored +10pp over Claude Code. Gemini 3.1 Pro scored +20pp over Gemini CLI." The vendor harnesses, in other words, were sometimes *worse* than a neutral one, by ten to twenty points. And the independent builder Sakasegawa, constructing a personal HarnessBench across 27 tasks, ran straight into the sample-size wall and was honest about it: "27 tasks are not enough to make strong success-rate ranking claims. To reliably detect a 10-point gap, we probably need roughly 160 to 315 tasks." He also argued that "wall time deserves to be a first-class metric" — a reminder that success rate is not even the only axis the harness controls.

EnterpriseClawBench, surfaced by @HuggingPapers in June, made the same point from the enterprise direction:

> EnterpriseClawBench
>
> A benchmark for enterprise coding agents distilled from real workplace sessions. It evaluates complete harness-model systems on artifact delivery, cost, runtime, and skill transfer.
>
> Even the best configuration only reaches 0.663.
>
> — [@HuggingPapers](https://x.com/HuggingPapers/status/2069347401890955503), Jun 22, 2026

The frontier of harness-plus-model systems, measured on realistic work rather than curated tasks, is nowhere near solved — which means the harness premium is being measured on a problem where there is still enormous headroom, exactly the condition under which the premium is largest.

# How the field metabolized it

The papers supplied the numbers, but the more telling signal is how fast practitioners reorganized their mental models around them. This was not a slow academic absorption. Within days of each paper, the people building agents for a living were rewriting their own rules of thumb in public, and the direction of the revision was consistent: stop crediting the model for what the harness did.

The sharpest version came from Philipp Schmid, who used the release of the DeepSWE benchmark to make the controlled-comparison point concrete. DeepSWE's neutral evaluation harness gave every model the same minimal tooling, and the results embarrassed the vendor scaffolds:

> The evaluation harness (mini-swe-agent) gives every model a single bash tool and the same SI. No vendor editing primitives. [...] Mini-swe-agent claims to match or beat 1P harnesses on the same tasks. Claude Opus scored +10pp over Claude Code. Gemini 3.1 Pro scored +20pp over Gemini CLI.
>
> — [@_philschmid](https://x.com/_philschmid/status/2059564676569076021), Jun 5, 2026

Read that twice, because it inverts the marketing. Claude Code and Gemini CLI are the vendors' own flagship harnesses, built by the teams that built the models, and a stripped-down neutral scaffold *beat both of them* by ten and twenty points on the same weights. The harness the vendor ships is not automatically the best harness for the model — sometimes it is actively leaving double-digit points on the table. Schmid's instinct at the end of the thread is the tell of where the field's attention has moved: he wanted to see "how other harness x model combinations will do." The unit of analysis is no longer the model. It is the pair.

This is the deeper reason the discourse felt like a phase change rather than a finding. For two years the bookmark-worthy AI-engineering content was about models — which one, how big, how to prompt it. By mid-2026 the most-shared work was about the layer around the model, and the framing had flipped from "how do I pick the best model" to "how do I measure my scaffold." When @0xTria wrote that benchmarks "do not measure the model," and @rohanpaul_ai celebrated a method that turns "guesswork into an auditable experiment," they were not reporting two separate facts. They were describing the same migration: the field's center of gravity moving from the weights to the loop, and bringing its measurement instruments with it.

There is a contrarian undertone in the reaction worth surfacing, because not everyone treats the harness premium as good news. If a neutral harness can beat a vendor's flagship by twenty points, that is also an indictment — it means the vendors, with every advantage, are shipping scaffolds that handicap their own models. And if the premium is largely the one-time gain of fixing broken harnesses, then the people celebrating 54-point swings are, in a sense, celebrating how bad the starting point was. The optimistic read ("look how much headroom the harness gives us") and the pessimistic read ("look how much we were leaving on the floor") are the same number seen from two sides. Both are in the timeline. Both are correct.

# What disclosure would require

If the field accepts that a benchmark score is meaningless without the harness, the obvious response is to make people disclose it. Zhang and colleagues do not leave this as a wish. They specify the mechanism, and the specificity is what makes it worth reading rather than nodding past.

The first piece is a **Harness Card**, a structured disclosure organized by a seven-layer taxonomy they label ETCSOVG: Execution (the runtime substrate, sandboxing, step and task budgets), Tool (the tool list, schemas, and error contract), Context (window cap, compression and retrieval policy, persistent memory), Scheduling (the agent loop, retry and escalation rules), Observability (what artifacts and traces are logged), Verification (validation, self-checking, anomaly detection), and Governance. The authors are careful to frame it as "an attention checklist rather than a design constraint" — not a prescription for how to build a harness, just a demand that you say what yours does. The point is that two harnesses differing on any one of these seven layers are not interchangeable, and a score that does not specify them cannot be compared to another.

The second piece is the part with teeth: a **variance-decomposition protocol**. The minimum valid experiment, they argue, is a two-by-two model-by-harness grid, with task order, execution environment, eval script, API parameters, and stopping rules all held constant. And alongside the headline score you must report four things the field currently never reports — the harness variance per model, the model variance per harness, the aggregate ratio between them, and the count of model-pair ranking reversals across harnesses. A harness difference, they add, "counts as meaningful only if it changes at least one ETCSOVG layer," which rules out the trick of claiming a new harness while only retuning a temperature parameter.

The third piece pushes past success rate entirely, into trajectory-level metrics: a Recovery Rate that measures how often an agent claws back from an error within k steps, a Context Retention measure for how much relevant state survives compression, and a Control Lag for how long the harness takes to react. These exist because success rate is a single bit at the end of a long process, and the harness's contribution shows up in the *shape* of the trajectory (how it fails, how it recovers, what it forgets) long before it shows up in the final score.

![A labeled vertical checklist card on cream paper, black ink line-art with one red accent. A document outline titled HARNESS CARD with seven labeled rows stacked inside, each with a small checkbox glyph: EXECUTION, TOOL, CONTEXT, SCHEDULING, OBSERVABILITY, VERIFICATION, GOVERNANCE. To the right, a small separate panel titled VARIANCE PROTOCOL shows a 2x2 grid (two models by two harnesses) with four annotation tags: HV PER MODEL, MV PER HARNESS, RATIO, REVERSALS — the RATIO tag drawn in red. A bracket joining both panels reads WHAT A SCORE MUST DISCLOSE. Small caption: NONE OF THIS IS REPORTED TODAY.](/post-images/2026-06-22-how-much-is-the-harness-worth-measuring-agent-scaffolds/harness-card.jpg)

Whether anyone adopts it is the open question, and the incentive math is not encouraging. A Harness Card that honestly decomposes a vendor's flagship score into "model contribution: small, harness contribution: large" is an admission no marketing department will volunteer. Disclosure standards in this position tend to be driven by the people who lose from opacity — independent evaluators, enterprise buyers running their own bake-offs, academics who cannot reproduce a vendor number. The standard's value does not depend on vendors embracing it. It depends on giving everyone else a checklist for what to demand and a name for what is missing when it is withheld.

# What this changes

Pull it together and the practical consequences are sharper than the academic framing suggests.

For anyone reading a leaderboard, the rule is now simple and uncomfortable: a benchmark score without a disclosed harness is not a model comparison, and you should stop treating it as one. The Zhang paper's proposal is the field's first serious attempt at a disclosure standard: a "Harness Card" that names every layer of the scaffold from execution down to governance, plus a variance-decomposition protocol that reports harness variance per model and the ranking-reversal count. Whether vendors adopt it is a different question; disclosure that reveals your model's score came mostly from your harness is not in a vendor's interest. But the standard at least gives independent evaluators a checklist for what to demand.

For anyone *building* with agents, the numbers invert the default instinct. The instinct is to wait for the next model. The data says that in most current setups, the larger near-term gain is sitting in your harness — your context rules, your tool definitions, your memory, the components the AHE ablation showed were worth 5.6 and 3.3 points while the prompt everyone obsesses over was worth negative 2.3. The cheapest performance available to most teams in 2026 is not a model upgrade. It is fixing the scaffold they already have, and most teams have not measured theirs at all.

And for the field, the deepest implication is the one the open-source community has been quietly demonstrating. If the harness is worth this much, then frontier capability is no longer purely a function of who has the best weights. A team with a merely-good model and an excellent, well-measured harness can beat a great model wrapped in a mediocre one — which is precisely the gap the AHE transfer results and the DeepSWE neutral-harness numbers document. One builder, writing as @iam_elias1, put the economic version of this bluntly while comparing Cognition's funded Devin to the open-source agents that cloned it:

> Cognition AI raised $175 million and spent two years building Devin. The community built OpenHands in the open, gave it away, and is now running it in production at a scale Devin cannot match at its price point. [...] The open-source community did not beat Devin at fundraising. They beat it at the part that actually matters.
>
> — [@iam_elias1](https://x.com/iam_elias1/status/2058180656660959258), Jun 4, 2026

That comparison only resolves the way it does if the value was never only in the model. A $175M lab and a volunteer project converge on similar real-world performance precisely because the expensive, fundable part (the weights, or access to someone else's) is no longer where the decisive difference lives. A large share of it sits in the harness: the tool design, the context discipline, the verification loop, the memory. Those are buildable in the open by anyone willing to measure carefully, which is exactly why the open-source agents could close the gap without closing the funding gap. The harness premium is not just an evaluation curiosity. It is the mechanism by which the moat around frontier agent performance turns out to be shallower than the capital flowing into it assumed.

# What to watch

The thesis in this piece is falsifiable, which is the only kind worth holding. Here is what would confirm or break it over the next few quarters.

| Signal | If it happens | What it would mean |
|---|---|---|
| Vendors adopt Harness Cards or equivalent disclosure | Leaderboards start reporting scaffold details and variance decompositions | The measurement problem is being taken seriously; cross-model comparison becomes possible |
| The harness premium shrinks in replications | Later factorial studies report ratios well below 7.8x | The early numbers measured immature scaffolds; the premium was partly a one-time fix, as the compounding argument predicts |
| The premium holds or grows on harder tasks | EnterpriseClawBench-style realistic benchmarks keep showing large harness spreads | The harness-as-controller effect is structural, not transitional, and matters most exactly where the work is hardest |
| Auto-evolution closes the regression-blindness gap | AHE-style systems get reliable at predicting which edits hurt | Harness tuning becomes a solved engineering loop rather than a craft; the premium gets competed away faster |
| A neutral standard harness gets wide adoption | The field converges on a Terminus-style testbed for model-only comparison | The two camps reconcile: model comparisons on the neutral harness, system comparisons disclosed separately |
| Benchmark test suites get audited at scale | UTBoost-style corrections become routine pre-publication | The verifier term shrinks; the remaining score variance splits more cleanly between model and harness |

The keystone is the first replication that holds the harness genuinely constant across a wide model-capability gap. If the harness premium survives that test, the Binding Constraint Thesis graduates from a provocative position paper to a law of agent evaluation. If it shrinks toward the model term as harnesses mature, then 2026 was simply the year the field measured its own adolescence — the brief window when the scaffolds were bad enough that fixing them was worth more than better weights.

Either way, the number exists now. For two years the field repeated that the harness mattered without being able to say how much. In the spring and summer of 2026 it finally put a thermometer in the room, and the reading was higher than the slogans implied. The model is one input. Everything else is engineering — and as of this year, we can finally measure roughly how much of the result that engineering is responsible for. The answer, for now, is: more than the model. With an asterisk, and a standing invitation to prove it wrong.

## Sources

- [Zhang, Wang, Ge, Xu, Hamm, Reddy — Stop Comparing LLM Agents Without Disclosing the Harness (arXiv 2605.23950, May 7 2026)](https://arxiv.org/abs/2605.23950)
- [Lin, Liu, Pan et al. — Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses (arXiv 2604.25850, Apr 29 2026)](https://arxiv.org/abs/2604.25850)
- [Merrill, Shaw, Carlini et al. — Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command-Line Interfaces (arXiv 2601.11868, Jan 17 2026)](https://arxiv.org/abs/2601.11868)
- [Daniel Vaughan — When the Harness Outweighs the Model (Jun 16 2026)](https://codex.danielvaughan.com/2026/06/16/harness-outweighs-model-claw-swe-bench-harness-bench-utboost-codex-cli-configuration-strategy/)
- [Zheng, Han, Li et al. — Claw-SWE-Bench (arXiv 2606.12344, Jun 10 2026)](https://arxiv.org/abs/2606.12344)
- [Yao, Tan, Liu et al. — Harness-Bench (arXiv 2605.27922, May 27 2026)](https://arxiv.org/abs/2605.27922)
- [Yu, Zhu, He, Kang — UTBoost: Rigorous Evaluation of Coding Agents on SWE-Bench (arXiv 2506.09289, ACL 2025)](https://arxiv.org/abs/2506.09289)
- [Sakasegawa — Building HarnessBench, a Benchmark for Coding Agent Harnesses](https://nyosegawa.com/en/posts/harness-bench/)
- [OpenAI — Harness engineering: leveraging Codex in an agent-first world](https://openai.com/index/harness-engineering/)
- [SWE-bench leaderboards](https://www.swebench.com/)
- [@omarsar0 on X — Agentic Harness Engineering, 'pay attention to this one'](https://x.com/omarsar0/status/2049492169887748365)
- [@_philschmid on X — DeepSWE harness notes: Opus +10pp over Claude Code](https://x.com/_philschmid/status/2059564676569076021)
- [@0xTria on X — 'AI coding benchmarks do not measure the model'](https://x.com/0xTria/status/2061375868425970044)
- [@rohanpaul_ai on X — harness tuning 'from guesswork into an auditable experiment'](https://x.com/rohanpaul_ai/status/2049772749384798354)
- [@HuggingPapers on X — EnterpriseClawBench, best config only 0.663](https://x.com/HuggingPapers/status/2069347401890955503)

---

Canonical: https://www.thedeepfeed.ai/posts/2026-06-22-how-much-is-the-harness-worth-measuring-agent-scaffolds/
Site: https://www.thedeepfeed.ai
Full corpus: https://www.thedeepfeed.ai/llms-full.txt