Harness Engineering: Exploring the Outer Layer

The first serious work of The Scurry Lab was not on the project list — [Why the Lab]. Designing a coordinated team of specialized agents made the structural problem unavoidable: without a rigorous way to define scope, memory, authority, and handoffs, agents had no way to coordinate. Harness engineering became the prerequisite for everything else.

Building it out, I found the broader practitioner community converging on the same problems from different directions — prompt engineering, context architecture, scaffolding, guardrails — each term naming something real, none of them naming the whole. That convergence is still underway. This post is the lab’s contribution to it.

The argument is that outer-layer AI work — everything that shapes model behavior without touching model weights — constitutes a coherent engineering discipline with its own design vocabulary, its own failure modes, and its own design principles. That discipline is harness engineering. And naming the whole, rather than its parts, has practical consequences.

The sections that follow build toward a general framework — five design axes applicable to any agent in any domain — by first establishing the foundational distinction and surveying what the existing practitioner literature has already worked out.

The Inner/Outer Distinction

AI systems can be shaped from two directions.

The inner layer works on the model itself — fine-tuning, training, RLHF, activation steering, mechanistic interpretability. When you work at the inner layer, the weights change.

The outer layer shapes model behavior from the outside — prompts, context, tool access, memory architecture, agent orchestration, routing logic, trust boundaries, guardrails. When you work at the outer layer, the weights don’t change. The environment the model operates in does.

This distinction is not new in practice. Every team deploying AI systems works at the outer layer constantly. What has been missing is the recognition that outer-layer work constitutes a coherent engineering discipline with its own design vocabulary, its own failure modes, and its own design principles.

That discipline is harness engineering — a term that has emerged organically across the practitioner community, not coined here. Birgitta Böckeler, writing in the Fowler canon in April 2026, defines it precisely: “Agent = Model + Harness.” In her framing, the harness is everything in an AI agent except the model itself — a deliberately wide definition that she then narrows for the specific context of coding agents. Ryan Lopopolo’s account of building a million-line production codebase at OpenAI with zero manually-written code arrives at the same place from the practitioner direction: “building software still demands discipline, but the discipline shows up more in the scaffolding rather than the code.”

Both are describing outer-layer work. Neither attempts to theorize it as a complete discipline. The lab’s agent work required that theorization — and what emerged from it appears to have a place in the broader field.

Why “Context Engineering” Isn’t Enough

Andrej Karpathy’s framing of “context engineering” — the discipline of managing everything in the context window at inference time — is gaining traction. It names something important that “prompt engineering” undersells.

But context engineering is one mechanism within the outer layer, not the outer layer itself. Böckeler makes this explicit: “Context engineering provides us with the means to make guides and sensors available to the agent.”

Context engineering is subordinate to harness engineering, not coextensive with it. The harness includes context design, but it also includes tool access decisions, memory architecture, autonomy level, trust position, and observation surface — none of which reduces to context window management.Teams that think in terms of “prompt engineering” or “context engineering” have vocabulary for one mechanism but no vocabulary for the system those mechanisms compose.

What the Coding-Agent Literature Establishes

The most developed practitioner literature on harness engineering comes from the coding-agent space. Böckeler’s framework is the most systematic encountered in the lab’s initial survey of the literature: she distinguishes guides (feedforward controls that steer the agent before it acts) from sensors (feedback controls that let the agent self-correct after acting), and identifies three regulation categories — maintainability, architecture fitness, and behavior.

Lopopolo’s account demonstrates several harness design principles in practice: progressive disclosure of context (“give Codex a map, not a 1,000-page instruction manual”), mechanically enforced architectural constraints, autonomy earned incrementally as feedback loops are encoded, and the observation principle stated plainly — “from the agent’s point of view, anything it can’t access in-context while running effectively doesn’t exist.”

It also has an explicit scope: “I want to take the liberty here of defining its meaning in the bounded context of using a coding agent.” Both accounts are domain instances of a more general framework, not the general framework itself.

It is worth noting where agent frameworks fit in this picture. LangGraph, CrewAI, AutoGen, OpenClaw — these are the tooling through which most practitioners first encountered harness engineering, without necessarily having that name for it. Vivek Trivedy, writing from inside LangChain’s deepagents library, arrives at “Agent = Model + Harness” from exactly this direction — a practitioner account of harness components derived from implementation experience rather than design theory.

They are implementation tooling for the agent scaffold layer: each one makes a subset of harness decisions concrete and opinionated. LangGraph makes state graph execution legible. CrewAI makes role-based delegation accessible. OpenClaw makes tool execution against local systems tractable. None of them makes the underlying design axes explicit, and none of them is the design discipline itself. The framework executes your harness decisions; it does not make them for you. Two teams using the same framework can make radically different authority scope, memory model, and trust position decisions. The harness is in the decisions, not the tooling. This is why the field has excellent implementation options and fragmented design vocabulary simultaneously — the frameworks arrived before the discipline was named.

The vocabulary they’ve developed maps cleanly onto the framework this post introduces and makes explicitly below.

The General Framework

Harness engineering, as theorized here, operates across five design axes. Every agent in any domain — coding, research, healthcare, customer operations, autonomous systems — can be designed against all five.

Axis 1 — Authority Scope. What can this agent read, write, execute, or affect? Where are the hard limits? This axis defines the agent’s reach into the system. Böckeler’s regulation categories — what the harness is supposed to regulate — are the domain-specific content of this axis. Her maintainability, architecture fitness, and behavior categories answer “what specifically can this coding agent affect.” In a different domain, the categories differ; the axis is the same.

The design principle: scope authority to the minimum needed for the agent’s role. Wider authority requires explicit justification.

Axis 2 — Memory Model. What does this agent know, how does knowledge persist, and what is it permitted to remember? Memory in agent systems operates at three levels: episodic (what happened in past interactions), semantic (what the agent knows about the world and the task), and procedural (how to do things — skills, tool definitions, workflow templates). Lopopolo’s “repository as system of record” is a semantic memory design decision. His observation that a monolithic AGENTS.md fails in predictable ways — context crowding, guidance rot, unverifiable freshness — is a memory architecture failure mode.

The design principle: memory is a trust surface. What an agent remembers shapes what it does.

Axis 3 — Autonomy Level. Where on the spectrum from fully autonomous to human-directed does this agent sit, and is that placement intentional? Lopopolo’s account of incrementally increasing Codex’s autonomy over five months — as each feedback loop was encoded and validated — is the design principle stated in practice: autonomy should be earned through demonstrated reliability, not assumed. The spectrum runs from agents that act without human review, through human-monitors-can-intervene (human on the loop), through human-approves-key-decisions (human in the loop), to human-initiates-every-action.

The design principle: default to more human involvement, not less.

Axes 1 and 3 interact as a causal risk model, not just as independent dimensions. Risk in agent systems is a function of action frequency multiplied by consequence magnitude — RISK = f(autonomy × authority). An agent with high authority and low autonomy (acting rarely but with significant reach) carries concentrated risk per interaction. An agent with low authority and high autonomy (acting frequently within a tightly bounded surface) carries contained risk. An agent with high authority and high autonomy is the most capable configuration and the most consequential failure mode — it is the destination of a trust progression, not a default starting position. The lab’s Authority × Autonomy Risk Matrix maps these interactions into four quadrant profiles — Guided Assistant, Bounded Specialist, High-Stakes Delegate, and Collaborative Partner — each with defined design implications and a trust progression path. The critical design principle: agents earn their way to higher-capability quadrants through demonstrated reliability within bounded conditions. Jumping directly to the high-authority, high-autonomy quadrant without traversing the path is the primary harness design failure mode. The matrix is published as an interactive lab artifact alongside this post.

Axis 4 — Trust Position. Where does this agent sit in the hierarchy? Who tasks it, who can override it, who monitors it? This axis has no direct equivalent in the coding-agent literature. Both Böckeler and Lopopolo describe single-team, single-product deployments where trust relationships are implicit. For multi-agent systems — crews of specialized agents with defined roles, orchestration layers, and cross-agent dependencies — an agent with no explicit trust position is a coordination failure waiting to happen. In a multi-agent system, trust position defines which agents can task which, who validates whose output, and who escalates to humans.

The design principle: This axis maps the agent’s relationships: tasked by, reports to, overridden by, monitored by. Every agent has a clear chain of authority. No agent is unmonitored. No agent self-tasks without bounds.

Axis 5 — Observation Surface. What is logged, what triggers escalation, and what is appropriately opaque? Böckeler’s sensors are the mechanism that implements this axis. The observation surface determines whether an agent’s behavior is legible to the system — whether anomalies are detectable, whether human review is possible, whether the system can demonstrate it is operating within defined bounds.

The design principle: logging is not surveillance — it is the mechanism by which AI systems demonstrate that they operate within defined bounds.

The Harness Layer Architecture

The five axes apply at every layer of a deployed AI system, but they apply differently depending on where in the stack the harness is operating. Four layers can be distinguished:

Global harness profile — context files, constitutional bounds, system-wide norms; applies to all agents at all times
Agent harness profile — five-axis design standard; applies to one agent across all its tasks
Task harness profile — five-dimension design standard; applies to one task surface for one class of interaction (skills)
Infrastructure harness profile — transparent to agents; operates at the inference boundary; addresses system-level design questions the other layers cannot

Each layer inherits from the ones above it. Each addresses design questions the others cannot. .

The Task-Level Layer: Skill Design

The coding-agent literature goes to the agent level. There is a layer beneath it that the literature hasn’t yet addressed systematically: the individual task surface.

Skills — the task-specific configurations that shape model behavior for a defined class of interaction — are not just prompt templates. They are local harness artifacts. A skill designer is making harness decisions whether or not they have vocabulary for it.

Böckeler ends her framework piece with a set of open questions: “How do we keep a harness coherent as it grows, with guides and sensors in sync, not contradicting each other? How far can we trust agents to make sensible trade-offs when instructions and feedback signals point in different directions?”

These are task harness profile design problems. The answers require thinking about five dimensions at the task level:

Task surface — what capability does this skill open, and what behavior does it require at minimum?
Constraint geometry — which parts of this task need tight specification, and which should be left loose for model judgment?
Scope boundary — what does this skill assume is handled by the global harness, and what would break if that assumption failed?
Composability — how does this skill behave when it shares context with other skills or agent-level instructions?
Observability profile — what outputs does this skill produce that can be inspected, and what does anomalous behavior look like?

The composability dimension is particularly underexplored. A skill that produces correct output in isolation may produce degraded output in combination with other context — not because of prompt quality, but because of harness design failure at the task level. This is a different problem than anything the current literature addresses.

Not all of Böckeler’s open questions live at the task level. Some do — coherence between guides and sensors, constraint trade-offs when signals conflict — and the task harness profile design vocabulary addresses them directly. But her final question points somewhere else: “there’s real potential for tooling that helps configure, sync, and reason about controls as a system.” That is not a task-level question. It is an infrastructure question.

The Infrastructure Layer

The framework implies a fourth harness layer that the existing literature doesn’t name: a transparent infrastructure-level harness operating below the agent scaffold, invisible to calling agents. An agent invoking a model endpoint has no knowledge that a harness is active at that layer. From the agent’s perspective, it is calling a model. From the system’s perspective, the inference environment is being shaped before the response is returned.

This layer has a distinguishing property that separates it from the other three: it requires no agent modification, no skill update, no context file change. It operates at the infrastructure boundary between the calling agent and the underlying model. The harness is the infrastructure.

The infrastructure harness profile is the least defined of the four — deliberately so. Where the agent harness profile ends and the infrastructure harness profile begins is partly an empirical question. The boundary is noted, not resolved.

The Scurry Lab is actively investigating this layer. The current work involves a transparent proxy middleware architecture — a harness that intercepts standard model API calls, operates on them, and returns standard responses, with no agent modification required. The architectural questions it raises — what can be shaped at the inference boundary, what the harness can and cannot do without agent knowledge, how infrastructure-level design interacts with agent-level and task-level design — are open empirical questions, not settled ones. Findings will be published consistent with the lab’s research methodology — see [Why the Lab].

Where This Leaves the Field

The coding-agent literature has established the term, demonstrated the practice, and developed useful domain-specific vocabulary. Böckeler’s guides/sensors distinction and Lopopolo’s account of earning autonomy incrementally are genuine contributions that belong in any harness engineering canon. Trivedy’s practitioner account from inside the LangChain ecosystem adds the implementation tooling perspective: harness components discovered in practice, named from the inside out.

What remains undone is the generalization. The five axes apply to coding agents, but they also apply to research agents, medical documentation systems, customer-facing AI, autonomous workflow systems, and any other deployment of AI at the outer layer. The inner/outer organizing principle is what allows that generalization — it identifies harness engineering as a discipline defined by its layer, not its domain.

The framework also surfaces design questions that weren’t previously visible as a category. The task harness profile makes composability a first-class design problem. The infrastructure harness profile makes inference-boundary shaping a design surface. These aren’t refinements to existing vocabulary — they are questions the vocabulary of parts couldn’t ask.

An open question the framework makes tractable but does not yet answer: what is happening at the model level when harness conditions change? When a well-designed harness measurably improves agent behavior, the question of whether that change is reflected in internal model representations — or whether it is purely behavioral compliance — is the next research frontier. That question sits at the boundary between the outer layer work this post describes and the inner layer work of mechanistic interpretability. The lab’s research program is sequenced toward it.

The Scurry Lab’s Harness Design Profile, Skill Design Profile, and Authority × Autonomy Risk Matrix are the lab’s working implementation of this framework — design standards for agents, task surfaces, and risk profiles respectively, used in the design and evaluation of every agent the lab builds. They are published as lab artifacts, not as finished theory, and will evolve as the lab’s agent work matures. The framework post you’re reading is the theoretical foundation those artifacts were always building toward.

The field has named the parts. The whole has a name now too.

References

Böckeler, B. (2026, April 2). Harness engineering for coding agent users. martinfowler.com.

Lopopolo, R. (2026). Harness engineering: Leveraging Codex in an agent-first world. openai.com.

Karpathy, A. (2025). “Context engineering” — term and framing circulating via X/Twitter and associated writing.

Trivedy, V. (2026). Harness engineering: Naming the outer layer. LangChain / deepagents.

Lab artifacts referenced: Bilka, M. (2026). Harness Design Profile. The Scurry Lab internal design standard. Bilka, M. (2026). Skill Design Profile. The Scurry Lab internal design standard. Bilka, M. (2026). Authority × Autonomy Risk Matrix. The Scurry Lab interactive artifact.

Michael Bilka, PhD is the founder of The Scurry Lab, a human-AI teaming lab building in public. The lab’s thesis: that intentional, bounded, demonstrably positive human-AI collaboration is an engineering problem, not a philosophical one.

This article was agent-drafted and human-edited. The Scurry Lab publishes its methodology transparently — see [About the Lab] for how articles are produced.