harness

The open_atp.harness package is the agent concern composed by the AgentProver: for one agent CLI it owns the launch script, credential forwarding, asset staging, and token/cost parsing. The compute concern (where the command runs, with Lean+Mathlib) lives in the injected ComputeBackend.

See the per-harness prover pages under Provers for credential setup.

Base

class open_atp.harness.base.Harness(*, model: str = 'claude-opus-4-8', effort: str = 'high')[source]

Base class for an agent CLI harness.

Parameters:
modelstr

Model id the agent runs. Default "claude-opus-4-8" (Codex defaults to "gpt-5.5", Vibe to "magistral-medium-latest").

effortstr

Reasoning-effort level passed to harnesses that support it. Default "high".

Attributes:
commandstr

Bash command the backend runs to launch the agent.

agent_auth() AgentAuth[source]

Resolve this harness’s credentials into a ready-to-forward auth bundle.

Merges, in order: non-secret constants (_static_env()) and resolved required credentials (_required_env(), which raises if a needed key is absent). Mount dirs come from _home_dirs(). Subclasses needing best-effort host passthrough override this (see NuminaHarness).

Returns:
AgentAuth

Resolved env (name -> value) and (host_dir, dest_basename) mounts the prover forwards into the sandbox.

Examples

An explicit oauth_token is resolved into the forwarded env:

>>> from open_atp.harness import ClaudeCodeHarness
>>> harness = ClaudeCodeHarness(oauth_token="sk-ant-oat-fake")
>>> harness.agent_auth().env["CLAUDE_CODE_OAUTH_TOKEN"]
'sk-ant-oat-fake'
collect_logs(wd: Path, logs_dir: Path) None[source]

Move this harness’s rich log files out of wd into logs_dir.

The streamed event JSONL the prover captures from stdout is the agent’s transcript for every CLI harness, so the default does nothing. Harnesses that also drop a richer record inside the workdir (Vibe’s session log, ax-prover’s per-target logs) override this to relocate those files – so download_wd stays the proof project and download_logs carries the full record. Called after parse_result() (which may read those files for cost), so moving them is safe.

Parameters:
wdpathlib.Path

The agent working directory rich log files are moved out of.

logs_dirpathlib.Path

The run’s log directory the files are relocated into.

parse_result(lines: list[str]) HarnessRunResult[source]

Parse the agent’s streamed JSON lines into a HarnessRunResult.

Parameters:
lineslist[str]

The agent’s streamed stdout, one JSON object per line.

Returns:
HarnessRunResult

Token totals, cost, and stop metadata parsed from the stream.

stage_skills(wd: Path, skill_dirs: list[Path]) None[source]

Copy resolved skill source dirs into this harness’s skill location.

Each <name>/SKILL.md tree lands at wd/<skills_dest>/<dir-name>/ (an upstream tests/ fixture dir is dropped). A no-op for a harness that does not consume skills (skills_dest is None, e.g. ax-prover). The prover owns the list (AgentProver.skills, resolved by resolve_skill); the harness owns where it goes.

Parameters:
wdpathlib.Path

The agent working directory the skills are copied into (under skills_dest).

skill_dirslist[pathlib.Path]

Resolved skill source dirs (each a <name>/SKILL.md tree) to copy.

stage_wd(wd: Path) None[source]

Populate wd with the harness’s launch script.

Everything the harness itself owns – not the skills list (the prover stages it via stage_skills()) and not the prompt (the prover and task own it, written via write_prompt()). Subclasses that need more (Vibe’s VIBE_HOME, ax-prover’s per-target setup) override and call super().stage_wd.

Parameters:
wdpathlib.Path

The agent working directory to populate; must already exist.

Raises:
RuntimeError

If wd does not exist.

write_prompt(wd: Path, prompt: str) None[source]

Write the composed prompt where this harness’s launch script reads it.

The prompt’s content is owned by the prover (its prover prompt) and the task (the optional user prompt); the harness owns only the file location and the cat $PROMPT launch contract, so it provides the write mechanism.

Parameters:
wdpathlib.Path

The agent working directory the launch script reads the prompt from.

promptstr

The composed prompt text to write to PROMPT_FILE.

class open_atp.harness.base.HarnessRunResult(input_tokens: int = 0, output_tokens: int = 0, stop_reason: str | None = None, cost_usd: float | None = None, subtype: str | None = None, result_text: str | None = None)[source]

Token totals and cost parsed from an agent’s streamed output.

Attributes:
input_tokensint

Total input (prompt) tokens the run consumed.

output_tokensint

Total output (completion) tokens the run produced.

stop_reasonstr, optional

Why the agent stopped, when the stream reports it; None otherwise.

cost_usdfloat, optional

USD cost if the harness self-reports it (Claude Code, OpenCode); None when it must be estimated from token counts (Codex, ax-prover).

subtypestr, optional

Final type:"result" subtype (Claude Code: success / error_max_turns / error_during_execution). Used by NuminaProver’s round loop to decide continue-vs-stop when no END_REASON marker is present.

result_textstr, optional

The agent’s final result text (Claude Code’s result field), where the Numina coordinator prints its END_REASON:<reason> marker.

class open_atp.harness.base.AgentAuth(env: dict[str, str]=<factory>, mounts: list[tuple[~pathlib.Path, str]]=<factory>)[source]

Resolved credentials a harness hands the prover to wire into the sandbox.

Unlike a declarative spec, env here holds resolved name->**value** pairs – the harness has already read the host environment (and any explicit overrides) and validated that required credentials are present. The prover only forwards them; it never touches os.environ.

Attributes:
env:

Environment variables (name -> value) to forward into the sandbox.

mounts:

Host directories to expose under the sandbox’s $HOME, as (host_dir, dest_basename) pairs (e.g. (~/.codex, ".codex")).

Harnesses

Each harness adapts one agent CLI and is a Harness subclass (set as AgentProver’s harness).

class open_atp.harness.claude_code.ClaudeCodeHarness(*, model: str = 'claude-opus-4-8', effort: str = 'high', plugins: list[str] | None = None, oauth_token: str | None = None)[source]

Bases: Harness

Claude Code CLI, authenticated by a long-lived CLAUDE_CODE_OAUTH_TOKEN.

Claude Code is the only harness that loads plugins, so they live here rather than on the prover’s shared skills list.

Parameters:
modelstr

Model id the agent runs. Default "claude-opus-4-8".

effortstr

Reasoning-effort level. Default "high".

pluginslist[str], optional

Claude Code plugins to load, each a name (resolved from the vendored lean4-skills catalog) or a full path to a .claude-plugin/plugin.json tree. Default ["lean4"]; an empty list loads none.

oauth_tokenstr, optional

The CLAUDE_CODE_OAUTH_TOKEN (from claude setup-token) to forward into the sandbox. None (default) reads it from the host CLAUDE_CODE_OAUTH_TOKEN env var; resolution fails if neither is set.

Examples

>>> from open_atp.harness import ClaudeCodeHarness
>>> harness = ClaudeCodeHarness()
>>> harness.name
'claude_code'
>>> harness.plugins
['lean4']

With the credential supplied explicitly, agent_auth() resolves the full forwarded env without touching the host environment:

>>> harness = ClaudeCodeHarness(plugins=[], oauth_token="sk-ant-oat-fake")
>>> harness.agent_auth().env
{'IS_SANDBOX': '1', 'CLAUDE_CODE_OAUTH_TOKEN': 'sk-ant-oat-fake'}
stage_wd(wd: Path) None[source]

Populate wd with the harness’s launch script.

Everything the harness itself owns – not the skills list (the prover stages it via stage_skills()) and not the prompt (the prover and task own it, written via write_prompt()). Subclasses that need more (Vibe’s VIBE_HOME, ax-prover’s per-target setup) override and call super().stage_wd.

Parameters:
wdpathlib.Path

The agent working directory to populate; must already exist.

Raises:
RuntimeError

If wd does not exist.

class open_atp.harness.codex.CodexHarness(*, model: str = 'gpt-5.5', effort: str = 'high', auth_file: Path | None = None)[source]

Bases: Harness

Codex CLI, authenticated by a mounted auth.json credential.

Codex authenticates via ChatGPT/OpenAI, so it must run an OpenAI model; model defaults to gpt-5.5 rather than the Anthropic base default.

Parameters:
modelstr

Model id the agent runs; must be an OpenAI model. Default "gpt-5.5".

effortstr

Reasoning-effort level. Default "high".

auth_filePath, optional

The Codex auth.json to mount. None (default) uses ~/.codex/auth.json (from codex login); resolution fails if the file is absent.

Examples

>>> from open_atp.harness import CodexHarness
>>> harness = CodexHarness()
>>> harness.name
'codex'
>>> harness.model
'gpt-5.5'
class open_atp.harness.opencode.OpenCodeHarness(*, model: str = 'claude-opus-4-8', effort: str = 'high', provider: str | None = None, provider_api_key: str | None = None)[source]

Bases: Harness

OpenCode CLI, authenticated by a provider API key forwarded from the host.

Parameters:
modelstr

Model id the agent runs. Default "claude-opus-4-8".

effortstr

Reasoning-effort level. Default "high".

providerstr, optional

API provider name. None infers it from the model prefix (claude-* -> anthropic, gpt-* -> openai, …).

provider_api_keystr, optional

The selected provider’s API key, forwarded under its canonical env var (ANTHROPIC_API_KEY / OPENAI_API_KEY / …). None (default) reads that env var from the host; resolution fails if neither is set. The key is assumed to match provider (OpenAI and DeepSeek keys are indistinguishable, so no format check is done).

Examples

The provider is inferred from the model prefix when not given explicitly:

>>> from open_atp.harness import OpenCodeHarness
>>> harness = OpenCodeHarness(model="gpt-5.5")
>>> harness.name
'opencode'
>>> harness.provider
'openai'

With the provider key supplied explicitly, agent_auth() forwards it under the provider’s canonical env var without reading the host environment:

>>> harness = OpenCodeHarness(model="claude-opus-4-8", provider_api_key="sk-fake")
>>> harness.agent_auth().env
{'ANTHROPIC_API_KEY': 'sk-fake'}
stage_wd(wd: Path) None[source]

Populate wd with the harness’s launch script.

Everything the harness itself owns – not the skills list (the prover stages it via stage_skills()) and not the prompt (the prover and task own it, written via write_prompt()). Subclasses that need more (Vibe’s VIBE_HOME, ax-prover’s per-target setup) override and call super().stage_wd.

Parameters:
wdpathlib.Path

The agent working directory to populate; must already exist.

Raises:
RuntimeError

If wd does not exist.

class open_atp.harness.vibe.VibeHarness(*, model: str = 'magistral-medium-latest', effort: str = 'high', agent: str = 'lean-standin', max_turns: int | None = None, max_price: float | None = None, mistral_api_key: str | None = None)[source]

Bases: Harness

Mistral Vibe CLI driving its builtin lean agent (Leanstral) in a sandbox.

Vibe’s lean agent is Leanstral: vibe -p ... --agent lean pins the model to leanstral via the builtin agent profile (there is no --model flag). The bare lean profile is Labs-gated, so the lean-standin stand-in (vendored under assets/vibe/) runs the same Lean scaffold on a non-Labs model until Labs access is enabled. The selected profile is named by agent. Since vibe has no --model flag, the stand-in profile templates <<MODEL>> and the harness substitutes model into it at stage_wd() time – so the model is a knob (default Magistral) just like the other harnesses’ --model.

Two things differ from the other harnesses:

  • VIBE_HOME is workdir-local. vibe_agent.sh exports VIBE_HOME=$PWD/.vibe so vibe’s config (which un-gates the builtin lean agent), the vendored stand-in agent, and the per-session log all live under the workdir and sync back out with it.

  • Cost comes from the session log, not stdout. --output streaming carries only conversation messages – no token/cost totals. Those live in vibe’s per-session meta.json; parse_result() reads it from the synced-back log dir.

Parameters:
modelstr

Model templated into the stand-in profile (vibe has no --model flag). Ignored by the builtin lean agent, which pins its own (Leanstral). Default "magistral-medium-latest".

effortstr

Reasoning-effort level. Default "high".

agentstr, optional

Which vibe agent profile to drive: "lean" (real Leanstral) or a model-templated stand-in. Default "lean-standin".

max_turnsint, optional

vibe -p turn guard; None (default) leaves it unset.

max_pricefloat, optional

vibe -p price guard; None (default) leaves it unset.

mistral_api_keystr, optional

Mistral La Plateforme key forwarded as MISTRAL_API_KEY. None (default) reads it from the host env var; resolution fails if neither is set.

Examples

>>> from open_atp.harness import VibeHarness
>>> harness = VibeHarness()
>>> harness.name
'vibe'
>>> harness.agent
'lean-standin'

With the key supplied explicitly, agent_auth() forwards it as MISTRAL_API_KEY without reading the host environment:

>>> harness = VibeHarness(mistral_api_key="msk-fake")
>>> harness.agent_auth().env
{'MISTRAL_API_KEY': 'msk-fake'}
collect_logs(wd: Path, logs_dir: Path) None[source]

Move this harness’s rich log files out of wd into logs_dir.

The streamed event JSONL the prover captures from stdout is the agent’s transcript for every CLI harness, so the default does nothing. Harnesses that also drop a richer record inside the workdir (Vibe’s session log, ax-prover’s per-target logs) override this to relocate those files – so download_wd stays the proof project and download_logs carries the full record. Called after parse_result() (which may read those files for cost), so moving them is safe.

Parameters:
wdpathlib.Path

The agent working directory rich log files are moved out of.

logs_dirpathlib.Path

The run’s log directory the files are relocated into.

parse_result(lines: list[str]) HarnessRunResult[source]

Parse the agent’s streamed JSON lines into a HarnessRunResult.

Parameters:
lineslist[str]

The agent’s streamed stdout, one JSON object per line.

Returns:
HarnessRunResult

Token totals, cost, and stop metadata parsed from the stream.

stage_wd(wd: Path) None[source]

Populate wd with the harness’s launch script.

Everything the harness itself owns – not the skills list (the prover stages it via stage_skills()) and not the prompt (the prover and task own it, written via write_prompt()). Subclasses that need more (Vibe’s VIBE_HOME, ax-prover’s per-target setup) override and call super().stage_wd.

Parameters:
wdpathlib.Path

The agent working directory to populate; must already exist.

Raises:
RuntimeError

If wd does not exist.

class open_atp.harness.axprover.AxProverHarness(*, model: str = 'claude-opus-4-8', effort: str = 'high', max_iterations: int | None = None, provider_api_key: str | None = None)[source]

Bases: Harness

ax-prover-base (LangGraph Lean agent), driven by ax-prover prove in-sandbox.

ax-prover is a self-contained proving agent (its own proposer->builder->reviewer->memory loop) that edits the target .lean file in place. It slots in as a harness rather than a standalone prover because AgentProver.prove() already supplies everything around the edges (staging, snapshot/diff, sandbox run, key forwarding) and the shared Verifier – not ax-prover’s own reviewer – remains the source of truth for compile/sorry/axiom.

Two things differ from the CLI harnesses (it mirrors VibeHarness here):

  • Config lives in a workdir YAML, not flags. stage_wd() writes axprover.yaml selecting the model/effort/iterations; it layers on top of ax-prover’s bundled default.yaml (auto-prepended by the CLI), so it only needs to override the deltas.

  • Cost is not on stdout. ax-prover streams human-readable logs; token usage comes from its -o JSON (the per-target ax_output.<target>.json files the launch script writes), which carries input_tokens/output_tokens alongside {success, error, summary} as of the pinned fork commit (see AX_PROVER_REF in __main__.py). parse_result() sums those across every target and leaves cost_usd None so the prover converts tokens->USD via the fallback table, exactly like CodexHarness. (On an ax-prover build without those fields the tokens are simply absent and the run reports zero cost.)

Parameters:
modelstr

Model id the agent runs, mapped to ax-prover’s provider:model string. Default "claude-opus-4-8".

effortstr

Reasoning-effort level, mapped to each provider’s knob. Default "high".

max_iterationsint, optional

Cap on ax-prover’s proposer->builder->reviewer loop. None (default) keeps ax-prover’s own default (50).

provider_api_keystr, optional

The selected provider’s API key, forwarded under its canonical env var (ANTHROPIC_API_KEY / OPENAI_API_KEY / …). None (default) reads that env var from the host; resolution fails if neither is set.

Examples

>>> from open_atp.harness import AxProverHarness
>>> harness = AxProverHarness()
>>> harness.name
'axprover'
>>> harness.model
'claude-opus-4-8'

With the provider key supplied explicitly, agent_auth() forwards it under the provider’s canonical env var without reading the host environment:

>>> harness = AxProverHarness(provider_api_key="sk-fake")
>>> harness.agent_auth().env
{'ANTHROPIC_API_KEY': 'sk-fake'}
collect_logs(wd: Path, logs_dir: Path) None[source]

Move this harness’s rich log files out of wd into logs_dir.

The streamed event JSONL the prover captures from stdout is the agent’s transcript for every CLI harness, so the default does nothing. Harnesses that also drop a richer record inside the workdir (Vibe’s session log, ax-prover’s per-target logs) override this to relocate those files – so download_wd stays the proof project and download_logs carries the full record. Called after parse_result() (which may read those files for cost), so moving them is safe.

Parameters:
wdpathlib.Path

The agent working directory rich log files are moved out of.

logs_dirpathlib.Path

The run’s log directory the files are relocated into.

parse_result(lines: list[str]) HarnessRunResult[source]

Parse the agent’s streamed JSON lines into a HarnessRunResult.

Parameters:
lineslist[str]

The agent’s streamed stdout, one JSON object per line.

Returns:
HarnessRunResult

Token totals, cost, and stop metadata parsed from the stream.

stage_wd(wd: Path) None[source]

Populate wd with the harness’s launch script.

Everything the harness itself owns – not the skills list (the prover stages it via stage_skills()) and not the prompt (the prover and task own it, written via write_prompt()). Subclasses that need more (Vibe’s VIBE_HOME, ax-prover’s per-target setup) override and call super().stage_wd.

Parameters:
wdpathlib.Path

The agent working directory to populate; must already exist.

Raises:
RuntimeError

If wd does not exist.

Pricing

open_atp.harness.cost.compute_cost_usd(model: str, input_tokens: int, output_tokens: int) float | None[source]

Estimate the USD cost of a run from token counts.

Returns None when model is absent from COST_PER_MTOK.

open_atp.harness.cost.COST_PER_MTOK: dict[str, tuple[float, float]] = {'claude-fable-5': (10.0, 50.0), 'claude-haiku-4-5': (1.0, 5.0), 'claude-opus-4-6': (5.0, 25.0), 'claude-opus-4-7': (5.0, 25.0), 'claude-opus-4-8': (5.0, 25.0), 'claude-sonnet-4-5': (3.0, 15.0), 'claude-sonnet-4-6': (3.0, 15.0), 'deepseek-v4-flash': (0.14, 0.28), 'deepseek-v4-pro': (1.74, 3.48), 'gemini-3.1-pro-preview': (2.0, 12.0), 'gpt-4.1': (2.0, 8.0), 'gpt-4o': (2.5, 10.0), 'gpt-4o-mini': (0.15, 0.6), 'gpt-5.4': (2.5, 15.0), 'gpt-5.4-mini': (0.75, 4.5), 'gpt-5.4-nano': (0.2, 1.25), 'gpt-5.4-pro': (15.0, 120.0), 'gpt-5.5': (5.0, 30.0)}

Cost per million tokens, as (input, output).