harness¶
The open_atp.harness package is the agent concern composed by the
AgentProver: for one agent CLI it owns the
launch script, credential forwarding, asset staging, and token/cost parsing. The
compute concern (where the command runs, with Lean+Mathlib) lives in the injected
ComputeBackend.
See the per-harness prover pages under Provers for credential setup.
Base¶
- class open_atp.harness.base.Harness(*, model: str = 'claude-opus-4-8', effort: str = 'high')[source]¶
Base class for an agent CLI harness.
- Parameters:
- Attributes:
commandstrBash command the backend runs to launch the agent.
- agent_auth() AgentAuth[source]¶
Resolve this harness’s credentials into a ready-to-forward auth bundle.
Merges, in order: non-secret constants (
_static_env()) and resolved required credentials (_required_env(), which raises if a needed key is absent). Mount dirs come from_home_dirs(). Subclasses needing best-effort host passthrough override this (seeNuminaHarness).- Returns:
AgentAuthResolved env (name -> value) and
(host_dir, dest_basename)mounts the prover forwards into the sandbox.
Examples
An explicit
oauth_tokenis resolved into the forwarded env:>>> from open_atp.harness import ClaudeCodeHarness >>> harness = ClaudeCodeHarness(oauth_token="sk-ant-oat-fake") >>> harness.agent_auth().env["CLAUDE_CODE_OAUTH_TOKEN"] 'sk-ant-oat-fake'
- collect_logs(wd: Path, logs_dir: Path) None[source]¶
Move this harness’s rich log files out of
wdintologs_dir.The streamed event JSONL the prover captures from stdout is the agent’s transcript for every CLI harness, so the default does nothing. Harnesses that also drop a richer record inside the workdir (Vibe’s session log, ax-prover’s per-target logs) override this to relocate those files – so
download_wdstays the proof project anddownload_logscarries the full record. Called afterparse_result()(which may read those files for cost), so moving them is safe.- Parameters:
- wd
pathlib.Path The agent working directory rich log files are moved out of.
- logs_dir
pathlib.Path The run’s log directory the files are relocated into.
- wd
- parse_result(lines: list[str]) HarnessRunResult[source]¶
Parse the agent’s streamed JSON lines into a
HarnessRunResult.- Parameters:
- Returns:
HarnessRunResultToken totals, cost, and stop metadata parsed from the stream.
- stage_skills(wd: Path, skill_dirs: list[Path]) None[source]¶
Copy resolved skill source dirs into this harness’s skill location.
Each
<name>/SKILL.mdtree lands atwd/<skills_dest>/<dir-name>/(an upstreamtests/fixture dir is dropped). A no-op for a harness that does not consume skills (skills_dest is None, e.g. ax-prover). The prover owns the list (AgentProver.skills, resolved byresolve_skill); the harness owns where it goes.- Parameters:
- wd
pathlib.Path The agent working directory the skills are copied into (under
skills_dest).- skill_dirs
list[pathlib.Path] Resolved skill source dirs (each a
<name>/SKILL.mdtree) to copy.
- wd
- stage_wd(wd: Path) None[source]¶
Populate
wdwith the harness’s launch script.Everything the harness itself owns – not the skills list (the prover stages it via
stage_skills()) and not the prompt (the prover and task own it, written viawrite_prompt()). Subclasses that need more (Vibe’s VIBE_HOME, ax-prover’s per-target setup) override and callsuper().stage_wd.- Parameters:
- wd
pathlib.Path The agent working directory to populate; must already exist.
- wd
- Raises:
RuntimeErrorIf
wddoes not exist.
- write_prompt(wd: Path, prompt: str) None[source]¶
Write the composed prompt where this harness’s launch script reads it.
The prompt’s content is owned by the prover (its prover prompt) and the task (the optional user prompt); the harness owns only the file location and the
cat $PROMPTlaunch contract, so it provides the write mechanism.- Parameters:
- wd
pathlib.Path The agent working directory the launch script reads the prompt from.
- prompt
str The composed prompt text to write to
PROMPT_FILE.
- wd
- class open_atp.harness.base.HarnessRunResult(input_tokens: int = 0, output_tokens: int = 0, stop_reason: str | None = None, cost_usd: float | None = None, subtype: str | None = None, result_text: str | None = None)[source]¶
Token totals and cost parsed from an agent’s streamed output.
- Attributes:
- input_tokens
int Total input (prompt) tokens the run consumed.
- output_tokens
int Total output (completion) tokens the run produced.
- stop_reason
str, optional Why the agent stopped, when the stream reports it;
Noneotherwise.- cost_usd
float, optional USD cost if the harness self-reports it (Claude Code, OpenCode);
Nonewhen it must be estimated from token counts (Codex, ax-prover).- subtype
str, optional Final
type:"result"subtype (Claude Code:success/error_max_turns/error_during_execution). Used by NuminaProver’s round loop to decide continue-vs-stop when no END_REASON marker is present.- result_text
str, optional The agent’s final result text (Claude Code’s
resultfield), where the Numina coordinator prints itsEND_REASON:<reason>marker.
- input_tokens
- class open_atp.harness.base.AgentAuth(env: dict[str, str]=<factory>, mounts: list[tuple[~pathlib.Path, str]]=<factory>)[source]¶
Resolved credentials a harness hands the prover to wire into the sandbox.
Unlike a declarative spec,
envhere holds resolved name->**value** pairs – the harness has already read the host environment (and any explicit overrides) and validated that required credentials are present. The prover only forwards them; it never touchesos.environ.- Attributes:
- env:
Environment variables (name -> value) to forward into the sandbox.
- mounts:
Host directories to expose under the sandbox’s
$HOME, as(host_dir, dest_basename)pairs (e.g.(~/.codex, ".codex")).
Harnesses¶
Each harness adapts one agent CLI and is a
Harness subclass (set as
AgentProver’s harness).
- class open_atp.harness.claude_code.ClaudeCodeHarness(*, model: str = 'claude-opus-4-8', effort: str = 'high', plugins: list[str] | None = None, oauth_token: str | None = None)[source]¶
Bases:
HarnessClaude Code CLI, authenticated by a long-lived
CLAUDE_CODE_OAUTH_TOKEN.Claude Code is the only harness that loads plugins, so they live here rather than on the prover’s shared skills list.
- Parameters:
- model
str Model id the agent runs. Default
"claude-opus-4-8".- effort
str Reasoning-effort level. Default
"high".- plugins
list[str], optional Claude Code plugins to load, each a name (resolved from the vendored
lean4-skillscatalog) or a full path to a.claude-plugin/plugin.jsontree. Default["lean4"]; an empty list loads none.- oauth_token
str, optional The
CLAUDE_CODE_OAUTH_TOKEN(fromclaude setup-token) to forward into the sandbox.None(default) reads it from the hostCLAUDE_CODE_OAUTH_TOKENenv var; resolution fails if neither is set.
- model
Examples
>>> from open_atp.harness import ClaudeCodeHarness >>> harness = ClaudeCodeHarness() >>> harness.name 'claude_code' >>> harness.plugins ['lean4']
With the credential supplied explicitly,
agent_auth()resolves the full forwarded env without touching the host environment:>>> harness = ClaudeCodeHarness(plugins=[], oauth_token="sk-ant-oat-fake") >>> harness.agent_auth().env {'IS_SANDBOX': '1', 'CLAUDE_CODE_OAUTH_TOKEN': 'sk-ant-oat-fake'}
- stage_wd(wd: Path) None[source]¶
Populate
wdwith the harness’s launch script.Everything the harness itself owns – not the skills list (the prover stages it via
stage_skills()) and not the prompt (the prover and task own it, written viawrite_prompt()). Subclasses that need more (Vibe’s VIBE_HOME, ax-prover’s per-target setup) override and callsuper().stage_wd.- Parameters:
- wd
pathlib.Path The agent working directory to populate; must already exist.
- wd
- Raises:
RuntimeErrorIf
wddoes not exist.
- class open_atp.harness.codex.CodexHarness(*, model: str = 'gpt-5.5', effort: str = 'high', auth_file: Path | None = None)[source]¶
Bases:
HarnessCodex CLI, authenticated by a mounted
auth.jsoncredential.Codex authenticates via ChatGPT/OpenAI, so it must run an OpenAI model;
modeldefaults togpt-5.5rather than the Anthropic base default.- Parameters:
Examples
>>> from open_atp.harness import CodexHarness >>> harness = CodexHarness() >>> harness.name 'codex' >>> harness.model 'gpt-5.5'
- class open_atp.harness.opencode.OpenCodeHarness(*, model: str = 'claude-opus-4-8', effort: str = 'high', provider: str | None = None, provider_api_key: str | None = None)[source]¶
Bases:
HarnessOpenCode CLI, authenticated by a provider API key forwarded from the host.
- Parameters:
- model
str Model id the agent runs. Default
"claude-opus-4-8".- effort
str Reasoning-effort level. Default
"high".- provider
str, optional API provider name.
Noneinfers it from the model prefix (claude-*->anthropic,gpt-*->openai, …).- provider_api_key
str, optional The selected provider’s API key, forwarded under its canonical env var (
ANTHROPIC_API_KEY/OPENAI_API_KEY/ …).None(default) reads that env var from the host; resolution fails if neither is set. The key is assumed to matchprovider(OpenAI and DeepSeek keys are indistinguishable, so no format check is done).
- model
Examples
The provider is inferred from the model prefix when not given explicitly:
>>> from open_atp.harness import OpenCodeHarness >>> harness = OpenCodeHarness(model="gpt-5.5") >>> harness.name 'opencode' >>> harness.provider 'openai'
With the provider key supplied explicitly,
agent_auth()forwards it under the provider’s canonical env var without reading the host environment:>>> harness = OpenCodeHarness(model="claude-opus-4-8", provider_api_key="sk-fake") >>> harness.agent_auth().env {'ANTHROPIC_API_KEY': 'sk-fake'}
- stage_wd(wd: Path) None[source]¶
Populate
wdwith the harness’s launch script.Everything the harness itself owns – not the skills list (the prover stages it via
stage_skills()) and not the prompt (the prover and task own it, written viawrite_prompt()). Subclasses that need more (Vibe’s VIBE_HOME, ax-prover’s per-target setup) override and callsuper().stage_wd.- Parameters:
- wd
pathlib.Path The agent working directory to populate; must already exist.
- wd
- Raises:
RuntimeErrorIf
wddoes not exist.
- class open_atp.harness.vibe.VibeHarness(*, model: str = 'magistral-medium-latest', effort: str = 'high', agent: str = 'lean-standin', max_turns: int | None = None, max_price: float | None = None, mistral_api_key: str | None = None)[source]¶
Bases:
HarnessMistral Vibe CLI driving its builtin
leanagent (Leanstral) in a sandbox.Vibe’s
leanagent is Leanstral:vibe -p ... --agent leanpins the model toleanstralvia the builtin agent profile (there is no--modelflag). The bareleanprofile is Labs-gated, so thelean-standinstand-in (vendored underassets/vibe/) runs the same Lean scaffold on a non-Labs model until Labs access is enabled. The selected profile is named byagent. Since vibe has no--modelflag, the stand-in profile templates<<MODEL>>and the harness substitutesmodelinto it atstage_wd()time – so the model is a knob (default Magistral) just like the other harnesses’--model.Two things differ from the other harnesses:
VIBE_HOME is workdir-local.
vibe_agent.shexportsVIBE_HOME=$PWD/.vibeso vibe’s config (which un-gates the builtinleanagent), the vendored stand-in agent, and the per-session log all live under the workdir and sync back out with it.Cost comes from the session log, not stdout.
--output streamingcarries only conversation messages – no token/cost totals. Those live in vibe’s per-sessionmeta.json;parse_result()reads it from the synced-back log dir.
- Parameters:
- model
str Model templated into the stand-in profile (vibe has no
--modelflag). Ignored by the builtinleanagent, which pins its own (Leanstral). Default"magistral-medium-latest".- effort
str Reasoning-effort level. Default
"high".- agent
str, optional Which vibe agent profile to drive:
"lean"(real Leanstral) or a model-templated stand-in. Default"lean-standin".- max_turns
int, optional vibe -pturn guard;None(default) leaves it unset.- max_price
float, optional vibe -pprice guard;None(default) leaves it unset.- mistral_api_key
str, optional Mistral La Plateforme key forwarded as
MISTRAL_API_KEY.None(default) reads it from the host env var; resolution fails if neither is set.
- model
Examples
>>> from open_atp.harness import VibeHarness >>> harness = VibeHarness() >>> harness.name 'vibe' >>> harness.agent 'lean-standin'
With the key supplied explicitly,
agent_auth()forwards it asMISTRAL_API_KEYwithout reading the host environment:>>> harness = VibeHarness(mistral_api_key="msk-fake") >>> harness.agent_auth().env {'MISTRAL_API_KEY': 'msk-fake'}
- collect_logs(wd: Path, logs_dir: Path) None[source]¶
Move this harness’s rich log files out of
wdintologs_dir.The streamed event JSONL the prover captures from stdout is the agent’s transcript for every CLI harness, so the default does nothing. Harnesses that also drop a richer record inside the workdir (Vibe’s session log, ax-prover’s per-target logs) override this to relocate those files – so
download_wdstays the proof project anddownload_logscarries the full record. Called afterparse_result()(which may read those files for cost), so moving them is safe.- Parameters:
- wd
pathlib.Path The agent working directory rich log files are moved out of.
- logs_dir
pathlib.Path The run’s log directory the files are relocated into.
- wd
- parse_result(lines: list[str]) HarnessRunResult[source]¶
Parse the agent’s streamed JSON lines into a
HarnessRunResult.- Parameters:
- Returns:
HarnessRunResultToken totals, cost, and stop metadata parsed from the stream.
- stage_wd(wd: Path) None[source]¶
Populate
wdwith the harness’s launch script.Everything the harness itself owns – not the skills list (the prover stages it via
stage_skills()) and not the prompt (the prover and task own it, written viawrite_prompt()). Subclasses that need more (Vibe’s VIBE_HOME, ax-prover’s per-target setup) override and callsuper().stage_wd.- Parameters:
- wd
pathlib.Path The agent working directory to populate; must already exist.
- wd
- Raises:
RuntimeErrorIf
wddoes not exist.
- class open_atp.harness.axprover.AxProverHarness(*, model: str = 'claude-opus-4-8', effort: str = 'high', max_iterations: int | None = None, provider_api_key: str | None = None)[source]¶
Bases:
Harnessax-prover-base (LangGraph Lean agent), driven by
ax-prover provein-sandbox.ax-prover is a self-contained proving agent (its own proposer->builder->reviewer->memory loop) that edits the target
.leanfile in place. It slots in as a harness rather than a standalone prover becauseAgentProver.prove()already supplies everything around the edges (staging, snapshot/diff, sandbox run, key forwarding) and the sharedVerifier– not ax-prover’s own reviewer – remains the source of truth for compile/sorry/axiom.Two things differ from the CLI harnesses (it mirrors
VibeHarnesshere):Config lives in a workdir YAML, not flags.
stage_wd()writesaxprover.yamlselecting the model/effort/iterations; it layers on top of ax-prover’s bundleddefault.yaml(auto-prepended by the CLI), so it only needs to override the deltas.Cost is not on stdout. ax-prover streams human-readable logs; token usage comes from its
-oJSON (the per-targetax_output.<target>.jsonfiles the launch script writes), which carriesinput_tokens/output_tokensalongside{success, error, summary}as of the pinned fork commit (seeAX_PROVER_REFin__main__.py).parse_result()sums those across every target and leavescost_usdNoneso the prover converts tokens->USD via the fallback table, exactly likeCodexHarness. (On an ax-prover build without those fields the tokens are simply absent and the run reports zero cost.)
- Parameters:
- model
str Model id the agent runs, mapped to ax-prover’s
provider:modelstring. Default"claude-opus-4-8".- effort
str Reasoning-effort level, mapped to each provider’s knob. Default
"high".- max_iterations
int, optional Cap on ax-prover’s proposer->builder->reviewer loop.
None(default) keeps ax-prover’s own default (50).- provider_api_key
str, optional The selected provider’s API key, forwarded under its canonical env var (
ANTHROPIC_API_KEY/OPENAI_API_KEY/ …).None(default) reads that env var from the host; resolution fails if neither is set.
- model
Examples
>>> from open_atp.harness import AxProverHarness >>> harness = AxProverHarness() >>> harness.name 'axprover' >>> harness.model 'claude-opus-4-8'
With the provider key supplied explicitly,
agent_auth()forwards it under the provider’s canonical env var without reading the host environment:>>> harness = AxProverHarness(provider_api_key="sk-fake") >>> harness.agent_auth().env {'ANTHROPIC_API_KEY': 'sk-fake'}
- collect_logs(wd: Path, logs_dir: Path) None[source]¶
Move this harness’s rich log files out of
wdintologs_dir.The streamed event JSONL the prover captures from stdout is the agent’s transcript for every CLI harness, so the default does nothing. Harnesses that also drop a richer record inside the workdir (Vibe’s session log, ax-prover’s per-target logs) override this to relocate those files – so
download_wdstays the proof project anddownload_logscarries the full record. Called afterparse_result()(which may read those files for cost), so moving them is safe.- Parameters:
- wd
pathlib.Path The agent working directory rich log files are moved out of.
- logs_dir
pathlib.Path The run’s log directory the files are relocated into.
- wd
- parse_result(lines: list[str]) HarnessRunResult[source]¶
Parse the agent’s streamed JSON lines into a
HarnessRunResult.- Parameters:
- Returns:
HarnessRunResultToken totals, cost, and stop metadata parsed from the stream.
- stage_wd(wd: Path) None[source]¶
Populate
wdwith the harness’s launch script.Everything the harness itself owns – not the skills list (the prover stages it via
stage_skills()) and not the prompt (the prover and task own it, written viawrite_prompt()). Subclasses that need more (Vibe’s VIBE_HOME, ax-prover’s per-target setup) override and callsuper().stage_wd.- Parameters:
- wd
pathlib.Path The agent working directory to populate; must already exist.
- wd
- Raises:
RuntimeErrorIf
wddoes not exist.
Pricing¶
- open_atp.harness.cost.compute_cost_usd(model: str, input_tokens: int, output_tokens: int) float | None[source]¶
Estimate the USD cost of a run from token counts.
Returns
Nonewhenmodelis absent fromCOST_PER_MTOK.
- open_atp.harness.cost.COST_PER_MTOK: dict[str, tuple[float, float]] = {'claude-fable-5': (10.0, 50.0), 'claude-haiku-4-5': (1.0, 5.0), 'claude-opus-4-6': (5.0, 25.0), 'claude-opus-4-7': (5.0, 25.0), 'claude-opus-4-8': (5.0, 25.0), 'claude-sonnet-4-5': (3.0, 15.0), 'claude-sonnet-4-6': (3.0, 15.0), 'deepseek-v4-flash': (0.14, 0.28), 'deepseek-v4-pro': (1.74, 3.48), 'gemini-3.1-pro-preview': (2.0, 12.0), 'gpt-4.1': (2.0, 8.0), 'gpt-4o': (2.5, 10.0), 'gpt-4o-mini': (0.15, 0.6), 'gpt-5.4': (2.5, 15.0), 'gpt-5.4-mini': (0.75, 4.5), 'gpt-5.4-nano': (0.2, 1.25), 'gpt-5.4-pro': (15.0, 120.0), 'gpt-5.5': (5.0, 30.0)}¶
Cost per million tokens, as
(input, output).