provers

The concrete provers and the registry/factory over them. Each prover subclasses AutomatedProver and funnels its output through the shared Verifier. The agentic provers compose an agent harness (the agent concern) with a ComputeBackend (the compute concern).

Base

The base prover abstraction. An AutomatedProver is a candidate generator; the base class owns the shared lifecycle (the public prove: generate, then verify in the sandbox) so subclasses only implement _generate.

class open_atp.provers.base.AutomatedProver(*, backend: ComputeBackend, timeout_s: int = 1800)[source]

Generate candidate proofs, then verify them in a shared sandbox.

The sandbox image (its tag plus the Lean toolchain + Mathlib pins the shared verifier checks every project against) comes from backend – a prover inherits whatever image its backend runs.

Parameters:
backendComputeBackend

The one backend for this prover. Agentic provers reuse it (via a live session) for generation, then verify in that hot sandbox; Aristotle uses it only for the final check.

timeout_sint

Wall-clock budget for the generation run, in seconds. Default 1800.

prove(task: ProofTask, output_dir: Path | str) ProofResult[source]

Full lifecycle: reject-on-mismatch, generate, verify, write the result.

Parameters:
taskProofTask

The unit of work: the lake project to complete, the optional targets to focus on, and any user_prompt guidance.

output_dirpathlib.Path or str

Caller-chosen output directory, populated as output_dir/{wd,logs}/: wd is the completed lake project (the proof output) and logs is the run record (the agent stdout.txt/stderr.txt, result.json, and any harness-specific rich logs).

Returns:
ProofResult

The verification verdict and run metadata, pointing at the populated wd and logs_dir.

class open_atp.provers.base.ProofResult(prover: str, verification: VerificationReport | None, output_dir: Path, completed_files: dict[str, str]=<factory>, cost_usd: float | None = None, duration_s: float | None = None, metadata: dict[str, object]=<factory>, error: str | None = None)[source]

What a prover returns from AutomatedProver.prove().

The prover writes its artifacts into the caller-chosen output_dir, laid out as output_dir/{wd,logs}/: wd is the completed lake project (the proof output) and logs is the run record (the streamed agent stdout.txt, stderr.txt, result.json, and any harness-specific rich logs). This object just records where those live, plus the verification verdict and run metadata.

Attributes:
proverstr

Name of the prover that produced this result.

verificationVerificationReport or None

The shared verification of the completed project, or None when the run failed before a candidate could be verified (see error).

output_dirpathlib.Path

The run’s output directory. Holds the wd (proof project) and logs_dir (run record) subdirectories the prover populated.

completed_filesdict[str, str]

The completed .lean sources, keyed by file path relative to the project root.

cost_usdfloat, optional

Estimated USD cost of the run. None when the prover does not report cost.

duration_sfloat, optional

Wall-clock duration of the run, in seconds.

metadatadict[str, object]

Harness-specific run metadata (token counts, run summaries, …).

errorstr, optional

Set when the prover raised before producing a result (Docker down, API error, toolchain mismatch). verification is None and success is False.

wdpathlib.Path

The completed proof project (output_dir/wd).

logs_dirpathlib.Path

The run’s logs directory (output_dir/logs).

successbool

True iff verification exists and is verified.

to_dict() dict[str, object][source]

JSON-ready view: inline files, verification, cost, and artifact paths.

Provers

The concrete candidate generators.

AgentProver

class open_atp.provers.agent_prover.AgentProver(*, backend: ComputeBackend, harness: Harness | None = None, skills: list[str] | None = None, timeout_s: int = 1800)[source]

Bases: AutomatedProver

Generate proofs by driving an agent CLI harness in a compute backend.

Composes an agent harness (the agent concern) with a ComputeBackend (the compute concern): the harness edits the staged .lean files in place, then the shared Verifier does the final compile/sorry/axiom check. codex, opencode, axprover, and vibe are this prover on a different harness.

Parameters:
backendComputeBackend

The sandbox the agent runs in. Generation reuses it via a live session and verification runs in that same hot sandbox.

harnessHarness, optional

The harness to drive and its knobs: ClaudeCodeHarness (default), CodexHarness, OpenCodeHarness, VibeHarness, or AxProverHarness. Carries model/effort plus any harness-specific knobs. Plugins are Claude-only and live on ClaudeCodeHarness.

skillslist[str], optional

Skills to mount into the agent workdir, each a name (resolved from the vendored leanprover/skills catalog) or a full path to a SKILL.md tree. Default ["lean-proof"]; an empty list mounts none. Staged into every skill-supporting harness’s location; ignored by ax-prover.

timeout_sint

Wall-clock budget for the generation run, in seconds. Default 1800.

Attributes:
prover_promptstr

The prover’s own prompt handed to the agent, before any user prompt.

Examples

Construct the prover directly, wiring up a harness and a backend:

>>> from open_atp.backends.docker import DockerBackend
>>> from open_atp.harness import CodexHarness
>>> from open_atp.provers.agent_prover import AgentProver
>>> backend = DockerBackend()
>>> prover = AgentProver(harness=CodexHarness(effort="high"), backend=backend)
>>> prover.harness.model
'gpt-5.5'

Or build the same prover from the standard catalog by name, taking its baked-in defaults (see standard_prover()):

>>> from open_atp import standard_prover
>>> prover = standard_prover("agent:codex", backend=DockerBackend())
>>> prover.name, prover.harness.name
('agent', 'codex')

Complete a task’s sorrys with prove(), here on a bundled example (this runs the agent in Docker and bills it):

>>> import tempfile
>>> from open_atp.examples import EXAMPLE, example_task
>>> task = example_task(EXAMPLE.MUL_REORDER)
>>> result = prover.prove(task, tempfile.mkdtemp())
>>> result.success
True

AgentProver’s harness is a Harness — pick the CLI and its knobs by composing one (e.g. AgentProver(harness=VibeHarness(max_turns=8), backend=...)). The per-harness classes are documented under harness.

NuminaProver

class open_atp.provers.numina.NuminaProver(*, backend: ComputeBackend, skills: list[str] | None = None, max_rounds: int = 20, max_consecutive_limits: int = 2, helper_env_keys: tuple[str, ...] = ('GEMINI_API_KEY', 'OPENAI_API_KEY', 'LEAN_LEANDEX_API_KEY', 'ANTHROPIC_API_KEY'), guard_statements: bool = True, on_statement_change: Literal['error', 'warn'] = 'error', timeout_s: int = 1800, env: dict[str, str] | None = None)[source]

Bases: AgentProver

Run the Numina coordinator/subagent scaffold as an AgentProver.

A specialization of AgentProver wired to Numina’s vendored scaffold (coordinator prompt + skills + subagent prompts, staged by _stage_numina_assets()); generation and the shared Verifier work exactly as in the base agent prover.

The harness is fixed to an internal NuminaHarness (Claude Code with no plugins, since Numina ships its own scaffold) and is not configurable – Numina is claude-CLI driven, and _stage_numina_assets() mounts the vendored scaffold straight into the known .claude/ locations.

Parameters:
backendComputeBackend

The sandbox the agent runs in. Generation reuses it via a live session and verification runs in that same hot sandbox.

skillslist[str], optional

Extra named/path skills to mount alongside Numina’s vendored scaffold. Default empty – Numina’s coordinator skill is staged from vendor/numina/skills, not this list.

max_roundsint

Maximum number of coordinator rounds before the run stops. Default 20.

max_consecutive_limitsint

Reset (start a fresh session) after this many consecutive LIMIT rounds. Default 2.

helper_env_keystuple[str, …]

Helper-skill credentials forwarded into the sandbox when present in the host env; skills degrade/skip when their key is absent. Default _DEFAULT_HELPER_ENV_KEYS.

guard_statementsbool

Whether to snapshot the target theorems and reject runs that weaken or delete them. Default True.

on_statement_change{“error”, “warn”}

Behavior on a weakened/deleted target theorem: error stops the run and restores the originals; warn restores and continues. Default error (rejects; safe).

timeout_sint

Wall-clock budget for the generation run, in seconds. Default 1800.

envdict[str, str], optional

Extra literal environment variables forwarded into the agent sandbox (Numina pins its harness, so its env knobs live here). Default empty.

Attributes:
prover_promptstr

Numina’s coordinator scaffold + round protocol, before any user prompt.

Examples

Construct the prover directly:

>>> from open_atp.backends.docker import DockerBackend
>>> from open_atp.provers.numina import NuminaProver
>>> backend = DockerBackend()
>>> prover = NuminaProver(backend=backend)
>>> prover.max_rounds
20

Or build the same prover from the standard catalog by name, taking its baked-in defaults (see standard_prover()):

>>> from open_atp import standard_prover
>>> prover = standard_prover("numina", backend=DockerBackend())
>>> prover.name
'numina'

Complete a task’s sorrys with prove(), here on a bundled example (this runs the Numina scaffold in Docker and bills it):

>>> import tempfile
>>> from open_atp.examples import EXAMPLE, example_task
>>> task = example_task(EXAMPLE.INTER_UNION_DISTRIB)
>>> result = prover.prove(task, tempfile.mkdtemp())
>>> result.success
True

AristotleProver

class open_atp.provers.aristotle.AristotleProver(*, backend: ComputeBackend, api_key: str | None = None, allow_agent_questions: bool = False, max_connection_retries: int = 5, max_resume_attempts: int = 20, resume_backoff_seconds: float = 5.0, timeout_s: int = 1800)[source]

Bases: AutomatedProver

Prove by handing the whole project to Harmonic’s hosted Aristotle agent.

Generation happens over the network (submit the lake project, wait, download the result archive, unpack it over the workdir); the shared Verifier then runs the same local compile/sorry/axiom check. Generation is network-only, so the backend is used solely for that final check – unlike the agentic provers, there is no live session to reuse.

Parameters:
backendComputeBackend

The sandbox used only for the final verify; Aristotle generates over the network, so there is no live session to reuse.

api_keystr, optional

The Harmonic API key. None (default) reads it from the host ARISTOTLE_API_KEY env var.

allow_agent_questionsbool

Whether to let the hosted agent ask clarifying questions. Off by default: this is a headless API path and a prompt for stdin would hang the run.

max_connection_retriesint

Bounds per-call retries (list/refresh/download) when a connection drops. The hosted run lives server-side, so a dropped connection is recoverable: re-fetch rather than reporting the run failed. Default 5.

max_resume_attemptsint

Bounds how many times we re-attach to the event stream when it drops mid-run. Default 20.

resume_backoff_secondsfloat

Initial sleep between retries/resumes, doubling (capped) between tries. Default 5.0.

timeout_sint

Wall-clock budget for the generation run, in seconds. Default 1800.

Attributes:
prover_promptstr

The prover’s own prompt handed to Aristotle, before any user prompt.

Examples

Construct the prover directly (network-only generation, so the backend is just the verify backend):

>>> from open_atp.backends.docker import DockerBackend
>>> from open_atp.provers.aristotle import AristotleProver
>>> backend = DockerBackend()
>>> prover = AristotleProver(backend=backend)
>>> prover.name
'aristotle'

Or build the same prover from the standard catalog by name, taking its baked-in defaults (see standard_prover()):

>>> from open_atp import standard_prover
>>> prover = standard_prover("aristotle", backend=DockerBackend())
>>> prover.name
'aristotle'

Complete a task’s sorrys with prove(), here on a bundled example (this hits the hosted Aristotle API, needing ARISTOTLE_API_KEY, and runs Docker for the verify):

>>> import tempfile
>>> from open_atp.examples import EXAMPLE, example_task
>>> task = example_task(EXAMPLE.ABS_MUL_LT)
>>> result = prover.prove(task, tempfile.mkdtemp())
>>> result.success
True

Standard catalog

The standard catalog names each ready-to-run default prover and builds it against a compute backend. standard_prover() maps a catalog name to a constructed AutomatedProver, wiring in the shared image/toolchain from the backend; a caller then drives it directly via prove(), which returns a ProofResult with verification and cost. Agentic provers run generation in a live session over that backend and verify in the same hot sandbox. This is the top-level surface re-exported from open_atp itself.

Names are the agent:<cli> agent provers ("agent:claude", "agent:codex", "agent:opencode", "agent:vibe", "agent:axprover") and the standalone provers ("numina", "aristotle"); standard_provers() lists them all. Each builds its class’s baked-in defaults — to customize any knob, use build_prover() with a full config dict instead.

open_atp.config.standard_prover(name: str, *, backend: ComputeBackend) AutomatedProver[source]

Construct a standard default prover name against backend.

name is a STANDARD_PROVERS key – an agent:<cli> default ("agent:claude", "agent:codex", "agent:opencode", "agent:vibe", "agent:axprover") or a standalone prover ("numina", "aristotle"). The prover is built with its class’s baked-in defaults; to customize any knob (model, effort, skills, …), use build_prover() with a full config dict instead.

The sandbox image (and the toolchain + Mathlib pins projects are checked against) comes from backend, not a parameter here.

Examples

>>> from open_atp.backends.docker import DockerBackend
>>> prover = standard_prover("agent:claude", backend=DockerBackend())
>>> prover.harness.model
'claude-opus-4-8'
open_atp.config.standard_provers() list[str][source]

The names standard_prover() accepts: the STANDARD_PROVERS keys.

open_atp.config.STANDARD_PROVERS: dict[str, dict[str, object]] = {'agent:axprover': {'harness': {'type': 'axprover'}, 'type': 'agent'}, 'agent:claude': {'harness': {'type': 'claude_code'}, 'type': 'agent'}, 'agent:codex': {'harness': {'type': 'codex'}, 'type': 'agent'}, 'agent:opencode': {'harness': {'type': 'opencode'}, 'type': 'agent'}, 'agent:vibe': {'harness': {'type': 'vibe'}, 'type': 'agent'}, 'aristotle': {'type': 'aristotle'}, 'numina': {'type': 'numina'}}

The standard catalog: a friendly name -> the canonical prover spec for that ready-to-run default. The agent:<cli> entries are all the shared AgentProver on a different harness; numina and aristotle are the standalone provers. Build one with standard_prover().