Skip to content

Plan: Refactor foc-devnet: agent-first, legible, footgun-free #139

Description

@rvagg

Why

foc-devnet is likely to be increasingly driven by AI agents so a command's stdout should be viewed as the primary interface. This also obviously benefits humans arriving cold. Today the stdout interface is poor: output is unstructured trace spam, failures are opaque (clean can die with a bare PermissionDenied on Docker's own root-owned files), and the init->build->start lifecycle isn't discoverable from --help. The code has also drifted into 12 startup steps spread across 83 files, which makes it slow to change. This issue tracks making the CLI clear, recoverable, and easier to move forward on.

Guiding principle: every command's output is agent context. Say what happened, name the paths that matter, state what phase we're in, point at the next command, and be honest about damage and edges.

User-facing changes

Recovery / self-service

  • clean self-heals the root-owned files Docker leaves behind instead of failing opaquely; lifecycle commands become retry-safe; poison is explained and cleared predictably.
  • New doctor command: checks Docker, host.docker.internal, basedir ownership, ports, disk, poison, and binaries, with --fix for the safe repairs.

Orientation

  • New foc-devnet system-context prints a compact primer (what the devnet is, components, protocol, footguns) so a cold agent gets competent before acting; --help points to it.

Legible output

  • status gains a phase line (built / running+run_id / poisoned) with source refs and the suggested next command.
  • --output json with a stable error kind and a stable exit-code contract, so agents and CI can branch deterministically.
  • Each run writes a run-summary.json handoff so the next agent resumes from one artifact.

Fewer lifecycle surprises

  • --help ordered init->build->start->..., with honest command and flag descriptions.
  • start refuses to clobber a running cluster (opt-in override). Breaking.
  • Portainer becomes opt-in and is torn down by stop.
  • Artifact-preserving reinit (keep proof params, binaries, builder cache).

Testing ergonomics

  • Scenarios can run against a local Synapse worktree without patching the harness.
  • fund-user / larger default deposit to unblock piece-heavy tests.

Documentation

  • Consolidate the split README / README_ADVANCED into one README, plus a self-contained SYSTEM-CONTEXT.md (the primer the command prints) and a checked-in AGENTS.md, each readable from a bare clone.

What it looks like

Illustrative target output, not current behavior.

--help (lifecycle-ordered, self-describing)

$ foc-devnet --help
Local Filecoin devnet for FOC integration testing.

Lifecycle: init -> build -> start -> stop/clean
Re-initializing requires `clean` first. State lives under
$FOC_DEVNET_BASEDIR (default ~/.foc-devnet).

Commands:
  init          Clone repos, generate keys, stage proof params, build Docker images
  build         Compile Lotus or Curio from the configured sources
  start         Bring up the cluster (5-10 min); writes devnet-info.json
  status        Show phase, run_id, sources, and the next command
  doctor        Check environment and known sharp edges; --fix repairs safe ones
  system-context  Print the primer: what the devnet is and how the pieces fit
  stop          Stop containers, keep state on disk
  clean         Remove run state (self-heals Docker-owned files; keeps config + images)
  version       Show build/version info
  config        Show resolved configuration
  requirements  Check/set up Docker prerequisites

Options:
  --output <human|json>   Output format (default: human)
  -h, --help

A cold agent recovers and reaches a working devnet

$ foc-devnet status
Phase: poisoned (a prior command failed mid-run)
  Basedir:  /mnt/nvme2/foc-devnet
  Recover:  foc-devnet clean   (clears poison, self-heals Docker-owned files)

$ foc-devnet doctor
  ok    Docker reachable, user in docker group
  ok    host.docker.internal present
  FAIL  Basedir has root-owned files: docker/volumes/cache/foc-builder
          fix -> foc-devnet doctor --fix   (or `clean`, which self-heals)
  ok    Ports 5700-5800 free
  ok    Free disk 812 GiB
  warn  Poison present -> `clean` clears it
  ok    Required binaries present in bin/

$ foc-devnet clean
Cleaned run state. Kept config.toml and cached images.
  Self-healed 3 root-owned paths under docker/volumes/cache.
  Cleared poison.
Next: foc-devnet start

$ foc-devnet start
  [step 7/12] deploy FOC contracts ... done (128s)
  ...
Phase: running (run_id=20260701T1042_TizzyMoo)
  devnet-info:  /mnt/nvme2/foc-devnet/state/latest/devnet-info.json
  run-summary:  /mnt/nvme2/foc-devnet/state/latest/run-summary.json
Next: python3 scenarios/run.py   or   foc-devnet status

The same, machine-readable (--output json)

Agents branch on kind and exit code instead of grepping prose.

$ foc-devnet start --output json    # transient chain error, exit 5
{"command":"start","status":"error","phase":"multicall3_deploy",
 "kind":"chain_actor_unavailable","error":"actor not found: validation failure",
 "advice":"f4 activation delay; retry in ~5s","logs":".../run/<id>/setup.log"}

$ foc-devnet start --output json    # cluster already up, exit 3
{"command":"start","status":"error","kind":"cluster_running",
 "run_id":"20260701T1042_TizzyMoo",
 "advice":"stop it, use a different FOC_DEVNET_BASEDIR, or pass --force"}

start no longer silently clobbers a running cluster

$ foc-devnet start
Error: a cluster is already running (run_id=20260701T1042_TizzyMoo).
  Stop it:     foc-devnet stop
  Isolate:     FOC_DEVNET_BASEDIR=/mnt/nvme2/other foc-devnet start
  Override:    foc-devnet start --force   (tears down the running cluster)

Internal cleanup (not user-facing)

Collapse the step-module sprawl, delete dead code, and unify the sequential/parallel step DAG, which also fixes a live bug: the default start currently skips the prerequisites check that --parallel runs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    Status
    📌 Triage

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions