Skip to content

feat(control-plane,web): persist + surface the runner failure detail (#137)#169

Merged
stephane-segning merged 2 commits into
mainfrom
claude/persist-surface-failure-detail
Jun 23, 2026
Merged

feat(control-plane,web): persist + surface the runner failure detail (#137)#169
stephane-segning merged 2 commits into
mainfrom
claude/persist-surface-failure-detail

Conversation

@stephane-segning

Copy link
Copy Markdown
Contributor

1. Summary

This PR changes:

  • Adds migration 0016_task_error_detail.sql — a nullable error_detail TEXT column on tasks.
  • Persists the runner's free-text status detail: set_task_status now takes detail: Option<&str> and writes it via COALESCE($4, error_detail); set_status passes update.detail through (it previously only logged and dropped it — the code literally said "not persisted yet"); both reaper call sites updated (the reaper's terminal-fail path records a reason).
  • Exposes it: TaskRow.error_detail (selected by TASK_SELECT's t.*) so GET /tasks and GET /tasks/{id} return it.
  • Web run detail page surfaces error_detail as a calm inline status line ("Review did not post: ") for failed runs and for succeeded-but-empty no-ops, so a silent no-op is tellable apart from a real clean review.

It solves:


2. Intent

The intent of this PR is:

Stop discarding the runner's failure reason. A concurrent agent-runner PR makes the runner send a meaningful detail on failure; this PR persists that detail and surfaces it on the run detail page so an operator can tell why a review failed or posted nothing, instead of seeing a green run with no explanation. This is the persist + surface half of the fix for the silent-no-op problem (#137).


3. Scope

In Scope

  • services/control-plane: migration 0016, db.rs (set_task_status signature + TaskRow.error_detail), http/internal.rs (set_status threading detail), queue/reaper.rs (call-site updates).
  • apps/web: Task.error_detail type + run detail page rendering.

Out of Scope

  • webhook.rs idempotency logic and the agent-runner itself (separate concurrent PRs).
  • The succeeded/failed state machine — persisting + surfacing the reason is the core fix; the added "silent no-op" indicator on the web side is purely derived and additive.
  • Running the migration against prod / deployment / ai-helm.

4. Verification

I verified this change by:

  • Running automated tests
  • Running manual tests
  • Checking logs
  • Checking metrics
  • Testing error cases
  • Testing permissions/security behavior
  • Testing rollback or failure behavior, if relevant

A new DB test (set_task_status_persists_and_preserves_detail) covers the error case: a reported detail is persisted, and a later detail-less report does not erase the recorded reason (COALESCE).

Commands run:

cargo fmt -p control-plane -- --check
cargo clippy -p control-plane --all-targets -- -D warnings
cargo test -p control-plane   # DB tests against local pgvector
pnpm --filter web lint
pnpm --filter web build

Results:

$ cargo fmt -p control-plane -- --check
FMT_EXIT=0

$ cargo clippy -p control-plane --all-targets -- -D warnings
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 22.32s
CLIPPY_EXIT=0

$ cargo test -p control-plane
test db::tests::set_task_status_stamps_and_releases ... ok
test db::tests::set_task_status_persists_and_preserves_detail ... ok
test result: ok. 60 passed; 0 failed; 1 ignored; 0 measured; 0 filtered out

$ pnpm --filter web lint
$ biome check .
Checked 54 files in 24ms. No fixes applied.

$ pnpm --filter web build
├ ƒ /dashboard/runs/[id]                 2.97 kB         116 kB
✓ Compiled successfully

5. Screenshots / Evidence

  • Prod evidence: 98/144 (~68%) "succeeded" PR-review tasks over 14 days posted nothing.
  • Code evidence: the dropped-detail comment in http/internal.rs (StatusUpdate.detail — "not persisted yet") is now removed and the value is persisted.

6. Risk Assessment

Risk level:

  • Low
  • Medium
  • High

Potential risks:

  • Additive nullable column + a widened function signature; no state-machine change.
  • Migration is ADD COLUMN IF NOT EXISTS — idempotent and backward-compatible (older rows read NULL).

Mitigation:

  • COALESCE write semantics ensure a later report can't erase a recorded reason; covered by a test.
  • Web rendering guards on presence (error_detail truthy), so existing clean runs are unaffected.

7. AI Usage Declaration

AI was used for:

  • Understanding existing code
  • Generating code
  • Refactoring
  • Generating tests
  • Drafting documentation
  • Reviewing the diff
  • Not used

Human verification:

  • I understand every meaningful change in this PR
  • I checked generated code manually
  • I checked generated tests manually
  • I removed unsupported AI assumptions
  • I accept responsibility for this PR

8. Reviewer Focus

Please focus your review on:

  • Correctness
  • Architecture
  • Security
  • Performance
  • Tests
  • Maintainability
  • Product intent
  • Edge cases

#137)

The runner already reported a free-text `detail` on POST /internal/tasks/{id}/status,
but `set_status` only `info!`-logged it and dropped it ("not persisted yet"), and
`tasks` had no column to hold it. So a review that failed or posted nothing showed
green on the dashboard with no reason — a 14-day prod audit found 98 of 144 (~68%)
"succeeded" PR-review tasks had posted nothing, failures swallowed as success.

- Migration 0016: add nullable `error_detail TEXT` to `tasks`.
- `set_task_status` takes `detail: Option<&str>` and writes it via
  `COALESCE($4, error_detail)` so a later detail-less report can't erase a recorded
  reason; threaded through `set_status` (stop dropping it) and both reaper sites.
- `TaskRow.error_detail` (selected by `TASK_SELECT`'s `t.*`) so GET /tasks and
  GET /tasks/{id} return it.
- Web run detail: surface `error_detail` as a calm inline status line — "Review did
  not post: <detail>" — for failed runs and for succeeded-but-empty no-ops, so a
  silent no-op is tellable apart from a real clean review.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@chatgpt-codex-connector

Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@github-actions

Copy link
Copy Markdown

✅ AI Governance check passed

This PR declares AI usage, references a source of truth, and provides verification evidence. Thank you.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces persistence and frontend visualization for the runner's free-text status reason (error_detail), preventing silent failures or no-ops from being mistaken for successful reviews. However, a critical issue was identified in the database status update logic: when a task is retried and transitions back to running, the previous attempt's error_detail is not cleared. This can lead to successful retried runs being incorrectly flagged as silent no-ops on the dashboard. A suggestion has been provided to clear the error detail upon transitioning to the running state.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread services/control-plane/src/db.rs Outdated
@stephane-segning

Copy link
Copy Markdown
Contributor Author

@lightbridge-assistant please review this

@stephane-segning

Copy link
Copy Markdown
Contributor Author

@lightbridge-assistant please review this again

#169 review)

Gemini (high): COALESCE($4, error_detail) kept the prior attempt's failure
reason forever, so a retried run that now succeeds was still flagged as a silent
no-op on the dashboard. Clear error_detail on any 'running' transition (fresh
attempt = clean slate); preserve it on detail-less terminal reports. Added a
regression assertion.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@stephane-segning stephane-segning merged commit 649ec4c into main Jun 23, 2026
8 checks passed
@stephane-segning stephane-segning deleted the claude/persist-surface-failure-detail branch June 23, 2026 10:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant