fix([cos-agent-output-write-amplification]): batch streamed agent output to end per-line state-file rewrites#612
Merged
Conversation
…put to end per-line state-file rewrites The CoS Runner emits agent:output per parsed line and the direct-CLI non-stream stdout/stderr handlers fired per chunk; each call ran a full loadState+saveState of the agent-state JSON. On a chatty agent that is dozens of whole-file rewrites per second — the exact write-amplification the CLAUDE.md streaming-producer convention warns about, which only the TUI spawner and the stream-json chunk paths were honoring. Add a shared createAgentOutputBatcher to cosAgents.js (250ms debounce, in-flight re-schedule, swallow+log on write failure) and route the two hot producers through it: subAgentSpawner's runner agent:output handler (per-agent batcher map, drained on completed/error/orphaned) and agentCliSpawning's stdout/stderr/parser-flush paths (per-spawn batcher, drained in the close + error handlers before finalize).
… log to changelog
…unner-batcher Map leak, drain-loop on flush - subAgentSpawner: flush the runner batcher before deleting its Map entry (a line racing in during the awaited flush lands in the same batcher, not an orphaned new one) and drop agent:output for agents no longer in runnerAgents — the runner registers the agent before spawning, so this only ignores post-completion strays that would otherwise leak a batcher. (claude + agy review) - cosAgents: flush() now drains in a while-loop so a push racing in during the second drain is cleared synchronously rather than stranded to the debounce timer. (agy review) - agentLifecycle: drop the now-dead appendAgentOutput import. (agy + claude)
…user terminate/kill before marking complete codex review flagged that the new ~250ms output batching opened a window where a user terminate/kill marks the agent complete before its pending output is drained (and, on the runner path, could leak the batcher Map entry if no later completion event arrives). - agentCliSpawning: expose a flushOutput() hook on the activeAgents entry. - agentManagement terminateAgent/killAgent (direct): await agent.flushOutput?.() before completeAgent. - agentManagement terminateRunnerAgent: drain + drop the runner batcher (flushRunnerOutputBatcher, dynamic-imported to avoid the static cycle) before completeAgent + runnerAgents.delete. - subAgentSpawner: export flushRunnerOutputBatcher. - Add source-contract tests asserting flush precedes completeAgent on all three paths.
050d033 to
4e76ba3
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The audit item asked us to verify that every producer calling
appendAgentOutputdirectly is human-pace, not a hot loop — and migrate any that aren't. It was not a no-op audit: two hot producers were still doing a full agent-stateloadState+saveStateper output line/chunk.subAgentSpawner.js): the runner emitsagent:outputper parsed line (cos-runner/index.js), and the handler did one full state write per event. This is the primary output path when the runner is available (production default).agentCliSpawning.js): the non-stream stdout/stderr branches wrote state per chunk; the stream-json/codex chunk branches batched per chunk but without the documented ~250ms debounce.Only the TUI spawner honored the CLAUDE.md "High-frequency state writes must batch" convention.
Change
createAgentOutputBatcher(agentId)tocosAgents.js— a 250ms-debounced batcher overappendAgentOutputLineswith in-flight re-schedule and swallow+log on write failure (so neither the timer nor a caller'sawait flush()can throw into a child-process/timer callback). Callersawait flush()in their finish/cleanup path so the final lines land before the completion event.agent:outputhandler through a per-agent batcher map, drained onagent:completed/agent:error/agents:orphaned.closeanderrorhandlers before finalize.No change to the live output tail (events still emit per line); only the state-write cadence changes — from dozens of whole-file rewrites/sec to a few.
Test plan
cosAgents.test.js— newcreateAgentOutputBatchersuite: coalesces N lines into onesaveState, no-op flush when empty, captures lines pushed mid-drain, swallows+logs a write failure (flush never rejects).agentCliSpawning.test.js— the two stream-containment tests updated for the batched flow: a failed batch flush on close is logged with a ❌ prefix and never leaks as an unhandled rejection (stdout + stderr).cosAgents(5),agentCliSpawning(9),agentTuiSpawning(19),subAgentSpawner(85).