tools/stress/device-reporter: add parser + analyzer by nikw9944 · Pull Request #3843 · malbeclabs/doublezero

nikw9944 · 2026-06-05T13:36:59Z

Stack: 3 of 4 — base #3842. Followed by [#stress-reporter-cli].

This PR is the data-analysis foundation for the new `device-reporter` binary. It contains no I/O surface (no CLI, no markdown writer, no script integration) — those land in #4 of the stack. Split this way so the analyzer's data model can be reviewed without rendering pulling focus.

Summary of Changes

`pkg/parser` — `LoadRun` aggregates one stress-run directory: `orchestrator-config.json`, `orchestrator-runlog.jsonl`, the captured agent log (state machine that builds `AgentCycle` records from `Received N lines/bytes` / `Committing config session` / `Configuration session finalized` markers), and the abort sentinel. Missing individual artifacts are tolerated; a missing run directory is a hard error.
`pkg/analyze` — `BuildSummary` turns one `parser.Run` into the rolled-up view callers render. Two non-trivial joins: activate↔applied pairing by `user_index` (for the per-user onchain→on-device gap) and agent-cycle↔applied-bucket pairing by shared finalize timestamp (for the per-cycle table). Three linearity fits: commit-duration vs config size, diff-check-duration vs config size, onchain→on-device-gap vs active-user count. `commitCycles` surfaces a join-mismatch warning when the positional cycle↔bucket join is unbalanced.

Diff Breakdown

Category	Files	Lines (+/-)	Net
Core logic	6	+1130 / -0	+1130
Scaffolding	2	+122 / -0	+122
Tests	4	+439 / -0	+439
Total	12	+1691 / -0	+1691

Pure additive — twelve new files, no edits to existing code. Densest PR in the stack but all in one mental space (pure data manipulation).

Key files (click to expand)

`tools/stress/device-reporter/pkg/analyze/summary.go` — top-level `BuildSummary`, the two joins, the three fits, the join-mismatch warning.
`tools/stress/device-reporter/pkg/parser/agentlog.go` — state machine that walks the orchestrator's captured agent log.
`tools/stress/device-reporter/pkg/parser/run.go` — `LoadRun` aggregator.
`tools/stress/device-reporter/pkg/analyze/fit.go` — `LinearLeastSquares` + percentile helpers shared by all three fits.

Testing Verification

Unit tests cover: parser shapes for both cEOS and real-EOS agent logs; the activate↔applied pairing edge cases (missing applied, applied-without-activate, --no-agent runs with no applied at all); the per-cycle join across commit/abort/unfinished cycle outcomes; the join-mismatch warning; the diff-check duration computation; the onchain→on-device linearity fit against a perfectly-linear synthetic dataset.

elitegreg

[bug] BuildSummary can label an in-progress/provision-only run as success.

File: tools/stress/device-reporter/pkg/analyze/summary.go
Current success condition treats this as success:
- submit == activate > 0
- deprovision_submit == deprovision_activate == 0
Repro: a runlog with only provision submit/activate rows returns Outcome = "success".
This is misleading for partial runs (especially since the reporter is documented as usable mid-run). Consider requiring deprovision completion for provisioned runs.

@elitegreg

The previous predicate counted a provision-only mid-run snapshot (submit == activate > 0, deprovision_submit == deprovision_activate == 0) as success — the (submit+deprovision_submit) > 0 guard was satisfied by the provision count alone. That's misleading for runs inspected before teardown completes, and the reporter is documented as usable mid-run. Tighten the predicate to require: * deprovision actually happened (deprovision_submit > 0), and * either submit == 0 (the documented deprovision-only re-run case) or submit == deprovision_submit (every provisioned user was torn down — and mid-deprovision stays "unfinished"). Add table-driven tests covering all seven shapes the detector has to disambiguate: no events, provision-only (the bug), full cycle, deprovision-only re-run, mid-provision, mid-deprovision, abort sentinel. Reported by @elitegreg on #3843.

nikw9944 · 2026-06-05T20:04:49Z

Good catch — fixed in c9f9f98.

You're right that the previous predicate's (submit + deprovision_submit) > 0 guard was satisfied by a provision-only mid-run snapshot, so a half-completed run got labelled "success."

Tightened the predicate to require:

deprovision actually happened (deprovision_submit > 0), and
either submit == 0 (the documented deprovision-only re-run case) or submit == deprovision_submit (every provisioned user was torn down — mid-deprovision stays "unfinished").

Added table-driven tests covering all seven shapes the detector has to disambiguate: no events, provision-only (your repro), full cycle, deprovision-only re-run, mid-provision, mid-deprovision, abort sentinel.

#3844 has been rebased onto the new tip.

@elitegreg

The previous predicate counted a provision-only mid-run snapshot (submit == activate > 0, deprovision_submit == deprovision_activate == 0) as success — the (submit+deprovision_submit) > 0 guard was satisfied by the provision count alone. That's misleading for runs inspected before teardown completes, and the reporter is documented as usable mid-run. Tighten the predicate to require: * deprovision actually happened (deprovision_submit > 0), and * either submit == 0 (the documented deprovision-only re-run case) or submit == deprovision_submit (every provisioned user was torn down — and mid-deprovision stays "unfinished"). Add table-driven tests covering all seven shapes the detector has to disambiguate: no events, provision-only (the bug), full cycle, deprovision-only re-run, mid-provision, mid-deprovision, abort sentinel. Reported by @elitegreg on #3843.

Three sets of changes that together let the stress harness drive a real Arista EOS DUT to its hard cap of 1024 user tunnels without dropping applied events, killing the agent mid-commit, or undercounting on the observer side. * Agent log parser (orchestrator-side). Adds a per-section state machine that handles the real-EOS unified-diff shape ('interface TunnelN' as a context line with '+ <property>' lines below it) in addition to the existing cEOS '+interface TunnelN' shape. Emits two new activity events alongside the existing pre_commit_log / applied / commit beats: EventConfigReceived (from 'Received N bytes...') and EventCommitAborted (from 'session ... abort'). * Sweep timing. The quiescenceTracker grows a sticky pending-commit flag (set on EventConfigReceived, cleared on the matching terminal EventCommit / EventApplied / EventCommitAborted) so the post- deprovision wait outlasts the multi-second diff-check window between a config arrival and the next commit. A new waitForAppliedToCatchUp blocks the provision→deprovision boundary until the agent's applied count covers the orchestrator's provisioned-user count, so deprovision doesn't start removing users the device hasn't applied yet. Both waits honor explicit timeouts (--agent-quiescence-timeout-seconds and the new --apply-catch-up-timeout-seconds, default 300s). --no-agent runs auto-zero the catch-up timeout since the noop runner emits no applied events. * Observer tunnel-counter. Switches the device_tunnel_gap sentinel's sample command from 'show gre tunnel static' (only returns statically-keyed routing-fabric tunnels — never user tunnels on either platform tested) to 'show interfaces description' filtered on a 'USER-UCAST-' description prefix. The new filter tracks the controller's naming convention regardless of where in the Tunnel<N> id space user tunnels land — robust against both the legacy 500+ range and the gm/tunnel-id-start-1 fix that allocates from 1. End-to-end on physical hardware: 524-user runs on chi-dn-dzd5 + chi- dn-dzd9 and 1023-user runs on both DUTs all completed with 100% on- device coverage and clean teardown.

Adds --apply-per-batch-catch-up (default off) that pauses after every provision batch until the agent's applied count covers the cumulative target submitted so far. Reuses the existing waitForAppliedToCatchUp function and honors the same --apply-catch-up-timeout-seconds per batch. Use case: production users arrive at human cadence rather than in 32/64-user bursts, so this flag matches that shape — useful for measuring per-user latency under steady-state load instead of peak throughput under burst load. Off by default to preserve the original 'stress the agent' shape the harness was built for.

@elitegreg

The wait used to fast-path on a tracker-state inference: if quiescenceTracker.lastEvent() reported no events, the wait skipped. That inference raced with the consumer goroutine that updates the tracker — an event could be sitting in the channel buffer (emitted by the agent runner but not yet picked up by consumeAgentEvents) while the main goroutine read tracker state and concluded 'no events.' The wait would then skip and the orchestrator would kill the SSH session mid-commit. Reproduces flakily under go test -count=100 -run TestRun_QuiescenceBlocksOnPendingCommit Replace the inference with an explicit Config.NoAgent signal that main.go sets when --no-agent is on, alongside the existing ApplyCatchUpTimeout=0 override. The wait short-circuits on cfg.NoAgent (no race possible — it's an operator-provided flag) and otherwise always enters the loop. The loop's existing exit condition (elapsed >= AgentQuietWindow && sinceLast >= AgentQuietWindow && !HasPendingCommit()) handles the 'no events yet' case correctly: lastEvent of zero yields a huge sinceLast, so the wait gates on elapsed-since-start, blocking for at least AgentQuietWindow before returning. A slow consumer goroutine gets the full quiet window to catch up. Reported by @elitegreg on #3841.

Five small ergonomics fixes for run-stress-physical.sh, all aimed at making first-run failures obvious instead of silent: * Fail fast on missing EAPI_PASS via :? rather than the empty default that silently produces 401-Unauthorized on every observer sample. * Default EAPI_USER to 'stress' (matched by a documented device-side 'username stress secret 0 stress') so the observer doesn't ride on the device's admin password. * CONTROLLER_BINARY env override — point the script at a prebuilt controller from another branch's worktree instead of go-running from the local checkout. Useful when bisecting a controller patch. * Per-run controller.log: move it into $RUN_DIR so each run's log survives, rather than being truncated at the next controller start. The README documents the new stress-user prereq, the 'management api gnmi / provider eos-native' stanza that real EOS needs to expose the agent's 127.0.0.1:9543 listener (replacing an earlier eapilocal recipe that turned out to be a guess), the CONTROLLER_BINARY knob, and the EAPI_USER default change.

Add a matching --apply-per-batch-catch-up CLI flag to run-stress-physical.sh so the operator can opt into the per-batch catch-up wait without setting orchestrator flags directly. Off by default, matching the orchestrator's default behavior.

Foundation for the device-reporter binary, split off as its own PR so the analyzer's data model can be reviewed without the CLI surface and rendering pulling focus. * parser package — LoadRun aggregates one stress-run directory: orchestrator-config.json, orchestrator-runlog.jsonl, the captured agent log (state machine that builds AgentCycle records from Received N lines/bytes / Committing / Configuration session finalized markers), and the abort sentinel. Missing individual artifacts leave the matching field nil; missing run directory is a hard error. * analyze package — BuildSummary turns one parser.Run into the rolled-up view the formatter renders. The two non-trivial joins are activate↔applied pairing by user_index (for the onchain→on- device gap) and agent-cycle↔applied-bucket pairing by shared finalize timestamp (for the per-cycle table). Three linearity fits: commit-duration vs config size, diff-check-duration vs config size, onchain→on-device-gap vs active-user count. Tests cover the parser shapes, the activate↔applied edge cases (missing applied, applied-without-activate, --no-agent run with no applied events at all), the per-cycle join's behavior across commit/abort/unfinished cycles, the join-mismatch warning, the diff-check duration computation, and the linearity fit against a perfectly-linear synthetic dataset.

@elitegreg

The previous predicate counted a provision-only mid-run snapshot (submit == activate > 0, deprovision_submit == deprovision_activate == 0) as success — the (submit+deprovision_submit) > 0 guard was satisfied by the provision count alone. That's misleading for runs inspected before teardown completes, and the reporter is documented as usable mid-run. Tighten the predicate to require: * deprovision actually happened (deprovision_submit > 0), and * either submit == 0 (the documented deprovision-only re-run case) or submit == deprovision_submit (every provisioned user was torn down — and mid-deprovision stays "unfinished"). Add table-driven tests covering all seven shapes the detector has to disambiguate: no events, provision-only (the bug), full cycle, deprovision-only re-run, mid-provision, mid-deprovision, abort sentinel. Reported by @elitegreg on #3843.

This was referenced Jun 5, 2026

tools/stress/device-reporter: CLI + markdown writer + auto-summary #3844

Merged

tools/stress: post-run analysis tool + auto-emit summary at end of each run #3834

Closed

nikw9944 force-pushed the nikw9944/stress-physical-script-polish branch from 668146e to 9472dc6 Compare June 5, 2026 14:12

nikw9944 force-pushed the nikw9944/stress-reporter-analyze branch 2 times, most recently from 664b28b to ae77e99 Compare June 5, 2026 14:25

nikw9944 mentioned this pull request Jun 5, 2026

stress: surface device CPU + memory in the post-run report #3845

Closed

elitegreg approved these changes Jun 5, 2026

View reviewed changes

nikw9944 added the skip-changelog label Jun 5, 2026

nikw9944 force-pushed the nikw9944/stress-physical-script-polish branch from 0f78a02 to 20b6968 Compare June 5, 2026 19:53

nikw9944 force-pushed the nikw9944/stress-reporter-analyze branch from ae77e99 to cd135c8 Compare June 5, 2026 19:53

nikw9944 mentioned this pull request Jun 5, 2026

tools/stress: harden orchestrator + observer for physical EOS #3841

Merged

nikw9944 added 7 commits June 8, 2026 18:38

nikw9944 force-pushed the nikw9944/stress-physical-script-polish branch from 20b6968 to 6b663e0 Compare June 8, 2026 18:38

nikw9944 force-pushed the nikw9944/stress-reporter-analyze branch from c9f9f98 to cce53df Compare June 8, 2026 18:38

nikw9944 linked an issue Jun 8, 2026 that may be closed by this pull request

stress: device stress test per-run reporting tool #3835

Closed

Base automatically changed from nikw9944/stress-physical-script-polish to main June 10, 2026 20:14

nikw9944 enabled auto-merge (squash) June 10, 2026 20:15

nikw9944 merged commit b16328c into main Jun 10, 2026
37 of 38 checks passed

nikw9944 deleted the nikw9944/stress-reporter-analyze branch June 10, 2026 20:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tools/stress/device-reporter: add parser + analyzer#3843

tools/stress/device-reporter: add parser + analyzer#3843
nikw9944 merged 7 commits into
mainfrom
nikw9944/stress-reporter-analyze

nikw9944 commented Jun 5, 2026

Uh oh!

elitegreg left a comment

Uh oh!

nikw9944 commented Jun 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nikw9944 commented Jun 5, 2026

Summary of Changes

Diff Breakdown

Testing Verification

Uh oh!

elitegreg left a comment

Choose a reason for hiding this comment

Uh oh!

nikw9944 commented Jun 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants