tools/stress/device-reporter: add parser + analyzer#3843
Merged
Conversation
This was referenced Jun 5, 2026
668146e to
9472dc6
Compare
664b28b to
ae77e99
Compare
elitegreg
approved these changes
Jun 5, 2026
elitegreg
left a comment
Contributor
There was a problem hiding this comment.
[bug] BuildSummary can label an in-progress/provision-only run as success.
- File:
tools/stress/device-reporter/pkg/analyze/summary.go - Current success condition treats this as success:
submit == activate > 0deprovision_submit == deprovision_activate == 0
- Repro: a runlog with only provision submit/activate rows returns
Outcome = "success". - This is misleading for partial runs (especially since the reporter is documented as usable mid-run). Consider requiring deprovision completion for provisioned runs.
0f78a02 to
20b6968
Compare
ae77e99 to
cd135c8
Compare
nikw9944
added a commit
that referenced
this pull request
Jun 5, 2026
The previous predicate counted a provision-only mid-run snapshot
(submit == activate > 0, deprovision_submit == deprovision_activate == 0)
as success — the (submit+deprovision_submit) > 0 guard was
satisfied by the provision count alone. That's misleading for runs
inspected before teardown completes, and the reporter is documented
as usable mid-run.
Tighten the predicate to require:
* deprovision actually happened (deprovision_submit > 0), and
* either submit == 0 (the documented deprovision-only re-run case)
or submit == deprovision_submit (every provisioned user was torn
down — and mid-deprovision stays "unfinished").
Add table-driven tests covering all seven shapes the detector has
to disambiguate: no events, provision-only (the bug), full cycle,
deprovision-only re-run, mid-provision, mid-deprovision, abort
sentinel.
Reported by @elitegreg on #3843.
Contributor
Author
|
Good catch — fixed in c9f9f98. You're right that the previous predicate's Tightened the predicate to require:
Added table-driven tests covering all seven shapes the detector has to disambiguate: no events, provision-only (your repro), full cycle, deprovision-only re-run, mid-provision, mid-deprovision, abort sentinel. #3844 has been rebased onto the new tip. |
nikw9944
added a commit
that referenced
this pull request
Jun 7, 2026
The previous predicate counted a provision-only mid-run snapshot
(submit == activate > 0, deprovision_submit == deprovision_activate == 0)
as success — the (submit+deprovision_submit) > 0 guard was
satisfied by the provision count alone. That's misleading for runs
inspected before teardown completes, and the reporter is documented
as usable mid-run.
Tighten the predicate to require:
* deprovision actually happened (deprovision_submit > 0), and
* either submit == 0 (the documented deprovision-only re-run case)
or submit == deprovision_submit (every provisioned user was torn
down — and mid-deprovision stays "unfinished").
Add table-driven tests covering all seven shapes the detector has
to disambiguate: no events, provision-only (the bug), full cycle,
deprovision-only re-run, mid-provision, mid-deprovision, abort
sentinel.
Reported by @elitegreg on #3843.
Three sets of changes that together let the stress harness drive a real
Arista EOS DUT to its hard cap of 1024 user tunnels without dropping
applied events, killing the agent mid-commit, or undercounting on the
observer side.
* Agent log parser (orchestrator-side). Adds a per-section state
machine that handles the real-EOS unified-diff shape ('interface
TunnelN' as a context line with '+ <property>' lines below it) in
addition to the existing cEOS '+interface TunnelN' shape. Emits two
new activity events alongside the existing pre_commit_log / applied
/ commit beats: EventConfigReceived (from 'Received N bytes...')
and EventCommitAborted (from 'session ... abort').
* Sweep timing. The quiescenceTracker grows a sticky pending-commit
flag (set on EventConfigReceived, cleared on the matching terminal
EventCommit / EventApplied / EventCommitAborted) so the post-
deprovision wait outlasts the multi-second diff-check window
between a config arrival and the next commit. A new
waitForAppliedToCatchUp blocks the provision→deprovision boundary
until the agent's applied count covers the orchestrator's
provisioned-user count, so deprovision doesn't start removing users
the device hasn't applied yet. Both waits honor explicit timeouts
(--agent-quiescence-timeout-seconds and the new
--apply-catch-up-timeout-seconds, default 300s). --no-agent runs
auto-zero the catch-up timeout since the noop runner emits no
applied events.
* Observer tunnel-counter. Switches the device_tunnel_gap sentinel's
sample command from 'show gre tunnel static' (only returns
statically-keyed routing-fabric tunnels — never user tunnels on
either platform tested) to 'show interfaces description' filtered
on a 'USER-UCAST-' description prefix. The new filter tracks the
controller's naming convention regardless of where in the Tunnel<N>
id space user tunnels land — robust against both the legacy 500+
range and the gm/tunnel-id-start-1 fix that allocates from 1.
End-to-end on physical hardware: 524-user runs on chi-dn-dzd5 + chi-
dn-dzd9 and 1023-user runs on both DUTs all completed with 100% on-
device coverage and clean teardown.
Adds --apply-per-batch-catch-up (default off) that pauses after every provision batch until the agent's applied count covers the cumulative target submitted so far. Reuses the existing waitForAppliedToCatchUp function and honors the same --apply-catch-up-timeout-seconds per batch. Use case: production users arrive at human cadence rather than in 32/64-user bursts, so this flag matches that shape — useful for measuring per-user latency under steady-state load instead of peak throughput under burst load. Off by default to preserve the original 'stress the agent' shape the harness was built for.
The wait used to fast-path on a tracker-state inference: if quiescenceTracker.lastEvent() reported no events, the wait skipped. That inference raced with the consumer goroutine that updates the tracker — an event could be sitting in the channel buffer (emitted by the agent runner but not yet picked up by consumeAgentEvents) while the main goroutine read tracker state and concluded 'no events.' The wait would then skip and the orchestrator would kill the SSH session mid-commit. Reproduces flakily under go test -count=100 -run TestRun_QuiescenceBlocksOnPendingCommit Replace the inference with an explicit Config.NoAgent signal that main.go sets when --no-agent is on, alongside the existing ApplyCatchUpTimeout=0 override. The wait short-circuits on cfg.NoAgent (no race possible — it's an operator-provided flag) and otherwise always enters the loop. The loop's existing exit condition (elapsed >= AgentQuietWindow && sinceLast >= AgentQuietWindow && !HasPendingCommit()) handles the 'no events yet' case correctly: lastEvent of zero yields a huge sinceLast, so the wait gates on elapsed-since-start, blocking for at least AgentQuietWindow before returning. A slow consumer goroutine gets the full quiet window to catch up. Reported by @elitegreg on #3841.
Five small ergonomics fixes for run-stress-physical.sh, all aimed at making first-run failures obvious instead of silent: * Fail fast on missing EAPI_PASS via :? rather than the empty default that silently produces 401-Unauthorized on every observer sample. * Default EAPI_USER to 'stress' (matched by a documented device-side 'username stress secret 0 stress') so the observer doesn't ride on the device's admin password. * CONTROLLER_BINARY env override — point the script at a prebuilt controller from another branch's worktree instead of go-running from the local checkout. Useful when bisecting a controller patch. * Per-run controller.log: move it into $RUN_DIR so each run's log survives, rather than being truncated at the next controller start. The README documents the new stress-user prereq, the 'management api gnmi / provider eos-native' stanza that real EOS needs to expose the agent's 127.0.0.1:9543 listener (replacing an earlier eapilocal recipe that turned out to be a guess), the CONTROLLER_BINARY knob, and the EAPI_USER default change.
Add a matching --apply-per-batch-catch-up CLI flag to run-stress-physical.sh so the operator can opt into the per-batch catch-up wait without setting orchestrator flags directly. Off by default, matching the orchestrator's default behavior.
Foundation for the device-reporter binary, split off as its own PR so the analyzer's data model can be reviewed without the CLI surface and rendering pulling focus. * parser package — LoadRun aggregates one stress-run directory: orchestrator-config.json, orchestrator-runlog.jsonl, the captured agent log (state machine that builds AgentCycle records from Received N lines/bytes / Committing / Configuration session finalized markers), and the abort sentinel. Missing individual artifacts leave the matching field nil; missing run directory is a hard error. * analyze package — BuildSummary turns one parser.Run into the rolled-up view the formatter renders. The two non-trivial joins are activate↔applied pairing by user_index (for the onchain→on- device gap) and agent-cycle↔applied-bucket pairing by shared finalize timestamp (for the per-cycle table). Three linearity fits: commit-duration vs config size, diff-check-duration vs config size, onchain→on-device-gap vs active-user count. Tests cover the parser shapes, the activate↔applied edge cases (missing applied, applied-without-activate, --no-agent run with no applied events at all), the per-cycle join's behavior across commit/abort/unfinished cycles, the join-mismatch warning, the diff-check duration computation, and the linearity fit against a perfectly-linear synthetic dataset.
The previous predicate counted a provision-only mid-run snapshot
(submit == activate > 0, deprovision_submit == deprovision_activate == 0)
as success — the (submit+deprovision_submit) > 0 guard was
satisfied by the provision count alone. That's misleading for runs
inspected before teardown completes, and the reporter is documented
as usable mid-run.
Tighten the predicate to require:
* deprovision actually happened (deprovision_submit > 0), and
* either submit == 0 (the documented deprovision-only re-run case)
or submit == deprovision_submit (every provisioned user was torn
down — and mid-deprovision stays "unfinished").
Add table-driven tests covering all seven shapes the detector has
to disambiguate: no events, provision-only (the bug), full cycle,
deprovision-only re-run, mid-provision, mid-deprovision, abort
sentinel.
Reported by @elitegreg on #3843.
20b6968 to
6b663e0
Compare
c9f9f98 to
cce53df
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stack: 3 of 4 — base #3842. Followed by [#stress-reporter-cli].
This PR is the data-analysis foundation for the new `device-reporter` binary. It contains no I/O surface (no CLI, no markdown writer, no script integration) — those land in #4 of the stack. Split this way so the analyzer's data model can be reviewed without rendering pulling focus.
Summary of Changes
Diff Breakdown
Pure additive — twelve new files, no edits to existing code. Densest PR in the stack but all in one mental space (pure data manipulation).
Key files (click to expand)
Testing Verification