Skip to content

tools/stress/device-reporter: add parser + analyzer#3843

Merged
nikw9944 merged 7 commits into
mainfrom
nikw9944/stress-reporter-analyze
Jun 10, 2026
Merged

tools/stress/device-reporter: add parser + analyzer#3843
nikw9944 merged 7 commits into
mainfrom
nikw9944/stress-reporter-analyze

Conversation

@nikw9944

@nikw9944 nikw9944 commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Stack: 3 of 4 — base #3842. Followed by [#stress-reporter-cli].

This PR is the data-analysis foundation for the new `device-reporter` binary. It contains no I/O surface (no CLI, no markdown writer, no script integration) — those land in #4 of the stack. Split this way so the analyzer's data model can be reviewed without rendering pulling focus.

Summary of Changes

  • `pkg/parser` — `LoadRun` aggregates one stress-run directory: `orchestrator-config.json`, `orchestrator-runlog.jsonl`, the captured agent log (state machine that builds `AgentCycle` records from `Received N lines/bytes` / `Committing config session` / `Configuration session finalized` markers), and the abort sentinel. Missing individual artifacts are tolerated; a missing run directory is a hard error.
  • `pkg/analyze` — `BuildSummary` turns one `parser.Run` into the rolled-up view callers render. Two non-trivial joins: activate↔applied pairing by `user_index` (for the per-user onchain→on-device gap) and agent-cycle↔applied-bucket pairing by shared finalize timestamp (for the per-cycle table). Three linearity fits: commit-duration vs config size, diff-check-duration vs config size, onchain→on-device-gap vs active-user count. `commitCycles` surfaces a join-mismatch warning when the positional cycle↔bucket join is unbalanced.

Diff Breakdown

Category Files Lines (+/-) Net
Core logic 6 +1130 / -0 +1130
Scaffolding 2 +122 / -0 +122
Tests 4 +439 / -0 +439
Total 12 +1691 / -0 +1691

Pure additive — twelve new files, no edits to existing code. Densest PR in the stack but all in one mental space (pure data manipulation).

Key files (click to expand)

Testing Verification

  • Unit tests cover: parser shapes for both cEOS and real-EOS agent logs; the activate↔applied pairing edge cases (missing applied, applied-without-activate, --no-agent runs with no applied at all); the per-cycle join across commit/abort/unfinished cycle outcomes; the join-mismatch warning; the diff-check duration computation; the onchain→on-device linearity fit against a perfectly-linear synthetic dataset.

@elitegreg elitegreg left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[bug] BuildSummary can label an in-progress/provision-only run as success.

  • File: tools/stress/device-reporter/pkg/analyze/summary.go
  • Current success condition treats this as success:
    • submit == activate > 0
    • deprovision_submit == deprovision_activate == 0
  • Repro: a runlog with only provision submit/activate rows returns Outcome = "success".
  • This is misleading for partial runs (especially since the reporter is documented as usable mid-run). Consider requiring deprovision completion for provisioned runs.

@nikw9944 nikw9944 force-pushed the nikw9944/stress-physical-script-polish branch from 0f78a02 to 20b6968 Compare June 5, 2026 19:53
@nikw9944 nikw9944 force-pushed the nikw9944/stress-reporter-analyze branch from ae77e99 to cd135c8 Compare June 5, 2026 19:53
nikw9944 added a commit that referenced this pull request Jun 5, 2026
The previous predicate counted a provision-only mid-run snapshot
(submit == activate > 0, deprovision_submit == deprovision_activate == 0)
as success — the (submit+deprovision_submit) > 0 guard was
satisfied by the provision count alone. That's misleading for runs
inspected before teardown completes, and the reporter is documented
as usable mid-run.

Tighten the predicate to require:
  * deprovision actually happened (deprovision_submit > 0), and
  * either submit == 0 (the documented deprovision-only re-run case)
    or submit == deprovision_submit (every provisioned user was torn
    down — and mid-deprovision stays "unfinished").

Add table-driven tests covering all seven shapes the detector has
to disambiguate: no events, provision-only (the bug), full cycle,
deprovision-only re-run, mid-provision, mid-deprovision, abort
sentinel.

Reported by @elitegreg on #3843.
@nikw9944

nikw9944 commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

Good catch — fixed in c9f9f98.

You're right that the previous predicate's (submit + deprovision_submit) > 0 guard was satisfied by a provision-only mid-run snapshot, so a half-completed run got labelled "success."

Tightened the predicate to require:

  • deprovision actually happened (deprovision_submit > 0), and
  • either submit == 0 (the documented deprovision-only re-run case) or submit == deprovision_submit (every provisioned user was torn down — mid-deprovision stays "unfinished").

Added table-driven tests covering all seven shapes the detector has to disambiguate: no events, provision-only (your repro), full cycle, deprovision-only re-run, mid-provision, mid-deprovision, abort sentinel.

#3844 has been rebased onto the new tip.

nikw9944 added a commit that referenced this pull request Jun 7, 2026
The previous predicate counted a provision-only mid-run snapshot
(submit == activate > 0, deprovision_submit == deprovision_activate == 0)
as success — the (submit+deprovision_submit) > 0 guard was
satisfied by the provision count alone. That's misleading for runs
inspected before teardown completes, and the reporter is documented
as usable mid-run.

Tighten the predicate to require:
  * deprovision actually happened (deprovision_submit > 0), and
  * either submit == 0 (the documented deprovision-only re-run case)
    or submit == deprovision_submit (every provisioned user was torn
    down — and mid-deprovision stays "unfinished").

Add table-driven tests covering all seven shapes the detector has
to disambiguate: no events, provision-only (the bug), full cycle,
deprovision-only re-run, mid-provision, mid-deprovision, abort
sentinel.

Reported by @elitegreg on #3843.
nikw9944 added 7 commits June 8, 2026 18:38
Three sets of changes that together let the stress harness drive a real
Arista EOS DUT to its hard cap of 1024 user tunnels without dropping
applied events, killing the agent mid-commit, or undercounting on the
observer side.

* Agent log parser (orchestrator-side). Adds a per-section state
  machine that handles the real-EOS unified-diff shape ('interface
  TunnelN' as a context line with '+   <property>' lines below it) in
  addition to the existing cEOS '+interface TunnelN' shape. Emits two
  new activity events alongside the existing pre_commit_log / applied
  / commit beats: EventConfigReceived (from 'Received N bytes...')
  and EventCommitAborted (from 'session ... abort').

* Sweep timing. The quiescenceTracker grows a sticky pending-commit
  flag (set on EventConfigReceived, cleared on the matching terminal
  EventCommit / EventApplied / EventCommitAborted) so the post-
  deprovision wait outlasts the multi-second diff-check window
  between a config arrival and the next commit. A new
  waitForAppliedToCatchUp blocks the provision→deprovision boundary
  until the agent's applied count covers the orchestrator's
  provisioned-user count, so deprovision doesn't start removing users
  the device hasn't applied yet. Both waits honor explicit timeouts
  (--agent-quiescence-timeout-seconds and the new
  --apply-catch-up-timeout-seconds, default 300s). --no-agent runs
  auto-zero the catch-up timeout since the noop runner emits no
  applied events.

* Observer tunnel-counter. Switches the device_tunnel_gap sentinel's
  sample command from 'show gre tunnel static' (only returns
  statically-keyed routing-fabric tunnels — never user tunnels on
  either platform tested) to 'show interfaces description' filtered
  on a 'USER-UCAST-' description prefix. The new filter tracks the
  controller's naming convention regardless of where in the Tunnel<N>
  id space user tunnels land — robust against both the legacy 500+
  range and the gm/tunnel-id-start-1 fix that allocates from 1.

End-to-end on physical hardware: 524-user runs on chi-dn-dzd5 + chi-
dn-dzd9 and 1023-user runs on both DUTs all completed with 100% on-
device coverage and clean teardown.
Adds --apply-per-batch-catch-up (default off) that pauses after every
provision batch until the agent's applied count covers the cumulative
target submitted so far. Reuses the existing waitForAppliedToCatchUp
function and honors the same --apply-catch-up-timeout-seconds per
batch.

Use case: production users arrive at human cadence rather than in
32/64-user bursts, so this flag matches that shape — useful for
measuring per-user latency under steady-state load instead of peak
throughput under burst load. Off by default to preserve the original
'stress the agent' shape the harness was built for.
The wait used to fast-path on a tracker-state inference: if
quiescenceTracker.lastEvent() reported no events, the wait skipped.
That inference raced with the consumer goroutine that updates the
tracker — an event could be sitting in the channel buffer (emitted
by the agent runner but not yet picked up by consumeAgentEvents)
while the main goroutine read tracker state and concluded 'no events.'
The wait would then skip and the orchestrator would kill the SSH
session mid-commit. Reproduces flakily under

  go test -count=100 -run TestRun_QuiescenceBlocksOnPendingCommit

Replace the inference with an explicit Config.NoAgent signal that
main.go sets when --no-agent is on, alongside the existing
ApplyCatchUpTimeout=0 override. The wait short-circuits on
cfg.NoAgent (no race possible — it's an operator-provided flag) and
otherwise always enters the loop. The loop's existing exit condition
(elapsed >= AgentQuietWindow && sinceLast >= AgentQuietWindow &&
!HasPendingCommit()) handles the 'no events yet' case correctly:
lastEvent of zero yields a huge sinceLast, so the wait gates on
elapsed-since-start, blocking for at least AgentQuietWindow before
returning. A slow consumer goroutine gets the full quiet window to
catch up.

Reported by @elitegreg on #3841.
Five small ergonomics fixes for run-stress-physical.sh, all aimed at
making first-run failures obvious instead of silent:

* Fail fast on missing EAPI_PASS via :? rather than the empty
  default that silently produces 401-Unauthorized on every observer
  sample.
* Default EAPI_USER to 'stress' (matched by a documented device-side
  'username stress secret 0 stress') so the observer doesn't ride on
  the device's admin password.
* CONTROLLER_BINARY env override — point the script at a prebuilt
  controller from another branch's worktree instead of go-running
  from the local checkout. Useful when bisecting a controller patch.
* Per-run controller.log: move it into $RUN_DIR so each run's log
  survives, rather than being truncated at the next controller
  start.

The README documents the new stress-user prereq, the 'management api
gnmi / provider eos-native' stanza that real EOS needs to expose the
agent's 127.0.0.1:9543 listener (replacing an earlier eapilocal
recipe that turned out to be a guess), the CONTROLLER_BINARY knob,
and the EAPI_USER default change.
Add a matching --apply-per-batch-catch-up CLI flag to
run-stress-physical.sh so the operator can opt into the per-batch
catch-up wait without setting orchestrator flags directly. Off by
default, matching the orchestrator's default behavior.
Foundation for the device-reporter binary, split off as its own PR
so the analyzer's data model can be reviewed without the CLI surface
and rendering pulling focus.

* parser package — LoadRun aggregates one stress-run directory:
  orchestrator-config.json, orchestrator-runlog.jsonl, the captured
  agent log (state machine that builds AgentCycle records from
  Received N lines/bytes / Committing / Configuration session
  finalized markers), and the abort sentinel. Missing individual
  artifacts leave the matching field nil; missing run directory is a
  hard error.

* analyze package — BuildSummary turns one parser.Run into the
  rolled-up view the formatter renders. The two non-trivial joins
  are activate↔applied pairing by user_index (for the onchain→on-
  device gap) and agent-cycle↔applied-bucket pairing by shared
  finalize timestamp (for the per-cycle table). Three linearity
  fits: commit-duration vs config size, diff-check-duration vs
  config size, onchain→on-device-gap vs active-user count.

Tests cover the parser shapes, the activate↔applied edge cases
(missing applied, applied-without-activate, --no-agent run with no
applied events at all), the per-cycle join's behavior across
commit/abort/unfinished cycles, the join-mismatch warning, the
diff-check duration computation, and the linearity fit against a
perfectly-linear synthetic dataset.
The previous predicate counted a provision-only mid-run snapshot
(submit == activate > 0, deprovision_submit == deprovision_activate == 0)
as success — the (submit+deprovision_submit) > 0 guard was
satisfied by the provision count alone. That's misleading for runs
inspected before teardown completes, and the reporter is documented
as usable mid-run.

Tighten the predicate to require:
  * deprovision actually happened (deprovision_submit > 0), and
  * either submit == 0 (the documented deprovision-only re-run case)
    or submit == deprovision_submit (every provisioned user was torn
    down — and mid-deprovision stays "unfinished").

Add table-driven tests covering all seven shapes the detector has
to disambiguate: no events, provision-only (the bug), full cycle,
deprovision-only re-run, mid-provision, mid-deprovision, abort
sentinel.

Reported by @elitegreg on #3843.
@nikw9944 nikw9944 force-pushed the nikw9944/stress-physical-script-polish branch from 20b6968 to 6b663e0 Compare June 8, 2026 18:38
@nikw9944 nikw9944 force-pushed the nikw9944/stress-reporter-analyze branch from c9f9f98 to cce53df Compare June 8, 2026 18:38
@nikw9944 nikw9944 linked an issue Jun 8, 2026 that may be closed by this pull request
Base automatically changed from nikw9944/stress-physical-script-polish to main June 10, 2026 20:14
@nikw9944 nikw9944 enabled auto-merge (squash) June 10, 2026 20:15
@nikw9944 nikw9944 merged commit b16328c into main Jun 10, 2026
37 of 38 checks passed
@nikw9944 nikw9944 deleted the nikw9944/stress-reporter-analyze branch June 10, 2026 20:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

stress: device stress test per-run reporting tool

2 participants