fix(telemetry): track preflight error for telemetry#1403
Conversation
|
Claude Security Review: no high-confidence findings. (run) |
Package TarballHow to installgh release download pr-1403-tarball --repo aws/agentcore-cli --pattern "*.tgz" --dir /tmp/pr-tarball
npm install -g /tmp/pr-tarball/aws-agentcore-0.15.0.tgz |
b8e3200 to
cbcc5ca
Compare
|
Claude Security Review: no high-confidence findings. (run) |
agentcore-cli-automation
left a comment
There was a problem hiding this comment.
Nice cleanup — consolidating all the setPhase('error') + isRunningRef.current = false paths through failPreflight() is a clear improvement, and the new telemetry on preflight failure plugs a real gap.
One thing worth discussing before merging (inline comment on cancelTeardown).
| const cancelTeardown = useCallback(() => { | ||
| setPhase('error'); | ||
| isRunningRef.current = false; | ||
| failPreflight(new Error('Teardown cancelled')); |
There was a problem hiding this comment.
cancelTeardown now routes through failPreflight(new Error('Teardown cancelled')), which means user-initiated cancellation at the teardown-confirm prompt will be emitted to telemetry as a deploy failure with error_name: 'UnknownError' and error_source: 'unknown'. Pre-PR this path was silent.
A couple of concerns:
- Signal pollution — every user who cancels a teardown will now show up in error metrics indistinguishably from real failures. Since
'UnknownError'/'unknown'are also where uncategorized real bugs land, this makes those buckets noisier. - It's not really a failure — the user made a deliberate choice to abort.
A few options:
- Skip telemetry emission entirely when the error is a user cancellation (e.g., have
failPreflighttake a flag, or havecancelTeardownreset phase without going through the error path). - Introduce a dedicated
UserCancelledError(subclass ofBaseError) and add'UserCancelledError'to theErrorNameenum insrc/cli/telemetry/schemas/common-shapes.ts, so cancellations are at least distinguishable in dashboards. - Add a separate "cancelled" exit_reason — bigger change, probably overkill for this PR.
Same consideration applies more broadly to the other failPreflight(new Error('...')) call sites (e.g., 'Some OAuth providers failed to set up', stack-status messages) — those are real failures, so they should be telemetered, but they'll all classify as UnknownError/unknown. Less urgent, but worth tracking if error-bucket signal matters.
There was a problem hiding this comment.
good call, this will be used else so we should model.
|
Claude Security Review: no high-confidence findings. (run) |
Description
Problem
The deploy TUI isn't emitting telemetry when the preflight steps fail. This is because the telemetry wrapper is only wrapping the post-preflight logic.
Solution
Ideally, we would wrap all logic in a telemetry wrapper, but the deploy flow leverages React hooks that are not linear. Specifically, when preflight steps finish it triggers a hook that invokes the main deployment. Because they are disjointed flows via events, we need another solution.
The solution proposed here is to store the error from the earlier failure, and bail fast and emit when we get into the core deploy flow. This allows us to emit the correct information, but loses the duration of the failure from the preflight. We could theoretically store this in another Ref and then emit directly, but this significantly increases complexity, since we now need a way to override duration on the metric and need to manage a timer in a react hook that may trigger multiple times.
I also feel fairly certain that we will revisit this deploy code for a significant refactor as part of future work since its current flow is difficult to understand and extend.
Testing
ran deploy w/o AWS credentials and saw the error pop up in telemetry.
Then error in file is AWSCredentialsError
Related Issue
Closes #
Documentation PR
N/A
Type of Change
Testing
How have you tested the change?
npm run test:unitandnpm run test:integnpm run typechecknpm run lintsrc/assets/, I rannpm run test:update-snapshotsand committed the updated snapshotsChecklist
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the
terms of your choice.