Skip to content

fix: don't surface suspend as a FAILED outcome#492

Merged
yaythomas merged 1 commit into
mainfrom
fix/suspend-pending-outcome
Jun 29, 2026
Merged

fix: don't surface suspend as a FAILED outcome#492
yaythomas merged 1 commit into
mainfrom
fix/suspend-pending-outcome

Conversation

@yaythomas

@yaythomas yaythomas commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Closes #491

Problem

In the experimental OpenTelemetry plugin, a durable execution that suspended inside a child context (for example a child context that called context.wait(...)) produced a false fault on the child-context span in X-Ray, with a recorded exception named TimedSuspendExecution. The execution itself behaved correctly; only the instrumentation labeling was wrong.

Root cause

The core SDK reported a suspend to plugins as an ErrorObject carrying the concrete exception name (TimedSuspendExecution). UserFunctionOutcome.from_error matched only the base class name (SuspendExecution), so the suspend fell through to FAILED instead of being recognized. The OTEL plugin faithfully rendered FAILED as a span error.

Change

  • state.py: the except SuspendExecution branch now just re-raises. It no longer notifies plugins of the suspend.
  • The OTEL plugin's existing on_invocation_end sweep already iterates remaining open operation spans and ends them cleanly with no status set. That sweep now also handles the suspended context span, which ends as a non-fault span with no recorded exception. This matches how the JS SDK's OTEL plugin handles the same case.
  • The OTEL plugin's on_user_function_end PENDING branch is removed (it can no longer be reached, since suspends no longer dispatch through the end hook).
  • UserFunctionOutcome is reduced to SUCCEEDED and FAILED. from_error is simplified to None → SUCCEEDED, else FAILED. The fragile name-match against SuspendExecution.__name__ is gone.
  • durable.attempt.number and durable.attempt.outcome are no longer emitted on CONTEXT spans. These per-attempt fields are meaningful for STEP (each retry is an attempt) but not for CONTEXT. Dropping them aligns the emitted attributes across SDKs. STEP spans are unchanged.

No public plugin API change: UserFunctionOutcome retains the two values plugins already used in practice; the PluginExecutor end hook is unchanged. The from __future__ import annotations cleanup in plugin.py (previously a separate commit) is folded in.

Testing

  • New regression test_wrap_user_function_suspend_does_not_fire_end_hook asserts the end hook is never invoked when a wrapped function raises TimedSuspendExecution. On main, the end hook fires with outcome=FAILED, so this test would fail; with this change, captured stays empty.
  • New OTEL test test_context_span_omits_attempt_attributes asserts a CONTEXT span carries neither durable.attempt.number nor durable.attempt.outcome.
  • TestUserFunctionOutcomeValues pins the enum membership to {SUCCEEDED, FAILED} to catch accidental regressions.
  • Core SDK suite: 1275 passed. OTEL package suite: 49 passed. Typecheck clean (75 source files). hatch fmt --check clean across both packages.

Verified on a deployed durable function

The fix has been deployed to a test Lambda (us-west-2) with a child context that contains a wait. Three invocations of one execution, fresh X-Ray trace: all operation spans (including the suspended child-context, its child-wait, and the continuation child-context on the resuming invocation) close cleanly with no fault and no recorded exception. JS-SDK demo function in the same account produces the equivalent non-fault span tree for cross-SDK confirmation.

@yaythomas yaythomas added the otel-plugin related to the otel-plugin package label Jun 25, 2026
@yaythomas yaythomas moved this from Backlog to In review in aws-durable-execution Jun 25, 2026
SilanHe
SilanHe previously approved these changes Jun 25, 2026
@github-project-automation github-project-automation Bot moved this from In review to Pending merge in aws-durable-execution Jun 25, 2026
@yaythomas yaythomas force-pushed the fix/suspend-pending-outcome branch from 57c9bd8 to 05dd505 Compare June 29, 2026 18:41
@yaythomas yaythomas changed the title fix: report suspends as PENDING to plugins fix: stop surfacing suspend as a FAILED outcome to plugins Jun 29, 2026
@yaythomas

yaythomas commented Jun 29, 2026

Copy link
Copy Markdown
Contributor Author

Updated this PR with a different fix approach after review.

The previous version explicitly signaled the suspend to plugins with a new UserFunctionOutcome.PENDING value and an on_user_function_suspend hook. Instead, this will now align with the JS SDK's mechanism instead: don't fire the end hook on suspend at all; let the OTEL plugin's existing on_invocation_end sweep close any remaining open operation spans cleanly.

This drops the public PENDING outcome, simplifies the plumbing, and matches JS exactly.

Same observable fix (no fault on the suspended child-context span). New approach is verified end-to-end on a deployed test Lambda, all operation spans close non-fault, JS demo produces the equivalent span tree.

Expectation Result
Execution completes correctly {"a":"a","b":"b","child":"c1-c2"} ✓
All spans closed 46/46, 0 unclosed ✓
No fault on any plugin span (incl. suspended child-context) NONE ✓
CONTEXT spans have NO durable.attempt.number NONE ✓
CONTEXT spans have NO durable.attempt.outcome NONE ✓
STEP spans STILL have durable.attempt.number step-a, step-b, child-step-1, child-step-2 ✓
STEP spans STILL have durable.attempt.outcome same four ✓

A user function that suspends (for example a child context that waits)
was reported to instrumentation plugins as a FAILED outcome, so the OTEL
plugin recorded the suspend as a span error and X-Ray surfaced a fault
on the child-context span. The execution itself behaved correctly; only
the instrumentation labeling was wrong.

Drop the plugin notification on suspend so that on_user_function_end no
longer fires for the suspending invocation. The OTEL plugin's existing
on_invocation_end sweep already ends all remaining operation spans
cleanly (no status, no fault).

Also drop durable.attempt.number and durable.attempt.outcome from
CONTEXT spans. These per-attempt fields are meaningful for STEP (each
retry is an attempt) but not for CONTEXT, and dropping them aligns the
emitted attributes across SDKs. STEP spans are unchanged.

No public plugin API change: UserFunctionOutcome retains the SUCCEEDED
and FAILED values plugins already used.

Closes #491
@yaythomas yaythomas force-pushed the fix/suspend-pending-outcome branch from 05dd505 to d0c2bdb Compare June 29, 2026 18:53
@yaythomas yaythomas changed the title fix: stop surfacing suspend as a FAILED outcome to plugins fix: don't surface suspend as a FAILED outcome Jun 29, 2026
@yaythomas yaythomas merged commit be30636 into main Jun 29, 2026
67 of 68 checks passed
@github-project-automation github-project-automation Bot moved this from Pending merge to Done in aws-durable-execution Jun 29, 2026
@yaythomas yaythomas deleted the fix/suspend-pending-outcome branch June 29, 2026 21:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

otel-plugin related to the otel-plugin package

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

OTEL plugin marks child-context spans as faults when they suspend (timed wait misclassified as FAILED)

3 participants