fix: don't surface suspend as a FAILED outcome#492
Merged
Conversation
SilanHe
previously approved these changes
Jun 25, 2026
57c9bd8 to
05dd505
Compare
Contributor
Author
|
Updated this PR with a different fix approach after review. The previous version explicitly signaled the suspend to plugins with a new This drops the public Same observable fix (no fault on the suspended child-context span). New approach is verified end-to-end on a deployed test Lambda, all operation spans close non-fault, JS demo produces the equivalent span tree.
|
A user function that suspends (for example a child context that waits) was reported to instrumentation plugins as a FAILED outcome, so the OTEL plugin recorded the suspend as a span error and X-Ray surfaced a fault on the child-context span. The execution itself behaved correctly; only the instrumentation labeling was wrong. Drop the plugin notification on suspend so that on_user_function_end no longer fires for the suspending invocation. The OTEL plugin's existing on_invocation_end sweep already ends all remaining operation spans cleanly (no status, no fault). Also drop durable.attempt.number and durable.attempt.outcome from CONTEXT spans. These per-attempt fields are meaningful for STEP (each retry is an attempt) but not for CONTEXT, and dropping them aligns the emitted attributes across SDKs. STEP spans are unchanged. No public plugin API change: UserFunctionOutcome retains the SUCCEEDED and FAILED values plugins already used. Closes #491
05dd505 to
d0c2bdb
Compare
SilanHe
approved these changes
Jun 29, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #491
Problem
In the experimental OpenTelemetry plugin, a durable execution that suspended inside a child context (for example a child context that called
context.wait(...)) produced a false fault on thechild-contextspan in X-Ray, with a recorded exception namedTimedSuspendExecution. The execution itself behaved correctly; only the instrumentation labeling was wrong.Root cause
The core SDK reported a suspend to plugins as an
ErrorObjectcarrying the concrete exception name (TimedSuspendExecution).UserFunctionOutcome.from_errormatched only the base class name (SuspendExecution), so the suspend fell through toFAILEDinstead of being recognized. The OTEL plugin faithfully renderedFAILEDas a span error.Change
state.py: theexcept SuspendExecutionbranch now just re-raises. It no longer notifies plugins of the suspend.on_invocation_endsweep already iterates remaining open operation spans and ends them cleanly with no status set. That sweep now also handles the suspended context span, which ends as a non-fault span with no recorded exception. This matches how the JS SDK's OTEL plugin handles the same case.on_user_function_endPENDING branch is removed (it can no longer be reached, since suspends no longer dispatch through the end hook).UserFunctionOutcomeis reduced toSUCCEEDEDandFAILED.from_erroris simplified toNone → SUCCEEDED, elseFAILED. The fragile name-match againstSuspendExecution.__name__is gone.durable.attempt.numberanddurable.attempt.outcomeare no longer emitted onCONTEXTspans. These per-attempt fields are meaningful forSTEP(each retry is an attempt) but not forCONTEXT. Dropping them aligns the emitted attributes across SDKs.STEPspans are unchanged.No public plugin API change:
UserFunctionOutcomeretains the two values plugins already used in practice; thePluginExecutorend hook is unchanged. Thefrom __future__ import annotationscleanup inplugin.py(previously a separate commit) is folded in.Testing
test_wrap_user_function_suspend_does_not_fire_end_hookasserts the end hook is never invoked when a wrapped function raisesTimedSuspendExecution. Onmain, the end hook fires withoutcome=FAILED, so this test would fail; with this change,capturedstays empty.test_context_span_omits_attempt_attributesasserts a CONTEXT span carries neitherdurable.attempt.numbernordurable.attempt.outcome.TestUserFunctionOutcomeValuespins the enum membership to{SUCCEEDED, FAILED}to catch accidental regressions.hatch fmt --checkclean across both packages.Verified on a deployed durable function
The fix has been deployed to a test Lambda (us-west-2) with a child context that contains a wait. Three invocations of one execution, fresh X-Ray trace: all operation spans (including the suspended
child-context, itschild-wait, and the continuationchild-contexton the resuming invocation) close cleanly with no fault and no recorded exception. JS-SDK demo function in the same account produces the equivalent non-fault span tree for cross-SDK confirmation.