fix: prevent scheduler timeout hangs#498
Merged
Merged
Conversation
965e673 to
171f159
Compare
Contributor
Author
|
Tests ran much slower under 3.13, 3.14 than 3.11, 3.12: This might be related to the following warnings. |
18c97aa to
15bbf3f
Compare
1d71a23 to
3c6c145
Compare
wangyb-A
approved these changes
Jun 30, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
test_with_retry_callback_fails_twice_then_succeedsby makingwait_for_callbackwait until the callback-creating invocation has settled before returning the callback id.Root Cause
The original flaky CI hang was caused by the scheduler timeout mechanism depending entirely on the scheduler event loop.
Event.wait(timeout=...)scheduledasyncio.wait_for(event.wait(), timeout)onto that loop and then calledfuture.result()with no outer timeout. If the loop was blocked by a scheduled callable, the innerasyncio.wait_fortimeout could never run, so the test thread could wait forever.The callback concurrency example made this worse because handler invocation was represented as an async callable even though it performed blocking work and had no await points. The scheduler therefore ran the handler on the event loop. Moving that blocking work off-loop exposed another important constraint: local invocations must remain serialized, otherwise overlapping replays for the same execution can leave old handler threads blocked in concurrent durable operations. The scheduler now uses a single worker thread for synchronous scheduled functions, which keeps the event loop available for timeouts while preserving one-at-a-time invocation semantics.
Python 3.13/3.14 Follow-Up
After the scheduler fix, Python 3.13 and 3.14 were still slow at the final example test. A completed CI run showed the remaining delay inside
test_with_retry_callback_fails_twice_then_succeeds:wait_for_conditionpassed at21:41:43, finalwith_retry_callbackpassed at21:44:29, about 166s.wait_for_conditionpassed at21:41:19, finalwith_retry_callbackpassed at21:44:08, about 169s.Focused local 3.13 debugging showed the root race. The test runner returned a callback id as soon as a
CallbackStartedevent appeared. The test then sent callback failure immediately, while the original invocation could still be checkpointing thewait_for_callbacksubmitter step. That callback response advanced the checkpoint token underneath the still-running invocation, causingInvalid checkpoint token/BackgroundThreadError: Checkpoint creation failedwarnings. Python 3.13/3.14 scheduling made this race much more likely, which explains the extra warning volume and long fallback/retry behavior.The fix changes
wait_for_callbackto return only after history contains anInvocationCompletedevent after the matchingCallbackStartedevent. That means the callback id is exposed to tests once the invocation that created it has settled, so external callback responses no longer race the submitter checkpoint. The with-retry callback test now also verifies the first two callback operations areFAILEDand the third isSUCCEEDED, guarding against silently passing via timeout fallback.Changes
durable-schedulerand useasyncio.to_threadfor synchronous scheduled work.loop.shutdown_default_executor(), then stop the loop.future.result(...)inwait_for_eventso blocked event loops cannot hang the test thread forever.asyncio.iscoroutinefunctionusage withinspect.iscoroutinefunction.contextlib.closing(...)so ResourceWarnings do not accumulate on Python 3.13/3.14.wait_for_callbackwait for the callback-creating invocation to complete before returning the callback id.Evidence
Verification
hatch fmt --checkhatch run types:checkgit diff --check/private/tmp/dex-py313-debug/bin/python -m pytest packages/aws-durable-execution-sdk-python-examples/test/with_retry/test_with_retry_callback.py::test_with_retry_callback_fails_twice_then_succeeds -q -o log_cli=true --log-cli-level=WARNING- 1 passed in 6.36s with no warning-level callback race logshatch run test:all packages/aws-durable-execution-sdk-python-examples/test/with_retry/test_with_retry_callback.py::test_with_retry_callback_fails_twice_then_succeeds packages/aws-durable-execution-sdk-python-testing/tests/runner_test.py packages/aws-durable-execution-sdk-python-testing/tests/stores/sqlite_store_test.py -q- Python 3.14, 116 passedhatch run test:all packages/aws-durable-execution-sdk-python-testing/tests/executor_test.py packages/aws-durable-execution-sdk-python-testing/tests/scheduler_test.py -q- Python 3.14, 144 passed