Skip to content

fix: prevent scheduler timeout hangs#498

Merged
zhongkechen merged 4 commits into
mainfrom
codex/fix-scheduler-timeout-hang
Jun 30, 2026
Merged

fix: prevent scheduler timeout hangs#498
zhongkechen merged 4 commits into
mainfrom
codex/fix-scheduler-timeout-hang

Conversation

@zhongkechen

@zhongkechen zhongkechen commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Prevent local durable test waits from hanging indefinitely when the scheduler event loop is blocked.
  • Keep local scheduler invocations serialized while moving blocking scheduled callables off the event loop.
  • Close SQLite test-store connections deterministically to reduce Python 3.13/3.14 cleanup warnings.
  • Fix the remaining Python 3.13/3.14 slowness in test_with_retry_callback_fails_twice_then_succeeds by making wait_for_callback wait until the callback-creating invocation has settled before returning the callback id.
  • Add assertions proving the with-retry callback example used explicit callback failures for attempts 1 and 2, and success for attempt 3.

Root Cause

The original flaky CI hang was caused by the scheduler timeout mechanism depending entirely on the scheduler event loop. Event.wait(timeout=...) scheduled asyncio.wait_for(event.wait(), timeout) onto that loop and then called future.result() with no outer timeout. If the loop was blocked by a scheduled callable, the inner asyncio.wait_for timeout could never run, so the test thread could wait forever.

The callback concurrency example made this worse because handler invocation was represented as an async callable even though it performed blocking work and had no await points. The scheduler therefore ran the handler on the event loop. Moving that blocking work off-loop exposed another important constraint: local invocations must remain serialized, otherwise overlapping replays for the same execution can leave old handler threads blocked in concurrent durable operations. The scheduler now uses a single worker thread for synchronous scheduled functions, which keeps the event loop available for timeouts while preserving one-at-a-time invocation semantics.

Python 3.13/3.14 Follow-Up

After the scheduler fix, Python 3.13 and 3.14 were still slow at the final example test. A completed CI run showed the remaining delay inside test_with_retry_callback_fails_twice_then_succeeds:

  • Python 3.13: wait_for_condition passed at 21:41:43, final with_retry_callback passed at 21:44:29, about 166s.
  • Python 3.14: wait_for_condition passed at 21:41:19, final with_retry_callback passed at 21:44:08, about 169s.
  • Python 3.11: the same gap was about 8s.

Focused local 3.13 debugging showed the root race. The test runner returned a callback id as soon as a CallbackStarted event appeared. The test then sent callback failure immediately, while the original invocation could still be checkpointing the wait_for_callback submitter step. That callback response advanced the checkpoint token underneath the still-running invocation, causing Invalid checkpoint token / BackgroundThreadError: Checkpoint creation failed warnings. Python 3.13/3.14 scheduling made this race much more likely, which explains the extra warning volume and long fallback/retry behavior.

The fix changes wait_for_callback to return only after history contains an InvocationCompleted event after the matching CallbackStarted event. That means the callback id is exposed to tests once the invocation that created it has settled, so external callback responses no longer race the submitter checkpoint. The with-retry callback test now also verifies the first two callback operations are FAILED and the third is SUCCEEDED, guarding against silently passing via timeout fallback.

Changes

  • Convert executor handler invocation from an async callable to a synchronous callable so the scheduler routes it through its worker thread.
  • Give the scheduler a single-worker default executor named durable-scheduler and use asyncio.to_thread for synchronous scheduled work.
  • Change scheduler shutdown to await cancelled tasks, call loop.shutdown_default_executor(), then stop the loop.
  • Add a caller-thread timeout around future.result(...) in wait_for_event so blocked event loops cannot hang the test thread forever.
  • Replace deprecated asyncio.iscoroutinefunction usage with inspect.iscoroutinefunction.
  • Close direct SQLite connections in store tests with contextlib.closing(...) so ResourceWarnings do not accumulate on Python 3.13/3.14.
  • Make wait_for_callback wait for the callback-creating invocation to complete before returning the callback id.
  • Add regression coverage for blocked-loop timeouts, scheduler worker cleanup, SQLite connection cleanup, callback readiness polling, and with-retry callback attempt statuses.

Evidence

Verification

  • hatch fmt --check
  • hatch run types:check
  • git diff --check
  • /private/tmp/dex-py313-debug/bin/python -m pytest packages/aws-durable-execution-sdk-python-examples/test/with_retry/test_with_retry_callback.py::test_with_retry_callback_fails_twice_then_succeeds -q -o log_cli=true --log-cli-level=WARNING - 1 passed in 6.36s with no warning-level callback race logs
  • hatch run test:all packages/aws-durable-execution-sdk-python-examples/test/with_retry/test_with_retry_callback.py::test_with_retry_callback_fails_twice_then_succeeds packages/aws-durable-execution-sdk-python-testing/tests/runner_test.py packages/aws-durable-execution-sdk-python-testing/tests/stores/sqlite_store_test.py -q - Python 3.14, 116 passed
  • hatch run test:all packages/aws-durable-execution-sdk-python-testing/tests/executor_test.py packages/aws-durable-execution-sdk-python-testing/tests/scheduler_test.py -q - Python 3.14, 144 passed

@zhongkechen zhongkechen force-pushed the codex/fix-scheduler-timeout-hang branch from 965e673 to 171f159 Compare June 30, 2026 18:55
@zhongkechen zhongkechen marked this pull request as draft June 30, 2026 19:05
@zhongkechen

zhongkechen commented Jun 30, 2026

Copy link
Copy Markdown
Contributor Author

Tests ran much slower under 3.13, 3.14 than 3.11, 3.12:

https://github.com/aws/aws-durable-execution-sdk-python/actions/runs/28468575210/job/84374754519?pr=498

This might be related to the following warnings.

packages/aws-durable-execution-sdk-python-testing/tests/web/e2e/routes_arn_encoding_int_test.py::test_get_durable_execution_decodes_slash_in_arn
  /home/runner/.local/share/hatch/env/virtual/aws-durable-execution-sdk-python-monorepo/f1Q9feq1/test/lib/python3.13/site-packages/botocore/client.py:594: ResourceWarning: unclosed database in <sqlite3.Connection object at 0x7fd1767c23e0>
    def _create_api_method(
  Enable tracemalloc to get traceback where the object was allocated.
  See https://docs.pytest.org/en/stable/how-to/capture-warnings.html#resource-warnings for more info.

@zhongkechen zhongkechen force-pushed the codex/fix-scheduler-timeout-hang branch from 18c97aa to 15bbf3f Compare June 30, 2026 19:28
@zhongkechen zhongkechen force-pushed the codex/fix-scheduler-timeout-hang branch from 1d71a23 to 3c6c145 Compare June 30, 2026 22:24
@zhongkechen zhongkechen marked this pull request as ready for review June 30, 2026 22:30
@zhongkechen zhongkechen merged commit ddfcc9a into main Jun 30, 2026
71 checks passed
@zhongkechen zhongkechen deleted the codex/fix-scheduler-timeout-hang branch June 30, 2026 23:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants