Skip to content

test(evm): cap child_process.exec in lib.js to surface stalled commands#3526

Merged
wen-coding merged 7 commits into
mainfrom
wen/exec_command_timeout
Jun 3, 2026
Merged

test(evm): cap child_process.exec in lib.js to surface stalled commands#3526
wen-coding merged 7 commits into
mainfrom
wen/exec_command_timeout

Conversation

@wen-coding
Copy link
Copy Markdown
Contributor

@wen-coding wen-coding commented May 31, 2026

Summary

execCommand in contracts/test/lib.js wraps child_process.exec with no timeout. The Autobahn EVM integration matrix has hit multiple 30‑minute job timeouts where the hardhat process went silent at a test boundary, with the orphan‑process listing at cleanup consistently showing a stalled docker exec ... seid ... child still alive. The job then consumed its entire wall‑clock budget instead of surfacing a useful error.

This caps execCommand with a configurable timeout (default 60s, SIGKILL). On expiry, the rejection includes the offending command so the next occurrence is a 60s actionable error instead of a 30‑minute mystery. Override via EXEC_TIMEOUT_MS for tests that legitimately need longer.

Observed hang pattern (7 occurrences in 3 days)

Date (UTC) Run Affected Autobahn matrix job Branch
05‑28 23:48 26608835541 EVM Interoperability chore branch
05‑29 00:06 26609440164 EVM Interoperability chore branch
05‑29 06:00 26620669539 EVM Interoperability chore branch
05‑29 14:19 26642258916 EVM Interoperability merge_queue #3506
05‑29 16:22 26648904362 EVM Interoperability main push
05‑29 16:29 26648905384 EVM Interoperability merge_queue #3511
05‑31 01:00 26699345153 EVM Module PR #3525

Each ran for exactly ~30 minutes (job timeout). Investigation of one of the artifacts (26648905384): validators were perfectly healthy (continuing to produce blocks) the whole time, no test‑sender activity reached them during the silence — i.e., the hardhat process was alive but never sent another RPC, consistent with a stalled child process here.

Trade-off

  • Healthy docker exec calls in the suite complete in ~100–200ms; 60s is comfortably above the worst observed normal case.
  • If a specific test legitimately needs longer (e.g. genesis bootstrap), EXEC_TIMEOUT_MS overrides per-invocation.

Test plan

  • CI: full integration matrix passes (60s is well above normal).
  • The next time this stall pattern hits, the failing job will report a 60s execCommand timed out … <command> error instead of a 30‑min cancel — that error itself becomes the diagnostic.

🤖 Generated with Claude Code

The Autobahn EVM integration matrix has hit multiple 30-minute job
timeouts where hardhat went silent between test files with no error.
The orphan-process listing at cleanup consistently shows a docker exec
(running `seid tx evm send` or similar) still alive — `child_process.exec`
has no timeout by default, so a stalled CLI invocation eats the entire
job budget instead of surfacing as a clear error.

Add a default 60s timeout to execCommand with a SIGKILL kill signal.
On expiry, throw an error containing the offending command so the next
occurrence is a 60s actionable error instead of a 30-minute mystery.
Override via EXEC_TIMEOUT_MS for tests that legitimately need longer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cursor
Copy link
Copy Markdown

cursor Bot commented May 31, 2026

PR Summary

Low Risk
Test-only change to integration helper timeouts; no production auth, chain, or runtime behavior.

Overview
Adds a 60s default wall-clock cap (override via EXEC_TIMEOUT_MS) on child_process.exec in execCommand, with SIGKILL on expiry, so stalled docker exec … seid … calls fail fast with a message that includes the command instead of hanging until the CI job hits ~30 minutes.

Timeout failures are detected only when Node reports error.killed and error.code is unset, so ERR_CHILD_PROCESS_STDIO_MAXBUFFER and external kills are not mislabeled as timeouts.

Reviewed by Cursor Bugbot for commit 8ec7aa0. Bugbot is set up for automated code reviews on this repo. Configure here.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 31, 2026

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed✅ passed✅ passed✅ passedJun 2, 2026, 8:16 PM

Comment thread contracts/test/lib.js
@codecov
Copy link
Copy Markdown

codecov Bot commented May 31, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 58.30%. Comparing base (1878cfa) to head (8ec7aa0).

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3526      +/-   ##
==========================================
- Coverage   59.10%   58.30%   -0.80%     
==========================================
  Files        2213     2140      -73     
  Lines      182814   174318    -8496     
==========================================
- Hits       108046   101633    -6413     
+ Misses      65038    63664    -1374     
+ Partials     9730     9021     -709     
Flag Coverage Δ
sei-db 70.41% <ø> (ø)
sei-db-state-db ?

Flags with carried forward coverage won't be shown. Click here to find out more.
see 110 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Per cursor bugbot: error.signal is set on any signal-terminated process
(external OOM kill, runner cleanup, etc.), not just timeout kills.
error.killed is the precise Node-set-the-killer signal. Narrowing the
condition prevents non-timeout signal deaths from being mis-reported as
"execCommand timed out".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit c1920c4. Configure here.

Comment thread contracts/test/lib.js
wen-coding and others added 5 commits June 1, 2026 13:39
Per cursor bugbot: Node sets error.killed=true for two cases — the
timeout kill we want to attribute, and a maxBuffer overflow which
also has error.code = 'ERR_CHILD_PROCESS_STDIO_MAXBUFFER'. Gate the
timeout branch on !error.code so a buffer overflow falls through to
its original error instead of being mis-reported as a timeout.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 200-char cap was an arbitrary defensive choice in the initial
commit, not a bot-driven change. Realistic docker-exec-seid invocations
in this codebase typically run 250-350 chars before any encoded
calldata, so the cap truncated exactly the args needed to identify
which command stalled — defeating the diagnostic the timeout exists to
provide. Test code, no persistent storage, no risk surface to a long
error string.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@wen-coding wen-coding requested review from amir-deris and masih June 2, 2026 20:39
@wen-coding wen-coding added this pull request to the merge queue Jun 3, 2026
Merged via the queue into main with commit 2151ba0 Jun 3, 2026
55 checks passed
@wen-coding wen-coding deleted the wen/exec_command_timeout branch June 3, 2026 01:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants