Skip to content

[fix] Fix race condition in BlameCollector: skip hang dump when testhost hasn't launched yet#16065

Open
nohwnd wants to merge 2 commits into
mainfrom
fix/issue-15588-hang-dump-race-condition-5bee74e4b58b61b6
Open

[fix] Fix race condition in BlameCollector: skip hang dump when testhost hasn't launched yet#16065
nohwnd wants to merge 2 commits into
mainfrom
fix/issue-15588-hang-dump-race-condition-5bee74e4b58b61b6

Conversation

@nohwnd
Copy link
Copy Markdown
Member

@nohwnd nohwnd commented May 25, 2026

Summary

Fixes #15588HangDumpChildProcesses very flaky.

🤖 This is an automated fix generated by the Issue Triage workflow.

Root Cause

BlameCollector uses an inactivity timer to collect a hang dump after a configurable timeout. The timer starts in Initialize and is kicked (reset) by each TestCaseStart event. _testHostProcessId is set by TestHostLaunchedHandler, which is called when the testhost process actually starts.

The race condition: If the inactivity timer fires before TestHostLaunchedHandler is called (e.g., the timer has a very short timeout and testhost is slow to start, or in tests that use TimeSpan.Zero), _testHostProcessId is still 0. The code then tries to dump and kill PID 0, which is the Windows System Idle Process or the Linux Swapper process. On Windows this produces an empty/corrupt dump; on Linux it may fail silently or do nothing useful.

The existing tests used TimeSpan.Zero for the inactivity timer timeout, which fires on a ThreadPool thread immediately but asynchronously — creating a genuine race condition between the timer callback and the TestHostLaunched event raise.

Fix

Added an early-return guard in CollectDumpAndAbortTesthost when _testHostProcessId == 0:

if (_testHostProcessId == 0)
{
    EqtTrace.Warning("BlameCollector.CollectDumpAndAbortTesthost: Testhost has not launched yet, skipping hang dump collection.");
    _logger.LogWarning(_context.SessionDataCollectionContext, Resources.TestHostNotLaunchedCannotCollectHangDump);
    return;
}

This is safe because:

  • If testhost hasn't launched before the inactivity timer fires, the test run is already in a bad state.
  • _inactivityTimerAlreadyFired prevents the timer from ever firing again, so no dump will be collected from this session — which is the correct behavior since we don't have a valid PID.

Testing

  • Added new unit test InitializeWithDumpForHangShouldSkipDumpIfTestHostHasNotLaunchedYet that verifies the guard works correctly.
  • Updated 3 existing tests that used TimeSpan.Zero (which has the same race condition as the production bug) to use 50ms and properly raise TestHostLaunched before the timer fires, making them deterministic.
  • All 45 blame collector unit tests pass.

Localization

Added TestHostNotLaunchedCannotCollectHangDump resource string to all 13 .xlf files with state="new" as per vstest conventions.

🔍 Triaged by Issue Repro Triage & Auto-Fix 🔍

When the inactivity timer fires before the testhost process has launched,
_testHostProcessId is 0 (default int). Previously this caused
ProcessDumpUtility.StartHangBasedProcessDump to attempt to dump PID 0
(the Idle process on Windows / Swapper on Linux), resulting in an empty
or incorrect dump file.

The fix adds an early-return guard in CollectDumpAndAbortTesthost:
if _testHostProcessId == 0, log a warning and skip the dump/kill.

Also updates three existing hang dump unit tests to properly simulate
the happy-path scenario (testhost launches before the timer fires) by:
- Using a 50 ms timeout instead of 0 ms so the TestHostLaunched event
  can be raised before the timer callback runs
- Raising TestHostLaunched with PID 1234 before the timer fires

Adds a new test that verifies StartHangBasedProcessDump is NOT called
when the timer fires before TestHostLaunched.

Fixes #15588

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 25, 2026 14:07
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a race in BlameCollector hang-dump collection where the inactivity timer can fire before TestHostLaunched sets the testhost PID, causing the collector to attempt a dump/kill against PID 0. The PR adds a guard to skip hang-dump collection when the testhost hasn’t launched yet, and updates unit tests/localized resources accordingly.

Changes:

  • Add early-return in CollectDumpAndAbortTesthost when _testHostProcessId == 0, with warning logs.
  • Make hang-dump unit tests deterministic by avoiding TimeSpan.Zero in several tests and explicitly raising TestHostLaunched.
  • Add localized resource string TestHostNotLaunchedCannotCollectHangDump across .resx, designer, and all 13 .xlf files.

Reviewed changes

Copilot reviewed 16 out of 17 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
test/Microsoft.TestPlatform.Extensions.BlameDataCollector.UnitTests/BlameCollectorTests.cs Updates hang-dump timer tests and adds a regression test for “timer fires before testhost launched”.
src/Microsoft.TestPlatform.Extensions.BlameDataCollector/BlameCollector.cs Skips hang-dump collection when testhost PID is not yet known (PID 0).
src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/Resources.resx Adds new resource string for the “testhost not launched” warning.
src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/Resources.Designer.cs Adds strongly-typed accessor for the new resource string.
src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.cs.xlf Adds TestHostNotLaunchedCannotCollectHangDump localization entry (state=new).
src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.de.xlf Adds TestHostNotLaunchedCannotCollectHangDump localization entry (state=new).
src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.es.xlf Adds TestHostNotLaunchedCannotCollectHangDump localization entry (state=new).
src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.fr.xlf Adds TestHostNotLaunchedCannotCollectHangDump localization entry (state=new).
src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.it.xlf Adds TestHostNotLaunchedCannotCollectHangDump localization entry (state=new).
src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.ja.xlf Adds TestHostNotLaunchedCannotCollectHangDump localization entry (state=new).
src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.ko.xlf Adds TestHostNotLaunchedCannotCollectHangDump localization entry (state=new).
src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.pl.xlf Adds TestHostNotLaunchedCannotCollectHangDump localization entry (state=new).
src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.pt-BR.xlf Adds TestHostNotLaunchedCannotCollectHangDump localization entry (state=new).
src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.ru.xlf Adds TestHostNotLaunchedCannotCollectHangDump localization entry (state=new).
src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.tr.xlf Adds TestHostNotLaunchedCannotCollectHangDump localization entry (state=new).
src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.zh-Hans.xlf Adds TestHostNotLaunchedCannotCollectHangDump localization entry (state=new).
src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.zh-Hant.xlf Adds TestHostNotLaunchedCannotCollectHangDump localization entry (state=new).
Files not reviewed (1)
  • src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/Resources.Designer.cs: Language not supported

Comment on lines +302 to +304
// Do NOT raise TestHostLaunched — _testHostProcessId stays 0.
warningLogged.Wait(1000, TestContext.CancellationToken);


_blameDataCollector.Initialize(
GetDumpConfigurationElement(false, false, true, 0),
GetDumpConfigurationElement(false, false, true, 50),

_blameDataCollector.Initialize(
GetDumpConfigurationElement(false, false, true, 0),
GetDumpConfigurationElement(false, false, true, 50),

_blameDataCollector.Initialize(
GetDumpConfigurationElement(false, false, true, 0),
GetDumpConfigurationElement(false, false, true, 50),
Copy link
Copy Markdown
Member Author

@nohwnd nohwnd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Fix race condition in BlameCollector

Dimensions activated: Crash & Hang Dump Reliability · Error Reporting & Diagnostic Clarity · Parallel Execution & Scheduling Safety

Summary

The fix is correct and well-placed. Setting _inactivityTimerAlreadyFired = true before the guard ensures the timer is permanently defused regardless of the early-return path, and the test coverage faithfully exercises the new code path with TimeSpan.Zero.

One finding: _testHostProcessId is not volatile. The new guard makes this field load-bearing across a thread boundary for the first time — the timer thread now returns early based on reading 0. On x86/x64 this is unlikely to matter in practice, but volatile int is a one-liner fix that makes the cross-thread contract explicit and covers ARM64. See the inline comment for details.

Description alignment

PR description matches the diff exactly. Localization coverage (13 XLF files + resx + Resources.Designer.cs) is complete.


🧠 Reviewed by Expert Code Reviewer

🧠 Reviewed by Expert Code Reviewer 🧠

}

// If testhost has not launched yet, we cannot dump or kill it.
if (_testHostProcessId == 0)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Crash & Hang Dump Reliability / Parallel Execution & Scheduling Safety]_testHostProcessId is not volatile

_testHostProcessId is written on the data-collection events thread in TestHostLaunchedHandler and now load-bearing read here on the timer-callback thread. The C# memory model does not guarantee that a plain int write on one thread is visible as a non-stale read on another without a memory barrier.

Without volatile (or Interlocked), the JIT is allowed to cache the 0 in a register and satisfy this guard with a stale value, causing the guard to fire and silently skip the hang dump even when testhost has already launched. On x86/x64 (TSO) this is unlikely to manifest, but it is technically a data race and can bite on ARM64 CI agents.

The field was already read without synchronization at lines 261, 271, 319, and 627 (pre-existing), but this new guard is the first place where the value of 0 has a meaningful semantic effect (early return vs. proceeding). Marking the field volatile would make the cross-thread visibility guarantee explicit and eliminate the theoretical false-positive:

// before
private int _testHostProcessId;

// after
private volatile int _testHostProcessId;

volatile int is safe here — the field is only ever assigned once (in TestHostLaunchedHandler) and only read elsewhere, so no compare-and-swap semantics are needed.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — marked _testHostProcessId as volatile to make the cross-thread visibility guarantee explicit and eliminate the theoretical false-positive on weakly-ordered CPUs (ARM64).

🔧 Iterated by PR Iteration Agent 🔧

Ensures the hang-dump guard added in the parent commit sees a fresh
value on the timer-callback thread (ARM64 / weakly-ordered CPUs).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@nohwnd
Copy link
Copy Markdown
Member Author

nohwnd commented May 25, 2026

Commit pushed: 16a2275

🔧 Iterated by PR Iteration Agent 🔧

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

HangDumpChildProcesses very flaky

2 participants