[fix] Fix race condition in BlameCollector: skip hang dump when testhost hasn't launched yet#16065
[fix] Fix race condition in BlameCollector: skip hang dump when testhost hasn't launched yet#16065nohwnd wants to merge 2 commits into
Conversation
When the inactivity timer fires before the testhost process has launched, _testHostProcessId is 0 (default int). Previously this caused ProcessDumpUtility.StartHangBasedProcessDump to attempt to dump PID 0 (the Idle process on Windows / Swapper on Linux), resulting in an empty or incorrect dump file. The fix adds an early-return guard in CollectDumpAndAbortTesthost: if _testHostProcessId == 0, log a warning and skip the dump/kill. Also updates three existing hang dump unit tests to properly simulate the happy-path scenario (testhost launches before the timer fires) by: - Using a 50 ms timeout instead of 0 ms so the TestHostLaunched event can be raised before the timer callback runs - Raising TestHostLaunched with PID 1234 before the timer fires Adds a new test that verifies StartHangBasedProcessDump is NOT called when the timer fires before TestHostLaunched. Fixes #15588 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Fixes a race in BlameCollector hang-dump collection where the inactivity timer can fire before TestHostLaunched sets the testhost PID, causing the collector to attempt a dump/kill against PID 0. The PR adds a guard to skip hang-dump collection when the testhost hasn’t launched yet, and updates unit tests/localized resources accordingly.
Changes:
- Add early-return in
CollectDumpAndAbortTesthostwhen_testHostProcessId == 0, with warning logs. - Make hang-dump unit tests deterministic by avoiding
TimeSpan.Zeroin several tests and explicitly raisingTestHostLaunched. - Add localized resource string
TestHostNotLaunchedCannotCollectHangDumpacross.resx, designer, and all 13.xlffiles.
Reviewed changes
Copilot reviewed 16 out of 17 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| test/Microsoft.TestPlatform.Extensions.BlameDataCollector.UnitTests/BlameCollectorTests.cs | Updates hang-dump timer tests and adds a regression test for “timer fires before testhost launched”. |
| src/Microsoft.TestPlatform.Extensions.BlameDataCollector/BlameCollector.cs | Skips hang-dump collection when testhost PID is not yet known (PID 0). |
| src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/Resources.resx | Adds new resource string for the “testhost not launched” warning. |
| src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/Resources.Designer.cs | Adds strongly-typed accessor for the new resource string. |
| src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.cs.xlf | Adds TestHostNotLaunchedCannotCollectHangDump localization entry (state=new). |
| src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.de.xlf | Adds TestHostNotLaunchedCannotCollectHangDump localization entry (state=new). |
| src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.es.xlf | Adds TestHostNotLaunchedCannotCollectHangDump localization entry (state=new). |
| src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.fr.xlf | Adds TestHostNotLaunchedCannotCollectHangDump localization entry (state=new). |
| src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.it.xlf | Adds TestHostNotLaunchedCannotCollectHangDump localization entry (state=new). |
| src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.ja.xlf | Adds TestHostNotLaunchedCannotCollectHangDump localization entry (state=new). |
| src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.ko.xlf | Adds TestHostNotLaunchedCannotCollectHangDump localization entry (state=new). |
| src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.pl.xlf | Adds TestHostNotLaunchedCannotCollectHangDump localization entry (state=new). |
| src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.pt-BR.xlf | Adds TestHostNotLaunchedCannotCollectHangDump localization entry (state=new). |
| src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.ru.xlf | Adds TestHostNotLaunchedCannotCollectHangDump localization entry (state=new). |
| src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.tr.xlf | Adds TestHostNotLaunchedCannotCollectHangDump localization entry (state=new). |
| src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.zh-Hans.xlf | Adds TestHostNotLaunchedCannotCollectHangDump localization entry (state=new). |
| src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.zh-Hant.xlf | Adds TestHostNotLaunchedCannotCollectHangDump localization entry (state=new). |
Files not reviewed (1)
- src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/Resources.Designer.cs: Language not supported
| // Do NOT raise TestHostLaunched — _testHostProcessId stays 0. | ||
| warningLogged.Wait(1000, TestContext.CancellationToken); | ||
|
|
|
|
||
| _blameDataCollector.Initialize( | ||
| GetDumpConfigurationElement(false, false, true, 0), | ||
| GetDumpConfigurationElement(false, false, true, 50), |
|
|
||
| _blameDataCollector.Initialize( | ||
| GetDumpConfigurationElement(false, false, true, 0), | ||
| GetDumpConfigurationElement(false, false, true, 50), |
|
|
||
| _blameDataCollector.Initialize( | ||
| GetDumpConfigurationElement(false, false, true, 0), | ||
| GetDumpConfigurationElement(false, false, true, 50), |
nohwnd
left a comment
There was a problem hiding this comment.
Review: Fix race condition in BlameCollector
Dimensions activated: Crash & Hang Dump Reliability · Error Reporting & Diagnostic Clarity · Parallel Execution & Scheduling Safety
Summary
The fix is correct and well-placed. Setting _inactivityTimerAlreadyFired = true before the guard ensures the timer is permanently defused regardless of the early-return path, and the test coverage faithfully exercises the new code path with TimeSpan.Zero.
One finding: _testHostProcessId is not volatile. The new guard makes this field load-bearing across a thread boundary for the first time — the timer thread now returns early based on reading 0. On x86/x64 this is unlikely to matter in practice, but volatile int is a one-liner fix that makes the cross-thread contract explicit and covers ARM64. See the inline comment for details.
Description alignment
PR description matches the diff exactly. Localization coverage (13 XLF files + resx + Resources.Designer.cs) is complete.
🧠 Reviewed by Expert Code Reviewer
🧠 Reviewed by Expert Code Reviewer 🧠
| } | ||
|
|
||
| // If testhost has not launched yet, we cannot dump or kill it. | ||
| if (_testHostProcessId == 0) |
There was a problem hiding this comment.
[Crash & Hang Dump Reliability / Parallel Execution & Scheduling Safety] — _testHostProcessId is not volatile
_testHostProcessId is written on the data-collection events thread in TestHostLaunchedHandler and now load-bearing read here on the timer-callback thread. The C# memory model does not guarantee that a plain int write on one thread is visible as a non-stale read on another without a memory barrier.
Without volatile (or Interlocked), the JIT is allowed to cache the 0 in a register and satisfy this guard with a stale value, causing the guard to fire and silently skip the hang dump even when testhost has already launched. On x86/x64 (TSO) this is unlikely to manifest, but it is technically a data race and can bite on ARM64 CI agents.
The field was already read without synchronization at lines 261, 271, 319, and 627 (pre-existing), but this new guard is the first place where the value of 0 has a meaningful semantic effect (early return vs. proceeding). Marking the field volatile would make the cross-thread visibility guarantee explicit and eliminate the theoretical false-positive:
// before
private int _testHostProcessId;
// after
private volatile int _testHostProcessId;volatile int is safe here — the field is only ever assigned once (in TestHostLaunchedHandler) and only read elsewhere, so no compare-and-swap semantics are needed.
There was a problem hiding this comment.
Done — marked _testHostProcessId as volatile to make the cross-thread visibility guarantee explicit and eliminate the theoretical false-positive on weakly-ordered CPUs (ARM64).
🔧 Iterated by PR Iteration Agent 🔧
Ensures the hang-dump guard added in the parent commit sees a fresh value on the timer-callback thread (ARM64 / weakly-ordered CPUs). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Commit pushed:
|
Summary
Fixes #15588 —
HangDumpChildProcessesvery flaky.Root Cause
BlameCollectoruses an inactivity timer to collect a hang dump after a configurable timeout. The timer starts inInitializeand is kicked (reset) by eachTestCaseStartevent._testHostProcessIdis set byTestHostLaunchedHandler, which is called when the testhost process actually starts.The race condition: If the inactivity timer fires before
TestHostLaunchedHandleris called (e.g., the timer has a very short timeout and testhost is slow to start, or in tests that useTimeSpan.Zero),_testHostProcessIdis still0. The code then tries to dump and kill PID 0, which is the Windows System Idle Process or the Linux Swapper process. On Windows this produces an empty/corrupt dump; on Linux it may fail silently or do nothing useful.The existing tests used
TimeSpan.Zerofor the inactivity timer timeout, which fires on a ThreadPool thread immediately but asynchronously — creating a genuine race condition between the timer callback and theTestHostLaunchedevent raise.Fix
Added an early-return guard in
CollectDumpAndAbortTesthostwhen_testHostProcessId == 0:This is safe because:
_inactivityTimerAlreadyFiredprevents the timer from ever firing again, so no dump will be collected from this session — which is the correct behavior since we don't have a valid PID.Testing
InitializeWithDumpForHangShouldSkipDumpIfTestHostHasNotLaunchedYetthat verifies the guard works correctly.TimeSpan.Zero(which has the same race condition as the production bug) to use50msand properly raiseTestHostLaunchedbefore the timer fires, making them deterministic.Localization
Added
TestHostNotLaunchedCannotCollectHangDumpresource string to all 13.xlffiles withstate="new"as per vstest conventions.