[fix] Fix race condition: skip hang dump when testhost hasn't launched yet#16064
[fix] Fix race condition: skip hang dump when testhost hasn't launched yet#16064nohwnd wants to merge 4 commits into
Conversation
When the inactivity timer fires before the testhost process has launched, _testHostProcessId is 0 (default int). Previously this caused ProcessDumpUtility.StartHangBasedProcessDump to attempt to dump PID 0 (the Idle process on Windows / Swapper on Linux), resulting in an empty or incorrect dump file. The fix adds an early-return guard in CollectDumpAndAbortTesthost: if _testHostProcessId == 0, log a warning and skip the dump/kill. Also updates three existing hang dump unit tests to properly simulate the happy-path scenario (testhost launches before the timer fires) by: - Using a 50 ms timeout instead of 0 ms so the TestHostLaunched event can be raised before the timer callback runs - Raising TestHostLaunched with PID 1234 before the timer fires Adds a new test that verifies StartHangBasedProcessDump is NOT called when the timer fires before TestHostLaunched. Fixes #15588 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR hardens the Blame Data Collector’s hang-dump path by preventing hang dump collection (and the subsequent kill attempt) when the testhost hasn’t launched yet and the tracked PID is still 0, avoiding accidental targeting of PID 0 (Idle/Swapper).
Changes:
- Add an early-return guard in
CollectDumpAndAbortTesthostwhen_testHostProcessId == 0, logging a warning and skipping dump/kill. - Add a new localized resource string for the “testhost not launched” warning and propagate it to all BlameDataCollector
.xlffiles. - Update/add unit tests to cover the “skip when not launched” scenario and to raise
TestHostLaunchedbefore hang timer fires in existing tests.
Reviewed changes
Copilot reviewed 16 out of 17 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| test/Microsoft.TestPlatform.Extensions.BlameDataCollector.UnitTests/BlameCollectorTests.cs | Updates hang-dump timer tests and adds a new test ensuring dump is skipped when PID is still 0. |
| src/Microsoft.TestPlatform.Extensions.BlameDataCollector/BlameCollector.cs | Adds PID==0 guard in hang-dump timeout callback to skip dump/kill and log a warning. |
| src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/Resources.resx | Adds a new resource string for the “testhost not launched” hang-dump warning. |
| src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/Resources.Designer.cs | Adds the strongly-typed resource accessor for the new warning string. |
| src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.cs.xlf | Adds localization entry for the new warning string. |
| src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.de.xlf | Adds localization entry for the new warning string. |
| src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.es.xlf | Adds localization entry for the new warning string. |
| src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.fr.xlf | Adds localization entry for the new warning string. |
| src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.it.xlf | Adds localization entry for the new warning string. |
| src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.ja.xlf | Adds localization entry for the new warning string. |
| src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.ko.xlf | Adds localization entry for the new warning string. |
| src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.pl.xlf | Adds localization entry for the new warning string. |
| src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.pt-BR.xlf | Adds localization entry for the new warning string. |
| src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.ru.xlf | Adds localization entry for the new warning string. |
| src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.tr.xlf | Adds localization entry for the new warning string. |
| src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.zh-Hans.xlf | Adds localization entry for the new warning string. |
| src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.zh-Hant.xlf | Adds localization entry for the new warning string. |
Files not reviewed (1)
- src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/Resources.Designer.cs: Language not supported
| // Signal that fires once the timer callback completes (warning logged). | ||
| var warningLogged = new ManualResetEventSlim(); | ||
| _mockLogger | ||
| .Setup(x => x.LogWarning(It.IsAny<DataCollectionContext>(), It.IsAny<string>())) | ||
| .Callback(() => warningLogged.Set()); | ||
|
|
||
| _blameDataCollector.Initialize( | ||
| GetDumpConfigurationElement(false, false, true, 0), | ||
| _mockDataColectionEvents.Object, | ||
| _mockDataCollectionSink.Object, | ||
| _mockLogger.Object, | ||
| _context); | ||
|
|
||
| // Do NOT raise TestHostLaunched — _testHostProcessId stays 0. | ||
| warningLogged.Wait(1000, TestContext.CancellationToken); | ||
|
|
||
| // The hang dump must not have been attempted against PID 0. | ||
| _mockProcessDumpUtility.Verify( | ||
| x => x.StartHangBasedProcessDump(It.IsAny<int>(), It.IsAny<string>(), It.IsAny<bool>(), It.IsAny<string>(), It.IsAny<Action<string>>()), | ||
| Times.Never); |
nohwnd
left a comment
There was a problem hiding this comment.
Review Summary
The core fix is correct: adding a _testHostProcessId == 0 guard in CollectDumpAndAbortTesthost prevents a real bug where the hang-dump timer fires before the testhost process launches, causing PID 0 (the OS Idle/Swapper process) to be targeted for a dump and kill. The resource strings and all XLF files are properly updated.
One finding worth addressing:
The three updated "happy path" tests (InitializeWithDumpForHangShouldCaptureADumpOnTimeout and siblings) now carry a timing-dependent assumption: that the test thread raises TestHostLaunched within 50 ms of calling Initialize(). This creates a small but non-zero probability of flakiness on loaded CI agents. See the inline comment for details and a suggested mitigation.
No other issues found. The new InitializeWithDumpForHangShouldSkipDumpIfTestHostHasNotLaunchedYet test correctly validates the fix using a warning-callback sentinel, and the resource/XLF additions are complete and follow the repo conventions.
🧠 Reviewed by Expert Code Reviewer
🧠 Reviewed by Expert Code Reviewer 🧠
| _mockDataColectionEvents.Raise(x => x.TestHostLaunched += null, new TestHostLaunchedEventArgs(_dataCollectionContext, 1234)); | ||
|
|
||
| hangBasedDumpcollected.Wait(2000, TestContext.CancellationToken); | ||
| _mockProcessDumpUtility.Verify(x => x.StartHangBasedProcessDump(It.IsAny<int>(), It.IsAny<string>(), It.IsAny<bool>(), It.IsAny<string>(), It.IsAny<Action<string>>()), Times.Once); |
There was a problem hiding this comment.
[Threading] Potential test flakiness — timer may fire before TestHostLaunched is raised
The comment says "Simulate testhost launching before the timer fires", but there is no synchronization guarantee here. Initialize() starts a 50 ms timer on the threadpool, then control immediately returns to the test thread. If the test thread is preempted for >50 ms between lines 185 and 192 (e.g., on a loaded CI agent), the timer fires first, sees _testHostProcessId == 0, logs the warning and returns early — and StartHangBasedProcessDump is never called, causing the Times.Once assertion to fail.
The same race applies to the other two tests updated with the same pattern (InitializeWithDumpForHangShouldCaptureKillTestHostOnTimeoutEvenIfGetDumpFileFails and InitializeWithDumpForHangShouldCaptureKillTestHostOnTimeoutEvenIfAttachingToDumpFails).
A more robust pattern is to block the timer callback until after TestHostLaunched has been raised:
// Raise TestHostLaunched BEFORE starting the timer (or gate the timer via a semaphore)
// Option A: raise the event synchronously before Initialize sets up the timer
_mockDataColectionEvents.Raise(x => x.TestHostLaunched += null,
new TestHostLaunchedEventArgs(_dataCollectionContext, 1234));
_blameDataCollector.Initialize(
GetDumpConfigurationElement(false, false, true, 50), ...);Or, if the event handler must be wired via Initialize, increase the timeout to a value that gives the test thread ample time even on loaded CI (e.g. 500 ms), acknowledging this is still probabilistic.
There was a problem hiding this comment.
Good catch. Fixed in the latest commit by eliminating the real timer entirely from those 3 happy-path tests:
- Made
CollectDumpAndAbortTesthostprivate protected - Added
TriggerTimerCallback()toTestableBlameCollectorthat invokes the callback directly - The 3 tests now use
Mock<IInactivityTimer>(already used in other tests), raiseTestHostLaunched, then callTriggerTimerCallback()synchronously — no race possible - The "skip dump when PID=0" test keeps the real timer with
timeout=0since it specifically exercises the guard path that fires beforeTestHostLaunched
🔧 Iterated by PR Iteration Agent 🔧
Replace the 50ms real-timer pattern (race-prone on loaded CI agents) with a synchronous approach: - Make CollectDumpAndAbortTesthost private protected - Add TriggerTimerCallback() to TestableBlameCollector - Use Mock<IInactivityTimer> in the 3 happy-path hang-dump tests - Raise TestHostLaunched then call TriggerTimerCallback() directly The 'skip dump when PID=0' test keeps the real timer (timeout=0) since it specifically tests the guard path that fires before TestHostLaunched. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Commit pushed:
|
nohwnd
left a comment
There was a problem hiding this comment.
Review Summary
The core fix (guard against _testHostProcessId == 0 in CollectDumpAndAbortTesthost) is correct and addresses the real bug described in #15588. Resource strings and all XLF files are properly updated.
The previous flakiness concern raised in the prior review was addressed by introducing TriggerTimerCallback() in TestableBlameCollector — a sound approach. However, the access modifier used to expose the method is wrong and will cause a compile error.
One blocking finding:
CollectDumpAndAbortTesthost is declared private protected, which restricts access to derived classes in the same assembly only. TestableBlameCollector is in the test assembly (...UnitTests), a separate assembly from BlameCollector. This makes the TriggerTimerCallback() => CollectDumpAndAbortTesthost() call a CS0122 compile error. The fix is to use protected instead.
No other issues found.
🧠 Reviewed by Expert Code Reviewer 🧠
🧠 Reviewed by Expert Code Reviewer 🧠
| /// kills the testhost process. | ||
| /// </summary> | ||
| private void CollectDumpAndAbortTesthost() | ||
| private protected void CollectDumpAndAbortTesthost() |
There was a problem hiding this comment.
[Crash & Hang Dump Reliability] private protected prevents cross-assembly access — compile error
private protected restricts access to derived types within the same assembly only. TestableBlameCollector lives in Microsoft.TestPlatform.Extensions.BlameDataCollector.UnitTests (a different assembly), so TriggerTimerCallback() => CollectDumpAndAbortTesthost() will produce CS0122 at compile time.
Note: InternalsVisibleTo in Friends.cs only applies to internal members — it does not widen private protected.
The correct modifier is protected, which grants access to derived classes in any assembly:
protected void CollectDumpAndAbortTesthost()This is the minimum change needed for the test project to compile.
There was a problem hiding this comment.
Good catch — fixed. Changed private protected to protected so TestableBlameCollector in the test assembly can access CollectDumpAndAbortTesthost via inheritance.
🔧 Iterated by PR Iteration Agent 🔧
…tected TestableBlameCollector lives in a different assembly so private protected causes CS0122. InternalsVisibleTo only widens internal, not private protected. Change to protected to allow cross-assembly inheritance access. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Commit pushed:
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 20 out of 21 changed files in this pull request and generated 6 comments.
Files not reviewed (1)
- src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/Resources.Designer.cs: Language not supported
| /// Starts and waits for a new proc dump process to collect a single dump and then | ||
| /// kills the testhost process. | ||
| /// </summary> | ||
| private void CollectDumpAndAbortTesthost() | ||
| protected void CollectDumpAndAbortTesthost() | ||
| { |
| _blameDataCollector = new TestableBlameCollector( | ||
| _mockBlameReaderWriter.Object, | ||
| _mockProcessDumpUtility.Object, | ||
| null, | ||
| _mockFileHelper.Object, | ||
| _mockProcessHelper.Object); | ||
|
|
||
| // Signal that fires once the timer callback completes (warning logged). | ||
| var warningLogged = new ManualResetEventSlim(); | ||
| _mockLogger | ||
| .Setup(x => x.LogWarning(It.IsAny<DataCollectionContext>(), It.IsAny<string>())) | ||
| .Callback(() => warningLogged.Set()); | ||
|
|
||
| _blameDataCollector.Initialize( | ||
| GetDumpConfigurationElement(false, false, true, 0), | ||
| _mockDataColectionEvents.Object, | ||
| _mockDataCollectionSink.Object, | ||
| _mockLogger.Object, | ||
| _context); | ||
|
|
||
| // Do NOT raise TestHostLaunched — _testHostProcessId stays 0. | ||
| warningLogged.Wait(1000, TestContext.CancellationToken); | ||
|
|
||
| // The hang dump must not have been attempted against PID 0. | ||
| _mockProcessDumpUtility.Verify( | ||
| x => x.StartHangBasedProcessDump(It.IsAny<int>(), It.IsAny<string>(), It.IsAny<bool>(), It.IsAny<string>(), It.IsAny<Action<string>>()), | ||
| Times.Never); |
| @@ -1 +1,2 @@ | |||
| #nullable enable | |||
| Microsoft.TestPlatform.Extensions.BlameDataCollector.BlameCollector.CollectDumpAndAbortTesthost() -> void | |||
| @@ -1 +1,2 @@ | |||
| #nullable enable | |||
| Microsoft.TestPlatform.Extensions.BlameDataCollector.BlameCollector.CollectDumpAndAbortTesthost() -> void | |||
| @@ -1 +1,2 @@ | |||
| #nullable enable | |||
| Microsoft.TestPlatform.Extensions.BlameDataCollector.BlameCollector.CollectDumpAndAbortTesthost() -> void | |||
| @@ -1 +1,2 @@ | |||
| #nullable enable | |||
| Microsoft.TestPlatform.Extensions.BlameDataCollector.BlameCollector.CollectDumpAndAbortTesthost() -> void | |||
nohwnd
left a comment
There was a problem hiding this comment.
Review Summary
The core fix (guard _testHostProcessId == 0 in CollectDumpAndAbortTesthost) is correct and addresses the real bug from #15588. The previous private protected compile error is also resolved by using protected with the corresponding PublicAPI.Unshipped.txt entries. Resource strings and XLF files are properly updated.
One blocking correctness issue found:
_inactivityTimerAlreadyFired = true is set on line 216, before the new _testHostProcessId == 0 guard. ResetInactivityTimer() is gated on !_inactivityTimerAlreadyFired, so once the early return path executes, hang detection is permanently disabled for the session — even after testhost eventually launches. A legitimate hang of the (now-running) testhost will produce no dump.
The fix also disposes _inactivityTimer before the guard (line 240), so the timer cannot fire again regardless. Both issues need to be addressed together — see the inline comment for concrete options.
🧠 Reviewed by Expert Code Reviewer 🧠
🧠 Reviewed by Expert Code Reviewer 🧠
| { | ||
| EqtTrace.Warning("BlameCollector.CollectDumpAndAbortTesthost: Test host process has not launched yet. Skipping hang dump."); | ||
| _logger.LogWarning(_context.SessionDataCollectionContext, Resources.Resources.TestHostNotLaunchedCannotCollectHangDump); | ||
| return; |
There was a problem hiding this comment.
[Threading/Correctness] Early return permanently disables hang detection for the remainder of the session.
_inactivityTimerAlreadyFired = true is set on line 216, before this guard. ResetInactivityTimer() (line 675) skips resetting the timer when _inactivityTimerAlreadyFired is true. So once the timer fires before the testhost launches:
_inactivityTimerAlreadyFiredis permanentlytrue- The early return fires here
- From this point on,
ResetInactivityTimer()is a no-op, so no subsequent test activity resets the timer - If the testhost then launches and hangs legitimately, no hang dump is ever collected
The fix is to not mark the timer as already fired when we're doing the early return — move _inactivityTimerAlreadyFired = true to after the guard, or reset it before the early return:
// Option A: move the flag assignment after the guard
if (_testHostProcessId == 0)
{
EqtTrace.Warning("...");
_logger.LogWarning(...);
return;
}
_inactivityTimerAlreadyFired = true;// Option B: reset the flag before early return
if (_testHostProcessId == 0)
{
_inactivityTimerAlreadyFired = false; // allow future timer fires
EqtTrace.Warning("...");
_logger.LogWarning(...);
return;
}Note that _inactivityTimer?.Dispose() (line 240) has already been called before the guard, so Option A would also need the timer disposal moved after the guard, or the timer would need to be re-armed. Option B is simpler but needs to be paired with re-arming the timer (otherwise the timer won't fire again since it was disposed on line 240).
There was a problem hiding this comment.
Good catch. Fixed in the latest commit — moved both _inactivityTimerAlreadyFired = true and _inactivityTimer?.Dispose() to after the PID guard. When the timer fires before the testhost launches, we now return early without setting the flag or disposing the timer, so ResetInactivityTimer can re-arm it once the testhost eventually launches and hang detection stays active.
🔧 Iterated by PR Iteration Agent 🔧
🔧 Iterated by PR Iteration Agent 🔧
…ortTesthost When the inactivity timer fires before testhost launches (PID=0), the early return now occurs before setting _inactivityTimerAlreadyFired=true and before disposing the timer. This ensures ResetInactivityTimer can re-arm the timer once the testhost eventually launches, so hang detection remains active. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Commit pushed:
|
nohwnd
left a comment
There was a problem hiding this comment.
Review Summary
All three issues raised in prior reviews are now resolved in the latest commit:
private protected→protected✅ —CollectDumpAndAbortTesthostis nowprotected, allowingTestableBlameCollectorin the test assembly to override it._inactivityTimerAlreadyFired = trueordering ✅ — The flag is now set after the_testHostProcessId == 0guard, so an early return preserves the ability forResetInactivityTimerto re-arm the timer when the testhost eventually launches. The comment documents this intent clearly.- Test flakiness ✅ — Happy-path tests now use synchronous
TriggerTimerCallback()with a properly sequencedTestHostLaunchedraise. The "skip" test retains a real 0ms timer but uses aManualResetEventSlimwith a 1-second ceiling, which is sufficient for CI.
The fix is correct, well-tested, and the resource/XLF additions are complete.
🧠 Reviewed by Expert Code Reviewer 🧠
🧠 Reviewed by Expert Code Reviewer 🧠
🤖 This is an automated fix by the Issue Triage agent.
Fixes #15588
Root Cause
When the inactivity hang-dump timer fires before the testhost process has launched,
_testHostProcessIdis0(the defaultintvalue). The code inCollectDumpAndAbortTesthostwould then:StartHangBasedProcessDump(0, ...)— attempting to dump PID 0, which is the Idle process on Windows and Swapper on Linux.Process.GetProcessById(0)— getting the Idle/Swapper process and trying to kill it.This produces an empty or corrupt dump file and may cause errors. The comment in
DefaultEngineInvoker.cseven had a TODO noting this hazard:Fix
Added an early-return guard at the start of
CollectDumpAndAbortTesthost:If the testhost hasn't launched (PID still 0), we log a warning and skip dump/kill entirely.
Tests
InitializeWithDumpForHangShouldSkipDumpIfTestHostHasNotLaunchedYet: verifies thatStartHangBasedProcessDumpis not called when the timer fires beforeTestHostLaunched.timeout=0totimeout=50msand addedTestHostLaunchedevent raising before the timer fires, to properly test the happy path (testhost launched → hang timer fires → dump collected).