Skip to content

[fix] Fix race condition: skip hang dump when testhost hasn't launched yet#16064

Open
nohwnd wants to merge 4 commits into
mainfrom
fix/issue-15588-hang-dump-race-condition-566222668177d178
Open

[fix] Fix race condition: skip hang dump when testhost hasn't launched yet#16064
nohwnd wants to merge 4 commits into
mainfrom
fix/issue-15588-hang-dump-race-condition-566222668177d178

Conversation

@nohwnd
Copy link
Copy Markdown
Member

@nohwnd nohwnd commented May 25, 2026

🤖 This is an automated fix by the Issue Triage agent.

Fixes #15588

Root Cause

When the inactivity hang-dump timer fires before the testhost process has launched, _testHostProcessId is 0 (the default int value). The code in CollectDumpAndAbortTesthost would then:

  1. Call StartHangBasedProcessDump(0, ...) — attempting to dump PID 0, which is the Idle process on Windows and Swapper on Linux.
  2. Call Process.GetProcessById(0) — getting the Idle/Swapper process and trying to kill it.

This produces an empty or corrupt dump file and may cause errors. The comment in DefaultEngineInvoker.cs even had a TODO noting this hazard:

(todo/redacted): should there be a warning / error in this case, on windows and linux we are most likely not started by this PID 0

Fix

Added an early-return guard at the start of CollectDumpAndAbortTesthost:

if (_testHostProcessId == 0)
{
    EqtTrace.Warning("BlameCollector: Test host process has not launched yet. Skipping hang dump.");
    _logger.LogWarning(..., Resources.TestHostNotLaunchedCannotCollectHangDump);
    return;
}

If the testhost hasn't launched (PID still 0), we log a warning and skip dump/kill entirely.

Tests

  • New test InitializeWithDumpForHangShouldSkipDumpIfTestHostHasNotLaunchedYet: verifies that StartHangBasedProcessDump is not called when the timer fires before TestHostLaunched.
  • Updated 3 existing tests: changed from timeout=0 to timeout=50ms and added TestHostLaunched event raising before the timer fires, to properly test the happy path (testhost launched → hang timer fires → dump collected).

🔍 Triaged by Issue Repro Triage & Auto-Fix 🔍

When the inactivity timer fires before the testhost process has launched,
_testHostProcessId is 0 (default int). Previously this caused
ProcessDumpUtility.StartHangBasedProcessDump to attempt to dump PID 0
(the Idle process on Windows / Swapper on Linux), resulting in an empty
or incorrect dump file.

The fix adds an early-return guard in CollectDumpAndAbortTesthost:
if _testHostProcessId == 0, log a warning and skip the dump/kill.

Also updates three existing hang dump unit tests to properly simulate
the happy-path scenario (testhost launches before the timer fires) by:
- Using a 50 ms timeout instead of 0 ms so the TestHostLaunched event
  can be raised before the timer callback runs
- Raising TestHostLaunched with PID 1234 before the timer fires

Adds a new test that verifies StartHangBasedProcessDump is NOT called
when the timer fires before TestHostLaunched.

Fixes #15588

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 25, 2026 14:07
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the Blame Data Collector’s hang-dump path by preventing hang dump collection (and the subsequent kill attempt) when the testhost hasn’t launched yet and the tracked PID is still 0, avoiding accidental targeting of PID 0 (Idle/Swapper).

Changes:

  • Add an early-return guard in CollectDumpAndAbortTesthost when _testHostProcessId == 0, logging a warning and skipping dump/kill.
  • Add a new localized resource string for the “testhost not launched” warning and propagate it to all BlameDataCollector .xlf files.
  • Update/add unit tests to cover the “skip when not launched” scenario and to raise TestHostLaunched before hang timer fires in existing tests.

Reviewed changes

Copilot reviewed 16 out of 17 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
test/Microsoft.TestPlatform.Extensions.BlameDataCollector.UnitTests/BlameCollectorTests.cs Updates hang-dump timer tests and adds a new test ensuring dump is skipped when PID is still 0.
src/Microsoft.TestPlatform.Extensions.BlameDataCollector/BlameCollector.cs Adds PID==0 guard in hang-dump timeout callback to skip dump/kill and log a warning.
src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/Resources.resx Adds a new resource string for the “testhost not launched” hang-dump warning.
src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/Resources.Designer.cs Adds the strongly-typed resource accessor for the new warning string.
src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.cs.xlf Adds localization entry for the new warning string.
src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.de.xlf Adds localization entry for the new warning string.
src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.es.xlf Adds localization entry for the new warning string.
src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.fr.xlf Adds localization entry for the new warning string.
src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.it.xlf Adds localization entry for the new warning string.
src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.ja.xlf Adds localization entry for the new warning string.
src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.ko.xlf Adds localization entry for the new warning string.
src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.pl.xlf Adds localization entry for the new warning string.
src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.pt-BR.xlf Adds localization entry for the new warning string.
src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.ru.xlf Adds localization entry for the new warning string.
src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.tr.xlf Adds localization entry for the new warning string.
src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.zh-Hans.xlf Adds localization entry for the new warning string.
src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/xlf/Resources.zh-Hant.xlf Adds localization entry for the new warning string.
Files not reviewed (1)
  • src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/Resources.Designer.cs: Language not supported

Comment on lines +289 to +308
// Signal that fires once the timer callback completes (warning logged).
var warningLogged = new ManualResetEventSlim();
_mockLogger
.Setup(x => x.LogWarning(It.IsAny<DataCollectionContext>(), It.IsAny<string>()))
.Callback(() => warningLogged.Set());

_blameDataCollector.Initialize(
GetDumpConfigurationElement(false, false, true, 0),
_mockDataColectionEvents.Object,
_mockDataCollectionSink.Object,
_mockLogger.Object,
_context);

// Do NOT raise TestHostLaunched — _testHostProcessId stays 0.
warningLogged.Wait(1000, TestContext.CancellationToken);

// The hang dump must not have been attempted against PID 0.
_mockProcessDumpUtility.Verify(
x => x.StartHangBasedProcessDump(It.IsAny<int>(), It.IsAny<string>(), It.IsAny<bool>(), It.IsAny<string>(), It.IsAny<Action<string>>()),
Times.Never);
Copy link
Copy Markdown
Member Author

@nohwnd nohwnd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

The core fix is correct: adding a _testHostProcessId == 0 guard in CollectDumpAndAbortTesthost prevents a real bug where the hang-dump timer fires before the testhost process launches, causing PID 0 (the OS Idle/Swapper process) to be targeted for a dump and kill. The resource strings and all XLF files are properly updated.

One finding worth addressing:

The three updated "happy path" tests (InitializeWithDumpForHangShouldCaptureADumpOnTimeout and siblings) now carry a timing-dependent assumption: that the test thread raises TestHostLaunched within 50 ms of calling Initialize(). This creates a small but non-zero probability of flakiness on loaded CI agents. See the inline comment for details and a suggested mitigation.

No other issues found. The new InitializeWithDumpForHangShouldSkipDumpIfTestHostHasNotLaunchedYet test correctly validates the fix using a warning-callback sentinel, and the resource/XLF additions are complete and follow the repo conventions.

🧠 Reviewed by Expert Code Reviewer

🧠 Reviewed by Expert Code Reviewer 🧠

_mockDataColectionEvents.Raise(x => x.TestHostLaunched += null, new TestHostLaunchedEventArgs(_dataCollectionContext, 1234));

hangBasedDumpcollected.Wait(2000, TestContext.CancellationToken);
_mockProcessDumpUtility.Verify(x => x.StartHangBasedProcessDump(It.IsAny<int>(), It.IsAny<string>(), It.IsAny<bool>(), It.IsAny<string>(), It.IsAny<Action<string>>()), Times.Once);
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Threading] Potential test flakiness — timer may fire before TestHostLaunched is raised

The comment says "Simulate testhost launching before the timer fires", but there is no synchronization guarantee here. Initialize() starts a 50 ms timer on the threadpool, then control immediately returns to the test thread. If the test thread is preempted for >50 ms between lines 185 and 192 (e.g., on a loaded CI agent), the timer fires first, sees _testHostProcessId == 0, logs the warning and returns early — and StartHangBasedProcessDump is never called, causing the Times.Once assertion to fail.

The same race applies to the other two tests updated with the same pattern (InitializeWithDumpForHangShouldCaptureKillTestHostOnTimeoutEvenIfGetDumpFileFails and InitializeWithDumpForHangShouldCaptureKillTestHostOnTimeoutEvenIfAttachingToDumpFails).

A more robust pattern is to block the timer callback until after TestHostLaunched has been raised:

// Raise TestHostLaunched BEFORE starting the timer (or gate the timer via a semaphore)
// Option A: raise the event synchronously before Initialize sets up the timer
_mockDataColectionEvents.Raise(x => x.TestHostLaunched += null,
    new TestHostLaunchedEventArgs(_dataCollectionContext, 1234));

_blameDataCollector.Initialize(
    GetDumpConfigurationElement(false, false, true, 50), ...);

Or, if the event handler must be wired via Initialize, increase the timeout to a value that gives the test thread ample time even on loaded CI (e.g. 500 ms), acknowledging this is still probabilistic.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Fixed in the latest commit by eliminating the real timer entirely from those 3 happy-path tests:

  • Made CollectDumpAndAbortTesthost private protected
  • Added TriggerTimerCallback() to TestableBlameCollector that invokes the callback directly
  • The 3 tests now use Mock<IInactivityTimer> (already used in other tests), raise TestHostLaunched, then call TriggerTimerCallback() synchronously — no race possible
  • The "skip dump when PID=0" test keeps the real timer with timeout=0 since it specifically exercises the guard path that fires before TestHostLaunched

🔧 Iterated by PR Iteration Agent 🔧

Replace the 50ms real-timer pattern (race-prone on loaded CI agents)
with a synchronous approach:
- Make CollectDumpAndAbortTesthost private protected
- Add TriggerTimerCallback() to TestableBlameCollector
- Use Mock<IInactivityTimer> in the 3 happy-path hang-dump tests
- Raise TestHostLaunched then call TriggerTimerCallback() directly

The 'skip dump when PID=0' test keeps the real timer (timeout=0) since
it specifically tests the guard path that fires before TestHostLaunched.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@nohwnd
Copy link
Copy Markdown
Member Author

nohwnd commented May 25, 2026

Commit pushed: f4a7320

🔧 Iterated by PR Iteration Agent 🔧

Copy link
Copy Markdown
Member Author

@nohwnd nohwnd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

The core fix (guard against _testHostProcessId == 0 in CollectDumpAndAbortTesthost) is correct and addresses the real bug described in #15588. Resource strings and all XLF files are properly updated.

The previous flakiness concern raised in the prior review was addressed by introducing TriggerTimerCallback() in TestableBlameCollector — a sound approach. However, the access modifier used to expose the method is wrong and will cause a compile error.

One blocking finding:

CollectDumpAndAbortTesthost is declared private protected, which restricts access to derived classes in the same assembly only. TestableBlameCollector is in the test assembly (...UnitTests), a separate assembly from BlameCollector. This makes the TriggerTimerCallback() => CollectDumpAndAbortTesthost() call a CS0122 compile error. The fix is to use protected instead.

No other issues found.

🧠 Reviewed by Expert Code Reviewer 🧠

🧠 Reviewed by Expert Code Reviewer 🧠

/// kills the testhost process.
/// </summary>
private void CollectDumpAndAbortTesthost()
private protected void CollectDumpAndAbortTesthost()
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Crash & Hang Dump Reliability] private protected prevents cross-assembly access — compile error

private protected restricts access to derived types within the same assembly only. TestableBlameCollector lives in Microsoft.TestPlatform.Extensions.BlameDataCollector.UnitTests (a different assembly), so TriggerTimerCallback() => CollectDumpAndAbortTesthost() will produce CS0122 at compile time.

Note: InternalsVisibleTo in Friends.cs only applies to internal members — it does not widen private protected.

The correct modifier is protected, which grants access to derived classes in any assembly:

protected void CollectDumpAndAbortTesthost()

This is the minimum change needed for the test project to compile.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed. Changed private protected to protected so TestableBlameCollector in the test assembly can access CollectDumpAndAbortTesthost via inheritance.

🔧 Iterated by PR Iteration Agent 🔧

…tected

TestableBlameCollector lives in a different assembly so private protected
causes CS0122. InternalsVisibleTo only widens internal, not private protected.
Change to protected to allow cross-assembly inheritance access.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@nohwnd
Copy link
Copy Markdown
Member Author

nohwnd commented May 25, 2026

Commit pushed: 106ef82

🔧 Iterated by PR Iteration Agent 🔧

Copilot AI review requested due to automatic review settings May 25, 2026 14:51
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 20 out of 21 changed files in this pull request and generated 6 comments.

Files not reviewed (1)
  • src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Resources/Resources.Designer.cs: Language not supported

Comment on lines 210 to 214
/// Starts and waits for a new proc dump process to collect a single dump and then
/// kills the testhost process.
/// </summary>
private void CollectDumpAndAbortTesthost()
protected void CollectDumpAndAbortTesthost()
{
Comment on lines +283 to +309
_blameDataCollector = new TestableBlameCollector(
_mockBlameReaderWriter.Object,
_mockProcessDumpUtility.Object,
null,
_mockFileHelper.Object,
_mockProcessHelper.Object);

// Signal that fires once the timer callback completes (warning logged).
var warningLogged = new ManualResetEventSlim();
_mockLogger
.Setup(x => x.LogWarning(It.IsAny<DataCollectionContext>(), It.IsAny<string>()))
.Callback(() => warningLogged.Set());

_blameDataCollector.Initialize(
GetDumpConfigurationElement(false, false, true, 0),
_mockDataColectionEvents.Object,
_mockDataCollectionSink.Object,
_mockLogger.Object,
_context);

// Do NOT raise TestHostLaunched — _testHostProcessId stays 0.
warningLogged.Wait(1000, TestContext.CancellationToken);

// The hang dump must not have been attempted against PID 0.
_mockProcessDumpUtility.Verify(
x => x.StartHangBasedProcessDump(It.IsAny<int>(), It.IsAny<string>(), It.IsAny<bool>(), It.IsAny<string>(), It.IsAny<Action<string>>()),
Times.Never);
@@ -1 +1,2 @@
#nullable enable
Microsoft.TestPlatform.Extensions.BlameDataCollector.BlameCollector.CollectDumpAndAbortTesthost() -> void
@@ -1 +1,2 @@
#nullable enable
Microsoft.TestPlatform.Extensions.BlameDataCollector.BlameCollector.CollectDumpAndAbortTesthost() -> void
@@ -1 +1,2 @@
#nullable enable
Microsoft.TestPlatform.Extensions.BlameDataCollector.BlameCollector.CollectDumpAndAbortTesthost() -> void
@@ -1 +1,2 @@
#nullable enable
Microsoft.TestPlatform.Extensions.BlameDataCollector.BlameCollector.CollectDumpAndAbortTesthost() -> void
Copy link
Copy Markdown
Member Author

@nohwnd nohwnd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

The core fix (guard _testHostProcessId == 0 in CollectDumpAndAbortTesthost) is correct and addresses the real bug from #15588. The previous private protected compile error is also resolved by using protected with the corresponding PublicAPI.Unshipped.txt entries. Resource strings and XLF files are properly updated.

One blocking correctness issue found:

_inactivityTimerAlreadyFired = true is set on line 216, before the new _testHostProcessId == 0 guard. ResetInactivityTimer() is gated on !_inactivityTimerAlreadyFired, so once the early return path executes, hang detection is permanently disabled for the session — even after testhost eventually launches. A legitimate hang of the (now-running) testhost will produce no dump.

The fix also disposes _inactivityTimer before the guard (line 240), so the timer cannot fire again regardless. Both issues need to be addressed together — see the inline comment for concrete options.

🧠 Reviewed by Expert Code Reviewer 🧠

🧠 Reviewed by Expert Code Reviewer 🧠

{
EqtTrace.Warning("BlameCollector.CollectDumpAndAbortTesthost: Test host process has not launched yet. Skipping hang dump.");
_logger.LogWarning(_context.SessionDataCollectionContext, Resources.Resources.TestHostNotLaunchedCannotCollectHangDump);
return;
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Threading/Correctness] Early return permanently disables hang detection for the remainder of the session.

_inactivityTimerAlreadyFired = true is set on line 216, before this guard. ResetInactivityTimer() (line 675) skips resetting the timer when _inactivityTimerAlreadyFired is true. So once the timer fires before the testhost launches:

  1. _inactivityTimerAlreadyFired is permanently true
  2. The early return fires here
  3. From this point on, ResetInactivityTimer() is a no-op, so no subsequent test activity resets the timer
  4. If the testhost then launches and hangs legitimately, no hang dump is ever collected

The fix is to not mark the timer as already fired when we're doing the early return — move _inactivityTimerAlreadyFired = true to after the guard, or reset it before the early return:

// Option A: move the flag assignment after the guard
if (_testHostProcessId == 0)
{
    EqtTrace.Warning("...");
    _logger.LogWarning(...);
    return;
}

_inactivityTimerAlreadyFired = true;
// Option B: reset the flag before early return
if (_testHostProcessId == 0)
{
    _inactivityTimerAlreadyFired = false; // allow future timer fires
    EqtTrace.Warning("...");
    _logger.LogWarning(...);
    return;
}

Note that _inactivityTimer?.Dispose() (line 240) has already been called before the guard, so Option A would also need the timer disposal moved after the guard, or the timer would need to be re-armed. Option B is simpler but needs to be paired with re-arming the timer (otherwise the timer won't fire again since it was disposed on line 240).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Fixed in the latest commit — moved both _inactivityTimerAlreadyFired = true and _inactivityTimer?.Dispose() to after the PID guard. When the timer fires before the testhost launches, we now return early without setting the flag or disposing the timer, so ResetInactivityTimer can re-arm it once the testhost eventually launches and hang detection stays active.

🔧 Iterated by PR Iteration Agent 🔧

🔧 Iterated by PR Iteration Agent 🔧

…ortTesthost

When the inactivity timer fires before testhost launches (PID=0), the early
return now occurs before setting _inactivityTimerAlreadyFired=true and before
disposing the timer. This ensures ResetInactivityTimer can re-arm the timer
once the testhost eventually launches, so hang detection remains active.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@nohwnd
Copy link
Copy Markdown
Member Author

nohwnd commented May 25, 2026

Commit pushed: 9522bea

🔧 Iterated by PR Iteration Agent 🔧

Copy link
Copy Markdown
Member Author

@nohwnd nohwnd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

All three issues raised in prior reviews are now resolved in the latest commit:

  1. private protectedprotected ✅ — CollectDumpAndAbortTesthost is now protected, allowing TestableBlameCollector in the test assembly to override it.
  2. _inactivityTimerAlreadyFired = true ordering ✅ — The flag is now set after the _testHostProcessId == 0 guard, so an early return preserves the ability for ResetInactivityTimer to re-arm the timer when the testhost eventually launches. The comment documents this intent clearly.
  3. Test flakiness ✅ — Happy-path tests now use synchronous TriggerTimerCallback() with a properly sequenced TestHostLaunched raise. The "skip" test retains a real 0ms timer but uses a ManualResetEventSlim with a 1-second ceiling, which is sufficient for CI.

The fix is correct, well-tested, and the resource/XLF additions are complete.

🧠 Reviewed by Expert Code Reviewer 🧠

🧠 Reviewed by Expert Code Reviewer 🧠

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

HangDumpChildProcesses very flaky

2 participants