Skip to content

[fix] Fix HTML logger exception on invalid XML chars in test display names#16051

Open
nohwnd wants to merge 2 commits into
mainfrom
fix/issue-10431-html-logger-special-chars-390ce20d95623f88
Open

[fix] Fix HTML logger exception on invalid XML chars in test display names#16051
nohwnd wants to merge 2 commits into
mainfrom
fix/issue-10431-html-logger-special-chars-390ce20d95623f88

Conversation

@nohwnd
Copy link
Copy Markdown
Member

@nohwnd nohwnd commented May 22, 2026

Fixes #10431

🤖 This is an automated fix generated by the Issue Triage & Auto-Fix workflow.

Root Cause

When a test's DisplayName (e.g. set via a [DataRow] attribute) contains XML 1.0 invalid control characters such as \x01\x08, \x0B, \x0C, \x0E\x1F, DataContractSerializer.WriteObject() in HtmlLogger.PopulateHtmlFile() throws an XmlException:

Html Logger Error : '', hexadecimal value 0x01, is an invalid character.

The HTML report is silently not generated even though tests continue to pass. The XML 1.0 spec forbids these code points entirely — they cannot appear even as character references.

Fix

Apply the same sanitization already used by TrxLogger's XmlPersistence.RemoveInvalidXmlChar(): replace XML-invalid characters with their Unicode escape representation (e.g. \u0001) before storing them in the HTML logger object model.

Added a private RemoveInvalidXmlChars(string?) helper to HtmlLogger and applied it to DisplayName, FullyQualifiedName, ErrorStackTrace, and ErrorMessage in TestResultHandler().

Test

Added unit test TestResultHandlerShouldSanitizeInvalidXmlCharsInDisplayName that verifies control characters (\x01, \x02, \x03) in DisplayName, ErrorMessage, and ErrorStackTrace are replaced with their \uXXXX escape sequences.

🔍 Triaged by Issue Repro Triage & Auto-Fix 🔍

When a test's DisplayName (e.g. from a DataRow attribute) contains XML 1.0
invalid control characters such as 0x01-0x08, 0x0B, 0x0C, 0x0E-0x1F,
DataContractSerializer throws XmlException and silently prevents the HTML
report from being generated.

Apply the same sanitization pattern already used by TrxLogger's XmlPersistence
to replace invalid XML characters with their Unicode escape representation
(e.g. \u0001) before they are stored in the HTML logger object model.

Fixes #10431

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 22, 2026 01:56
@nohwnd nohwnd added the bug label May 22, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a failure mode in the HTML logger where DataContractSerializer.WriteObject() can throw when test result strings contain XML 1.0–invalid control characters (e.g., from [DataRow] display names), preventing the HTML report from being generated.

Changes:

  • Sanitize DisplayName, FullyQualifiedName, ErrorMessage, and ErrorStackTrace in HtmlLogger.TestResultHandler() before storing them in the HTML logger object model.
  • Add RemoveInvalidXmlChars(string?) helper to escape invalid XML characters as \uXXXX.
  • Add a unit test validating escaping for control characters in result fields.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
src/Microsoft.TestPlatform.Extensions.HtmlLogger/HtmlLogger.cs Adds XML-invalid character sanitization for key serialized fields and introduces a helper used during result handling.
test/Microsoft.TestPlatform.Extensions.HtmlLogger.UnitTests/HtmlLoggerTests.cs Adds coverage to ensure control characters are escaped instead of causing serialization failures.

Comment on lines +475 to +479
// From xml spec (http://www.w3.org/TR/xml/#charsets) valid chars:
// #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
const string invalidChar = @"[^\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD]";
return Regex.Replace(str, invalidChar, m => $@"\u{(ushort)m.Value[0]:x4}");
}
Comment on lines +459 to +467
/// <summary>
/// Removes characters that are invalid in XML 1.0 from a string.
/// </summary>
/// <remarks>
/// XML 1.0 valid characters: #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD].
/// Control characters in the range #x00-#x08, #x0B, #x0C, #x0E-#x1F are not valid and
/// will cause <see cref="DataContractSerializer"/> to throw an <see cref="System.Xml.XmlException"/>.
/// Invalid characters are replaced with their Unicode escape representation.
/// </remarks>
}

[TestMethod]
public void TestResultHandlerShouldSanitizeInvalidXmlCharsInDisplayName()
Copy link
Copy Markdown
Member Author

@nohwnd nohwnd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Fix HTML logger exception on invalid XML chars

The fix correctly addresses the root cause — DataContractSerializer throwing on XML 1.0 invalid control characters in test display names. Two issues worth addressing before merge:

  1. Correctness (surrogate pairs) — The regex excludes the entire surrogate range \uD800–\uDFFF, which means valid supplementary Unicode characters (emoji, etc.) in test display names are silently corrupted to \uD83D\uDE00-style sequences. Since DataContractSerializer handles valid surrogate pairs correctly, this sanitization is overly aggressive. See inline comment at line 476.

  2. Performance (static regex) — The regex should be a static readonly compiled field rather than recreated per-call via the string overload. See inline comment at line 474.

The primary fix (replacing control chars \x00–\x1F minus the whitespace ones) is correct and the test coverage is good.

🧠 Reviewed by Expert Code Reviewer

🧠 Reviewed by Expert Code Reviewer 🧠

}

// From xml spec (http://www.w3.org/TR/xml/#charsets) valid chars:
// #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Correctness] Surrogate pairs (supplementary Unicode characters) are incorrectly stripped

The regex [^\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD] has a gap between \uD7FF and \uE000 that covers the entire surrogate range \uD800–\uDFFF. In .NET strings, supplementary Unicode characters (U+10000–U+10FFFF, including all emoji) are encoded as surrogate pairs — two char values both in that range.

DataContractSerializer / XmlWriter handle surrogate pairs correctly and serialize them as the corresponding supplementary Unicode code point, which is valid XML 1.0 ([#x10000-#x10FFFF]). So DataContractSerializer will not throw on a string like "Test(😀)".

The current code will corrupt such strings: "Test(😀)""Test(\uD83D\uDE00)" — two broken escape sequences instead of the emoji. A test whose [DataRow] argument contains an emoji will have its display name silently mangled in the HTML report.

Suggested fix — allow valid surrogate pairs to pass through by checking for a complete surrogate pair before replacing:

private static readonly Regex s_invalidXmlCharsRegex =
    new Regex(@"[^\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?<![\uD800-\uDBFF])[\uDC00-\uDFFF]",
              RegexOptions.Compiled);

private static string? RemoveInvalidXmlChars(string? str)
{
    if (str is null)
        return null;

    // Allow valid surrogate pairs (they represent U+10000–U+10FFFF, which are valid XML chars).
    // Only lone surrogates are truly invalid.
    return s_invalidXmlCharsRegex.Replace(str, m =>
        m.Value.Length == 1 ? $@"\u{(ushort)m.Value[0]:x4}" : m.Value);
}

Alternatively, use char.IsHighSurrogate/char.IsLowSurrogate in a manual loop, which is clearer and avoids nested regex complexity.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — the surrogate range \uD800-\uDFFF is now excluded from the negated character class in the first alternative, so valid surrogate pairs pass through untouched. Two additional alternatives handle lone surrogates (unpaired high or low) using lookahead/lookbehind. Added a test that verifies "Test(😀)" is preserved as-is after sanitization.

🔧 Iterated by PR Iteration Agent 🔧

{
return null;
}

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Performance] Regex is re-instantiated on every call via the string overload

Regex.Replace(str, invalidChar, ...) with a const string pattern looks up (and possibly compiles) the regex on every invocation via .NET's internal cache (size 15). This method is called 4× per test result; for a run with thousands of tests this accumulates.

Since the project targets netstandard2.0 and net48, [GeneratedRegex] isn't available, but a static readonly field with RegexOptions.Compiled eliminates the cache lookup overhead:

private static readonly Regex s_invalidXmlCharsRegex = new Regex(
    @"[^\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD]",
    RegexOptions.Compiled);

Minor, but worth fixing given the hot-path nature of TestResultHandler.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — the regex is now a private static readonly Regex InvalidXmlCharsRegex with RegexOptions.Compiled, eliminating the per-call lookup overhead.

🔧 Iterated by PR Iteration Agent 🔧

…rs correctly

- Use a static readonly compiled Regex instead of re-creating on every call
- Exclude the surrogate range from the negated char class in the first
  alternative so valid surrogate pairs are not matched; add explicit
  lone-surrogate alternatives with lookahead/lookbehind to catch only
  invalid lone surrogates
- Add test verifying emoji (valid surrogate pair) passes through unchanged

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@nohwnd
Copy link
Copy Markdown
Member Author

nohwnd commented May 22, 2026

Commit pushed: 82ef8e5

🔧 Iterated by PR Iteration Agent 🔧

Copy link
Copy Markdown
Member Author

@nohwnd nohwnd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both previous findings are resolved in this revision:

  • [Performance] Regex is now a private static readonly field with RegexOptions.Compiled — no per-call overhead.
  • [Correctness] The regex now correctly handles surrogate pairs: valid high+low surrogate pairs pass through unchanged; only lone surrogates are replaced with their \uXXXX escape. The three-alternative pattern is logically correct, including the edge cases of a high surrogate at end-of-string and a low surrogate at start-of-string.

Test coverage is solid — both the invalid control char case and the emoji (surrogate pair) preservation case are tested.

The fix is correct and complete. ✅

🧠 Reviewed by Expert Code Reviewer 🧠

🧠 Reviewed by Expert Code Reviewer 🧠

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

dotnet test html logger throws execption when using special characters in DataRow attributes.

2 participants