[fix] Fix HTML logger exception on invalid XML chars in test display names#16051
[fix] Fix HTML logger exception on invalid XML chars in test display names#16051nohwnd wants to merge 2 commits into
Conversation
When a test's DisplayName (e.g. from a DataRow attribute) contains XML 1.0 invalid control characters such as 0x01-0x08, 0x0B, 0x0C, 0x0E-0x1F, DataContractSerializer throws XmlException and silently prevents the HTML report from being generated. Apply the same sanitization pattern already used by TrxLogger's XmlPersistence to replace invalid XML characters with their Unicode escape representation (e.g. \u0001) before they are stored in the HTML logger object model. Fixes #10431 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR fixes a failure mode in the HTML logger where DataContractSerializer.WriteObject() can throw when test result strings contain XML 1.0–invalid control characters (e.g., from [DataRow] display names), preventing the HTML report from being generated.
Changes:
- Sanitize
DisplayName,FullyQualifiedName,ErrorMessage, andErrorStackTraceinHtmlLogger.TestResultHandler()before storing them in the HTML logger object model. - Add
RemoveInvalidXmlChars(string?)helper to escape invalid XML characters as\uXXXX. - Add a unit test validating escaping for control characters in result fields.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
src/Microsoft.TestPlatform.Extensions.HtmlLogger/HtmlLogger.cs |
Adds XML-invalid character sanitization for key serialized fields and introduces a helper used during result handling. |
test/Microsoft.TestPlatform.Extensions.HtmlLogger.UnitTests/HtmlLoggerTests.cs |
Adds coverage to ensure control characters are escaped instead of causing serialization failures. |
| // From xml spec (http://www.w3.org/TR/xml/#charsets) valid chars: | ||
| // #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] | ||
| const string invalidChar = @"[^\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD]"; | ||
| return Regex.Replace(str, invalidChar, m => $@"\u{(ushort)m.Value[0]:x4}"); | ||
| } |
| /// <summary> | ||
| /// Removes characters that are invalid in XML 1.0 from a string. | ||
| /// </summary> | ||
| /// <remarks> | ||
| /// XML 1.0 valid characters: #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD]. | ||
| /// Control characters in the range #x00-#x08, #x0B, #x0C, #x0E-#x1F are not valid and | ||
| /// will cause <see cref="DataContractSerializer"/> to throw an <see cref="System.Xml.XmlException"/>. | ||
| /// Invalid characters are replaced with their Unicode escape representation. | ||
| /// </remarks> |
| } | ||
|
|
||
| [TestMethod] | ||
| public void TestResultHandlerShouldSanitizeInvalidXmlCharsInDisplayName() |
nohwnd
left a comment
There was a problem hiding this comment.
Review: Fix HTML logger exception on invalid XML chars
The fix correctly addresses the root cause — DataContractSerializer throwing on XML 1.0 invalid control characters in test display names. Two issues worth addressing before merge:
-
Correctness (surrogate pairs) — The regex excludes the entire surrogate range
\uD800–\uDFFF, which means valid supplementary Unicode characters (emoji, etc.) in test display names are silently corrupted to\uD83D\uDE00-style sequences. SinceDataContractSerializerhandles valid surrogate pairs correctly, this sanitization is overly aggressive. See inline comment at line 476. -
Performance (static regex) — The regex should be a
static readonlycompiled field rather than recreated per-call via the string overload. See inline comment at line 474.
The primary fix (replacing control chars \x00–\x1F minus the whitespace ones) is correct and the test coverage is good.
🧠 Reviewed by Expert Code Reviewer
🧠 Reviewed by Expert Code Reviewer 🧠
| } | ||
|
|
||
| // From xml spec (http://www.w3.org/TR/xml/#charsets) valid chars: | ||
| // #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] |
There was a problem hiding this comment.
[Correctness] Surrogate pairs (supplementary Unicode characters) are incorrectly stripped
The regex [^\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD] has a gap between \uD7FF and \uE000 that covers the entire surrogate range \uD800–\uDFFF. In .NET strings, supplementary Unicode characters (U+10000–U+10FFFF, including all emoji) are encoded as surrogate pairs — two char values both in that range.
DataContractSerializer / XmlWriter handle surrogate pairs correctly and serialize them as the corresponding supplementary Unicode code point, which is valid XML 1.0 ([#x10000-#x10FFFF]). So DataContractSerializer will not throw on a string like "Test(😀)".
The current code will corrupt such strings: "Test(😀)" → "Test(\uD83D\uDE00)" — two broken escape sequences instead of the emoji. A test whose [DataRow] argument contains an emoji will have its display name silently mangled in the HTML report.
Suggested fix — allow valid surrogate pairs to pass through by checking for a complete surrogate pair before replacing:
private static readonly Regex s_invalidXmlCharsRegex =
new Regex(@"[^\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?<![\uD800-\uDBFF])[\uDC00-\uDFFF]",
RegexOptions.Compiled);
private static string? RemoveInvalidXmlChars(string? str)
{
if (str is null)
return null;
// Allow valid surrogate pairs (they represent U+10000–U+10FFFF, which are valid XML chars).
// Only lone surrogates are truly invalid.
return s_invalidXmlCharsRegex.Replace(str, m =>
m.Value.Length == 1 ? $@"\u{(ushort)m.Value[0]:x4}" : m.Value);
}Alternatively, use char.IsHighSurrogate/char.IsLowSurrogate in a manual loop, which is clearer and avoids nested regex complexity.
There was a problem hiding this comment.
Fixed — the surrogate range \uD800-\uDFFF is now excluded from the negated character class in the first alternative, so valid surrogate pairs pass through untouched. Two additional alternatives handle lone surrogates (unpaired high or low) using lookahead/lookbehind. Added a test that verifies "Test(😀)" is preserved as-is after sanitization.
🔧 Iterated by PR Iteration Agent 🔧
| { | ||
| return null; | ||
| } | ||
|
|
There was a problem hiding this comment.
[Performance] Regex is re-instantiated on every call via the string overload
Regex.Replace(str, invalidChar, ...) with a const string pattern looks up (and possibly compiles) the regex on every invocation via .NET's internal cache (size 15). This method is called 4× per test result; for a run with thousands of tests this accumulates.
Since the project targets netstandard2.0 and net48, [GeneratedRegex] isn't available, but a static readonly field with RegexOptions.Compiled eliminates the cache lookup overhead:
private static readonly Regex s_invalidXmlCharsRegex = new Regex(
@"[^\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD]",
RegexOptions.Compiled);Minor, but worth fixing given the hot-path nature of TestResultHandler.
There was a problem hiding this comment.
Fixed — the regex is now a private static readonly Regex InvalidXmlCharsRegex with RegexOptions.Compiled, eliminating the per-call lookup overhead.
🔧 Iterated by PR Iteration Agent 🔧
…rs correctly - Use a static readonly compiled Regex instead of re-creating on every call - Exclude the surrogate range from the negated char class in the first alternative so valid surrogate pairs are not matched; add explicit lone-surrogate alternatives with lookahead/lookbehind to catch only invalid lone surrogates - Add test verifying emoji (valid surrogate pair) passes through unchanged Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Commit pushed:
|
nohwnd
left a comment
There was a problem hiding this comment.
Both previous findings are resolved in this revision:
- [Performance] Regex is now a
private static readonlyfield withRegexOptions.Compiled— no per-call overhead. - [Correctness] The regex now correctly handles surrogate pairs: valid high+low surrogate pairs pass through unchanged; only lone surrogates are replaced with their
\uXXXXescape. The three-alternative pattern is logically correct, including the edge cases of a high surrogate at end-of-string and a low surrogate at start-of-string.
Test coverage is solid — both the invalid control char case and the emoji (surrogate pair) preservation case are tested.
The fix is correct and complete. ✅
🧠 Reviewed by Expert Code Reviewer 🧠
🧠 Reviewed by Expert Code Reviewer 🧠
Fixes #10431
Root Cause
When a test's
DisplayName(e.g. set via a[DataRow]attribute) contains XML 1.0 invalid control characters such as\x01–\x08,\x0B,\x0C,\x0E–\x1F,DataContractSerializer.WriteObject()inHtmlLogger.PopulateHtmlFile()throws anXmlException:The HTML report is silently not generated even though tests continue to pass. The XML 1.0 spec forbids these code points entirely — they cannot appear even as character references.
Fix
Apply the same sanitization already used by
TrxLogger'sXmlPersistence.RemoveInvalidXmlChar(): replace XML-invalid characters with their Unicode escape representation (e.g.\u0001) before storing them in the HTML logger object model.Added a private
RemoveInvalidXmlChars(string?)helper toHtmlLoggerand applied it toDisplayName,FullyQualifiedName,ErrorStackTrace, andErrorMessageinTestResultHandler().Test
Added unit test
TestResultHandlerShouldSanitizeInvalidXmlCharsInDisplayNamethat verifies control characters (\x01,\x02,\x03) inDisplayName,ErrorMessage, andErrorStackTraceare replaced with their\uXXXXescape sequences.