Skip to content
mnaoumov.dev
Go back

Escaping Invalid XML Unicode characters

Hi folks

Recently I discovered a bug in NUnit

Basically the issue caused by the fact that NUnit may create a XmlDocument with Unicode characters that are not valid in XML.

To fix the issue we need to either strip those characters or maybe escape them

According to the xml spec, the only valid XML characters:

#x9 | \#xA | \#xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

Let’s construct a Regex to replace invalid xml characters

First Naive Approach

var invalidXmlCharactersRegex = new Regex("[^\u0009\u000a\u000d\u0020-\ud7ff\ue000-\ufffd\U00010000-\U0010ffff]");

won’t work because \U00010000-\U0010ffff represented as Unicode surrogate pairs and equivalent to \ud800\udc00-\udbff\udff and form an invalid Regex

All characters \U00010000-\U0010ffff (Supplementary Planes) can be described as a Regex:

var supplementaryPlanesRegex = new Regex("[\ud800-\udbff][\udc00-\udfff]");

According to the list of valid characters shown above, [#xD800-#xDFFF] are invalid XML characters. Taking into account Supplementary Planes, this means that we are interested in surrogate characters that don’t form a valid surrogate pair.

In my previous blogpost I described a Regex to match such characters.

var invalidCharactersRegex = new Regex("([\ud800-\udbff](?![\udc00-\udfff]))|((?<![\ud800-\udbff])[\udc00-\udfff])");

Second Naive Approach

var invalidXmlCharactersRegex = new Regex("[^\u0009\u000a\u000d\u0020-\ud7ff\ue000-\ufffd]|([\ud800-\udbff](?![\udc00-\udfff]))|((?<![\ud800-\udbff])[\udc00-\udfff])");

This forms a valid Regex but this won’t work correctly. It will match the string with valid Unicode code point “\U00010000” which is equivalnt to “\ud800\udc00” . The reason for that is the fact that these characters were matched by the first part of the Regex. We need to skip this by adding this range into Regex

Third Approach

var invalidXmlCharactersRegex = new Regex("[^\u0009\u000a\u000d\u0020-\ud7ff\ud800-\udfff\ue000-\ufffd]|([\ud800-\udbff](?![\udc00-\udfff]))|((?<![\ud800-\udbff])[\udc00-\udfff])");

We can simplify it a bit by combining \u0020-\ud7ff\ud800-\udfff\ue000-\ufffd

Final Approach

var invalidXmlCharactersRegex = new Regex("[^\u0009\u000a\u000d\u0020-\ufffd]|([\ud800-\udbff](?![\udc00-\udfff]))|((?<![\ud800-\udbff])[\udc00-\udfff])");

I checked and this Regex is really filtering only the characters from the spec.

And here is the final version of the desired methods

public static string StripInvalidXmlCharacters(string str)
{
    var invalidXmlCharactersRegex = new Regex("[^\u0009\u000a\u000d\u0020-\ufffd]|([\ud800-\udbff](?![\udc00-\udfff]))|((?<![\ud800-\udbff])[\udc00-\udfff])");
    return invalidXmlCharactersRegex.Replace(str, "");
}

public static string EscapeInvalidXmlCharacters(string str)
{
    var invalidXmlCharactersRegex = new Regex("[^\u0009\u000a\u000d\u0020-\ufffd]|([\ud800-\udbff](?![\udc00-\udfff]))|((?<![\ud800-\udbff])[\udc00-\udfff])");
    return invalidXmlCharactersRegex.Replace(str, match => CharToUnicodeSequence(match.Value[0]));
}

static string CharToUnicodeSequence(char symbol)
{
    return string.Format("\\u{0}", ((int) symbol).ToString("x4"));
}

UPD: As I was asked in a comment, I provide a positive regex for a valid xml characters

My first incorrect attempt was to simply negate the invalidXmlCharactersRegex by replacing negative group [^…] with positive group […], and negative lookahead (?!…) with positive lookahed (?=…), and negative lookbehind (?<!…) with positive lookbehind (?<=…)

var validXmlCharactersRegex = new Regex("[\u0009\u000a\u000d\u0020-\ufffd]|([\ud800-\udbff](?=[\udc00-\udfff]))|((?<=[\ud800-\udbff])[\udc00-\udfff])");

But this is wrong, because \u0020-\ufffd includes surrogate characters so it will false positively match the string

string badString = "\ud800";

Here is the correct version of the regex

var validXmlCharactersRegex = new Regex("[\u0009\u000a\u000d\u0020-\ud7ff\ue000-\ufffd]|([\ud800-\udbff](?=[\udc00-\udfff]))|((?<=[\ud800-\udbff])[\udc00-\udfff])");

UPD2: As I was asked in a comment, we can simplify the regex if we don’t need to get individual codepoints.

var validXmlCharactersRegex = new Regex("[\u0009\u000a\u000d\u0020-\ud7ff\ue000-\ufffd]|[\ud800-\udbff][\udc00-\udfff]");

Share this post on:

Previous Post
PowerShell .NET property access swallows exceptions
Next Post
NUnit pull request