Hi folks
The more you work with Unicode the more discoveries you can make.
.NET System.Char represents a character as a UTF-16 code unit.
UTF-16 has a concept of surrogates:
Characters from U+D800 to U+DBFF - lead surrogate aka first code unit aka high surrogate
Characters from U+DC00 to U+DFFF - tail surrogate aka second code unit aka low surrogate
To form a valid Unicode code point, lead surrogate should be always followed by tail surrogate.
Though, this rule is not enforced by .NET. You can create a string which is not valid from the UTF-16 point of view.
For example
string s = "a\ud800b";
here \ud800 is lead surrogate but it is followed by b letter which is not a low surrogate.
This string is not a valid Unicode string and this may cause some issues.
For example
s.Normalize();
fails with
System.ArgumentException: Invalid Unicode code point found at index 2.
Parameter name: strInput
If we store such string into the file, some text editors can fail on the file open.
So I think if we got a string from an unreliable source we may want to strip the incorrect symbols.
There are some approaches but I would like to suggest another one based on Regex.
We will use negative lookahead and lookbehinds: find lead surrogates that are not followed by tail surrogate and also find tail surrogate that are not led by lead surrogates
public static string StripInvalidUnicodeCharacters(string str)
{
var invalidCharactersRegex = new Regex("([\ud800-\udbff](?![\udc00-\udfff]))|((?<![\ud800-\udbff])[\udc00-\udfff])");
return invalidCharactersRegex.Replace(str, "");
}