xyzzy: Corrigendum #9 clarifies noncharacter usage in Unicode

2013-03-14

Corrigendum #9 clarifies noncharacter usage in Unicode

While waiting for the "missing" combining character forbidden to go with the new U+1F4A9 I'm slightly confused by Corrigendum #9: There is a small block of 32 non-characters in the BMP (plane 0), and each plane (0..16) ends with two non-characters, for an immutable (stability guaranteed) total of 66=32+2×17 non-character code points. It's good to know that converters from, say, UTF8 to UTF16LE, are not forced to handle non-characters as errors. Unlike surrogates, surrogates outside of UTF16 or not appearing in a surrogate pair to address code points outside of the BMP still are errors.

But one non-character U+FFFE (arguably one plus sixteen in all planes) has an important purpose, it is not a BOM, also known as signature.

UTF16 texts starting with hex. FEFF are supposed to be UTF16BE (big endian), while UTF16 texts starting with hex. FFFE are supposed to be UTF16LE (little endian). UTF16 texts starting with non-character U+FFFE instead of U+FEFF would be a major pile of poo.

xyzzy

2013-03-14

Corrigendum #9 clarifies noncharacter usage in Unicode

No comments:

Blog Archive

Labels

Static pages

About Me


Search only IANA, ICANN, IETF, OpenSPF, Unicode, W3C, xyzzy