blogger dashboard blog archive
xyzzy homepage

REXX, Full Frontal mpeg, IANA

2013-03-14

Corrigendum #9 clarifies noncharacter usage in Unicode

While waiting for the "missing" combining character forbidden to go with the new U+1F4A9 I'm slightly confused by Corrigendum #9: There is a small block of 32 non-characters in the BMP (plane 0), and each plane (0..16) ends with two non-characters, for an immutable (stability guaranteed) total of 66=32+2×17 non-character code points. It's good to know that converters from, say, UTF8 to UTF16LE, are not forced to handle non-characters as errors. Unlike surrogates, surrogates outside of UTF16 or not appearing in a surrogate pair to address code points outside of the BMP still are errors.

But one non-character U+FFFE (arguably one plus sixteen in all planes) has an important purpose, it is not a BOM, also known as signature.

UTF16 texts starting with hex. FEFF are supposed to be UTF16BE (big endian), while UTF16 texts starting with hex. FFFE are supposed to be UTF16LE (little endian). UTF16 texts starting with non-character U+FFFE instead of U+FEFF would be a major pile of poo.

Labels

Creative Commons Licencexyzzy blog
CC Attribution-ShareAlike 4.0 License
Search only IANA, ICANN, IETF, OpenSPF, Unicode, W3C, xyzzy

About Me

My photo
Hamburg, Germany
There's no EX in ex-Wikiholic. Now having fun with the last days of Google+ and its self-proclaimed murderess.