Prev Next

Unicode characters / code points range from 0 to 0x10ffff (1114111), of which 97,655 are assigned in Unicode 4.1. The set of characters below 0xFFFF is the BMP (Basic Multilingual Plane). Outside it are, among others, Deseret, Byzantine Musical Symbols, and some CJK characters.There are several Unicode transformation formats (encodings):

For UTF-8 and UTF-16, start of characters is easily detectable (search backward a known number of bytes), corruption is localized. For UTF-16, Byte Order Mark (BOM) 0xFEFF indicates byte order (0xFFFE is guaranteed invalid.) Some programs write UTF-8 encoding of BOM to indicate UTF-8.