Fixes The Build: Unicode String
Author: Chris Goss
What is Unicode?
In the bad old days (circa 1987), most text was encoded using ASCII. You’re probably familiar with ASCII, it uses 7 bits to encode the characters that appear on a US keyboard (along with a few dozen non-printable control characters). It’s very handy for communicating in English, but it doesn’t support other Latin languages fully, and doesn’t support languages with non-Latin alphabets at all. There were a variety of encodings (including Extended ASCII) that added additional characters with 8-bit encoding, but none of these encodings gained widespread popularity.
Unicode is an expansive multi-lingual character encoding standard that gained widespread acceptance in the 1990s. Unicode is backwards compatible with ASCII, but adds many more bits to be able to encode additional characters. Version 1.0.0 was released in 1991 and included 7,161 unique characters. Version 10.0 was released in 2017 and included 136,755 unique characters. Unicode can represent characters for nearly all of the world’s writing systems (including some historic languages), and can be expanded further. As internationally-mindful developers, we should strive to support Unicode for any text that is an input from or an output to the user or the user’s machine or profile.
Unicode defines several different encodings for its character set. The most popular encodings are UTF-8, UTF-16, UCS-2, and UTF-32. இ
UTF-32 uses 32 bits to encode each Unicode character. There is a one-to-one relationship between the Unicode code point and its storage. This is the simplest way to encode a Unicode character, but it is also the most space inefficient. It is not commonly used. ਊ
Back in the early 1990s, when Unicode was being developed, its founders believed that 16 bits would be sufficient to encode all of the world’s living languages. UCS-2 was a standard that was developed on that assumption. It is a one-to-one encoding Unicode code point to 2 bytes (16 bits). Under this assumption, UCS-2 is just as convenient to use as UTF-32, but with half the storage! Many early adopters of Unicode (including Microsoft Windows) used this encoding.
Unicode Version 2.0, released in 1996, determined that 16 bits was not sufficient to encode all Unicode characters (whoops!). Version 2.0 introduced the concept of multiple “planes” of Unicode, each of which had 16 bits to encode its characters. Unicode defines 17 planes total (0-16), though only planes 0-2 have a significant number of encoded characters as of Version 10.0. Plane 0, the Basic Multilingual Plane (often abbreviated BMP, but I’ll continue to use the full name for maximal pretentiousness), is the plane that includes characters for almost all modern languages. UCS-2 can only support Unicode characters in the Basic Multilingual Plane, so it is obsolete. It is still used by some software that hasn’t updated to UTF-16. That software will not be able to support emojis in text, which is tragic. 😩
UTF-16 uses a sequence of 16-bit code units to encode each Unicode code point. For Unicode characters in the Basic Multilingual Plane, UTF-16 will encode the Unicode code point in a single 16-bit code unit, just like UCS-2. Unicode characters in other planes are encoded using two 16-bit code units, called surrogate pairs. UTF-16 is a popular way to encode Unicode text.
To convert a Unicode code point beyond the Basic Multilingual Plane into UTF-16 surrogate pairs, 0x10000 is first subtracted from code point, U, to create U’. U’ will never have more than 20 bits of information. The most significant 10 bits are stored in the first code unit: W1. The least significant 10 bits are stored in the second code unit: W2. Like so:
U’ = yyyyyyyyyyxxxxxxxxxx
W1 = 110110yyyyyyyyyy
W2 = 110111xxxxxxxxxx
To convert UTF-16 code unit(s) into a Unicode code point, one must first determine if the code unit is part of a surrogate pair. If the code unit is between 0xD800 – 0xDFFF, it is interpreted as a surrogate pair and decoded along with the next code unit into a code point using the reverse process described above. If the code unit is outside of that range, it is interpreted directly as a Unicode code point. ☃
UTF-8 uses a sequence of 8-bit code units to encode each Unicode code point. One byte is used to encode ASCII characters. Two bytes are used to encode the next 1920 Unicode characters, which includes almost all Latin, Greek, Cyrillic, and Arabic languages. Three bytes are used to encode the remainder of the characters in the Basic Multilingual Plane. Four bytes are used to encode characters in other planes. To convert a Unicode code point into UTF-8, the following approach is used:
|Byte 1||Byte 2||Byte 3||Byte 4|
The x’s represent the bits of information in the Unicode code point. The more significant bits are stored in earlier bytes. Decoding UTF-8 works by examining the first few bits in the UTF-8 byte. If the most significant bit is a 0, the byte represents an ASCII character. If the second most significant bit is 0, the byte is a non-leading byte of a multibyte encoding. If the third most significant bit is 0, it is the leading byte of a 2-byte encoding. If the fourth most significant bit is 0, it is the leading byte of a 3-byte encoding. If the fifth most significant bit is 0, it is the leading byte of a 4-byte encoding.
One of the great features of UTF-8 encoding is that it is backwards-compatible and forwards-compatible with ASCII. Text with all ASCII characters is valid UTF-8. But, also, multi-byte encodings in UTF-8 will never be misinterpreted as ASCII characters. This is handy if you have code that looks for particular characters (like ‘/’). UTF-8 is also very space-efficient. There are some Unicode characters that are encoded using 3 bytes in UTF-8 but only 2 bytes in UTF-16. However, UTF-8 text is almost always more space-efficient than UTF-16 text due to the prevalence of ASCII characters even in non-Latin text (like space, punctuation, and numbers). Consequently, UTF-8 is the most popular way to encode Unicode text. ༂
wchars in C
C90 introduced the wchars type to C. C95 expanded the wide character support significantly; almost every char and string helper has an associated wchar version. Like many C types, the size of wchar depends on the compiler. It can be as small as 8 bits and as large as 32 bits, though it is most commonly 16 bits. While wchars are often used to encode Unicode characters (usually using either UCS-2 or UTF-16), there is nothing in the language that enforces any particular encoding. There are a couple functions (introduced in C95) to convert from wchar to char and vica versa, wcrtomb(s) and mbrtowc(s), but the actual encodings for wchar and char are not customizable and might not be UTF-16 and UTF-8! C11 introduced char16_t and char32_t types which are always 16 and 32 bits, respectively. C11 also introduces a few functions that can convert a Unicode character to and from different encodings: mbrtoc16, mbrtoc32, c16rtomb, and c32rtomb, but these functions can’t operate on full strings. It’s all pretty inconvenient. 😡
Unicode in Windows
Starting in the mid-1990s, Microsoft began supporting Unicode strings in their Windows API. Windows exposes two parallel sets of APIs, one for ANSI strings and the other for Unicode strings. The ANSI versions use chars throughout, while the Unicode versions use wchars. For example, CreateFileA() and CreateFileW() are the ANSI and Unicode versions of the Windows function to create or open a file. If the file path might contain non-ASCII characters, you need to use the Unicode version. As an alternative to calling the ‘A’ or ‘W’ version explicitly, you can call CreateFile, which gets #defined to either CreateFileW (if UNICODE is defined) or CreateFileA (if not).
#define CreateFile CreateFileW
#define CreateFile CreateFileA
Originally, Windows interpreted wchars as UCS-2. Starting with Windows 2000, wchars are appropriately interpreted as UTF-16. Windows exposes a couple functions that translate char strings to wchar strings and vica versa: MultiByteToWideChar() and WideCharToMultiByte(). Unlike the C functions in stdlib, these Windows functions do let you select the encoding type, so you can explicitly convert a string from UTF-8 to UTF-16 and vica versa. Hooray! 😃
UTF-16 or UTF-8?
Suppose you’re working in a C++ codebase on primarily Windows OSes and you want to support Unicode file paths, user names, etc… To interface with the Unicode-friendly Windows API, you’re going to need to use wchars (encoded in UTF-16). One dilemma is whether you should use wchars in all of the intermediate code, or use char strings (encoded as UTF-8) in the intermediate code and convert to and from UTF-16 when interfacing with the Windows API. Here are some pros and cons of using UTF-8 encoded char string in intermediate code:
- UTF-8 is more space-efficient. Ѭ
- Using char storage and manipulation will likely require less refactoring. ㊬
- Adds non-trivial encoding translation logic when interfacing with Windows API. 〠
- It will be less obvious if the change isn’t comprehensive; increases the risk that some strings are not Unicode. ߷
Potential pitfalls converting existing strings to UTF-8
Suppose you have an existing system that stores text as chars and treats them as ASCII characters, but you want to change it to support Unicode characters. The path of least resistance is to keep the char storage, but use UTF-8 encoding. You’ll need to change the entry and exit points for the text (I/O or other external API) to encode and decode using UTF-8. But, for the most part, you won’t have to touch the intermediate storage and string manipulation at all! There are a few exceptions, which might require specific attention:
- String length != number of characters. Any code that uses string length as a proxy for the number of characters in the string will be incorrect. Ⱙ
- to_lower() and to_upper() string manipulations will have no effect on non-ASCII characters. If you want this code to affect other Unicode characters, you’ll need to decode the UTF-8 string and perform some special logic. ⨗
- Any > comparisons of characters. Any code testing if a char is > some ASCII literal will now also be true for each byte in a multi-byte UTF-8 encoding. ۩
- Be a good internationally-mindful developer and support Unicode text. ♗
- UTF-8 is a really nifty encoding that manages to be elegant and backwards compatible with ASCII at the same time. ᛥ
- Be pessimistic about using C standard library functions to manipulate Unicode strings. ጇ
- ‘W’ stands for Unicode in the Windows API. ჭ
- Use the phrase “Basic Multilingual Plane” as often as possible to appear pretentious. 㘜