Backward Compatibility of Character Set Encodings
I recently spent some time using Linux. Programs for Linux aren’t perfect. Sometimes they have bugs. One of my worry is about the compatibility of character set encodings.
The distribution I installed was Red Hat Fedora Core 2. By default, it uses UTF-8 encoding for all languages. This is good for internationalization. The worry about character set encoding compatibility is not whether UTF-8 supports Chinese, but is whether Linux programs are fully compatible with UTF-8.
Let’s see an example. The following code illustrates a typical string comparison:
int contains_dot_dot(s)
{
return (strstr(s, "/..") != NULL) ? 1 : 0;
}
My concern is, if the encoding of the character set is multi-byte, then there may be a case like this: A character may be composed of two bytes. The first byte of the multi-byte character is >= 0x80, but the second may be less than that. If the second byte happend to be ‘/’, then the strstr comparison may return true, but the ‘/’ actually doesn’t represent a whole character.
In GBK, ‘/’ is never used as the second byte, but ‘\’ is used. Lately, I read an article about UTF-8 on wikipedia. It promised me that UTF-8 doesn’t use byte ranging between 0x00 and 0x7F in multi-byte charaters. Then I was happy. GB2312 also has this feature, while GBK doesn’t.
Then the work is simpler. Most code that don’t need to use whole characters or extended ASCII characters should become compatible automatically. Only when things must be done with characters rather than bytes should be done carefully, e.g., truncating and displaying.
There is an issue for console displaying. GBK and GB2312 has a feature over UTF-8. In console mode, there is no graphical function calls like GetTextExtent. The GetTextExtent function returns the size of the character. In console mode, a character would take one or two ASCII character positions. If it would take one, then GBK and GB2312 would have it represented in one byte. If it would take two, then GBK and GB2312 would have it represented in two bytes. UTF-8 doesn’t have this feature. So in order to determine the size of the character to display under console environment, it is a way to convert the character to GBK or GB2312.
What about Unicode then? It seems to be recommended by both Apple and Windows?
UNICODE can be directly stored in wchar_t if it only contains BMP (UCS-2) characters. BMP characters include all characters in daily use, and they range from 0x0000 to 0xFFFF. In this way, there is no special encoding. The only thing that we should consider is whether the encoding is big endian or small endian. It is good for text editors and other character manipulation programs. However, it takes a lot of space when used with western characters. So western programmers don’t usually like it when used in storage. So Linux doesn’t recommend it for storage. Instead, Linux recommends UTF-8 for storage. When manipulating text in memory, converting some UTF-8 characters to UCS-2 representation is OK for text editing. For professional text editors, such as VIM, they support UCS-4, which is a maximum of Unicode character set. Each UCS-4 character takes 4 bytes.