Partition Surprise 0.1: Miscelanous technical references: locales usage

14.1 locales usage

This section explains how to use UCS4 internal wide string encoding and locales for printing debug messages.

Filenames in filesystem can contain wide characters (e.g. UCS2 in vfat or UCS4 in NTFS), so conversion must support these filenames. We decided to use gconv (part of glibc) library to perform these conversions, because it supports all important encodings (internal UCS2/4, general UTF8 and a lot of proprietary charsets: ISO-8859-2,... including CP1250!).

We decided to use UCS4 encoding for internal puproses. UCS4 strings consist of characters 4 bytes long stored in Big Endian format. Strings are followed by final zero. All characters can be easily indexed, string length is obvious (size of array divided by 4 minus 1), and most importantly: this encoding contains all possible characters (2^32=4G).

If you want to convert multibyte strings (this is the general case, most of encodings are just 1 byte -- iso-8859-2, but UTF8 uses variable length,...), you must call iconv_open. This function gets 2 charset names and returns handle for performing this type of conversion:

iconv_t ih=iconv_open("UCS4","ISO-8859-2");

Handle should be closed by calling

iconv_close(ih);

If you want to convert a string from 1 encoding to another, call iconv function. This function is very general, so you MUST read info libc documentation. It gets conversion handle, 2 pointers to char* strings and 2 pointers to their lengths. It tries to convert as many characters as possible. Unfortunately it has one bug/feature: It it finds invalid character (e.g. invalid multibyte sequence or character which is not supported in target encoding), it returns an error. The user MUST take care of it -- skip manually the character, replace it,... You could look at ewstomb.c source (my source for priting at terminal using locales) -- I substitute these characters by `?' character.

OK, imported filesystems read filenames in proprietary encoding (be carefull of the endianity) and store filenames in UCS4 encoding. Exported filesystems get filenames in UCS4 and convert into proprietary encoding. The same function iconv is used -- but with another handle.

If you want to print the strings (for debugging purposes), you must convert it to charset supported by terminal. Locales function are used now. Use ewstomb function (for user allocated strings) or ewstomb_static (it uses 512b static buffer). These functions get UCS4 (Big Endian) string, convert the its endianity to processor specific (unfortunately Low Endian on Intel) and convert it into terminal charset.

To let it work well, locale_init() MUST be called when program is loaded -- it reads environment variables and sets correct charset,... I'd like to tell that it really works if you set it: LC_CTYPE=cs is set in my computer (iso-8859-2 charset is used) and when I list directories from my vfat partition (stored in UCS2 by Windows98), all characters are displayed well (including Czech!!!).

Next Previous Contents