This section explains how to use UCS4 internal wide string encoding and locales for printing debug messages.
Filenames in filesystem can contain wide characters (e.g. UCS2 in vfat or UCS4 in NTFS), so conversion must support these filenames. We decided to use gconv (part of glibc) library to perform these conversions, because it supports all important encodings (internal UCS2/4, general UTF8 and a lot of proprietary charsets: ISO-8859-2,... including CP1250!).
We decided to use UCS4 encoding for internal puproses. UCS4 strings consist of characters 4 bytes long stored in Big Endian format. Strings are followed by final zero. All characters can be easily indexed, string length is obvious (size of array divided by 4 minus 1), and most importantly: this encoding contains all possible characters (2^32=4G).
If you want to convert multibyte strings (this is the general case, most
of encodings are just 1 byte -- iso-8859-2, but UTF8 uses
variable length,...), you must call iconv_open
. This function gets
2 charset names and returns handle for performing this type of
conversion:
iconv_t ih=iconv_open("UCS4","ISO-8859-2");
Handle should be closed by calling
iconv_close(ih);
If you want to convert a string from 1 encoding to another, call
iconv
function. This function is very general, so you MUST read
info libc
documentation. It gets conversion handle, 2 pointers to
char*
strings and 2 pointers to their lengths. It tries to convert
as many characters as possible. Unfortunately it has one bug/feature:
It it finds invalid character (e.g. invalid multibyte sequence or
character which is not supported in target encoding), it returns an
error. The user MUST take care of it -- skip manually the character,
replace it,... You could look at ewstomb.c
source (my source for
priting at terminal using locales) -- I substitute these characters by
`?' character.
OK, imported filesystems read filenames in proprietary encoding (be
carefull of the endianity) and store filenames in UCS4 encoding.
Exported filesystems get filenames in UCS4 and convert into
proprietary encoding. The same function iconv
is used -- but with
another handle.
If you want to print the strings (for debugging purposes), you must
convert it to charset supported by terminal. Locales function are used
now. Use ewstomb
function (for user allocated strings) or
ewstomb_static
(it uses 512b static buffer). These functions get
UCS4 (Big Endian) string, convert the its endianity to processor
specific (unfortunately Low Endian on Intel) and convert it into
terminal charset.
To let it work well, locale_init()
MUST be called when program is
loaded -- it reads environment variables and sets correct charset,...
I'd like to tell that it really works if you set it: LC_CTYPE=cs
is
set in my computer (iso-8859-2 charset is used) and when I list
directories from my vfat partition (stored in UCS2 by Windows98),
all characters are displayed well (including Czech!!!).