Hi, Mr. Sam,
I think GNU libiconv is a better choice than you maintain a
Unicode library yourself. Libiconv's maintainers are more professional
to trace The Unicode Consortium.
Actually, it is oriental people, who speak large character set
languages, that has much more eager requirement for Unicode support
than western people, most of whose languages can be expressed in
256-glyph character sets.
But at the same time, the maintenance of large character sets
such as Chinese (GB18030, BIG5-HKSCS), Japanese (Shift-JIS) is a piece
of tiring work. The constitutors of these encodings,
Chinese/Japanese/Korean governments and other organizations, are
modifying these encoding standard continually, according to
Chinese/Japanese/Korean people's writing fashions.
You said to me that GNU libiconv cannot provide meta data that
Courier requires.
But I think there is a workaround with GNU libiconv:
Assume a byte string: [b1 b2 b3 b4 b5 ... bn] (ended with a CR/LF)
1. Initialize GNU Libiconv: iconv_open("UCS-4BE", "SOME ENCODING");
2. Try iconv() against: [b1]
If successfully, the current character is [b1], skip [b1] and continue
from step 2.
3. Try iconv() against: [b1 b2]
If successfully, the current character is [b1 b2], skip [b1 b2] and
from step 2.
4. Try iconv() against: [b1 b2 b3]
If successfully, the current character is [b1 b2 b3], skip [b1 b2 b3]
and continue from step 2.
5. Try iconv() against: [b1 b2 b3 b4]
If successfully, the current character is [b1 b2 b3 b4], skip
[b1 b2 b3 b4] and continue from step 2.
6. Try iconv() against: [b1 b2 b3 b4 b5]
If successfully, the current character is [b1 b2 b3 b4 b5], skip
[b1 b2 b3 b4 b5] and continue from step 2.
7. Try iconv() against: [b1 b2 b3 b4 b5 b6]
If successfully, the current character is [b1 b2 b3 b4 b5 b6], skip
[b1 b2 b3 b4 b5 b6] and continue from step 2.
8. Output "?" as a dummy substitution, Skip [b1], and continue from step 2.
Of course, some optimization measures can be applied to the above
workaround.
Only trials of [b1] and [b1 b2] is needed for GB2312, GBK, BIG5,
BIG5-HKSCS, EUC-JP and Shift-JIS.
GB18030 requires [b1], [b1 b2] and [b1 b2 b3 b4].
UTF-8 requires [b1], [b1 b2], [b1 b2 b3], [b1 b2 b3 b4],
[b1 b2 b3 b4 b5] and [b1 b2 b3 b4 b5 b6].
------------------------------------------------------------------------
From Beijing, China