ma...@intron.ac writes:
Are these just "meta data" you referred to?
Yes. The unicode library in Courier does not just convert stuff from one
character set to another. I also need to know some metadata about each
character set, such as what I listed below.
When, for example, encoding the character set in a message's header or body,
I need to know whether the character set uses shift-in/shift-out character
sequences, if so base64 must be used for encoding the character set in the
headers. Even in character sets that don't use shift-in/shift-out
sequences, I still need to know the preferred encoding method, in order to
automatically select the best one when encoding message content.
I remember that many years ago I sent a mail to whatever mailing list
address I dug up out of iconv's documentation. My mail was ignored.
To maintain oriental languages' encoding conversion tables is
a piece of hard work. For example, your GB2312 table includes only 6763
Chinese characters. But our MANDATORY new national standard GB18030
covers 27484 Chinese characters! If we only use GB2312, even we cannot
spell our ex-prime minister's name (Rong-Ji Zhu), and we cannot print
all contents of most of Chinese classical novels.
Except meta data, it is wiser to make use of substantial conversion
tables provided by other professional libraries.
If you agree with me, I and others will help you in oriental
languages. Western language encodings (ISO 8859-X, KOI-8, IBM/Microsoft)
are much simpler than CJK, easy to be solved.
------------------------------------------------------------------------
From Beijing, China
Sam Varshavchik wrote:
Ysbeer writes:
Out of curiosity, have you ever considered using ICU for handling your
Unicode requirements?
I am not familiar with ICU's capabilities. The requirements are that for
a given character set, I must know whether or not:
1) The character set's lower 128 bytes consist of US-ASCII
2) The character set is a direct mapping of unicode (UTF-8, UTF-7, et al)
3) Whether the character set uses multibyte characters
4) The character set uses composite mapping using shift-in/shift-out
escape codes
5) Unrepresentable unicode characters may be ignored when converting
unicode to/from the character set
6) Whether quoted-printable or base-64 is best for encoding the character
set in the message's headers or body.