Are these just "meta data" you referred to?
To maintain oriental languages' encoding conversion tables is
a piece of hard work. For example, your GB2312 table includes only 6763
Chinese characters. But our MANDATORY new national standard GB18030
covers 27484 Chinese characters! If we only use GB2312, even we cannot
spell our ex-prime minister's name (Rong-Ji Zhu), and we cannot print
all contents of most of Chinese classical novels.
Except meta data, it is wiser to make use of substantial conversion
tables provided by other professional libraries.
If you agree with me, I and others will help you in oriental
languages. Western language encodings (ISO 8859-X, KOI-8, IBM/Microsoft)
are much simpler than CJK, easy to be solved.
------------------------------------------------------------------------
From Beijing, China
Sam Varshavchik wrote:
Ysbeer writes:
Out of curiosity, have you ever considered using ICU for handling your
Unicode requirements?
I am not familiar with ICU's capabilities. The requirements are that for
a given character set, I must know whether or not:
1) The character set's lower 128 bytes consist of US-ASCII
2) The character set is a direct mapping of unicode (UTF-8, UTF-7, et al)
3) Whether the character set uses multibyte characters
4) The character set uses composite mapping using shift-in/shift-out
escape codes
5) Unrepresentable unicode characters may be ignored when converting
unicode to/from the character set
6) Whether quoted-printable or base-64 is best for encoding the character
set in the message's headers or body.