7 messages in net.sourceforge.lists.courier-usersRe: [courier-users] The Possibility t...
FromSent OnAttachments
ma...@intron.acMay 21, 2006 11:37 pm 
Sam VarshavchikMay 22, 2006 3:54 am 
YsbeerMay 25, 2006 2:52 pm 
Sam VarshavchikMay 25, 2006 3:19 pm 
ma...@intron.acMay 27, 2006 12:24 am 
Sam VarshavchikMay 27, 2006 6:56 am 
ma...@intron.acMay 27, 2006 7:34 am 
Actions with this message:
Paste this link in email or IM:
Paste this link in email or IM:
Atom feed for this thread
Paste this URL into your reader:
Subject:Re: [courier-users] The Possibility to Substitute GNU Libiconv for Your Unicode LibraryActions...
From:ma...@intron.ac (ma@intron.ac)
Date:May 27, 2006 7:34:19 am
List:net.sourceforge.lists.courier-users

Iconv is a simple and stream-oriented API, conforming to UNIX 98. Perhaps subscribers of those mailing lists considered iconv to be too simple. I have got known of only 4 manual pages about GNU libiconv:

/usr/local/man/man1/iconv.1.gz /usr/local/man/man3/iconv.3.gz /usr/local/man/man3/iconv_open.3.gz /usr/local/man/man3/iconv_close.3.gz

Well, meta data are courier-specific and need to be reserved and developped. But substantial encoding conversion can be afforded by GNU libiconv, can't it?

I would write some relevant code.

------------------------------------------------------------------------ From Beijing, China

Sam Varshavchik wrote:

ma@intron.ac writes:

Are these just "meta data" you referred to?

Yes. The unicode library in Courier does not just convert stuff from one character set to another. I also need to know some metadata about each character set, such as what I listed below.

When, for example, encoding the character set in a message's header or body, I need to know whether the character set uses shift-in/shift-out character sequences, if so base64 must be used for encoding the character set in the headers. Even in character sets that don't use shift-in/shift-out sequences, I still need to know the preferred encoding method, in order to automatically select the best one when encoding message content.

I remember that many years ago I sent a mail to whatever mailing list address I dug up out of iconv's documentation. My mail was ignored.

To maintain oriental languages' encoding conversion tables is a piece of hard work. For example, your GB2312 table includes only 6763 Chinese characters. But our MANDATORY new national standard GB18030 covers 27484 Chinese characters! If we only use GB2312, even we cannot spell our ex-prime minister's name (Rong-Ji Zhu), and we cannot print all contents of most of Chinese classical novels.

Except meta data, it is wiser to make use of substantial conversion tables provided by other professional libraries.

If you agree with me, I and others will help you in oriental languages. Western language encodings (ISO 8859-X, KOI-8, IBM/Microsoft) are much simpler than CJK, easy to be solved.

------------------------------------------------------------------------ From Beijing, China

Sam Varshavchik wrote:

Ysbeer writes:

Out of curiosity, have you ever considered using ICU for handling your Unicode requirements?

I am not familiar with ICU's capabilities. The requirements are that for a given character set, I must know whether or not:

1) The character set's lower 128 bytes consist of US-ASCII

2) The character set is a direct mapping of unicode (UTF-8, UTF-7, et al)

3) Whether the character set uses multibyte characters

4) The character set uses composite mapping using shift-in/shift-out escape codes

5) Unrepresentable unicode characters may be ignored when converting unicode to/from the character set

6) Whether quoted-printable or base-64 is best for encoding the character set in the message's headers or body.