6 messages in org.gnome.gtk-i18n-listHyphenation status
FromSent OnAttachments
Damon Chaplin24 Nov 2002 17:07 
Arthit Suriyawongkul24 Nov 2002 21:20 
Matthias Clasen25 Nov 2002 02:09 
Roozbeh Pournader25 Nov 2002 03:59 
Owen Taylor25 Nov 2002 06:58 
Damon Chaplin26 Nov 2002 15:37 
Subject:Hyphenation status
From:Damon Chaplin (dam@kendo.fsnet.co.uk)
Date:11/24/2002 05:07:32 PM
List:org.gnome.gtk-i18n-list

Hi,

I've been working on code to do hyphenation, hopefully to add to Pango. My new code is faster than libhnj and groff and uses less memory.

Here's a rough comparison, using the US hyphenation patterns, and on an 850MHz P3:

Speed in Words/Sec Memory Use --------------------------------------------------------- groff 310000 140K libhnj 360000 200K my code 630000 43K

The TeX code may be a wee bit more efficient, but it is complicated and I'm not sure about the license. (We may also have problems with the various licenses in the hyphenation patterns files at some point.)

My code is almost ready for Unicode as well. The main remaining issue is normalization. I need to:

a) Normalize the words and the hyphenation patterns so that matching works correctly (i.e. different forms still match), and b) Convert the resulting hyphenation pattern back to the positions of the original characters, so we insert hyphens in the right place.

g_utf8_normalize() is a problem because it is very slow and I have no way to do (b).

So I'm thinking of writing an optimized normalization function just for the code ranges that use hyphenation. (We can just ignore other characters as they won't make any difference.)

I think hyphenation is used for Latin, Greek and Cyrillic characters. Are there any others?

Anyone else have better ideas to handle normalization?

Damon