atom feed6 messages in org.gnome.gtk-i18n-listHyphenation status
FromSent OnAttachments
Damon ChaplinNov 24, 2002 5:07 pm 
Arthit SuriyawongkulNov 24, 2002 9:20 pm 
Matthias ClasenNov 25, 2002 2:09 am 
Roozbeh PournaderNov 25, 2002 3:59 am 
Owen TaylorNov 25, 2002 6:58 am 
Damon ChaplinNov 26, 2002 3:37 pm 
Subject:Hyphenation status
From:Damon Chaplin (dam@kendo.fsnet.co.uk)
Date:Nov 24, 2002 5:07:32 pm
List:org.gnome.gtk-i18n-list

Hi,

I've been working on code to do hyphenation, hopefully to add to Pango. My new code is faster than libhnj and groff and uses less memory.

Here's a rough comparison, using the US hyphenation patterns, and on an 850MHz P3:

Speed in Words/Sec Memory Use --------------------------------------------------------- groff 310000 140K libhnj 360000 200K my code 630000 43K

The TeX code may be a wee bit more efficient, but it is complicated and I'm not sure about the license. (We may also have problems with the various licenses in the hyphenation patterns files at some point.)

My code is almost ready for Unicode as well. The main remaining issue is normalization. I need to:

a) Normalize the words and the hyphenation patterns so that matching works correctly (i.e. different forms still match), and b) Convert the resulting hyphenation pattern back to the positions of the original characters, so we insert hyphens in the right place.

g_utf8_normalize() is a problem because it is very slow and I have no way to do (b).

So I'm thinking of writing an optimized normalization function just for the code ranges that use hyphenation. (We can just ignore other characters as they won't make any difference.)

I think hyphenation is used for Latin, Greek and Cyrillic characters. Are there any others?

Anyone else have better ideas to handle normalization?

Damon