Messages per Month
|Subject:||Re: Hyphenation status|
|From:||Owen Taylor (otay...@redhat.com)|
|Date:||Nov 25, 2002 6:58:42 am|
Damon Chaplin <dam...@kendo.fsnet.co.uk> writes:
My code is almost ready for Unicode as well. The main remaining issue is normalization. I need to:
a) Normalize the words and the hyphenation patterns so that matching works correctly (i.e. different forms still match), and b) Convert the resulting hyphenation pattern back to the positions of the original characters, so we insert hyphens in the right place.
g_utf8_normalize() is a problem because it is very slow and I have no way to do (b).
So I'm thinking of writing an optimized normalization function just for the code ranges that use hyphenation. (We can just ignore other characters as they won't make any difference.)
I think hyphenation is used for Latin, Greek and Cyrillic characters. Are there any others?
Hebrew is hyphenated at least sometimes (InDesign apparently can do it.)
I don't see how a "optimized normalization function" is going to get you significantly faster.... maybe you save a few percent from smaller tables, but you aren't going to get 10x as fast or anything.
And quite a chunk of the Unicode normalization stuff _is_ for Latin/Greek. (Few languages are going to give you more normalization opportunities than Greek.)
IMO, all you are going to end up with is "my function that sort of does Unicode normalization, but not quite".
If you have ideas about how to write a fast normalization function, they should be applied to g_utf8_normalize()
(You probably can avoid the double pass through the string and the full-size intermediate wide character buffer if you are willing to reallocate/copy the ouput buffer in some case; the search function in find_decoposition probably can be speeded up a bit.)
If you need extended interfaces, we should plan on getting them into glib eventually. (The same need for reverse mappings comes up in adding normalization into the Pango shapers.)