atom feed14 messages in org.freedesktop.lists.libreofficeRe: [Grammar checking] Using Language...
FromSent OnAttachments
Olivier R.Dec 4, 2012 10:29 am 
Olivier R.Dec 4, 2012 11:03 am 
Olivier R.Dec 4, 2012 2:48 pm 
Németh LászlóDec 5, 2012 2:17 am 
Olivier R.Dec 5, 2012 11:34 pm 
Németh LászlóDec 6, 2012 8:30 am 
Rene EngelhardDec 17, 2012 8:00 am 
Németh LászlóDec 17, 2012 9:49 am 
Rene EngelhardDec 17, 2012 12:48 pm 
Rene EngelhardDec 18, 2012 12:08 am 
Németh LászlóDec 19, 2012 1:39 am 
Rene EngelhardDec 19, 2012 2:47 pm 
Rene EngelhardJan 15, 2013 4:09 am 
Rene EngelhardJan 15, 2013 5:56 am 
Subject:Re: [Grammar checking] Using LanguageTool lexicons with Lightproof new possible
From:Olivier R. (
Date:Dec 4, 2012 11:03:12 am

My connection ended while posting. Here is the full post:

Hello everyone,

## Build indexable binary grammatically tagged dictionaries for Lightproof/Grammalecte ##

The most important limitation for building a grammar checker with Lightproof was the lack of grammatically tagged dictionaries. Most of Hunspell dictionaries, which Lightproof can handle via LibreOffice-UNO, are not grammatically tagged and cannot be of any help to retrieve morphological information about words.

LanguageTool has not this problem since it’s using binary indexable dictionaries built on huge grammatically tagged lexicons with a finite-state automaton (fsa) software ( written in C. Java has a dedicated library to read these binary files.

But we had nothing such as this in Python. So I tried to understand how this FSA software in C works, but as I am not a C expert and as I was upset to depend again on another software, I finally decided to write my own FSA tool to build such indexable binary dictionaries.

Why build such dictionaries? you may ask. Because lexicons which contain words, lemmas and morphological tags are HUGE, up to several megabytes, they are not indexable as is and it uses much more memory to make them such. So the goal is to make them small, compressed, quick to load and to parse, low memory consuming, indexable, readable without having to uncompress them.

That’s what I did with Python 3.3.

I took all lexicons from LanguageTool and I compressed them in binary indexable dictionaries readable with my own script. The built dictionaries are not as small as the ones made with the C FSA tool used by LT, but it’s close enough and there is still room for improvements. I’ll work on this later.

Here are the results:

These dictionaries are about 5-30 % bigger than the LT ones (and sometimes surprisingly twice smaller), but anyhow it’s perfectly usable as is.

Consequences: — it will be possible to use all existing LT lexicons with Lightproof, — we will be able to make a stand-alone version of Lightproof/Grammalecte as it won’t be necessary to use Hunspell anymore, — we will be able to write automated tests and prevent regressions when writting/modifying rules.

# Lexicons

Lexicon are simple text document listing all flexions, their stem and their morphological tags:

Each field is separated with a tabulation.

With the new tool, lexicons MUST be UTF-8 encoded to be properly converted.

# Want to test it?

The code is written with Python 3.3. License: MPL 2.

Two files: — reads all files listed in "_lexicons.list.txt" and builds binary dictionaries with a specific stemming command. — reads all files whose name is "[lang].bdic", and if it finds a test file named "[lang].test.txt" writes results found for each word in a new file.

The builder with uncompressed LT lexicons encoded in UTF-8: [130 Mb]


And let it run. Warning: building dictionaries is slow, as lexicons are huge. For most langages it takes 1 or 2 minutes for each. But for german, polish, galician, russian, czech, it tooks 5 to 10 minutes for each, and it consumes a huge amount of memory. The czech uses up to 6 Gb! You have been warned. :)

The dictionary reader with binary dictionaries and test files: [11 Mb]


Let it run. Count to 1 (or 2 if you have a slow computer). And it’s already finished. :) It has read all binary dictionaries, read the test files, and written the results in other files.

I’ll try to write a more complete web page about this when I have the time. I still have to compress it better, for those who might think it’s not enough.