Historical context of written Luxembourgish
As Luxembourgish has for a long time been mostly a spoken language, it strives for phonological accuracy in its written form. One striking phonological phenomenon, that is also reflected in written Luxembourgish, causes the deletion of the trailing “n” or “nn” in some contexts. This is called the “Eifeler Regel“.
Example: Ech hunn e Brudder. (“I have a brother”)
But: Ech hu keng Schwëster. (“I have no sister” as in “I don’t have a sister”)
In the second sentence the ending is dropped because it is not pronounced.
The big problem obviously is that in order to correctly apply that rule, a well founded knowledge of spoken Luxembourgish is absolutely necessary. However, reality is that even native speakers still have enormous problems to properly employ it.
First steps
I started developing a piece of software for an automatic application of that rule in 2006 (see here). It proved to be very difficult, as Luxembourgish “borrows” words from other languages, which makes it impossible to specify a straightforward rule that always applies. My first implementation was done in Java (in the meantime I’ve also successfully ported it to PHP5).
The Cortina project and the Eifeler Regel
I was, however, not the first one to try to implement the “Eifeler Regel”. In 1998, the Cortina project (founded by the government to develop proofreader for Luxembourgish) started developing a spelling checker that implemented the Eifeler Regel as an orthographic rule. The project was cancelled in 2002, leaving the software in a rather unusable state. According to information I got from one of the original developers, they annotated every word in the spell checking dictionary with information on whether or not it could drop the trailing n.
Example:
hunn | T
Prinzessin | F
T (true) means it can drop the final n, F (false) means it can’t. I don’t remember the exact notation they used, but this is the basic idea.
I found this approach for several reasons rather unflexible. First of all, manually annotating a list of several thousand words with phonological information inevitably leads to a certain error rate. Secondly, implementing the Eifeler Regel as an orthographical rule (which it is not) limits the quality of the correction suggestions (according to my own experience). A third reason is that whenever new words are added (e.g. a user adds words to their personal dictionary), those words would have to be manually annotated with the correct information about whether or not the Eifeler Regel can be applied, so this implementation can not handle unknown words.
My approach for implementing the Eifeler Regel
The approach I took is radically different. I wrote a script that uses two lists of words as input (one list for which the rule applies and one for which it does not) and then returns a regular expression (in the case of the Eifeler Regel it has several hundred characters). Now, if a word does not exist in either of the lists, the probability that the regular expression detects the correct application of the Eifeler Regel is still above 99.99% (tested with real-life samples). Whenever a false positive or a false negative has been detected, that word is added to the second or the first list, respectively. Then, the regular expression is regenerated thereby increasing the probability for a correct rule detection.
This is of course a heuristic approach. The obvious disadvantage is that false positives are not excluded by design, but the probability that they occur is still low enough in order to provide reliable results.
On the other hand, having a strict separation between orthographic correction and the Eifeler Regel leads to a higher quality in the orthographic correction. Finally, because of the flexible approach, the Eifeler Regel can also be applied to words that are not yet part of the official spell checking dictionary, but exist only in a user dictionary.
Integration into OpenOffice.org
My first version for OpenOffice.org was rather badly integrated into the workflow, given that the correction window had to be opened in a new window (Screenshot). Fortunately, OpenOffice.org 3.0.1, which was released a few days ago, features a new grammar checking API. I took the opportunity to implement a new version that now is able to underline problems with the Eifeler Regel during typing.
Mathias Bauer (Project Lead OpenOffice.org Writer) has this to say about the new API:
We wanted to make the API as simple as possible, especially we wanted to bring it into the working space of the target developers where strings, sentences and plain, simple paragraphs are used and not the very complex and hierarchical text structures and API of OOo (that of course are necessary for many other things).
And I have to say they did quite a good job. Using the OpenOffice.org API plug-in for NetBeans, I was able to quickly create a component that implements the com.sun.star.linguistic2.XProofreader interface, thus providing a basic proofreading service. Within less than three days I had a running implementation (as I said before, only the OOo API integration is new, the rest of the code comes from a previous implementation).
The only thing that needed fiddling was the “Linguistic.xcu” file, an XML file that basically tells OpenOffice.org which proofreaders are available at which location. However, I was able to solve these with the help of the participants of the lingucomponent mailinglist. (thanks!)
Technical overview
Whenever the proofreader needs to check the text (e.g. automatically while writing text or manually when opening the proofreading window), the doProofreading method of the XProofreader interface is called with a String containing the text of a single paragraph. I then take that String, split it into sentences and then split the sentences into words (I call them “tokens”). In the future I also plan to tag those tokens with grammatical information in order to implement grammatical rules as well.
Splitting a paragraph into sentences is not as straightforward as it sounds because not every dot (“.”) marks the end of a sentence. Just take this example: “Meng Internetadress ass www.beispill.com”. That’s just one sentence, even though it contains two dots. The same applies to dates (e.g. “31. Januar”) or abbreviations (e.g. “Prof. Dr. XXX”).
Being able to work directly with Strings enormously simplifies the integration of proofreading tools. For my first OpenOffice.org extension still I had to iterate over the complex internal data structure of the Writer application. Goodbye to com.sun.star.text.XText, com.sun.star.container.XEnumeration and all those horrible constructs
After splitting a sentence into tokens, I iterate over a list of rules, and apply them to the sentences. Because I want to keep my implementation generic (in case I want to integrate it into another application), mistakes are handled in an internal data structure. Finally, the mistake list is converted to the data structure of OpenOffice.org, which handles displaying mistakes and displaying correction suggestions.
If I get the time I will provide some more technical information during the next weeks. Here’s a screenshot:
All in all, I am very happy with the final result. The binary can be downloaded on Spellchecker.lu. The release annoucement can be found here (both pages are in Luxembourgish). I opted not to release the source code for now.


Hello,
Thank you for the great job you did.
As I frequently write in Luxembourgish, the spellchecker.lu extension is very useful.
But, although Java 6 is installed on my Mac, the attempt of installing the Eifeler Regel Implementation always fails.
Nevertheless, congrats for the job you did.
Phlëpp