Dealing with accented characters

If you’ve come here looking for the HOWTO, jump to the technical bit below.

We ran into a rather frustrating problem when we tried to put a search system on top of our newly imported dictionary database: how do you search for words which may or may not contain diacritics, if you don’t know if the diacritics exist or not?

How we use diacritics

As a result of a decision made 50 years ago, we don’t use diacritics for Latin words unless we think it would be interesting or useful to do so at a given point. So any individual instance of a word may, or may not, include diacritics, and you won’t know until you look at it whether it does or not.

For example, take the word ‘excidere’. We don’t normally add diacritics to it, except in two headwords: we have two entries for words spelled ‘excidere’, and both have diacritics — 1 excĭdere and 2 excīdere.

In print, this isn’t a problem, or even a real inconsistency. Readers looking for excidere turn to p. 831 (excessivusexcipere), and if their Latin is very good, the diacritics help them to choose which homonym is most likely to provide a correct definition. If you don’t know what the diacritics signify, you can safely ignore them, you’d just have to look at both homonyms.

This sort of thing is consistent throughout the dictionary. At all times where we use diacritics, they are merely an aid to the reader — they serve no semantic purpose in our Latin text.

The problem

How do you know in advance that the word or words you are searching for may or may not contain diacritics? Even assuming that your Latin is good enough to put the diacritics on, and that you’re computer literate enough to know how to type them: if you don‘t have the print dictionary in front of you, you’re stumped.

A user searching for our ‘excidere’ entry  is going get no results and wonder why we missed such an obvious word — they’re hardly likely guess that the problem is technical rather than editorial.

This is obviously a major problem: how do you find an entry when you don’t know how to spell its headword?

We can’t strip the diacritics from 81,000 entries on the fly, as that would make things painfully slow. We can’t strip the diacritics from the source data because then the online dictionary would look too different from the printed one.

Adding a filter to eXist

The obvious solution is to remove all diacritics from the search index. That way, you can search for ‘excidere’ and get back both entries, which display with their diacritics intact. Other native XML databases, such as BaseX, allow you to set an option to do just this. In eXist, however, it isn’t so simple. In general, we’ve been rather pleased with eXist, but this is a serious shortcoming (for us at least) with the system. The benefit of open source technologies, of course, is that you can add functionality if you need to, so that’s what we’ve done: added a filter to the indexing system to remove diacritics from both Latin and polytonic Greek.

Given that this feature may be of use to others, I’ve written a step-by-step guide. Any feedback, as always, is more than welcome!

Tom Wrobel (thomas.wrobel@classics.ox.ac.uk)

The technical part

Edit: this does not work with eXist 2.0 or later (see the comments below). At the time the article was written, eXist was at version 1.4.

In order to strip diacritics, we have to provide a custom analyzer class to eXist’s Lucene indexer, and then compile eXist from source using the new analyzer. Sounds easy enough, right?

Thankfully, a lot of the work has already been done, thanks (heartfelt thanks) to Mike Sokolov’s ISOLatin1AccentAnalyzer (http://sourceforge.net/p/exist/patches/173/), which is (I assume) based on Lucene’s own ASCIIFoldingFilter. However, you may want to make your own selection of character replacements.

For those wanting to see the code used here, please download the zip file at comment #3 on the Lucene Filter that Flattens Diacritics page.

Step 1. Work out which characters you will need to flatten

If no one, to date, has written the definitive list of all of the characters you might want to change, you’ll have to find them. Some good lists do exist, but none of the standard ones (such as Lucene’s ASCIIFoldingFilter) work for polytonic Greek. In addition, you will probably want to customize that list.

Note: a better approach would be to normalize the text into the NFD form, and then just strip out the combining characters, that way we’re much less likely to need to revise the list. I will be looking into doing this at a later stage, but it requires a more basic revision of Mike Sokolov’s code, and I want to get the system stable for the time being.

We started with a simple perl script which we ran across our XML files and which printed a list of all of the characters we use. This gave us a comprehensive list of the characters that may need to be replaced. If you’d like a look at the script, see the zip file on soureforge.

Step 2. Get hold of the eXist source code

We deployed eXist as a .war, so we had to download the standard eXist .jar. In addition, you’re probably going to want to test this outside of your production eXist site anyhow, so starting from a clean download in a sandbox is probably a good idea.

Step 3. Set up and test a sandboxed eXist

Go through the normal eXist setup procedures (http://exist-db.org/exist/quickstart.xml) for your newly downloaded version. Make sure everything’s working fine — you can always copy across the conf.xml and data directory from your existing WEB-INF folder.

Step 4. Write your analyzer

Mike Sokolov has already done the hard work on this. But there are a few changes you’ll need to make, otherwise the code won’t compile — at least, it didn’t on my machine.

  1. In ISOLatin1AccentAnalyzer.java, change all mentions to LngAnalyzer to whatever matches the title of your .java file (in our case the file is called DMLBSAccentAnalyzer.java, and the main class is DMLBSAccentAnalyzer). If you don’t want to change the name of the file, then change LngAnalyzer to ISOLatin1AccentAnalyzer. You’ll need to edit the file in 3 places: the main public class declaration (line 24), and lines 36 and 41.
  2. In ISOLatin1AccentFilter.java, add cases for the characters you want to remove diacritics from – we added another 200 or so cases! We also made the decision not to change some characters: þ, ȝ, ð, and ƿ (thorn, yogh, eth, wynn). Again, if you want to see our version, have a look at the zip file on sourceforge.

Step 5. Add your analyzer to the eXist source tree, and recompile eXist

Instructions on building eXist from source can be found on the eXist website (http://exist-db.org/exist/building.xml).

  1. Put the two java files into your new sandboxed eXist at $EXIST-HOME/extensions/indexes/lucene/src/org/exist/indexing/lucene
  2. Run ./build.sh clean to flush the precompiled .jar files
  3. Run ./build.sh all to recompile from scratch
  4. (optional) to create a .war, run ./build.sh dist-war

Step 6. Declare your analyzer in the index configuration file

Analyzers are declared in the collection.xconf file (see the Lucene indexing section on the eXist website for more info). Your analyzer must first be declared, and given an id.

The class is the package name at the top of your analyzer class file (in our case package org.exist.indexing.lucene) plus the name of your main analyzer class. We called ours DMLBSAccentAnalyzer. If you haven’t changed Mike Sokolov’s code, you’ll need to put org.exist.indexing.lucene.ISOLatin1AccentAnalyzer.

The id is a plain text field, and elements can declare that their indexes use that analyzer by means of an @analyzer attribute.

<lucene>
  <analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
  <analyzer id="ascii" class="org.exist.indexing.lucene.DMLBSAccentAnalyzer"/>
  <text qname="l" boost="3.0" analyzer="ascii"/>
  <text qname="entry" analyzer="ascii"/>
  <text qname="v"/>
  <text qname="l" analyzer="ascii"/>
   ...
 </lucene>

Step 7. Reindex

Once you’ve reimported your data (or copied across your existing data directory), all you have to do is reindex, and it’s time to test the rest of your code.

Step 8. Gotchas

I found that a couple of things broke as a result of this change. In particular, any query using the unaccented index failed for the (rare) times that accented text was being sent across (for instance, automatic lookup of previous and next entry headwords). We had to rewrite a few of our XQuery functions to use other indexes as a result. YMMV, of course!

About these ads

6 comments on “Dealing with accented characters

  1. I’m so glad you were able to get some mileage from the ISOLatin1AccentAnalyzer patch!

    • Tom Wrobel says:

      How on earth did I not see this before! Thank you so much for your code – it worked like a charm.

  2. Hi Tom,
    Thanks so much for the detailed tutorial on this topic. I’ve been trying your solution and also tried adding the standard ASCIIFoldingFilter but can’t get a successful build so far. My use case requires filtering of macron characters common in Māori (eg ‘ā’) for a full-text index and I think the ASCIIFoldingFilter does this. I suspect I am encountering a version mismatch problem – some discussion is in this thread I posted on the exist-open list: http://exist.2174344.n4.nabble.com/Filtering-accents-from-Lucene-index-in-eXist-2-0-td4659845.html

    Have you successfully added your analyzer to eXist 2.0+? Do you have any comments on which versions of eXist and Lucene should work together for this purpose?

    Cheers,
    Chris

    • Tom Wrobel says:

      Hi Chris, thanks for the reply!

      I’m afraid the analyzer only works with version 1.4 (and 1.3, I think). I’ll make this clear in the post. You are, indeed, encountering a version mismatch issue – eXist 2.0 uses an updated version of Lucene, and the analyzer won’t work with it. We’re still using eXist 1.4, partly for this reason.

      You’ll probably spend longer getting the analyzer to work than you will in finding a way to enable ASCIIFoldingFilter.

      If you’re not married to eXist, then BaseX has this behaviour turned on by default (I think, see http://docs.basex.org/wiki/Options#UPDINDEX).

      Sorry I can’t be of more help!

      blue skies

      Tom

  3. Hi Tom,

    Thanks for clarifying. Unless someone on the exist-open list comes to my rescue with the ASCIIFoldingFilter, it looks like BaseX could be a good option for me, so thanks for the suggestion.

    Cheers,
    Chris

  4. Tom Wrobel says:

    I’m no longer able to edit this post, but I’m informed that the links to Sourceforge no longer work. In case anyone is still interested in seeing the code, it’s mirrored at: https://github.com/tomwrobel/DMLBS

Leave a comment

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s