You know how it is. A project ambles along for a few years gathering data until managing that data becomes difficult enough that someone thinks “it’s time for a database!”. Well, the DMLBS is nearly 50 years old now – it was probably time to get around to building one.
So, breaking news: the dictionary now has a database (for internal use only, for now).
Why didn’t we have one before?
Since completing the arduous process of converting the dictionary’s content into XML, all of our content has been stored as XML files.
Thousands of files in a Windows file system might not be sexy computing, but it works well for creating and editing: editors can write or edit the files using oXygen, and version control using SVN maintains a history of the changes. The idea of ‘a set of entries in an XML file’ corresponds well to the boxes of slips from which the entries are written (one file per box), and means that similar entries are written at the same time by the same person in the same file. At the same time, for the converted material (which doesn’t need to make its way through an editorial workflow), one file per entry makes sense too, and with more than 65,000 entries already, that’s a lot of files.
It’s also worth noting that we didn’t store our data electronically until a few years ago, and, more importantly, XML database technology wasn’t exactly mature when we started up in 1965.
Why build one now?
Having over 65,000 files in two different XML schemas is a bit of a pain when you want to do anything quickly. Internally, editors need to be able to search existing content – to provide examples of definitions, previous citation formats, etc. – and doing this using XSL over each file is a slow, memory-hungry process (finding an entry in the file system sometimes took over 2 minutes).
In addition, we’re thinking about the future. Once the dictionary is complete, we’re looking to offer it online, with all the added benefits online publication brings: wider accessibility, searching, easy navigation between entries, automatic bibliographic lookups, etc. So creating a database, and some trial web pages, was an ideal scoping exercise.
Which database to use?
The only real choice for something like this is an XML database. Relational databases (MySQl, Access) are very good at storing tabular data (i.e. anything that can be represented in an Excel file, for instance), but they aren’t very good at storing mixed content.
For example, our quotations can contain editorial comments, bibliographic references and cross references, mixed in with the quotation text – it’s hard to store something like “devota mente †paratam [v. paratum] peregrinar” (with its embedded editorial comment) in a relational database. On the other hand, it’s what XML databases thrive on.
The choice of XML database wasn’t a hard one either. It had to be:
- free to use, and also free to use for the foreseeable future (ruling out commercial databases like MarkLogic, even their free versions).
- something with a proven track record (baseX is a little immature)
- a project that was being actively developed – no point in selecting something that won’t be updated
- a database with a well-documented REST interface (making Sedna less attractive, although it’s definitely an option we might consider in future)
- a project which other people in the University have experience with (so we can find external developers if necessary)
- something which could potentially integrate with oXygen, which we use extensively both for authoring/editing, and for developing.
For all of these reasons, eXist was the obvious choice.
How did we do it?
We borrowed an old desktop box, stripped it down and installed Linux (Ubuntu 12.04) with Apache and Tomcat (it’s useful to be able to restart eXist without messing around with -jar files in the early stages). Installing eXist within Tomcat was easy enough if you’ve done that sort of thing before, although eXist’s own documentation wasn’t terribly helpful. If you haven’t, then spending a couple of hours with the Tomcat documentation – and where its files are stored – will be time well spent (see also this helpful guide to repackaging eXist for Tomcat).
Initially, we tried importing the data over WebDav, but that proved painfully slow (it took 5 days – which I only know because I had other things to do, or I would’ve just killed the process). Uploading via a shell script and the REST interface was much quicker, and all 81,000 entries were in the database in a few hours.
Configuring eXist, especially the indexes, hasn’t been easy. The documentation isn’t too bad, but the relatively small user base – compared with something like MySQl – makes finding solutions to specific problems difficult, especially where the documentation isn’t clear. This is, of course, a work in progress – I’ll write more on this once we’ve spent some time tweaking.
What do we do with it?
The first step was to reproduce our existing searches (written as XSL transforms) in XQuery. For transforming an XML document into another document, XSL is lovely. For data analysis, such as a search, it’s a pain – it’s hard to tell data manipulation from output transformation, for a start. XQuery was built to analyse XML data in this way, and it’s quicker to write and (for me) much easier to maintain.
What are the benefits?
Speed is the most obvious benefit. Searches that used to take minutes now return in seconds. It now takes only 1.5 seconds to find a simple entry (which used to take up to 2 minutes), and 4 seconds for a search like ‘find all definitions which include the word “church”’, which used to involve making a cup of tea (333 results, btw.). We can also ask more complex queries that simply weren’t feasible before (results to follow).
Moreover, they don’t take up resources on the user’s machine while they run – even with the longer searches, some of which can take 20 seconds or so, an editor doesn’t have to sit and watch the screen because oXygen and Saxon are sucking up all available RAM!
What advice would we have for anyone who wants to do something similar?
- Keep the task simple. We postponed some important issues by simply ignoring them  – don’t let them get in the way of trying something new, so long as you aren’t committing yourself to something you might later regret.
- Make the task useful. Doing something like this shouldn’t be an exercise in its own right, but a path to some distinct improvement or feature. Keeping the end goal in mind made a lot of decisions easier.
- Think very carefully about where you’re storing the authoritative form of your data. If you can store it as an XML file and only import a copy of that into your database (as opposed to storing the authoritative form in the database), you make your task a lot easier.
What are we going to do now?
For now, we’re testing the system by letting the editors use it and encouraging feedback. We’re actively exploring ways to develop an online version of the dictionary, but the current form of some of the data isn’t publication-ready. Much as we’d love to make the database as described here more widely available, that just isn’t feasible right now (there are also copyright issues that would need to be resolved). On the other hand, we’d be interested to talk to any academic projects that might like to use the data.
Thanks for reading all the way to the end! It’s a long post, but it’s been quite a long journey to get this far. We still have some way to go, so watch this space for further news. We’d also be very interested to hear from you if you’ve attempted something similar or are thinking of doing so, in particular your choice of database and web technologies.
Tom Wrobel (firstname.lastname@example.org)
XML and Web Developer, DMLBS
 We do add in a few tweaks, such as temporary <entry> id numbers, but it’s minimal. (back)
 Cans kicked down the road include: versioning (handled by SVN on the data before it goes into the database), data integrity and backups (it isn’t the authoritative version of the content, and can be rebuilt if the worst happens), access control (it’s internal use only, and only updates from the master files), schema-aware processing (the XML files being worked on will be well formed, but not necessarily valid), and editing workflow (editors have no way to update existing database content other than by the normal process of writing new entries). (back)
 I was remarkably unimpressed with XForms in general – does anyone have any success stories? (back)