Escape from Atamyrat

The wirdz site now includes an English Language Thesaurus based on Grady Ward's Moby Thesaurus.

This is a real monster. Whereas the Roget's Thesaurus runs at approximately 1000 word categories, the Moby Thesaurus has over 30000. This is matched by a huge index list with over 2.5 million entries. Certainly this gives MySQL a reasonable work-out.

In order to represent the thesaurus, there are two tables:

A table with an entry for each category - this includes the "root word" of the category plus the word list itself held in a text field as well as a record ID

A word list "index" table with an entry for each word/category combination

Regarding the latter, the muliple entry winner is the word "cut" which appears in 1120 different categories/word lists.

There are 253 words with 300 or more entries - this ain't nothin like Roget. This thesaurus ought to have come out of Texas except that the contact address for its orginator Grady Ward is in California.

Getting performance tables of this size so that the AJAX "search as you type" and the "next" and "previous" buttons worked speedily presented a slight challenge. The next and previous feature of the wirdz dictionary framework works on record ID. So tables need to be indexed so that both alphabetical searches and range limits on the record ID work efficiently and produce fast query plans. (Standard dictionaries can be held in a single alphabetically ordered table - this doesn't work for a thesaurus.)

The word list table was at first ordered according to the original creation where the dictionary word lists where processed sequentially. This works fine for the orignal search (based on a index of the tabel column holding the individual words) but was no good for ID range constraints together with row return limits using the LIMIT clause on the SELECT statements (as noted above even with an exact word match, 1020 rows could be returned). The table was therefore recreated with the entries in alphabetical order. This now works well. Full marks to MySQL.

The remaining problem is making the thesaurus more usable with so many potential categories/word lists matching individual words. (Needless to say there is considerable overlap between categories - brevity certainly wasn't an objective when this thesaurus was created!) One option will be to add an expand/contract +/- option in the results area showing only the matched word and category root word for each list initially. Even limiting the number of lists currently shown at one time to five produces much more than a screen-full of information each time.

So Version 2 is on the way.

Sunday, May 21, 2006

Moby by name, Moby by nature

0 Comments:

About Me

Previous Posts