Saturday, June 10, 2006

Roget's Thesaurus - the original

The Moby distribution includes a 1991 update of the 1911 public domain version of Roget's Thesaurus. This is much smaller than Moby with just over 1000 categories and < 90000 index entries (derived from the source text).

The source data was unfortunately a classic example of the difference between a human edited document which is apparently in a standard format and a consistent text suitable for automated processing.

The source text was parsed and converted to relational database tables using an ad hoc parser and validator written in Visual Basic and running in MS Access. This worked well in practice but unfortunately literally hundreds of corrections were needed to the source text to remove inconsistencies, ambiguities and errors.

There are no doubt numerous errors remaining.

The good thing about about the Roget's text is that links between categories are marked in the text and, after all the corrections etc noted above, these were easily converted into HTML links. This allowed a richer AJAX interface than used previously to be created which allows links to be followed and back tracked easily, all with the rapid response resulting from using AJAX to front end to PHP and MySQL.

So now the wirdz site includes Roget's Thesaurus. The 1991 update includes a good addition of more modern words. However, there's still lots of old and obsolete stuff which is kinda interesting to see.

1 Comments:

Anonymous Anonymous said...

How can you tell it like it is on the street if you do not post but once a month? :)

Please stop by and see me and leave a flag. You'll see what I mean when you get there.

Thank you!

9:28 PM  

Post a Comment

<< Home