Tuesday, May 23, 2006

Angla vortaro - that's English dictionary to you and me

A quick check against public domain material downloaded previously turned up The The Online Plain Text English Dictionary (OPTED) produced by Ralph S. Sutherland from the Project Gutenberg version of the 1913 edition of Webster's Unabridged Dictionary with a additional vocabulary supplement.

(This dictionary is no longer available at the original Australian National University site. So I guess Ralph has moved on. Good thing a copy was already to hand.)

OPTED came in a very simple and completely regular html format and was easy to process automatically and upload.

So wirdz now has its very own English Dictionary. (Having Esperanto but not English was just a bit too absurda.)

Despite the supplement, the dictionary is still relatively lacking in recent (i.e. post 1911) vocabularly). The GNU CIDE dictionary which combines the PD Webster's Dictionaryand WordNet seems rather better. This is also to hand but in a relatively complex marked up format. In due course ... or for estonta.

Esperanto dictionary? - this is frenezeco!

Ok - an online English-Esperanto dictionary is certainly not going to set the world on fire. But what the heck! The itanda dictionary engine makes new dictionaries really easy to add.

The dictionary source comes from Project Gutenberg and was commendably easy to process and upload.

This is unlike the various English dictionaries based on the public domain versions of Websters. But one is coming soon to wirdz.

(And "frenezeco" - means insanity or madness - a bit harsh really).

Monday, May 22, 2006

Moby II - the sequel

Well Moby I (the original version of the wirdz English Thesaurus) had a pretty short release life. With potentially hundreds of index matches (and corresponding word lists) for a single word, something had to be done. So after less than 24 hours, version 2 was released.

The new version has 2 search modes:


  • Root word matches only


  • All matches against the index


  • For the latter, the initial display shows only the matched word, the root word and a plus on minus sign used to show or hide the word list itself. This uses the display style property of the container divs for the word list and the two images as follows:

    // On click event for the plus sign
    expandDiv("divList"); 
    contractDiv("divPlus"); 
    expandDiv("divMinus");
    
    // On click event for the minus sign
    contractDiv("divList"); 
    contractDiv("divMinus"); 
    expandDiv("divPlus");
    
    // The Javascript Div expand and contract functions
    function expandDiv(divid) {
      if (document.layers) {
        document.layers[divid].display="block";
      }
      else {
        document.getElementById(divid).style.display="block";
      }  
    }
    
    function contractDiv(divid) {
      if (document.layers) {
        document.layers[divid].display="none";
      }
      else {
        document.getElementById(divid).style.display="none";
      }
    }

    Standard stuff but still pretty nifty in practice. In this case, usability is significantly improved - the Moby Thesaurus is such a monster that some level of paging through word lists is pretty much unavoidable. The Roget's Thesaurus (1911 public domain edition with updates) is much smaller (circa 1000 categories/word groups). This may well be a later addition to the wirdz site.

    Sunday, May 21, 2006

    Moby by name, Moby by nature

    The wirdz site now includes an English Language Thesaurus based on Grady Ward's Moby Thesaurus.

    This is a real monster. Whereas the Roget's Thesaurus runs at approximately 1000 word categories, the Moby Thesaurus has over 30000. This is matched by a huge index list with over 2.5 million entries. Certainly this gives MySQL a reasonable work-out.

    In order to represent the thesaurus, there are two tables:


  • A table with an entry for each category - this includes the "root word" of the category plus the word list itself held in a text field as well as a record ID


  • A word list "index" table with an entry for each word/category combination


  • Regarding the latter, the muliple entry winner is the word "cut" which appears in 1120 different categories/word lists.

    There are 253 words with 300 or more entries - this ain't nothin like Roget. This thesaurus ought to have come out of Texas except that the contact address for its orginator Grady Ward is in California.

    Getting performance tables of this size so that the AJAX "search as you type" and the "next" and "previous" buttons worked speedily presented a slight challenge. The next and previous feature of the wirdz dictionary framework works on record ID. So tables need to be indexed so that both alphabetical searches and range limits on the record ID work efficiently and produce fast query plans. (Standard dictionaries can be held in a single alphabetically ordered table - this doesn't work for a thesaurus.)

    The word list table was at first ordered according to the original creation where the dictionary word lists where processed sequentially. This works fine for the orignal search (based on a index of the tabel column holding the individual words) but was no good for ID range constraints together with row return limits using the LIMIT clause on the SELECT statements (as noted above even with an exact word match, 1020 rows could be returned). The table was therefore recreated with the entries in alphabetical order. This now works well. Full marks to MySQL.

    The remaining problem is making the thesaurus more usable with so many potential categories/word lists matching individual words. (Needless to say there is considerable overlap between categories - brevity certainly wasn't an objective when this thesaurus was created!) One option will be to add an expand/contract +/- option in the results area showing only the matched word and category root word for each list initially. Even limiting the number of lists currently shown at one time to five produces much more than a screen-full of information each time.

    So Version 2 is on the way.

    Tuesday, May 16, 2006

    Chinese dictionary launched

    Version 1 of the wirdz English-Chinese dictionary
    has finally been launched. (See previous posts on Pinyin).

    The dictionary is based on CEDICT. Since CEDICT is a Chinese-Englsh dictionary, the wirdz version has been created by parsing out the individual English equivalents for each Chinese entry and "inverting" the dictionary.

    The next step (Version 2) will be to add back the other English "equivalents" to provide a better indication of which meaning applies when the English term has two or more homonyms (i.e. words spelt the same but with different meanings - you probably know this but I had to double check to ensure that I'd got the meaning the right way round compared with synonym!).

    The format of the source data for CEDICT was pretty high quality and consistent. The only area where it slips up from an automated process point of view was in the use of round brackets. These are used for example to indicate parts of speech (although not for most words), to indicate a proper name as (N) or just in ways in which brackets would be used in normal text. This made the parsing out of the different uses problematic.

    A standard XML format for dictionaries would be nice. That's not to say there aren't XML "standards" for dictionaries - of course there are - for example the Kirrkirr standard - but I mean "standard" in the "if not universal then at least pretty common" sense.

    Sunday, May 07, 2006

    Pinyin tone numbers to tone marks in MySQL

    The following SQL will convert from Chinese pinyin tone number format to pinyin with tone marks.

    The query is based on The Fool's Workshop Pinyin to Unicode Converter. This is uses a PHP Script which takes tone number pinyin in and outputs the html entities for the tone mark characters.

    For the wirdz English-Chinese dictionary, it is better to do all the processing in advance and hold the pinyin in both forms directly in the dictionary table. This eliminates the need for on the fly conversion but more importantly allows the dictionary to be full text searched directly using tone mark pinyin.

    Here's the script:

    /* Pinyin tone numbers to pinyin tone marks */
    set names 'utf8';
    
    /* Convert tone numbers to intermediate representation */
    update TABLE set pinyin_tones = 
    replace(
    replace(                   
    replace(
    replace(                   
    replace(
    replace(                   
    replace(
    replace(                   
    replace(
    replace(                   
    replace(
    replace(                   
    replace(
    replace(                   
    replace(
    replace(                   
    replace(
    replace(                   
    replace(
    replace(                   
    replace(
    replace(                   
    replace(
    replace(                   
    replace(
    replace(                   
    replace(
    replace(                   
    replace(
    replace(                   
    replace(
    replace(                   
    replace(
    replace(                   
    replace(
    replace(                   
    replace(
    replace(                   
    replace(
    replace(                   
    replace(
    replace(                   
    replace(
    replace( 
    replace(                   
    replace(
    replace(                   
    replace(
    replace(                   
    replace(
    replace(                   
    replace(
    replace(                   
    replace(
    replace(                   
    replace(
    replace(                   
    replace(
    replace(                   
    replace(
    replace(                   
    replace(
    replace(                   
    replace(
    replace( 
    replace(                   
    replace(
    replace(                   
    replace(
    replace(                   
    replace(
    replace(                   
    replace(
    replace(                   
    replace( 
    replace(                   
    replace(
    replace(                   
    replace(
    replace(                   
    replace(
    replace(                   
    replace(
    replace(lower(pinyin_numbers),  
    'ang1','//aq//ng'),
    'ang2','//aw//ng'),
    'ang3','//ae//ng'),
    'ang4','//ar//ng'),
    'eng1','//eq//ng'),
    'eng2','//ew//ng'),
    'eng3','//ee//ng'),
    'eng4','//er//ng'),
    'ing1','//iq//ng'),
    'ing2','//iw//ng'),
    'ing3','//ie//ng'),
    'ing4','//ir//ng'),
    'ong1','//oq//ng'),
    'ong2','//ow//ng'),
    'ong3','//oe//ng'),
    'ong4','//or//ng'),
    'an1','//aq//n'),
    'an2','//aw//n'),
    'an3','//ae//n'),
    'an4','//ar//n'),
    'en1','//eq//n'),
    'en2','//ew//n'),
    'en3','//ee//n'),
    'en4','//er//n'),
    'in1','//iq//n'),
    'in2','//iw//n'),
    'in3','//ie//n'),
    'in4','//ir//n'),
    'un1','//uq//n'),
    'un2','//uw//n'),
    'un3','//ue//n'),
    'un4','//ur//n'),
    'ao1','//aq//o'),
    'ao2','//aw//o'),
    'ao3','//ae//o'),
    'ao4','//ar//o'),
    'ou1','//oq//u'),
    'ou2','//ow//u'),
    'ou3','//oe//u'),
    'ou4','//or//u'),
    'ai1','//aq//i'),
    'ai2','//aw//i'),
    'ai3','//ae//i'),
    'ai4','//ar//i'),
    'ei1','//eq//i'),
    'ei2','//ew//i'),
    'ei3','//ee//i'),
    'ei4','//er//i'),
    'a1','//aq//'),
    'a2','//aw//'),
    'a3','//ae//'),
    'a4','//ar//'),
    'a1','//aq//'),
    'a2','//aw//'),
    'a3','//ae//'),
    'a4','//ar//'),
    'er2','//ew//r'),
    'er3','//ee//r'),
    'er4','//er//r'),
    'lyue','l//v//e'),
    'nyue','n//v//e'),
    'e1','//eq//'),
    'e2','//ew//'),
    'e3','//ee//'),
    'e4','//er//'),
    'o1','//oq//'),
    'o2','//ow//'),
    'o3','//oe//'),
    'o4','//or//'),
    'i1','//iq//'),
    'i2','//iw//'),
    'i3','//ie//'),
    'i4','//ir//'),
    'nyu3','n//ve//'),
    'lyu','l//v//'),
    'v1','//vq//'),
    'v2','//vw//'),
    'v3','//ve//'),
    'v4','//vr//'),
    'v0','//vs//'),
    'u1','//uq//'),
    'u2','//uw//'),
    'u3','//ue//'),
    'u4','//ur//');
    
    /* Convert to tone marks */
    update TABLE set pinyin_tones = 
    replace(
    replace(                   
    replace(
    replace(                   
    replace(
    replace(                   
    replace(
    replace(                   
    replace(
    replace(                   
    replace(
    replace(                   
    replace(
    replace(                   
    replace(
    replace(                   
    replace(
    replace(                   
    replace(
    replace(                   
    replace(
    replace(                   
    replace(
    replace(                   
    replace(
    replace(                   
    replace(
    replace(                   
    replace(
    replace(                   
    replace(
    replace(pinyin_tones,  
    '//aq//','ā'), 
    '//aw//','á'), 
    '//ae//','ǎ'), 
    '//ar//','à'), 
    '//eq//','ē'), 
    '//ew//','é'), 
    '//ee//','ě'), 
    '//er//','è'), 
    '//iq//','ī'), 
    '//iw//','í'), 
    '//ie//','ǐ'), 
    '//ir//','ì'), 
    '//oq//','ō'), 
    '//ow//','ó'), 
    '//oe//','ǒ'), 
    '//or//','ò'), 
    '//uq//','ū'), 
    '//uw//','ú'), 
    '//ue//','ǔ'), 
    '//ur//','ù'), 
    '//vq//','ǖ'), 
    '//vw//','ǘ'), 
    '//ve//','ǚ'), 
    '//vr//','ǜ'), 
    '//vs//','ü'), 
    '//aaq//','Ā'), 
    '//aaw//','À'), 
    '//aae//','Ǎ'), 
    '//aar//','¿'), 
    '//eeq//','Ē'), 
    '//eew//','É'), 
    '//eer//','È');
    

    Tuesday, May 02, 2006

    Scrabble with blank tiles

    The Scrabble Solver engine has been modified to handle wild cards.

    When one or more wild cards have been detected in the search letters, the basic selection criteria to find potential words has been changed from:

    select *
    from WORDLIST
    where conv('<Map>', 2, 10) & letter_map = letter_map

    to:

    select *
    from WORDLIST
    where bit_count(conv('<Map>', 2, 10) & letter_map)
             >= bit_count(letter_map) - <number of wild cards>

    This then needs to be filtered to ensure that the number of wild cards and of each letter in the target word does not exceed the number in the search letter string.

    The SQL is generated automatically by the PHP code and the resulting where clause condition is quite large as it has to deal with each letter of the alphabet separately as well as ensuring that the overall sum of the difference between the total number of occurrences of each letter of the alphabet in the candidate word against the number available does not exceed the number of wild cards/blank tiles available.

    The initial count slows up a bit from when there are no wildcards but is still acceptably fast. Once the count has been made and the first set of words displayed, moving between set using the next and previous links is pretty fast.

    The search process is pretty robust - even entering a search letter string consisting entirely of wild cards doesn't throw it.

    A combinatorial explosion has definitely been averted.