Chinese dictionary launched
Version 1 of the wirdz English-Chinese dictionary
has finally been launched. (See previous posts on Pinyin).
The dictionary is based on CEDICT. Since CEDICT is a Chinese-Englsh dictionary, the wirdz version has been created by parsing out the individual English equivalents for each Chinese entry and "inverting" the dictionary.
The next step (Version 2) will be to add back the other English "equivalents" to provide a better indication of which meaning applies when the English term has two or more homonyms (i.e. words spelt the same but with different meanings - you probably know this but I had to double check to ensure that I'd got the meaning the right way round compared with synonym!).
The format of the source data for CEDICT was pretty high quality and consistent. The only area where it slips up from an automated process point of view was in the use of round brackets. These are used for example to indicate parts of speech (although not for most words), to indicate a proper name as (N) or just in ways in which brackets would be used in normal text. This made the parsing out of the different uses problematic.
A standard XML format for dictionaries would be nice. That's not to say there aren't XML "standards" for dictionaries - of course there are - for example the Kirrkirr standard - but I mean "standard" in the "if not universal then at least pretty common" sense.
has finally been launched. (See previous posts on Pinyin).
The dictionary is based on CEDICT. Since CEDICT is a Chinese-Englsh dictionary, the wirdz version has been created by parsing out the individual English equivalents for each Chinese entry and "inverting" the dictionary.
The next step (Version 2) will be to add back the other English "equivalents" to provide a better indication of which meaning applies when the English term has two or more homonyms (i.e. words spelt the same but with different meanings - you probably know this but I had to double check to ensure that I'd got the meaning the right way round compared with synonym!).
The format of the source data for CEDICT was pretty high quality and consistent. The only area where it slips up from an automated process point of view was in the use of round brackets. These are used for example to indicate parts of speech (although not for most words), to indicate a proper name as (N) or just in ways in which brackets would be used in normal text. This made the parsing out of the different uses problematic.
A standard XML format for dictionaries would be nice. That's not to say there aren't XML "standards" for dictionaries - of course there are - for example the Kirrkirr standard - but I mean "standard" in the "if not universal then at least pretty common" sense.
0 Comments:
Post a Comment
<< Home