Escape from Atamyrat

Tuesday, April 27, 2021

Wizzittle launched as a new single page multi-language pronunciation dictionary.

Wizzittle.com now provides a "single pane of glass" multi-language pronunciation dictionary.

The initial version supports the following languages:

Chinese, French, German, Italian, Japanese, Portuguese and Spanish.

That's a fair chunk of the global population covered if both first and second languages are considered - circa 4.5 billion people (including the 60 million or so Italian speakers thrown in for good measure).

Chinese and Japanese both have well established Romanisations (pinyin and romaji) although in both cases the dictionary also supports Chinese/Japanese character searches as well.

Regional accent options are provided where appropriate. So, for example, there are 6 English accent options as well as 4 for French and 3 for German.

The dictionary is based on the Wirdz dictionary engine. The wirdz dictionary site includes a number of additional languages already available. Five billion and beyond is well in view.

Sunday, August 30, 2020

I C how to make things smaller and faster ...

The wirdz™ Scrabble® Solver has gone through a number of iterations:

1. The original solver "simply" used SQL against the MySQL database to find matching words including wild cards. It was quite a challenge to get this to work quickly enough to deliver results in an acceptable time but even so, the CPU load was quite noticeable. Certainly not scalable and the SQL was certainly not simple.

2. A python in-memory solver. This was definitely faster although it needed vectorised numpy operations to achieve an acceptable performance as Python's interpreted loops and run-time object type operations make any implementation of directly coded loops and checks pretty slow. This used XMLRPC as the request/response protocol.

3. An implementation in the Julia language, which uses a compile on demand/lazy compile approach. This was definitely faster but there appears to be no such thing as a small memory footprint Julia program due to the size of the run-time libraries including the compiler functions. The Julia program provided a REST interface.

4. Finally (?), the Scrabble Solver is written in C and operates as a fully in-memory process (once the word tables have been initially loaded from the DB). This is small and fast and now includes the option to use template patterns to reflect the existing board position. It also allows double and triple letter and word scores to be specified and taken into account when determining the highest scoring plays.

To provide the same functionality as the Julia program, the C version is about 60% longer (in lines). Not a lot when additional closing brackets ("}") are needed as well as more code intensive operations to allocate and free heap memory (malloc & free) and relatively clunky string operations.

The C version uses the PCRE regex library to do fast pattern matching and the GNU libmicrohttpd library to provide the REST server framework.

Needless to say it's both faster and much smaller 😊.

The original Scrabble Solver is still at its original url (wirdz Scrabble Solver). The new Scrabble Helper which includes pattern matching is at (wirdz Scrabble Helper).

Saturday, July 25, 2020

Extending Apache Authentication To Avoid Repeated Logins

When basic Apache authentication is used to control access to a web site, once authentication is completed, the server includes a session cookie in the HTTP header which is set to expire when the browser is closed.

This is unlike say, Google, where the default is for authentication to persist for a period.

The Apache session token includes the user name and password in clear text and so, while https may be used to protect the credentials in transit, it is clearly not a method which can be recommended for highly secure access. However, there are definitely use cases where it is suitable (for example to restrict access to content where higher levels of security are not called for and no data updates are being enabled). On the plus side, Apache authentication is quick and easy to implement.

However, having to re-authenticate each time can be pretty/very/extremely annoying.

This was the case for a custom dictionary site based on the wirdz dictionary engine being developed by JHC Technology Ltd.

The key property of the session cookie is the "expired" field. Even if this is updated by code, it is reset each time new content is accessed via the http header. So basically, this can seem a bit like whack-a-mole.

But coming to the rescue is the document unload event which in most cases will have the last work.

Outline code for using this is as follows:

<script>
window.addEventListener("unload", function(event) {
var nDays = 14;
var cookieValue = getCookie("session");
var expires = new Date();
expires.setTime(+expires + nDays * 24 * 3600 * 1000);
var options = {expires : expires.toGMTString(), samesite : "Strict"};
setCookie("session", cookieValue, options);
});
</script>

This will (hopefully) stop the cookie being deleted for 14 days from the last use and allow the site to be accessed without tiresome logins each. This code will be needed on each page.

You should note that getCookie and setCookie are NOT standard functions but there are plenty bits of sample code to be found via a quick search engine search.

Friday, May 08, 2020

Pinyin tone marks, mysql & accented characters

Most of the mysql UTF8 collations such as utf8_general_ci are case insensitive which also means that for sorting and comparisons, accented characters are treated as being the same even though their internal representation is different and they are returned by queries or displayed in the original form.

For the wirdz Chinese-English Dictionary this turns out to be quite convenient. pinyin (which is a "Romanized" method of representing Chinese characters and pronunciation) has two variants: tone marks and tone numbers.

For example, the Chinese phrase 發表演講 which means to give a speech may be represented in pinyin as either fā biǎo yǎn jiǎng (tone marks) or fa1 biao3 yan3 jiang3 (tone numbers).

For dictionary look ups which simply need to locate a term (and are not directly concerned with pronunciation), mysql's case insensitivity allows unaccented look ups directly while matching directly against the accented version held on the database without needing to have an additional column for use in searches and as well the need to pre-process search terms to remove any accents. Since search terms can be entered by cut-and-paste, both accented and unaccented entries need to be covered.

So far so good, but there can also be the need to separate out the different accented versions. While mysql does have case insensitive collations such as utf8_bin which can be applied at column level, this all gets a bit messy and could require the same data in two separate columns with different collations. Collations in some circumstances can also be set within queries, But this is also a tad messy.

If there are requirements to distinguish between accented characters, the hex function is a fairly simple option.

So, for example, if c is an accented column in table T with a character insensitive collation, and there are are only two values of c, say "a" and "ā", then select distinct s from T will return only one row (which will contain only one of the two variants) whereas select distinct hex(s), s from T will return two rows and both values.

Thursday, May 07, 2020

There are a number of interesting old dictionaries which have been digitized such that they can be relatively easily processed and uploaded to populate the mysql database table stucture used by the wirdz™ dictionary engine.

The latest addition to wirdz.com is the Sailor's Word Book, a 19th century dictionary of nautical terms created by one admiral and edited by another.

As always, the text was never intended to be easy for automated text process, which is this case would have been over 100 years after the book was produced. The trick is to be able to identify where the main body of the dictionary starts, where it ends, when a new term starts, whether there is a part of speech included and where the definition starts. The Victorian type setters and the folks who undertake the digitisation are in general both well disciplined folks with strong attention to detail and consistency and so, despite there always being a few edge cases, the digitisation, which in the case of content for wirdz.com uses variations on a core tool built in Python, can be relatively straight-forward.

Wednesday, May 06, 2020

Chinese characters and escapes with AJAX and PHP

Although most of the time, UTF-8 combined with the multi-byte settings on PHP seem to do the trick beneath the hood, there is always a corner case. In order to support Chinese character search against the English-Chinese Dictionary or Kanji/Kana searches against the Japanese-English Dictionary passing Chinese and Japanese characters via the Ajax call to the PHP back-end process is needed. When this was tested, the characters passed to the server process, which were escaped UTF-8 did not prove to be easily converted back to UTF-8 using any of the standard PHP functions.

For example, the Chinese character string 介質訪問控制層 arrived as:

%u4ECB%u8CEA%u8A2A%u554F%u63A7%u5236

when processed through the AJAX "get" method.

After eliminating all of the standard functions, the following steps finally did the trick:

1. Replacing % with \ in the input string to get it into the form of a "standard" escaped UTF-8 string.

2. Converting the string to a string representing a single element array, i.e. "['my utf8 string']"

3. Using the json_decode function on the string.

4. Extracting the converted text string from the returned array.

Not the most obvious way to handle the problem but simple in the end! Because the string passed to the server by the AJAX call is subject to character filtering, the risk of a false positive getting misinterpreted by the json function is minimized.

Saturday, August 12, 2006

AJAX basics

The wirdz site used AJAX to provide much of the functionality and ease of use. This has been taken furthest so far with the Wordnet 2.1 based English Dictionary plus and Roget's Thesaurus.

When this blog started, some technical material of AJAX was promised. So here's a basic intro to start with.

The key to AJAX is that it allows interaction and updated content to be obtained from the back end web/application server without needing a full page refresh. This is what allows the update as you type feature of the itanda dictionary engine to work. It is also the approach behind rich feature applications such as google's online spreadsheet which has been launched in beta status.

The orchestration of the AJAX interaction takes place between javascript running in the browser and whatever active page technology is being used at the back end - this can be PHP, ASP, Java servlets/server pages, Ruby on Rails ... your choice.

The first step on the client/browser side is to create a HTTP request object. Here's the code used on itanda. You'll find variants of this elsewhere on the web. This works with most browsers.

var objReq;

function initialize() {
  try {
    objReq = new ActiveXObject("Msxml2.XMLHTTP");
  }
  catch(e) {
    try {
      objReq = new ActiveXObject("Microsoft.XMLHTTP");
    }
    catch(e2) {
      objReq = null;
    }
  }
  if (!objReq && typeof XMLHttpRequest!="undefined") {
    objReq = new XMLHttpRequest();
  }
}

Interaction takes place in two stages (a) sending a request and (b) processing the return. Here's some sample script:


/* Send request */
if (objReq != null) {
  objReq.onreadystatechange = processReturn;
  strRequest = "http://www.itanda.com/demo/ajaxDemo1.php"
  objReq.open("GET", strRequest, true);
  objReq.send(null);
}

/* Process return */
function processReturn() {
  if (objReq.readyState == 4) {
    if (objReq.status == 200) {
      if (objReq.responseText == "") {
        response.write(objReq.responseText);
      }
    }
  }
}

In practice the original request will include some parameters and the return processing will be just a tad more sophisticated.

To complete the picture, here's the classic example played out at the back end in PHP:

<?
echo "Greetings earthlings - we now call the shots" .
     " - forget bringing your leader" .
     " - he's got nothing worth saying!";
?>

And that's it folks ...

Wörterbuch Deutsch-Englisch (or German-English Dictionary)

The wirdz site now includes a German-English Dictionary with an English-German one to follow once a few gremlins in the source data are sorted out (well several hundred actually but mostly amenable to an automated fix).

Because of the growing number of dictionaries and related content, the left hand dictionary links menu is becoming unwieldy (schwerfällig or unhandlich according to the dictionary above).

Going forward, wirdz will include separate language variants, probably with a option selector at the top right of the screen. In the meantime, the German-English dictionary is not directly linked - so the reference above is its one chance to get spidered.