Simple word indexer

13 July 2021. Continued from the previous.

Hyper Estraier

2019

On the 12th of October 2019, in parallel with htdig, I had found (pkg search hyperestraier) and then installed Hyperestraier. It worked very well, and made it easy (estcmd words db, as per Fallabs, now redirecting to Dbmx) to extract a word list from the database, for comparison with an earlier list, to detect added and possibly misspelled words, and to see which ones had vanished due to corrections recently made.

Languages from the 8859-1 cross (X), formed by Iceland to Albania and Finland to the Azores, were supported, but in a way I didn’t expect: Camões appeared in the index as camoes, and correcção and correcções became correccao and correccoes, with everything reduced to accentless ASCII. This reduction also took place (well, still takes place on my current website server, under FreeBSD 12.2) on search words entered, so you can find them both in full form and in reduced form, or in any mix, for example ‘correccões’ or ‘correcçoes’.

However, for the Esperanto word ĉapitro, meaning chapter or section, which I then still coded as &ccirc;apitro, only the non-existent word ‘apitro’ appeared in the word list. Meanwhile I have converted my whole site from ISO 8859-1 to UTF-8, and ĉapitro IS correctly identified as a word, and found. Even before that conversion, ĥemiaj (chemical, a plural Esperanto adjective), in the underlying encoding ĥemiaj, with a separate diacritical, did work.

When skimming the word list, I noticed some Chinese or Japanese characters, and what may have been Arabic presentation forms, all of which I didn’t recognise as anything that I could have been putting on my site. So where did they come from?

For example, when looking for “Υιαγες ρισϲοσε”, risky travels, viages riscose, in pseudo-Greek which is really Interlingua, I was shown this context: “Υιαγες ρισϲοσε Andr頋uipers”.

Somehow the sequence of é (in Windows 1252 or ISO 8859-1), a space and a K (André Kuipers) was interpreted as Chinese. I found more such examples, and after diving deeper into it, up to the bit level, I could prove that it was caused by reading ISO 8859-1 as UTF-8, but interpreting its bit sequences in a way that was not what the designers of UTF-8 had in mind. Just one example: in the Dutch word “geïncasseerde”, the letters ïnc in Windows 1252 are:1110-1111 (hex EF) 0110-1110 (hex 6E) 0110-0011 (hex 63)

Hyperestraier (version 1.4.13) concludes from the three 1-bits in 1110 that this must be a three-byte UTF-8 sequence. But it cannot be, because then the second and third bytes should start with the bits 10, not 01, which means plain ASCII, even in UTF-8. Nevertheless, Hyperestraier takes the payload of the first byte (the bits after the first zero bit, 1111), and the last six bits of the other two bytes (10-1110 and 10-0011), and combines all that into 1111-10-1110-10-0011 or 1111-1011-1010-0011, FB-A3 in hexadecimal.

That is what the Unicode documentation calls “ARABIC LETTER RNOON MEDIAL FORM”, see UFB50.pdf. And that was what was displayed.

I did look at the sources briefly, but didn’t manage to identify what went wrong where. I left it at that.

Hyperestraier was written by Mikio Hirabayashi, who is Japanese. Languages like Japanese and Chinese do not delimit words by spaces and punctuation, as do most languages written in alphabetic scripts like Latin, Cyrillic and Modern Greek. So it’s harder to know where words start and where they end. Hyperestraier uses a method called “N-gram analysis” to solve this. I do not quite understand it, but possibly this is connected with the bug described above.

2021

I am currently in the process of switching web hosting from FreeBSD to Ubuntu Server. Not because I dislike FreeBSD, on the contrary. I just want something different. Also replacing Apache by nginx, for the same non-reason. A more valid reason is that I no longer use Windows, but Linux Mint instead. Linux Mint is Ubuntu is Debian, they are all similar, so my new choice enables me to have everything structured and organised the same, on my personal laptop and on the web server.

Various sites report that Ubuntu 20.4 officially supports Hyperestraier. But that is not the case, or no longer. Tools apt and apt-get do not find it. So I installed the required zlib1g-dev, and downloaded, unpacked, configured and compiled the tarballs qdbm-1.8.78.tar.gz and hyperestraier-1.4.13.tar.gz myself.

There were suspicious compiler warnings, and already in the checking phase of database QDBM (also by Mikio Hirabayashi) a segmentation fault occurred. FreeBSD reportedly uses the same version, but there everything works fine. Strange, but typical of the kind of programming errors that tend to cause segmentation faults: whether they occur or not, may depend on implementation details in operating system and compiler (FreeBSD uses clang, not gcc), and their absence doesn’t prove that the coding is correct.

Now let’s look into some details. In the next episode.