Simple word indexer (13)

19 and 20 August, 19 September 2021. Continued from the previous.

Positioning to a character (1)

When finding a word, using the index that has been built up, of course it isn’t enough to show the word and the name of the file it occurs in. We also want to see context.

That should be easy. We know exactly where the word starts. What could be simpler than skipping a few tens of bytes back, and a few tens forward, and show those bytes on either side of the highlighted search word?

Well, there’s a little problem there. Nowadays, a byte is not the same as a character anymore. A character in UTF-8 – and I expect that that is the only encoding to stick around, in the near and remote future and everything in between – can be one byte long, namely if it is plain ASCII. Otherwise, it can also be three bytes, roughly for India and China; or four bytes, for rather specialised, unusual and ancient characters; or two bytes for most everything else: accented Latin, and Greek, Russian, Hebrew, Arabic.

That is the short story, details are here. Two bytes are needed from scalar code point 0080 (8–11 bits to encode), three UTF-8 bytes from 0800 (12–16 bits) and four starting with Unicode scalar 10000 (17–20 bits, and a little beyond). A total of 1114112 code points in theory (but some are reserved or have special purposes).

So if you just pick a byte position in a file, it could be the start of a character, but just as well the second, third or fourth byte of an incomplete sequence, which is therefore invalid and uninterpretable. Likewise, by ending the sample just somewhere, at some byte count, it could be that one or more required bytes of a valid sequence are missing. A real-life example, from my 2010 Portuguese article Terra Poderosa (Powerful Earth):

“�gua pode fazer com que a água penetre no furo e expulse o petróleo.”

This comes from a more complete “pressão da água pode fazer com que”, but only the second byte of the two, which are required to encode the letter ‘a with acute accent’ for Portuguese (Unicode: e1, UTF-8 encoding, c3-a1, all in hexadecimal) was read from the file.

The browser does not and cannot know what to do with hex a1 by itself, which is invalid and does not mean anything. So the browser instead displays the special Unicode character hex fffd, the so-called replacement character, �, described as: “used to replace an incoming character whose value is unknown or unrepresentable in Unicode”.

The proper way to solve this is to deal with characters, not bytes. Multibyte characters, wide characters. I wrote about that before. However, when I wrote a little test program to find out what happens when you fseek to an invalid character position, and read from there, I found there is a problem. A bug in the library. I will describe that in detail in the next episode.

And even before that, I had decided not to do it that way. Instead, I used a quick and not so dirty solution, which works with bytes without requiring characters:

At the start of the buffer, replace any UTF-8 follow bytes with spaces. A valid UTF-8 character always starts with a byte whose initial number of 1-bits indicates the character’s total length in bytes. If there is no such byte, and it also isn’t plain ASCII, we could only be dealing with an incomplete character sequence. Recognition: UTF-8 follow bytes start with the bits 10.
At the end of the buffer, I replace any non-ASCII with null-bytes. That deals with the case that fewer follow bytes are present than is required. But it possibly sacrifices a valid non-ASCII character. Because the size of the context to be shown is rather arbitrary, in my opinion that is acceptable.

KISS, keep it simple. Implemented in function ClearInvalid in siworin-displ.c. I could however also have used my own UTF tools, which can test whether a UTF-8 sequence is a valid character. But I didn’t want more dependencies. It can probably also be done using the multibyte functions in the C stdlib.

Update 19 September 2021: I kept it simple and made it even simpler: I now just clear any non-whitespace at the start and end of the buffer to be displayed. That removes any invalid UTF-8, and it ensures only whole words are shown.