Simple word indexer (5)

Idea 2 July, text 18, 26 and 27 July, 19 September 2021. Continued from the previous.

Algorithm

The algorithm follows from what information is requested, and the design criteria. What do we want to know? This: what words occur on the website, and where?

For this we need a list containing all the words. Of each word, it should be known in which of the HTML files it occurs, and at what byte position. Steps to create such a list are:

Create a list of HTML files to be examined. This can be done using find, possibly fine-tuned with a grep -v -f stoplist to exclude some files that are not of interest, or would cause clutter.
Open all the files in that list, and peruse them looking for words. A word is a sequence of alphabetic UTF-8 characters. Dashes (-) and apostrophes (' or ’) may occur, except at the start and the end, for English words like ‘don’t’ and ‘isn’t’. Skip anything that is in an HTML tag, between < and >. Do include entities (later to be removed), so that the presence of optional hyphenation points indicated by the entity  does not interrupt an alphabetic sequence that would otherwise be a word.
Write all such words to a file, one word a line, followed by an indication of where that word can be found: the line number (zero-based) in the list of files from point 1, and a byte offset in the HTML file.
(Changed 18 January 2023: HTML entities are not removed, except . Whether to also include digits in a ‘word’, at the start or also further on, is now configurable, with ‘no’ being the default.)
Sort the list obtained from the previous step. Although the list contains words in UTF-8, for reasons explained in a next episode, it is sorted using the C locale, that is, by byte value of single bytes of the UTF-8 representation.
If a word occurs in multiple files, and/or several times in any one file, there will be multiple occurrences (i.e., lines) in the list so far obtained. To save disk space, and for easier use of the information in a word search function, those lines are now combined, so that each unique word occurring in all of the HTML files together, occupies just one line in the combined file. That word is followed by a list of pairs, each pair containing the line number of the HTML file in the list obtained in step 1.
The program that does the combining also creates a file containing fixed-length byte offsets, one per line, of the start of each line in the other file. This comes in handy for the binary search in the next step.
This step 4 also suppresses excessively long lists of occurrences of trivial words like English ‘in’, ‘is’, ‘and’, ‘the’ etc., that would only take space, and aren’t interesting to look for in a website. Other languages also have such words. They are suppressed by imposing a hard limit on the number of occurrences.
A separate step, to be performed after step 1, or right now at the latest, also creates a file with byte offsets for the list of HTML files, to simplify and speed up search facilities.
Search facilities can be implemented which employ the four files thus far created. The algorithm is:
1. Use one or more search words. For each of them, look up a matching line in the words file, using a binary search. Use an algorithm to find HTML files where at least three of those words occur together close enough. (Minimum number of words and maximum distance are configurable.)
2. Optionally, the matches for each of the words can also be shown separately.
3. Or use egrep to find one or more matching keywords lines based on a regular expression.
4. For each of the lines thus obtained, walk the list of locations, encoded as a line index to the list of files, and a byte offset in the HTML. For each occurrence, get some context for display.
Optionally, the words can be extracted from the list of words and locations, and compared against the words from the previous run, to see which words now occur but didn’t before, and vice versa. In other words, which words were added and deleted.

Although the algorithm was designed with HTML in mind, it probably also works with XHTML, XML in general, and even (and this case I tested) with MBX mailboxes such as those used by e-mail program Eudora.

C sources and standard tools used to implement each step are:

find, grep
siworin-wordsep.c
sort
siworin-makelst.c, siworin-files.c, siworin-config.c
siworin-pathoffs.c
siworin-cgi.c, cgitools.c, siworin-srchind.c, siworin-match.c, bsearch, egrep, siworin-displ.c, siworin-files.c, siworin-config.c
sed, diff

Now let’s have a look at the performance of this algorithm in a real life situation.