Automatic site index (script – update)
Meanwhile
The script I described here (over a year ago! time flies when you’s having fun) has meanwhile changed a bit. Also, I only run it once every 24 hours, only at about a quarter to three at night.
Here’s the new version:
Script
cd /usr/local/www/apache24/htdocs/tools/index/ mv words wordswas LC_COLLATE="en_US.ISO8859-1" ; export LC_COLLATE find ../.. -type f -a \ \( -name '*.htm' -o -name '*.html' -o -name '*.stm' -o -name '*.inc' \) | # Note 1 grep -v -f stoppath | ./wordsep | grep -v -x -f stoplist | sort > wordsraw # Note 2 # Addition 26 October 2012: for frequency list # Not necessary to do this everyday! #uniq -c wordsraw | sort -nr > wordfreq # Note 3 uniq wordsraw > words rm wordsraw ./genindex words rm ../../index/index-?.htm mv index*.htm ../../index diff wordswas words
Notes
The former file ‘stoplist’ has been renamed to ‘stoppath’, because two lines further on, I now use a different file called ‘stoplist’. This new file contains 77 very common words (largely from Dutch and English, the languages I write most of my web articles in).
Ignoring such common words is of course quite usual a technique, but I didn’t do it yet, until recently (26 October 2012). The measure reduced the total number of words from the site (unsorted and including all duplicates) from 637710 to 386740!
The option
-x
forgrep
is necessary too avoid that the words ‘deal’ and ‘deactivate’, for example, would be filtered out because of the common Dutch and Portuguese word ‘de’. Only whole-line matches (between the one-word-per-line files) should be honoured for filtering out (option-v
), and that is what-x
does. (At least it does under FreeBSD 8.2 and 8.3, don’t know about other Unix flavours.)I first tried
-w
(‘match whole words only’), but that wasn’t good enough, because then things like Al-Cercthe no longer appeared in the index, because of the common Dutch word ‘al’ (meaning: yet, already).The above-mentioned
stoplist
, with frequently occuring function words, I obtained by generating a frequency list from the file (wordsraw
) that has all the words in it. I commented out that step later, because I don’t need a fresh frequency list every day.I could have further optimised the script, by doing that
uniq wordsraw > words
here, in an earlier step, without savingwordsraw
to disk. Pipes are normally implemented using a 2 kilobyte buffer in core, so they are much more efficient than intermediate files.The combined script line (in the existing series of piped commands), instead of:
(sort > wordsraw; uniq wordsraw > words)
would then be:sort | uniq > words
or better still:sort -u > words
But because the script runs only once every 24 hours, I left it as it is.