Rick wrote:
Hmmm, I wonder why it sometimes misses files. For example, it finds the German Rheinland-Pfalz history page, but not the English version:
The program merely looks for associated filenames to make the file pairings. It fails here because I didn't know about or anticipate filenames with language codes used as an infix: /gene/reg/RHE-PFA/rhein-p-his.html (E) /gene/reg/RHE-PFA/rhein-p-d-his.html (D) ^^ -d in the middle of the filename The E (English) file isn't found correctly because its name isn't obviously derivable (at least to the program) from the D filename. In fact, the E file isn't found at all because the D file doesn't link to it, and neither does the parent of the D file. (The crawler I wrote only parses one file from a language multiplet; in this case, the German-language parent was parsed, while the English one was not.) I could make the crawler look for infixed language tags in the filenames. This gives it more chances for false associations, however, so I would prefer that the German file name be changed to a simpler /gene/reg/RHE-PFA/rhein-p-his-d.html (D) Another case is (mea culpa) /gene/reg/NSAC/schaumburg-lippe_adel.html (D) /gene/reg/NSAC/schaumburg-lippe_nobility.html (E) and a simple file rename (with accompanying link updates) will fix this too. A worse problem, in my opinion, is presented by the sometimes poorly-chosen page titles. This can only be cured by careful attention by the page authors. -- =Jim Eggert EggertJ@LL.mit.edu