Re: [webpages-l] sitemap_m.html

1 Sep 1999

      Rick wrote:
...
Hmmm, I wonder why it sometimes misses files.  For example, it finds
the German Rheinland-Pfalz history page, but not the English
version:
The program merely looks for associated filenames to make the file
pairings.  It fails here because I didn't know about or anticipate
filenames with language codes used as an infix:

  /gene/reg/RHE-PFA/rhein-p-his.html   (E)
  /gene/reg/RHE-PFA/rhein-p-d-his.html (D)
                           ^^ -d in the middle of the filename

The E (English) file isn't found correctly because its name isn't
obviously derivable (at least to the program) from the D filename.  In
fact, the E file isn't found at all because the D file doesn't link to
it, and neither does the parent of the D file.  (The crawler I wrote
only parses one file from a language multiplet; in this case, the
German-language parent was parsed, while the English one was not.)

I could make the crawler look for infixed language tags in the
filenames.  This gives it more chances for false associations,
however, so I would prefer that the German file name be changed to a
simpler
  /gene/reg/RHE-PFA/rhein-p-his-d.html (D)

Another case is (mea culpa)
  /gene/reg/NSAC/schaumburg-lippe_adel.html     (D)
  /gene/reg/NSAC/schaumburg-lippe_nobility.html (E)
and a simple file rename (with accompanying link updates) will fix
this too.

A worse problem, in my opinion, is presented by the sometimes
poorly-chosen page titles.  This can only be cured by careful
attention by the page authors.

-- 
=Jim Eggert   EggertJ@LL.mit.edu