Re: [webpages-l] sitemap_m.html
Hmmm, I wonder why it sometimes misses files. For example, it finds the German Rheinland-Pfalz history page, but not the English version: Deutsche Genealogie: Rheinland-Pfalz (D) / German Genealogy: Rheinland-Pfalz/Rhineland-Palatinate (E) Deutsche Genealogie: Rheinland-Pfalz, die Geschichte
English history would presumably go here<<<<<<< frenchzone.jpg
I'm not the type that would use this kind of resource to get to know a site, but I suppose there are some that are. Rick At 07:24 PM 8/31/99 -0400, Jim Eggert wrote:
I've uploaded an improved site map to http://www2.genealogy.net/gene/tmp/up/pages/sitemap_m.html
This might be a complete catalog of all the clickable links on our server. (I haven't checked.) This includes clickable images, pdf files, and text files.
The program does a better job with the hierarchy because it requires that subordinate members be in the same or subordinate directories. It also knows a lot about exceptions:
o Files that should not be parsed, like team.html o Files that look like language-pairs, but aren't o Files that shouldn't be included at all (so far none) o Files that should be pinned in the hierarchy, independent of the crawling mechanism (so far none)
Titles and hierarchy are still derived by crawling a local copy of the site. So if you don't like the titles, change them in the site!
Feedback is encouraged. So far I've had almost none.
-- =Jim Eggert EggertJ@LL.mit.edu
Rick wrote:
Hmmm, I wonder why it sometimes misses files. For example, it finds the German Rheinland-Pfalz history page, but not the English version:
The program merely looks for associated filenames to make the file pairings. It fails here because I didn't know about or anticipate filenames with language codes used as an infix: /gene/reg/RHE-PFA/rhein-p-his.html (E) /gene/reg/RHE-PFA/rhein-p-d-his.html (D) ^^ -d in the middle of the filename The E (English) file isn't found correctly because its name isn't obviously derivable (at least to the program) from the D filename. In fact, the E file isn't found at all because the D file doesn't link to it, and neither does the parent of the D file. (The crawler I wrote only parses one file from a language multiplet; in this case, the German-language parent was parsed, while the English one was not.) I could make the crawler look for infixed language tags in the filenames. This gives it more chances for false associations, however, so I would prefer that the German file name be changed to a simpler /gene/reg/RHE-PFA/rhein-p-his-d.html (D) Another case is (mea culpa) /gene/reg/NSAC/schaumburg-lippe_adel.html (D) /gene/reg/NSAC/schaumburg-lippe_nobility.html (E) and a simple file rename (with accompanying link updates) will fix this too. A worse problem, in my opinion, is presented by the sometimes poorly-chosen page titles. This can only be cured by careful attention by the page authors. -- =Jim Eggert EggertJ@LL.mit.edu
participants (2)
-
Jim Eggert
-
Richard Heli