![]() |
Page stats |
![]() |
GENUKI Contents |
A lot of work is done by the Spider, and so the job is split up into sections to make it easier to rerun parts if problems or reconfiguration is required, and the big section which looks at each page uses checkpoint restarting, so that if anything crashes, it can carry on from just before the problem occurred.
This stage can take often take 1-2 days to run, depending on network load, and the number of links to hosts that no longer run a web server.
In order to speed things up, and for backup purposes, the Spider keeps a copy of every GENUKI page that it analyses. If a page has not changed since the last run, it uses the locally cached copy, rather than fetching it again and this gives a significant performance advantage. Unfortunately some web servers do not return the last modified date of pages and so we cannot take advantage of this. If you are using server side includes, that is another case when the last modified date is not returned.
This part of the job also checks each GENUKI page to see if it contains our logo, and records all that pages that contain a link to our Nearby Places script. In order to help validate e-mail addresses, this section validates the hostname part of all mailto: links. It looks in the DNS to see if there is an A or MX record for the hostname part of the email address, and flags them as an error if one does not exist. Mailers use the A and MX records in the DNS to determine where to send e-mail.
The report program does not run automatically after the Spider, as it actually runs under a different userid. When the Spider runs over the weekend I check at intervals to see if it has finished, and try and do a single run of the reporter via a dialup terminal session. This means that the majority of the information can be downloaded at the weekend when downloads are cheaper. The reporter takes about ten minutes to run, but after every Spider run I usually find some changes are need to the config file as some sections may have moved, and frequently new items appear under Unable to determine owner.
This tidy up phase always done the next time that I am at the university, usually the next monday.
The syntax check typically takes a couple of hours to run and is therefore not performed via a dialup session. If pages for a particular county are held at genuki.org.uk, it can be run at other times for that particular, county but again using the list of page names that the previous Report run found.
The syntax check is performed county by county using the list of county pages produced from the report on the full Spider run. It is does against disc copies of the pages, which for all pages held at Manchester this is the live page. For everything else it is does against the disc copy of the page stored by the full Spider run. I can therefore rerun the syntax check for counties stored here if you ask me to, but it will only be run against the list of pages found in the full Spider run. N.B. The Spider report page uses server side includes to show you the number of html syntax errors which in effect means they can change after the report page is originally written whithout changing the report page. It also means ypu may ned to use Shift+Reload in your browser to make the latest updates visible
The recheck uses the list of pages that contained link errors (not syntax errors) from the last run, and checks just them for link errors. It does not check any GENUKI pages linked from them, and will therefore not check any brand new pages. The time the run takes depends on the number of pages containing errors, and experience so far shows that it takes a day to run. Typically this is because on quite a few really bad links we have to wait for a timeout.
Once the Re-check run completes, the Reporter is then run but just using the new badlinks file as changed data, so only this part of the report will be different. I will also usually run the Syntax checker again at this point. But of course the only changed dat it has at this point is any page stored at genuki.org.uk and any of the other pages that were previously reported as containing link errors.
The only item on the report page that changes after a recheck run is the number of link errors, evrything else refers to the last full run.