GENUKI Home page
Org Page stats    GENUKI Contents GENUKI Contents

The GENUKI Spider

We keep track of errors on the GENUKI pages by running a program called the GENUKI Spider, which analyses every page, and tests all the links on them to look for errors. As each page is being fetched and analysed, a few other checks are made at the same time.

A lot of work is done by the Spider, and so the job is split up into sections to make it easier to rerun parts if problems or reconfiguration is required, and the big section which looks at each page uses checkpoint restarting, so that if anything crashes, it can carry on from just before the problem occurred.

1. The Spider run

The first stage is the Spider run itself, which start at www.genuki.org.uk, fetching and checking any pages whose URL contains genuki. The program has a configuration file, so we can be a little more flexible in determining how far it searches.

This stage can take often take 1-2 days to run, depending on network load, and the number of links to hosts that no longer run a web server.

In order to speed things up, and for backup purposes, the Spider keeps a copy of every GENUKI page that it analyses. If a page has not changed since the last run, it uses the locally cached copy, rather than fetching it again and this gives a significant performance advantage. Unfortunately some web servers do not return the last modified date of pages and so we cannot take advantage of this. If you are using server side includes, that is another case when the last modified date is not returned.

This part of the job also checks each GENUKI page to see if it contains our logo, and records all that pages that contain a link to our Nearby Places script. In order to help validate e-mail addresses, this section validates the hostname part of all mailto: links. It looks in the DNS to see if there is an A or MX record for the hostname part of the email address, and flags them as an error if one does not exist. Mailers use the A and MX records in the DNS to determine where to send e-mail.

2. The Reporter

The report program is used to analyse the information that the Spider run has produced, and to format it into web pages that we can look at. A configuration file is used to match up all the varied URLs to the appropropriate parts of GENUKI. The reporter also produces some information that is used by other checking programs.

The report program does not run automatically after the Spider, as it actually runs under a different userid. When the Spider runs over the weekend I check at intervals to see if it has finished, and try and do a single run of the reporter via a dialup terminal session. This means that the majority of the information can be downloaded at the weekend when downloads are cheaper. The reporter takes about ten minutes to run, but after every Spider run I usually find some changes are need to the config file as some sections may have moved, and frequently new items appear under Unable to determine owner.

This tidy up phase always done the next time that I am at the university, usually the next monday.

3. The Syntax Checker

A further check that is carried out on the pages is to run an HTML syntax checker against them. This is done against the disc copies of the pages that the Spider saves, or against the disc copies of the pages themselves for anything stored at genuki.org.uk. The list of files that must be checked for each county, is the list of county pages that is produced by the Reporter.

The syntax check typically takes a couple of hours to run and is therefore not performed via a dialup session. If pages for a particular county are held at genuki.org.uk, it can be run at other times for that particular, county but again using the list of page names that the previous Report run found.

The syntax check is performed county by county using the list of county pages produced from the report on the full Spider run. It is does against disc copies of the pages, which for all pages held at Manchester this is the live page. For everything else it is does against the disc copy of the page stored by the full Spider run. I can therefore rerun the syntax check for counties stored here if you ask me to, but it will only be run against the list of pages found in the full Spider run. N.B. The Spider report page uses server side includes to show you the number of html syntax errors which in effect means they can change after the report page is originally written whithout changing the report page. It also means ypu may ned to use Shift+Reload in your browser to make the latest updates visible

4. Re-check runs

The main Spider is usually run at about monthly intervals as it uses a lot of resources, and takes quite a long time to run. This interval can appear to be quite long if you are fixing errors, and you need an up to date list of what is still outstanding. So I have now developed a re-check version of the Spider which I plan to run more frequently.

The recheck uses the list of pages that contained link errors (not syntax errors) from the last run, and checks just them for link errors. It does not check any GENUKI pages linked from them, and will therefore not check any brand new pages. The time the run takes depends on the number of pages containing errors, and experience so far shows that it takes a day to run. Typically this is because on quite a few really bad links we have to wait for a timeout.

Once the Re-check run completes, the Reporter is then run but just using the new badlinks file as changed data, so only this part of the report will be different. I will also usually run the Syntax checker again at this point. But of course the only changed dat it has at this point is any page stored at genuki.org.uk and any of the other pages that were previously reported as containing link errors.

The only item on the report page that changes after a recheck run is the number of link errors, evrything else refers to the last full run.