GENUKI Maintainers' Pages

Version 1.15

How the Spider Works

Analysis

The starting point for analysis is the database file which lists all the pages that comprise the GENUKI web site: the GENUKI page list. For this purpose, the GENUKI web site consists of all pages held on the genuki.org.uk server, and all pages in GENUKI sections held on other servers and referenced from the genuki.org.uk server.

Analysis is done page by page using the GENUKI page list. It is only during this stage of analysis that the spider uses web protocols to access GENUKI pages. If there are any problems locating a GENUKI page at this point, e.g., if a page in the GENUKI page list cannot be found because it has been deleted, it will appear as a problem under the "Spider" heading of the Problems report.

When a page is being analysed, the spider detects and checks links from each page in the GENUKI page list. For reasons of speed and to avoid hitting websites too often, the spider minimises web access.

For links to GENUKI pages, the spider doesn't use web access protocols. Instead, it looks at local files for pages held at genuki.org.uk, and at the copy saved at genuki.org.uk for every GENUKI page hosted elsewhere (these copies are held for html checks but also used as a backup). If such a file doesn't exist, a "404 Not Found" error is reported under the Spider heading in the Problems report. This means that if a GENUKI page no longer exists it will only be reported as a problem under the "Spider" heading of the Problems report.

For links to non-GENUKI pages, the spider uses another database table: the non-GENUKI page list. Before attempting to locate a non-GENUKI page, the spider searches the non-GENUKI page list for a matching page address entry. If a link to this non-GENUKI page has already been encountered, either on this run of the spider or a previous run of the spider, the spider uses the entry in the non-GENUKI list and does not use web access protocols to obtain the page again. If a link to this page has not been encountered before, its page address entry will not be in the list. In that case, the spider adds an entry to the list for each non-GENUKI link, attempts to locate the page using web access protocols, and records any consequent errors or redirects in the entry, together with the date the entry was added.

However, a link to a non-GENUKI page could fail subsequent to the spider run which added the entry to the table, so the non-GENUKI list has to be purged regularly. If purging did not take place, links to non-GENUKI pages would be reported as a success even though, in practice, they would fail. Entries in the non-GENUKI page list for successful links are purged after 5 days, and those for failed links are purged after 3 days.

There are some links to non-GENUKI pages that should never be checked, e.g., those to "validator.w3.org" which provides html syntax checking. Such links are avoided by creating a suitable entry in the non-GENUKI page list with a date well into the future. This means that these entries are never purged, the spider thinks they are always successful, and their web site is not accessed.

Discovery

During the analysis phase, when the spider is checking for links to GENUKI pages, it notes the filename and address of each GENUKI page referenced, and saves these in a list in the database at the end of the spider run.

This list of GENUKI page names is used by the discovery process which checks if the page is already known by looking at the GENUKI page list, i.e., the list of pages that comprise the GENUKI web site. For those that are unknown, i.e., new pages, it creates a page entry in the GENUKI page list, chooses a section for it, and sets its mediatype and type. New pages are therefore only checked and analysed on the next run of the spider. The mediatype is set for anything composed of html (except cgi), images, css, js, or other. Link and html checking is performed only on html, and for the others only if the page exists. The type is set only for html files and is unused currently.

Directories

The final complication is handling directories for sections not hosted at genuki.org.uk. If a url ends in '/' the spider tries the usual suspects, index.html, index.php, etc. until it finds one successfully. The spider looks directly at files on genuki.org.uk, but for those expected elsewhere it has to use web access protcols until it finds one successfully.

For directories with the trailing slash missing for sections not held at genuki.org.uk, the spider is unaware if the link is to a file or a directory. The spider tries to get the file using web access protocols, and then checks if the base address of the returned file is different to that requested. In the case of a directory, the returned base is terminated with '/' and the spider can therefore be sure it's dealing with a directory.

Reporting

Spider reports required by maintainers are generated when requested by means of cgi scripts which read the database and create a web page containing the required data.

The cgi scripts used to generate the various spider reports each generate a different format, and content, of report as follows:

Overall Report

The overall report is generated by invoking the spider cgi script. The resulting web page is organised by the logical sections in which GENUKI pages are contained for maintenance purposes:

Section Report

The section report is generated by invoking the section cgi script.

Section Files Report

The section files report is generated by invoking the section_files cgi script. The resulting web page provides a list of the urls of the requested files.

The section report also includes the section problems report for all problems in the section.

Section Problems Report

The section problems report is generated by invoking the section_problems cgi script. The resulting web page provides a list of the urls of the problem files.