Hide

Clone of Checking of Internal and External Hypertext Links

hide
Hide

Help and Guidance 2020: Draft: Modified Page: Version 1

Hide

Introduction

The GENUKI system contains many thousand of pages.  Most pages contain many hypertext links to other pages, both internally within the GENUKI system, and externally to remote sites.  A large part of the reputation and credibility of GENUKI is built on the assumption that these hypertext links will work when selected by the end user.

Over the years the GENUKI team have developed the concept of the monthly Spider Report which systematically checks all the hypertext links in the system, and reports all those links that are broken or redirected.

The move of GENUKI to a Content management System (CMS) has given the opportunity to improve our regular checking of internal and external links.  This page is intended to explain how the link checking process works, and how it is to be used by GENUKI maintainers.

Note: The link checker is designed for use on www.genuki.org.uk - it does not work correctly on dp.genuki.uk, the system used for conversions from our pre-Drupal implementation.

The fundamentals

There are several key components to the GENUKI link checking system:

  • Each time a page of information is saved the system extracts all the hypertext links from the content and stores these links on a master list of links.
  • The system will then automatically check all new and existing links in the system every 4 weeks.  NB.  In order to spread the load on the GENUKI web server, the checking of links will be done in small batches every hour, but any specific link will be rechecked every 4 weeks.
  • All link errors and redirected links are stored in the link checking database, thereby allowing analysis of these broken links at any time.
  • Each time a maintainer opens the edit view of a page in the system, any broken links it contains will be shown at the head of the edit view of the page.
  • Each time a an edited page is saved, any new links added by a Maintainer will be submitted for checking at the next batch run.
  • Each maintainer can also view a complete list of broken links on all the pages that they are responsible for.
  • Each maintainer has the facility to tell the system to ignore selected links during the checking process, either on a temporary or permanent basis.

Automatic checking of hypertext links

Within the GENUKI system, there is a batch job (CRON job) that runs every hour.  Each time the batch job runs, it selects up to 1000 new or existing links to be checked.  This is done by checking the GENUKI site or the remote sites and evaluating the HTTP response codes.  Any response other than a normal response is logged by the system for subsequent reporting.  There are various types of links analysed by this process:

  • links internal to the page being checked
  • links to another page within the GENUKI system
  • links to a remote site outside the GENUKI system

Information available to a GENUKI Maintainer

Each GENUKI maintainer will, on editing a page that they are responsible for, see at the head of the page a list of its broken links.

There is also a "Broken Links" report under the Maintainer's "My account" section that covers all the pages that they are responsible for.  For each broken link, the system lists:

  • Section - This is the GENUKI section under which the page is maintained.  This is normally a country or county.
  • Node - This is the url of the page containing the broken link.  If the GENUKI alias has not been set, the system will display the node number.  The title of the page is also shown in brackets on the line below.
  • Field Name - This is the field name containing the broken link.
  • Broken Link - This is the url of the link that is broken.  The visible text "fragment" containing the broken url is also shown in brackets on the line below.
  • Code - This is the Hypertext Status Code sent by the remote system when the link was checked.  NB.  Currently status codes of 200, 206, 302, and 304 are the only valid codes.  All other codes are regarded as "broken".  See below for a list of the more common error codes, and how to deal with them.
  • Error - This is a description of the Hypertext Status Code explained above.
  • Author - This is the name of the author of the page.  This is normally the GENUKI maintainer for the page.
  • Last Checked - This is the date and time that the link was last checked.

The final column on the Broken Links report contains a number of useful "operations" for the Maintainer to use:

  • Edit node - this option allows the Maintainer to directly edit the node containing the broken link
  • Recheck - this option submits the chosen link for re-checking at the next batch run
  • Ignore link - this option allows the link to be transferred from the Broken Links report to the Ignored Links report (see section on "Ignored Links" below).
  • Redirect - this option allows the Maintainer to immediately change the existing link to the redirected link, without having to manually edit the node and manually correct the link.

Thus the Broken Links report, through its more comprehensive details about each error, and provision of "useful operations", is usually more convenient to use than the error reports given at the head of relevant individual node edit screens.

The current version of the Broken Links report simply lists all the errors for a given Maintainer.  Whilst the information given for any specific broken link should be sufficient for the Maintainer to locate and correct the broken link, it is recognised that in the early days of the current GENUKI implemantation conversion some maintainers will have a large list of errors to resolve.  Therefore, the error report also has some extra facilities to enable Maintainers to focus their work on a specific area:

  • ‚ÄčFiltering - this facility enables the Maintainer to filter the list on the key fields - section, node (ie. url of page containing the broken link),  field name, url of broken link, status code.  The filtering options are exposed by clicking on the "Filter Items" link at the top of the page.  Simply enter characters into one or more filter fields, and the system will retrieve any records containing those characters.  NB.  the fields are case-sensitive.
  • Simultaneous actions on multiple links - it is now possible to carry out the same action on multiple links simultaneously.  This is done by selecting the required rows by ticking the checkbox at the start of each line, and then choosing the required action by clicking on the "Update Options" links at the top of the page.  Allowed actions currently include moving items to the Ignored List, and re-submitting items for re-checking.  More information on these actions is described below.
  • A pager - this is a facility shown at the bottom of the page that splits a large report into multiple chunks, and then allows the Maintainer to step forwards or backwards through multiple pages (50 items at a time). Note: Use of this facility takes you back to the unfiltered list.
  • Recheck all links - this is a button at the top of the report that allows a Maintainer to submit all their current broken links for re-checking the next time the batch job is run (usually within the hour)

During the conversion of a county from our pre-Drupal implementation to current GENUKI implemantation, it is likely that many manual corrections are made to links in the run up to a county being made "live" in current GENUKI implemantation.  However, please note that during the conversion phase the Broken Links report may not be entirely accurate, especially when reporting links internal to Genuki.  It is for this reason that, following the implementation of a county, it is recommended that a maintainer resubmits all remaining broken links for re-checking using the button at the top of the the Broken Links report.  Normally these links would not get re-checked automatically until 4 weeks after implementation.  Once a county is live in the Genuki-2 system, and all remaininng broken links have been re-checked, the Broken Links report can now be relied upon to display accurate and up-to-date information. 

Note: there is a further means of identifying broken links, the report "Genuki - Errors & Statistics".  This report can be accessed from the main menu in the blue navigation bar -> Maintenance -> Errors & Statistics. (Its format is similar to the Spider Report in our pre-Drupal implementation.) The main  reason for providing this new report is to enable any maintainer to  view a list of nodes and a list of broken links for any section/county within Genuki - this enables the central conversion team to assist other maintainers with broken links during the conversion  process.

Correction of a broken link

There are many reasons why a link may show up on the Broken Links error report.  However, there are four basic scenarios:

  • Link is still correct - this scenario occurs when the link is still valid, but for some reason the remote system has responded with an error message (e.g. the remote system was undergoing maintenance).  Assuming that the remote system was only temporarily unavailable, the broken link will normally disappear from the error report the next time this link is checked in the 4-week cycle.  NB.  The Maintainer has the option of manually resubmitting the link for re-checking as soon as the remote system is available once again. NB.  Once a link has been resubmitted for re-checking, it will temporarily disappear from the error report until the re-checking process is complete. (It may then subsequently return to the error report if there is still a problem with the link).
  • Link is no longer valid - this scenario occurs when the page being requested has been removed or changed on the remote system for whatever reason.  In this case the GENUKI maintainer must either remove or correct the link manually.  Once the link on the GENUKI page has been corrected, the system will submit the link for re-checking in a forthcoming batch run.  NB.  It will temporarily disappear from the error report until the re-checking process is complete. (It may then subsequently return to the error report if the correction is unsuccessful).
  • Link has been moved - this scenario occurs when the page being requested has been permanently moved to a different part of the remote system, but the owner of the remote system has provided a "redirected" link.  In this case the GENUKI maintainer must manually change the link to the redirected link (assuming the Maintainer is happy with the redirected page).  Again, once the link on the GENUKI page has been corrected, the system will submit the link for re-checking in a forthcoming batch run.  If the Maintainer is happy with the suggested redirected link, the simplest action is to change the existing link to the redirected link by selection the "Redirect" operation on the Broken Links report.  NB.  It will temporarily disappear from the error report until the re-checking process is complete. (It may then subsequently return to the error report if the redirection is unsuccessful).
  • Link will never be valid - this scenario occurs when the maintainer has entered a non-existent link, usually by mistake / typing error.  Again, once the link on the GENUKI page has been corrected, the system will submit the link for re-checking in a forthcoming batch run.  NB.  It will temporarily disappear from the error report until the re-checking process is complete (it may then subsequently return to the error report if the correction is unsuccessful).

HTTP  Status Codes

Each time a link is checked the GENUKI system sends an automatic request to the target system/website (internal or external), and the target system automatically sends a response message and status code to GENUKI.  The most common codes are:

Success Codes

  • 200 OK - This is the standard response for successful HTTP requests.
  • 206 Partial Content - The remote server is has delivered only part of the content.  This is also OK, as the GENUKI link checking process does not require a full page to be sent.
  • 304 Not Modified - ok.

Error Codes

  • 301 Moved Permanently - The requested link has been moved permanently to a new location. All future requests should be directed to the new given Url.  The Maintainer must then decide whether or not to accepted the suggested redirection (see "Correction of a broken link" above), or to remove the link from the text completely.
  • 302 Found - According to Wikipedia this code is used in different ways by different sites.  However, this code is another category of re-directs (temporary or permanent).  Therefore, as for status 301, the Maintainer must decide whether to accept the redirection, or to remove the link completely.
  • 400 Bad Request - The server cannot or will not process the request due to something that is perceived to be a client (ie. GENUKI) error.  As the GENUKI link checking process is meant to be an automatic process run in the background, it is unlikely that the Maintainer can resolve this problem.  (Please contact Phil or Ken in this case).
  • 403 Forbidden - The request was a valid request, but the server is refusing to respond to it - one simply-fixed cause is that the page linked to has not had its "Publish" flag set on.  If the Maintainer cannot access the remote system by clicking on the link, then it is likely that the link will have to be changed manually.  However, if the link works by clicking on it, but the GENUKI link checking process generates an error, it it likely the remote site has been configured to deter automated requests.  The recommendation is to classify this link as a permanently ignored link (see section on "Ignored Links" below).
  • 404 Not Found - The requested resource could not be found but may be available somewhere else, or again at the requested location some time in the future.  The most likely solution is for the Maintainer to correct the link to one that works, when this is possible.
  • 500 Internal Server Error - A generic error message, given when an unexpected condition was encountered and no more specific message is suitable.  It is suggested that the Maintainer submits this for re-checking a day or two later to see if the remote server is available again.
  • 503 Service Unavailable - It is suggested that the Maintainer submits this for re-checking a day or two later to see if the remote server is available again.
  • 999 Name or Service not known - This error occurs when the link checking process is unable to find a valid remote site to communicate with.  This link will probably have to be corrected or removed manually by the Maintainer.  The error message will normally be shown as "Name or Service not known", but if the link checker determines a more specific error, then this specific error message will be shown instead.

Ignored Links

We have now introduced the concept of "ignored links".  This is to enable a Maintainer to remove one or more links from their Broken Links report.  Ignored links can be classified as temporary or permanent:

  • Temporarily ignored links will continue to be re-checked every 4 weeks, and will still be regarded as broken links.  For example, this could be used when a target system is unavailable for a short period, but is likely to return in the  near future.
  • Permanently ignored links will no longer be re-checked by the system every 4 weeks, and will no longer appear on any error report. For example, this could be used when a link is correctly available when clicked on by a user, but for some reason the target system returns an error (see status code 403 above) when checked by the GENUKI link checking progress.  

(The temporary v. permanent choice is made using the "Update Options" menu.)