GENUKI Home page

Search GENUKI plus

GENUKI contentsGENUKI Contents

Notes for GENUKI maintainers

The purpose of this page is to provide those maintaining GENUKI pages with a better understanding of just what the search facility does, how it does it and why it does it that way.

At the time of writing (June 2003) the search database includes references to around 75,000 documents published on web servers relating to family history.

If you are using this facility for the first time you may find the additional notes for users on the verbose version of the search form helpful.

  1. The ht://Dig software
  2. The indexing robot

The ht://Dig software

The software being used for the GENUKI search facility is ht://Dig which is freely available from http://htdig.sourceforge.net/. We are using version 3.1.6 of the software, the current release version.

The indexing robot

Which servers are indexed?

The 'digging' component (henceforth 'the robot') of ht://Dig is set up to index GENUKI and other UK family history pages weekly, the timing is under review but generally occurs Thursday or Friday night. The robot starts from the the GENUKI home page http://www.genuki.org.uk/ and follows links to find other pages. The robot is constrained by two basic rules. It will follow (and index) any link with 'genuki' in its address. It will also follow and index any link to a substantial number of additional sites. This list of additional sites is partially maintained by hand and partially extracted automatically from the GENUKI pages listing FHS web sites and county surname-list sites.

Here is a simple text list of the web servers visited (and the page count from each) in the last run of the indexing robot. Note that these servers won't have all been fully indexed, in many cases only a part of the server has been indexed.

When time (and the appropriate automation) permits the exclusion list will be documented here.

Which pages are indexed?

All GENUKI pages are indexed, as are all Family History Society's pages, all county surname interest lists and numerous other sites appropriate to family history.

Not all files are indexed. Adobe Acrobat (.pdf) files may be indexed in the future but for the present only HTML (.htm or .html) and text (.txt) files are indexed. There are also some exclusions based on strings within the URL. The most notable exclusions are 'cgi' and '?'.

Robot exclusion rules

If your server contains parts which you feel would be better left out of the search database then you can tell the robot to stay out of a specified part of your server hierarchy using the generally accepted robot exclusion rules. These rules assume you have full control of the server's document hierarchy, an assumption that will be false for many of you. These rules are detailed at http://www.robotstxt.org/. The GENUKI robot goes under the name of 'index-genuki' and so a robots.txt file in your server's document root directory containing:

          User-agent: index-genuki
          Disallow: /personal

would keep the robot out of your /personal directory.

Of course, you might prefer:

          User-agent: *
          Disallow: /personal

which would keep all (well behaved) robots out.

Bad Link report

Each weekly indexing run creates a report which is visible in http://www.genuki.org.uk/search/report/. This shows the number of pages indexed from each server and also shows their distribution in terms of 'mouse-clicks' from the GENUKI home page. The report also provides an elementary bad-link list for each server it indexed (or rather for each server it found a bad link on 8-). This list is not complete - it only reports files that it would have indexed (subject to its include and exclude rules) but failed to access.

ht://Dig logoThis search facility uses the free ht://Dig software.
This installation is administered by Malcolm.Austen@weald.org.uk
© 2000-2003 GENUKI