|
Search GENUKI plus |
|
The purpose of this page is to provide those maintaining GENUKI pages with a better understanding of just what the search facility does, how it does it and why it does it that way.
At the time of writing (June 2003) the search database includes references to around 75,000 documents published on web servers relating to family history.
If you are using this facility for the first time you may find the additional notes for users on the verbose version of the search form helpful.
The software being used for the GENUKI search facility is ht://Dig which is freely available from http://htdig.sourceforge.net/. We are using version 3.1.6 of the software, the current release version.
The 'digging' component (henceforth 'the robot') of ht://Dig is set up to index GENUKI and other UK family history pages weekly, the timing is under review but generally occurs Thursday or Friday night. The robot starts from the the GENUKI home page http://www.genuki.org.uk/ and follows links to find other pages. The robot is constrained by two basic rules. It will follow (and index) any link with 'genuki' in its address. It will also follow and index any link to a substantial number of additional sites. This list of additional sites is partially maintained by hand and partially extracted automatically from the GENUKI pages listing FHS web sites and county surname-list sites.
Here is a simple text list of the web servers visited (and the page count from each) in the last run of the indexing robot. Note that these servers won't have all been fully indexed, in many cases only a part of the server has been indexed.
When time (and the appropriate automation) permits the exclusion list will be documented here.
All GENUKI pages are indexed, as are all Family History Society's pages, all county surname interest lists and numerous other sites appropriate to family history.
Not all files are indexed. Adobe Acrobat (.pdf) files may be indexed in the future but for the present only HTML (.htm or .html) and text (.txt) files are indexed. There are also some exclusions based on strings within the URL. The most notable exclusions are 'cgi' and '?'.
If your server contains parts which you feel would be better left out of the search database then you can tell the robot to stay out of a specified part of your server hierarchy using the generally accepted robot exclusion rules. These rules assume you have full control of the server's document hierarchy, an assumption that will be false for many of you. These rules are detailed at http://www.robotstxt.org/. The GENUKI robot goes under the name of 'index-genuki' and so a robots.txt file in your server's document root directory containing:
User-agent: index-genuki
Disallow: /personal
would keep the robot out of your /personal directory.
Of course, you might prefer:
User-agent: *
Disallow: /personal
which would keep all (well behaved) robots out.
Each weekly indexing run creates a report which is visible in http://www.genuki.org.uk/search/report/. This shows the number of pages indexed from each server and also shows their distribution in terms of 'mouse-clicks' from the GENUKI home page. The report also provides an elementary bad-link list for each server it indexed (or rather for each server it found a bad link on 8-). This list is not complete - it only reports files that it would have indexed (subject to its include and exclude rules) but failed to access.
This search facility uses the free
ht://Dig software.
This installation is administered by
Malcolm.Austen@weald.org.uk
© 2000-2003 GENUKI