UK & Ireland Contents
Recommended Books and Articles on Record Linking
- "News from the Cambridge Group for the History of Population
and Social Structure: Automatic record linking for family
recognition," Local Population Studies, vol. 40, pp.10-16,
A summary account of the work of the Centre. States that they can
now perform fully-automatic family reconstitution. "We have tested
the record-linking algorithms extensively against manual
reconstructions, and we have found that they provide comparable, or
superior results in a fraction of the time taken to link the
records by hand.." Provides a brief but useful narrative
description of the techniques, but no references to publications
giving detailed accounts of the techniques.
- P. Adman, S.W. Baskerville and K.F. Beedham, "Computer-Assisted
Record Linkage: or How Best to Optimize Links Without Generating
Errors," History and Computing, vol. 4, no. 1, pp.2-15, 1994.
"Record linkage is arguably a more complex process than most of its
practitioners realize. Once a historian decides to settle for a
subset of the potential links that can be made between two sets of
nominal records s/he is already involved in the process of
sampling. Fine, if that is relevant to the enquiry at hand: but if
the objective is to achieve the maximum number of 'true' links,
then accurate judgement becomes paramount. The key assumption
underlying this paper is that 'judgement' is the function of the
historian, not the computer. It is our belief that no systematic
algorithm, however sophisticated, can perform this task as well as
can an experienced team of researchers equipped with an appropriate
set of software tools. In this paper we develop this line of
reasoning in the context of our work on early eighteenth-century
elections, and describe the functions of a software package known
as CARL - Computer-Assisted Record Linkage."
- R. Barker, "Reconstituting the Family," Genealogists' Magazine,
vol. 21, no. 9,1985.
Based on a talk given at the Society of Genealogists. Gives a
fairly full account of manual techniques for performing family
reconstitution; in describing the work of the Cambridge Group for
the History of Population and Social Structures mentions use of a
computer only in connection with the analysis of the results of
such a manual reconstitution. Then describes how genealogists could
use essentially the same manual system of coloured record slips to
analyze all the data they collect on surnames of interest, arguing:
"This systematic way of attacking genealogy leaves much less margin
for error, in my view. You cancel out many births by means of
infant mortality - and know you've done so. You can slot births in
because you recognise the pattern of birth interval. A five-year
gap which should contain a baby may be filled from another parish.
You can keep your finger on the variations of surname. There is,
however, one other important advantage. Genealogy is a very
self-indulgent form of research. It benefits, in the last instance,
no-one but yourself, because no-one sees it. . . Don't stop being
genealogists. Become *responsible* genealogists. Become family
historians. If you use the Cambridge system, then you are not only
helping yourself, you are making a useful contribution to
historical research. Write-up your findings; deposit your slips in
the record Office for others to borrow. Let the Record Office
microfilm your reconstitutions. . ." (An editorial note refers to
the Local Population Studies Society, formed by the Cambridge
Group, as a source of further information.)
- G. Bouchard, "The Processing of Ambiguous Links in Computerised
Family Reconstructions," Historical Methods, vol. 19, no. 1,
Follows on from Bouchard (1980), and gives a detailed account of
the methods used. Starts by pointing out the (often minimized)
problem of deciding which records are candidates for matching,
stating that in fact this, and hence the whole matching procedure
should depend on the characteristics of the data being matched.
States that in this case the data being used (Saguenay registers
for 1842-1951) was of sufficiently high quality to use
nonprobabilistic methods of record matching to achieve progressive
linking, starting with the most secure situations (involving
couples) and dealing later with clusters. "During the operations,
the accumulation of information in the family records allows more
and more difficult decisions to be made: thus each step is based on
the preceding one . . . from one step to another, the algorithms
for matching and decision making are modified according to the
nature and quantity of the information available in the family
records and the complexity of the situations to be dealt with."
- G. Bouchard, "Current Issues and New Prospects for Computerized
Record Linkage in the Provnce of Quebec," Historical Methods, vol.
25, no. 2, pp.67-73, 1992.
"Most of this article focuses on [work on a database of 665,000
baptism, marriage and death records for the period 1838-1971 in the
Saguenay region of Quebec] in order to single out a few basic
theoretical and methodological issues relating to automated record
linkage. Some major lessons must be learned from our own
accomplishments (and failures) and from the varying fates of the
various well-known, large-scale automated record-linkage projects
set up in the 1970s in the fields of history, demography and
genetics. One of these lessons has to do with the close
relationship between the nature and the quality of the data, the
research goals, and the linkage strategies. The diversity of the
pioneer projects has provided ample evidence of (1) the need to
devise linkage systems consistent with the purposes of the research
and the capacity of the data to support them; (2) the consequent
contradiction that may arise between some ideal accuracy and
efficiency levels; and (3) the difficult choice that has to be made
in such circumstances."
- G. Bouchard and C. Pouyez, "Name Variations and Computerised
Record Linkage," Historical Methods, vol. 13, pp.119-125, 1980.
An excellent account of some of the techniques developed, in
connection with a large-scale parish register and census data
linking project in Quebec, for matching data concerning couples
(rather than individuals). Five different types of name variation
are described: spelling variations, phonetic variations,
double-names, double first names, and alternate first names, with
details given on the techniques used for dealing with first two of
these. The first was handled by a specially created scheme of
phonetic encoding for the French language, containing 64 rules -
the article points out that soundex is not really a phonetic
encoding scheme, but rather just a crude sorting device. The second
type was handled by a technique based mainly on Guth's scheme for
assessing the extent to which two words contain the same letters in
the same order. Another aspect of the scheme is that it builds up a
dictionary of equivalent names, using separate criteria for
isolated names, and names in the context of all the name data
available for a couple. On one test, involving 2,000 records, the
scheme suceeded in finding 98.5% of the possible links, whereas
only two thirds of the possible links could be found if one tested
simply for name identicality.
- C. Bourlet and J.-L. Minel. "A Declarative System for Setting
Up a Prosoprographical Database," in History and Computing, ed. P.
Denley and D. Hopkin, pp.186-191, Manchester Univ. Press, 1987.
Brief description of a 70,000 item database of data from13th
century French tax registers, and the use of a Prolog-based scheme
for determining whether items refer to the same individual.
- J. Carvalho. "Expert Systems and Community Reconstruction
Studies," in History and Computing, II, ed. P. Denley, S. Fogelvik
and C. Harvey, pp.97-102, Manchester Univ. Press, 1989.
Brief description of a small-scale project that is to use a
database, and a set of Prolog modules for record linking,
life-story reconstruction, genealogical analysis and network
- P.G. Cook, "Is Your John Cooke my John Cooke: introducing the
"C-Vector" for finding common ancestors among databases.,"
Genealogical Computing, vol. 10, no. 1, pp.37-38, 1990.
A heuristic for matching individuals in two lineage-linked
databases, based on so-called C-vectors. Each vector corresponds to
a separate individual, and lists in sequence the birth years of the
individual, his/her parents and his/her grandparents, and the
individual's sex, and name. (The scheme involves finding pairs of
C-vectors which match on three or more known dates.)
- P. Cooley, "Biographs for 19th Century Family Records,"
Computers in Genealogy, vol. 4, no. 3, pp.104-111, 1991.
Shows how a biograph ("a potted life story on one individual",
generated automatically from a PAF file) can be used together with
some simple heuristics, to predict census returns, survivors, and
electors, for checking against the actual lists.
- P. Cooley, "Generation and Usage of Machine-Readable files from
the GRO Indexes," Computers in Genealogy, vol. 4, no. 5,
Describes some techniques that have been developed for automatic
enhancement of the output obtained from using a scanner to scan St.
Catherine's House indexes, and describes the sort of test (making
use of names and dates) that could be used to check whether two
registrations refer to the same individual.
- C. Davey. Reconstructing Local Population History: the Hatfield
and Bobbingworth districts of Essex, 1550-1880, Cambridge Ph.D.,
- C. Davey and A.S. Jarvis, "Microcomputers for Microhistory: a
database approach to the reconstruction of small English
populations," History and Computing, vol. 2, no. 3, pp.187-193,
Describes a scheme, based on the use of a relational database, for
assisting local amateur historians to perform family
reconstitutions. Baptisms, Marriages and Burials are kept in three
base tables. "For every marriage, or implied marriage, all
associated baptisms, burials and remarriages are found. Links
between entities are made by matching the character strings that
represent names. In addition, certain logical, biological and
social conventions [Wrigley, 1966] are observed which limit the
number of possible links to only probable links." Possible links
are kept in separate tables. Uses Soundex, plus a list of special
cases, for matching names. The user is then provided on request
with all the relevant information on screen in order that he/she
can perform resolution of competing links manually.
- G. De Brou and M. Olsen, "The Guth Algorithm and the Nominal
Record Linkage of Multi-Ethnic Populations," Historical Methods,
vol. 19, pp.20-24, 1986.
Discussion of experiments comparing the letter-by-letter
comparative Guth surname matching algorithm (Guth 1976) against
language-specific surname matching algorithms, as Soundex, Henry
and FONEM. "Although relatively succesful for the particular
projects for which they were designed, each language-specific
system suffers from a major drawback: Its effective application is
limited to the language for which it was originally created. . . .
[Guth's] algorithm does not depend on recognition of phonetic
similarity . . . Our study suggests that Guth is correct to claim
that her algorithm has an important advantage over other systems:
It is able to identify variant spellings through the position of
letters in names. Because of this, the Guth algorithm, unlike the
language-specific programs, is well-suited to the linking of a
mult-ethnic population. The algorithm, however, is not perfect. It
can produce incorrect matches between surnames that bear little
resemblance to one another. This problem becomes particularly acute
when comparing short names, where one or two incorrect vowels can
produce an incorrect match. One way of overcoming this would be to
include a new function that would assign a level of 'confidence' in
the link according to the number of letters that occur in the same
positions of the names being compared."
- G.J.A. Guth, "Surname Spellings and Computerized Record
Linkage," Historical Methods Newsletter, vol. 10, pp.10-19,
Provides a full description of a surname matching
("name-recognition") algorithm that takes account of letter
ordering, rather than being phonetic in character, and its use in a
project to link various sets of early 18th century Norwich records:
pollbooks,land tax assessments, window-tax assessments and
registers of freemen. Details are given of how linking was
performed first of all within individual sets of data, then taking
pairs of sets of data at a time. "The limited number of identifiers
available in the Norwich data did not justify establishing an
elaborate weighting scheme for making links automatically. However
some percent of the 'true links' were formed by the program,
enabling the historian to select the remaining links
- J. Hitchon, "Russell Soundex Code: a BBC Basic program,"
Computers in Genealogy, vol. 1, no. 5, pp.122-123, 1983.
- C.J. Jardine and A.D.J. MacFarlane. "Computer Input of
Historical Records for Multi-Source Record Linkage," in Proc. 7th
Int. Economic History Conf., pp.71-8, Edinburgh 2, 1978.
A fascinating account of the means by which completely transcribed
documents were marked up in order to make the individuals,
property, etc., and the relationships that they refer to evident,
ready for computer input.
- S. King, "Record Linkage in a Protoindustrial Community,"
History and Computing, vol. 4, no. 1, pp.27-33, 1994.
- A. Macfarlane, S. Harrison and C. Jardine. Reconstructing
Historical Communities, Cambridge, Cambridge Univ. Press, 1977, 222
p. ISBN 0 521 21796 2
This fascinating book describes the development and use of a
sophisticated manual technique for creating a set of inter-related
indexes to data obtained from a variety of documents in order to
perform a "total reconstitution" of a given small community. The
technique is illustrated mainly by reference to a fourteen-year
long study of the history of the villages of Earls Colne in Essex,
and Kirby Lonsdale in Cumbria, covering the period 1500-1750. The
twelve document types dealt with in detail are: Anglican parish
registers, manorial rentals, court baron land transfers,
frankpledge listings, court leek cases, church court cases, quarter
sessions cases, assize depositions, hearth tax records, wills,
probate inventories, and population listings. For each of these
document types, details of the indexing techniques used are
described. In the case of Earls Colne, whose population was about
1200, the parish registers alone resulted in about 30,000 index
cards, and the total for the twelve sets of records was nearly
140,000 cards. The analyses that the integrated set of indexes make
feasible are very impressive. Far more complete family
reconstructions can be produced than would be possible using just
the parish register data. (One example shows a family tree which
includes 25 marriages only four of which could be traced from
parish register entries alone.) Similarly, extremely detailed
historical accounts of the ownership and land and property are made
feasible. (For Earls Colne it is "possible to know who owned every
one of the approximately 650 separate parcels of property, land or
housing, at any point in the last four hundred years.") Moreover,
much can be learnt about the accuracy and the completeness of the
data contained in the various types of document, and indeed about
what some of the documents really mean, so that the author claims
that one can learn far more about a typical English community
during the period covered, both on an individual basis, and as a
whole, than had hitherto been thought possible. The final chapter,
incidentally, contains a careful analysis of how the manual
techniques described could be aided by computer processing (albeit
written in terms of then-current large mainframe computers). The
task of doing a complete family reconstitution project manually for
a village of 1000 inhabitants over a period of 300 years, based
solely on parish registers, is estimated at 1,500 hours. A "total
reconstitution" of the type described of such a community is
estimated as needing 10 to 20 manyears if performed manually, or
about 3 to 4 manyears if aided by computer.
- F. Nault and B. Desjardins. "Computers and Historical
Demography: the reconstitution of the early Québec
population," in History and Computing, II, ed. P. Denley, S.
Fogelvik and C. Harvey, pp.143-148, Manchester Univ. Press,
Description of a very large family reconstruction project; it is
stated that the method used, though "often incorrect at the
individual level . . . yields results that are statistically
- H.B. Newcombe and J.M. Kennedy, "Record Linkage: making maximum
use of the discriminating power of identifying information," Comm.
ACM, vol. 5, pp.563-565, 1962.
- H. Rhodri Davies, "Automated Record Linkage of Census
Enumerator's Books and Registration Data: Obstacles, Challenges and
Solutions," History and Computing, vol. 4, no. 1, pp.16-26,
"In this paper the need for individual-level longitudinal data to
study nineteenth century fertility and migration is highlighted and
reasons why so little work has been carried out in this area using
record linkage techniques are given. After hopefully alleviating
the fears about data quality we discuss a computer package which
can be used to speed up the linkage process."
- R.S. Schofield, "The Standardisation of Names and the Automatic
Linking of Historical Records," Annales de démographie
- K. Schürer. "Historical Demography, Social Structure and
the Computer," in History and Computing, ed. P. Denley and D.
Hopkin, pp.33-45, Manchester Univ. Press, 1987.
A general overview of the work of the Cambridge Group on
computer-based record linking of parish register records, and of
successive census records. Mention is made of the use of Soundex,
and of standardized coding (e.g. of occupations), in establishing
links, and of the programs for finding logical inconsistencies in
the data. (35 refs.)
- K. Schürer, J. Oeppen and R. Scofield. "Theory and
Methodology: an example from historical demography," in History and
Computing, II, ed. P. Denley, S. Fogelvik and C. Harvey,
pp.130-142, Manchester Univ. Press, 1989.
Summary of the "Cambridge" approach to family reconstruction, which
is described as based on two principles: "First an agnosticism
about the number and identity of individuals whose life histories
are represented by the events recorded in the registers leads us to
form all possible links, and to take note of cases where records
cannot be linked to gether. Second, a belief that in situations of
uncertainty it is better to prefer links in which the information
content of the records gives one the greatest confidence leads us
to adopt a hierarchical, sequential approach to the resolution of
ambguities." It argues that this approach to deleting competing
links is to be preferred to that by Skolnick, of searching "for the
combination of links with the lowest aggregate total confidence
score that would need to be deleted to resolve the network . . .
[because] there can be a very large number of possible combinations
of links that could be deleted."
- M. Skolnick. "The Resolution of Ambiguities in Record Linkage,"
in Identifying People in the Past, ed. E. A. Wrigley, pp.102-127,
London, Edward Arnold, 1973.
The method described is based on the estimation of maximum
likelihoods, and applied to the linking of records from medieval
Italian parish registers, for purposes of genetic research. "The
method [uses] the decision making techniques developed in
artificial intelligence projects . . . The artificial intelligence
approach consists of building a family of related solutions,
developing a method of keeping the family of reasonable solutions
small, and selecting the best solution with minimum effort and
maximum accuracy. . . The frequencies of the nominal identifiers in
the records to be linked forms the basis of the likelihoods. If one
has the distribution of each forename and surname by parish, by
time period and by type of record, one can calculate the
probability that a link between two records will be made by chance.
Thus if a record has much missing data, and the only data which is
compatible consists of common names, the probability of a match,
made at random, being compatible is quite large. If there are many
identifiers, and some of them are rare names, the probability of a
compatible match occurring is quite low, and can be estimated. . .
Likelihoods are also computed from age error distributions which
are formed from records whose links are most certain. . . A
heuristic computer program, LINK, is being developed for the
resolution of ambiguities in record linking. It is being
constructed in a manner similar to the heuristic DENDRAL computer
- Squire, "Expert Systems in Genealogy," Computer in Genealogy,
vol. 3, no. 8,1990.
Briefly argues the potential utility of expert systems for various
- N.C. Stevenson. Genealogical Evidence; a guide to the standard
of proof relating to pedigrees, ancestry, heirship and family
history, Laguna Hills CA, Aegean Park Press, 1979, 233 p.
An excellent account by a lawyer and genealogist of the standards
of proof that should be sought in establishing an individual's
ancestry, whether for producing a documented genealogy, or for
legal purposes. Though aimed at an American readership, and dealing
mainly with American records and laws, it is well worthy of study
by genealogists in other countries, particularly the UK given the
links (both of emigration, and of legal heritage) between Britain
and the USA. Quotes from many legal judgements, but nevertheless is
very readable. Provides careful and thought-provoking analyses of
the strengths and weaknesses, as genealogical evidence, of various
types of official records (vital, court, land and census records),
church and family bible records, newspaper files, monuments, etc.,
and published and private genealogies and genealogical directories
- a number of which are roundly criticised, Burke's Peerage in
particular. To quote from the introduction: " ... there are some
who believe that the rules of evidence in our legal system and in
effect in our court proceedings are too technical and not
completely practical for genealogical, historical and biographical
research. This belief is not valid." This is perhaps shown best by
the excellent chapter on "Rules of Evidence Applied to Genealogy".
The book is thus a great antidote to the motherhood statements
about the need for careful research found in many guides to
- E.R. Swart, "A Computer Simulation of the Ineradicable
Uncertainty in Genealogical Research," Family History, no. 118,
- M. Thaller. "Methods and Techniques of Historical Computation,"
in History and Computing, ed. P. Denley and D. Hopkin, pp.147-156,
Manchester Univ. Press, 1987.
A general discussion of desirable characteristics and capabilities
of database systems used for various types of "historical
computation", including record linkage.
- J.E. Vetter, J.R. Gonzalez and M.P. Gutman, "Computer-Assisted
Record Linkage Using a Relational Database System," History and
Computing, vol. 4, no. 1, pp.34-51, 1994.
"Our intent is not to present a theoretically or methodologically
correct approach to nominal record linkage, but to describe the
evolution of a semi-automated linkage technique particularly suited
to the needs and resources of a university demographic research
- J.D. Willigan and K.A. Lynch. Sources and Methods of Historical
Demography, New York, Academic Press, 1982.
A scholarly treatise on the whole subject of historical demography,
but containing only a brief section on techniques for family
reconstruction. This describes, as an example method, that of the
- I. Winchester, "The Linkage of Historical Records by Man and
Computer: techniques and problems," Journal of Interdisciplinary
History, vol. 1, pp.107-124, 1970.
- I. Winchester. "A Brief Survey of the Algorithmic, Mathematical
and Philosophical Literature relevant to historical record
linkage," in Identifying People in the Past, ed. E. A. Wrigley,
pp.128-154, London, Edward Arnold, 1973.
"This survey consists of a discussion of the chief problems of
record linkage which are relevant to historical data, followed by a
select bibliography [of 54 items]. The discussion takes the form of
a critical survey of literature about record linkage." An excellent
- E.A. Wrigley, (Ed.). Family Reconstruction, An Introduction to
English Historical Demography. 1966, 96-159 p.
Contains a very detailed account of a manual technique, using
Family Reconstitution Forms (FRF), for organizing parish register
record linking pioneered by L. Henry. (The method to be used for
resolving ambiguities is described in less detail, and would appear
to be fairly subjective.)
- E.A. Wrigley, (Ed.). Identifying People in the Past, London,
Edward Arnold, 1973, 159 p.
From the introduction: "In recent years there has been a spate of
historical studies involving nominal record linkage on a scale
which requires the linkage rules to be set out formally and in
detail. Most of these studies have been based on parish registers
(or a comparable source of genealogical information), or on
nineteenth century census schedules. And at the same time there has
been a marked tendency to abandon manual methods in favour of
computers. The six chapters which follow represent an attempt to
describe some of the methods currently in use, and to discuss the
problems and opportunities of record linkage work."
- E.A. Wrigley and R.S. Schofield. "Nominal Record Linkage by
Computer and the Logic of Family Reconstruction," in Identifying
People in the Past, ed. E. A. Wrigley, pp.64-101, London, Edward
Detailed account of the record linkage techniques used by the
Cambridge group, including a section on Matchscoring and an
Appendix on Name Spelling. The section on demographic constraints
upon record linkage gives the following rules/heuristics:
"1. Age at death is never greater than 100 years, unless age
information in the burial record overrides this rule.
2. At the birth of a child the mother is never less than 15 or more
than 50 years old, nor the father less than 15 or more than 75.
3. No two successive birth events to the same mother occur in less
than 10 months and no three successive birth events in less than 22
4. The interval between the end of a marriage and the remarriage of
the surviving spouse must be less than 20 years.
5. First marriages (for both sexes) occur only when the bride or
the groom is above 15 years of age and less than 50, unless age
information in the marriage record overrides the rule.
6. All brides and grooms are less than 75 years of age at marriage
unless age information in the marriage record overrides the
7. Whenever an age is given in either one or both of the records
involved in a possiible link the difference between the dates of
the two records must be compatible with the age information
8. Where occupations are stated in both records involved in a
possible link and they mismatch in a manner which is thought to be
incompatible even with the most extreme assumptions about lifetime
occupational mobility, no link is made (for example
9. In addition to the requirement that the names of the principal
on two records should agree before a link is made, there must also
be no disagreement about the names of any relatives named in both
records (for example, if the names of both parents are recorded at
the baptism of a child, and again when he is buried, they must not
disagree if a BAP-BUR link is to be made)."
- E.A. Wrigley and R.S. Schofield, (Ed.). The Population History
of England, Cambridge Univ. Press, 1989.
Return to top of page
Find help, report problems, or contribute information .
|Note: The information provided by GENUKI must not be
used for commercial purposes, and all specific restrictions
concerning usage, copyright notices, etc., that are to be found on
individual information pages within GENUKI must be strictly adhered
to. Violation of these rules could gravely harm the cooperation
that GENUKI is obtaining from many information providers, and hence
threaten its whole future.
[ Last updated: 25
Oct 2005 - Brian Randell ]