Problems of distributed Genealogical Databases

With more genealogical data being made available for browsing on the world wide web, it is natural to try and avoid duplications of whole databases or even individual records and to use the Hyper link mechanism to co-reference data held at other sites.

For example, two researchers hold similar family data. One who has been researching Smith has a marriage to a Brown. Someone else has the database for Brown and also sees the marriage to Smith. In the Web, it is only necessary for the two databases to contain links to the appropriate place in the other. The result would be a distributed database containing the union of all the data contained in them. This is achieved without the two researchers combining their respective datasets into one that is contained in duplicate at both sites.

The advantages of linking are that each researcher can continue to explore their own area and amend and update the information under their domain of responsibility without the permission or active cooperation of the others involved. Conversely, if there was only one database with several contributors, then the normal configuration management and change control problems of a larger multi-author system rear their head.

The Challenges

Naming Conventions

To make a distributed system work, it is necessary to agree a common form of URL to access individual records in the different databases. Why is this neccessary?

Their are at least three different methods of making genealogical data available on the web. They are based on different philosophies and methods of organisation. More methods will develop as people experiment with the technology and as facilities become more advanced. Therefore we can say with certainty that the way we do it now is not likely to be the way it will always be done. Further, the three different methods have different ways of referring to individual records. We cannot assume that any one way is better than any other.

What does an individual reference look like? Here are some examples, extracted from GenWeb around the net:

<ul> <li><a href="http://www.dcs.hull.ac.uk/cgi-bin/gedlkup/n=royal?royal1762">Daughter</a> <li><a href="http://www.ifi.unit.no/~frode/demo/i1.html">Frode Kvam</a> <li><a href="http://faui80.informatik.uni-erlangen.de/html/genpdb/Kaiser-Heinrich-V.html">/user80/ftp/pub/html/genpdb/Kaiser-Heinrich-V.html-8.1.1086</a> <li><a href="http://www.dcs.hull.ac.uk/cgi-bin/gedlkup/n=royal?royal1404">Information on , Henry V, Holy Roman Emperor</a> <li><a href="http://www.dcs.hull.ac.uk/cgi-bin/gedname/n=royal?Henry.V.*Holy.Roman.Emperor">Information on , Henry V, Holy Roman Emperor</a> <li><a href="http://www.vest.sdata.no/cgi-bin/ll-gw/BASE=royal92/INDEX=I1?LookupInternal">I1: Victoria HANOVER (Queen of England) </a> <li><a href="http://www.vest.sdata.no/cgi-bin/ll-gw/BASE=wathne/INDEX=I1?LookupInternal">I1: Lars NILSEN</a> <li><a href="http://irpsbbs.ucsd.edu/gene/Wallace_Jones.html">Wallace Jones (1895)</a> </ul>

Here we see that various methods of identifying individuals are used. Some methods use a file path and filename method of specifying the individual, where the path and name represent either the persons name or a person index. Other methods use a script to access the record and supply again either the persons name or their index. Some also use a byte offset index.

Lifetimes of references

One of the problems is the lifetime of the URLs. The provider of a set of genealogical data may feel free to add or edit records and reconstruct the data in any way. This is their right as the data owner. The providers of such services probably regard any URLs pointing at any point in their data as purely transitory. Most of us (data-providers) only regard the "top-level" page for our data as the only fixed point of reference. Some data sets can be re-ordered moved or changed at a whim, making any references to their internal points invalid.

The philosophy from the point of view of the data provider contrasts with the view from the data user. The data user will save URLs pointing at interesting pieces of data in their private Bookmark or Hotlist file for later use. The idea that the data will "go away" is alien to them.

When a data provider also starts to become a data consumer, such as is the case when data sharing happens, access to the internals of a data set becomes a debating point.

Herbert Stoyan, who has a database of a large number of the German Nobility, has proposed the following model naming scheme. This is what he uses for his data:

Url = <prelude>/<title>-<area/familyname>-<firstnames>.html <title> = Kaiser|Koenig|Kurfuerst|Grossherzog|Herzog|Fuerst|Markgraf|Landgraf|Graf (This could be enlarged into English titles) <area/familyname> = Oesterreich|Preussen|... (This is a list of all possibilities) <firstnames> = <firstname>|<firstname>-<romannumber>|<firstname>-<firstnames>

This, of course, only works for people who have such a form of noble title. It has been suggested that this could be expanded for use with the general public at large by including a date, such as a birth date. On the surface this may seem a fine solution, but it runs aground on the same rocks that have sunk other many other attempts to use name and DOB as a universal identifier. Regular readers of comp.risks will be familiar with such an issue. For the genealogist things are even more complicated than, perhaps the Social Security office. This is because not all the details about an individual in our database are known. For many entries in my database the date of birth is not known; more than this for many of them their complete name is not known. The purpose of genealogical study is often to find out some of the missing details. To only support the access to records that are completely researched seems, in my view, to obfuscate the purpose of improving genealogical resource provision by using GenWeb in the first place.

The only unique and unchanging way I have discovered of naming individuals comes from those societies with traditional oral genealogical histories. Those are, for example, the Nordic and Icelandic sagas and the Gealic or Celtic histories. In those traditions people have names like Erik Magnusson Haraldsson Ignoldsson or Ruaidri macToirrdellbaig moicConchobaig O'Brien or Tewdwr ap Gryffydd ap Gwynedd and so on. These strings of names can continue as long as is necessary to link an individual to a unique ancestor. This may work for the male lines in these traditions but breaks down often when considering females in these Patriarchial Societies. Long strings of fornames used to make unambiguous name references are unwieldy to use and could not make a reasonable standard to adopt; even though they are often the de-facto method of referring to people in many genealogical and lineage works of reference in the English Language (such as "The complete Peerage").

Suggestions

Summary

To summarise the main points, we can say that:

Points to Consider

I would suggest that the providers of genealogical data on the web need to consider the following points:

If we accept that this is inevitable, we must also then consider the following corrolaries:

Further Thoughts

It seems that record numbering cannot be considered to have an infinite lifetime as database reorganisation is always a future and yet unseen possibility. If records should be accessed by name then do we need to agree a specific syntax for that name so that each site uses the same? No, probably not. All that is necessary is that each site is self consistant.

The problem with using names as the only means of access is one of performance. The index number is, in the short term, a much more rapid method of accessing an individual record.

Perhaps what is needed is a a standard enquiry to ask any genealogical system to translate a short term name into a permanent one.

Other Related Proposals

So far I have looked at the problem of what happens when a genealogical record pointer may change with time, but we also have the problem of site name or database name drift over time.

Birger Wathne considers centralising access to genealogical records may be of benefit and has made the following proposal:

Either let all GenWeb URL's be of the form
http://genweb.fixed.location/genweb-url/BASE=100234/INDEX=MyIndex-123
Where

genweb.fixed.location would be a hostname pointed to by a high-level DNS server, so we could be sure the name would be fixed even if the server has to move. I wonder if there is a DNS namespace for fixed services?

genweb-url is a gateway program that returns the 'real' URL to the data.

BASE= points to a unique database descriptor each provider gets when registering his base(s). This is used by the genweb-url program to build the real URL.

INDEX= is the private key into the database. The contents of this part should be left to the provider of the base.

The solution to the moving host or database problem can often be solved by mechanisms currently built into web servers and other associated software without resorting to a centralised authoritairian scheme. If an individual database moves to another server then the name mapping and proxy serving capabilities of the modern http server can be used to fulfil a request from a new location. If a site changes name altogether then the aliasing mechanisms in DNS can be used to leave dormant pointers from the old to the new name.

The centralising scheme does not resolve the problem of naming and locating records, but in fact continues it. It could be used as a way of providing a centralised enquiry bureau to find suitable data but this was not part of the original proposal. A cenralised genweb server is much like putting the Mormon IGI online.

Another problem with centralising authority for accessing genealogical records is the one of bottlneck and single point of failure. A non-centralised scheme has the advantages of resiliance and the multiplicity of contributors. If one person gets tired, bored or overloaded no one elses work needs to stop and wait.


Discussion Group

This matter is discussed on the email list genweb@ucsd.edu. To subscribe contact genweb-request@ucsd.edu. The archives of these discussions are held at http://demo.genweb.org/genweblist/genweblist.html
Brian Tompsett
Department of Computer Science

Hull, UK, HU6 7RX
B.C.Tompsett@dcs.hull.ac.uk