Introduction

There are several known problems and limitations with the current experiment involving the data capture, data rendering, data display and data searching. Some of these are detailed below.

Multiple parents

People having multiple parents is quite a common phenomena, both in "everyday" as well as royal genealogy. People can, for example, have different Birth and Adoptive parents. The child might need to be listed under both sets of parents as they derive attributions from both. Similarly, when looking at the Childs record both forms of Parents might need to be listed.

An example, in the Royal data set, is of the Swedish Royal house of Bernadotte. Here the founder of the Bernadotte line was French but was adopted into the Swedish Royal family on accession to the throne. He would therefore need to be listed as a sucessor to earlier monarchs, but his true origins would still need recording.

We are not, obviously, limited to just two sets of parents. People can have Biological, several sets of Step, Foster and Adoptive parents. When this situation occurs with historical figures, as in the royal database these facts are of historical consequence and need recording and displaying appropriately.

The problems

The problems with this is the great variance in the tools concerned. The raw data from which the GEDCOM files are generated is often collected using various pieces of Genealogical specific software. Some of these are on PCs, some on Macs and some are unix based. Some store the data in their own internal forms and generate GEDCOM by post processing and some use GEDCOM as their native storage format. Irrespective of the internal storage format, not all the tools are able to render in GEDCOM format the necessary information.

The inability of the tools either to capture, or render in GEDCOM, the necessary details has a consequential effect on the poor quality of the data displayed in this experiment.

Wider character sets

As discussed in the previous section, we are limited by the data collection tools and their ability to render GEDCOM files. This argument also applies to the problem of character sets.

Not all data is conveniently rendered in 7-bit ASCII. The results are much more attractive and understandable if native language representation of information is used. Further, to avoid the information loss attendant to transliteration from formatting used in sources to that necessary for ASCII only representation means that we must accept and display accented and native character sets.

The problem with accented and extended character sets is that there are several physical representations of them. A PC may use the PC character set, along with the restricted number of characters that it implies. A unix user may have access to the ISO extended Latin-1 alphabet.

The HTML standard itself uses the ISO Latin-1 alphabet, but this is a lot more restrictive than the set specified in the GEDCOM5.3 standard. The GEDCOM standard copes with the wide variety of possibilities, but leaves us with the problem that in standard HTML there is no way of communicating and displaying the characters above and beyond ISO. This is quite a problem when just looking at plain accented characters. When considering extensions such as Chinese, Japanese, (Kanji, Katakana etc), Arabic similar non-latin texts.

We must also not forget that Web pages can be displayed in a non-graphic environment such as a vt100 terminal!. How should the data be rendered for these users and how should the loss of information content be moderated?

Very long names and titles

This note is just to remind us that we certainly have problems with people whose names are too long for the data capture tools. This means that their names in the GEDCOM, and hence in the html form will be truncated.

This could, perhaps, be solved by using a notation for standard abbreviations which can then be expanded on display in order to render the correct form of name or title.

If you want an example, then consider Frederick Ferdinand III, Duke of Schleswig-Holstein-Wolfenbuttel.

Large numbers of spouses or children

Again, this is a problem related to the physical limits of the data collection tools. Consider the example of Emperor Charlemagne who had more than ten concubines and spouses with children by most of them, or Edward I whose fecundity is indisputable!

If the tools that originate the GEDCOM cannot cope with the complexity, then the Web displayed version will be equally deficient.

Xrefs between people

Many of the notes for individuals contain comments that refer to other people also in the database. In a proper Hypermedia system these references should be hot-links that can be followed. The latest Gedcom5.3 specification does allow for XREF entities but not all tools know how to use this.

The GED to CGI postprocessing tool would need to process this field accordingly.

Multiple language rendering of base data

The multinational nature of large genealogical works gives rise to linguistic problems. The native language of the person in the database, the langauge of the data source, the language of the enquirer may all be different. In an ideal world it would be nice to be able to cope with this.

For example, Harold a Norman from England became Monarch in Lithuania. His name can be rendered in English, French, Lithuanian, Russian Cyrillic and so on. All these representations would be "correct". No one form should really have precedance over the other for storage keys or enquiry. The standardisation down to one form only would mean that this person could not be found in the database by people from one of the language groups who are unaware of the rendering of the name in the other linguistic form.

Multimedia

The current web system does allow for multimedia information, such as digitised portraits, photographs, sound recordings and moving pictures to be linked to a source record. Some data entry tools have forms for noting some links. Some of these tools only support some formats and not others. The HTTP tools, similarly only support a limited variety of formats both of the data and of ways of linking.

The Gedcom5.3 standard does have entity specifications for multimedia forms, but very few tools know how to use them. We therefore have problems both of refering to the data but also presenting it in a standardised way.

A further point is the sheer difficulty of the data gathering task for multimedia as well as the volume of archive space it consumes.

Navigation

Once the dataset has been collected, navigation is a problem. The use of a web browser and HTML is an attempt to solve this for shared datasets, but with large and complex data sets the problem never goes away.

I have attempted to address this problem by the creation of several search modes or styles, such as search by date, title, forname, surname and so on as well as a set of directed indexes pointing into threads into the data. There is a need for the specification of more powerful or specific searches by the use of techniques such as HTML Forms, but this then places larger and larger loads on the database host. The need to rely more on remote searching, such as WAIS would alleviate this problem.

For further thoughts on Navigation see also my search Help text.

The experiment home page is GEDCOM to Web Experiment.


Brian Tompsett
Department of Computer Science
University of Hull
Hull, UK, HU6 7RX
B.C.Tompsett@dcs.hull.ac.uk