A Genealogy Data Model (Matrix Algebra) Specification

http://jytangledweb.org/genealogy/model/

John H. Yates

Last Update: Fri Apr 22 12:05 EDT 2011

Initial Version Released: Apr 19, 2011

I am planning to take any discussion of this to the rootsweb mailing list GENSOFT. I checked the archives and there has been almost no traffic there recently, but the group is still alive. So if you want to discuss this via a mailing list, subscribe to GENSOFT at rootsweb.com. I have subscribed so if anyone wants to post something there, I'll see it.

I think I have now succinctly defined the problem and its solution by focusing on the single thing that needs to be done to allow any vendor to achieve a solution. Or a startup, of course. See below. Not that it will be easy, but it could turn out not to be difficult. I guess we shall see.

The simple thing that genealogy program vendors could do immediately to move toward the future would be to generalize the number of variables that their programs input and store internally in their databases. A very simple step that could be done with little impact on their users. This is a required step toward the ability to represent data in a manner that can be transformed to other programs without data loss. But it is not necessary if vendors want to lock your data in to their program by ensuring data loss when moving the data to another program. Choose your side, vendors or users.

I present here the outline of a data model for genealogy data. I have presented my views on this over the last three or more years on various news groups and mailing lists. I used terminology that probably went over the head of almost all of my readers. Here I attempt to show what my words meant by making more explicit the model I was proposing.

An introductory knowledge of matrix algebra is needed, but that is not difficult, and many may be able to learn a bit of matrix algebra simply by reading my description here. Everything you need to know about matrix algebra (and more) for the discussion here is found at: http://en.wikipedia.org/wiki/Matrix_theory.

I welcome constructive criticism, and or extensions of these concepts. Just email me by clicking on my name at the top of this page.

After some more pondering and some feedback on a couple of news groups some of my ideas have begun to gel.

I described the need for someone like Edgar Frank "Ted" Codd to do for genealogy what Codd did for relational databases. Upon reflection, he has. If the genealogical data representation followed his rules, including the five levels of data normalization, I believe the problem would be solved for data transfer. I believe the entire problem with genealogical data transfer between programs is wholly caused by not adhering to the five levels of data normalization. That is the root of the problem!

The treatment below shows what happens with Name representations when it fails to be represented in a properly normailized form.

A complete set of variables is a required step toward normalization and thus is required if it is going to be possible to exchange data between programs without loss. Nothing more, and nothing less is required. The reason is, if you have a complete set of independent variables, then by necessity they must be mappable into another set of likewise complete set of variables, but perhaps with different names, but that is a trivial and unimportant difference.

The explicit Name example below shows what happens when you have an incomplete set of variables representing Name. When the input variable of Name is a linear combination of what should be independent variables, you don't have a complete set of variables. Thus it is not uniquely decomposable into a set of independent variables, and any way you slice it, you are going to have data loss. (without additional hokey "rules", anyway. And that is not the elegant solution required.

So the whole business of genealogical data transfer can be solved if one group is tasked with the responsibility of deciding what the list complete set of variables is. (data types would help the programmers, but even that is not strictly necessary). Presentation layers would be required to present that data, but at its core, the problem would be solved.

That is it. No other design work is needed. I really think this should be the current focus of those trying to solve the problem of data transfer. All other decisions are peripheral to it. (like what language to code it in, whether to use an XML scheme, etc. , etc.). Solve the complete list of data required, and vendors can code it in any way they choose. If would be smart to use something like XML data structures, but not at all necessary to be able to exchange data without loss. Any language program can export all the variable data to some flat template (like GEDCOM), and any language program can be easily taught to import that flat template. Even though it is fully a binary relational database internally.

If program vendors adopt the complete list, it will then be mathematically possible to map that data set to any other vendor's data set. Any vendor can choose to not use some of the variables, and this might render their program incomplete, and possibly not able to store variables important to some users, and possibly use linear combinations of what should be independent variables (as in the Name example below), but their data would be uniquely transformed to the data of another program with a simple matrix transformation if they would agree to use a full set of independent variables specified by some worldwide genealogical group of authority. (I know, where are we going to find agreement with one of those!). But this succinctly states the problem and its solution.

Vendor adoption of a complete set of variables in their data model would allow them to achieve full data transfers without loss, and without changing anything that their end user sees, if they so wish. Of course they SHOULD adapt to inputting each of the full set of variables properly, but if they want to rely on hokey ways to parse a name string to guess at its pieces [also address fields should not be entered as City, County, State, Country, but each of those should be input as separate fields, I'm having a problem with one program entering this type of data, I never know whether a field is going in the city or county field, esp. when I don't have the city. Counting commas is prone to much error, each field should have a separate box so the end user has full control over what variable is being specified where], then they are welcome to do that, but also will have trouble populating the proper fields in the full data set to transport to other programs.

And it will be possible for the central group specifying what the complete variable list is, to add variables to it, as necessary. Not willy nilly, but by careful consideration and approval. If the internal data model used by a vendor is an elegant relational database model, they will have no trouble adding/modifying/deleting variables as needed. The specification of complete genealogical data set will evolve with time.

I'll be working more on this web site and its concepts over the next weeks. I apologize for its crudeness right now.

It is my opinion that all the talk about GEDCOM and improving it will never make substantial progress until someone gets the data model correct. These notes are an attempt to shed light on a way that one can "get it right" and why that means that then the data transfer problem is solved.

Today, every genealogy program has its own version of a data model. Some are better than others, but it is my opinion that none are general enough for the genealogist's holy grail of being able to use GEDCOM to move data between programs without loss.

I used several programs to export a GEDCOM and see that they use one field for Name but split it into two variables by putting slashes around the last name. As: GivenName /Surname/ . It would be a better model to have used two variables to start with, and in light of the discussion above, three variables.

A look at the number of Name variables are under the user's control at the data entry point for some programs shows: Legacy 2 (Given, Surname); TMG 2 (GivenName, Surname) [but also something called PreSurname, I don't know what that is used for yet, so they may have 3 if you wish]; FTM 1 (Name).

The Legacy input box for Name in a address field drops back to simply one variable for Name. I suggest that where ever Name input is found, it should use the same number of input variables consistently.

A genealogist cares little about what program they use, they do care that it is their database repository for their hard won data. I submit that a proper database model is all that a genealogist really wants or needs. The rest is just a presentation layer. Report, charts, and other output. People pick a genealogy program based not on their data model, sadly, but on what the presentation layer looks like. Ideally, the core data model should be sufficiently general that this makes no difference, but in practice it does make a significant difference and some programs have better data models than others. The best way to see this is by the fact that GEDCOM export imports suffer data loss. If the data model were complete, there would be no data loss.

I understand that program vendors today care little about exporting their data to other programs, but do care a lot about importing data from other programs into their programs. They should care about a complete data model, however, if they want to stay in the game as better vendors come along and do that. Ones that do will eventually eat the lunch of those that don't offer a complete data model at the core. And those that don't will die out over time.

Vendors should appreciate that there is still a lot of room for differentiation from other vendors in the presentation layer. Customers will still choose based on that. But a complete data model will allow a customer to completely, without loss of data, move to another program if they wish.

If all the genealogy programs used a complete (in a mathematical sense) data model, they would only be differentiated by their presentation layers.

Now for the model. This is a beginning attempt at writing it down and thinking some of it through. I hope to make it better over time, and maybe fill in treatments of the parts I will leave to others for now.

I welcome constructive criticism, and even help in developing the model further. I had hoped others would do this, but it doesn't look like that is happening.

Let's begin by analyzing the simple data representation of a name. It is simple because it shows the significance of a complete data representation of it, why current programs fail at that, and gives a mathematical representation that shows how one can take the complete data representation for it and with a simple transformation matrix produce the incomplete representation used by programs today. And it will become obvious why one cannnot use a simple matrix transformation to take the incomplete representation back to the complete representation. That is, the current state of GEDCOM data transformations. It also shows how easy it is to model a complete set of data that will allow for forward and backward data transformations. Said in another way, how to completely transform one program's data to another program's data, with no loss of data at all. Let's start with the output of a name, then construct it as two genealogy programs do, and then show how I would construct it in a complete basis set. And show the mathematics of how easy it is to transform things, but only from the complete set to subsets, not from subsets back to the complete set.

Name="Mary Ann Susan Smith"

Anticipating the matrix algebra below, let me represent that as:

Name=Mary + Ann + Susan + Smith

Where the "+"'s are not to appear in the output.

One genealogy program, let's say Program A, treats this as one datum. It uses programming tricks, um rules, to decide what is the GivenName and the Surname. Do you see a problem brewing? It may work for most cases, but what if the user wants reports to use her name as people called her, "Mary Ann". Immediately problematic. And the guessing that the Surname is Smith because it is the last field in the input is also prone to problems beyond the control of the end user. These may be relatively rare, but why should we accept mediocre when perfection is a simple data model extension away?

Program B is a bit smarter, and uses two input fields for Name. GivenName and Surname. Then it constructs Name as:

GivenName=Mary + Ann + Susan
Surname=Smith
and thus
Name=GivenName + Surname
Name=Mary + Ann + Susan + Smith

That is better. But what about when the user wanted to have "Mary Ann" as the GivenName that appears in reports? Not so easy is it?

So let me generalize once more and represent Name by three variables in the basis set. GivenName1, GivenName2, and Surname. Then the Name would be:

Name=GivenName1 + GivenName2 + Surname

Then we have control over the GivenName as:

GivenName1=Mary Ann
GivenName2=Susan
Surname=Smith
and
Name=GivenName1 + GivenName2 + Surname = Mary Ann + Susan + Smith
and in print output = Mary Ann Susan Smith

This is, of course, not completely general either. Perhaps each piece of the name should have its own basis set variable. But what would that gain for us. I submit that it gains us nothing important. I'm willing to be overruled. But I think it is complete enough to be used to the satisfaction of almost all genealogists.

Now, let's look at the matrix representations of the transformations between the complete set name, Name, and the name representation in Program A and Program B.

Program A, which represents the whole name in one variable is very easy. Program A represents the name in one dimension and the general basis represents it in 3. We can project out Program A's representation from the three dimensional representation as:

           [1 1 1] [GivenName1]
Name(A) =  [0 0 0] [GivenName2]  = GivenName1 + GivenName2 + Surname = Mary Ann + Susan + Smith
           [0 0 0] [Surname   ]

We have reduced the three dimensional Name complete representation to a one dimensional representation by representing the complete name by a dimension 3 vector and multiplying it by a 3 x 3 transformation matrix. This is the equivalent of a GEDCOM transformation from the data of a complete representation to the Program A representation. What does the transformation from the complete Name set to that of Program B look like? This:


          [GivenName1 + GivenName2]    [1 1 0] [GivenName1]
Name(B) = [Surname                ] =  [0 0 1] [GivenName2]
          [   0                   ]    [0 0 0] [Surname   ]

so on output, Name(B)=GivenName1 + GivenName2 + Surname = Mary Ann + Susan + Smith

For completeness of transformation matrices, the identity matrix (ones along the diagonal) is the transformation matrix that transforms the Name basis set to the output for the complete set. That is:

       [1 0 0] [GivenName1]
Name = [0 1 0] [GivenName2] = GivenName1 + GivenName2 + Surname = Mary Ann + Susan + Smith
       [0 0 1] [Surname   ]

All elementary matrix manipulations. All produce the correct full name. And the transformation exists to project out the Name from the complete set to that in Program A and Program B. The problem comes when we want to transform the name data from Program A or Program B to that of the full set.

In fact, it can't be done elegantly. It can only be done by additional programming assumptions. Like the last field is Surname and the first field is GivenName or GivenName1.

Not elegant. Nor very useful if you are concerned about simple matrix transformations that will easily allow data conversion from one program's set to another.

This is why genealogists first must decide what data variables comprise a complete set before coding the program even begins. Otherwise you will never achieve the holy grail of a general data transfer without loss.

Those genealogy programs that use an incomplete set of variables to represent the data will never be able to transform that data uniquely into the data of another program. Those programs that do use a complete set of variables to represent the data, will be able to transform their data to any other program, where the target data set is either complete or incomplete.

This treatment also shows that this data model specification precedes even the decision of what computer language(s) to use. The data model is determined by what pieces of data that genealogists need, and a complete (or complete enough) representation of that. Whether it is coded in XML data structures or Mickey Mouse structures does not matter.

The next harder, but still simple, example of this would be representing an address field. A general basis set needs to be proposed and accepted by genealogists. A simple start, for U.S. addresses, would look something like (let's ignore the Name field as we have dealt with that case already):

A complete data basis set could look like:

          [1 0 0 0 0 0 0] [StreetNo ]
          [0 1 0 0 0 0 0] [Street   ]
          [0 0 1 0 0 0 0] [City     ]
Address = [0 0 0 1 0 0 0] [County   ]
          [0 0 0 0 1 0 0] [State    ]
          [0 0 0 0 0 1 0] [ZipCode  ]
          [0 0 0 0 0 0 1] [ZipCode+4]

We could add Country, etc. That would be for a round table of smart genealogists to determine what a complete set would be. Non-US addresses would require some different fields. But the matrix representation and transformation of that should be clear.

Field formatting of variable types would be handled by the presentation layer code according to how the data should appear in that specific output. e.g. the address pieces in a source reference are "run on" in one line, but if you are addressing envelopes it would be in that format.

How would we map a full address to just City and State? Like this:


                      [0 0 0 0 0 0 0] [StreetNo ]   [  0  ]
                      [0 0 0 0 0 0 0] [Street   ]   [  0  ]
                      [0 0 1 0 0 0 0] [City     ]   [City ]
Address(City+State) = [0 0 0 0 0 0 0] [County   ] = [  0  ]
                      [0 0 0 0 1 0 0] [State    ]   [State]
                      [0 0 0 0 0 0 0] [ZipCode  ]   [  0  ]
                      [0 0 0 0 0 0 0] [ZipCode+4]   [  0  ]

Now the full basis set of the Name and Address complete sets would be a vector of dimension 10, and transformations handled by 10 x 10 matrices. [3 Name variables + 7 Address variables = 10]. This is just the representation as above, not a completely general one yet for Address. I plan to study and present here more about the fields and dimensionality of Address fields soon. Check back as this page continues to develop. [Perhaps along the lines of: http://www.endswithsaurus.com/2009/07/lesson-in-address-storage.html ]. A good plan would be to continue to analyze what all the required fields are currently (can change in the future) for all types of genealogy data to be stored. So far this site deals only with Name and Address. More work to be done!

This can be further extended to handle a complete set of variables for source references. I have parameterized Elizabeth Shown Mills' QuickCheck Models here. Improving that to a better, but complete and unique set of variables would allow source field transfers to be made without loss from one computer program to another by a simple matrix multiplication. But one would need to agree on the unique set of variables. (this could be modified over time, it wouldn't be locked in forever, of course). And the Evidence Style formatting would be done by the program's presentation layer. That has nothing to do with storing the complete and unique set of data.

Now throw in a general representation of all the other required genealogy data variables, determined by a meeting of the best genealogical minds, and there will be a long list, but it will be quite finite. If there are 1000 variables, then transforming the data uniquely and without data loss can be done by simple (for a computer) matrix multiplications of vectors of dimension 1000 and matrices of dimension 1000 x 1000.

Today's list can be expanded tomorrow. That too can be handled by a simple matrix multiplication that would convert your 1000 vector to 1000 + N and the 1000 x 1000 matrices to be 1000+N x 1000+N. Not much sweat for a computer.

Completeness here, much like in mathematics, means that the basis set you decide to use will allow you to represent the complete set. To a genealogist, the complete set is the set of charts and reports, etc. that come from the presentation layer of the genealogy program. If that becomes limiting somehow, the gurus should add some variables, and specify the transformations so that program vendors can easily code to it.

The part that I don't understand (yet, anyway) is how to represent the linked list data. That is the data that points a person to their children, spouses, and parents. But surely this isn't hard. Because the current GEDCOM already does that pretty well. When you transfer data from one program to another, I don't think you find that parents, children, etc. are lost or messed up. So surely the linked list part can be built in this model, probably in the vectors and matrices, but I'm not certain if that is even required. I haven't programmed such linked lists since I took a computer course in Algol in college a long time ago. I don't see this as a show stopper. In fact, if I studied the GEDCOM format for a while, I'd probably discover the way or a way to represent that. But, again, that doesn't seem to be the difficult part in getting data to transfer without loss.

Program data will need to be generalized over time. How many programs deal with same-sex marriages, transgender, and the problems they raise? A smart, central committee could develop such extensions. Maybe the RFC (Request For Comment) model that the computer network engineering (and more?) world has adopted would be a model to emulate. All the Internet protocols are fully specified in public RFCs. That is because the applications and infrastructure of the Internet MUST work together or we have no Internet. Today, genealogy vendors care less if their program interacts with another vendor's program. But they would all like to provide a tool for you to convert your data to theirs. And not so much in the other direction.

The best way for users to demand change that I see is that some upstart computer code writer writes a general data model and writes a program that uses it for a decent presentation layer. As that layer begins to offer all the reports, charts, etc. that users want, then people will begin leaving the other vendors who suffer from inertia and won't or can't keep up with the times. This site is just a back of the envelope presentation of how I would approach this. YMMV (your mileage may vary).

It was pointed out to me that GENTECH has done work on a relational database genealogical data model. It seems dormant, but I'll be reading its documentation shortly.

A quick perusal of their documentation seems to show that they overlooked the simple analysis I provide in this web site. They proceed into relational database normalization concepts but seem to ignore the need for a complete set of variables in order to achieve full normalization (adequate normalization, at least) for the data under the rug.

I keep getting referred to the BetterGedcom effort. Here is my response.

I was there, briefly. They are capable engineers. I wish them well. But I think someone needs to step back and analyze the science of the problem before engineering the solution. I don't think they have done that sufficiently. My web site is a crude attempt to analyze the core of what the current programs do wrong, in a minimalist manner, identifying the smallest, simple step(s) that would get them on the right track. That, and maybe stir some minds to do this better than I am able. They are welcome to read my analysis.