A Genealogy Data Model (Matrix Algebra) Specification

http://jytangledweb.org/genealogy/model/

John H. Yates

Last Update: Fri Apr 29 21:51 EDT 2011

Initial Version Released: Apr 19, 2011

The first representation of this is here. As I rewrite and improve the concepts here, the link to first, older version will be removed.

Introduction

The following analysis is a beginning attempt to describe a matrix representation applicable to genealogical data and its representation from internal database storage and retrieval, including mark up for output.
The hope is that such an analysis may shed light on, and guide understanding toward a complete and extensible modular representation amenable to a simple and modular computer algorithm. At the very least, this exercise should be a useful thought experiment to perhaps guide or inform the writing of general codes.
Model

The first step is to identify a complete set of genealogical variables, v_i . Let's represent this list as the vector G_complete. Let this be an N x 1 vector (row x column).
```
 
|
v₁
|
 


 
|
v₂
|
 


G_Complete = 
|
v₃
|
                (1)


 
|
v_…
|
 


 
|
v_N
|
 
```
Where N is the number of variables in the complete representation.
Every existing program has its own list of variables it uses, and thus its own N. For programs A, B, and C, let's call this G_A, G_B, and G_C, where the dimensionality is N_A, N_B, and N_C, respectively.
A starting representation of G_complete then will be the superset of G_A, G_B, and G_C. This will be the union of the elements of of G_A, G_B, and G_C.
```
  G_complete = G_A ∪ G_B ∪ G_C                 (2)
```
This union represents the superset of all variables used in the best practices of existing programs. A reasonable starting point. If, under considered study, variables are considered unimportant, or new variables are considered necessary, make the appropriate adjustments. Note that the precise list of variables will evolve over time by adding or removing or renaming variables. This will not be a problem for this method, and coding the program in the general ways presented here will make such changes easy to accomplish. Most likely unlike the program codings of currently existing programs.
The important thing to note is that unless a program allows storage of all variables in the superset G_complete, its data cannot be transformed without data loss to the representation of another program.
However, the superset can always be projected into the subspace of any incomplete program's representation of G. But that projection will suffer data loss. So the author of programs will want to ensure that their G's are complete, at least according to the current state of the art. That is, if they wish to export/import, without data loss, to/from other program's data representation.
I believe there is work going on today to identify a complete set of variables (e.g. BetterGedcom), but this treatment attempts to show mathematically why a complete set is necessary.
Another shortcoming with current genealogy programs is that they often take short cuts in their data representation. For example, some use a single variable for the input of Name. Internal to the program it is parsed into GivenName and Surname. This is a bad practice. The Name analysis done below shows why that is true.
Relational databases are the natural tool hold genealogical data, and current programs do use them. However, they don't seem to adhere to enough of the rules for database normalization in their relational database. These rules reduce the redundancy in the stored data and help assure modularity and simplicity.
I contend that representing the Name variable with too few variables violates Normalization, and is not the only place that current implementations fail Normalization practices. This would not be important if it did not affect transport of data to other programs, but it does.
Let me note here that GEDCOM attempts to rectify part of the name problem by putting the Surname in slashes, as /Surname/, even though it uses only one variable for Name. As:
```
Name = Mary Ann /Smith/
```
If you need the independent variables, use them in the basis, don't try to fudge them in the code.
Another lesson that current program designers need to learn is the modular data structure from object oriented programming. Once you have defined a structure for a set of variables into an object, as Name, then use that object everywhere in the program.
It is commonly seen in programs that they use different fields when asking for a name of the person you are adding to the tree vs. the name of a person who is the owner of an artifact, the name of the author of a book, etc. Properly, all occurrences of Name should use the same data structure.
Each of the data types represents an Object with a data structure in object oriented programming. Transformation matrices that extract or project just that object from G_complete will have zeroes everywhere except ones to extract each variable that describes that object. (not necessarily on the diagonal, off diagonal ones can reorder the required variables and this can be used to order output order of the variables).
One can also represent G_complete as the union of the G's for all of the objects as:
```
G_complete = G_Name ∪ G_Address ∪ G_… ∪ G_N            (3)
```
or
```
          | G_Name  |
          | G_Address|
G_complete =                                  (4)
          | G_…   |
          | G_N    |
```
An Aside

Is this vector and matrix algebra approach required? No. But I think it is a useful way to analyze it because it emphasizes the modularity, the transformations, and the concept of sets, complete and incomplete. And it indeed may be amenable to coding in this manner, resulting in a modular code that will be trivially extensible and modifiable by simply redefining projection operators and vector addition operators (see below) rather than trying to recode spaghetti code that was written ad hoc, with no overall model in mind. This analysis is a work in progress. Let's see what value comes out of it.
Now let us examine some of the data objects.
Name Object

Now let's define G_Name with this nomenclature. The first thing to do with any object is to do a thorough investigation of best practices to determine the complete list of variables to use. I contend that the minimum number of variables to use is three. As:
```
       | GivenName1 |
G_Name = | GivenName2 |                       (5)
       |  Surname   |
```
I don't think more variables are required, but if a consensus determines that this is true, the list may be expanded. But then all programs will be required to handle all the fields. So there should be a judicious choice of variables.
Let the vector operator P_Name project out only the Name variables from the complete G_complete. That is:
```
                        |1 0 0|             | GivenName1 |
G_Name = P_Name . G_complete = |0 1 0| . G_complete = | GivenName2 |      (6)
                        |0 0 1|             |  Surname   |
```
P_Name is shown only as a 3 x 3 matrix, but in actuality it will be N x N where N is the number of variables in the complete set. Only the non-zero portion is shown. After the full projection, the Null elements may be removed from the resulting vector. For Name, this results in the 3 x 1 vector.
To take the matrix algebra analogy one more step, let's determine the value of the resulting Name variable by calculating the magnitue of the vector G_Name. That is:
```
  
    
 
    
    
 
    
    
_____________
    
  
  
    
|G_Name| = 
    
    
√
    
    
 G_Name . G_Name 
    
    
       (7)
    
  





 
 
_____________________________________
 



|G_Name| = 
√
 GivenName1² + GivenName2² + Surname² 
       (8)
```
Let's define that the square root for these strings, unlike in algebra, results in:
```
|G_Name| = GivenName1 + GivenName2 + Surname = Mary Ann Smith      (9)
```
where:
GivenName1 = Mary
GivenName2 = Ann
Surname = Smith

One should be able to apreciate now that programs that use a single variable for Name have an insufficient number of variables. This model, with three variables for Name, can be projected into the set space where only one Name is used, sucessfully. But how would one project the single Name variable back into the space of using three? Not by a unique mathematical transformation. One would have to write rules for parsing it. That is a give away that not enough variables are defined. For Name, it isn't a severe problem, but for other, more complex data representations, it can be a very serious problem that prevents data transfer without loss of data.
This sure looks like a lot of work and analysis for little return. Stay tuned!
Let's now define an Input operator, I_Name. This operator is analagous to the projection operator P, but instead of 1's in the matrix, the operation is to ask for the variable input. The operator GetVar returns the variable it operates on. That is, GetVar(GivenName1) prompts the user for the variable GivenName1. In matrix form:
```
              |GetVar    0      0   |          | GivenName1 |
I_Name . G_Name = |  0     GetVar   0   | . G_Name = | GivenName2 |    (10)
              |  0       0   Getvar |          |  Surname   |

         | GetVar(GivenName1) |
      =  | GetVar(GivenName2) |                         (11)
         |  GetVar(Surname)   |
```
and the program prompts the user to input the specific, required variables, for the object being defined. Here, it is a Name object. How it prompts the user is unimportant here, it can be in input boxes, or any way convenient in the context of the program. The important thing to note is that the operator defined in this manner will prompt precisely for the variables needed, and only those. That will be very useful later, as when we consider Evidence Style historical source reference templates. And this analysis guides an elegant manner to structure the computer code. And computers take to matrix operations like a duck takes to water.
Let's define one last operator, trivial for this object, but it will be important for other objects, as below. That is the Render, or output operator, R_Name. This is defined as:
```
R_Name . |G_Name| = Mary Ann Smith              (12)
```
That is, it takes the magnitude of the resulting string and processes it for output. In this case it simply prints the string: Mary Ann Smith. Since the output is a simple string, it just outputs the simple string. Later we will see that this operator has some mark up work to do. See below.
Now, let's move on to another object case that is slightly more complex, and apply the same treatment to it.
Address Object

The next harder, but still simple, example of this would be representing an address field. A general basis set needs to be proposed and accepted by genealogists. A simple start, for U.S. addresses, would look something like (let's ignore the Name field as we have dealt with objects of that case already):
A complete data basis set could look like:
```
                             |1 0 0 0 0 0 0| |StreetNo |
                             |0 1 0 0 0 0 0| |Street   |
                             |0 0 1 0 0 0 0| |City     |
G_Address = P_Address . G_Complete = |0 0 0 1 0 0 0| |County   |                 (13)
                             |0 0 0 0 1 0 0| |State    |
                             |0 0 0 0 0 1 0| |ZipCode  |
                             |0 0 0 0 0 0 1| |ZipCode+4|
```
To input the required Address variables, use an I operator as in equations (10) and (11) above.
```
                   |GetVar  0     0     0      0     0     0   |
                   |  0   GetVar  0     0      0     0     0   |
                   |  0     0   GetVar  0      0     0     0   |
I_Address . G_Address = |  0     0     0   GetVar   0     0     0   | . G_Address           (14)
                   |  0     0     0     0    GetVar  0     0   |
                   |  0     0     0     0      0   GetVar  0   |
                   |  0     0     0     0      0     0   GetVar|

         | GetVar(StreetNo)  |
         | GetVar(Street)    |
         | GetVar(City)      |
      =  | GetVar(County)    |                         (15)
         | GetVar(State)     |
         | GetVar(ZipCode)   |
         | GetVar(ZipCode+4) |
```
And analogous to equation (12) we can write a render, or output operator, R_Address as:
```
R_Address . |G_Address| = StreetNo Street City County State ZipCode ZipCode+4              (16)
```
Now is a good time to genearalize the Render operator so that it is capable of marking up, or formatting the output. This wasn't necessary for Name, but for Address, we would like to add punctuation and format control. The matrix algebra approach can be used for this extension also.
Let the vector addition operator, +, represent the vector addition of character strings. Note that:
```
5 + 3 ≠ 3 + 5
because
5 + 3 = 53
and
3 + 5 = 35
```
In mathematical terms, the string addition is catenation and not commutative. This will be important here later.
Let us redefine the Render operator to work as:
```
R_Address . G_Address = A_L,Address + G_Address + A_R,Address          (17)
```
where A_L and A_R are Addition Vectors of dimension of the particular object, N_Object x 1. A_L is Addition Vector Left and A_R is Addition Vector Right. This allows us to insert special mark up characters, or codes, to the left and to the right of the variable fields prior to rendering the strings for output.
For a simple address output this could look like:
```
                   ||     |StreetNo |   | |
                   ||     |Street   |   |, |
                   ||     |City     |   |, |
R_Address . G_Address = ||   + |County   | + |, |    (18)
                   ||     |State    |   | |
                   ||     |ZipCode  |   |-|
                   ||     |ZipCode+4|   ||

                 =StreetNo Street, City, County, State Zipcode-ZipCode+4
```
In the A_L and A_R vectors above, it is important to notice the punctuation including spaces, or no spaces. For example, || means a null string, and | | means a single space character. This is important. In the other vectors and arrays spaces are unimportant. But these vectors control formatting, and every character is important. If we define <CR> as a token that represents a carriage return (or line feed), we can control an envelope output format for the address as:
```
                   ||     |StreetNo |   | |
                   ||     |Street   |   |<CR>|
                   ||     |City     |   |<CR>|
R_Address . G_Address = ||   + |County   | + |<CR>|    (19)
                   ||     |State    |   | |
                   ||     |ZipCode  |   |-|
                   ||     |ZipCode+4|   ||

                 =StreetNo Street
                  City
                  County
                  State Zipcode-ZipCode+4
```
These simple examples do not require any formatting characters in the Left Addition Vector. We will need to use that capability later.
We could add Country, etc. That would be for a round table of smart genealogists to determine what a complete set would be. Non-US addresses would require some different fields. But the matrix representation and transformation of that should be clear.
Field formatting of variable types would be handled by the presentation layer code according to how the data should appear in that specific output. e.g. the address pieces in a source reference are "run on" in one line, but if you are addressing envelopes it would be in that format.
How would we map a full address to just City and State? Like this:
```
                   |0 0 0 0 0 0 0| |StreetNo |   |  0  |
                   |0 0 0 0 0 0 0| |Street   |   |  0  |
                   |0 0 1 0 0 0 0| |City     |   |City |
G_{Address(City,State)} = |0 0 0 0 0 0 0| |County   | = |  0  |          (20)
                   |0 0 0 0 1 0 0| |State    |   |State|
                   |0 0 0 0 0 0 0| |ZipCode  |   |  0  |
                   |0 0 0 0 0 0 0| |ZipCode+4|   |  0  |
```
Now the full basis set of the Name and Address complete sets would be a vector of dimension 10, and transformations handled by 10 x 10 matrices. [3 Name variables + 7 Address variables = 10]. This is just the representation as above, not a completely general one yet for Address. Perhaps a good starting basis set of variables would be along the lines of the twelve address variables presented by Ben Alabaster of Toronto, Ontario, Canada at: http://www.endswithsaurus.com/2009/07/lesson-in-address-storage.html. (In case this site goes away, these are: Street Number [Int], Street Number Suffix [VarChar] - A~Z 1/3 1/2 2/3 3/4 etc, Street Name [VarChar], Street Type [VarChar] - Street, Road, Place etc. (I've found 262 unique street types in the English speaking world so far... and still finding them), Street Direction [VarChar] - N, NE, E, SE, S, SW, W, NW, Address Type [VarChar] - For example Apartment, Suite, Office, Floor, Building etc., Address Type Identifier [VarChar] - For instance the apartment number, suite, office or floor number or building identifier, Minor Municipality (Village/Hamlet) [VarChar], Major Municipality (Town/City) [VarChar], Governing District (Province, State, County) [VarChar], Postal Area (Postal Code/Zip/Postcode)[VarChar], Country [VarChar].)
That site even suggests the variable types to use for each in the relational database which holds all of this data. A good plan would be to continue to analyze what all the required fields are currently (can change in the future) for all types of genealogy data to be stored.
Now let's extend this model to a matrix algebra representation for coding templates for Evidence Style historical source references. We now have all the matrix algebra tools defined that we require.
Evidence Style Source Reference Object

Now let's use the above approach to present a treatment of the Evidence Style source reference object. As you now know, the first task is to identify a best effort complete set of variables for the object at hand.
A reasonable start will be the 577 variables that I found useful for completely representing the 170 QuickCheck Models (all three versions of each) in Evidence Explained. See http://jytangledweb.org/genealogy/evidencestyle/. Is this a complete set of variables to represent all source reference variables? No. Are they a complete set to represent all three types of the 170 QuickCheck Models? Yes.
Some of the variable names may appear to be redundant, but are named as such because it is necessary to distinguish between similar variable types whose content may differ for Full, Short, or List templates.
The next step will be to specify the operator matrices and addition vectors for each QuickCheck type and subtype. Since there are so many of them, a nomenclature for type and subtype needs to be devised. Nomenclature was not important in the Name and Address object type presentations, but is required here because of the increased complexity.
Let's represent the vector of the complete set of Evidence Style variables as E. It is a 577 x 1 vector.
The projection operators, P, will need subscripts to define which template style is being defined by P. As P_{Type,Subtype,Style,Page}. Type will be an abbreviation for the name of the chapter in Evidence Explained, Subtype will be a brief text description of the subtype being defined, Style will be one of F, S, or L (Full, Short, List), Page will be the page number in the book on which the defined QuickCheck Model appears.
This nomenclature will define a unique P for each of the 170 QuickCheck Model type templates found here into the above matrix algebra representation.
The QuickCheck nomenclature to be used is defined here.
As an aside, it would be welcome if the definers of these style templates would provide the nomenclature and also the required, unique, and necessarily complete variable list to describe all of their defined types. This list may grow or change with evolution, but the authors of the styles are in the best position to define the nomenclature. Meanwhile, I shall proceed with a convenient nomenclature of my invention.

The relevant chapter title abbreviations for Type are:
Archives and Artifacts (AA)
Business & Institutional Records (BIR)
Cemetery Records (Cem)
Census Records (Cen)
Church Records (Ch)
Local & State Records: Courts & Governance (LSRCG)
Local & State Records: Licenses, Registrations, Rolls, & Vital Records (LSRLR)
Local & State Records: Property & Probates (LSRPP)
National Government Records (NGR)
Publications: Books, CDs, Maps, Leaflets, & Videos (PB)
Publications: Legal Works & Government Documents (PL)
Publications: Periodicals, Broadcasts & Web Miscellanea (PP)

Now we need to define the subtypes for each of the 170 QuickCheck Models. Let's get that from the list developed at http://jytangledweb.org/genealogy/evidencestyle/. That is here.
[Here I will expound matrix representation for Evidence Style source references. Storage, retrieval, and mark up. (coming soon!). I am considering putting the Evidence Style object into a Sage mathematical project. I will start by presenting a few specific exanples, but hope to provide all the operator matrices and addition vectors to fully describe the Evidence Style references for the 170 QuickCheck Models in Evidence Explained: Citing History Sources from Artifacts to Cyberspace, Second Edition by Elizabeth Shown Mills, Baltimore: Genealogical Publishing Co., 2009.]
In programming this in Sage I found that lists of strings, not vectors and matrices, are the best way to represent the data. The list elements are indexable just like vectors, and such a treatment allows one to not store any of the 0's in the vectors and arrays. In essence it is equivalent, but much more efficient to code. I will be expanding on that soon, but here is the untouched (I did fold the List and Full lines for this web page, in the output they are each one long string) Sage output from a general Sage mathematical treatment for E, P, AL, AR, with variable substitution, for the QuickCheck Model found on page 93 of Evidence Explained.
```
*****List,Full,Short*****
CITATIONNUMBER = JHY1
TEMPLATENUMBER = ESM93
TYPE = Archives & Artifacts
SUBTYPE = Archived Material: Artifact, Creator as lead element in Source List
*************************
Horst, "Aunt Ella," et al. "Amish Friendship Sampler Album." Quilt. ca. 1876<endash>1900. Michigan
   Quilt Project. Michigan State University Museum, East Lansing.
"Aunt Ella," et al Horst, "Amish Friendship Sampler Album," Quilt, ca. 1876<endash>1900; item 01.0011,
   Michigan Quilt Project; Michigan State University Museum, East Lansing, Michigan. The archival description
   identifies the quilt makers collectively as "Friends of Annie Risser Horst".
Horst et al., "Amish Friendship Sampler Album," Michigan Quilt Project.
*************************
```
There are a couple of formating things that need fixing. First, the "et. al." here is misplaced because of the automated way the two variables are used for name. My first parameterization handled name differently, when it was split to First and Last fields, as here, it lost a bit of control for placement of things like "et. al.". I'll be fixing this soon.
Also, there is a |".| that should be rewritten to be |."|. I handled this in a previous treatment by running a "fixit" or "rationalization" step on the final string to fix up punctuation. Such anomalies appear because the final string is pieced together from pieces, and the result, while containing the proper information, may not obey all grammatical rules. The human mind fixes this, a computer program can do it with a little "training". This was very successful in my earlier programming treatment of this (not released because it was in Fortran and Bourne Shell programming). I was able to reproduce precisely every one of the 170 QuickCheck Models, and even wrote the rendering program part to RTF so the results could be compared to the book. (precisely except for any typos in the book, of course). I have now looked in my old code and there were only two cases that required "fixing". They are:
```
# ". -> ."
# ", -> ,"
```
And a look at the capitalization "fixes" required I found these commments in my code (along with the Unix script code that detected and "fixed" it):
```
### CAPTST
#echo "PUNC: no leading string, capitalize value"
#echo "PUNC: |. | precedes value, capitalize it"
#echo "PUNC: |.\" | precedes value, capitalize it"
#echo "PUNC: |, | precedes value, lowercase it"
#echo "PUNC: |,\" | precedes value, lowercase it"
```
(a \ is needed in Unix scripts to "protect" certain characters after it from being interpreted by the shell script instead of as a character. Here the " character needs protecting by \").
Another syntactical grammar issue that happens because of the automated use of the same variable in different syntaxes is that "Quilt" is properly capitalized after a |.| (actually a |." |) but also improperly capitalized after a |,| (actually a |," |). This can also fixed by an automated "fixit" check after the string is marked up but before being output.
And <endash> is, I believe, the proper length dash to be used in a date range. This too will be resolved in the rendering step, as to RTF, as precise code to draw the proper dash. Just using - is also an option, but I intended to show the generality of this approach and the control one has over the output if one wishes to use it.
This is just a taste of the Sage treatment as I continue to develop this model to handle all 170 QuickCheck Models. I will be explaining this in more detail as it develops.
Even though the current treatment is crude so far, I have put the current full Sage input and output on this web page. And here is the Unix script that runs the input.
[This will give my third parameterization of Evidence Style historical sources. This one is in a matrix algebra specification, which should turn out to be the most useful of the three presentations for writing a computer code to use them. See http://jytangledweb.org/genealogy/evidencestyle/ for the first two.]
Note

I have not yet added confidence level nor note fields to sources. I am thinking that every source citation should also carry at least the variables [CONFIDENCE LEVEL] and [NOTES] fields. I have been thinking about this, and other objects do not need a confidence level, in my opinion we do need to be able to (not required, of course) assign a confidence level every time we assign a source to a fact. If a fact has multiple citations of varying confidence levels, some artificial intelligence level could potentially be assigned to the fact itself. So maybe every object could have a variable of confidence inferred by its sources, and not input directly by the user. (just food for thought at this point).
Conclusion

This analysis based on mathematical set theory and matrix algebra forms a modular basis for coding genealogical data input and output. It does not obviate the need to store the data in a relational database nor the presentation layer of genealogy programs consisting of reports and charts. It does present a data model that will allow export to another program's data without loss, but only if the other program also represents a complete set of data, as agreed upon by best practices at any given time.
Analyzing data in this manner should also suggest modular and efficient coding for mining the data for for statistics, trends, etc., where new facts can be discovered in an artificial intelligence manner.

		_____________________________________
\|G_Name\| =	√	GivenName1² + GivenName2² + Surname²	(8)

	\|	v₁	\|
	\|	v₂	\|
G_Complete =	\|	v₃	\|	(1)
	\|	v_…	\|
	\|	v_N	\|

		_____________
\|G_Name\| =	√	G_Name . G_Name	(7)

A Genealogy Data Model (Matrix Algebra) Specification

Introduction

Model

An Aside

Name Object

Address Object

Evidence Style Source Reference Object

Note

Conclusion