Genealogical Evidence Analysis Tool - An Algorithmic Research Process Flow Chart

http://jytangledweb.org/genealogy/evidence_analysis_flow_chart/

John H. Yates

Last Update: Thu Jun 06 18:51 EDT 2013
Initial Version: June 2013

Abstract
This work examines the simple model "Evidence Analysis: A Research Process Map" presented on the inside cover of the book Evidence Explained by Elizabeth Shown Mills and shows one practical way to begin designing automated evidence analyis programming tools to assist genealogists. Such tools should begin appearing in next generation genealogy programs.
Description
This work fits in with the "how can we program that?" goals of my previous work on computer ready templates for Evidence Style genealogical sources, and particularly, work on how to solve the GEDCOM problem (by specifying all currently known to be important genealogical variables in a central, world-wide body).
Model xxx, Evidence Style yyy.
This work represents thoughts on how to computer automate the Genealogical Proof Standard (GPS)). The work here is based on that map and the GPS, see references, described in REFERENCE(s?), extending it into an alogrithmic approach that shows a way to add a tool to existing genealogy program codes that assists professional genealogists in evidence analysis and proofs.
Figure 1. Flow Chart Diagram
Evidence Analysis Flow Chart Box Descriptions
- Hypothesis
  To begin a genealogical evidence analysis, first form a Hypothesis. (such as a particular John Doe was born on 15 Aug 1890).
  - [I Loop] ConfS = ConfI = ConfE = 0
    Initialize the Level of Confidence variables to zero.
  - Identify and Retrieve Relevant Source
    Examine all of the Sources in your genealogical database that contain Information relevant to this Hypothesis. For each Source, classify it as Original or Derivative, and if either, set a Source Confidence Level to an appropriate value. ConfS = OrigBoost or DerivBoost.
  - Classify Source
    - [J Loop] Identify and Extract Relevant Information
      For each Source, extract each piece of Information relevant to the Hypothesis. Classify each piece of Information as Primary or Secondary, and if either, set an Information Confidence Level to an appropriate value. ConfI = PrimBoost or SecBoost.
    - Classify Information
      - [K Loop] Identify and Extract Relevant Evidence
        For each piece of Information, extract each piece of Evidence relevant to the Hypothesis. Classify each piece of Evidence as Direct or Indirect, and if either, set an Evidence Confidence Level to an appropriate value. ConfE = DirBoost or IndirBoost.
      - Classify Evidence
        Compute the Level of Confidence of the I,J,K-th piece of Evidence as:
        Conf(I,J,K) = ConfS + ConfI + ConfE
- Analysis of Hypothesis
  Analysis consists of examining each piece of Evidence, and documenting why it supports the Hypothesis or not. Each piece of Evidence has a calculated Level of Confidence (perhaps) and a standard deviation (perhaps). You also have the ability to override this and assign a Level of Confidence that you are comfortable with.
- Proof of Hypothesis
  Proof comes as a conclusion from examining the Evidence in the Analysis. Is there supporting evidence? Is it strong or weak? Is there a lot of supporting evidence of high confidence? Is there evidence not supporting the Hypothesis? How strong is it? This is where the Art of Genealogical Evidence Analysis is required. You need to document why you have reached your conclusions. This flow chart only shows how all of this is captured for your analysis, and captures your analysis.
In the case of very simple Hypotheses, and tests against known values, it is possible that the program can automate the Evidence Analysis and conclude a Hypothesis is Proved. (e.g. John Doe's birth date is 15 Jun 1880). You can, of course, not choose to allow the program to do this, and you can at any time revisit each Proof, and change things. This is simply a tool that aids analysis and conclusions, recording the analysis.
```
■ Proven
□ Disproven
□ Defer
☑ Perhaps?
```
Discussion of blocks in Flow Chart
- Hypothesis. This is where the user forms a hypothesis to analyze and decide on whether it is proved or not. It is really part of a loop over all hypotheses that one wishes to examine, but adding that loop here in this chart isn't needed to understand the process.
  Hypotheses take the form of: Item | Action (Conditional?) | Object.
  This list would be a pop-up list of objects in Hypothesis statement like:
  Object | Qualifier | Operator | Value
  Birth Date | John Doe [1770-1810] | Equals | 10 Dec 1810
  Note that in a genealogy program, when the focus is on a given person, the qualfier can be implicit, but for algorithm completeness here, it must be specified.
- Conf = 0. This just makes explicit that Conf is initially set at 0.00 for each individual hypothesis.
- Identify and Retrieve Relevant Source. With a hypothesis in hand, the end user will select, one by one, each source documented in the genealogy program data that is relevant to the hypothesis.
- Classify Relevant Source. Here the end user classifies each relevant source as Original or Derivative. (see EE inside cover and the text discussion). Originals should boost the computed level of confidence by more than derivatives. In reality, this is a dichotomy. In practice, if one cannot decide, move on to the next step assigning no value.
- Identify and Extract Relevant Information. Each source contains information relevant to the Hypothesis. It may contain more than one piece of information relevant to the Hypothesis. So this block is where the end user plucks out each PIECE (?) (section, ...) and ideally this piece is stored in the evidence analysis database part so that when the end user revisits this analysis it is perfectly clear what piece was being examined at the time. Bare information.
- Classify Relevant Information. Here the end user classifies each relevant piece of information as Primary or Secondary. [Indeterminate?]. Primary information should boost the computed level of confidence by more than secondary. In reality, this is a dichotomy. In practice, if one cannot decide, move on to the next step assigning no value.
- Identify and Extract Relevant Evidence. Here the end user makes explicit what the evidence is that each piece of information contains. (and the program stores it for when it is revisited). How information relates to hypothesis.
- Classify Relevant Evidence. Here the end user classifies each relevant piece of information as Direct or Indirect. Direct evidence should boos the computed level of confidence by more than Indirect. In reality, this is a dichotomy. In practice, if one cannot decide, move on to the next step assigning no value.
  This ends the nesting of three loops. For each Hypothesis there are NumSources relevant sources, each of which has NumInfo pieces of Information to examine, each of which has NumEvidence pieces of evidence to examine.
  The flow chart continues:
- Analysis of Hypothesis. This box in the figure represents a complex analysis that can be expanded to its own internal flow chart.
  Here the end user asks for reports summarizing each Hypothesis. One summary page per Hypothesis. Here the end user gets to specify a level of confidence, AConf, that they feel is appropriate for their needs. See Examples below. [text boxes that capture reasoning]. This is where verbal reasoning is done and documented. The category types are folded in intellectually, along with the optional numerical "assist" outlined here.
  Source:
  Original = 1.0
  Derivative = 0.5
  
  Information:
  Primary = 1.0
  Secondary = 0.5
  
  Evidence:
  Direct = 1.0
  Indirect = 0.5
  One way to estimate the uncertainty in the level of confidence for a particular piece of evidence is to use the corrected sample standard deviation, s, expressed as:
```
                           N = Number of pieces of relevant evidence
                           x_i = level of confidence of i-th piece of evidence
                           s only has meaning if N > 1.
```
  The uncertainty of the average, x̄, can be estimated as ±s. Strictly this is really just an estimate of the deviation from the mean.
  These three categories give us a "feel" for the quality of the particular piece of evidence. But it can support the hypothesis or not support it.
  If a piece of evidence does not support the hypothesis, simply make the Level of Confidence negative. This will cause x_i to be negative with a small s if all evidence is negative, or cause x̄ to have a large s if there is conflicting evidence.
```
Positive = + (supports)
Negative = - (doesn't support)
```
- Proof of Hypothesis. This box in the figure also can be expanded to its own internal flow chart.
  Here the end user gets to assert, with a captured text box of the reasoning behind it, whether they feel the Hypothesis is proven or not. Indeterminate is also an option, it can be revisited later.
```
■ Proven
□ Disproven
□ Defer
☑ Perhaps
```
  The flow chart uses a few variables. Here are their descriptions:
  - Conf
    A computed value of level of confidence that starts at 0.00(0%) and gains in value according to various properties of the Source, Information, and Evidence. The end user has control of the Boosts according to each property. This value is only meant to be a guide (and should be in a program preference of OFF or ON so it can go away for those that dislike it).
  - OrigBoost
  - DerivBoost
  - PrimBoost
  - SecBoost
  - DirBoost
  - IndirBoost
  - AConf
    Short for AssignedLevelOfConfidence. This is where the user gets to inspect Conf and assign a level of confidence that they feel represents the Evidence in the Analysis stage. If non-zero it overrides Conf.
  - NConf
    This is the number of levels of confidence (in addition to 0) to use in the program. My initial suggestions would be either 0-5 or 0-10. All internal storages of the level of confidence should be done in percent or fractional confidence so that the program can use a Preference for NConf that the user can assign as 5 or 10, and changing may be done without causing a problem.
```
NConf=5  NConf=10  Percent      Fractional          Description
                   Confidence   Confidence
------   --------  -----------  --------------      -----------
  0          0         0            0.00            uncertain
             1        10            0.10
  1          2        20            0.20
             3        30            0.30
  2          4        40            0.40
             5        50            0.50              ⇳
  3          6        60            0.60
             7        70            0.70
  4          8        80            0.80
             9        90            0.90
  5         10       100            1.00            certain
```
    The hope is that experimenting with "boost" levels for the various properties will help the end user and eventually lead to some amount of automated processing for analysis of proof.
Reports
The following types of reports are samples of useful ones. Many others are possible.
- List All Hypotheses
- List All Proven Hypotheses
- List All Indeterminate Hypotheses
- List All Disproven Hypotheses
- List All Proven Birth Dates
- etc.
Not only new reports possibly, but after significant experience, some analysis and proof tools can be developed. Such as:
Analyze/Prove | {Birth Dates, Death Dates} | for { All; Selected Individuals}
My assertion is that with a program such as this, with experience, it will be possible to generate automated reports of proven Hypotheses. How about running a report for Proofs of the birth date of all persons in your data. The more meticulous you have been about providing sources for each fact for each person, and analyzing that fact, even if facts disagree, the better the program can decide if a fact is proven. If you have 5 Original, Primary, and Direct documents that all say the same birth date, it seems to me that the program can make the decision that it is proven. If there is conflicting evidence, then the program will need your assistance. But the process here will present all the evidence for you in one place so you can make your decisions and record them.
Just as now reports can be run that flag people that have no birth date specified, you could run a report flagging people whose birth dates are not considered proven. The types of these automated reports can become more sophisticated as experience is gained.
If someone questions why you believe someone was born on a certain day, you can run a report that shows the Hypothesis, the relevant sources, their classification, your notes and level of confidence assigned (if desired), and the statement of your conclusion. Right from your favorite genealogy program.
ESM, p. 19.
```
Certainly    5   high degree of certainty
Probably     4
Possible     3
Likely       2
Apparently   1
Perhaps      0   some piece of evidence exists, but not trusted
```
Examples
Here are examples. Note that I am more confident in my model than I am in my ability to perform the analysis. I am not a professional genealogist. I welcome constructive criticism regarding any errors or improvements from professional genealogists. This will only serve to make this model stronger.
1. John Doe was born on 15 July 1910
  Note that the particular John Doe is determined in the context of the person selected from the genealogy program data. It is left unspecified here because the context should be clear.
2. The earliest known ancestor of John Doe is Joe Smith
3. Mary Doe passed away between 1870 and 1880
4. The Robert Smith - Jane Doe family did not have a child named Samuel
5. Authored Work
6. The Smith Family with Patriarch George Smith [1810-1900] migrated westward in the 1800s
Conclusions
What I have added is what I hope is a practical way to implement and automate the parts of the process that are conducive to that, show how it can fit in as a tool within exisiting genealogy programs, grow awareness that next generation genealogy programs should add the features, or similar ones, to the programs.
Experience will inform as to whether such an automated tool is accurate enough, and in which cases, to meet someone's needs. You need not use the tool, nor do you need to take its suggestions.
If computers can do facial recognition, and they do, they can begin doing some aspects of genealogical data analysis.
This type of analysis will be in next generation genealogical programs. It is not too early to think about how to implement such tools.
I believe that this flow chart even without automation would be a helpful aid to research analysis.
The automation here will allow genealogists to easily record each and every Hypothesis and path to Proof that they undertake, right in the genealogy program that holds all the data used.
Although practicing genealogists say that they apply the GPS and this analsis to all data and hypotheses, I believe that many hypotheses are considered so drop dead certain that no formal application is required, and due to time constraints, many are left undone.
The preponderance of evidence proving the simplest cases will allow you to spend your time more effectively on those hypotheses that need attention, and the method will actually identify those that need your attention via reports of unproven things.
Pro - Gathers all the relevant evidence in one place for you to analyze. And it gives supporting cues on the quality of the evidence..
Exhibits
Here are the sources exhibited that are used in the above examples.
With each one is provided the Evidence Style source citation. Constructive feedback from professional genealogists on improving these citations is welcome.
1. John Doe Birth Certificate
2. John Doe in 1940 Federal Census
3. Gravestone Photograph for X
Standards
I would suggest standards for all of this, but several years ago proposed a simple solution to the GEDCOM issue. See my recommendation HERE. Basically it analyzes the single property that the solution must have, and advocates a method to use that property to define it. i.e. a call for some genealogical standards body to convene a committee and determine precisely what variables are required to do professional genealogy. Name, classify, and describe them. Then convince genealogy program vendors to use those variables. Let start-up companies that see the wisdom of this way begin to threaten the established companies until they adopt the variables because of popular demand. The committee should meet on a periodic basis (yearly?) and review the list of variables and make minor tweaks as the field evolves. But only for good reasons, not whims. [This is sort of like how chemists decided to make the Periodic Chart of the Elements. It is very rigid, but as new elements are discovered, they do get added to the chart].
Like words must exist in the various languages if a translation is to be successful in a translation. Failing that a translator expands the dictionary to accomodate the new word.
To date, no genealogy program has taken me up on that solution, nor any standards group in Genealogy taken an leadership position in the support of the solution. I find that sad. Let me know if I am wrong. Better yet, just have them send me their current working full genealogical variable list (and ideally the relational database data types), which will be my acid test of if they have indeed even begun to solve the GEDCOM problem.
DEFINITION: I define solving the GEDCOM problem to be establishing a practical solution that if implemented in computer programs will allow EXPORT and IMPORT from a genealogy program suffering no loss of genealogical data.
Any group working on solving the GEDCOM problem that can't send me this list has not yet made any progress toward a solution.
Anyway, I would be happy to devote some time to the development of standards for Evidence Analysis, but only if I see movement toward the solution of the GEDCOM problem.
Notes
- This is not the only way to approach this.
- Some analysis and proof will depend on solely verbal evidence.
To Do
- Expand examples to include actual genealogical sources.
- Think about how this fits in with the GPS in a larger sense and if other programming advancements can be achieved.

References
- Mills, Elizabeth Shown. Evidence Explained: Citing History Sources from Artifacts to Cyberspace. Second Edition. Baltimore: Genealogical Publishing Co., 2009.
- Mills, Elizabeth Shown, Editor. Professional Genealogy: A Manual for Researchers, Writers, Editors, Lecturers, and Librarians. Baltimore: Genealogical Publishing Company, 2001, sixth printing 2008.