http://jytangledweb.org/genealogy/evidence_analysis_flow_chart/
Last Update: Thu Jun 06 18:51 EDT 2013
Initial Version: June 2013
This work examines the simple model "Evidence Analysis: A Research Process Map" presented on the inside cover of the book Evidence Explained by Elizabeth Shown Mills and shows one practical way to begin designing automated evidence analyis programming tools to assist genealogists. Such tools should begin appearing in next generation genealogy programs.
This work fits in with the "how can we program that?" goals of my previous work on computer ready templates for Evidence Style genealogical sources, and particularly, work on how to solve the GEDCOM problem (by specifying all currently known to be important genealogical variables in a central, world-wide body).
Model xxx, Evidence Style yyy.
This work represents thoughts on how to computer automate the Genealogical Proof Standard (GPS)). The work here is based on that map and the GPS, see references, described in REFERENCE(s?), extending it into an alogrithmic approach that shows a way to add a tool to existing genealogy program codes that assists professional genealogists in evidence analysis and proofs.
To begin a genealogical evidence analysis, first form a Hypothesis. (such as a particular John Doe was born on 15 Aug 1890).
Initialize the Level of Confidence variables to zero.
Examine all of the Sources in your genealogical database that contain Information relevant to this Hypothesis. For each Source, classify it as Original or Derivative, and if either, set a Source Confidence Level to an appropriate value. ConfS = OrigBoost or DerivBoost.
For each Source, extract each piece of Information relevant to the Hypothesis. Classify each piece of Information as Primary or Secondary, and if either, set an Information Confidence Level to an appropriate value. ConfI = PrimBoost or SecBoost.
For each piece of Information, extract each piece of Evidence relevant to the Hypothesis. Classify each piece of Evidence as Direct or Indirect, and if either, set an Evidence Confidence Level to an appropriate value. ConfE = DirBoost or IndirBoost.
Compute the Level of Confidence of the I,J,K-th piece of Evidence as:
Conf(I,J,K) = ConfS + ConfI + ConfE
Analysis consists of examining each piece of Evidence, and documenting why it supports the Hypothesis or not. Each piece of Evidence has a calculated Level of Confidence (perhaps) and a standard deviation (perhaps). You also have the ability to override this and assign a Level of Confidence that you are comfortable with.
Proof comes as a conclusion from examining the Evidence in the Analysis. Is there supporting evidence? Is it strong or weak? Is there a lot of supporting evidence of high confidence? Is there evidence not supporting the Hypothesis? How strong is it? This is where the Art of Genealogical Evidence Analysis is required. You need to document why you have reached your conclusions. This flow chart only shows how all of this is captured for your analysis, and captures your analysis.
In the case of very simple Hypotheses, and tests against known values, it is possible that the program can automate the Evidence Analysis and conclude a Hypothesis is Proved. (e.g. John Doe's birth date is 15 Jun 1880). You can, of course, not choose to allow the program to do this, and you can at any time revisit each Proof, and change things. This is simply a tool that aids analysis and conclusions, recording the analysis.
■ Proven □ Disproven □ Defer ☑ Perhaps?
Hypotheses take the form of: Item | Action (Conditional?) | Object.
This list would be a pop-up list of objects in Hypothesis statement like:
Object | Qualifier | Operator | Value
Birth Date | John Doe [1770-1810] | Equals | 10 Dec 1810
Note that in a genealogy program, when the focus is on a given person, the qualfier can be implicit, but for algorithm completeness here, it must be specified.
This ends the nesting of three loops. For each Hypothesis there are NumSources
relevant sources, each of which has NumInfo pieces of Information to examine,
each of which has NumEvidence pieces of evidence to examine.
The flow chart continues:
Here the end user asks for reports summarizing each Hypothesis. One summary page per Hypothesis. Here the end user gets to specify a level of confidence, AConf, that they feel is appropriate for their needs. See Examples below. [text boxes that capture reasoning]. This is where verbal reasoning is done and documented. The category types are folded in intellectually, along with the optional numerical "assist" outlined here.
Source:
Original = 1.0
Derivative = 0.5
Information:
Primary = 1.0
Secondary = 0.5
Evidence:
Direct = 1.0
Indirect = 0.5
One way to estimate the uncertainty in the level of confidence for a particular
piece of evidence is to use the
corrected sample standard deviation,
s, expressed as:
N = Number of pieces of relevant evidence xi = level of confidence of i-th piece of evidence s only has meaning if N > 1.
These three categories give us a "feel" for the quality of the particular piece of evidence. But it can support the hypothesis or not support it.
If a piece of evidence does not support the hypothesis, simply make the Level of Confidence negative. This will cause xi to be negative with a small s if all evidence is negative, or cause x̄ to have a large s if there is conflicting evidence.
Positive = + (supports) Negative = - (doesn't support)
Here the end user gets to assert, with a captured text box of the reasoning behind it, whether they feel the Hypothesis is proven or not. Indeterminate is also an option, it can be revisited later.
■ Proven □ Disproven □ Defer ☑ Perhaps
The flow chart uses a few variables. Here are their descriptions:
NConf=5 NConf=10 Percent Fractional Description Confidence Confidence ------ -------- ----------- -------------- ----------- 0 0 0 0.00 uncertain 1 10 0.10 1 2 20 0.20 3 30 0.30 2 4 40 0.40 5 50 0.50 ⇳ 3 6 60 0.60 7 70 0.70 4 8 80 0.80 9 90 0.90 5 10 100 1.00 certain
The hope is that experimenting with "boost" levels for the various properties will help the end user and eventually lead to some amount of automated processing for analysis of proof.
The following types of reports are samples of useful ones. Many others are possible.
Not only new reports possibly, but after significant experience, some analysis and proof tools can be developed. Such as:
Analyze/Prove | {Birth Dates, Death Dates} | for { All; Selected Individuals}
My assertion is that with a program such as this, with experience, it will be possible to generate automated reports of proven Hypotheses. How about running a report for Proofs of the birth date of all persons in your data. The more meticulous you have been about providing sources for each fact for each person, and analyzing that fact, even if facts disagree, the better the program can decide if a fact is proven. If you have 5 Original, Primary, and Direct documents that all say the same birth date, it seems to me that the program can make the decision that it is proven. If there is conflicting evidence, then the program will need your assistance. But the process here will present all the evidence for you in one place so you can make your decisions and record them.
Just as now reports can be run that flag people that have no birth date specified, you could run a report flagging people whose birth dates are not considered proven. The types of these automated reports can become more sophisticated as experience is gained.
If someone questions why you believe someone was born on a certain day, you can run a report that shows the Hypothesis, the relevant sources, their classification, your notes and level of confidence assigned (if desired), and the statement of your conclusion. Right from your favorite genealogy program.
ESM, p. 19.
Certainly 5 high degree of certainty Probably 4 Possible 3 Likely 2 Apparently 1 Perhaps 0 some piece of evidence exists, but not trusted
Hypothesis Examples:
Note that the particular John Doe is determined in the context of the person selected from the genealogy program data. It is left unspecified here because the context should be clear.
What I have added is what I hope is a practical way to implement and automate the parts of the process that are conducive to that, show how it can fit in as a tool within exisiting genealogy programs, grow awareness that next generation genealogy programs should add the features, or similar ones, to the programs.
Experience will inform as to whether such an automated tool is accurate enough, and in which cases, to meet someone's needs. You need not use the tool, nor do you need to take its suggestions.
If computers can do facial recognition, and they do, they can begin doing some aspects of genealogical data analysis.
This type of analysis will be in next generation genealogical programs. It is not too early to think about how to implement such tools.
I believe that this flow chart even without automation would be a helpful aid to research analysis.
The automation here will allow genealogists to easily record each and every Hypothesis and path to Proof that they undertake, right in the genealogy program that holds all the data used.
Although practicing genealogists say that they apply the GPS and this analsis to all data and hypotheses, I believe that many hypotheses are considered so drop dead certain that no formal application is required, and due to time constraints, many are left undone.
The preponderance of evidence proving the simplest cases will allow you to spend your time more effectively on those hypotheses that need attention, and the method will actually identify those that need your attention via reports of unproven things.
Pro - Gathers all the relevant evidence in one place for you to analyze. And it gives supporting cues on the quality of the evidence..
Here are the sources exhibited that are used in the above examples.
With each one is provided the Evidence Style source citation. Constructive feedback from professional genealogists on improving these citations is welcome.
I would suggest standards for all of this, but several years ago proposed a simple solution to the GEDCOM issue. See my recommendation HERE. Basically it analyzes the single property that the solution must have, and advocates a method to use that property to define it. i.e. a call for some genealogical standards body to convene a committee and determine precisely what variables are required to do professional genealogy. Name, classify, and describe them. Then convince genealogy program vendors to use those variables. Let start-up companies that see the wisdom of this way begin to threaten the established companies until they adopt the variables because of popular demand. The committee should meet on a periodic basis (yearly?) and review the list of variables and make minor tweaks as the field evolves. But only for good reasons, not whims. [This is sort of like how chemists decided to make the Periodic Chart of the Elements. It is very rigid, but as new elements are discovered, they do get added to the chart].
Like words must exist in the various languages if a translation is to be successful in a translation. Failing that a translator expands the dictionary to accomodate the new word.
To date, no genealogy program has taken me up on that solution, nor any standards group in Genealogy taken an leadership position in the support of the solution. I find that sad. Let me know if I am wrong. Better yet, just have them send me their current working full genealogical variable list (and ideally the relational database data types), which will be my acid test of if they have indeed even begun to solve the GEDCOM problem.
DEFINITION: I define solving the GEDCOM problem to be establishing a practical solution that if implemented in computer programs will allow EXPORT and IMPORT from a genealogy program suffering no loss of genealogical data.
Any group working on solving the GEDCOM problem that can't send me this list has not yet made any progress toward a solution.
Anyway, I would be happy to devote some time to the development of standards for Evidence Analysis, but only if I see movement toward the solution of the GEDCOM problem.