‘Proximity’ Genealogical Database Project Update
Update: I have much more to report about the project, but I have to make choices between writing about it and developing, and right now the development work is taking precedence. Look for a new post soon. I hope. If this work sounds like something you might want to contribute to financially in a small way, please let me know that you are interested.
The Proximity project is coming along. I have successfully loaded a large family tree from a GEDCOM file and a very large set of pedigrees downloaded from Ancestry DNA using DNAGedcom.com. Name, place, and date parsers are working reasonably well, although there is considerable room for improvement. I have been doing a great deal of refactoring as I discover, bit by bit, how to usefully apply a relational database system to solving difficult genealogical problems.
Identifying optimal primary keys has been somewhat challenging. This is a research tool, not a transactional or warehouse application. I have to think a lot about what it is that I am trying to do.
The present database addresses the problem of finding common ancestors that lived during the same time period in proximate geographical locations. Future iterations will look for name similarities, and for DNA (SNP) matching proximities. Output is currently just to a SQL query window, but will eventually be used to produce either reports or GEDCOM files that can be used to visualize not only traditional family trees but also “dotted line” relationships representing time/geography and genetic distance proximities.
The database contains two models, a hierarchical “tree” model based upon the GEDCOM file format, for easy importing, and a linear “pedigree” model for high speed querying. The tree model generally requires a recursive CTE in order to traverse it, which is slow if done repeatedly. The pedigree model represents lineages in a single, large, well-indexed table that is loaded just once via recursive CTE and is extremely fast to query. Clustered and non-clustered indexes support querying lists of ancestors or descendants, and a filtered index supports locating specific descendants in a single lookup. This latter feature allows ‘proximities’ of interest to be tagged with the names of present-day DNA cousins that are descendants, while incurring minimal overhead.
The main issue has been, as expected, with the performance of geospatial distance computations. Now that the basic data models are working, I am developing a place-to-place distance cache that will accumulate distance calculation results so that a given calculation need only be performed once. Another optimization recognizes “static” matches, where name comparisons alone are all that is required to detect a match, avoiding having to do a geospatial computation at all.