I am developing a proof-of-concept for a genealogical database intended to support querying for various proximities including
- Generational relationships (traditional family tree)
- Proximity of events in time and space (geolocation)
- Genetic proximity (DNA matching via SNPs)
The intent is to integrate relationship information from these three sources so that, for example, even if generational (family tree) data is unknown or unreliable, as is frequently the case, relationships based upon event dates & locations and on SNP matching data can still be represented in the “tree,” and can be visualized in the same way as a traditional tree.
After experimenting further with the Family Tree Maker (FTM) report output formats and encountering a number of issues with identifying primary keys, I decided to go ahead and read the GEDCOM files directly and, so far, that is proving to be much simpler. I am using the .Net GedComReader library to do the hard work. It returns the structured GEDCOM data as an XML document containing series of nodes representing the major entities, which I find easier to work with than flat text file records.
I am still using FTM for place name normalization. It is quite good for that.
For the initial proof of concept, I am using Family Tree Maker (FTM) to generate, via a Place Report, a CSV file containing places, people, and events. An SSIS package will parse and load this information (ETL) into a SQL Server database for further analysis. FTM will be used to merge duplicate people and to normalize place and other data prior to loading, to improve reliability.
Once ETL is working, it should be possible to use US shapefile data to translate place names into polygons for geospatial analysis. I have done this kind of thing before, and I realize that it is going to take some doing.
SNP data loading presents its own set of challenges, which I intend to address later on.
The initial platforms are a local SQL Server instance, and a parallel Azure database, so that I can gain experience in that world.
A longer-term goal is to be able to load GEDCOM files directly, using one of the available libraries, and to be able to represent the results graphically. One possibility would be to use an existing family tree web app as the front end, one that would support plug-ins that connected to the Azure (or other cloud-based) database.