GedcomReader library fails while reading FTM 2017 GEDCOM export
While testing Family Tree Maker 2017 (FTM), I happened to notice that GEDCOM imports for my Proximity Database project had ceased to work. At first I suspected FTM, but the output was fine and it was a complete mystery why the imports were failing.
I use GedcomReader to read GEDCOM files and pre-parse them into XML for further ETL parsing. It has worked flawlessly up to this point. After several hours of stepping through the code (after figuring out how to debug a GAC assembly, since I am using SSRS and GedcomReader needs to be in the GAC for my script task to reference it), I realized that
- GedcomReader supports an optional Byte Order Mark (BOM) at the beginning of the input file.
- FTM 2017 added a BOM to GEDCOM files that it writes.
- Most text editors completely hide BOMs, as if they were not there, with the notable exception of the Visual Studio hex editor.
- GedcomReader is missing a line of code in its byte-to-char encoding logic that causes it to insert NUL characters into the character buffer if a BOM is present and the GEDCOM is of significant length (> 4093 bytes).
- GedcomReader treats the first NUL that it encounters as end-of-file.
Oops.
It was an easy fix — change (in StreamScannerReader.cs)
// Skip any byte order mark. if (_pos == 0 && _byteBuffer.Take(_preamble.Length).SequenceEqual(_preamble)) { Array.Copy(_byteBuffer, _preamble.Length, _byteBuffer, 0, byteCount - _preamble.Length); byteCount -= _preamble.Length; }
to
// Skip any byte order mark. if (_pos == 0 && _byteBuffer.Take(_preamble.Length).SequenceEqual(_preamble)) { Array.Copy(_byteBuffer, _preamble.Length, _byteBuffer, 0, byteCount - _preamble.Length); bytesToRead -= _preamble.Length; byteCount -= _preamble.Length; }
I am still testing the results, but at least my very large GEDCOMs are no longer being reduced to 4K stubs.