New DelimitedTextCleaner library available on github
I have released a small, free, C# single-class library that can be used with SSIS or in other environments to parse CSV and other similar delimited text formats (such as tab or pipe delimited), and to do so with some tolerance for errors in the use of double quote characters. The repo can be found on github.
It is not that uncommon for CSV files to be incorrectly formatted, and being able to read them anyway can be useful. The DelimitedTextCleaner class supports error tolerance in two ways:
- It attempts to preserve incorrectly escaped double quotes, and also to recognize the field delimiter in fields that contain incorrectly escaped double quotes, even when the error would normally cause the delimiter to be consumed as text.
- Since the source code is available, it can be modified to parse specific known patterns of data corruption.
The design uses a simple state machine to walk through a delimited text string and parse it into fields. The results can be returned as a “clean” delimited text string with escapes or as an array of field values not containing escapes.
Further details can be found at the github site, in the doc file and in the code itself.