This last week I wrote two SSIS packages (SQL Server 2005) that read data that was originally published using Excel .xls format and populated a staging table. The data is only published once a year, with perhaps one or two corrections following, but I thought I would save a little bit of trouble and use the Excel data source to read all of the worksheets that I needed from one source. After all, the import wizard uses it when it builds a package for you.
Oops, big mistake.
I am not sure now, a few days later, that I can even remember everything that happened, but here are the highlights.
- The Excel source has no “skip headers” option, and my source included four extra rows after the column headings and before the start of the actual data, so I wrote a script transformation to skip those rows. It’s easy to do: count rows using a static VB.NET variable, add a second synchronous output, place both outputs in the same exclusion group, and (using script code) use the auto-generated “direct to” methods to send each row to the appropriate output. I called one “Normal” and one “Skip,” and I didn’t connect the Skip output to anything. Worked great.
- The Excel source decided that all text fields were Unicode, and would not convert to DT_STRING. That required me to include an explicit Unicode to String conversion in the data flow for each text field.
- This is where it got ugly. The Excel source makes its own mind up about the data type of each column. Two of the columns contained ZIP codes. The first data row contained a 5-digit ZIP in one column and a 9-digit ZIP in the other. Based upon I-don’t-know-what, the Excel source decided that it would process only 5-digit ZIPs from the first column and only 9-digit ZIPs from the other, returning NULL for everything else, and without any data loss warning or error! There were even more (similar) problems with other columns, but that was enough to convince me not to use this component.
The solution, then, was to use the flat-file source, which does allow you to skip rows at the beginning, and to configure the type of each column. It also fails by default on conversion errors, although you can change the behavior for each column using the Advanced Editor. It won’t treat data as Unicode by default.
Since the original data was in .XLS files, I used Excel to convert each worksheet to tab-delimited text. Even though it is tab delimited, Excel still encloses text fields within double quotes if the field contains special characters, so I had to specify double quote as the text qualifier in the flat file source. As long as I was having to fool with it anyway, I sorted the data before saving it, and then I set the “is sorted” and sort key index properties in the flat file source using the Advanced Editor. This allowed me to do a Merge Join of the various sources into a single SQL Server table without having to include a Sort in the flow.
It all worked very nicely after that. The one complaint I have is that the IS designer is horribly slow when you make changes to a Merge Join component that has a large number (~200 or more) of input columns. I had to go find other things to do while it crunched for I don’t know how many minutes on a 2.4 Ghz Core II Duo system. Unfortunately, the Excel data contains approximately 600 columns of data spread across 3 worksheets. Perhaps in the future sometime I can split that part of the database into several targiet tables.
In retrospect, while the import wizard does build packages using the Excel source, you can’t edit those packages in the designer, and that should have told me something. Orginally, I was going to create the packages using the wizard and then tweak them to do what I wanted, but the only way I could edit them (if they contained Excel input) was as XML. The designer would complain that Unicode doesn’t convert to string when I tried to save. Some day I may sit down and try to figure out what is really going on. In the mean time, if anyone out there knows, please enlighten us.
It is rumored that with SSIS 2005 there is a tweak that can be applied (using the disconnected editor) to work around this problem. I haven’t found an equivalent fix for SSIS 2008. That is probably a good thing because what I have learned now is that the requisite driver for the Excel source does not exist for x64 platforms and that is what I use now.
The issue with all strings being presented as Unicode also occurs when importing a Jet database. It’s not entirely unreasonable–just inconvenient sometimes. A useful convention is to add a suffix such as _U to each such field name. You can then use the original column name later on when converting to varchar using a derived column or data conversion transformation.
The lack of x64 versions of certain components such as the Excel source is easily resolved by running the package in 32-bit mode. The same issue came up the other day when I was using the Access source in the designer, and I simply changed the project-level debugging option that forces 32-bit execution.
Finally, someone else is writing about this. His scripting approach to reading Excel using OpenXML looks promising, especially since I just spent so much of the last two weeks developing or modifying SSIS import flows for Excel data sources. It might even be worth my while to write an SSIS data source component to automate it even more and offer it free on CodePlex. It’s a thought, anyway.
- Save the individual worksheets as delimited text and read them as such, not as Excel (my old solution, as straightforward as any other idea I came across)
- Find a job where you don’t have to import data from government-produced Excel files with crazy mixes of data in the columns (my most recent solution).