Adam Machanic : Uniquely Identifying XML Nodes with DENSE_RANK
I came across this post today from 2009 that solves a problem I have been having with SQL Server XML shredding, and I thought I would pass it along in case anybody else has managed to miss it for as long as I have.
When working with XML in SQL Server, you might want to uniquely identify one node against another. But due to the flexibility with which XML can be defined, this is not always directly possible. SQL Server’s own XML structures are guilty of having this
Source: Adam Machanic : Uniquely Identifying XML Nodes with DENSE_RANK
The problem it solves is how to assign sequence numbers to XML nodes returned by the nodes() function. There isn’t an obvious way to do it, but there is a very simple way once you know the trick, involving nothing more than a DENSE_RANK() OVER (ORDER BY <node>), where <node> is the node name from the nodes() function.
I have been using things like “ROW NUMBER() OVER ( PARTITION BY <something> ORDER BY (SELECT NULL) )” to generate row numbers further downstream, but I’m never quite certain that the numbering will be correct. Using a row number derived from DENSE_RANK(), using this trick, in place of (SELECT NULL) has a much better feel to it, because the ordering is actually based on something — the node order. Very nice!
Here is an example from my Proximity database code, which loads peoples’ names (from a family tree) to a PersonName table from the XML generated by GedcomReader, keeping track of the order in which they occur in the XML, since the first instance is the “preferred” instance. Since names may be duplicated for a given person, I group out duplicates, retaining the earliest node number for each group.
The actual sequence numbers (AliasId) are generated in the INSERT, partitioned by person so that AliasId = 1 is always associated with the preferred name for a given person. When querying the data I can then select just the preferred names, or up to N aliases, or all aliases for each person, depending on how long I am willing to let the query run. (There’s actually a lot more to it, since I break the names into words, code them using SOUNDEX, and then look for similar names of people living in nearby places, at about the same point in history.)
WITH X_CTE AS ( SELECT PersonId = pid.AsInt , nn.GivenName , nn.Surname , nn.Suffix , NodeSeq = DENSE_RANK() OVER (ORDER BY Person) FROM #Gedcom CROSS APPLY ged.nodes('/GED/INDI/NAME') G(Person) CROSS APPLY utl.ExtractDigits(Person.value('../@ID[1]', 'nvarchar(50)')) pid CROSS APPLY prx.NormalizeName(Person.value('.[1]/text()[1]', 'varchar(200)')) nn ), Group_CTE as --Remove duplicates and empties ( SELECT PersonId , GivenName , Surname , Suffix , NodeSeq = MIN(NodeSeq) --Take the first node sequence number FROM X_CTE GROUP BY PersonId , GivenName , Surname , Suffix ) INSERT prx.PersonName ( ImportId , PersonId , AliasId , GivenName , Surname , Suffix ) SELECT ImportId = @ImportId , PersonId , AliasId = ROW_NUMBER() OVER (PARTITION BY PersonId ORDER BY NodeSeq ) , GivenName , Surname , Suffix FROM Group_CTE;