Last month I wrote a post titled Broken Left Join which covered an easy to make mistake that I see fairly regularly when reviewing code. This month I’m covering UPDATE statements that run into trouble because of JOINs.
Updates with joins are not covered by the ANSI standard. To see the problems that can occur with joins in update statements, let’s setup some data to run a test.
IF OBJECT_ID(N'dbo.Child', N'U') IS NOT NULL DROP TABLE dbo.Child; GO IF OBJECT_ID(N'dbo.Parent', N'U') IS NOT NULL DROP TABLE dbo.Parent; GO CREATE TABLE dbo.Parent ( ParentID INT NOT NULL IDENTITY(1,1), Name VARCHAR(100) NOT NULL, FavoriteChild VARCHAR(100) NULL ); GO ALTER TABLE dbo.Parent ADD CONSTRAINT PK_ParentID PRIMARY KEY (ParentID); GO CREATE TABLE dbo.Child ( ChildID INT NOT NULL IDENTITY(1,1), ParentID INT NOT NULL, Name VARCHAR(100), DateOfBirth DATE ); GO ALTER TABLE dbo.Child ADD CONSTRAINT PK_ChildID PRIMARY KEY (ChildID); GO ALTER TABLE dbo.Child ADD CONSTRAINT FK_Parent_ParentID FOREIGN KEY (ParentID) REFERENCES dbo.Parent(ParentID); GO CREATE NONCLUSTERED INDEX IX_ParentID ON dbo.Child(ParentID, Name); GO INSERT INTO dbo.Parent (Name) VALUES ('Homer'),('Anakin'),('Peter'); INSERT INTO dbo.Child (ParentID, Name, DateOfBirth) VALUES (1,'Bart', '19900121'),(1,'Lisa', '19920503'),(1,'Maggie', '19950717'), (2,'Luke', '19700101'),(2,'Leia', '19720518'), (3,'Meg', '20001023'),(3,'Chris', '19980912'),(3,'Stewie', '20061217'); GO
I’ve created two tables named Parent and Child with a foreign key in the Child table that references that ParentID in the Parent table. Selecting from these tables produces the data below.
Next I’ll execute an UPDATE to set each parents favorite child. I’ll do this by joining to the child table in the update statement.
When updating with a join it is only possible to update values in one of the tables. If you try to update more than one, you will see the following:
The statement below runs fine as it’s only updating values in one of the tables.
UPDATE p SET p.FavoriteChild = c.Name FROM Parent p JOIN Child c ON p.ParentID = c.ParentID;
Now which child do you think will end up being the favorite for each parent?
To give you a clue, I’ll show you the execution plan.
What’s that aggregate operator doing there? We’re not aggregating any values!
Let’s take a closer look at the Defined Values property in the properties panel for that stream aggregate.
That’s right; SQL Server will just pick any child. It will not update the parent row for each child. When the MERGE statement runs into this problem, it raises an error and rolls back the UPDATE but our query will silently pick any value and move on. We don’t want to update data like this, we want our query to dictate the logic explicitly. This will require some changes to our query.
I prefer to write these types of updates with the use of a CTE. This allows us to easily highlight the SELECT in the CTE and execute it to see which rows will be updated (@variables can cause problems here though). Adding a COUNT can help to identify the problem in the previous query.
Highlight and execute the code inside the CTE below to check for duplicates. If this returns 0 rows then you are good to remove the COUNT, GROUP BY and HAVING then add the name columns before executing the whole statement.
WITH cte_UpdateJoin AS ( SELECT p.ParentID, COUNT(*) AS UpdateCheck FROM Parent p JOIN Child c ON p.ParentID = c.ParentID GROUP BY p.ParentID HAVING COUNT(*) > 1 ) UPDATE cte_UpdateJoin SET FavoriteChild = Name;
If the code in the CTE returned rows (like ours did) then we have a problem.
If we have multiple child rows for a parent row, we need to filter all but one row or aggregate these child rows so that we only have one value to apply in the UPDATE. In this example, I’m going to make a sweeping generalisation that all parents prefer their first born. While this isn’t always true it does give us some logic that we can apply to our UPDATE statement.
WITH cte_UpdateJoin AS ( SELECT p.FavoriteChild, c.Name, DateOfBirth, MIN(DateOfBirth) OVER (PARTITION BY p.ParentID) AS MinDOB FROM Parent p JOIN Child c ON p.ParentID = c.ParentID ) UPDATE cte_UpdateJoin SET FavoriteChild = Name WHERE DateOfBirth = MinDOB;
The main advantage of using a CTE here is that we can use window functions in the CTE. These are not allowed in the SET clause. The window function in the CTE returns the earliest date of birth of all children for each parent. My WHERE clause then makes sure that I only use the name of the first born child.
This is a contrived example to illustrate the problem (twins would cause this to break) but I hope this highlights something you need to check when writing or reviewing UPDATE statements with JOINs.