It has taken a month to fuss over my new blog, but I finally made my first SQL entry. Since I’m excited about the upcoming SQL Pass conference, I thought I would show a fictitious problem about employees and their interests in SQL Conferences.
Problem: You are given two tables. The first table contains employees. The second table contains all the SQL Conferences each employee has been interested in along with the date they showed interest in the conference and whether or not they are still interested. You are asked to find the last SQL Conference that was added for each employee. Only conferences the employees are still interested in should be included, and only one conference per employee should be listed. The returned data should be ordered by the employee’s last name and first name.
The first solution that came to mind, was to use the MAX function on the InterestAddDate field to find the last added interest. There are two issues with this approach though.
1. In order to get the activity field returned, the Interest table has to be joined a second time on the MAX(InerestAddDate).
2. Multiple rows will be returned if the employee had an interest in two SQL conferences on the same date. While this could be a valid result set, in this case only one activity should be returned.
WITH CTE_InterestsByMax AS ( SELECT EmployeeID ,MAX(InterestAddDate) AS LastInterestAdDate FROM dbo.Interest AS i WHERE isActive = 1 GROUP BY EmployeeID ) SELECT e.FirstName + ‘ ‘ + e.LastName AS EmployeeName ,i.InterestAddDate ,i.Activity FROM CTE_InterestsByMax AS im JOIN dbo.Interest AS i ON im.LastInterestAdDate = i.InterestAddDate AND im.EmployeeID = i.EmployeeID JOIN dbo.Employee AS e ON im.EmployeeID = e.EmployeeID ORDER BY e.LastName ,e.FirstName |
Solution: To address these two issues, I used a Common Table Express (CTE) and the ROW_NUMBER function. This function will number each row with a unique sequential number based on the OVER clause. Inside the OVER clause, I will order the data by the InterestAddDate field in descending order. Since I want to find the last SQL Conference of interest for each employee, I’m going to add the PARTITION statement on the EmployeeID field to the OVER clause. This will cause the ROW_NUMBER function to start over for each EmployeeID. Since I’m not using an aggregate function, I can return all the data from the Interest table that I need.
In the next part of the query , I join the CTE to the Employee table and add a WHERE clause. Since I ordered each partition in descending order, I know that the first row of each partition will have a rowindex of 1. I can now filter my data by rowindex = 1.
WITH CTE_InterestsByRow_Number AS ( SELECT i.EmployeeID ,i.InterestAddDate ,i.Activity ,ROW_NUMBER() OVER (PARTITION BY i.EmployeeID ORDER BY i.InterestAddDate DESC) AS RowIndex FROM dbo.Interest AS i WHERE i.IsActive = 1 ) SELECT e.FirstName + ‘ ‘ + e.LastName AS EmployeeName ,i.InterestAddDate ,i.Activity FROM CTE_InterestsByRow_Number AS i JOIN dbo.Employee AS e ON i.EmployeeID = e.EmployeeID WHERE rowindex = 1 ORDER BY e.LastName ,e.FirstName |
When I looked at the logical reads for these two separate queries, the query using the MAX function had twice as many logical reads as the query with the ROW_NUMBER function. When I looked at the Execution Plan for both queries, I found the query using the MAX function had a higher Query Cost relative to the batch. My first run with the data, I used 20 Employees and 40 Interests. For the second run, I used 1000 employees and 4000 interests. I found that the Query Cost for the query using the MAX function increased with the larger datasets.