Tag Archive for PARTITION

Going Beyond The INSERT Statement

The Seventh Mission

In this installment of SQLCoOp, we are writing about CRUD. No, this is not about the stuff you scrape off the bottom of your shoes after hiking. This is Create, Read, Update, and Delete. I’ll be focusing on create, which is done with the INSERT Statement, but I’m going to go beyond the basic INSERT statement.

MCJJ_ALL

Multiple Rows

Let’s start with a new feature of the basic INSERTVALUES statement that we all know and love. In SQL Server 2012, Microsoft improved the INSERTVALUES statement, by allowing you to include multiple lists of values in the same query. This allows you to insert multiple rows of data with one INSERT statement. I find this syntax to to be a wonderful feature when I’m creating small sets of data using Excel.

Here is an example of what the syntax looks like. The values for each row are enclosed in parenthesis and separated by commas. In this example, I’m entering student scores for tests the students took at the end of September. After this one query runs, there will be 12 rows in the table.


CREATE TABLE StudentScore
(
     StudentScoresID int IDENTITY PRIMARY KEY
     ,StudentID int
     ,TestingRoomID int
     ,Score tinyint
     ,TestDateTime datetime
);

 

INSERT INTO StudentScore
     (StudentID,TestingRoomID, Score, TestDateTime)
VALUES
     (35, 1, 110, '2014-09-30 17:00:00.00')
     ,(36, 1, 87, '2014-09-30 17:00:00.00')
     ,(42, 1, 94, '2014-09-30 17:00:00.00')
     ,(10, 1, 99, '2014-09-29 12:00:00.00')
     ,(35, 10, 90, '2014-09-29 17:00:00.00')
     ,(36, 10, 100, '2014-09-29 17:00:00.00')
     ,(42, 10, 105, '2014-09-29 17:00:00.00')
     ,(10, 10, 99, '2014-09-29 17:00:00.00')
     ,(35, 5, 115, '2014-09-28 17:00:00.00')
     ,(36, 5, 70, '2014-09-28 17:00:00.00')
     ,(42, 5, 90, '2014-09-28 17:00:00.00')
     ,(10, 5, 67, '2014-09-28 17:00:00.00');
Common Table Expressions (CTEs)

I love CTEs, sometimes a little too much. The reason I think they are so great is two fold. First, they can make complicated queries easier to read and maintain. Second, they can perform complicated functionality such as recursion without breaking a query up into multiple statements. An added benefit, is they can be used with all CRUD statements, including the INSERT statement.

Let’s take a look at an example.

I have a group of students that were given the same test multiple times in different classrooms on different dates. I want to only insert the row representing  the highest score a student had. I also want to make sure that I keep the first time they received this score and any other data that came with that record.

Below I’ve highlighted the rows that I want to insert into the new table.

SQLCoOp7_Image1

The first step is building the SELECT statement for the CTE that numbers the scores for each student. This is done by using the ROW_NUMBER function in conjunction with the OVER clause. The OVER clause will have two parts. The first part will use the PARTION BY statement. This is used to restart the numbering for each partition. In this case we will partition the data by StudentID. The second part will use the ORDER BY statement. This tells us how to order the data so that it will consistently come out in the same row number. Since we want the highest score first, we will order the data by Score in a descending order. Then we will order the duplicates by TestDateTime so that the first date the score was achieved will appear first in the order.

SELECT
     ss.StudentScoresID
     ,ss.StudentID
     ,ss.TestingRoomID
     ,ss.Score
     ,ss.TestDateTime
     ,ROW_NUMBER() OVER (PARTITION BY ss.StudentID
               ORDER BY ss.Score DESC, ss.TestDateTime) AS RowID
FROM
     dbo.StudentScore AS ss

This SQL statement above will be used in our CTE. We’ll then use all the rows that have a RowID of 1 for the INSERT. The final INSERT statement will look like this.

CREATE TABLE dbo.StudentScoreHigh
(
     StudentScoresID int PRIMARY KEY
     ,StudentID int
     ,TestingRoomID int
     ,Score tinyint
     ,TestDateTime datetime
);

 

WITH cte_Scores
AS
(
     SELECT
          ss.StudentScoresID
          ,ss.StudentID
          ,ss.TestingRoomID
          ,ss.Score
          ,ss.TestDateTime
          ,ROW_NUMBER() OVER (PARTITION BY ss.StudentID
               ORDER BY ss.Score DESC, ss.TestDateTime) AS RowID
     FROM
          dbo.StudentScore AS ss
)
INSERT INTO dbo.StudentScoreHigh
(
     StudentScoresID
     ,StudentID
     ,TestingRoomID
     ,Score
     ,TestDateTime
)
SELECT
     StudentScoresID
     ,StudentID
     ,TestingRoomID
     ,Score
     ,TestDateTime
FROM
     cte_Scores
WHERE
     RowID = 1;
MERGE

A different way to approach this same problem, would be to use the MERGE statement. The MERGE statement, is like CRUD on steroids. It will look at all the data you specify that is coming in and compare to all the existing data. Then you can determine what will happen. Do you want to insert, update, delete, or ignore the data based on what does and does not match.

For this example, we’ll assume we are constantly updating the StudentScoreHigh table with the students highest score. We’ll compare what is already in the StudentScoreHigh table (the target) with what is in the StudentScore table (the Source) each night. If there is a record already in the table and the student achieved a higher score today, than the existing record will be updated. If the student does not exist in the table, then the student will be inserted into the table. Note: for this example, we’ll assume that the student could only take the test once in a day.

The MERGE statement will look like this.


MERGE dbo.StudentScoreHigh AS target
USING
(
     SELECT
          StudentScoresID
          ,StudentID
          ,TestingRoomID
          ,Score
          ,TestDateTime
     FROM
          dbo.StudentScore AS ss
     WHERE
          TestDateTime >= DATEADD(d, -1, GETDATE())
) AS Source ON Target.StudentID = Source.StudentID
WHEN MATCHED AND Source.Score > Target.Score THEN
UPDATE SET
     StudentScoresID = Source.StudentScoresID
     ,StudentID = Source.StudentID
     ,TestingRoomID = Source.TestingRoomID
     ,Score = Source.Score
     ,TestDateTime = Source.TestDateTime
WHEN NOT MATCHED THEN
INSERT
(
     StudentScoresID
     ,StudentID
     ,TestingRoomID
     ,Score
     ,TestDateTime
)
VALUES
(
     Source.StudentScoresID
     ,Source.StudentID
     ,Source.TestingRoomID
     ,Source.Score
     ,Source.TestDateTime
);
BONUS

If you have the latest Red Gate’s SQL Prompt installed (ver 6.4), then you will have an even easier time writing your favorite INSERT statement. They added a new feature that will highlight the field of the value you are modifying and vice versa.  Below you can see that my cursor is on line 61 and line 54 is highlighted. If I were to put my cursor on line 54, then line 61 would be highlighted.

 

SQLCoOp7_Image2

Don’t Stop Yet

Don’t forget to check out these blog posts by the rest of the SQL CoOp team on the subject of CRUD:

To follow our quest for SQL knowledge through this collaborative project, follow the #SQLCoOp tag on Twitter.

See you next time!!

On a SQL Collaboration Quest

Four SQL professionals gathered from the four corners of the world to share their SQL knowledge with each other and with their readers: Mickey Stuewe from California, USA, Chris Yates from Kentucky, USA, Julie Koesmarno from Canberra, Australia, and Jeffrey Verheul from Rotterdam, The Netherlands. They invite you to join them on their quest as they ask each other questions and seek out the answers in this collaborative blog series. Along the way, they will also include other SQL professionals to join in the collaboration.

Original Post: On a SQL Collaboration Quest

The ROW_NUMBER Function As An Alternate To The MAX Function

It has taken a month to fuss over my new blog, but I finally made my first SQL entry. Since I’m excited about the upcoming SQL Pass conference, I thought I would show a fictitious problem about employees and their interests in SQL Conferences.

Problem: You are given two tables. The first table contains employees. The second table contains all the SQL Conferences each employee has been interested in along with the date they showed interest in the conference and whether or not they are still interested. You are asked to find the last SQL Conference that was added for each employee. Only conferences the employees are still interested in should be included, and only one conference per employee should be listed. The returned data should be ordered by the employee’s last name and first name.

Employee and Interest Data Model

The first solution that came to mind, was to use the MAX function on the InterestAddDate field to find the last added interest. There are two issues with this approach though.

1. In order to get the activity field returned, the Interest table has to be joined a second time on the MAX(InerestAddDate).
2. Multiple rows will be returned if the employee had an interest in two SQL conferences on the same date. While this could be a valid result set, in this case only one activity should be returned.


WITH CTE_InterestsByMax
AS
(
      SELECT
         EmployeeID
         ,MAX(InterestAddDate) AS LastInterestAdDate
      FROM
            dbo.Interest AS i
      WHERE
            isActive = 1
      GROUP BY
            EmployeeID
)
SELECT
      e.FirstName + ‘ ‘ + e.LastName AS EmployeeName
      ,i.InterestAddDate
      ,i.Activity
FROM
      CTE_InterestsByMax AS im
      JOIN dbo.Interest AS i ON im.LastInterestAdDate = i.InterestAddDate
                                                AND im.EmployeeID = i.EmployeeID
      JOIN dbo.Employee AS e ON im.EmployeeID = e.EmployeeID
ORDER BY
      e.LastName
      ,e.FirstName

Solution: To address these two issues, I used a Common Table Express (CTE)  and the ROW_NUMBER function. This function will number each row with a unique sequential number based on the OVER clause. Inside the OVER clause, I will order the data by the InterestAddDate field in descending order. Since I want to find the last SQL Conference of interest for each employee, I’m going to add the PARTITION statement on the EmployeeID field to the OVER clause. This will cause the ROW_NUMBER function to start over for each EmployeeID. Since I’m not using an aggregate function, I can return all the data from the Interest table that I need.

In the next part of the query , I join the CTE to the Employee table and add a WHERE clause. Since I ordered each partition in descending order, I know that the first row of each partition will have a rowindex of 1. I can now filter my data by rowindex = 1.

WITH CTE_InterestsByRow_Number
AS
(
      SELECT
         i.EmployeeID
         ,i.InterestAddDate
         ,i.Activity
         ,ROW_NUMBER() OVER (PARTITION BY i.EmployeeID ORDER BY i.InterestAddDate DESC) AS RowIndex
      FROM
         dbo.Interest AS i
      WHERE
         i.IsActive = 1
 )
SELECT
      e.FirstName + ‘ ‘ + e.LastName AS EmployeeName
      ,i.InterestAddDate
      ,i.Activity
FROM
      CTE_InterestsByRow_Number AS i
      JOIN dbo.Employee AS e ON i.EmployeeID = e.EmployeeID
WHERE
      rowindex = 1
ORDER BY
      e.LastName
      ,e.FirstName

When I looked at the logical reads for these two separate queries, the query using the MAX function had twice as many logical reads as the query with the ROW_NUMBER function. When I looked at the Execution Plan for both queries, I found the query using the MAX function had a higher Query Cost relative to the batch. My first run with the data, I used 20 Employees and 40 Interests. For the second run, I used 1000 employees and 4000 interests. I found that the Query Cost for the query using the MAX function increased with the larger datasets.

Execution Plan

%d bloggers like this: