Archive for SQL Functions

T-SQL Tuesday #51- Don’t Crap Out While Betting On Table Functions

My good friend Jason Brimhall (b|t) is hosting this month’s T-SQL Tuesday blog party. The party was started by Adam Machanic (b|t) in December of 2009. As a compliment to the upcoming debut of the Las Vegas SQL Saturday, Jason has taken up a betting theme. He wants to know our stories of when we bet it all on a risky solution and won or lost.

Instead of telling you about the past, I want to help you win big at the table today. I really don’t want you to crap out while betting on the wrong table functions.

Snake Eyes

There are two types of table functions Multi-line Table Functions and In-Line Table Functions. There is a huge difference between the two of them.

Multi-line table functions sound great. You write as much code as you need in them and they will return all the data in a table variable. This is where the weighted dice rolls snake eyes every single time. You see, the statistics for a table variable always, always says there is only one row in the table being returned. It doesn’t matter if there are a hundred, a thousand, or a million rows. The statistics will say one. Which means the optimizer has a good chance of loosing when it picks the execution plan for that query.

Let’s Take a Look at the Bets

For my example, I have a simple query that returns 43 rows out of a Tally table. Notice that the index estimates 43 rows will be returned, which is great, because that is exactly on the money!

TSQLTuesday51_1

 

If we put that same query inside of a multi-line table function, we get an estimated number of rows of 1 (snake eyes!).

 

TSQLTuesday51_2

 

TSQLTuesday51_3

 

Double Down

An in-line table function will return the same result set, but there are some limitations on its construction. The entire query within the in-line table function needs to be done in only one statement.

Note: You can get very creative with Common Table Expressions (CTE) if need be.

There are two benefits to using an in-line Table Function. One, is that the Estimated Number of Rows will be accurate (or as accurate as the statistics on the table), and two, the “inside” of the in-line table function is not masked in the Execution Plan. It is plopped right into the middle of the calling query. (Yes, “plopped” is a technical term. )

 

TSQLTuesday51_4

 

TSQLTuesday51_5 

Last Call

SoRemember to double down on in-line table functions and don’t crap out on the snake eyes of the multi-line table function.

Thanks for all the fish

Thanks go out to Jason Brimhall for hosting this month’s T-SQL Tuesday blog party. Please visit his website at http://jasonbrimhall.info/, or better yet come to Las Vegas for their SQL Saturday and thank him in person.

Using Set Theory Instead of ISNULL To Filter Data

I have written dozens upon dozens of reports for my users. I love the challenge of finding the data and wrangling it into a report in a way that doesn’t take a hundred years to run when they decide to return a year’s worth of data. One of the common requests that I get is to provide a parameter that will allow them to choose a single client or all clients. At first blush this is a very simple request and can be accomplished by using the ISNULL function.

ISNULL - ISNULL Code

Unfortunately there are performance implications using the ISNULL function. The ISNULL function is non-sargable. This means that when the function is used as part of a predicate, it can’t utilize the necessary index in an optimal way to filter the data. In this example, the Execution Plan shows that a full index scan was used to return one row. This equated to 80 Logical Reads for my dataset. This is the same number of Logical Reads whether one row was returned or all rows were returned. Eighty Logical Reads may not be that big a deal, but what about on a table that has hundreds of thousands of rows?

ISNULL - ISNULL Execution Plan
Execution Plan

ISNULL - ISNULL Reads 1
Statistics on the Table I/O Tab in SQL Sentry Plan Explorer Pro

 

An Alternate Universe

There is an alternate way of accomplishing this same request and it uses Set Theory to solve the problem. I can accomplish the same request by separating out the ISNULL logic into two different SQL statements and using UNION ALL to join the two result sets into one result set.  Even though I’m using two different SQL statements, only one will return data during the execution of the logic.

ISNULL - Union All Code

The first SELECT statement returns rows when the @ClientID variable has a value other than NULL. The second SELECT statement returns rows when @ClientID is NULL. By using UNION ALL instead of UNION we forgo the task of checking if there are duplicate rows in the second SELECT statement.

The Execution Plan now contains the ability to handle both cases, and the Logical Reads changes based on the value of @ClientID. There are still 80 Logical Reads when @ClientID is NULL because the whole table is being returned, but the Logical Reads are reduced to two when @ClientID is set to a single client number.

 

ISNULL - Union All Execution Plan 1
Execution Plan when @ClientID =7890

ISNULL - Union All Reads 1

 

Statistics when @ClientID = 7890
ISNULL - Union All Execution Plan 2
Execution plan when @ClientID IS NULL

ISNULL - Union All Reads 2
Statistics when @ClientID IS NULL

 

In this example I was able to use UNION ALL to filter on one or all clients, but it isn’t always the best solution. If the table being filtered fits on one page, then a table scan will occur regardless of how the predicate is setup. In that case, the ISNULL statement is easier to maintain so it would be the better solution.

Conclusion

It is always best to take a look at the Execution Plan of a SQL statement to see if there are any Index Scans on filtered data. If there are, you should take a look to see if there is another, set based approach to solve the problem.

The ROW_NUMBER Function As An Alternate To The MAX Function

It has taken a month to fuss over my new blog, but I finally made my first SQL entry. Since I’m excited about the upcoming SQL Pass conference, I thought I would show a fictitious problem about employees and their interests in SQL Conferences.

Problem: You are given two tables. The first table contains employees. The second table contains all the SQL Conferences each employee has been interested in along with the date they showed interest in the conference and whether or not they are still interested. You are asked to find the last SQL Conference that was added for each employee. Only conferences the employees are still interested in should be included, and only one conference per employee should be listed. The returned data should be ordered by the employee’s last name and first name.

Employee and Interest Data Model

The first solution that came to mind, was to use the MAX function on the InterestAddDate field to find the last added interest. There are two issues with this approach though.

1. In order to get the activity field returned, the Interest table has to be joined a second time on the MAX(InerestAddDate).
2. Multiple rows will be returned if the employee had an interest in two SQL conferences on the same date. While this could be a valid result set, in this case only one activity should be returned.


WITH CTE_InterestsByMax
AS
(
      SELECT
         EmployeeID
         ,MAX(InterestAddDate) AS LastInterestAdDate
      FROM
            dbo.Interest AS i
      WHERE
            isActive = 1
      GROUP BY
            EmployeeID
)
SELECT
      e.FirstName + ‘ ‘ + e.LastName AS EmployeeName
      ,i.InterestAddDate
      ,i.Activity
FROM
      CTE_InterestsByMax AS im
      JOIN dbo.Interest AS i ON im.LastInterestAdDate = i.InterestAddDate
                                                AND im.EmployeeID = i.EmployeeID
      JOIN dbo.Employee AS e ON im.EmployeeID = e.EmployeeID
ORDER BY
      e.LastName
      ,e.FirstName

Solution: To address these two issues, I used a Common Table Express (CTE)  and the ROW_NUMBER function. This function will number each row with a unique sequential number based on the OVER clause. Inside the OVER clause, I will order the data by the InterestAddDate field in descending order. Since I want to find the last SQL Conference of interest for each employee, I’m going to add the PARTITION statement on the EmployeeID field to the OVER clause. This will cause the ROW_NUMBER function to start over for each EmployeeID. Since I’m not using an aggregate function, I can return all the data from the Interest table that I need.

In the next part of the query , I join the CTE to the Employee table and add a WHERE clause. Since I ordered each partition in descending order, I know that the first row of each partition will have a rowindex of 1. I can now filter my data by rowindex = 1.

WITH CTE_InterestsByRow_Number
AS
(
      SELECT
         i.EmployeeID
         ,i.InterestAddDate
         ,i.Activity
         ,ROW_NUMBER() OVER (PARTITION BY i.EmployeeID ORDER BY i.InterestAddDate DESC) AS RowIndex
      FROM
         dbo.Interest AS i
      WHERE
         i.IsActive = 1
 )
SELECT
      e.FirstName + ‘ ‘ + e.LastName AS EmployeeName
      ,i.InterestAddDate
      ,i.Activity
FROM
      CTE_InterestsByRow_Number AS i
      JOIN dbo.Employee AS e ON i.EmployeeID = e.EmployeeID
WHERE
      rowindex = 1
ORDER BY
      e.LastName
      ,e.FirstName

When I looked at the logical reads for these two separate queries, the query using the MAX function had twice as many logical reads as the query with the ROW_NUMBER function. When I looked at the Execution Plan for both queries, I found the query using the MAX function had a higher Query Cost relative to the batch. My first run with the data, I used 20 Employees and 40 Interests. For the second run, I used 1000 employees and 4000 interests. I found that the Query Cost for the query using the MAX function increased with the larger datasets.

Execution Plan

%d bloggers like this: