Sunday, November 11, 2007

Taking a look at CROSS APPLY

Taking a look at CROSS APPLY



think the easiest way to think of CROSS APPLY is that it is like doing a CROSS JOIN with a correlated sub-query instead of a derived table. Let's see if I can explain that .... A derived table is "self-contained", in that all tables and columns in the parent SELECT are not accessible (though variables and parameters can be referenced). For example, consider:


select A.*, b.X
from A
cross join (select B.X from B where B.Val=A.Val) b


That is not legal because A.Val is out of scope within the derived table; this is because the derived table is evaluated independently of the other tables in the SELECT. To limit the rows in table B so that B.Val = A.Val, we must do that outside of the derived table via a join or in the criteria:


select A.*, b.X
from A
cross join (select * from B) b
where A.Val = b.Val


(Of course, the above is equivalent to doing an INNER JOIN to the derived table, or just joining to the table B.)

Also, keep in mind that the scope-of-derived-tables rule isn't just for CROSS JOINS, it's for all JOINS -- CROSS, INNER, OUTER and even UNION; they all use "self-contained" derived tables.

This is in contrast to a correlated sub-query, where the parent SELECT is in scope for the sub-query; the sub-query is evaluated for each row in the query, so the other tables and columns in the SELECT are all available:


select A.*, (select B.X from B where B.Val=A.Val) as X
from A


(Note: I am ignoring for now the fact that returning multiple rows in a sub-query will return an error.)

This is an easy way to think of the difference between CROSS JOIN and CROSS APPLY. CROSS JOIN, as we saw, joins to a derived table; however, CROSS APPLY, despite looking like a JOIN, actually is applying a correlated sub-query. This imposes both the advantages of a correlated sub-query but also the performance implications.

So, we can simply rewrite our first example using CROSS APPLY like this:


select A.*, b.X
from A
cross apply (select B.X from B where B.Val=A.Val) b


Since we are performing an APPLY and not a JOIN, A.Val is in scope and it works just fine.

Table Valued User Defined Functions

Note that the same rules apply when using Table-Valued User-Defined Functions:


select A.*, B.X
from A
cross join dbo.UDF(A.Val) B


is not legal; once again, A.Val is not in scope for the user-defined function. The best we can do before SQL 2005 was to use a correlated sub-query:


select A.*, (select X from dbo.UDF(A.Val)) X
from A


However, that is not logically equivalent; the UDF cannot return more than 1 row or it will result in an error, and wouldn't make logical sense anyway if it did.

Starting with SQL 2005, we can now use CROSS APPLY and it will work fine:


select A.*, b.X
from A
cross apply dbo.UDF(A.Val) b


So, that is one way to think of the difference between a JOIN and an APPLY; a JOIN combines two separate result sets, but APPLY is more of a loop that evaluates one result set over and over for each row in another. This means that, in general, APPLY will be less efficient than a JOIN, just as, in general, correlated sub-queries are less efficient than derived tables. (The optimizer, however, is generally quite good at optimized correlated sub-queries when possible.)

So, why use CROSS APPLY instead of a correlated sub-query? What's the advantage? Actually, quite a lot -- it is much more powerful!

CROSS APPLY can return multiple rows

Unlike correlate sub-queries, CROSS APPLY works with multiple rows. This allows us to do things like "joining" a table to a function that parses a CSV column in that table into multiple rows:


select A.ID, b.Val
from A
cross apply dbo.ParseCSV(A.CSV) b


When the ParseCSV() function returns multiple rows, it simply acts as if we have joined Table A to the function's return table, duplicating the rows in Table A for each row in the joined table. This is not possible with a correlated sub-query, and will result in an error. This is a quick and easy way to parse a table of data into multiple rows in an efficient "set-based" manner when the algorithm requires a complex User-Defined Function. (The CSV is not a great example, but other parsing routines are not as easily accomplished via a JOIN to a numbers table.)

CROSS APPLY can return multiple columns

Again, in a correlated sub-query, we can only return a single value. If we write a SQL statement that returns a running sum, we can use a correlated sub-query like this:


select o.*,
(select sum(Amount) from Order o
where p.OrderDate <= o.OrderDate) as RunningSum
from Order o


However, what if we'd like to return an additional running sum of Orders based on some other criteria (e.g., for orders with the same "OrderCode")? We'd need another correlated sub-query, greatly reducing the efficiency of our SELECT:


select o.*,
(select sum(Amount) from Order o
where p.OrderDate <= o.OrderDate) as RunningSum,
(select sum(Amount) from Order o
where p.OrderCode = o.OrderCode and p.OrderDate <= o.OrderDate) as SameCode
from Order o


However, we can easily re-write that using a single CROSS APPLY:


select o.*, rs.RunningSum, rs.SameCode
from Order o
cross apply
(
select
sum(Amount) as RunningSum,
sum(case when p.OrderCode = o.OrderCode then Amount else 0 end) as SameCode
from Order P
where P.OrderDate <= O.OrderDate
) rs


So, we get the benefit of return multiple columns like a derived table, and we also get the ability to reference outer values in our SELECT to use in criteria and CASE statements. Very, very powerful.

CROSS APPLY also allows us to quickly get columns from the "previous" row in a table quite easily:


select o.*, prev.*
from Order o
cross apply
(
select top 1 *
from Order P where P.OrderDate < O.OrderDate
order by OrderDate DESC
) prev


We can express the running total and counts as CROSS JOINS or INNER JOINS by reworking a correlated sub-query, but using TOP and referencing the outer SELECT is not easily done in that manner.

Note that the above CROSS APPLY will not return any orders without a previous order; we would use OUTER APPLY to ensure that all orders are returned even if no previous orders exist:


select o.*, prev.*
from Order o
outer apply
(
select top 1 *
from Order P where P.OrderDate < O.OrderDate
order by OrderDate DESC
) prev


We can also CROSS APPLY to a Table-Valued User Defined Function that returns exactly one row, but with multiple columns, to return separate pieces of data from a single function call. For example, we can parse an email address into separate username and domain columns. A few months back I wrote a SQL Team article that discusses that concept.

Summary

CROSS and OUTER APPLY are very powerful and can be very useful, but we must be careful to use them only when necessary; I am still testing the possibilities, but in general an APPLY will tend to be not as efficient as a JOIN. In addition, APPLY is mostly demonstrated by applying table-valued user-defined functions, but it can be used with in-line SELECT statements as well.