Set-based alternative to loop in SQL Server

Question

I know that there are several posts about how BAD it is to try to loop in SQL Server in a stored procedure. But I haven't quite found what I am trying to do. We are using data connectivity that can be linked internally directly into excel.

I have seen some posts where a few people have said they could convert most loops to a standard query. But for the life of me I am having trouble with this one.

I need all custIDs who have orders right before an event of type 38,40. But only get them if there is no other order between the event and the order in the first query.

So there are 3 parts. I first query for all orders (orders table) based on a time frame into a temporary table.

Select into temp1 odate, custId from orders where odate>'5/1/12'

Then I could use the temp table to inner join on the secondary table to get a customer event (LogEvent table) that may have occurred some time in the past prior to the current order.

Select into temp2 eventdate, temp1.custID from LogEvent inner join temp1 on 
temp1.custID=LogEvent.custID where EventType in (38,40) and temp1.odate>eventdate
order by eventdate desc

The problem here is that the queries I am trying to run will return all rows for each of the customers from the first query where I only want the latest for each customer. So this is where on the client side I would loop to only get one Event instead of all the old ones. But as all the query has to run inside of Excel I can't really loop client side.

The third step then could use the results from the second query to make check if the event occurred between most current order and any previous order. I only want the data where the event precedes the order and no other orders are in between.

Select ordernum, shopcart.custID from shopcart right outer join temp2 on 
shopcart.custID=temp2.custID where shopcart.odate >= temp2.eventdate and
ordernum is null

Is there a way to simplify this and make it set-based to run in SQL Server instead of some kind of loop that I is perform at the client?

I am using 2005 and 2008. We are starting to migrate over to 2008 but haven't finished so I need to solve this for 2005 as well. — CaptainBli
– CaptainBli, Commented May 22, 2012 at 18:34
Is that May 1st or January 5th? Please use safe, unambiguous formats for date literals, e.g. '20120501'... SQL Server will never misinterpret that, nor will your users, co-workers or readers here. — Aaron Bertrand
– Aaron Bertrand, Commented May 22, 2012 at 18:36
Whether it is January or May it really doesn't matter. The date is not relevant to the query as it will be dynamically inserted. But thank you for the note about being concise. — CaptainBli
– CaptainBli, Commented May 22, 2012 at 18:39
It's not clear what you want. Is this guess right? Given a date @D, return [custID] for every customer with 1) an order [ordernum] on [odate] > @D, 2) a most recent event 38 or 40 on [eventdate] < [odate], and 3) no order before [odate] and after [eventdate]. I don't see how Gordon's query below fulfills requirement #3. Among things that are not clear from your description: A) are the date columns pure dates, and if a customer placed two orders on the same date do you want any results for events 38 or 40 before that date? B) What if [eventdate] precedes an order placed before @D? — Steve Kass
– Steve Kass, Commented May 22, 2012 at 19:18
The actual return value would be simply the custIDs that have an order during the time period @D that have an event of type 38 or 40 right before it. So I exclude those extra orders that may be multiple orders after the event. — CaptainBli
– CaptainBli, Commented May 22, 2012 at 19:32

Gordon Linoff · Accepted Answer · 2012-05-22 18:46:30Z

2

THis is a great example of switching to set-based notation.

First, I combined all three of your queries into a single query. In general, having a single query let's the query optimizer do what it does best -- determine execution paths. It also prevents accidental serialization of queries on a multithreaded/multiprocessor machine.

The key is row_number() for ordering the events so the most recent has a value of 1. You'll see this in the final WHERE clause.

select ordernum, shopcart.custID
from (Select eventdate, temp1.custID,
             row_number() over (partition by temp1.CustID order by EventDate desc) as seqnum
      from LogEvent inner join
           (Select odate, custId
            from order
            where odate>'5/1/12'
           ) temp1 
           on temp1.custID=LogEvent.custID
      where EventType in (38,40) and temp1.odate>eventdate order by eventdate desc 
     ) temp2 left outer join
     ShopCart
     on shopcart.custID=temp2.custID
 where seqnum = 1 and shopcart.odate >= temp2.eventdate and ordernum is null

I kept your naming conventions, even though I think "from order" should generate a syntax error. Even if it doesn't it is bad practice to name tables and columns with reserved SQL words.

answered May 22, 2012 at 18:46

Gordon Linoff

1.3m62 gold badges705 silver badges857 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

CaptainBli Over a year ago

Yeah sorry it should be orders not order.

CaptainBli Over a year ago

The row_number() will iterate through all of the Events for the given customers. Is that faster than looping and requesting a single Event row for each customer? There could be 10000's of Event rows for the group of customers queried. I am trying to better understand is all.

Gordon Linoff Over a year ago

Yes! "Looping" inside the database is insanely faster than looping using a cursor. Cursors have a lot of overhead, brining each value back and forth from the database. Cursors run serially, databases in parallel, and so on.

Justin Pihony Over a year ago

Just as a side note, there can be a difference between cursor and looping. You can set up a special while loop...it cuts down on much of the cursor overhead. However, using builtin functions will still win out

Justin Pihony · Accepted Answer · 2012-05-22 19:06:09Z

0

If you are using a newer version of sql server, then you can use the ROW_NUMBER function. I will write an example shortly.

;WITH myCTE AS
( 
SELECT
    eventdate, temp1.custID, 
    ROW_NUMBER() OVER (PARTITION BY temp1.custID ORDER BY eventdate desc) AS CustomerRanking 
FROM LogEvent 
JOIN temp1 
    ON temp1.custID=LogEvent.custID 
WHERE EventType IN (38,40) AND temp1.odate>eventdate
)
SELECT * into temp2 from myCTE WHERE CustomerRanking = 1;

This gets you the most recent event for each customer without a loop.

Also, you could use RANK, however that will create duplicates for ties, whereas ROW_NUMBER will guarantee no duplicate numbers for your partition.

edited May 22, 2012 at 19:06

answered May 22, 2012 at 18:35

Justin Pihony

67.2k20 gold badges154 silver badges185 bronze badges

4 Comments

CaptainBli Over a year ago

I like this idea, but it will have to iterate through the entire Event table for all of the customers in my initial query. This has the potential to be tens of thousands of records. Where a loop could give me a single record for each order.

Justin Pihony Over a year ago

@CaptainBli You need to think in set based logic, not procedural logic. This will limit down your query in sets. I have modified my answer to include the customer ranking where = 1 in the first query using a CTE. The optimizer should take care of the rest. Looping when unnecessary is always a bad idea in SQL..as you already mentioned :)

Steve Kass Over a year ago

@CaptainBli: You won't know if this query is inefficient until you look at its query plan. With supporting indexes, the query might use Segment and Top operators to avoid processing every row of LogEvent, because of the clause CustomerRanking = 1. (Note from my comment to your question that regardless of this, I think the query is wrong. My comment here is only in response to your remark that it "will have to iterate through the entire Event table.")

Justin Pihony Over a year ago

Yes, I meant to mention that in my answer, but the entire query can be tightened up for sure, and as Steve Kass mentioned already, declare your code and check optimization after...dont overthink the optimizer...that is how ORM's work...and most of the time you dont need to tweak anything :)

Collectives™ on Stack Overflow

Set-based alternative to loop in SQL Server

2 Answers 2

4 Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related