1

I am trying to optimize the following T-SQL query:

SELECT Person.*
FROM Person
WHERE ZipCode LIKE '123%'
AND City = 'Washington'
AND NumberOfHomes in (1, 2, 3)
AND
(
    EXISTS
    (
        SELECT * FROM House
        WHERE Person.ID = House.PersonID
        AND House.Type = 'TOWNHOUSE'
        AND House.Size = 'Medium'
    )
    OR
    EXISTS
    (
        SELECT * FROM Color
        WHERE Person.ID = Color.PersonID
        AND Color.Foreground IN ('Green', 'Blue', 'Purple')
    )
)

I'd greatly appreciate any response in optimizing the query.

In particular, is there a way to convert the query into a more efficient query using only a single SELECT statement without any of the inner SELECT statements?

Thanks!

2
  • cant say much without the actual execution plan for your query. one minor tip, for EXISTS you don't need to return all rows or columns, just return TOP 1 1 from your query EXISTS( SELECT TOP 1 1 FROM House...) Commented Sep 10, 2014 at 15:00
  • 8
    @user2321864 It doesn't matter what you put there. SQL Server doesn't care, it knows it is just looking for 1 row and then it can short circuit, and it knows it doesn't return any data. Want proof it doesn't matter? Replace * with 1/0. Commented Sep 10, 2014 at 15:18

4 Answers 4

4

This is the query:

SELECT p.* 
FROM Person p
WHERE p.ZipCode LIKE '123%'  AND p.City = 'Washington' AND p.NumberOfHomes in (1, 2, 3) AND
      (EXISTS (SELECT *
               FROM House h
               WHERE p.ID = h.PersonID AND h.Type = 'TOWNHOUSE' AND h.Size = 'Medium'
             ) OR 
       EXISTS (SELECT *
               FROM Color c
               WHERE p.ID = c.PersonID AND c.Foreground IN ('Green', 'Blue', 'Purple')
              )
      );

Without rewriting the query, you can optimize this with indexes. I would recommend:

Person(City, ZipCode, NumberOfHomes, Id);
House(PersonId, Type, Size);
Color(PersonID, Foreground)

Question, though. Are you sure that the ids in theHouseandColortables really match back toPerson.Id? Normally, they would have a column called something likePersonId`.

Sign up to request clarification or add additional context in comments.

3 Comments

I'm curious, perhaps you know the answer. Does EXISTS (SELECT * and EXISTS (SELECT 1 perform any different?
@wdosanjos . . . No, the compiler changes both of these to the same code. Usually, I would use select 1, but I left the select * because that is how the OP phrased it.
@wdosanjos NO. Please see my comment above.
0

Please try this:

SELECT p.*
FROM Person p
WHERE Substring(Ltrim(Rtrim(p.ZipCode)),1,3) = '123' AND p.City = 'Washington'AND 
(p.NumberOfHomes=1 or  p.NumberOfHomes=2 or p.NumberOfHomes=3))
AND
(
EXISTS
(
    SELECT 1 FROM House h
    WHERE p.ID = h.PersonID
    AND h.Type = 'TOWNHOUSE'
    AND h.Size = 'Medium'
)
OR
EXISTS
(
    SELECT 1 FROM Color c
    WHERE p.ID = c.PersonID
    AND (c.Foreground ='Green' or c.Foreground='Blue' or  c.Foreground='Purple')
)
);

Also this will work better:

SELECT 
    p.*
FROM Person p
Left join House h
    On (p.Id=h.PersonID)
Left join Color c
    On (p.id=c.PersonID)
WHERE Substring(Ltrim(Rtrim(p.ZipCode)),1,3) = '123' AND p.City = 'Washington'AND 
(p.NumberOfHomes=1 or  p.NumberOfHomes=2 or p.NumberOfHomes=3)) and Isnull(h.Type,'') =   'TOWNHOUSE' AND Isnull(h.Size,'') = 'Medium' AND 
(Isnull(c.Foreground,'') ='Green' or Isnull(c.Foreground,'')='Blue' or Isnull(c.Foreground,'')='Purple') and 
(h.PersonID is not null or  c.PersonID is not null);

4 Comments

¿Why the -1? is abuse
Hi, the queries are not any particularly better than any of the other queries. They all show similar timing information with the following settings: SET STATISTICS TIME ON & SET STATISTICS IO ON. Is there a way to get more accurate timing information to compare with other queries?
Same remark as for Sam Yi: converting WHERE EXISTS() to LEFT OUTER JOINs will POTENTIALLY return the same record doubled, tripled, etc... if there are multiple matches with the Color or House tables data. I'd also be surprised that SubString(1,3) will be faster than LIKE 'xyz%'. Rolling out the IN (,,) into OR's is done by the query optimizer anyway, personally I prefer the readability of the IN (,,) construction.
You are right about changing Exists() to Left join, I am sure substring(p.ZipCode,1,3) or Left(p.ZipCode,3) will work better than Like'%123' also the or works better than an in.
0

Left join and checking for null will be quicker than doing existence checks. Also, if NumberofHomes is an integer, doing BETWEEN will be the same as IN.

SELECT p.*
FROM Person p
LEFT JOIN House h
    ON p.ID = h.PersonID
    AND h.Type = 'TOWNHOUSE'
    AND h.Size = 'Medium'
LEFT JOIN Color c
    ON p.ID = c.PersonID
    AND c.Foreground IN ('Green', 'Blue', 'Purple')
WHERE p.ZipCode LIKE '123%'
  AND p.City = 'Washington'
  AND p.NumberOfHomes BETWEEN 1 AND 3
  AND (h.PersonID is not null or c.PersonID is not null)

OR you can try something like this...

select t.* 
from (
    select personid from house
    where type = 'townhouse' and size = 'medium'
    union
    select personid from color
    where foreground in ('green','blue','purple')
) pid
cross apply (
    select *
    from person p
    where p.id = pid.personid
      and p.zipcode like '123%'
      and p.city = 'washington'
      and p.numberofhomes between 1 and 3
    ) t
where t.id is not null

It's really difficult to optimize these blind. Depending on the distribution of your data, the above query may give you better results.

6 Comments

This is not an equivalent query. And I see no explanation about why it should be more efficient.
Sorry... I got pull away from the desk.
Hi, the queries are not any particularly better than any of the other queries. They all show similar timing information with the following settings: SET STATISTICS TIME ON & SET STATISTICS IO ON. Is there a way to get more accurate timing information to compare with other queries?
Those queries will POTENTIALLY return the same record from p.* doubled, tripled, etc... if there are multiple matches with the Color or House tables data. Also, I'm not sure why EXISTS() has such a bad reputation; in my experience it performs just as good, and in some cases better, than using a LEFT OUTER JOIN approach.
@Peter if you prefer a GUI to compare the behavior and performance of queries on MSSQL I personally like SQL Sentry Plan Explorer a lot. Basically it's just the same information you'd get from the query-plan and the profiler in an (IMHO) easier to grasp presentation. (PS: I'm in no way affiliated with SqlSentry =)
|
-1

Often optimizing and having several different select statements are different topics as the query optimizer (SQL Server) often will take your sql statement and run it the way it sees to be the most efficient way it sees fit.

Saying that yes are several different ways you can take your statements and combine them into one sql statement here is an example. This will preserve your person table and get matches from House OR Color tables that match your criteria.

<!-- language:SQL-->
SELECT *
FROM Person Left Outer Join House ON Person.ID = House.PersonID Left Outer Join Color ON
Person.ID= Color.PersonID
WHERE (ZipCode LIKE '123%'
    AND City = 'Washington'
    AND Person.NumberofHomes in (1, 2, 3) )
    AND (
        House.Type = 'TOWNHOUSE'
        AND House.Size = 'Medium'
    )
    OR(
         Color.Foreground IN ('Green', 'Blue', 'Purple')
    )

I would recommend that you reconsider your model. For example, having PersonID in color is very suspect as is having numberofhomes (that could be possibly calculated for example, from a count on the House table that has the person's id). There are some other questionable normalization attributes as well. Not part of your question but I thought you might want to consider it.

6 Comments

The data model is correct but the names are made up, so it may appear a bit strange. Also, the queries are not any particularly better than any of the other queries. They all show similar timing information with the following settings: SET STATISTICS TIME ON & SET STATISTICS IO ON. Is there a way to get more accurate timing information to compare with other queries?
In regards to the data model. I understand, not knowing the context or experience of user and problem , I wanted to present the fact good modeling is key.
As I mentioned optimization and having a single query statement are really not always the same thing. As the query optimizer will more often than not take queries and execute them as it sees fit (based on an execution plan).
Certainly IO and time are important in optimization however, most dbas would be looking at the execution plan as the best indictor of optimization. Your best bet in comparing queries is to run them through the estimated query plan and have a look for items such as table scans(usually considered bad and are to be avoided). This will not only enable you to compare different sql scripts and see if things are optimized but also enable you to ensure that you and your sql scripts are using indexes appropriately.
Incidently, I ran your original query and the query above against unindexed tables and query optimizer took both and execute the same way (all table scans of course). I then put in place the indexes I thought to appropriate for these tables and ended up with all. I ended up with 3 index seeks. That's a nice improvement IMO.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.