UNION ALL Versus Sub-Queried JOIN In MS SQL 2008

A recent to shift from MS SQL 2000 to MS SQL 2008 was required, and assuming everything would go swimmingly, with greater speed increases, and improved development with the integration of Intellisense to the SQL management tools I thought all would be well.  That is until users starting hitting a page which ran a stored procedure.  Upon further investigation it was found that the queries within the stored procedure were causing the server to max out for at least five minutes at a time, hindering from all usage of any web pages.  Odd, seeing as these stored procedures were run on a daily basis on the old server, and a little fore-sight was taken into account for the compile-time of each stored procedure as they were migrated over to the new server.  Unfortunately this wasn’t the case.  Whatever happened in the development of SQL in this eight-year period changed the way queries work.  Consider an example where a join encompasses a nested select, cumbersome sure, but capable, even with a table composed of over half a million entries.  Unfortunately this will take a long, long time.  Once run once the query will run just time again, perhaps because the data is placed higher in the SQL stack.

Moving forwards, this was tested over and over again only to find the same results.  Using CTE’s, views, or temporary tables had no effect on the query either.  Colleagues then attempted alternative approaches to the same query where eventually we reached the use of union.  To our surprise the query worked perfectly, and executed in quite literally 5 seconds, compared to the 5 minutes in previous attempts.

SELECT
    Column1,
    Column2
FROM
    MyTable
WHERE
    Column2 LIKE '%exp%' OR
    Column1 IN (SELECT Column1 FROM MyOtherTable)

Compared with:

SELECT
    Column1,
    Column2
FROM
    MyTable
WHERE
    Column2 LIKE '%exp%'
UNION ALL
SELECT
    Column1,
    Column2
FROM
    MyTable
WHERE
    Column1 IN (SELECT Column1 FROM MyOtherTable)

The latter example will run much, much faster.  Logically they are the same, and produce the same results, but in fact the difference in performance is substantial.  The only problem is that the better performing query is a much disliked one also.

About Kieran

A twenty-four year old software developer from Guernsey, Channel Islands.
This entry was posted in Pages. Bookmark the permalink.

5 Responses to UNION ALL Versus Sub-Queried JOIN In MS SQL 2008

  1. Sinbad says:

    I still have to register to post a comment ;)

    What did the execution plan say? That should show you why the performance is so different.

    My guess is a screw-up in the 2008 query optimiser so that it doesn’t realise it can parallelise and cache the results of the inner query separately. The inner query is essentially entirely separate from outer one, and is uncorrelated (ie nothing in the inner query references the outer query) so not only can the query be executed entirely separately, it’s results can be cached and reused separately too unless MyOtherTable changes. The query optimiser seems to be broken if it doesn’t realise that, and only picks it up when you make it explicit (which a UNION does, since the SQL statements are explicitly independent).

    Don’t ya just love being on the cutting edge? ;)

  2. Kieran says:

    I thought I’d changed the options so you don’t have to register in order to comment. I’ll have a look in a moment.

    The execution plan said that the nested select would be faster than using UNION ALL. Still with loads of different attempts the nested SELECT kept going slow, specifically with the OR statement. We tested the query by taking the OR statement out with both WHERE clauses independantly and the query ran fine. It took us quite a while to track down, but this had definitely never been an issue in SQL 2000. We’re now having to change some more frequently used queries to UNION ALL to stop this temporarily until we figure out a more robust solution.

  3. Sinbad says:

    Do you know if your database stats / analysis are up to date? What could be happening is that if the DB stats aren’t right and the query optimiser doesn’t know the size / statistic distribution of data in the tables, it’s making the wrong decision about performance (for example, on a small table, reading all the rows and filtering in memory is faster than doing an indexed read). In Oracle stats collection was something you had to consciously think about & organise, but it’s supposed to be mostly automatic in SQL Server. Nevertheless, it still has to be done so it might be worth checking the stats collection is functioning correctly.

  4. Damien Guard says:

    This does sound odd and remember that the results from the second query may not be the same if you have a row that matches both selections you will now get it twice – DISTINCT is your friend for that :)

    [)amien

  5. Kieran says:

    The rows themselves are distinct in terms of their clauses, you can only get one or the other in this case. Although you’re right, this should be enforced anyway if we’re only looking for distinct data, but distinct itself has a performance hit, especially when we’re doing queries against hundreds of thousands of rows.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>