A recent to shift from MS SQL 2000 to MS SQL 2008 was required, and assuming everything would go swimmingly, with greater speed increases, and improved development with the integration of Intellisense to the SQL management tools I thought all would be well. That is until users starting hitting a page which ran a stored procedure. Upon further investigation it was found that the queries within the stored procedure were causing the server to max out for at least five minutes at a time, hindering from all usage of any web pages. Odd, seeing as these stored procedures were run on a daily basis on the old server, and a little fore-sight was taken into account for the compile-time of each stored procedure as they were migrated over to the new server. Unfortunately this wasn’t the case. Whatever happened in the development of SQL in this eight-year period changed the way queries work. Consider an example where a join encompasses a nested select, cumbersome sure, but capable, even with a table composed of over half a million entries. Unfortunately this will take a long, long time. Once run once the query will run just time again, perhaps because the data is placed higher in the SQL stack.
Moving forwards, this was tested over and over again only to find the same results. Using CTE’s, views, or temporary tables had no effect on the query either. Colleagues then attempted alternative approaches to the same query where eventually we reached the use of union. To our surprise the query worked perfectly, and executed in quite literally 5 seconds, compared to the 5 minutes in previous attempts.
SELECT
Column1,
Column2
FROM
MyTable
WHERE
Column2 LIKE '%exp%' OR
Column1 IN (SELECT Column1 FROM MyOtherTable)
Compared with:
SELECT
Column1,
Column2
FROM
MyTable
WHERE
Column2 LIKE '%exp%'
UNION ALL
SELECT
Column1,
Column2
FROM
MyTable
WHERE
Column1 IN (SELECT Column1 FROM MyOtherTable)
The latter example will run much, much faster. Logically they are the same, and produce the same results, but in fact the difference in performance is substantial. The only problem is that the better performing query is a much disliked one also.
I still have to register to post a comment
What did the execution plan say? That should show you why the performance is so different.
My guess is a screw-up in the 2008 query optimiser so that it doesn’t realise it can parallelise and cache the results of the inner query separately. The inner query is essentially entirely separate from outer one, and is uncorrelated (ie nothing in the inner query references the outer query) so not only can the query be executed entirely separately, it’s results can be cached and reused separately too unless MyOtherTable changes. The query optimiser seems to be broken if it doesn’t realise that, and only picks it up when you make it explicit (which a UNION does, since the SQL statements are explicitly independent).
Don’t ya just love being on the cutting edge?
I thought I’d changed the options so you don’t have to register in order to comment. I’ll have a look in a moment.
The execution plan said that the nested select would be faster than using UNION ALL. Still with loads of different attempts the nested SELECT kept going slow, specifically with the OR statement. We tested the query by taking the OR statement out with both WHERE clauses independantly and the query ran fine. It took us quite a while to track down, but this had definitely never been an issue in SQL 2000. We’re now having to change some more frequently used queries to UNION ALL to stop this temporarily until we figure out a more robust solution.
Do you know if your database stats / analysis are up to date? What could be happening is that if the DB stats aren’t right and the query optimiser doesn’t know the size / statistic distribution of data in the tables, it’s making the wrong decision about performance (for example, on a small table, reading all the rows and filtering in memory is faster than doing an indexed read). In Oracle stats collection was something you had to consciously think about & organise, but it’s supposed to be mostly automatic in SQL Server. Nevertheless, it still has to be done so it might be worth checking the stats collection is functioning correctly.
This does sound odd and remember that the results from the second query may not be the same if you have a row that matches both selections you will now get it twice – DISTINCT is your friend for that
[)amien
The rows themselves are distinct in terms of their clauses, you can only get one or the other in this case. Although you’re right, this should be enforced anyway if we’re only looking for distinct data, but distinct itself has a performance hit, especially when we’re doing queries against hundreds of thousands of rows.