There are many reasons you might need a number sequence in SQL. I’m going to show you how to efficiently generate such a sequence in a set-based manner, all in memory, and without using any loops. In future articles I’ll get into some uses for such a sequence, but for now, let’s open up SSMS and just roll with it.
A Loopy Solution
If you are a typical imperative programmer, you’re probably used to thinking of number sequences in terms of loops. Say that we wanted to generate a sequence of 100,000 numbers. A naive first attempt in T-SQL using a loop might look like:
If you run Script 1 (try using the F5 key) you’ll wait several seconds to get a results set with 100,000 rows. On my machine, generating the results set took around 9 seconds. Not horrible. But using a different approach, we can do much better than that.
Introducing Common Table Expressions
Now let’s try generating a number sequence using a different approach. I believe this idea was first developed by Itzik Ben-Gan – at least he commonly gets credit when I’ve seen this mentioned elsewhere. What we’re going to do is cross join some CTEs (Common Table Expressions) to quickly and efficiently create a combinatorial explosion of numbers.
This begins simply enough – we’ll start this out by using a CTE to create a set with only one column and two members.
And here’s our results set.
How is this useful? Stick with me. We’re about to find out.
It turns out that we can use multiple CTEs in one query, so let’s add a few.
Perhaps you’ve noticed that adding more CTEs is a matter of comma-separating them, and that one CTE can refer to any previously-defined CTE. When you run Script 3, you’ll very quickly see the answer – 65,536. That’s a substantial number, but let’s see what happens when we take this just one step further.
A Combinatorial Explosion
It turns out that, taking this just a tiny bit further, this is about to start growing really, really quickly. Let’s give it a go.
And BOOM! Combinatorial explosion? Well, yes. And, no. More like a minor implosion.
What you’ll see after waiting about 100 seconds for this script to finish is the following:
So what happened? Well, we had the combinatorial explosion we were hoping for. We generated 4 billion rows right there in t5. Unfortunately for us, the SQL COUNT function returns an integer, while we’ve suddenly generated a BIGINT’s worth of rows.
The documentation for the COUNT() function says “For return values greater than 2^31-1, COUNT produces an error.” We’ve just generated 2^32 worth of rows. (We’ve also confirmed that not all of Microsoft’s documentation is wrong, contrary to rumors.) Are we out of luck here? Not if we have SQL Server 2008 or newer.
You Can Count On Me
SELECT COUNT_BIG('x') AS [Count] FROM t5
That’s better. Let’s hit it again (F5, that is.) Once again, we wait around 100 seconds and find that we’ve now generated 4,294,967,296 rows. That ought to be enough for most purposes. Wait a minute. You remember earlier when I said this was going to be fast? 100 seconds?
Query Optimizer To The Rescue
Our first approach at solving this problem resulted in it taking around 9 seconds to generate a sequence of 100,000 numbers. Lately we’ve been talking about generating 4 billion numbers, but we’ve also been talking about something that’s taking 100 seconds to finish. I don’t know about you, but as far as I’m concerned, ain’t nobody got time for that. Fortunately the SQL Query Optimizer is about to step in to give us a hand. Let’s change line 38 again:
SELECT TOP 100000 * FROM t5
Running this updated script, everything finishes in around 1 second (in fact, as we’ll soon see, much more quickly than that.) The query optimizer was able to see that we were never going to use the entire 4 billion rows, and somehow, it magically short-cuts the query. This saved us a lot of time. But now if you look at the results set, you might notice we just have a big bunch of ones. In fact, we now have 100,000 rows repeating the number 1:
Not much of a sequence, is it? Seems kind of worthless. So let’s fix that.
Number Sequence In SQL
Here’s the finished code, as far as I’m going to take it in this post:
Let’s not talk about any of these changes yet. Just hit F5. Do it. In about a second you’ll have a results set. The first row starts with 1, and the second row is a 2. Grab the thumb and drag it to the bottom of the Results window. On row 100000 you’ll see the number 100,000. There’s your number sequence in SQL. You’re welcome.
You Better Explain Yourself
OK, you probably noticed a few changes, along with a couple of new lines at the top and the bottom of Script 5. One important change is on lines 37 and 38, where I am now cross-joining t4 to t3, instead of to t4. The effect of this change is to only generate around 16.8 million rows, instead of 4 billion. This change is optional, since we’ve already shown that the query optimizer is going to limit the number of rows from this query to the number we actually need. I’ve decided that 16.8 million is a big enough sequence for my purposes. Feel free to cross join t4 to t4 as we did earlier, if you think you’ll ever need a sequence that large.
Now for the new stuff. On line 1, the statement SET STATISTICS IO ON causes SQL Server to display “information regarding the amount of disk activity generated by Transact-SQL statements.” In other words, this allows us to see how much pressure our query places on disk resources on the SQL Server.
On Line 2, SET STATISTICS TIME ON displays “the number of milliseconds required to parse, compile, and execute each statement.” So this is going to tell us something about how our query uses resources such as the CPU on the SQL Server. We’ll get back to the purpose behind all this in a moment.
If you’re really paying close attention, you may have noticed there is now a semicolon at the beginning of line 4. This is because CTEs really want to be the first thing in a SQL batch – the parser in SSMS will complain otherwise. The semicolon is not really starting the CTE – it is ending anything that happened prior to declaring the CTE. Such as the SET STATISTICS statements.
The big change in this script is the new CTE that starts on line 40, which I called Numbers. Remember that big bunch of ones we had before? Now we’re getting to the point of this exercise. On line 42 we use all those rows of ones to generate row numbers, and on line 46, those row numbers become our sequence. The ROW_NUMBER function requires an ORDER BY clause. Since the only thing we have to sort is a bunch of ones, we don’t actually care about what order they are in when the row numbers are generated; all we really want are the row numbers. As Ben-Gan points out in his original article, the ORDER BY (SELECT NULL) phrase tells the optimizer that it doesn’t really need to perform a sort before generating the row numbers. Another optimization.
Is It Fast, Mister?
At the top of this article I promised that we could do better than to use a loop to generate a number sequence in SQL. More recently I mentioned that the latest script finished in around a second, compared to the 9 second performance that we got from our original loop-based script. But let’s see how we did, more specifically. Remember those SET STATISTICS statements on lines 1 and 2? Lets look on the Messages tab and see how we did.
If you study that picture, the first thing that will probably grab your eye is “100000 rows affected” – of course this is what we’d expect. Then notice the execution time. We only used 15 milliseconds of CPU time to generate 100,000 rows! The elapsed time includes the time it took to stream the output to the client, and that’s around half a second. That beats the pants off our original loopy script!
What you probably didn’t notice is what you don’t see. The SET STATISTICS IO ON statement did not generate any output whatsoever. Normally this statement will cause messages that look similar to “Scan count 0, logical reads 1, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.” Since we don’t have any messages that look like this, it means our query didn’t even touch the disk subsystem. Fantastic.
What if we decided to go for a million rows? On my machine, that took 157 milliseconds of CPU, and an elapsed time of 9,569 milliseconds. Nice and linear. The query finished on the client right around the 9 second mark. Coincidentally, this is almost exactly how long it took originally to generate 100 thousand rows using loops, so this method performs about 10 times better. Nice!
Wrapping It Up
Now what did we learn? We learned that we may get better performance if we avoid looping constructs in T-SQL. We learned how to use CTEs – in fact we learned how to use multiple CTEs in one query. And we learned how to use a couple of the SET STATISTICS statements and to read and interpret the output from those.
We also learned that we can generate large sequences of numbers rather quickly using SQL Server. Perhaps this doesn’t seem very useful to you. Standing on its own, perhaps it is not all that useful. But in my next post I’ll show you how you can put such a sequence to work – such as to generate a range of dates on the fly.