David M Williams
Wednesday, 18 February 2009 16:55
IT Industry -
Market
Page 2 of 3
All of us who have sat through database theory know that the SQL standard doesn’t impose any ordering on the information contained within a relational database. You use the ORDER BY clause in a SELECT statement to get the order you want.
As with most things, what happens in practice is somewhat different. Databases do sort data within a table in a set, predictable, pattern. If this wasn’t the case you’d get a random arrangement each time you executed SELECT * FROM
TABLE.
Depending on the database system used, this order may follow the primary key or it may be something quite different – specifically, the order of the clustered index.
Unfortunately, in 90% (ok, some number) of cases the two are one and the same. The primary key order is the clustered index order.
So just what is a clustered index?
Simply put, one – and only one – of the indexes on a table can be clustered. This means the data is actually physically stored in order according to the clustered index.
If your primary key is a series of numbers that start at one and increment by one – and the clustered index is on the primary key – then the rows in the database will be stored on disk in this same order.
If you delete a row in the middle, and then later manually insert a new row but specify the primary key is the value you previously deleted then this new row will be physically stored on disk in the place where the old row was.
Here’s where the problem comes in. Most database system will automatically index the primary key. And, being the first index created, they’ll also make it the clustered index. Obviously only one index can be clustered (after all, the database can’t be stored on disk in two different orders unless you double up on the data!) so unless the database designer thinks to change this that’s the way it will always stay.
Is the primary key necessarily the best choice for clustered index? My made up figure of 90% may not be the right number but I’ve seen more databases that have left things as they are than databases that don’t.
Think about it: imagine you have a huge database with customer orders. This spans millions of rows. The primary key is an order number. Another field is client number.
Now, when you want to list orders what’s the most common filtering you’d apply? Chances are you don’t care about listing orders in order of their number without reference to anything else. The greater likelihood is your queries will predominantly list orders by customer number.
Every request for orders by customer is suffering a performance hit. If the clustered index was the customer number – which isn’t the primary key – they’d run faster.
I can illustrate with a program I’m looking at right now. This is a market leading payroll program. It’s used by 30% of companies within one market alone. I don’t want to name them or embarrass them. Mind you, I selected their products over others so I do endorse it – but it could perform better!
Over the page I’ll show you, with evidence from SQL Server.