There are two aspects of how MySQL/InnoDB use a table’s primary key on the file system that have important ramifications for the key’s design:
- The primary key determines the order in which the data is physically stored in the main data file, aka “the clustered index”. Another way of saying this: the main data file is a B-tree index that directly contains all of a table’s columns, and the key on this B-tree is the primary key. This can result in one less I/O operation (compared to MyISAM) on many queries.
- The primary key is what is used to associate all of a table’s indexes with the main data file. So the primary key is replicated in every row of every index.
How to design your InnoDB primary keys:
- Do explicitly define a primary key, and make it numeric. If no primary key is specified, or a non-numeric column is used as a primary key, InnoDB will used its own internal auto-incremented row ID to link between the main data table and the other indexes. This row ID is 6 bytes. The biggest numeric type in MySQL, BIGINT, is 8 bytes, and the next smallest type, INT, is 4 bytes. So not only is this “hidden” row ID an entire extra redundant column, it is also bigger than the data type that would be used in the vast majority of cases where a primary key is explicitly specified. Furthermore, when InnoDB falls back to using this internal row ID, the rows in the main data file are ordered by insertion order instead of your primary key, rendering InnoDB’s clustering behavior useless, and adding an entire extra I/O operation on some queries. This is a big loss.
- As alluded to above, the primary key should be sequential. Because the structure of the data on the disk is dictated by the primary key, it should be sequential, to avoid extremely expensive I/O on writes (when the B-Tree has to be reorganized), and to take advantage of the native clustering.
- The primary key should be as small as possible. Because the primary key is replicated in every entry of every index, designing the primary key to be as small as possible impacts the efficiency of all the indexes.
Using these 3 simple guidelines on a well-designed schema, InnoDB can perform at its full potential.
Above I mentioned that the primary key is stored in every row of every index. This includes the entire contents of a multi-column primary key. In a future article I will discuss scenarios/designs that can take advantage of this behavior.
Great post – you have just saved me from using a 32 character GUID as the primary key.
However is this still relevent for the current GA versions of MySQL, or have things moved on in the last few years?
Thanks, Richard
I would also be interested in your response to RIchard’s question above – basically “Is this still the case?”
I would imagine that innodb still has the same design in these regards. It would be a pretty big deal if that had changed.
However– I’m not 100% sure.
I was trying to find evidence for what you said in point 1 above about having a non-numeric PK, but http://dev.mysql.com/doc/refman/5.0/en/innodb-index-types.html seems to disagree – it says it only uses the automatic field if there is no PK and no suitable unique indexes, which suggests that a long non-numeric PK (e.g. BINARY(16)) or even a multi-column PK (e.g. 3 INTs for 18 bytes) would work without it clustering on the automatic field. I couldn’t find evidence for your point in the manual – where did you find it out? Perhaps the source code? (I realise the documentation often skips over the deeper mechanics, and documents how it should work/should be used rather than the exact method that it does work internally.) It seems to be mentioned on a lot of blogs, but I cannot find it anywhere “official”.
I understand the issues with using a long non-numeric PK, such as larger indexes, more disk reads, potential page splits if it’s non-monotonically-increasing, etc. My interest is whether it will cluster on a large non-numeric PK or not, and whether this long PK is used in the secondary indexes or whether the automatic row id is, instead (as this would mean that queries run on just the indexed column and the PK would use the index and then look up the physical data as well, which is obviously slower than just using the index and being done with it).
Sorry for the long comment!
Long intelligent comments are encouraged– keep em coming!
A non-sequential primary key (which i imagine will be the case for all non-numeric keys, unless you have a hex key or something similar), will result in the main data store having to be shuffled around for _every_ insert, because the main data store is sorted by order of its primary key.
You can make a BINARY(16) that would be (mostly) increasing by making it time based. e.g. get the current 8 byte high-resolution unix timestamp (this is at least millisecond accurate) and convert it to binary, and then tack on another 8 bytes of random binary data. This will be increasing except sometimes when two inserts take place during the same millisecond. It’s not sequential, but it is monotonically increasing (for the vast majority of inserts), and I think that’s what’s important? (Also, those inserts which are not increasing will be extremely recent, so will still be in memory and thus the shuffling should be very minor.)
Of course, I assume that MySQL can order by binary columns and does so sensibly 11 > 10 > 01 > 00.
But what is your motivation behind such a scheme? Why not just use an auto-incrementing integer?
Theres many Pros:
- Simple sharding – client generates a UUID and then sends request to the correct DB, future requests are sent to the relevant database based on predefined rules
- AUTO_INCREMENT causes locks on the tables whilst it generates the next entry, this is avoided by having the client generate the ID (not a problem normally but at extremely high load it can be)
- Reverse engineering of data: you can no longer tell how many (e.g.) users the system has by looking at the latest user ID. This isn’t just obfuscated – the information simply not there.
- When sharding, no need to keep track of what IDs have been used before. No single point of failure.
- IDs are unique across tables and not just within them. (Not sure this is particularly helpful, but it doesn’t hurt!)
- Security: e.g. user IDs are unpredictable – it would take forever to guess another user id even if one is already known (even if the format is understood the likelyhood of generating an identical ID is negilgable).
- Less human readable
- Other reasons I’ve forgotten
Cons:
)
- Storage space increases
- Index sizes increase
- Marginally slower DB performance
- Less human readable (both a pro and a con
- Other reasons I’ve forgotten
The “UUID” style format I’ll be using does very precise time information first (100nanosecond resolution) and then random data afterwards, so it will be (mostly) incremental and is unlikely to ever have any conflicts.
Okay, cool, sounds like you’ve got quite the specialized problem space, and you are on top of it
And if the column is numeric, and almost always sequential, then you are set.
Actually it might be worth it to see if you can somehow guarantee that it is sequential, because a re-org for a big system like it seems you are dealing with, could be quite expensive, even if it only happens once a day.
I guess you can achieve this with a timestamp column in the table, representing the timestamp prefix of the primary key. Or by making a ‘timestamp server” — a meta table running on a single box, keeping track of all the IDs across all shards.
Anyway, I’m sure you can think of your own ways to do this– but at least look into (or experiment with) how expensive a non-sequential insert would be.
Hmm actually, now that I think of it, even if you did have a non-sequential insert, it would only be out-of-order with the most recent 1 row (if they were submitted at the same millisecond) — and I’m guessing that re-org would be quite cheap.
But maybe not? Like i said, it’s worth doing a quick experiment (and reporting your findings here
)
John