As is the case for many successful web applications, our databases have grown considerably over the years. While there was a time that a single moderately-sized PostgreSQL server could happily store all of our transactional data, those days are long gone. Which is, of course, not a bad thing. As they say, scaling to meet the demands of your success is a rather nice problem to have. But it does tend to turn up a few thorny technical challenges. One such problem that we encountered was finding a way to create unique identifiers in a sharded environment.
Our general approach to scalability is a fairly standard one in the SaaS world — partition services where appropriate, and then scale out rather than up. As applied to databases, that generally means spreading large datasets over a number of discrete shards. And so it happened that our fondly-remembered single database instance became four logical shards, spread out over eight physical servers (configured as four primary/standby pairs).
We have quite a few places in our application where we use database sequences for identifiers, and in most of these cases it’s important that the IDs remain unique across all shards. We routinely migrate data from one shard to another in order to balance load, something which would become a lot more complicated if it was possible for IDs from different shards to conflict with each other. So we needed a way to generate IDs that came, at least conceptually, from a single global pool of numbers.
We ruled out of a couple of our first ideas before settling on one that we liked. The first rejected solution was to put the sequences in a separate service, and to have each shard request IDs from the service on demand. While simple, this method had two significant drawbacks: it would hurt performance, as we would be introducing a network round-trip for every ID generated; and it could affect reliability since the ID service would be a single point of failure for those areas of the application.
The second idea was to use a different data type for our IDs that would more easily allow them to be made unique. For example, we could prefix the IDs with an identifier for the shard on which they were created —
s1_2 for the first shard;
s2_2 for the second shard; and so on. Or we could use something like a UUID. Unfortunately that would require some significant changes to our table definitions, and we have a lot of tables. Additionally, many of our IDs are published in the form of URLs, and UUIDs in particular do not lend themselves well to short (or attractive) URLs.
This led us to an idea that we liked much better. If we allocate discrete ranges of numbers to each shard, the shards can autonomously generate sequential numbers as they need them, so long as they stay inside the limits of those ranges. Since we can’t always predict how many IDs each shard might need, we’d need to be able to add additional ranges over time, but by picking large enough sizes for each range we could ensure that this would only happen occasionally. We can maintain a list of shard/sequence/range mappings in a table, and then periodically check each shard to make sure we always stay a few blocks ahead.
(We currently check the status of each sequence block daily, but we’ve sized the blocks such that each one typically lasts on the order of months, and always keep at least one unused block preallocated for each sequence ahead of time. This provides a long window of time in which everything continues to work if there is a problem with the sequence allocation scripts).
The only challenge, then, was to figure out how to introduce such a scheme onto an already busy database, without disruption or downtime. And this is where the trickery comes in. Or, if you ask our DBA, the evil.
Looking at the table definitions in our database, most tables that use a sequence for row identifiers look something like this:
(localhost/~) # \d fancy_table Table "public.fancy_table" Column | Type | Modifiers ------------------------+---------+---------------------------------------------------------------- fancy_id | integer | not null default nextval('fancy_table_fancy_id_seq'::text) username | text | not null favorite_toast_topping | text | not null Indexes: "fancy_table_pkey" PRIMARY KEY, btree (fancy_id)
::text, which casts the argument to be a
However, PostgreSQL actually defines the
nextval function in terms of the
(localhost/~) # \df nextval List of functions Schema | Name | Result data type | Argument data types | Type ------------+---------+------------------+---------------------+-------- pg_catalog | nextval | bigint | regclass | normal
As PostgreSQL supports function overloading, this gives us a way to sneak some additional steps into the sequence generation process. By creating a function
nextval(seq_name text), our function will be invoked in place of the system-supplied one. (You may now be able to see why our DBAs thought this was evil.) In most cases, we’ll want our special
nextval to simply delegate to the built-in implementation. However, when a sequence reaches the end of its allocated range, we can adjust the sequence using the next available range on the fly.
Here’s what our
nextval function looks like:
create function nextval(p_seq_name text) returns bigint as $$ begin return nextval(p_seq_name::regclass); exception when undefined_table then execute(repl.get_sequence_block(p_seq_name, false)); return nextval(p_seq_name::regclass); when object_not_in_prerequisite_state then execute(repl.get_sequence_block(p_seq_name, true)); return nextval(p_seq_name::regclass); end; $$ language 'plpgsql';
object_not_in_prerequisite_state exception is what we’ll get when a sequence has reached the end of its allowed range. We also catch
undefined_table so that we can create a new sequence on its first use. This makes it easier to introduce new shards, as we can create them without having to figure out the sequence ranges for them ahead of time.
And here’s the
get_sequence_block function that it uses:
create function repl.get_sequence_block(p_seq_name text, sequence_exists boolean) returns text as $$ declare next_attempted_value bigint; next_min bigint; next_max bigint; result text; begin -- use a lock on the table to make sure we don't have two sessions -- trying to redefine the sequence at the same time perform * from repl.sequence_block_assignment for update; if sequence_exists then execute 'select last_value + 1 from ' || p_seq_name into next_attempted_value; else next_attempted_value := 0; end if; select seq_min, seq_max into next_min, next_max from repl.sequence_block_assignment sba, repl.settings s where sba.seq_name = p_seq_name and sba.host = s.host_name and sba.seq_max >= next_attempted_value order by seq_min limit 1; if next_min is null then raise exception 'No more sequence blocks available'; end if; -- it is possible that we are in the middle of a sequence, due to -- having our max reset, so we must ensure that we don't go backwards next_min := greatest(next_min, next_attempted_value); if sequence_exists then result := 'alter sequence ' || p_seq_name || ' maxvalue ' || next_max::text || ' restart with ' || next_min::text; else result := 'create sequence ' || p_seq_name || ' minvalue ' || next_min::text || ' maxvalue ' || next_max::text; end if; return result; end; $$ language 'plpgsql';
There’s a couple of fiddly details, but in essence all that function does is look up the next available range assigned to this sequence (which we store in the
sequence_block_assignment table), and then either create or adjust the sequence as necessary.
As for communicating the block assignments to each shard, the changes are made centrally and then replicated out to each shard. We already had some replication in place for a small number of tables that contain global data, and so it was easy for us to include this information in the same way.
We’ve been running this scheme in production for a while now, and for the most part it has served us well. In hindsight, I’m inclined to agree that the
nextval function should be made a little more explicit, and it is probably something that we will do.
Still, I doubt the final solution will feel as sneaky as this one did.