Sharding PostgreSQL sequences through trickery and deceit
(Just don't tell our DBAs)
As is the case for many successful web applications, our databases have grown considerably over the years. While there was a time that a single moderately-sized PostgreSQL server could happily store all of our transactional data, those days are long gone. Which is, of course, not a bad thing. As they say, scaling to meet the demands of your success is a rather nice problem to have. But it does tend to turn up a few thorny technical challenges. One such problem that we encountered was finding a way to create unique identifiers in a sharded environment.
Our general approach to scalability is a fairly standard one in the SaaS world — partition services where appropriate, and then scale out rather than up. As applied to databases, that generally means spreading large datasets over a number of discrete shards. And so it happened that our fondly-remembered single database instance became four logical shards, spread out over eight physical servers (configured as four primary/standby pairs).
We have quite a few places in our application where we use database sequences for identifiers, and in most of these cases it’s important that the IDs remain unique across all shards. We routinely migrate data from one shard to another in order to balance load, something which would become a lot more complicated if it was possible for IDs from different shards to conflict with each other. So we needed a way to generate IDs that came, at least conceptually, from a single global pool of numbers.
We ruled out of a couple of our first ideas before settling on one that we liked. The first rejected solution was to put the sequences in a separate service, and to have each shard request IDs from the service on demand. While simple, this method had two significant drawbacks: it would hurt performance, as we would be introducing a network round-trip for every ID generated; and it could affect reliability since the ID service would be a single point of failure for those areas of the application.
The second idea was to use a different data type for our IDs that would more easily allow them to be made unique. For example, we could prefix the IDs with an identifier for the shard on which they were created — s1_1, s1_2 for the first shard; s2_1, s2_2 for the second shard; and so on. Or we could use something like a UUID. Unfortunately that would require some significant changes to our table definitions, and we have a lot of tables. Additionally, many of our IDs are published in the form of URLs, and UUIDs in particular do not lend themselves well to short (or attractive) URLs.
This led us to an idea that we liked much better. If we allocate discrete ranges of numbers to each shard, the shards can autonomously generate sequential numbers as they need them, so long as they stay inside the limits of those ranges. Since we can’t always predict how many IDs each shard might need, we’d need to be able to add additional ranges over time, but by picking large enough sizes for each range we could ensure that this would only happen occasionally. We can maintain a list of shard/sequence/range mappings in a table, and then periodically check each shard to make sure we always stay a few blocks ahead.
(We currently check the status of each sequence block daily, but we’ve sized the blocks such that each one typically lasts on the order of months, and always keep at least one unused block preallocated for each sequence ahead of time. This provides a long window of time in which everything continues to work if there is a problem with the sequence allocation scripts).
The only challenge, then, was to figure out how to introduce such a scheme onto an already busy database, without disruption or downtime. And this is where the trickery comes in. Or, if you ask our DBA, the evil.
Looking at the table definitions in our database, most tables that use a sequence for row identifiers look something like this[1]:
(localhost/~) # \d fancy_table
Table "public.fancy_table"
Column | Type | Modifiers
------------------------+---------+----------------------------------------------------------------
fancy_id | integer | not null default nextval('fancy_table_fancy_id_seq'::text)
username | text | not null
favorite_toast_topping | text | not null
Indexes:
"fancy_table_pkey" PRIMARY KEY, btree (fancy_id)
Note the ::text, which casts the argument to be a text value.
However, PostgreSQL actually defines the nextval function in terms of the regclass type[2]:
(localhost/~) # \df nextval
List of functions
Schema | Name | Result data type | Argument data types | Type
------------+---------+------------------+---------------------+--------
pg_catalog | nextval | bigint | regclass | normal
As PostgreSQL supports function overloading, this gives us a way to sneak some additional steps into the sequence generation process. By creating a function nextval(seq_name text), our function will be invoked in place of the system-supplied one. (You may now be able to see why our DBAs thought this was evil.) In most cases, we’ll want our special nextval to simply delegate to the built-in implementation. However, when a sequence reaches the end of its allocated range, we can adjust the sequence using the next available range on the fly.
Here’s what our nextval function looks like:
create function nextval(p_seq_name text) returns bigint as $$ begin return nextval(p_seq_name::regclass); exception when undefined_table then execute(repl.get_sequence_block(p_seq_name, false)); return nextval(p_seq_name::regclass); when object_not_in_prerequisite_state then execute(repl.get_sequence_block(p_seq_name, true)); return nextval(p_seq_name::regclass); end; $$ language 'plpgsql';
That object_not_in_prerequisite_state exception is what we’ll get when a sequence has reached the end of its allowed range. We also catch undefined_table so that we can create a new sequence on its first use. This makes it easier to introduce new shards, as we can create them without having to figure out the sequence ranges for them ahead of time.
And here’s the get_sequence_block function that it uses:
create function repl.get_sequence_block(p_seq_name text,
sequence_exists boolean)
returns text
as
$$
declare
next_attempted_value bigint;
next_min bigint;
next_max bigint;
result text;
begin
-- use a lock on the table to make sure we don't have two sessions
-- trying to redefine the sequence at the same time
perform * from repl.sequence_block_assignment for update;
if sequence_exists then
execute 'select last_value + 1 from ' || p_seq_name into next_attempted_value;
else
next_attempted_value := 0;
end if;
select
seq_min, seq_max into next_min, next_max
from repl.sequence_block_assignment sba, repl.settings s
where
sba.seq_name = p_seq_name
and sba.host = s.host_name
and sba.seq_max >= next_attempted_value
order by seq_min
limit 1;
if next_min is null then
raise exception 'No more sequence blocks available';
end if;
-- it is possible that we are in the middle of a sequence, due to
-- having our max reset, so we must ensure that we don't go backwards
next_min := greatest(next_min, next_attempted_value);
if sequence_exists then
result := 'alter sequence ' || p_seq_name ||
' maxvalue ' || next_max::text || ' restart with ' || next_min::text;
else
result := 'create sequence ' || p_seq_name ||
' minvalue ' || next_min::text || ' maxvalue ' || next_max::text;
end if;
return result;
end;
$$ language 'plpgsql';
There’s a couple of fiddly details, but in essence all that function does is look up the next available range assigned to this sequence (which we store in the sequence_block_assignment table), and then either create or adjust the sequence as necessary.
As for communicating the block assignments to each shard, the changes are made centrally and then replicated out to each shard. We already had some replication in place for a small number of tables that contain global data, and so it was easy for us to include this information in the same way.
Looking Back
We’ve been running this scheme in production for a while now, and for the most part it has served us well. In hindsight, I’m inclined to agree that the nextval function should be made a little more explicit, and it is probably something that we will do.
Still, I doubt the final solution will feel as sneaky as this one did.
[1]I believe this has changed in later versions of PostgreSQL; nowadays the default clause explicitly casts the argument to regclass.
[2]The regclass type is a PostgreSQL alias for an object ID.
2 comments
Leave a comment

Rakesh Waghela commented:
Nice article and quite useful blog post !
Did you encounter any other problem in mean time ? Would be great if you share the further details.
Kevin McConnell commented:
Hi Rakesh! Thanks for your comment. So far the mechanism I described has been working pretty well for us. We have some areas of our code where we run through sequences really fast (like for message IDs; we send a *lot* of mail nowadays!) and apart from one occasion where we made a small mistake setting up some tables, we haven’t encountered any problems.
Since I wrote that article we’ve been working on redesigning a lot of our database schema, and our new version will have a feature similar to this one. I’m hoping we can make the code a little simpler next time though — these things are always easier the second time around :). I’ll be sure to report any interesting developments here when we do!
Have a great day,
Kevin