EmmaTech

  • Emma Home
  • Emma Blog
  • Job Openings
  • RSS

Sharding PostgreSQL sequences through trickery and deceit

(Just don't tell our DBAs)

Kevin McConnell 17 Mar 2011 postgres 2 Comments

As is the case for many suc­cess­ful web appli­ca­tions, our data­bases have grown con­sid­er­ably over the years. While there was a time that a sin­gle moderately-sized PostgreSQL server could hap­pily store all of our trans­ac­tional data, those days are long gone. Which is, of course, not a bad thing. As they say, scal­ing to meet the demands of your suc­cess is a rather nice prob­lem to have. But it does tend to turn up a few thorny tech­ni­cal chal­lenges. One such prob­lem that we encoun­tered was find­ing a way to cre­ate unique iden­ti­fiers in a sharded environment.

Our gen­eral approach to scal­a­bil­ity is a fairly stan­dard one in the SaaS world — par­ti­tion ser­vices where appro­pri­ate, and then scale out rather than up. As applied to data­bases, that gen­er­ally means spread­ing large datasets over a num­ber of dis­crete shards. And so it hap­pened that our fondly-remembered sin­gle data­base instance became four log­i­cal shards, spread out over eight phys­i­cal servers (con­fig­ured as four primary/standby pairs).

We have quite a few places in our appli­ca­tion where we use data­base sequences for iden­ti­fiers, and in most of these cases it’s impor­tant that the IDs remain unique across all shards. We rou­tinely migrate data from one shard to another in order to bal­ance load, some­thing which would become a lot more com­pli­cated if it was pos­si­ble for IDs from dif­fer­ent shards to con­flict with each other. So we needed a way to gen­er­ate IDs that came, at least con­cep­tu­ally, from a sin­gle global pool of numbers.

We ruled out of a cou­ple of our first ideas before set­tling on one that we liked. The first rejected solu­tion was to put the sequences in a sep­a­rate ser­vice, and to have each shard request IDs from the ser­vice on demand. While sim­ple, this method had two sig­nif­i­cant draw­backs: it would hurt per­for­mance, as we would be intro­duc­ing a net­work round-trip for every ID gen­er­ated; and it could affect reli­a­bil­ity since the ID ser­vice would be a sin­gle point of fail­ure for those areas of the application.

The sec­ond idea was to use a dif­fer­ent data type for our IDs that would more eas­ily allow them to be made unique. For exam­ple, we could pre­fix the IDs with an iden­ti­fier for the shard on which they were cre­ated — s1_1, s1_2 for the first shard; s2_1, s2_2 for the sec­ond shard; and so on. Or we could use some­thing like a UUID. Unfortunately that would require some sig­nif­i­cant changes to our table def­i­n­i­tions, and we have a lot of tables. Additionally, many of our IDs are pub­lished in the form of URLs, and UUIDs in par­tic­u­lar do not lend them­selves well to short (or attrac­tive) URLs.

This led us to an idea that we liked much bet­ter. If we allo­cate dis­crete ranges of num­bers to each shard, the shards can autonomously gen­er­ate sequen­tial num­bers as they need them, so long as they stay inside the lim­its of those ranges. Since we can’t always pre­dict how many IDs each shard might need, we’d need to be able to add addi­tional ranges over time, but by pick­ing large enough sizes for each range we could ensure that this would only hap­pen occa­sion­ally. We can main­tain a list of shard/sequence/range map­pings in a table, and then peri­od­i­cally check each shard to make sure we always stay a few blocks ahead.

(We cur­rently check the sta­tus of each sequence block daily, but we’ve sized the blocks such that each one typ­i­cally lasts on the order of months, and always keep at least one unused block pre­al­lo­cated for each sequence ahead of time. This pro­vides a long win­dow of time in which every­thing con­tin­ues to work if there is a prob­lem with the sequence allo­ca­tion scripts).

The only chal­lenge, then, was to fig­ure out how to intro­duce such a scheme onto an already busy data­base, with­out dis­rup­tion or down­time. And this is where the trick­ery comes in. Or, if you ask our DBA, the evil.

Looking at the table def­i­n­i­tions in our data­base, most tables that use a sequence for row iden­ti­fiers look some­thing like this[1]:

(localhost/~) # \d fancy_table
                                    Table "public.fancy_table"
         Column         |  Type   |                           Modifiers
------------------------+---------+----------------------------------------------------------------
 fancy_id               | integer | not null default nextval('fancy_table_fancy_id_seq'::text)
 username               | text    | not null
 favorite_toast_topping | text    | not null
Indexes:
    "fancy_table_pkey" PRIMARY KEY, btree (fancy_id)

Note the ::text, which casts the argu­ment to be a text value.

However, PostgreSQL actu­ally defines the nextval func­tion in terms of the regclass type[2]:

(localhost/~) # \df nextval
                           List of functions
   Schema   |  Name   | Result data type | Argument data types |  Type
------------+---------+------------------+---------------------+--------
 pg_catalog | nextval | bigint           | regclass            | normal

As PostgreSQL sup­ports func­tion over­load­ing, this gives us a way to sneak some addi­tional steps into the sequence gen­er­a­tion process. By cre­at­ing a func­tion nextval(seq_name text), our func­tion will be invoked in place of the system-supplied one. (You may now be able to see why our DBAs thought this was evil.) In most cases, we’ll want our spe­cial nextval to sim­ply del­e­gate to the built-in imple­men­ta­tion. However, when a sequence reaches the end of its allo­cated range, we can adjust the sequence using the next avail­able range on the fly.

Here’s what our nextval func­tion looks like:

create function nextval(p_seq_name text) returns bigint
as
$$
begin
  return nextval(p_seq_name::regclass);
exception
when undefined_table then
  execute(repl.get_sequence_block(p_seq_name, false));
  return nextval(p_seq_name::regclass);
when object_not_in_prerequisite_state then
  execute(repl.get_sequence_block(p_seq_name, true));
  return nextval(p_seq_name::regclass);
end;
$$ language 'plpgsql';

That object_not_in_prerequisite_state excep­tion is what we’ll get when a sequence has reached the end of its allowed range. We also catch undefined_table so that we can cre­ate a new sequence on its first use. This makes it eas­ier to intro­duce new shards, as we can cre­ate them with­out hav­ing to fig­ure out the sequence ranges for them ahead of time.

And here’s the get_sequence_block func­tion that it uses:

create function repl.get_sequence_block(p_seq_name text,
    sequence_exists boolean)
returns text
as
$$
declare
  next_attempted_value bigint;
  next_min bigint;
  next_max bigint;
  result text;
begin
  -- use a lock on the table to make sure we don't have two sessions
  -- trying to redefine the sequence at the same time
  perform * from repl.sequence_block_assignment for update;

  if sequence_exists then
    execute 'select last_value + 1 from ' || p_seq_name into next_attempted_value;
  else
    next_attempted_value := 0;
  end if;

  select
    seq_min, seq_max into next_min, next_max
  from repl.sequence_block_assignment sba, repl.settings s
  where
    sba.seq_name = p_seq_name
    and sba.host = s.host_name
    and sba.seq_max >= next_attempted_value
  order by seq_min
  limit 1;

  if next_min is null then
    raise exception 'No more sequence blocks available';
  end if;

  -- it is possible that we are in the middle of a sequence, due to
  -- having our max reset, so we must ensure that we don't go backwards
  next_min := greatest(next_min, next_attempted_value);

  if sequence_exists then
    result := 'alter sequence ' || p_seq_name ||
      ' maxvalue ' || next_max::text || ' restart with ' || next_min::text;
  else
    result := 'create sequence ' || p_seq_name ||
      ' minvalue ' || next_min::text || ' maxvalue ' || next_max::text;
  end if;

  return result;
end;
$$ language 'plpgsql';

There’s a cou­ple of fid­dly details, but in essence all that func­tion does is look up the next avail­able range assigned to this sequence (which we store in the sequence_block_assignment table), and then either cre­ate or adjust the sequence as necessary.

As for com­mu­ni­cat­ing the block assign­ments to each shard, the changes are made cen­trally and then repli­cated out to each shard. We already had some repli­ca­tion in place for a small num­ber of tables that con­tain global data, and so it was easy for us to include this infor­ma­tion in the same way.

Looking Back

We’ve been run­ning this scheme in pro­duc­tion for a while now, and for the most part it has served us well. In hind­sight, I’m inclined to agree that the nextval func­tion should be made a lit­tle more explicit, and it is prob­a­bly some­thing that we will do.

Still, I doubt the final solu­tion will feel as sneaky as this one did.


[1]I believe this has changed in later ver­sions of PostgreSQL; nowa­days the default clause explic­itly casts the argu­ment to regclass.
[2]The regclass type is a PostgreSQL alias for an object ID.

2 comments

Rakesh Waghela commented:

2011-08-06, 05:02

Nice arti­cle and quite use­ful blog post !
Did you encounter any other prob­lem in mean time ? Would be great if you share the fur­ther details.

Kevin McConnell

Kevin McConnell commented:

2011-08-06, 12:04

Hi Rakesh! Thanks for your com­ment. So far the mech­a­nism I described has been work­ing pretty well for us. We have some areas of our code where we run through sequences really fast (like for mes­sage IDs; we send a *lot* of mail nowa­days!) and apart from one occa­sion where we made a small mis­take set­ting up some tables, we haven’t encoun­tered any problems.

Since I wrote that arti­cle we’ve been work­ing on redesign­ing a lot of our data­base schema, and our new ver­sion will have a fea­ture sim­i­lar to this one. I’m hop­ing we can make the code a lit­tle sim­pler next time though — these things are always eas­ier the sec­ond time around :). I’ll be sure to report any inter­est­ing devel­op­ments here when we do!

Have a great day,
Kevin

Leave a comment

Click here to cancel reply.

About Kevin McConnell

Kevin McConnell

Kevin serves as our Director of Engineering. Hailing originally from Edinburgh, Kevin has injected the office with words like "dodgy" and "wee." As in a wee spot of tea to go with his toast snack. Did we mention Kevin's devotion to toast? Just say the word and he lights up.

Read more from Kevin

Emma Tech on Twitter

    Follow Emma Tech »
    Help wanted

    • Popular Tags

      Python12 api7 UX5 conferences4 postgres4 workflow4 time4 javascript3 PHP3 jQuery3 tools3 editors2 travel2 server maintenance2 Git2 maintenance windows2 Haml1 Frank1 Ruby1 CSS1 PyCon1 office1 Sass1 downtime1 post‑mortems1 cgit1 books1 Trac1 collaboration1 community1 Twitter1 Facebook1 OAuth1 coding1 cool sites1 Redis1 github1 objects programming refactoring1 integration1 salesforce1 usability testing1 Social Posting1 music1 productivity1 bugs1 TextExpander1 san francisco1 Convore1 Vim1 releases1 legacy data1 HTML1 reading1 Django1 PgCon1 testing1 TDD1

    Emma is a member of the Email Sender & Provider Coalition and the Messaging Anti-Abuse Working Group.

    Copyright © 2003 - 2012 Emma.
    All rights reserved.

    • Get Emma's Newsletter
    • Visit the Emma Blog
    • @emmaemailtech on Twitter
    • @emmaemail on Twitter
    • Emma on Facebook

    Emma's email marketing makes communicating simple and stylish.
    Inquire now for more details.