Written by dap
on August 25, 2024

Challenges deploying PostgreSQL (9.2) for high availability

NOTE: This post was published in 2024 about experiences in 2017-2019 scaling up a system built largely in 2012-2013. This was all based on PostgreSQL 9.2 - 9.6. No doubt much has changed in PostgreSQL since then. I'm not sure this post is useful for anybody using PostgreSQL today. But it's a handy reference for when folks ask me about my experiences with PostgreSQL.

In 2012, Joyent started working on Manta, a highly-available cloud object store. Our early choice to use PostgreSQL in the heart of the system cast a long shadow. We learned about operating PostgreSQL at scale, and particularly in a high-write-volume, high-availability use case. This post covers some of our choices, the behavior of PostgreSQL, and what I learned.

I’ve included a lot of detail so that this post can stand alone for people who aren’t already familiar with PostgreSQL internals or its patterns for high availability. To keep it manageable, I’ve divided the post into sections:

How (and why) we deployed HA PostgreSQL
Background: how writes work in PostgreSQL
How replication went wrong for us
Summary and final thoughts

Feel free to skip around or jump right to the summary.

1. How (and why) we deployed HA PostgreSQL

Manta design goals

Like a filesystem, an object store maps user-controlled names (like “/dap/public/kartlytics.tgz”) to large blobs of data. What makes object stores simpler than filesystems (and much more appealing in the face of CAP) is that they generally don’t allow writes to the middle of an object. Objects are effectively immutable: if you want to change the contents of “/dap/public/kartlytics.tgz”, you end up creating a new object that replaces the original one. With this model, it’s natural to decompose the contents of the object store into metadata and data: the metadata maps the user-visible namespace (like “/dap/public/kartlytics.tgz”) to specific blobs of the actual user data that’s stored on specific servers.

Back in 2012, we started with several constraints on Manta’s design:

Data integrity. This is job #1. Never lose anybody’s data – ever.
High availability. As a cloud service at the foundation of customers’ applications, it’s critical that the system remain online as much as possible. To quantify this a little better: we wanted to build a system where no failure of a single hardware component, server, rack, or even a physical datacenter space could affect overall service aside from requests currently in-flight on that component.
Horizontal scalability. Again, being a cloud service, customers expect to be able to store arbitrary amounts of data in Manta. If we became limited by the physical capacity of any component, we had to be able to scale it out by deploying more instances on more hardware.
Strong consistency. This is probably the most controversial because when we started, the biggest, most popular object store (Amazon’s S3) wasn’t read-after-write consistent in most cases. We understood this to be a pain point for many S3 users. Besides that, we knew that we (or customers) could build a more-available (less strongly consistent) service atop a strongly-consistent Manta, but the reverse is not possible.

ZFS was an easy choice for data integrity. It remains one of the only filesystems that deals with arbitrary data corruption by HBAs and disks (provided not all copies are affected, of course). We chose to store user data as flat files on ZFS filesystems because that enabled us to build in-place compute as a first-class Manta operation. We discuss both of these aspects in more detail in our ACM Queue article.

For metadata, which needs to be queried in (slightly) more sophisticated ways, we decided early on to use PostgreSQL primarily because of the need for strong consistency and its reputation for data integrity and performance. If it wasn’t going to work for us, we figured we’d probably have to build our own component (remember, this was 2012), but that would have to be informed by whatever reason caused PostgreSQL not to work for us.

In terms of our constraints, we expected that PostgreSQL would give us data integrity out of the box (particularly when deployed atop ZFS). Sharding is well-understood for horizontal scalability, and we combined this with consistent hashing to enable us to grow the number of shards in the future. By sharding based on directory names and limiting atomic API operations to single-directory operations, we got strong consistency.

For a long time, the biggest challenge we had with PostgreSQL was the remaining constraint: high availability.

Highly available PostgreSQL

Chapter 26 in the PostgreSQL documentation covers high availability.¹ At the most basic level: if you intend a service to survive fatal failure of a PostgreSQL database (say, a server reboot or network partition) in less time than it takes to recover that database (e.g., the time required to reboot or restore connectivity), you need a second, up-to-date copy of the database somewhere else that you can bring into service when the first database fails.²

Terminology note: people use different terms to describe these database copies, including “master”, “primary”, or “upstream” for the first database and “slave”, “secondary”, “standby”, or “downstream” for the second one. The PostgreSQL documentation uses various combinations of these terms. I’m going to use “primary” and “secondary”.

For our secondary database to be useful for high availability, we needed two properties:

Failover time must be bounded. That is, we needed to be able to bring the secondary database into service as a primary within seconds, no matter how big or busy the database was.
Failover must not lose any data. Being a foundational service for end users’ applications, no amount of data loss is acceptable. We couldn’t drop a few seconds’ worth of writes on a failover event. Imagine if S3 occasionally lost keys that you had stored into it because of some internal failure event that you otherwise didn’t even know had happened.

More precisely, we can say that if a write operation is acknowledged to a client on the primary, and a takeover happens immediately after that, then within a bounded amount of time, another client must be able to read the data that was just written. Again, by “bounded time” I mean that the time required is just a few seconds regardless of how big the database is or how busy it was before the takeover.

Like many PostgreSQL users, especially in 2012, we opted to use streaming WAL replication. This involved setting up the secondary database to receive the log of the primary’s write activity and replay the log. Further, we used synchronous replication, which causes write transactions on the primary to block until the secondary has received and acknowledged the write as well. (Much more on this later.) In comparison to the other approaches discussed in the PostgreSQL docs, this had a couple of nice properties:

To set this up, you needed only to set up your two copies of the database the same way and then configure PostgreSQL to replicate.
Unlike the shared disk option, no special hardware was required, and the system could survive failure of any component (provided the databases are deployed with no shared components).
Unlike statement-based replication middleware, PostgreSQL itself ensured that the secondary contains exactly the same content as the primary.
Unlike the various multi-master modes, there was no need for conflict resolution, automatic or otherwise, because there cannot be any conflicting changes.

Now, replication takes time, so synchronous replication adds latency to each write request. When we talked to people about synchronous replication, they often thought we were crazy because it was just too slow. But we were only replicating across reliable, high-throughput, low-latency network connections, so the round-trip network latency wasn’t that high. It felt like a reasonable price to pay for the durability, consistency, and availability characteristics we were going for.

In order to survive a failure, we needed two components to be online before the failure: the primary and the secondary. If we had a second failure while the primary was offline, even if the primary quickly came back online, we could not bring it back into service because it would be missing writes that were written to the secondary while the primary was down. To address this, we deployed three databases instead of two, using synchronous replication from the primary to the secondary (which we called the “sync peer”, or just “the sync”) and asynchronous replication to the third peer (often called the “async peer”, or just “the async”). On failover, the async peer became a sync. As a result, all writes to the database were durable on two different servers. We built a system called Manatee for monitoring the health of each peer and managing the various roles in the cluster (i.e., initiating takeovers). There are many other such systems now, but there weren’t at the time.

With these pieces in place, we were monitoring peers and taking over automatically in bounded time without data loss. Right? If this were a blog post from 2015 about how we built this system, this would be the end of the story. Informed by several more years’ worth of experience, the real story turns out to be quite a bit more complicated!

One of my favorite quotes is that “you do not understand how a system works until you understand how it fails”. At this point, you know how our H/A PostgreSQL system worked at a high level. Let’s get into the details a bit so that we can then see how it failed.

2. Background: how writes work in PostgreSQL

Reminder: This was all written for PostgreSQL 9.2 - 9.6. I haven't kept up with PostgreSQL and things may have changed since then.

Recall what we said above:

Requirement: if a write operation is acknowledged to a client on the primary, and a takeover happens immediately after that, then within a bounded amount of time, another client must be able to read the data that was just written.

To understand if that’s true of the system we described, we need to dig deeper and understand the behavior first of normal writes in a single PostgreSQL database and then of those writes when using synchronous replication.

Writes in a single PostgreSQL database

Chapter 30 in the PostgreSQL documentation talks about PostgreSQL’s write-ahead-log (WAL), giving this overview:

Briefly, WAL's central concept is that changes to data files (where tables and indexes reside) must be written only after those changes have been logged, that is, after log records describing the changes have been flushed to permanent storage. If we follow this procedure, we do not need to flush data pages to disk on every transaction commit, because we know that in the event of a crash we will be able to recover the database using the log: any changes that have not been applied to the data pages can be redone from the log records. (This is roll-forward recovery, also known as REDO.)

I think of it this way: the state of a PostgreSQL database consists of:

an on-disk representation of the database contents (the on-disk table and indexes)
an in-memory representation of the database contents (sometimes called the in-memory heap, and generally represented by the region referred to as shared buffers)
a write-ahead-log (WAL)

I draw it like this:

When you issue a write to PostgreSQL, a few things happen:

The operation is written to the WAL, and those WAL records are flushed to stable storage.
The in-memory database state is updated so that subsequent queries will see the results of the write (depending on their transaction mode).
Some time later, the in-memory changes are written to the on-disk state of the database.

If the database crashes before the WAL records for a particular operation have been flushed to disk, then the operation never happened. (The client would not have seen an acknowledgment for the operation.) If the database crashes after the WAL records are written, but before the database files are updated, then the operation did happen. How does that work? When PostgreSQL comes back up, it can reconstruct in-memory state from the on-disk state, and that will provide correct results to clients provided that PostgreSQL re-applies all of the writes from the WAL before accepting any requests from clients. The process of replaying writes from the WAL after startup is called recovery – and indeed, PostgreSQL startup blocks until the recovery process completes.

The process by which PostgreSQL flushes outstanding changes to the database’s data files is usually done by an operation called a checkpoint. Section 30.4 explains:

Checkpoints are points in the sequence of transactions at which it is guaranteed that the heap and index data files have been updated with all information written before that checkpoint. At checkpoint time, all dirty data pages are flushed to disk and a special checkpoint record is written to the log file. (The change records were previously flushed to the WAL files.) In the event of a crash, the crash recovery procedure looks at the latest checkpoint record to determine the point in the log (known as the redo record) from which it should start the REDO operation. Any changes made to data files before that point are guaranteed to be already on disk.

In a high-write-volume system, at any given time, there will be some write operations in each of the above phases. We can summarize this concisely in terms of how it looks in the WAL:

The documentation goes on to describe the impact of several checkpoint-related parameters:

The server's checkpointer process automatically performs a checkpoint every so often. A checkpoint is begun every `checkpoint_timeout` seconds, or if `max_wal_size` is about to be exceeded, whichever comes first.
…

Reducing checkpoint_timeout and/or max_wal_size causes checkpoints to occur more often. This allows faster after-crash recovery, since less work will need to be redone. However, one must balance this against the increased cost of flushing dirty data pages more often. If full_page_writes is set (as is the default), there is another factor to consider. To ensure data page consistency, the first modification of a data page after each checkpoint results in logging the entire page content. In that case, a smaller checkpoint interval increases the volume of output to the WAL log, partially negating the goal of using a smaller interval, and in any case causing more disk I/O.

As this explains, checkpoint frequency has a huge impact on the server: doing checkpoints too frequently can result in doing a lot of I/O – potentially much more than if checkpoints happen less frequently. (In the extreme case, consider a single page in the database that changes once per second. If you checkpoint once per second, you’ll rewrite it 60 times in a minute. If you checkpoint once per minute, you’ll rewrite it just once.) On the other hand, checkpointing too infrequently can result in very long startup times after a restart, since the server has to replay all of the WAL that wasn’t checkpointed when the server went down.

It’s generally up to the operator to configure PostgreSQL to checkpoint at an appropriate frequency, and there are many other considerations to be aware of.

But there’s only one concern that we care about here in our discussion about availability: the longer the time since the last checkpoint, the more data will need to be replayed if the server crashes at that point. Since this represents a debt that must be repaid if the server restarts, we’ve come to call this checkpoint lag. (I’m not sure if the community uses a different term for this.)

Writes in a two-peer PostgreSQL cluster using synchronous replication

Now that we have a model for understanding writes in a single database, how does replication affect writes? Section 26.2.8.1 explains this for the specific case of synchronous replication:

After a commit record has been written to disk on the primary, the WAL record is then sent to the standby. The standby sends reply messages each time a new batch of WAL data is written to disk, unless `wal_receiver_status_interval` is set to zero on the standby.

To be more precise, see the documentation for the synchronous_commit tunable (slightly reformatted here for clarity). If synchronous replication is enabled:

this parameter also controls whether or not transaction commits will wait for their WAL records to be replicated to the standby server(s).
When set to on, commits will wait until replies from the current synchronous standby(s) indicate they have received the commit record of the transaction and flushed it to disk. This ensures the transaction will not be lost unless both the primary and all synchronous standbys suffer corruption of their database storage.

When set to remote_apply, commits will wait until replies from the current synchronous standby(s) indicate they have received the commit record of the transaction and applied it, so that it has become visible to queries on the standby(s).

When set to remote_write, commits will wait until replies from the current synchronous standby(s) indicate they have received the commit record of the transaction and written it out to their operating system. This setting is sufficient to ensure data preservation even if a standby instance of PostgreSQL were to crash, but not if the standby suffers an operating-system-level crash, since the data has not necessarily reached stable storage on the standby.

Now we can better explain what happens to a write operation with synchronous replication:

The operation commits on the primary, generating a WAL “commit” record.
All WAL records for this transaction, including the commit record, are transmitted to the secondary.
The secondary writes and flushes these records to disk.
The WAL records are applied to the in-memory database on the secondary.
Some time later, the changes are applied to the on-disk data files.

This last step is analogous to the checkpointing behavior we discussed in the single-node case, though section 30.4 explains that these are called restartpoints on non-primaries:

In archive recovery or standby mode, the server periodically performs restartpoints, which are similar to checkpoints in normal operation: the server forces all its state to disk, updates the pg\_control file to indicate that the already-processed WAL data need not be scanned again, and then recycles any old log segment files in the pg\_xlog directory.

3. How replication went wrong for us

Replay lag

Critically, the synchronous_commit parameter determines at what point in the above five-stage process the primary can consider a write having completed. Manta was built on PostgreSQL 9.2.4, where the value on was the strongest form available, but that still only waited until step (3) above. What happened to transactions that have made it past step (3), but not step (4)? In that case, their WAL records were written to disk on the secondary, but the changes hadn’t been applied to the database on the sync. Under a steady write workload like we had, we could assume that some transactions would be in this state at any given time.

Could we measure this? Fortunately, because the WAL is a single stream of activity and it’s processed in order, it was sufficient to look at position in the WAL stream, which PostgreSQL exposes via the pg_stat_replication view. We referred to the delta between what’s been received and what’s been replayed (applied to the database) as replay lag.

What did replay lag mean for Manta? A few things:

If you were to query the secondary at any point, you would find its state does not reflect writes that haven’t completed step (4). It was often suggested that Manta could allow read requests to be directed to secondary databases to reduce load on the primaries, but this would have resulted in frequent violation of Manta’s read-after-write consistency because of replay lag.
More critically, if the secondary took over as primary, it needed to finish replaying all of the un-replayed WAL records before it could even accept reads. Otherwise, it would be a violation of PostgreSQL’s own guarantees.

This second point is huge: the minimum possible takeover time depended on the amount of replay lag at the point of takeover. Further, there’s no bound on how much data can be stuck in this intermediate state. Out of the box, this violated our goal of bounded-time takeovers.

Imagine that the secondary runs into a problem that causes it to replay database updates much more slowly than normal. Even with synchronous replication, as long as the peer is able to receive those records at the same rate that they’re being sent, performance on the primary is unchanged – but the replay lag starts growing. This could go on without issue as long as nothing else went wrong. But when something ever did go wrong, we were in trouble. We found ourselves in situations where we had over 18 hours’ worth of replay lag, then the primary failed and we tried to do a takeover – but PostgreSQL on the secondary was going to be busy replaying WAL segments for hours before it could start up! Takeover that takes 18 hours might be fine for disaster recovery, but it’s not high availability.

Later versions of PostgreSQL provide a synchronous_commit = remote_apply option to provide the necessary backpressure by not allowing writes to complete on the primary until they’ve been applied on the secondary (not just received or written to disk). This wasn’t an option in the version of PostgreSQL on which we built Manta (9.2.4). We later did work to support upgrading production database to 9.6, but using remote_apply was a non-starter for reasons described below.

Causes of replay lag

Replay lag became a major problem for our large, high-write-volume Manta deployments. A few causes are worth looking at in detail: high write volume, mismatched filesystem record sizes, and the fact that WAL recovery is single-threaded.

High write volume, lots of vacuuming

Our Manta workload involved storing lots of relatively small objects. Plus, the storage tier was very easy to scale out. So the workload quickly became bound on how quickly it could write to the metadata tier (the PostgreSQL databases). Further, the metadata was being frequently updated, which led to a lot of dead tuples, which led to a need for frequent vacuuming. This in turn generated lots more write activity.

Mismatched filesystem record size

Based on advice from the community, we had tuned the ZFS recordsize to 16 KiB from its much larger default of 128 KiB. The idea behind using a 16 KiB recordsize was that PostgreSQL’s native block size is 8 KiB, and using 16 KiB blocks causes ZFS to fetch two PostgreSQL pages at a time, which is helpful when PostgreSQL scans through a relation. It’s conceivable that this could help some read-heavy workloads or workloads that fit comfortably in DRAM. But for our write-heavy workload that didn’t come close to fitting in DRAM, this setting was truly disastrous. Can you tell why?

Regardless of the filesystem recordsize, PostgreSQL writes to the filesystem in 8 KiB blocks. When the filesystem uses a matching 8 KiB recordsize and PostgreSQL updates a block, ZFS just allocates a new block to be written out with the next transaction group. The actual write() operation can return immediately in many cases without waiting for any I/O. But with a 16 KiB recordsize, when PostgreSQL updates a random 8 KiB block, that’s one half of a 16 KiB filesystem block. In order for the filesystem to construct the new version of that block, it needs the contents of the other half – the half that PostgreSQL isn’t modifying. If that block isn’t already in memory, it has to be read from disk. This is known as a read-modify-write operation.

With an 8 KiB recordsize, we would have had a streaming write workload, which even spinning disks could handle pretty quickly. With a 16 KiB recordsize, we had a mixed workload: streaming writes plus random reads. This is significantly more expensive even on SSDs, since it requires up to twice the I/O. write() operations may block on disk reads – slowing them down by literally orders of magnitude.

It’s hard to communicate just how bad this problem was. Even if 99% of writes do not encounter the read-modify-write issue, because random reads are so much slower, Amdahl reminds us that we’ll still spend virtually 100% of our time blocked on the random reads.

We ultimately resolved this problem – rather painfully – by manually taking every database peer out of service and rewriting the database files to use an 8 KiB recordsize.

Single-threaded WAL recovery

Executing queries often requires disk I/O, especially when the database’s size exceeds what can be cached in DRAM. Disks are much slower than DRAM, so this increases query latency considerably. But disks can do several operations concurrently, which means that throughput of a random-read workload typically increases with more concurrency (up to a point), particularly for spinning disks. So it’s common to configure a PostgreSQL database to accept tens or hundreds of concurrent client connections, with each connection served by a separate single-threaded process that initiates whatever filesystem I/O it needs. This enables good throughput even when individual I/O operations might be slow.

With replication involved, the secondary essentially needs to do all the same work that the primary does. (While the random reads required on the secondary don’t line up 1-to-1 with those on the primary, it’s reasonable to imagine that they do because we’d expect the proportions to match up under steady-state.) But while the primary does all that disk I/O from tens or hundreds of threads, on the secondary, there is only one single-threaded process that reads the WAL log and applies those changes to the database (doing all of the I/O necessary to do that). So the secondary has to do in sequence the same amount of read I/O that the primary is doing with a concurrency of a few hundred. It’s somewhat surprising that a secondary can ever keep up!

Cascading failure

In our case, these problems (the heavy write volume, the recordsize mismatch, and the single-threaded replay lag) compounded badly: the single-threaded WAL recovery process was spending most of its time blocked on random reads in order to carry out writes (because of the read-modify-write behavior induced by the recordsize mismatch).

We resolved the limited-concurrency problem by implementing pg_prefaulter. This daemon ran on a downstream peer, monitored the current WAL replay location, read whatever WAL files PostgreSQL was going to be needing soon, parsed the WAL records, determined which filesystem pages PostgreSQL was going to need soon, and then read those into memory – from multiple threads. This effectively put a multi-threaded prefaulter into the PostgreSQL replication pipeline.

Note that because of the way pg_prefaulter worked, it could only improve replay performance when replay lag was already non-zero. If replication was caught up (and remained caught up), there was nothing to prefault. With pg_prefaulter we’d usually see the system oscillate between accumulating small amounts of replay lag and driving it back to zero.

Having explained all this, we can now get to a subtle but critical point about performance. Recall that earlier, we said that modern versions of PostgreSQL provide a synchronous_commit = remote_apply option, which blocks writes on the primary until the sync has actually applied the write. This option eliminates replay lag. However, this option slows writes to the primary to the rate at which the single thread on the sync can replay WAL transactions. By contrast, using synchronous_commit = remote_write and pg_prefaulter, we had synchronous replication, plus the performance benefits of multi-threaded I/O, while still keeping replay lag low. Unfortunately, there’s nothing about this setup to guarantee that replay lag remains low; it’s something we needed to constantly monitor. (In practice, the combination of the recordsize change and prefaulter eliminated replay lag in our environments without the need for regular operator intervention.)

Looking back at the times people told us that we were crazy to use synchronous replication because it was so slow, I wonder how much that reputation came from the fact that synchronous replication requires an extra network round-trip per transaction (which is intrinsic, but not actually that expensive) vs. the fact that the PostgreSQL implementation effectively limits throughput to the speed of a single-threaded I/O-bound process (which was not something we found documented, nor something I ever heard anybody talk about at that point).³

The re-emergence of checkpoint lag

Earlier, we explained checkpoints and how PostgreSQL start-up time could be greatly affected by checkpoint frequency. We came to call this “checkpoint lag”, since, like replay lag, it reflected an amount of debt that needed to be repaid for the system to function after a restart.⁴ We discovered this to be a major contributor to unbounded takeover time in a couple of surprising ways.

Recall that checkpoint lag affects restart time for a single server because the server has to replay all the non-checkpointed WAL entries. This affects both primaries and secondaries independently. We were generally restarting the secondary as part of takeover (trigger files were not a thing when we initially built this system). We ran into cases where the secondary was restarted as part of a takeover and we found ourselves – yet again – with an enormous amount of replay lag that had to be applied before the system could come back up. That was even though there was no replay lag prior to the takeover! The problem was that the system hadn’t been checkpointing frequently enough.

Why not? It turned out that the page size / recordsize mismatch problem affected the checkpointer in the same way it had affected the WAL application process. Writes that were logically 8 KiB were turned into 16 KiB filesystem writes, triggering read-modify-writes. In this way, a thread whose sole job was to write dirty data to disk spent 100% of its time blocked on reads. And like WAL application, it was single-threaded, so it quickly fell behind. When we started hitting this, checkpoints that had previously taken 10-15 seconds started to take hours.

Things got worse from there. An important byproduct of the checkpoint process is that when dirty buffers were written out, those buffers could be used for new work (e.g., replaying the next WAL entry). Generally this would ensure that there was a pool of non-dirty buffers available for whatever needed them. But when the checkpointer fell behind, PostgreSQL would eventually run out of non-dirty buffers altogether. When that happens, any time anything needed a buffer, it had to use a dirty one and write out the dirty data before doing anything else. Of course, now that thing became blocked on the same read-modify-write problem. This problem looked like a very sharp cliff: though the checkpointer may have started falling behind hours ago, things would be humming along until until the pool of dirty buffers became empty. At that point, WAL application suddenly slowed to a crawl. And there was nothing pg_prefaulter could do about this problem.

So in this way, excessive checkpoint lag actually created significant replay lag. The resolution for us was fixing the recordsize mismatch as described above, but again, there’s nothing structurally in PostgreSQL that provided backpressure to prevent these things from becoming problems. It was just a matter of keeping the different pieces independently moving quickly enough.

Monitoring various types of lag

Given that both replay lag and checkpoint lag exhibited few symptoms under normal operation but became catastrophic at failover time, it was critical to monitor them closely. How does one do that? Section 26.2.5.2 explains:

An important health indicator of streaming replication is the amount of WAL records generated in the primary, but not yet applied in the standby. You can calculate this lag by comparing the current WAL write location on the primary with the last WAL location received by the standby. They can be retrieved using `pg_current_xlog_location` on the primary and the `pg_last_xlog_receive_location` on the standby, respectively (see Table 9-78 and Table 9-79 for details). The last WAL receive location in the standby is also displayed in the process status of the WAL receiver process, displayed using the ps command (see Section 28.1 for details).
You can retrieve a list of WAL sender processes via the pg_stat_replication view. Large differences between pg_current_xlog_location and sent_location field might indicate that the master server is under heavy load, while differences between sent_location and pg_last_xlog_receive_location on the standby might indicate network delay, or that the standby is under heavy load.

First of all, this is not quite correct. The most important health indicator is indeed the amount of WAL generated on the primary but not yet applied on the secondary. However, the documentation goes on to measure something else that we call send lag (or receive lag): it’s the amount of data generated on the primary but not yet received on the secondary. For those using asynchronous replication, this too is a problem. It’s just as bad as the other lags – we just didn’t tend to see it much because we use synchronous replication for the first hop.

The documentation doesn’t explain how to monitor the actual replay lag. The answer is to follow similar instructions to what it says, but look at the replay_location field instead of sent_location and use pg_last_xlog_replay_location() instead of pg_last_xlog_receive_location().

Checkpoint lag was more difficult to measure. There were a couple of ways that I found:

Most directly, we could look at the time of the latest checkpoint. The only way I found to do this is using the pg_controldata command directly on the database. I’m not sure there was a way to query the database for this information. It was also not very clear what the field reported by pg_controldata means, particularly on a secondary server, but it appears to be the timestamp when the last checkpoint started on the primary that was applied as a restartpoint on the secondary (which is probably what you want, since that’s what would have to be replayed on the secondary if it had to restart).
We could use pg_buffercache to monitor the amount of dirty buffers, which I understood to be the amount of data that would need to be checkpointed. This is somewhat indirect, and scanning this information appeared to take exclusive locks, so I’m not sure that this would be a great approach.
We could enable logging of checkpoints and restartpoints and monitor the logs.

There’s another subtlety worth mentioning: the difference in WAL positions was only a proxy for what really affects takeover time, which is the amount of time required to replay WAL positions. In our experience, WAL replay rates varied over time depending on workload. (For example, a vacuum may generate WAL records that cause many more random reads than usual.) I don’t know any way to really deal with this problem – there’s no good way to estimate the cost of replaying WAL records ahead of time. But it’s important to be aware of.

Another problem: brief interruptions cause huge impact

Our availability also suffered due to another major issue we found with streaming replication. Section 26.2.2 explains (emphasis added):

In standby mode, the server continuously applies WAL received from the master server. The standby server can read WAL from a WAL archive (see `restore_command`) or directly from the master over a TCP connection (streaming replication). The standby server will also attempt to restore any WAL found in the standby cluster's pg\_xlog directory. That typically happens after a server restart, when the standby replays again WAL that was streamed from the master before the restart, but you can also manually copy files to pg\_xlog at any time to have them replayed.
At startup, the standby begins by restoring all WAL available in the archive location, calling restore_command. Once it reaches the end of WAL available there and restore_command fails, it tries to restore any WAL available in the pg_xlog directory. If that fails, and streaming replication has been configured, the standby tries to connect to the primary server and start streaming WAL from the last valid record found in archive or pg_xlog. If that fails or streaming replication is not configured, or if the connection is later disconnected, the standby goes back to step 1 and tries to restore the file from the archive again. This loop of retries from the archive, pg_xlog, and via streaming replication goes on until the server is stopped or failover is triggered by a trigger file.

To summarize that: if PostgreSQL restarted on the secondary, or if the replication connection terminated for any reason (including transient TCP failures or an otherwise quick crash-and-restart on the primary), the secondary will finish replaying all of the files that it has locally before even attempting to reconnect to the primary to resume replication.

Suppose you have a cluster that seems to be behaving okay, but there’s some non-trivial replay lag – say it’s an hour behind. We know that if we were to failover at this point, you’d be in for an hour of downtime. But you’d also be in for an hour of downtime if:

the primary restarted (even if it came back up immediately), or
the secondary restarted (even if it came back up immediately), or
the TCP connection connection used for replication simply broke for any reason (even if it was restored immediately)

Any of these causes the secondary to finish replaying all outstanding WAL before connecting back to the primary. But without a replication connection, with synchronous replication, the primary is down for writes while this happens. (With async replication, you’d be accumulating send lag while this happens.) It’s pretty surprising that the cluster was behaving just fine but restarting one peer causes this hour of downtime.

4. Summary and final thoughts

PostgreSQL started with what seemed like a pretty reasonable abstraction: shipping the write-ahead-log to a second copy of the database to keep up-to-date so that you could failover quickly. We found that if you did this naively, you could easily accumulate various types of lag:

With async replication, send lag was the difference between what was committed on the primary and what was actually received on the secondary. This represented data that would be lost on a takeover.
With all forms of streaming replication, replay lag was the difference between what’s been committed on the primary and what was actually been applied to the in-memory database on the secondary. This represented work that would have to be completed before a takeover could complete.
Even without replication, checkpoint lag was the difference between what was applied to the in-memory database and what was written to the database files. This represented work that would have to be re-done if the database restarted. This applied to both primaries and secondaries, and depending on how you implement takeover, this could also represent work that has to be completed before a takeover can complete.

With synchronous replication, not only would a takeover with any meaningful amount of replay lag or checkpoint lag result in extended downtime, but any kind of transient blip on the secondary or the network (with a meaningful amount of replay lag) could cause extended downtime, too.

Several challenges made these problems particularly painful for us:

We had a high write volume workload. This is bad enough, but it necessitated a lot of vacuuming, which in turn triggered more write volume.
We had a 24/7/365 workload, which left no time for any of these processes that fell behind to catch up.
We were using version 9.2.4 through most of this.
We configured the filesystem record size very badly. This was the initial trigger for many of these other problems.
We implemented automated failover not with trigger files, but by reconfiguring PostgreSQL and restarting it. This made checkpoint lag on secondaries far worse than it needed to be.

Some of these were hard to avoid. When we built the system, trigger files were not an option. And updating PostgreSQL across major versions required downtime, especially when using streaming replication (since the protocol is not stable across major versions).

In the end, the manifestation of all these problems was that we had tons of outages that boiled down to:

An actual failure triggered a takeover, or
A transient blip like a restart of either primary or secondary or a network connection that failed transiently

Followed by hours or days of downtime because:

there had been a lot of of replay lag, or
there had been a lot of checkpoint lag

and that lag was generally caused by some combination of:

a very heavy write workload,
a vacuum operation that further amped up the write workload, and/or
the mismatched database/filesystem recordsize,

After all this, we were surprised by several things about PostgreSQL:

How much the availability of the system depended implicitly on the performance of just two single-threaded PostgreSQL processes: when these processes slowed down, replay lag soared and we became one glitch away from a major outage.
How quietly things flew off the rails: an arbitrary amount of replay lag could accumulate without any log messages or any effect on query performance. (One could – and should – monitor replay lag, but the fact that the documentation itself was confused about how to do it reflects how easy it was to get this wrong.)
How suddenly and catastrophically it all came crashing down: when a failover finally did happen, the system went from up-and-running at full speed to down-for-writes for hours or even days.

In retrospect, I’d argue that the default synchronous_commit behavior could best be called synchronous for durability, not availability.

It was a rough few years as we scaled up Manta to database sizes and write volumes that were orders of magnitude larger than we’d done before. We ran into a lot of issues – some of our own making, and some appearing to result from deficiencies in PostgreSQL. Hours waiting for database rebuilds and lag debt repayments have afforded me plenty of time to think about the choices we made.

The best thing about our experience with PostgreSQL? I said at the beginning that we chose PostgreSQL first of all for its reputation for data integrity. To my knowledge, it never lost production data for us. (There were a handful of terrifying cases where a database appeared to be lost because of a problem during WAL application, but these have been few and far between, and we were generally able to recover.)

The worst thing about our experience with PostgreSQL? The number of ways it accumulated “debt”, where everything appeared fine until a balloon payment came due and things were suddenly not fine. We had known for a while that plain old autovacuum was one example of this. We saw this again in 2015, when a transaction wraparound autovacuum triggered a 10-hour outage of our public Manta service. After four years in production, we learned about the several ways this can happen with replay lag and checkpoint lag. In all of these cases, we would have gladly accepted more latency to deal with these problems synchronously with steadier, more predictable performance.

PostgreSQL was (and presumably remains) a solid piece of software. But it’s also from an era that expected a lot of hand-on attention from experienced human operators to keep things running smoothly. We didn’t appreciate the degree to which operating a reliable PostgreSQL database (at least of this vintage) at scale required hand-on attention. Some folks will say this is obvious; others will bristle at that characterization. People will point out that they’ve run databases without trouble for years, despite not paying attention to them. But my point is those people were getting lucky to some degree because there was nothing structural in PostgreSQL to ensure that the WAL replay process was keeping up with incoming logs, or that the checkpointer was keeping up with dirty data, or that vacuum was keeping up with dead tuples. You’re expected to know about these things ahead of time and configure them carefully, keeping in mind that proper configuration is both workload-dependent and system-dependent. Then you must monitor them carefully, too, so that you can intervene early. I’m sure this works, but it feels both operationally expensive and brittle. In the end, we pushed the database beyond its limits, and when that happened, the failure modes were non-linear and disastrous. When I went to investigate CockroachDB in 2020, I was excited to see it built with modern techniques like circuit breakers so that when the system is overloaded, it seeks to fail quickly, transparently, and predictably. (That said, we haven’t yet pushed CockroachDB nearly as hard at Oxide as we did PostgreSQL in those years.)

Every system has problems. Every system struggles when pushed to the edge of its operational abilities. In fact, in every major system I’ve worked on, when we pushed some well-tested component to new limits, we found new, serious issues that had been lurking for years. Perhaps the most important lesson, which I’m embarassed to have had to learn more than once, is how critical it is to push components to their breaking points before the production environment does that for you. All systems fail under enough load; the best we can hope for is that they fail crisply and raise clear alarms before things become critical.

Footnote

¹ In this post, I’ve linked to the PostgreSQL 9.6 documentation because that was current when we were doing the work that led to this post.

² The only approach covered in the documentation that does not require a second copy is “shared disk failover”. But that only allows you to survive very limited types of failure affecting only one server – certainly not a failure of an entire rack or the shared storage subsystem (and the complexity of those subsystems does not always result in improved availability anyway).

³ There was a somewhat surreal moment for me when I attended a talk about replication from a community expert at a San Francisco PostgreSQL conference. By this point, I’d spent 1-2 years almost completely occupied with these replication availability issues. So I asked about the single-threaded WAL applier during the Q&A. The speaker, not realizing I was from Joyent, referred me to the work we had done and mentioned pg_prefaulter by name.

⁴ Actually, because these things took us by surprise, we had come to call replay lag “secret lag”. We found checkpoint lag on the secondary later and came to call that “double secret lag”.

← → Top