PostgreSQL Replication Lag in Managed Databases: A Practical Guide
Learn how PostgreSQL replication lag happens, how to read the right metrics, and how to design safer read routing, alerts, and failover runbooks for managed databases.
Replication lag is one of the easiest PostgreSQL metrics to misread. A replica can be connected, healthy, and still seconds or minutes behind the primary. That delay may be harmless for analytics, but it can break user flows that expect a read-after-write result, hide data during failover planning, or make a dashboard look wrong during a busy write burst.
The useful question is not whether lag is always bad. It is whether the application depends on fresh reads, how much lag the workload can tolerate, and whether the bottleneck is WAL generation, network transfer, disk flush, or replay on the standby. Managed PostgreSQL removes much of the server maintenance, but it does not remove the need to design reads, alerts, and failover expectations around replication behavior.
What replication lag actually measures
PostgreSQL physical streaming replication ships write-ahead log records from a primary server to one or more standby servers. The standby receives WAL, writes it, flushes it to durable storage, and replays it into its own data files. Lag can happen at any of those stages. A single “replica lag” number is useful for alerting, but it can hide which stage is slow.
PostgreSQL exposes this pipeline in pg_stat_replication on the primary. The sent_lsn, write_lsn, flush_lsn, and replay_lsn columns describe how far a standby has progressed through the WAL stream. The official docs also expose write_lag, flush_lag, and replay_lag intervals for recent WAL, but those values should be interpreted as recent communication delay rather than a complete application freshness contract. For byte-based checks, teams often compare the current WAL location on the primary with the standby's replay location.
On a standby, functions such as pg_last_wal_receive_lsn() and pg_last_wal_replay_lsn() show how much WAL has been received and replayed. PostgreSQL's warm standby documentation calls out the practical interpretation: a large gap between the primary's current WAL position and sent_lsn can mean primary pressure, a gap between sent and received locations can point to network or standby load, and a gap between received and replayed locations means WAL is arriving faster than the standby can apply it.
Why lag appears in managed PostgreSQL
Managed PostgreSQL usually gives teams replicas without making them run base backups, WAL archiving, restart scripts, and monitoring agents by hand. The underlying mechanics still matter. Replication consumes CPU, disk I/O, network bandwidth, and storage on both sides. A replica that is perfect during normal traffic can fall behind when a migration rewrites a large table, a batch job updates millions of rows, or an incident causes many retries.
Read traffic can also create a tradeoff. Hot standby replicas are attractive for reporting and read scaling, but long-running queries on the standby can conflict with WAL replay. PostgreSQL has to choose between keeping the query alive and applying WAL promptly. Settings such as max_standby_streaming_delay define how long a standby may wait before canceling conflicting queries, while hot_standby_feedback can reduce query cancels by telling the primary about standby snapshots. The cost is that cleanup on the primary can be delayed, which may contribute to table bloat in some workloads.
Replication slots solve a different problem. They prevent the primary from removing WAL that a standby still needs, which makes catch-up safer after a disconnect. They also create an operational risk: if a replica stays disconnected or cannot consume WAL fast enough, retained WAL can grow until storage becomes a problem. PostgreSQL provides max_slot_wal_keep_size to limit how much WAL a slot can retain, but the right setting depends on the recovery objective and storage budget.
| Lag stage | What it usually means | Common cause | Operational response |
|---|---|---|---|
Primary current WAL is far ahead of sent_lsn | The sender cannot ship as fast as WAL is produced | Write burst, primary CPU or I/O pressure, saturated WAL sender | Reduce write amplification, schedule bulk jobs, inspect primary load |
sent_lsn is far ahead of standby receive location | WAL is leaving the primary but not arriving quickly enough | Network issue, standby overload, interrupted receiver | Check connectivity, standby resources, and WAL receiver status |
| Received WAL is far ahead of replayed WAL | The standby has data but cannot apply it fast enough | Slow replay I/O, conflicting standby queries, large transaction replay | Identify replay blockers and long-running standby queries |
| Replication slot retained WAL grows | The primary is keeping WAL for a lagging consumer | Disconnected replica or stalled consumer | Alert on retained bytes and fix or drop abandoned slots carefully |
| Lag jumps during DDL or bulk updates | The standby must replay a large WAL burst | Table rewrites, index builds, high-volume backfills | Run changes in smaller batches or during lower traffic windows |
Freshness is an application design choice
Replica lag becomes a product bug when the application silently sends freshness-sensitive reads to a stale standby. A common example is account creation: a request writes a user row on the primary, then the next page reads from a replica and says the account does not exist. The database is working as designed; the routing policy is wrong for the flow.
Treat reads as having different consistency needs. User-facing read-after-write paths should usually read from the primary for a short window after a write, use a session flag that pins the user to primary reads, or wait until the replica has replayed at least the WAL position returned by the write. Background reports, dashboards, search indexing, and internal analytics can often tolerate a documented delay. The important part is to make the policy explicit rather than letting every repository method choose a connection by accident.
For SaaS products, tenant boundaries add another wrinkle. A large tenant can generate enough WAL to make smaller tenants see stale replica reads even though they did nothing unusual. If the product promises near-real-time dashboards, lag budgets should be set around the user experience, not only the average database metric. A replica that is ten seconds behind may be acceptable for weekly reporting and unacceptable for a checkout confirmation.
Alert on impact, not only on a single threshold
A good lag alert has context. Alerting only on “replica lag greater than 30 seconds” can create noise during planned maintenance and miss risk when storage is filling because an inactive slot is retaining WAL. Use at least two kinds of signals: freshness signals that describe how stale a replica is, and capacity signals that describe whether retained WAL or replay pressure can harm the primary.
For an application replica, alert when lag exceeds the freshness budget for long enough to affect users. If a dashboard can tolerate two minutes of delay, a short spike to thirty seconds is not an incident. If login, billing, or support tooling reads from a replica, the acceptable budget may be much lower or the routing should avoid replicas entirely. Pair the time-based alert with an error-budget style question: which flows can return wrong or confusing results when this replica is stale?
For slots and WAL retention, alert on retained bytes and available storage. A disconnected standby can turn a replication problem into a primary storage problem because WAL cannot be recycled while it is still needed by the slot. This is one reason managed providers care about replica health even when the application does not currently use the replica for reads.
Practical runbook for a lag spike
Start by classifying the lag stage. On the primary, inspect pg_stat_replication and compare sent, write, flush, and replay positions. On the standby, compare receive and replay locations if you have access. If the primary is generating WAL faster than it can send, look for bulk writes, table rewrites, excessive retries, or an index operation. If WAL is received but not replayed, look for long standby queries and replay conflicts.
Next, protect the user experience. Temporarily route freshness-sensitive reads to the primary, pause nonessential reporting queries on the standby, or slow a batch job that is producing the WAL burst. Avoid promoting a lagging replica casually. Promotion turns the standby's replayed state into the new writable timeline, so the recovery point matters. In a failover plan, the question is not just “is the replica alive?” but “how far behind is the data that would become primary?”
After the spike, fix the source rather than only raising thresholds. Large backfills often need batching, smaller transactions, and progress checkpoints. Reporting queries may need timeouts or a separate analytics path. If standby query conflicts are frequent, review whether hot_standby_feedback is worth the bloat tradeoff for that replica. If slots retain too much WAL, confirm that each slot maps to a real consumer and that abandoned slots are removed through an intentional runbook.
Managed PostgreSQL recommendations
In a managed environment, the provider should handle the mechanics of creating replicas, keeping WAL plumbing healthy, and exposing metrics. The application team still owns read routing, transaction size, schema-change timing, and incident decisions. Keep those responsibilities separate. Do not assume that because a database is managed, every read replica is safe for every read.
For small teams using ArmorDB, PgBouncer and managed backups solve different layers of the problem. PgBouncer helps control connection pressure, while backups protect recovery workflows. Replicas help with availability and read scaling, but they are not a substitute for tested backups or a guarantee of synchronous freshness. If connection pressure is part of the same incident, the PostgreSQL connection pool sizing guide is a useful companion. If the replica is part of a recovery strategy, pair this with the backup and restore strategy guide.
A healthy default is simple: send read-after-write traffic to the primary, reserve replicas for tolerant reads, alert on both freshness and retained WAL, and rehearse failover with realistic lag. That approach keeps replicas useful without turning eventual consistency into a surprise for users.
Takeaway
PostgreSQL replication lag is not one problem. It can be a write burst, a network gap, a slow standby, a conflicting read query, or a stalled replication slot. The best response starts by locating the stage, then matching the fix to the application impact.
Use replicas deliberately. They are excellent for read scaling, reporting, and availability planning when the freshness contract is clear. They are dangerous when an application silently treats a standby as if it were the primary. In managed PostgreSQL, the operational burden is lower, but the consistency decisions still belong in the application architecture and runbook.
Sources and further reading
- PostgreSQL documentation: monitoring statistics and
pg_stat_replication - PostgreSQL documentation: log-shipping standby servers and streaming replication
- PostgreSQL documentation: replication configuration
- PostgreSQL documentation: hot standby conflicts
- PostgreSQL documentation: WAL and recovery information functions
Topic
Deep Dives
Updated
Jun 29, 2026
Read time
8 min read
ArmorDB Engineering writes about PostgreSQL operations, security, and infrastructure decisions for teams building production apps on ArmorDB.
Read next
Tech-News & Trends · 6 min read
PostgreSQL 18 OLD and NEW RETURNING: Cleaner Change Capture in SQL
PostgreSQL 18 lets INSERT, UPDATE, DELETE, and MERGE return OLD and NEW row values directly. Learn what changed, where it helps, and how to use it safely.
Read articleData-Specs · 8 min read
PostgreSQL SaaS Tenancy Models: Shared Tables, Schemas, or Databases?
Compare shared-table, schema-per-tenant, and database-per-tenant PostgreSQL designs for SaaS apps, including isolation, migrations, pooling, backups, and when to switch models.
Read article