Monitor PostgreSQL Replication Lag with pg_stat_replication
Monitor PostgreSQL Replication Lag with pg_stat_replication
A standby lagging more than a few seconds is rarely a transient hiccup — it is a query backlog on the standby, a network ceiling between the nodes, a misconfigured hot_standby_feedback, or a single-threaded apply storm. pg_stat_replication joined to pg_wal_lsn_diff() exposes each replica's byte lag and time lag in one row, on the primary, no SSH required.
Purpose and Overview
PostgreSQL streaming replication is asynchronous by default. The primary commits a transaction, writes the WAL record locally, the WAL sender forwards it to the standby, the standby writes it to its own WAL, flushes the WAL to disk, and finally replays it into the data files. Each stage runs on a separate thread of execution, and each stage has its own latency. Under load they diverge: write may be near-instant, flush a few milliseconds, and replay seconds or minutes behind on a busy standby.
pg_stat_replication exposes one row per connected standby, with the LSNs (log sequence numbers) tracking each stage — sent_lsn, write_lsn, flush_lsn, replay_lsn — plus the wall-clock intervals from primary commit until each acknowledgment. Subtracting the standby's replay_lsn from the primary's pg_current_wal_lsn() gives byte lag in WAL bytes; the pg_wal_lsn_diff function makes the subtraction safe across LSN boundaries.
Lag is operationally significant for three reasons: read-after-write queries against a lagging standby return stale data; synchronous-commit replicas that lag block primary commits; and failover to a lagging standby loses recent transactions. Measuring all three from a single query on the primary is what this view is for.
Sample Code
1SELECT
2 pid,
3 application_name,
4 client_addr,
5 state,
6 sync_state,
7 pg_wal_lsn_diff(
8 pg_current_wal_lsn(),
9 replay_lsn
10 ) AS replay_lag_bytes,
11 write_lag,
12 flush_lag,
13 replay_lag,
14 backend_start,
15 NOW() - backend_start AS connection_age
16FROM pg_stat_replication
17ORDER BY replay_lag_bytes DESC NULLS LAST;
Notes: Requires PostgreSQL 10+ for the write_lag / flush_lag / replay_lag interval columns; the LSN columns and pg_wal_lsn_diff work on 9.4+ (replace _wal_ with _xlog_ on pre-10 servers). Run on the primary — pg_stat_replication is empty on a standby. Requires pg_monitor or superuser to see all rows.
Code Breakdown
The query is a per-standby snapshot. Each column either identifies the standby, classifies its replication mode, or quantifies a specific stage of lag.
pg_stat_replication on the Primary
The view is populated by WAL-sender processes on the primary — one row per active standby connection. On a standby the view is empty, which is also a useful signal: a primary with no rows here has no standbys connected, regardless of what pg_hba.conf and primary_conninfo suggest is configured.
pg_wal_lsn_diff for Byte Lag
pg_wal_lsn_diff(a, b) returns the byte distance between two LSNs as a numeric. Positive when a is ahead of b. The standard byte-lag formula is pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) — bytes of WAL produced on the primary that the standby has not yet applied.
state and sync_state Columns
state is the WAL sender's status: startup during initial handshake, catchup while a recently-disconnected standby is streaming the backlog, streaming during normal operation, backup during a pg_basebackup run. A streaming row is the healthy steady state; a catchup row is temporary and expected after reconnection or for a brand-new replica.
sync_state is the synchronous-replication classification: async (no commit gating), sync (commits gate on this standby's flush), potential (next in line if a sync replica disconnects), quorum (one of a quorum-N group).
The Three Interval-Typed Lag Columns
write_lag, flush_lag, and replay_lag are wall-clock intervals between the primary's commit and the corresponding standby acknowledgment. Under steady load, expect write_lag < flush_lag < replay_lag; under unusual load patterns (synchronous_commit storms, parallel-bulk loads), the ordering can invert briefly.
Key Replication LSN Stages
sent_lsn (Primary Has Sent)
The LSN the WAL sender has handed off to the network socket. If sent_lsn lags pg_current_wal_lsn() significantly, the WAL sender on the primary is the bottleneck — usually CPU-bound encryption or compression, or the primary's network output buffer is saturating.
write_lsn (Standby Has Received and Written)
The LSN the standby has written to its own WAL buffers. Lag between sent_lsn and write_lsn indicates network latency or standby write-side saturation.
flush_lsn (Standby Has fsync'd)
The LSN durably persisted on the standby's disk. Synchronous replication with synchronous_commit = on blocks the primary commit until this LSN advances past the transaction's commit LSN. Standby disk I/O is the typical bottleneck here.
replay_lsn (Standby Has Applied)
The LSN actually visible to queries on the standby. Replay is single-threaded per database, so a heavy write workload on the primary can outpace replay even when the standby is idle of client queries. This is the LSN that matters for "is the read consistent" questions.
Practical Applications
Detecting a Standby Falling Behind in catchup State
A row with state = 'catchup' and replay_lag_bytes shrinking over successive query runs is a healthy reconnect. A row with state = 'catchup' and lag growing means the primary is outpacing what the standby can apply — escalate via larger wal_buffers, faster standby I/O, or synchronous_commit = off on the primary side until the standby catches up.
Synchronous Replication Health Checks
For a sync_state = 'sync' standby, monitor flush_lag rather than replay_lag — flush is what gates primary commits, and a high flush_lag with low replay_lag means the standby is acknowledging fsyncs but falling behind in apply. Both matter, but they signal different problems.
Logical Replication Subscriber Lag Verification
Logical replication subscribers also appear in pg_stat_replication with application_name matching the subscription name. Pair with the standby-side pg_stat_subscription and pg_stat_subscription_stats (PostgreSQL 15+) for apply errors. Divergence between the two views — primary says lag is bounded, subscriber reports apply errors — points at conflicts the apply worker is silently retrying.
1-- Run on the STANDBY
2SELECT
3 pg_is_in_recovery() AS in_recovery,
4 pg_last_wal_receive_lsn() AS last_received,
5 pg_last_wal_replay_lsn() AS last_replayed,
6 pg_last_xact_replay_timestamp() AS last_xact_replayed,
7 NOW() - pg_last_xact_replay_timestamp() AS time_lag;
Sizing WAL Retention with Replication Slots
If a standby disconnects long enough for the primary to recycle WAL it needs, the standby cannot resume from streaming and must be re-bootstrapped from a fresh base backup. pg_replication_slots lets the primary reserve WAL for a specific standby; wal_keep_size (PostgreSQL 13+) reserves unconditionally. Pair pg_stat_replication with pg_replication_slots to confirm both ends agree on what's being retained.
Version Compatibility
pg_stat_replication itself has existed since streaming replication landed in PostgreSQL 9.0, but the modern column set materialized over several releases. The write_lag, flush_lag, and replay_lag interval columns were added in PostgreSQL 10. The LSN columns and pg_wal_lsn_diff function work on PostgreSQL 9.4+; on 9.4 through 9.6 the function and the columns are named with xlog instead of wal (e.g. pg_current_xlog_location(), pg_xlog_location_diff).
pg_replication_slots was added in 9.4 alongside the physical replication slot feature; logical replication slots followed in 9.4 as well. wal_keep_size (replacing the integer wal_keep_segments) is PostgreSQL 13+ and accepts a human-readable size like '1GB'.
hot_standby_feedback interacts with replication: enabling it makes the standby send its oldest snapshot back to the primary so the primary holds off on vacuuming tuples the standby still needs. Without it, long-running queries on the standby can be cancelled by replay; with it, those queries can cause bloat on the primary. The trade-off is workload-specific.
Best Practices
- Always query on the primary —
pg_stat_replicationon a standby returns no rows; only the primary owns the WAL-sender process records - Pair byte lag with time lag — bytes scale with WAL volume, time scales with apply speed; both matter
- Use replication slots for critical standbys — guarantees the primary retains the WAL the standby needs, at the cost of unbounded
pg_walgrowth if the standby disconnects forever - Alert on growing lag, not threshold crossings — a standby that has been 500 MB behind for an hour is more concerning than a brief 2 GB spike during a bulk load
- Monitor
pg_stat_subscriptionon logical subscribers — apply errors there are invisible topg_stat_replicationon the publisher - Audit
sync_stateperiodically — a config drift that flips asyncstandby toasyncis silent unless you watch for it
References
- PostgreSQL: pg_stat_replication — official reference for the per-standby replication statistics view
- PostgreSQL: WAL Functions — reference for
pg_current_wal_lsn,pg_wal_lsn_diff, and related LSN functions - PostgreSQL: Streaming Replication — conceptual guide to streaming replication architecture
- PostgreSQL: Replication Slots — reference for physical and logical replication slots
- pganalyze: Reducing replication lag with maintenance_io_concurrency in Postgres 15 — operational tactic for shrinking replay lag via standby-side I/O prefetch
- HariSekhon/SQL-scripts: postgres_recovery.sql — companion script exposing the standby-side LSN functions used in the time-lag query above
Posts in this series
- How Many Connections Can Your PostgreSQL Database Handle?
- PostgreSQL Backend Connections via pg_stat_database
- Identifying Blocking PostgreSQL Queries using pg_stat_activity
- List PostgreSQL Databases by Size with Access Check
- Assess PostgreSQL Database Sizes Quickly and Easily
- Unveiling Your PostgreSQL Server - A Diagnostic Powerhouse
- Keep Your PostgreSQL Database Clean, Identify Idle Connections
- Query the PostgreSQL Configuration
- PostgreSQL Recovery Monitoring: Essential SQL Insights
- Restart All PostgreSQL Sequences with ALTER SEQUENCE
- Monitor Running Queries in PostgreSQL using pg_stat_activity
- Monitor PostgreSQL Active Sessions with pg_stat_activity
- PostgreSQL Error Handling Settings via pg_settings
- PostgreSQL File Location Settings Query via pg_settings
- PostgreSQL Lock Management Settings via pg_settings
- PostgreSQL Logging Configuration Query via pg_settings
- Monitor PostgreSQL Memory Settings with pg_settings
- PostgreSQL Table Row Count Estimates with SQL
- List PostgreSQL Tables by Size with SQL
- PostgreSQL WAL Settings Query Guide
- PostgreSQL SSL Settings Query Guide
- PostgreSQL Statistics Settings Query Guide
- PostgreSQL Resource Settings Query Guide
- PostgreSQL Replication Settings Query Guide
- PostgreSQL Query Planning Settings Query Guide
- PostgreSQL Preset Options Settings Query Guide
- PostgreSQL Miscellaneous Settings Query Guide
- Count PostgreSQL Sessions by State with SQL
- Kill Idle PostgreSQL Sessions with SQL
- Grant SELECT on All Tables in PostgreSQL
- Identify Insert-Only Tables in PostgreSQL
- Detect Soft Delete Patterns in PostgreSQL
- List PostgreSQL Object Comments with SQL
- List Foreign Key Constraints in PostgreSQL
- List PostgreSQL Enum Types and Their Values with SQL
- List All Views in a PostgreSQL Database with SQL
- Find PostgreSQL Tables Without a Primary Key
- List PostgreSQL Partitioned Tables with SQL
- List All Schemas in Your PostgreSQL Database
- PostgreSQL Database Statistics with pg_stat_database
- List PostgreSQL Roles and Their Privileges
- Scrubbing Email PII in PostgreSQL for GDPR Compliance
- List Installed Extensions in PostgreSQL
- List Collations in Your PostgreSQL Database
- PostgreSQL Replica Identity for Logical Replication
- Monitor PostgreSQL Vacuum Progress with pg_stat_progress_vacuum
- Monitor PostgreSQL Wait Events Using pg_stat_activity
- Monitor PostgreSQL Replication Lag with pg_stat_replication
- List PostgreSQL Wait Events with the pg_wait_events View