Tune PostgreSQL Background Writer with pg_stat_bgwriter

Tune PostgreSQL Background Writer with pg_stat_bgwriter

Where pg_stat_statements does query-level aggregation and pgBadger does after-the-fact log analysis, pg_stat_bgwriter operates at the buffer-cache layer — answering whether the bgwriter and checkpointer are keeping shared_buffers clean enough that foreground queries never have to pay for a write inline. On a healthy OLTP cluster the answer is almost always yes; on a struggling one the counters say which knob to turn first.

Purpose and Overview

When a PostgreSQL backend needs a clean buffer in shared_buffers and none is available, something has to flush a dirty page back to disk before the backend can continue. Three subsystems share that workload. The checkpointer flushes every dirty buffer at checkpoint time, spreading the writes over checkpoint_completion_target × checkpoint_timeout to avoid I/O spikes. The background writer ("bgwriter") runs continuously, sweeping a small window of the LRU list each bgwriter_delay and writing out dirty buffers to keep a runway of clean buffers available. And when neither of those has kept up, the user-facing backend writes the buffer itself — inline, on the critical path of the query.

The third case is the expensive one. A high share of buffers_backend relative to buffers_checkpoint and buffers_clean means foreground queries are doing the bgwriter's job, paying latency for I/O that the background processes should have absorbed.

pg_stat_bgwriter exposes the counters that distinguish the three writers, plus checkpoint timing and request counts — the data you need to size bgwriter_lru_maxpages, bgwriter_delay, checkpoint_timeout, and max_wal_size against your actual workload rather than against generic defaults.

Sample Code

 1-- PostgreSQL 9.2 through 16
 2SELECT
 3    checkpoints_timed,
 4    checkpoints_req,
 5    checkpoint_write_time,
 6    checkpoint_sync_time,
 7    buffers_checkpoint,
 8    buffers_clean,
 9    buffers_backend,
10    buffers_backend_fsync,
11    buffers_alloc,
12    maxwritten_clean,
13    ROUND(
14        100.0 * buffers_backend /
15        GREATEST(
16            buffers_checkpoint + buffers_clean + buffers_backend,
17            1
18        ),
19        2
20    )                                AS pct_backend_flushed,
21    stats_reset
22FROM pg_stat_bgwriter;

Notes: Available on PostgreSQL versions prior to 17. PostgreSQL 17 split this view in two: pg_stat_bgwriter now holds only the bgwriter counters, and the checkpointer counters moved to pg_stat_checkpointer. On 17+ the query above will fail on the checkpoint columns; see the Version Compatibility section for the split variant.

Code Breakdown

The query is a single-row SELECT against cumulative counters. The interesting work is in interpreting the ratios, not the columns themselves.

The Three Buffer Writers

buffers_checkpoint, buffers_clean, and buffers_backend are the count of dirty buffers each subsystem has flushed since the last counter reset. Together they decompose total write activity into "scheduled" (checkpointer), "background" (bgwriter), and "synchronous" (backend) shares. The ratio between them — not the absolute values — is the tuning signal.

Checkpoint Counters

checkpoints_timed counts checkpoints triggered by checkpoint_timeout — the healthy majority. checkpoints_req counts checkpoints triggered by hitting max_wal_size. A high checkpoints_req share means WAL is filling faster than checkpoints are spreading their I/O; the response is to raise max_wal_size. checkpoint_write_time and checkpoint_sync_time are total milliseconds spent in the write and sync phases respectively.

bgwriter Counters

buffers_clean is the bgwriter's flush count. maxwritten_clean is the number of times the bgwriter stopped a single sweep early because it hit bgwriter_lru_maxpages — the per-round cap. A non-zero (and growing) maxwritten_clean indicates the bgwriter is at its per-round ceiling and would benefit from raising bgwriter_lru_maxpages.

Backend Counters

buffers_backend is the foreground-flushed count — the one to minimize. buffers_backend_fsync is the count of backends that had to fsync themselves because the fsync request queue was full; this should be zero in steady state. A non-zero value is a saturated-I/O signal.

The pct_backend_flushed Derived Column

pct_backend_flushed is buffers_backend / (buffers_checkpoint + buffers_clean + buffers_backend) expressed as a percentage, with a GREATEST(..., 1) guard against division by zero on a freshly-reset cluster. Values above 15-20% on a write-heavy OLTP cluster are the classic "tune bgwriter" trigger.

Key Buffer-Flushing Counters

buffers_checkpoint and Checkpoint I/O Spreading

The checkpointer flushes every dirty buffer at checkpoint time. checkpoint_completion_target (default 0.9) controls how much of checkpoint_timeout the writes are spread over — higher values smooth the I/O, lower values complete faster but cause spikier writes.

buffers_clean and the bgwriter Sweep

The bgwriter wakes every bgwriter_delay (default 200ms), looks at the LRU list, and writes up to bgwriter_lru_maxpages (default 100) dirty buffers per pass. Raising either knob increases the bgwriter's throughput; lowering them throttles it.

buffers_backend and Foreground I/O

Every flush counted here is one a query had to do itself. A small share is normal; a large share is the unmistakable signal that the bgwriter is undersized or the checkpointer is too aggressively spread.

buffers_backend_fsync and Saturated fsync Queues

This counter increments only when the fsync request queue to the checkpointer is full and a backend has to fsync its own write. Persistently non-zero values point at I/O subsystem saturation or undersized wal_buffers.

maxwritten_clean as a bgwriter Ceiling Signal

A maxwritten_clean count rising in step with buffers_clean says the bgwriter is consistently hitting its per-round cap. Raising bgwriter_lru_maxpages lets it write more per sweep and typically lowers pct_backend_flushed as a result.

Practical Applications

Tuning bgwriter_lru_maxpages and bgwriter_delay

If pct_backend_flushed exceeds 15% and maxwritten_clean is growing, the bgwriter is the bottleneck. Common production values: bgwriter_lru_maxpages = 500 (up from 100), bgwriter_delay = 50ms (down from 200ms). Apply one change at a time, reset the counters with SELECT pg_stat_reset_shared('bgwriter'), and re-measure over a representative window.

Diagnosing Checkpoint Storms from checkpoints_req

checkpoints_req substantially exceeding checkpoints_timed is the storm signal: WAL is filling faster than the checkpoint spread accommodates, every checkpoint compresses into a short I/O burst, and buffers_backend rises as foreground queries pick up the slack between bursts. Raising max_wal_size (and accepting longer crash recovery as the cost) restores spread-checkpoint behavior.

Sizing max_wal_size on Write-Heavy OLTP

Compute the cluster's WAL generation rate by sampling pg_current_wal_lsn() over a 5-minute window during peak load. max_wal_size should be at least 1× to 2× the peak rate times checkpoint_timeout — large enough that checkpoints_req stays near zero.

Resetting bgwriter Counters for A/B Tuning

pg_stat_reset_shared('bgwriter') (and 'checkpointer' on 17+) zeroes the counters without touching live activity. The A/B pattern: capture current ratios, change one setting, reset counters, wait one workload cycle, re-capture, compare. The counters in this view are the only inexpensive way to A/B-test bgwriter tuning short of pgbench.

 1-- Average rate since last reset
 2SELECT
 3    ROUND(buffers_checkpoint /
 4          GREATEST(EXTRACT(EPOCH FROM NOW() - stats_reset), 1),
 5          2)                         AS checkpoint_buf_per_sec,
 6    ROUND(buffers_clean /
 7          GREATEST(EXTRACT(EPOCH FROM NOW() - stats_reset), 1),
 8          2)                         AS bgwriter_buf_per_sec,
 9    ROUND(buffers_backend /
10          GREATEST(EXTRACT(EPOCH FROM NOW() - stats_reset), 1),
11          2)                         AS backend_buf_per_sec
12FROM pg_stat_bgwriter;

Version Compatibility

pg_stat_bgwriter has existed since PostgreSQL 8.3, with columns added across versions. buffers_backend_fsync arrived in 9.1; checkpoint_write_time and checkpoint_sync_time in 9.2.

PostgreSQL 17 introduced the major split. The view now contains only bgwriter-specific counters (buffers_clean, maxwritten_clean, buffers_alloc, stats_reset). The checkpointer counters moved to a new view, pg_stat_checkpointer, with renamed columns: num_timed (was checkpoints_timed), num_requested (was checkpoints_req), write_time, sync_time, buffers_written (was buffers_checkpoint). Migrate any dashboard or alert reading pg_stat_bgwriter.checkpoints_* before upgrading to 17.

 1-- PostgreSQL 17+ split variant
 2SELECT
 3    b.buffers_clean,
 4    b.maxwritten_clean,
 5    b.buffers_alloc,
 6    c.num_timed                      AS checkpoints_timed,
 7    c.num_requested                  AS checkpoints_req,
 8    c.write_time                     AS checkpoint_write_time,
 9    c.sync_time                      AS checkpoint_sync_time,
10    c.buffers_written                AS buffers_checkpoint
11FROM pg_stat_bgwriter   b
12CROSS JOIN pg_stat_checkpointer c;

pg_stat_reset_shared accepts 'bgwriter' on all supported versions and additionally 'checkpointer' on 17+ — reset them independently to A/B-test bgwriter tuning without losing checkpointer baselines.

Best Practices

  • Watch pct_backend_flushed as the primary tuning indicator — below 5% is excellent, 5-15% is fine, above 15% on write-heavy OLTP says tune the bgwriter
  • Raise max_wal_size when checkpoints_req exceeds checkpoints_timed — accept longer crash recovery as the cost of spread checkpoint I/O
  • Change one setting at a time and reset counters betweenbgwriter_lru_maxpages, bgwriter_delay, max_wal_size, checkpoint_completion_target all interact
  • Plan the PostgreSQL 17 dashboard migration before the upgradepg_stat_bgwriter.checkpoints_* no longer exists on 17
  • Treat buffers_backend_fsync > 0 as an I/O alarm — it means the fsync queue saturated, which is rarely the right operating mode
  • Sample twice and subtract for instantaneous rates — the raw counters are cumulative since reset; deltas are what dashboards should plot

References

Posts in this series