Detect Soft Delete Patterns in PostgreSQL

Detecting Soft Delete Patterns in PostgreSQL

Soft deletes are a common application pattern. Instead of removing a row with a DELETE statement, the application marks it as deleted by setting a column — typically deleted_at — to a non-null timestamp. The row stays in the table forever. The application simply filters it out with a WHERE deleted_at IS NULL clause.

This pattern is useful for audit trails, undo functionality, and referential integrity. It is also a significant source of problems for PostgreSQL DBAs who do not know it is in use.

When rows are deleted with DELETE, PostgreSQL marks them as dead tuples. Autovacuum reclaims that space. With soft deletes, those rows are never actually deleted. They accumulate silently. Tables grow large. Indexes balloon. Query performance degrades. And unless you know the pattern is there, the root cause is hard to diagnose.

The script covered in this article scans information_schema.columns to find every table in the database that contains a column named deleted_at. That is a reliable heuristic for identifying which tables use soft deletes.

SQL Script

1SELECT
2    table_schema,
3    table_name,
4    column_name,
5    data_type
6FROM
7    information_schema.columns
8WHERE
9    column_name = 'deleted_at';

Notes: Run as any role with access to information_schema.columns. The query scans all schemas visible to the connected user. Adjust the WHERE clause to add other common soft-delete column names such as is_deleted or deleted.

Code Breakdown

SELECT Clause

The query returns four columns: table_schema, table_name, column_name, and data_type. The schema is included because the same table name can exist in multiple schemas. The data_type tells you whether deleted_at is a timestamp with time zone, timestamp without time zone, boolean, or something else. The data type reveals how the application implements the pattern.

FROM information_schema.columns

information_schema.columns is a standard SQL catalog view. It contains one row for every column in every table, view, and materialized view that the current role has access to. It covers all schemas unless you add a table_schema filter.

WHERE column_name = 'deleted_at'

This is the heuristic filter. deleted_at is the most common name for a soft-delete timestamp column, popularised by Rails ActiveRecord's paranoia gem and many other ORM-level soft-delete libraries. The filter is case-sensitive in PostgreSQL catalog lookups, so lowercase deleted_at matches the overwhelming majority of real-world schemas.

To cast a wider net, extend the filter:

1WHERE column_name IN ('deleted_at', 'is_deleted', 'deleted', 'removed_at', 'archived_at')

Key Points

  • The query identifies tables that likely use soft deletes by looking for a deleted_at column.
  • Soft-delete tables accumulate rows permanently. Dead-tuple pressure from DELETE is absent, but the tables grow without bound.
  • Large soft-delete tables cause index bloat, slower sequential scans, and inflated row estimates that mislead the query planner.
  • Autovacuum has no dead tuples to clean from soft-delete tables, so its default thresholds often mean it visits these tables infrequently even as they grow large.
  • Finding these tables is the first step. The next step is understanding how large they have become and what fraction of rows are logically deleted.

Insights and Explanations

PostgreSQL's query planner sees all rows as live. When 80% of a table's rows have deleted_at IS NOT NULL, the planner still counts all of them in its row estimates. It may choose a sequential scan when an index scan would be faster for the 20% of live rows. The estimates are off by a factor of five before the query even starts.

Autovacuum triggers on dead tuple count and table age. Soft-delete tables generate no dead tuples from the marking operation itself — the old row version from an UPDATE does create a dead tuple, but autovacuum cleans that normally. The problem is that the logically-deleted rows themselves are never removed. Autovacuum works correctly at the mechanical level but cannot solve the accumulation problem.

Every index on a soft-delete table includes entries for logically-deleted rows. A B-tree index on user_id for a users table where half the users are soft-deleted carries entries for all users. Index scans touch more pages, cache efficiency drops, and the index consumes more disk space. Creating a partial index on (user_id) WHERE deleted_at IS NULL is the standard mitigation, but only helps if the application consistently uses that filter.

Once you have a list of soft-delete tables, follow up with these queries for each one:

1-- Row count split by deletion status
2SELECT
3    deleted_at IS NOT NULL AS is_deleted,
4    count(*)
5FROM your_schema.your_table
6GROUP BY 1;
1-- Table and index size
2SELECT
3    pg_size_pretty(pg_total_relation_size('your_schema.your_table')) AS total_size,
4    pg_size_pretty(pg_relation_size('your_schema.your_table')) AS table_size;

Additional Considerations

  • Extending the heuristic: deleted_at is a convention, not a standard. Some codebases use is_deleted (boolean), removed_at, or archived_at. A thorough audit should query for all likely names.
  • Remediation options: Add partial indexes with WHERE deleted_at IS NULL, schedule periodic hard deletes beyond a retention window, or partition large tables by time so old deleted data can be dropped by partition detach.
  • Involve the application team: Soft deletes are an application contract. Physical deletion of logically-deleted rows requires agreement on a retention policy before any DBA action.
  • Behavioural signal: Tables where n_tup_del in pg_stat_user_tables is zero but the table is large and growing are another signal that rows are never physically deleted.

References

Posts in this series