Scrubbing Email PII in PostgreSQL for GDPR Compliance

Scrubbing Email PII from Your PostgreSQL Database

When you copy a production database to a development or test environment, you must remove or anonymize personally identifiable information (PII). Email addresses are among the most common PII fields that need to be scrubbed. Leaving real email addresses in non-production databases creates compliance risk under GDPR, CCPA, and similar regulations.

PostgreSQL makes it straightforward to build a reusable scrubbing solution using a PL/pgSQL function for single-row replacement and a stored procedure for batch processing large tables.

Sample Code

The solution has two parts: a function that generates a safe replacement email, and a procedure that applies it across a table in batches.

Part 1: The scrub_email function

 1CREATE OR REPLACE FUNCTION scrub_email(email_address varchar(255))
 2RETURNS varchar(255) AS $$
 3BEGIN
 4  RETURN
 5    CONCAT(
 6      substr(md5(random()::text), 0, greatest(length(split_part(email_address, '@', 1)) + 1, 6)),
 7      '@',
 8      split_part(email_address, '@', 2)
 9    );
10END;
11$$ LANGUAGE plpgsql;

Part 2: The scrub_email_batch procedure

 1CREATE OR REPLACE PROCEDURE scrub_email_batch(_tbl regclass, _col character varying(255))
 2LANGUAGE plpgsql
 3AS $$
 4DECLARE
 5  batch_size int := 50;
 6  orig_email character varying(255);
 7  min_id bigint;
 8  max_id bigint;
 9BEGIN
10  EXECUTE format('SELECT max(id), min(id) FROM %s', _tbl) INTO max_id, min_id;
11  RAISE INFO 'table=% column=% max_id=% min_id=%', _tbl, _col, max_id, min_id;
12
13  FOR j IN min_id..max_id BY batch_size LOOP
14    FOR k IN j..j + batch_size LOOP
15      EXECUTE format('SELECT %s FROM %s WHERE id = %s', _col, _tbl, k) INTO orig_email;
16      CONTINUE WHEN orig_email IS NULL;
17
18      EXECUTE format('UPDATE %s SET %s = %s WHERE id = %s',
19        _tbl, _col, quote_literal('xyz@example.com'), k);
20    END LOOP;
21
22    RAISE INFO 'committing batch from % to % at %', j, j + batch_size, now();
23    COMMIT;
24  END LOOP;
25END;
26$$;
27
28-- CALL scrub_email_batch('users', 'email');

Notes: Requires PostgreSQL 11 or later for stored procedures (CREATE OR REPLACE PROCEDURE). The scrub_email function works on any version that supports PL/pgSQL. The batch procedure commits in increments to avoid long-running transactions on large tables.

Code Breakdown

scrub_email function:

  • split_part(email_address, '@', 1) — Extracts the local part of the email (everything before @).
  • split_part(email_address, '@', 2) — Extracts the domain part (everything after @). The domain is preserved so the result looks realistic.
  • md5(random()::text) — Generates a random 32-character hex string. Using random() means each call produces a different result.
  • greatest(length(...) + 1, 6) — Ensures the replacement local part is at least 5 characters to avoid collisions with very short email addresses.
  • substr(..., 0, N) — Trims the MD5 string to match the original local-part length.
  • CONCAT(...) — Assembles the fake email: <random_local>@<original_domain>.

scrub_email_batch procedure:

  • _tbl regclass — Accepts a table name as a type-safe reference. PostgreSQL resolves it to an OID and validates the table exists.
  • _col character varying(255) — The column name to scrub, used with format() to build dynamic SQL.
  • batch_size int := 50 — Processes 50 rows per commit. Adjust based on row size and transaction overhead.
  • EXECUTE format(...) — Uses dynamic SQL to work with any table and column combination.
  • quote_literal('xyz@example.com') — Safely quotes the replacement value to prevent SQL injection in dynamic SQL.
  • COMMIT inside the loop — Commits after each batch. This is only valid inside a stored procedure, not a function.
  • CONTINUE WHEN orig_email IS NULL — Skips rows where the email is already null, avoiding unnecessary updates.

Key Points

  • The function preserves the email domain but replaces the local part with random characters. This is useful when the domain is not sensitive (e.g., @company.com).
  • The batch procedure replaces every email with xyz@example.com. Modify it to call scrub_email() instead if you want unique replacements per row.
  • Committing inside the batch loop keeps individual transactions small, which matters on tables with millions of rows.

Insights and Explanations

Why batch processing matters:

A single UPDATE users SET email = 'xyz@example.com' on a 10-million-row table creates one massive transaction. This holds locks for its entire duration, fills the WAL, and creates autovacuum pressure after it commits. Batch commits reduce all of these side effects.

Using the function vs the procedure:

Use scrub_email() directly when you want unique-looking fake emails per row:

1UPDATE users SET email = scrub_email(email);

Use scrub_email_batch() when the table is large and you need controlled commits with progress logging:

1CALL scrub_email_batch('users', 'email');

Domain preservation:

Keeping the original domain (@company.com) while replacing the local part makes test data look realistic without leaking real addresses. If the domain is also sensitive, replace it by modifying the CONCAT call to use a static domain such as example.com.

Additional Considerations

  • Permissions: Only a superuser or table owner can update the target table. The procedure runs with the caller's permissions by default.
  • Sequences and IDs: The batch procedure assumes the table has a sequential integer id column. If your table uses UUIDs or non-sequential IDs, the loop logic needs adjustment.
  • Irreversible operation: Scrubbing permanently modifies data. Always run this on a copy of production, never on production itself.

References

Posts in this series