Scrubbing Email PII in PostgreSQL for GDPR Compliance
Scrubbing Email PII from Your PostgreSQL Database
When you copy a production database to a development or test environment, you must remove or anonymize personally identifiable information (PII). Email addresses are among the most common PII fields that need to be scrubbed. Leaving real email addresses in non-production databases creates compliance risk under GDPR, CCPA, and similar regulations.
PostgreSQL makes it straightforward to build a reusable scrubbing solution using a PL/pgSQL function for single-row replacement and a stored procedure for batch processing large tables.
Sample Code
The solution has two parts: a function that generates a safe replacement email, and a procedure that applies it across a table in batches.
Part 1: The scrub_email function
1CREATE OR REPLACE FUNCTION scrub_email(email_address varchar(255))
2RETURNS varchar(255) AS $$
3BEGIN
4 RETURN
5 CONCAT(
6 substr(md5(random()::text), 0, greatest(length(split_part(email_address, '@', 1)) + 1, 6)),
7 '@',
8 split_part(email_address, '@', 2)
9 );
10END;
11$$ LANGUAGE plpgsql;
Part 2: The scrub_email_batch procedure
1CREATE OR REPLACE PROCEDURE scrub_email_batch(_tbl regclass, _col character varying(255))
2LANGUAGE plpgsql
3AS $$
4DECLARE
5 batch_size int := 50;
6 orig_email character varying(255);
7 min_id bigint;
8 max_id bigint;
9BEGIN
10 EXECUTE format('SELECT max(id), min(id) FROM %s', _tbl) INTO max_id, min_id;
11 RAISE INFO 'table=% column=% max_id=% min_id=%', _tbl, _col, max_id, min_id;
12
13 FOR j IN min_id..max_id BY batch_size LOOP
14 FOR k IN j..j + batch_size LOOP
15 EXECUTE format('SELECT %s FROM %s WHERE id = %s', _col, _tbl, k) INTO orig_email;
16 CONTINUE WHEN orig_email IS NULL;
17
18 EXECUTE format('UPDATE %s SET %s = %s WHERE id = %s',
19 _tbl, _col, quote_literal('xyz@example.com'), k);
20 END LOOP;
21
22 RAISE INFO 'committing batch from % to % at %', j, j + batch_size, now();
23 COMMIT;
24 END LOOP;
25END;
26$$;
27
28-- CALL scrub_email_batch('users', 'email');
Notes: Requires PostgreSQL 11 or later for stored procedures (CREATE OR REPLACE PROCEDURE). The scrub_email function works on any version that supports PL/pgSQL. The batch procedure commits in increments to avoid long-running transactions on large tables.
Code Breakdown
scrub_email function:
split_part(email_address, '@', 1)— Extracts the local part of the email (everything before@).split_part(email_address, '@', 2)— Extracts the domain part (everything after@). The domain is preserved so the result looks realistic.md5(random()::text)— Generates a random 32-character hex string. Usingrandom()means each call produces a different result.greatest(length(...) + 1, 6)— Ensures the replacement local part is at least 5 characters to avoid collisions with very short email addresses.substr(..., 0, N)— Trims the MD5 string to match the original local-part length.CONCAT(...)— Assembles the fake email:<random_local>@<original_domain>.
scrub_email_batch procedure:
_tbl regclass— Accepts a table name as a type-safe reference. PostgreSQL resolves it to an OID and validates the table exists._col character varying(255)— The column name to scrub, used withformat()to build dynamic SQL.batch_size int := 50— Processes 50 rows per commit. Adjust based on row size and transaction overhead.EXECUTE format(...)— Uses dynamic SQL to work with any table and column combination.quote_literal('xyz@example.com')— Safely quotes the replacement value to prevent SQL injection in dynamic SQL.COMMITinside the loop — Commits after each batch. This is only valid inside a stored procedure, not a function.CONTINUE WHEN orig_email IS NULL— Skips rows where the email is already null, avoiding unnecessary updates.
Key Points
- The function preserves the email domain but replaces the local part with random characters. This is useful when the domain is not sensitive (e.g.,
@company.com). - The batch procedure replaces every email with
xyz@example.com. Modify it to callscrub_email()instead if you want unique replacements per row. - Committing inside the batch loop keeps individual transactions small, which matters on tables with millions of rows.
Insights and Explanations
Why batch processing matters:
A single UPDATE users SET email = 'xyz@example.com' on a 10-million-row table creates one massive transaction. This holds locks for its entire duration, fills the WAL, and creates autovacuum pressure after it commits. Batch commits reduce all of these side effects.
Using the function vs the procedure:
Use scrub_email() directly when you want unique-looking fake emails per row:
1UPDATE users SET email = scrub_email(email);
Use scrub_email_batch() when the table is large and you need controlled commits with progress logging:
1CALL scrub_email_batch('users', 'email');
Domain preservation:
Keeping the original domain (@company.com) while replacing the local part makes test data look realistic without leaking real addresses. If the domain is also sensitive, replace it by modifying the CONCAT call to use a static domain such as example.com.
Additional Considerations
- Permissions: Only a superuser or table owner can update the target table. The procedure runs with the caller's permissions by default.
- Sequences and IDs: The batch procedure assumes the table has a sequential integer
idcolumn. If your table uses UUIDs or non-sequential IDs, the loop logic needs adjustment. - Irreversible operation: Scrubbing permanently modifies data. Always run this on a copy of production, never on production itself.
References
Posts in this series
- How Many Connections Can Your PostgreSQL Database Handle?
- PostgreSQL Backend Connections via pg_stat_database
- Identifying Blocking PostgreSQL Queries using pg_stat_activity
- List PostgreSQL Databases by Size with Access Check
- Assess PostgreSQL Database Sizes Quickly and Easily
- Unveiling Your PostgreSQL Server - A Diagnostic Powerhouse
- Keep Your PostgreSQL Database Clean, Identify Idle Connections
- Query the PostgreSQL Configuration
- PostgreSQL Recovery Monitoring: Essential SQL Insights
- Restart All PostgreSQL Sequences with ALTER SEQUENCE
- Monitor Running Queries in PostgreSQL using pg_stat_activity
- Monitor PostgreSQL Active Sessions with pg_stat_activity
- PostgreSQL Error Handling Settings via pg_settings
- PostgreSQL File Location Settings Query via pg_settings
- PostgreSQL Lock Management Settings via pg_settings
- PostgreSQL Logging Configuration Query via pg_settings
- Monitor PostgreSQL Memory Settings with pg_settings
- PostgreSQL Table Row Count Estimates with SQL
- List PostgreSQL Tables by Size with SQL
- PostgreSQL WAL Settings Query Guide
- PostgreSQL SSL Settings Query Guide
- PostgreSQL Statistics Settings Query Guide
- PostgreSQL Resource Settings Query Guide
- PostgreSQL Replication Settings Query Guide
- PostgreSQL Query Planning Settings Query Guide
- PostgreSQL Preset Options Settings Query Guide
- PostgreSQL Miscellaneous Settings Query Guide
- Count PostgreSQL Sessions by State with SQL
- Kill Idle PostgreSQL Sessions with SQL
- Grant SELECT on All Tables in PostgreSQL
- Identify Insert-Only Tables in PostgreSQL
- Detect Soft Delete Patterns in PostgreSQL
- List PostgreSQL Object Comments with SQL
- List Foreign Key Constraints in PostgreSQL
- List PostgreSQL Enum Types and Their Values with SQL
- List All Views in a PostgreSQL Database with SQL
- Find PostgreSQL Tables Without a Primary Key
- List PostgreSQL Partitioned Tables with SQL
- List All Schemas in Your PostgreSQL Database
- PostgreSQL Database Statistics with pg_stat_database
- Scrubbing Email PII in PostgreSQL for GDPR Compliance