Identify Underutilized Columns in PostgreSQL for Schema Optimization
Identifying Underutilized Columns in Your PostgreSQL Database
Keeping your PostgreSQL database clean and efficient is crucial for optimal performance. One way to achieve this is by identifying and potentially removing "useless" columns – those that hold minimal or redundant data. This article explores a PostgreSQL code snippet designed to unearth such columns, helping you optimize your database schema.
Sample Code from Command Line
1SELECT
2 nspname,
3 relname,
4 attname,
5 typname,
6 (stanullfrac*100)::int AS null_percent,
7 case
8 when stadistinct >= 0
9 then stadistinct
10 else
11 abs(stadistinct)*reltuples
12 end AS "distinct",
13 case 1
14 when stakind1
15 then stavalues1
16 when stakind2
17 then stavalues2
18 end AS "values"
19FROM
20 pg_class c
21JOIN
22 pg_namespace ns
23ON
24 (ns.oid=relnamespace)
25JOIN
26 pg_attribute
27ON
28 (c.oid=attrelid)
29JOIN
30 pg_type t
31ON
32 (t.oid=atttypid)
33JOIN
34 pg_statistic
35ON
36 (c.oid=starelid AND staattnum=attnum)
37WHERE
38 nspname NOT LIKE E'pg\\_%'
39 AND
40 nspname != 'information_schema'
41 AND
42 relkind = 'r'
43 AND
44 NOT attisdropped
45 AND
46 attstattarget != 0
47 AND
48 reltuples >= 100
49 AND
50 stadistinct BETWEEN 0 AND 1
51ORDER BY
52 nspname,
53 relname,
54 attname;
Notes: Finds useless PostgreSQL columns containing only a single value eg. all NULLs or a redundant non-distinguishing field. Only returns columns with over 100 rows. Tested on PostgreSQL 8.4, 9.x, 10.x, 11.x, 12.x, 13.0
Identifying Underutilized Columns in Your PostgreSQL Database
Keeping your PostgreSQL database clean and efficient is crucial for optimal performance. One way to achieve this is by identifying and potentially removing "useless" columns – those that hold minimal or redundant data. This article explores a PostgreSQL code snippet designed to unearth such columns, helping you optimize your database schema.
Code Breakdown:
The provided code utilizes system catalog tables in PostgreSQL to analyze table structures and column statistics. Here's a breakdown of the key components:
- Data Retrieval:
pg_class (c)
: Retrieves information about tables (relations)pg_namespace (ns)
: Captures table schema namespg_attribute
: Provides details about table columns (attributes)pg_type (t)
: Defines data types of columnspg_statistic (s)
: Stores statistical information about columns
- Joins:
- The code joins these tables to establish relationships between tables, columns, data types, and statistics.
- Filtering:
- The
WHERE
clause filters the results to exclude system schemas (e.g.,pg_catalog
), temporary tables, dropped columns, and tables with less than 100 rows. - It further focuses on columns with a
stadistinct
value between 0 and 1, indicating potentially single-valued columns.
- The
- Calculations and Outputs:
- The code calculates null value percentage (
null_percent
) and the number of distinct values (distinct
). It also extracts the actual value(s) (values
) if only one or two distinct values exist. - The final
SELECT
clause retrieves schema name (nspname
), table name (relname
), column name (attname
), data type (typname
), null value percentage, number of distinct values, and the actual value(s) (if applicable). - The results are ordered by schema, table, and column name for easier analysis.
- The code calculates null value percentage (
Key Points and Insights:
- This code identifies columns with potentially redundant data, such as those containing only null values or a single unique value across all rows (e.g., a redundant flag set to the same value for every record).
- A minimum threshold of 100 rows is used to avoid analyzing very small tables where statistics might be less reliable.
- The script provides insights into null value prevalence and data diversity within columns.
Explanation of Important Parts:
stadistinct
: This column from thepg_statistic
table represents the estimated number of distinct values in a column. Values between 0 and 1 suggest a single value or very few distinct values.stanullfrac
: This column indicates the estimated fraction of null values in a column. A highnull_percent
might also point to a potentially unused column.- The case statement handles scenarios where
stadistinct
can be negative. It calculates the absolute value multiplied by the total number of rows (reltuples
) to approximate the number of distinct values in such cases.
Conclusion:
By leveraging this code and understanding its output, you can effectively identify potentially useless columns in your PostgreSQL database. Remember to analyze the context and business logic before removing any columns, as seemingly redundant data might hold hidden importance. This code serves as a valuable tool to streamline your database schema and potentially improve query performance.
References:
Finding useless columns: https://wiki.postgresql.org/wiki/Finding_useless_columns
PostgreSQL system catalogs: https://www.postgresql.org/docs/current/catalogs.html
pg_class
system catalog: https://www.postgresql.org/docs/current/catalog-pg-class.html
pg_namespace
system catalog: https://www.postgresql.org/docs/current/catalog-pg-namespace.html
pg_attribute
system catalog: https://www.postgresql.org/docs/current/catalog-pg-attribute.html
pg_type
system catalog: https://www.postgresql.org/docs/current/catalog-pg-type.html
pg_statistic
system catalog: https://www.postgresql.org/docs/current/catalog-pg-statistic.html
Source:
https://github.com/HariSekhon/SQL-scripts/blob/master/postgres_columns_useless.sql