Identify Underutilized Columns in PostgreSQL for Schema Optimization

Identifying Underutilized Columns in Your PostgreSQL Database

Keeping your PostgreSQL database clean and efficient is crucial for optimal performance. One way to achieve this is by identifying and potentially removing "useless" columns – those that hold minimal or redundant data. This article explores a PostgreSQL code snippet designed to unearth such columns, helping you optimize your database schema.

Sample Code from Command Line

 1SELECT
 2  nspname,
 3  relname,
 4  attname,
 5  typname,
 6  (stanullfrac*100)::int AS null_percent,
 7  case
 8    when stadistinct >= 0
 9      then stadistinct
10    else
11      abs(stadistinct)*reltuples
12  end AS "distinct",
13  case 1
14    when stakind1
15      then stavalues1
16    when stakind2
17      then stavalues2
18  end AS "values"
19FROM
20  pg_class c
21JOIN
22  pg_namespace ns
23ON
24  (ns.oid=relnamespace)
25JOIN
26  pg_attribute
27ON
28  (c.oid=attrelid)
29JOIN
30  pg_type t
31ON
32  (t.oid=atttypid)
33JOIN
34  pg_statistic
35ON
36  (c.oid=starelid AND staattnum=attnum)
37WHERE
38  nspname NOT LIKE E'pg\\_%'
39    AND
40  nspname != 'information_schema'
41    AND
42  relkind = 'r'
43    AND
44  NOT attisdropped
45    AND
46  attstattarget != 0
47    AND
48  reltuples >= 100
49    AND
50  stadistinct BETWEEN 0 AND 1
51ORDER BY
52  nspname,
53  relname,
54  attname;

Notes: Finds useless PostgreSQL columns containing only a single value eg. all NULLs or a redundant non-distinguishing field. Only returns columns with over 100 rows. Tested on PostgreSQL 8.4, 9.x, 10.x, 11.x, 12.x, 13.0

Identifying Underutilized Columns in Your PostgreSQL Database

Keeping your PostgreSQL database clean and efficient is crucial for optimal performance. One way to achieve this is by identifying and potentially removing "useless" columns – those that hold minimal or redundant data. This article explores a PostgreSQL code snippet designed to unearth such columns, helping you optimize your database schema.

Code Breakdown:

The provided code utilizes system catalog tables in PostgreSQL to analyze table structures and column statistics. Here's a breakdown of the key components:

  • Data Retrieval:
    • pg_class (c): Retrieves information about tables (relations)
    • pg_namespace (ns): Captures table schema names
    • pg_attribute: Provides details about table columns (attributes)
    • pg_type (t): Defines data types of columns
    • pg_statistic (s): Stores statistical information about columns
  • Joins:
    • The code joins these tables to establish relationships between tables, columns, data types, and statistics.
  • Filtering:
    • The WHERE clause filters the results to exclude system schemas (e.g., pg_catalog), temporary tables, dropped columns, and tables with less than 100 rows.
    • It further focuses on columns with a stadistinct value between 0 and 1, indicating potentially single-valued columns.
  • Calculations and Outputs:
    • The code calculates null value percentage (null_percent) and the number of distinct values (distinct). It also extracts the actual value(s) (values) if only one or two distinct values exist.
    • The final SELECT clause retrieves schema name (nspname), table name (relname), column name (attname), data type (typname), null value percentage, number of distinct values, and the actual value(s) (if applicable).
    • The results are ordered by schema, table, and column name for easier analysis.

Key Points and Insights:

  • This code identifies columns with potentially redundant data, such as those containing only null values or a single unique value across all rows (e.g., a redundant flag set to the same value for every record).
  • A minimum threshold of 100 rows is used to avoid analyzing very small tables where statistics might be less reliable.
  • The script provides insights into null value prevalence and data diversity within columns.

Explanation of Important Parts:

  • stadistinct: This column from the pg_statistic table represents the estimated number of distinct values in a column. Values between 0 and 1 suggest a single value or very few distinct values.
  • stanullfrac: This column indicates the estimated fraction of null values in a column. A high null_percent might also point to a potentially unused column.
  • The case statement handles scenarios where stadistinct can be negative. It calculates the absolute value multiplied by the total number of rows (reltuples) to approximate the number of distinct values in such cases.

Conclusion:

By leveraging this code and understanding its output, you can effectively identify potentially useless columns in your PostgreSQL database. Remember to analyze the context and business logic before removing any columns, as seemingly redundant data might hold hidden importance. This code serves as a valuable tool to streamline your database schema and potentially improve query performance.

References:

Finding useless columns: https://wiki.postgresql.org/wiki/Finding_useless_columns

PostgreSQL system catalogs: https://www.postgresql.org/docs/current/catalogs.html

pg_class system catalog: https://www.postgresql.org/docs/current/catalog-pg-class.html

pg_namespace system catalog: https://www.postgresql.org/docs/current/catalog-pg-namespace.html

pg_attribute system catalog: https://www.postgresql.org/docs/current/catalog-pg-attribute.html

pg_type system catalog: https://www.postgresql.org/docs/current/catalog-pg-type.html

pg_statistic system catalog: https://www.postgresql.org/docs/current/catalog-pg-statistic.html

Source:

https://github.com/HariSekhon/SQL-scripts/blob/master/postgres_columns_useless.sql