Collect statistics via ANALYZE

Pavel Semyonov

Contents

When to update statistics
Ways to collect statistics
Analyze partitioned tables
Statistics target
Collected statistics
- Table size: pg_class table
- Column-level statistics: pg_statistic table and pg_stats view

This topic explains the basics of database statistics collection in Greengage DB (based on Greenplum).

Greengage DB uses table metadata and statistics to optimize query execution. To execute a query, the executor performs its steps in the order defined by the planner. These steps include scans, filters, joins, motions, and so on. Every step has its cost, which represents the estimated use of CPU, memory, disk I/O, and network resources. A cost is an abstract value: it’s a number that reflects a relative amount of resources required to complete a step. To achieve the best query performance, the cost-based query optimizer estimates possible ways to execute a query, aiming to minimize its total cost.

To choose the best execution plan, the optimizer needs accurate statistics about the stored data, including row counts, distinct values, null fractions, and data distribution patterns. For example, if a query joins two tables and needs to redistribute one of them across all segments, the optimizer chooses the smaller table to reduce network overhead.

Statistics collection has its own cost: analyzing large volumes of data can cause a significant load on the cluster. To reduce this overhead, Greengage DB uses sampling during statistics collection. Instead of analyzing every row in large tables, it gathers statistics based on a representative subset of the data. For partitioned tables, the sample includes rows from all partitions.

Default sampling settings produce sufficiently accurate statistics for most cases. You can increase the sample size or statistics granularity for specific columns or tables to improve accuracy. Note that this increases the overhead of statistics collection and does not always lead to better query performance. It is recommended to test changes to the statistics configuration to find out if they actually improve workload efficiency.

In Greengage DB, statistics are collected in parallel across segments, but coordination is managed centrally. The master instance gathers data samples and table metadata from each segment and calculates overall statistics based on them. The results are stored on master in the system catalog: the pg_statistic table and pg_stats view.

Newly created tables do not have any statistics until they are collected using one of the available methods.

You can check for missing statistics using the gp_toolkit.gp_stats_missing view:

SELECT * FROM gp_toolkit.gp_stats_missing;

The output lists tables with missing statistics:

 smischema | smitable  | smisize | smicols | smirecs
-----------+-----------+---------+---------+---------
 public    | orders    | f       |       4 |       0
 public    | customers | f       |       4 |       0
(2 rows)

After statistics have been collected for a table, they remain unchanged until the next collection operation. However, as table data changes over time, the existing statistics may become outdated and no longer reflect the actual data distribution. When this happens, query execution plans based on such statistics become less efficient. To avoid this, it’s important to keep statistics up to date.

Statistics update is usually needed in the following cases:

After loading a significant amount of data into a table.
After creating indexes.
After DML operations, such as INSERT, UPDATE, or DELETE, that affect large amounts of stored data.

Note that not every bulk modification requires a statistics update. If new data follows the same distribution pattern as existing data — such as normally distributed numeric values — then the existing statistics may remain accurate enough. In such cases, analyzing the table again might not lead to any benefit.

NOTE

There are indirect indicators that may be used to detect outdated statistics:

A large difference between reltuples in the pg_catalog.pg_class system table (estimated number of rows) and the actual number of rows — SELECT COUNT(*) result. Note that counting rows in large tables can be resource-intensive.
Old timestamps in the last_analyze and last_autoanalyze columns of the pg_catalog.pg_stat_all_tables system view. However, if the table is rarely modified, outdated statistics may have little impact on performance.

Greengage DB provides several tools and mechanisms for collecting and managing statistics:

The ANALYZE SQL command.
The analyzedb utility.
Automatic statistics collection.
The VACUUM ANALYZE SQL command, which removes expired rows and then collects statistics on the cleaned-up table. Learn more in the Remove expired table rows via VACUUM topic.

Prerequisites

To execute commands described in this section, connect to the Greengage DB master host using psql as described in Connect to Greengage DB via psql.

The ANALYZE SQL command collects statistics about the contents of tables and columns. You can use ANALYZE in several ways, depending on the scope of statistics collection:

Analyze all tables available to the user:
```
ANALYZE;
```
This call collects statistics for every table in the current database. It can take a long time on large databases and is usually not necessary unless you’re preparing the entire system for query optimization. Examples of such cases are a major data load or a migration.
Analyze a particular table:
```
ANALYZE orders;
```
This updates statistics for all columns of the specified table. Use this after inserting, updating, or deleting a significant amount of data from the table.
Analyze particular columns of a table:
```
ANALYZE orders (total, order_date)
```
This updates statistics only for the specified columns. It’s useful when you know that only certain columns have changed significantly, especially if those columns are frequently used in WHERE, JOIN, or GROUP BY clauses.

To display progress and output messages of statistics collection steps, add the VERBOSE keyword:

ANALYZE VERBOSE;

Table locks

Running an ANALYZE operation acquires a Share Update Exclusive lock on the target table. It blocks concurrent schema changes, VACUUM (on heap tables, without FULL), and ANALYZE operations. Concurrent DML operations are allowed; however, it is not recommended to run ANALYZE concurrently with large-scale data modifications or index creation. This may reduce the accuracy of the collected statistics or affect performance.

NOTE

When analyzing partitioned tables, ANALYZE behavior depends on whether the root or individual partitions are targeted. Details are covered in the Analyze partitioned tables section.

analyzedb is a command-line utility that triggers statistics collection in Greengage DB. It provides more flexibility and automation options than the SQL ANALYZE command, especially for large databases or scripted maintenance tasks.

To analyze all tables in a database, run analyzedb on the master host passing the database name in the -d option:

$ analyzedb -d books_store

The command displays the list of tables and partitions to be analyzed. Type y and press Enter to confirm statistics collection. Use the -a option to skip confirmation:

$ analyzedb -d books_store -a

Unlike manual ANALYZE, analyzedb can analyze multiple tables in parallel and selectively update statistics only when necessary. This speeds up statistics collection.

To define the number of tables to analyze in parallel, use the -p option. Available values are 1 to 10:

$ analyzedb -d books_store -p 10

NOTE

This number reflects the number of partitions analyzed in parallel when analyzing partitioned tables.

To automatically skip tables with up-to-date statistics, analyzedb saves state information for append-optimized tables in the db_analyze subdirectory of the master data directory. When called next time, the utility uses this information to check if a table or a partition has been changed. If no changes are found, there is no need to analyze that table or partition. By default, analyzedb skips analysis when possible. To force full analysis regardless of current statistics, use the --full option:

$ analyzedb -d books_store --full

NOTE

Heap tables are always analyzed regardless of changes.

analyzedb provides options for selecting objects for statistics collection:

-s — analyze all tables in a schema:
```
$ analyzedb -d books_store -s sales
```

-t — analyze a specific table:

$ analyzedb -d books_store -t sales.orders

-i — analyze specific columns of a table:

$ analyzedb -d books_store -t sales.orders -i order_date,total

-x — exclude columns from analysis:

$ analyzedb -d books_store -t sales.orders -x comment

The -f (--file) option lets you define the list of tables and columns to analyze in a file. An analyzedb configuration file contains names of tables to analyze, each on a separate line. You can optionally specify a list of columns to analyze or exclude from analysis using the -i and -x options.

<schema>.<table>
<schema>.<table> -i <column1>,<column2>,...
<schema>.<table> -x <column1>,<column2>,...

For example:

public.customers
sales.orders -i order_date, total
public.books -x year

To update statistics as specified in the file:

$ analyzedb -d books_store -f <analyze_file_name>

where <analyze_file_name> is the name of the analyzedb configuration file.

Greengage DB can automatically update statistics after certain events. Using this mechanism, the query planner gets up-to-date information about stored data without manual intervention. You can adjust automatic statistics collection using the configuration parameters described in this section.

The gp_autostats_mode parameter controls when automatic statistics collection is triggered. Its possible values are the following:

NONE — disables automatic statistics collection.
ON_NO_STAT (default) — collects statistics for tables without statistics after data loading operations, namely:
- CREATE TABLE AS SELECT
- INSERT
- COPY
ON_CHANGE — collects statistics after DML operations listed below if they affect a number of rows exceeding the threshold (the gp_autostats_on_change_threshold parameter):
- CREATE TABLE AS SELECT
- UPDATE
- DELETE
- INSERT
- COPY

The gp_autostats_on_change_threshold parameter specifies the number of changed rows for triggering automatic statistics collection in the ON_CHANGE mode. The default threshold value is 2147483647 (the maximum 32-bit signed integer). Such a large value effectively disables automatic statistics collection. Thus, when switching to the ON_CHANGE mode, define a meaningful threshold value for your workload. For example, to enable automatic statistics collection when more than 1000 rows are changed:

$ gpconfig -c gp_autostats_mode -v ON_CHANGE
$ gpconfig -c gp_autostats_on_change_threshold -v 1000
$ gpstop -u

To turn off automatic statistics collection:

$ gpconfig -c gp_autostats_mode -v NONE
$ gpstop -u

By default, statistics collection starts automatically only if the triggering operation is executed by the table owner. The gp_autostats_allow_nonowner parameter additionally triggers statistics collection for data modifications performed by non-owners.

$ gpconfig -c gp_autostats_allow_nonowner -v on
$ gpstop -u

The gp_autostats_mode applies only to operations executed outside user-defined functions or procedures. Automatic statistics collection inside functions is managed separately using the gp_autostats_mode_in_functions parameter. It accepts the same values as gp_autostats_mode: NONE, ON_NO_STAT, ON_CHANGE.

$ gpconfig -c gp_autostats_mode_in_functions -v ON_CHANGE
$ gpstop -u

TIP

To enable logging of automatic statistics collection events, turn on the log_autostats parameter:

$ gpconfig -c log_autostats -v on
$ gpstop -u

This helps with monitoring and diagnosing whether and when auto-analysis is triggered.

Partitioned tables require special handling during statistics collection. GPORCA — the default optimizer in Greengage DB — uses statistics collected on the root partition (parent table — the highest-level one) to plan queries. In contrast, the Postgres planner (legacy planner) uses statistics only from individual leaf partitions — the lowest-level tables in the hierarchy.

By default, an ANALYZE call on a partitioned table collects statistics on all leaf partitions and the root partition. This produces a complete set of statistics that allow GPORCA to generate efficient execution plans.

If you use only the Postgres planner (the optimizer parameter is off), you can speed up statistics collection by turning it off for the root partition. To do this, set the optimizer_analyze_root_partition parameter to off:

$ gpconfig -c optimizer -v off
$ gpconfig -c optimizer_analyze_root_partition -v off
$ gpstop -u

With analyzedb, use the --skip_orca_root_stats option to turn off root partition statistics collection:

$ analyzedb -d books_store --skip_orca_root_stats

When optimizer_analyze_root_partition is off, you can collect root partition statistics by running ANALYZE ROOTPARTITION on a partitioned table:

ANALYZE ROOTPARTITION orders;

Alternatively, you can ANALYZE directly on the root partition.

NOTE

With default settings — both optimizer and optimizer_analyze_root_partition are on — the ROOTPARTITION keyword is not required.

To optimize statistics collection on partitioned tables, you can run ANALYZE only on changed partitions. In turn, the analyzedb utility automatically checks for changes and skips unchanged partitions. See the analyzedb utility section for details.

Automatic statistics collection on partitioned tables

When automatic statistics collection is enabled, it is triggered on partitioned tables only when data is inserted directly into a leaf partition. The statistics are collected on that particular leaf partition. DML operations on parent tables do not trigger statistics collection.

The statistics target determines how detailed the collected statistics are. It is the number of distinct values that are stored in the list of most common values (MCVs) and histograms for each column. Greengage DB uses this information to estimate data distribution and choose efficient query execution plans.

By default, the statistics target is the same for all tables and columns. It is defined by the default_statistics_target configuration parameter. The default value is 100, and the maximum is 10000.

To increase the default target globally:

$ gpconfig -c default_statistics_target -v 200
$ gpstop -u

Increasing the value allows the optimizer to better handle columns with irregular data distributions, which can improve plan accuracy. However, this also increases the cost of statistics collection and may consume more catalog space.

You can override the default target for particular columns when more detailed statistics are needed:

ALTER TABLE orders ALTER COLUMN total SET STATISTICS 300;

NOTE

Two values of the statistics target are handled in special ways:

-1 — falls back to the default set by default_statistics_target.
0 — disables statistics collection for the column (the planner ignores it).

TIP

Use a per-column override only when you know a specific column significantly impacts query performance and has a non-uniform distribution.

This section lists statistical metrics collected in Greengage DB and where they are stored.

Greengage DB uses the pg_catalog.pg_class system table to estimate table sizes during query planning. Its column reltuples stores an estimate of the number of rows in each table. This estimate is updated during ANALYZE and VACUUM operations.

To check the estimated row count for a table, query by the table name in the relname column:

SELECT reltuples FROM pg_class WHERE relname = 'orders';

NOTE

The value may differ from the actual row count returned by SELECT COUNT(*). If the difference is significant, it indicates that the statistics are outdated and should be refreshed. See When to update statistics for more details.

Greengage DB stores detailed column statistics in:

pg_catalog.pg_statistic — an internal system table used by the planner.
pg_catalog.pg_stats — a system view that presents statistics in a readable format.

The pg_statistic table stores more technical detail, while the pg_stats view is better suited for querying and inspection.

Each row in pg_statistic and pg_stats describes statistics for a single column of a user table. This includes the following key statistical metrics.

Key statistical metrics
Metric	pg_stats column	pg_statistic column	Description
Number of distinct values	n_distinct	stadistinct	Helps estimate join and filter selectivity
Null fraction	null_frac	stanullfrac	Fraction of column values that are NULL
Most common values (MCVs)	most_common_vals	stavalues1	An array of the most frequent values and their frequencies. The maximum array size is defined by the statistics target
Histograms	histogram_bounds	stavalues2	Used to estimate data distribution across value ranges
Correlation	correlation	stacorr1	Measures how well values are sorted, used to optimize index scans

The pg_stats view allows querying column statistics using table and column names:

SELECT null_frac, most_common_vals FROM pg_stats
WHERE tablename = 'orders' AND attname = 'comment';

Result:

 null_frac | most_common_vals
-----------+-------------------
     0.7   | {NULL, 'no comment', 'ok'}
(1 row)

In turn, pg_statistic uses object IDs and attribute numbers, making it better for internal use:

SELECT stanullfrac FROM pg_statistic WHERE starelid = 16384 AND staattnum = 4;

Result:

stanullfrac
-------------
         0.7
(1 row)

This form is typically used by internal tools and not recommended for manual inspection unless you’re debugging or developing system utilities.

Use resource queues

Remove expired table rows via VACUUM

Collect statistics via ANALYZE

When to update statistics

Ways to collect statistics

ANALYZE command

analyzedb utility

Automatic statistics collection

Analyze partitioned tables

Statistics target

Collected statistics

Table size: pg_class table

Column-level statistics: pg_statistic table and pg_stats view