Collect statistics via ANALYZE
This topic explains the basics of database statistics collection in Greengage DB (based on Greenplum).
Greengage DB uses table metadata and statistics to optimize query execution. To execute a query, the executor performs its steps in the order defined by the planner. These steps include scans, filters, joins, motions, and so on. Every step has its cost, which represents the estimated use of CPU, memory, disk I/O, and network resources. A cost is an abstract value: it’s a number that reflects a relative amount of resources required to complete a step. To achieve the best query performance, the cost-based query optimizer estimates possible ways to execute a query, aiming to minimize its total cost.
To choose the best execution plan, the optimizer needs accurate statistics about the stored data, including row counts, distinct values, null fractions, and data distribution patterns. For example, if a query joins two tables and needs to redistribute one of them across all segments, the optimizer chooses the smaller table to reduce network overhead.
Statistics collection has its own cost: analyzing large volumes of data can cause a significant load on the cluster. To reduce this overhead, Greengage DB uses sampling during statistics collection. Instead of analyzing every row in large tables, it gathers statistics based on a representative subset of the data. For partitioned tables, the sample includes rows from all partitions.
Default sampling settings produce sufficiently accurate statistics for most cases. You can increase the sample size or statistics granularity for specific columns or tables to improve accuracy. Note that this increases the overhead of statistics collection and does not always lead to better query performance. It is recommended to test changes to the statistics configuration to find out if they actually improve workload efficiency.
In Greengage DB, statistics are collected in parallel across segments, but coordination is managed centrally.
The master instance gathers data samples and table metadata from each segment and calculates overall statistics based on them.
The results are stored on master in the system catalog: the pg_statistic
table and pg_stats
view.
When to update statistics
Newly created tables do not have any statistics until they are collected using one of the available methods.
You can check for missing statistics using the gp_toolkit.gp_stats_missing
view:
SELECT * FROM gp_toolkit.gp_stats_missing;
The output lists tables with missing statistics:
smischema | smitable | smisize | smicols | smirecs -----------+-----------+---------+---------+--------- public | orders | f | 4 | 0 public | customers | f | 4 | 0 (2 rows)
After statistics have been collected for a table, they remain unchanged until the next collection operation. However, as table data changes over time, the existing statistics may become outdated and no longer reflect the actual data distribution. When this happens, query execution plans based on such statistics become less efficient. To avoid this, it’s important to keep statistics up to date.
Statistics update is usually needed in the following cases:
-
After loading a significant amount of data into a table.
-
After creating indexes.
-
After DML operations, such as
INSERT
,UPDATE
, orDELETE
, that affect large amounts of stored data.
Note that not every bulk modification requires a statistics update. If new data follows the same distribution pattern as existing data — such as normally distributed numeric values — then the existing statistics may remain accurate enough. In such cases, analyzing the table again might not lead to any benefit.
There are indirect indicators that may be used to detect outdated statistics:
-
A large difference between
reltuples
in thepg_catalog.pg_class
system table (estimated number of rows) and the actual number of rows —SELECT COUNT(*)
result. Note that counting rows in large tables can be resource-intensive. -
Old timestamps in the
last_analyze
andlast_autoanalyze
columns of thepg_catalog.pg_stat_all_tables
system view. However, if the table is rarely modified, outdated statistics may have little impact on performance.
Ways to collect statistics
Greengage DB provides several tools and mechanisms for collecting and managing statistics:
-
The
ANALYZE
SQL command. -
The
analyzedb
utility. -
Automatic statistics collection.
-
The
VACUUM ANALYZE
SQL command, which removes expired rows and then collects statistics on the cleaned-up table. Learn more in the Remove expired table rows via VACUUM topic.
ANALYZE command
The ANALYZE
SQL command collects statistics about the contents of tables and columns.
You can use ANALYZE
in several ways, depending on the scope of statistics collection:
-
Analyze all tables available to the user:
ANALYZE;
This call collects statistics for every table in the current database. It can take a long time on large databases and is usually not necessary unless you’re preparing the entire system for query optimization. Examples of such cases are a major data load or a migration.
-
Analyze a particular table:
ANALYZE orders;
This updates statistics for all columns of the specified table. Use this after inserting, updating, or deleting a significant amount of data from the table.
-
Analyze particular columns of a table:
ANALYZE orders (total, order_date)
This updates statistics only for the specified columns. It’s useful when you know that only certain columns have changed significantly, especially if those columns are frequently used in
WHERE
,JOIN
, orGROUP BY
clauses.
To display progress and output messages of statistics collection steps, add the VERBOSE
keyword:
ANALYZE VERBOSE;
When analyzing partitioned tables, ANALYZE
behavior depends on whether the root or individual partitions are targeted.
Details are covered in the Analyze partitioned tables section.
analyzedb utility
analyzedb
is a command-line utility that triggers statistics collection in Greengage DB.
It provides more flexibility and automation options than the SQL ANALYZE
command, especially for large databases or scripted maintenance tasks.
To analyze all tables in a database, run analyzedb
on the master host passing the database name in the -d
option:
$ analyzedb -d books_store
The command displays the list of tables and partitions to be analyzed.
Type y
and press Enter
to confirm statistics collection.
Use the -a
option to skip confirmation:
$ analyzedb -d books_store -a
Unlike manual ANALYZE
, analyzedb
can analyze multiple tables in parallel and selectively update statistics only when necessary.
This speeds up statistics collection.
To define the number of tables to analyze in parallel, use the -p
option.
Available values are 1
to 10
:
$ analyzedb -d books_store -p 10
This number reflects the number of partitions analyzed in parallel when analyzing partitioned tables.
To automatically skip tables with up-to-date statistics, analyzedb
saves state information for append-optimized tables in the db_analyze subdirectory of the master data directory.
When called next time, the utility uses this information to check if a table or a partition has been changed.
If no changes are found, there is no need to analyze that table or partition.
By default, analyzedb
skips analysis when possible.
To force full analysis regardless of current statistics, use the --full
option:
$ analyzedb -d books_store --full
Heap tables are always analyzed regardless of changes.
analyzedb
provides options for selecting objects for statistics collection:
-
-s
— analyze all tables in a schema:$ analyzedb -d books_store -s sales
-
-t
— analyze a specific table:$ analyzedb -d books_store -t sales.orders
-
-i
— analyze specific columns of a table:$ analyzedb -d books_store -t sales.orders -i order_date,total
-
-x
— exclude columns from analysis:$ analyzedb -d books_store -t sales.orders -x comment
The -f
(--file
) option lets you define the list of tables and columns to analyze in a file.
An analyzedb
configuration file contains names of tables to analyze, each on a separate line.
You can optionally specify a list of columns to analyze or exclude from analysis using the -i
and -x
options.
<schema>.<table> <schema>.<table> -i <column1>,<column2>,... <schema>.<table> -x <column1>,<column2>,...
For example:
public.customers sales.orders -i order_date, total public.books -x year
To update statistics as specified in the file:
$ analyzedb -d books_store -f <analyze_file_name>
where <analyze_file_name>
is the name of the analyzedb
configuration file.
Automatic statistics collection
Greengage DB can automatically update statistics after certain events. Using this mechanism, the query planner gets up-to-date information about stored data without manual intervention. You can adjust automatic statistics collection using the configuration parameters described in this section.
The gp_autostats_mode
parameter controls when automatic statistics collection is triggered.
Its possible values are the following:
-
NONE
— disables automatic statistics collection. -
ON_NO_STAT
(default) — collects statistics for tables without statistics after data loading operations, namely:-
CREATE TABLE AS SELECT
-
INSERT
-
COPY
-
-
ON_CHANGE
— collects statistics after DML operations listed below if they affect a number of rows exceeding the threshold (thegp_autostats_on_change_threshold
parameter):-
CREATE TABLE AS SELECT
-
UPDATE
-
DELETE
-
INSERT
-
COPY
-
The gp_autostats_on_change_threshold
parameter specifies the number of changed rows for triggering automatic statistics collection in the ON_CHANGE
mode.
The default threshold value is 2147483647
(the maximum 32-bit signed integer).
Such a large value effectively disables automatic statistics collection.
Thus, when switching to the ON_CHANGE
mode, define a meaningful threshold value for your workload.
For example, to enable automatic statistics collection when more than 1000 rows are changed:
$ gpconfig -c gp_autostats_mode -v ON_CHANGE
$ gpconfig -c gp_autostats_on_change_threshold -v 1000
$ gpstop -u
To turn off automatic statistics collection:
$ gpconfig -c gp_autostats_mode -v NONE
$ gpstop -u
By default, statistics collection starts automatically only if the triggering operation is executed by the table owner.
The gp_autostats_allow_nonowner
parameter additionally triggers statistics collection for data modifications performed by non-owners.
$ gpconfig -c gp_autostats_allow_nonowner -v on
$ gpstop -u
The gp_autostats_mode
applies only to operations executed outside user-defined functions or procedures.
Automatic statistics collection inside functions is managed separately using the gp_autostats_mode_in_functions
parameter.
It accepts the same values as gp_autostats_mode
: NONE
, ON_NO_STAT
, ON_CHANGE
.
$ gpconfig -c gp_autostats_mode_in_functions -v ON_CHANGE
$ gpstop -u
To enable logging of automatic statistics collection events, turn on the log_autostats
parameter:
$ gpconfig -c log_autostats -v on
$ gpstop -u
This helps with monitoring and diagnosing whether and when auto-analysis is triggered.
Analyze partitioned tables
Partitioned tables require special handling during statistics collection. GPORCA — the default optimizer in Greengage DB — uses statistics collected on the root partition (parent table — the highest-level one) to plan queries. In contrast, the Postgres planner (legacy planner) uses statistics only from individual leaf partitions — the lowest-level tables in the hierarchy.
By default, an ANALYZE
call on a partitioned table collects statistics on all leaf partitions and the root partition.
This produces a complete set of statistics that allow GPORCA to generate efficient execution plans.
If you use only the Postgres planner (the optimizer
parameter is off
), you can speed up statistics collection by turning it off for the root partition.
To do this, set the optimizer_analyze_root_partition
parameter to off
:
$ gpconfig -c optimizer -v off
$ gpconfig -c optimizer_analyze_root_partition -v off
$ gpstop -u
With analyzedb
, use the --skip_orca_root_stats
option to turn off root partition statistics collection:
$ analyzedb -d books_store --skip_orca_root_stats
When optimizer_analyze_root_partition
is off
, you can collect root partition statistics by running ANALYZE ROOTPARTITION
on a partitioned table:
ANALYZE ROOTPARTITION orders;
Alternatively, you can ANALYZE
directly on the root partition.
With default settings — both optimizer
and optimizer_analyze_root_partition
are on
— the ROOTPARTITION
keyword is not required.
To optimize statistics collection on partitioned tables, you can run ANALYZE
only on changed partitions.
In turn, the analyzedb
utility automatically checks for changes and skips unchanged partitions.
See the analyzedb utility section for details.
Statistics target
The statistics target determines how detailed the collected statistics are. It is the number of distinct values that are stored in the list of most common values (MCVs) and histograms for each column. Greengage DB uses this information to estimate data distribution and choose efficient query execution plans.
By default, the statistics target is the same for all tables and columns.
It is defined by the default_statistics_target
configuration parameter.
The default value is 100
, and the maximum is 10000
.
To increase the default target globally:
$ gpconfig -c default_statistics_target -v 200
$ gpstop -u
Increasing the value allows the optimizer to better handle columns with irregular data distributions, which can improve plan accuracy. However, this also increases the cost of statistics collection and may consume more catalog space.
You can override the default target for particular columns when more detailed statistics are needed:
ALTER TABLE orders ALTER COLUMN total SET STATISTICS 300;
Two values of the statistics target are handled in special ways:
-
-1
— falls back to the default set bydefault_statistics_target
. -
0
— disables statistics collection for the column (the planner ignores it).
Use a per-column override only when you know a specific column significantly impacts query performance and has a non-uniform distribution.
Collected statistics
This section lists statistical metrics collected in Greengage DB and where they are stored.
Table size: pg_class table
Greengage DB uses the pg_catalog.pg_class
system table to estimate table sizes during query planning.
Its column reltuples
stores an estimate of the number of rows in each table.
This estimate is updated during ANALYZE
and VACUUM
operations.
To check the estimated row count for a table, query by the table name in the relname
column:
SELECT reltuples FROM pg_class WHERE relname = 'orders';
The value may differ from the actual row count returned by SELECT COUNT(*)
.
If the difference is significant, it indicates that the statistics are outdated and should be refreshed.
See When to update statistics for more details.
Column-level statistics: pg_statistic table and pg_stats view
Greengage DB stores detailed column statistics in:
-
pg_catalog.pg_statistic
— an internal system table used by the planner. -
pg_catalog.pg_stats
— a system view that presents statistics in a readable format.
The pg_statistic
table stores more technical detail, while the pg_stats
view is better suited for querying and inspection.
Each row in pg_statistic
and pg_stats
describes statistics for a single column of a user table.
This includes the following key statistical metrics.
Metric | pg_stats column | pg_statistic column | Description |
---|---|---|---|
Number of distinct values |
n_distinct |
stadistinct |
Helps estimate join and filter selectivity |
Null fraction |
null_frac |
stanullfrac |
Fraction of column values that are NULL |
Most common values (MCVs) |
most_common_vals |
stavalues1 |
An array of the most frequent values and their frequencies. The maximum array size is defined by the statistics target |
Histograms |
histogram_bounds |
stavalues2 |
Used to estimate data distribution across value ranges |
Correlation |
correlation |
stacorr1 |
Measures how well values are sorted, used to optimize index scans |
The pg_stats
view allows querying column statistics using table and column names:
SELECT null_frac, most_common_vals FROM pg_stats
WHERE tablename = 'orders' AND attname = 'comment';
Result:
null_frac | most_common_vals -----------+------------------- 0.7 | {NULL, 'no comment', 'ok'} (1 row)
In turn, pg_statistic
uses object IDs and attribute numbers, making it better for internal use:
SELECT stanullfrac FROM pg_statistic WHERE starelid = 16384 AND staattnum = 4;
Result:
stanullfrac ------------- 0.7 (1 row)
This form is typically used by internal tools and not recommended for manual inspection unless you’re debugging or developing system utilities.