Remove expired table rows via VACUUM
This topic explains how to remove expired table rows in Greengage DB (based on Greenplum) using the VACUUM
operation.
Overview
Greengage uses the PostgreSQL’s Multi-Version Concurrency Control (MVCC) model to manage concurrent access to heap tables. In this model, updates and deletes are logical operations: they don’t modify or delete rows but mark them as expired, also known as "dead" rows. An update creates a new version of the row with the modified values. It is appended to the table, just like an insert. By default, expired rows are invisible to users: they are not returned in query results and do not affect statistics. They are retained internally for purposes such as conflict resolution and transaction rollback. Indexes are affected in the same way: expired index entries remain stored but are ignored during queries. For more information about MVCC, see the corresponding chapter in the PostgreSQL documentation: Concurrency control.
Because expired rows remain in the database, table size continues to grow over time, even if the visible row count stays the same. This also leads to query performance degradation due to increased disk I/O.
Over time, a table may occupy significantly more disk space than is required to store its current data. This unnecessary storage growth is known as bloat. Bloat affects all types of tables — user and system, heap and append-optimized (AO) — as well as indexes.
Recover disk space
The VACUUM
operation reclaims the space occupied by expired rows, making it available for reuse and reducing bloat in tables.
After VACUUM
, free space becomes available in the existing data pages and is reused for inserts and updates.
New pages are only appended when there is no reusable space in the existing pages.
Append-optimized (AO) tables handle updates and deletes differently. See the Vacuum append-optimized tables section below for details.
System catalog tables also accumulate bloat over time. For information on vacuuming them, see the Remove system catalog bloat section.
Prevent transaction ID wraparound failures
Another purpose of VACUUM
is preventing transaction ID wraparound failures.
In MVCC, every transaction has a 32-bit identifier called XID.
It allows for around four billion XID values.
They are used cyclically: if a transaction counter overflows at some point, it wraps around to zero.
When XID wraparound happens, new transactions start receiving smaller XIDs than previous ones.
To handle this correctly, Greengage DB — as well as PostgreSQL — uses special XID comparison rules, where smaller XID values can be "newer" than bigger ones.
For more details about transaction ID wraparound and comparison, see the corresponding chapter in the PostgreSQL documentation: Routine vacuuming.
VACUUM
prevents issues with XID wraparound by freezing old rows.
XID comparison rules no longer apply to a row once it’s frozen.
Instead, it is considered older than (and hence, visible to) any transaction that happens after the freeze, regardless of their XID values.
Check bloat in tables
Bloat is measured as a ratio between the numbers of actual
and expected
pages a table uses, where:
-
actual
is the number of pages a table currently occupies. -
expected
is the minimum number of pages required to store the table data without expired rows.
This ratio helps estimate how much space could be reclaimed through a VACUUM
operation.
Both numbers are calculated during ANALYZE operations and stored in internal statistics. To get accurate bloat estimates, make sure table statistics are up to date.
To find out the current number of pages used by a table (actual
), query the pg_class
system catalog table:
SELECT relname, relpages, reltuples FROM pg_class WHERE relname='orders';
Example result:
relname | relpages | reltuples ---------+----------+----------- orders | 25 | 1e+03 (1 row)
When updates or deletes are performed on the table, the number of pages (relpages
) grows over time, even if the number of rows (reltuples
) stays the same or decreases.
To check both expected and actual numbers of pages in heap tables, DB provides two helper views: gp_toolkit.gp_bloat_diag
and gp_toolkit.gp_bloat_expected_pages
.
gp_toolkit.gp_bloat_diag
reports only tables that are suspected to have bloat:
SELECT * FROM gp_toolkit.gp_bloat_diag;
Example output:
bdirelid | bdinspname | bdirelname | bdirelpages | bdiexppages | bdidiag ----------+------------+------------+-------------+-------------+------------------------------------ 17018 | public | orders | 25 | 4 | moderate amount of bloat suspected (1 row)
Table bloat is reflected by columns:
-
bdirelpages
— the actual number of pages (same asrelpages
inpg_class
). -
bdiexppages
— the expected number of pages.
Tables appear in gp_toolkit.gp_bloat_diag
only if their actual-to-expected page ratio is 3
or greater.
The bdidiag
column provides a diagnostic message:
-
moderate amount of bloat suspected
— actual pages exceed expected pages by more than three times. -
significant amount of bloat suspected
— actual pages exceed expected pages by more than ten times.
You can use this view to find bloated tables by name:
SELECT * FROM gp_toolkit.gp_bloat_diag WHERE bdirelname = 'orders';
Or to list tables with significant bloat:
SELECT * FROM gp_toolkit.gp_bloat_diag WHERE bdidiag LIKE '%significant%';
For information about bloat in all heap tables, including those not present in gp_bloat_diag
, use the gp_toolkit.gp_bloat_expected_pages
view.
To check a specific table, filter by its object ID (oid
):
SELECT * FROM gp_toolkit.gp_bloat_expected_pages WHERE btdrelid = 'orders'::regclass;
Example result:
btdrelid | btdrelpages | btdexppages ----------+-------------+------------- 17018 | 25 | 4 (1 row)
For detecting bloat in append-optimized tables, use the gp_toolkit.__gp_aovisimap_compaction_info(oid)
function.
To learn more, see the Vacuum append-optimized tables section.
Ways to do VACUUM
Greengage DB provides the following ways to do VACUUM
:
-
The
VACUUM
SQL command and its formVACUUM FULL
. -
The
vacuumdb
utility.
VACUUM command
The VACUUM
command reclaims the space occupied by expired rows by marking it as free.
It does not return disk space to the operating system.
Instead, future INSERT
and UPDATE
operations reuse the cleared space within the same table pages, overwriting obsolete data.
An exception applies to expired rows at the physical end of a table.
If VACUUM
can acquire an Access exclusive
lock on the table and the tail pages are empty, it truncates them, returning that space to the operating system.
This operation is essential for maintaining performance and managing disk usage in systems where frequent updates or deletes happen.
Additionally, VACUUM
freezes old current rows based on the vacuum_freeze_min_age
parameter.
This is a special state that makes them accessible to all future transactions even after XID wraparound.
VACUUM
defines rows to freeze using their age — the difference between a row’s xmin
and the newest XID at the current moment.
All rows older than vacuum_freeze_min_age
are frozen during the operation.
VACUUM
is a non-blocking operation: it can run in parallel with most other database queries, including SELECT
, INSERT
, and UPDATE
.
However, it cannot run concurrently with some DDL operations such as ALTER TABLE
.
Running VACUUM
with no arguments processes all tables in the current database:
VACUUM;
To vacuum a specific table, add its name:
VACUUM orders;
This is often faster and less resource-intensive than vacuuming the entire database, especially if you’re targeting a known bloated table.
VACUUM
supports several options to adjust its behavior:
-
FULL
— rewrites the table without expired rows to a new file and returns space that it previously occupied to the operating system. This is a heavier operation requiring exclusive access to the table. See the VACUUM FULL command section for details. -
VERBOSE
— prints detailed progress information. -
ANALYZE
— triggers statistics collection after vacuuming. Statistics are collected on the target tables.When vacuuming a specific table, you can list one or more columns to collect statistics for:
VACUUM ANALYZE orders (order_id);
This can be helpful if only certain columns are used in
WHERE
clauses or joins and need precise statistics. If no columns are specified, Greengage analyzes all columns of the table. -
FREEZE
— performsVACUUM
as if it is run with thevacuum_freeze_min_age
parameter set to0
. This means that all current rows that the operation processes are frozen — theirxmin
values logically appear in the past for all future transactions regardless of their XIDs.
These options use parenthesis syntax and can be specified in any order. This syntax allows combining multiple options in a single command:
VACUUM (VERBOSE, ANALYZE) orders;
VACUUM FULL command
VACUUM FULL
is a special form of VACUUM
that physically rewrites the table to remove expired rows.
Unlike standard VACUUM
, which only reclaims space internally, it shrinks the table’s physical size and returns unused space to the operating system.
As a result, the table’s actual page number decreases to the possible minimum (expected).
VACUUM FULL
also reorganizes the table data so that rows are stored next to each other without internal fragmentation.
This makes future reads more efficient.
To perform a full VACUUM
on all database tables:
VACUUM FULL;
Or run it on a specific table:
VACUUM FULL orders;
Because it rewrites the entire table, VACUUM FULL
reclaims more space than a standard VACUUM
, leaving no free space between rows.
It requires an access exclusive lock on the table, meaning that no other queries can access the table during the operation.
Since the table is rewritten, the operation can be I/O intensive and requires additional disk space roughly equal to the size of the table during processing.
Because VACUUM FULL
blocks all access to the table and temporarily consumes more disk space, it should only be run during scheduled maintenance windows.
On large or heavily used tables, it can cause significant service disruption.
Running standard VACUUM
operations regularly helps keep table bloat under control and reduces the need for VACUUM FULL
.
For recommendations on scheduling vacuum operations, see the When to do VACUUM section.
As an alternative to VACUUM FULL
, you can use the CREATE TABLE AS
statement to copy the data into a new table and drop the existing one.
This manual method provides more control over locking and disk space usage.
vacuumdb utility
vacuumdb
is a command-line utility that runs the VACUUM
SQL command.
It automates vacuuming tasks, making it suitable for use in maintenance scripts and scheduled jobs.
Use vacuumdb
to perform vacuuming without manually connecting to the database via psql
.
To vacuum an entire database:
$ vacuumdb books_store
The full form includes the -d
(--database
) option name:
$ vacuumdb -d books_store
Use -a
(--all
) to vacuum all databases in a Greengage DB cluster:
$ vacuumdb --all
To vacuum a specific table, specify its name in the -t
(--table
) option:
$ vacuumdb --table 'orders' books_store
To vacuum and collect statistics for a table:
$ vacuumdb --analyze --table 'orders' books_store
The following options define aspects of VACUUM
execution:
-
--full
(or-f
) — performs VACUUM FULL, which rewrites the table and releases unused space to the operating system. Use only during maintenance windows due to its blocking nature and resource usage. -
--verbose
(or -v
) — prints progress details. -
--analyze
(or-z
) — collects statistics after vacuuming, improving planner decisions. -
--analyze-only
(or-Z
) – runsANALYZE
without performing any vacuuming. This is functionally the same as using the analyzedb utility. When analyzing a specific table, you can optionally provide one or more columns for targeted statistics collection:$ vacuumdb --analyze --verbose --table 'orders(order_id)' books_store
-
--freeze
(or-F
) — runsVACUUM FREEZE
.
When to do VACUUM
Unlike PostgreSQL, Greengage DB does not support automatic vacuuming (autovacuum
) for user databases.
All vacuum operations must be performed manually or automated using vacuumdb
.
Running regular VACUUM
is essential to keep disk usage under control and maintain good query performance.
It prevents table bloat growth and helps avoid the need for disruptive maintenance operations in the future.
A common best practice is to run vacuum tasks nightly.
For tables with frequent UPDATE
or DELETE
operations, consider vacuuming them more often.
This can be several times a day, depending on the workload and table size.
Because VACUUM
increases disk I/O and CPU load, it should be run in low-use periods to minimize performance impact.
After performing large data modifications, it’s a good practice to run VACUUM ANALYZE
on affected tables:
VACUUM ANALYZE orders;
This not only reclaims expired space but also refreshes planner statistics for better execution plans.
For heavily bloated tables, it’s more effective to vacuum them before running ANALYZE
.
This ensures statistics are based on the cleaned-up data layout, resulting in more accurate row count estimates and better execution plans.
If regular vacuuming is neglected, table and index bloat can grow substantially.
This results in increased disk usage, slower queries due to excessive page reads, and eventually a need for VACUUM FULL
.
Since VACUUM FULL
takes an exclusive lock and uses more system resources, it requires a dedicated maintenance window.
If your workload regularly deletes all rows from a table, consider using TRUNCATE
instead of DELETE
.
TRUNCATE
immediately deletes the data files, bypassing MVCC tracking.
This eliminates the need for VACUUM
and reduces system overhead.
Vacuum append-optimized tables
Append-optimized tables use a concurrency model different from MVCC. Unlike heap tables, AO tables are designed primarily for append-only workloads, where updates and deletes are rare or completely avoided. However, if updates or deletes are performed on AO tables, their bloat grows over time.
If your AO table experiences frequent updates or deletes, consider converting it to a heap table instead. Heap storage is optimized for row modifications and avoids the overhead of full-table rewrites. For details, see the How to choose a table storage type section.
When VACUUM
is run on an AO table, it physically rewrites the table without expired rows and reclaims space.
This process is known as AO compaction, and it requires enough free disk space to store a full copy of the table during the rewrite.
Because this operation uses significant disk space, Greengage DB automatically skips compaction for AO tables that do not exceed a defined bloat threshold.
This threshold is defined by the gp_appendonly_compaction_threshold
configuration parameter, which reflects the percentage of expired rows.
By default, it is 10
.
When compacting AO tables, Greengage DB checks this threshold for each segment separately.
Thus, a VACCUM
command called on a table can execute on some segments and skip others.
VACUUM FULL
on AO tables ignores the threshold and always compacts the entire table.
To check bloat in an AO table, use the gp_toolkit.__gp_aovisimap_compaction_info(oid)
function.
This function outputs the information about expired rows in data files of the AO table segments:
SELECT * FROM gp_toolkit.__gp_aovisimap_compaction_info('customers'::regclass);
Example result:
NOTICE: gp_appendonly_compaction_threshold = 10 content | datafile | compaction_possible | hidden_tupcount | total_tupcount | percent_hidden ---------+----------+---------------------+-----------------+----------------+---------------- 1 | 1 | t | 2000 | 3000 | 66.67 3 | 1 | t | 2000 | 3000 | 66.67 0 | 1 | t | 2000 | 3000 | 66.67 2 | 1 | t | 2000 | 3000 | 66.67 (4 rows)
Field | Description |
---|---|
content |
Segment ID |
datafile |
ID of the data file in the segment |
compaction_possible |
A boolean value ( |
hidden_tupcount |
The number of expired rows in a data file |
total_tupcont |
The total number of rows in a data file, including expired ones |
percent_hidden |
The percentage of expired rows in a data file |
After running VACUUM ANALYZE
on the table, the function outputs the following:
NOTICE: gp_appendonly_compaction_threshold = 10 content | datafile | compaction_possible | hidden_tupcount | total_tupcount | percent_hidden ---------+----------+---------------------+-----------------+----------------+---------------- 0 | 1 | f | 0 | 0 | 0.00 0 | 2 | f | 0 | 1000 | 0.00 3 | 1 | f | 0 | 0 | 0.00 3 | 2 | f | 0 | 1000 | 0.00 2 | 1 | f | 0 | 0 | 0.00 2 | 2 | f | 0 | 1000 | 0.00 1 | 1 | f | 0 | 0 | 0.00 1 | 2 | f | 0 | 1000 | 0.00 (8 rows)
The f
value in compaction_possible
shows that no data file can be compacted.
To disable AO compaction entirely, turn off the gp_appendonly_compaction
parameter.
$ gpconfig -c gp_appendonly_compaction -v off
$ gpstop -u
Remove index bloat
VACUUM
and VACUUM FULL
operations don’t remove bloat in indexes.
They process only tables.
Just like tables, indexes can accumulate bloat due to frequent updates or deletes. This happens when index entries become obsolete but are not physically removed until the index is rebuilt.
The only way to remove index bloat is recreating the index with the REINDEX
command:
-
Recreate all indexes on a table:
REINDEX TABLE orders;
-
Recreate a specific index:
REINDEX INDEX total_idx;
This way is better for cases when only one or two indexes are bloated because it consumes less resources.
After running REINDEX
, the index statistics such as reltuples
and relpages
in pg_class
are reset to zero.
To get actual statistics, run ANALYZE
on the table after reindexing:
ANALYZE orders;
Remove system catalog bloat
The system catalog stores metadata about all database objects: tables, indexes, columns, functions, and others. Like user tables, catalog tables and their indexes accumulate bloat over time. Schema changes and DDL operations update and delete system table rows, keeping their expired versions.
In active databases with frequent changes of schemas and database objects, system catalog bloat can lead to:
-
Slower execution of system queries (planning, introspection, joins).
-
Increased memory usage by the query planner.
-
Increased disk space usage by the system catalog.
The general recommendations for preventing catalog bloat are the same as for user tables: run VACUUM
nightly and after large update or delete operations.
If database objects rarely change, you may vacuum the system catalog less often. Once a week is the minimum recommended frequency for this case. Additionally, reindexing is required to remove bloat from system catalog indexes.
Typical steps of a system catalog maintenance are the following:
-
Rebuild system catalog indexes using the
REINDEX SYSTEM
command or thereindexdb
utility with the-s
(--system
) option. -
Vacuum system catalog tables using
VACUUM
(periodical maintenance) orVACUUM FULL
(if the catalog is heavily bloated).IMPORTANTPlan a maintenance window with downtime for
VACUUM FULL
. Regular catalog vacuuming can be done during low-use periods. -
Refresh statistics for catalog tables and indexes using the
ANALYZE
command or theanalyzedb
utility.