Overview of gpbackup and gprestore

Pavel Semyonov

Contents

Backup and restore process
Database objects included in backups
Backup layout
- Metadata files on master host
- Data files on segment hosts
Backup scope
Incremental backups
Extend backup functionality with storage plugins

gpbackup and gprestore are Greengage DB command-line utilities for centralized database backup and restore. They are distributed separately from the core DBMS, use their own versioning, and are compatible across Greengage DB versions. To learn how to install the utilities and get started, see Installation of gpbackup and gprestore.

gpbackup creates logical backups of a database, while gprestore restores those backups to the same or another Greengage DB cluster. Together, they provide a unified, reliable framework for protecting data and enabling disaster recovery in distributed environments.

The utilities implement a centralized parallel backup and restore mechanism. Operations are invoked on the master host and then coordinated across all segment hosts to achieve high performance and scalability. Both utilities also offer extensive options to customize backup aspects such as the backup scope, storage location, and parallel level. In addition, the plugin mechanism allows integration with external storage systems such as cloud or network storage, extending the flexibility of backup management.

The gpbackup and gprestore utilities allow administrators to create distributed backups of Greengage DB databases and restore from them with a single command-line call on the master host. They start and coordinate parallel execution of backup and restore tasks across all segment hosts, which ensures scalability and efficient resource utilization.

A gpbackup call with a database name and a path initiates the backup creation in the specified location:

$ gpbackup --dbname marketplace --backup-dir /home/gpadmin/backups

When invoked, gpbackup connects to that database and collects metadata of its objects and distribution information. The master then dispatches parallel worker sessions to each segment host. Each segment writes its portion of table data to backup files on its local filesystem. This distributed execution model allows backups to scale with the number of segments and reduces the load on the master.

During backup, each table being processed is locked with an ACCESS SHARE lock. This prevents schema modifications (such as ALTER TABLE) but allows concurrent reads and writes to other tables, so database activity can continue during the backup.

NOTE

If a table is already locked with an EXCLUSIVE lock, gpbackup waits until the lock is released before proceeding.

Internally, gpbackup uses the COPY command with the ON SEGMENT clause to extract table data. Each segment executes a COPY … TO command to write data in the CSV format to local files in parallel.

The gprestore utility restores a database from a backup. It accepts a backup timestamp that uniquely identifies it and its location:

$ gprestore --timestamp 20250908074826 --backup-dir /home/gpadmin/backups

gprestore recreates the database schema from the backup metadata and instructs each segment to load its portion of the backup data back into tables. This is done using COPY … FROM ON SEGMENT. The restore process is also parallel and distributed.

During restore, Greengage DB ensures that data distribution policies are respected. You can restore a backup to the same cluster — preserving the original distribution — or to another cluster with a different number of segments with --resize-cluster. In this case, gprestore automatically redistributes data based on the table definitions in the backup. This makes it possible to restore a database even if the cluster has been resized since the backup was taken.

Both utilities rely on SSH to establish connections between the master and segment hosts. Each segment operation is performed over a separate SSH session. The total number of open connections approximately equals the number of segments in the cluster, which can be high for large clusters. Additionally, increasing the parallel level of the operation with the --jobs option multiplies this number by the specified value. To avoid connection bottlenecks or timeouts, it is recommended to increase the SSH daemon configuration parameters MaxStartups and MaxSessions on the master host.

gpbackup creates logical backups of databases, including data, metadata (DDL), and other database objects. They can also optionally include global (cluster-wide) objects such as roles and tablespaces.

By default, a backup includes all objects that belong to the specified database. Utility options allow you to control the inclusion of global objects, statistics, or system objects. Complete lists of options are available in Overview of the gpbackup syntax and Overview of the gprestore syntax.

The following database-level objects are included in a backup:

Tables
Schemas
Procedural language extensions
Sequences
Comments
Session-level configuration parameter settings (GUCs)
Indexes
Ownership and privileges
Writable and readable external tables (DDL only)
Functions
Aggregates
Casts
Types
Views
Materialized views (DDL only)
Protocols
Triggers
Rules
Domains
Operators, operator families, and operator classes
Conversions
Extensions
Text search parsers, dictionaries, templates, and configurations
Table statistics (if the --with-stats option is specified)

NOTE

DDL only

Some objects are backed up in DDL-only form. For these objects — such as external tables and materialized views — gpbackup saves only the object definition (the SQL CREATE statement), but not the underlying data. When restored, gprestore recreates their structure without repopulating data sources.
Triggers

Trigger definitions, if present, are also backed up and restored, but Greengage DB does not execute triggers.

gpbackup includes the following cluster-wide objects into backups by default:

Tablespaces
Database-wide configuration parameter settings (GUCs)
Resource group definitions
Resource queue definitions
Roles
GRANT assignments of roles to databases

The --without-globals option excludes these objects from the backup.

By default, gprestore does not restore global objects, preserving the original state of the target cluster’s global objects. When restored with the --with-globals option, gprestore restores global objects from the backup before database-level objects to ensure dependency consistency.

System schemas are not included in backups:

gp_toolkit
information_schema
pg_aoseg
pg_bitmapindex
pg_catalog
pg_toast*
pg_temp*

Objects within these schemas are maintained internally by Greengage DB and are recreated automatically when the database is restored. If you modify objects in these schemas (for example, in gp_toolkit), such changes will be lost after restore.

A database backup created with gpbackup consists of multiple files that together represent the logical structure and data of a database. The exact set of files depends on the backup settings, specifically, on which objects are included and how the backup layout is configured.

Backups include two main categories of files:

Metadata files on the master host.
Data files on segment hosts.

Metadata files describe database structure, configuration, and backup operation details. They are always created on the master host.

File Description

File	Description
gpbackup_<YYYYMMDDHHMMSS>_metadata.sql	Contains the DDL statements for database objects and settings included in the backup. This file is omitted in data-only backups, which are intended for restoring data into an existing database schema
gpbackup_<YYYYMMDDHHMMSS>_toc.yaml	The backup table of contents (TOC). Lists all tables included in the backup with their OIDs and sizes. Used internally by `gprestore` to control restore order
gpbackup_<YYYYMMDDHHMMSS>_report	Backup operation report. Contains information about the backup command, its parameters, and execution status
gpbackup_<YYYYMMDDHHMMSS>_config.yaml	Backup configuration file. Lists parameters and environment details used during the backup operation
gpbackup_<YYYYMMDDHHMMSS>_statistics.sql	Serialized database statistics. Created only when the backup is run with the `--with-stats` option. Restoring statistics can help the query planner maintain optimal performance after recovery
gpbackup_history.db	A local SQLite database used internally by `gpbackup` and `gprestore` to track completed backups and their timestamps

gpbackup_<YYYYMMDDHHMMSS>_metadata.sql

Contains the DDL statements for database objects and settings included in the backup. This file is omitted in data-only backups, which are intended for restoring data into an existing database schema

gpbackup_<YYYYMMDDHHMMSS>_toc.yaml

The backup table of contents (TOC). Lists all tables included in the backup with their OIDs and sizes. Used internally by gprestore to control restore order

gpbackup_<YYYYMMDDHHMMSS>_report

Backup operation report. Contains information about the backup command, its parameters, and execution status

gpbackup_<YYYYMMDDHHMMSS>_config.yaml

Backup configuration file. Lists parameters and environment details used during the backup operation

gpbackup_<YYYYMMDDHHMMSS>_statistics.sql

Serialized database statistics. Created only when the backup is run with the --with-stats option. Restoring statistics can help the query planner maintain optimal performance after recovery

gpbackup_history.db

A local SQLite database used internally by gpbackup and gprestore to track completed backups and their timestamps

Each segment host writes table data into its own set of data files. Their naming convention is as follows:

gpbackup_<content-id>_<YYYYMMDDHHMMSS>_<OID>.gz

where:

<content-id> is the segment content identifier.
<YYYYMMDDHHMMSS> is the backup timestamp.
<OID> is the table (or partition) identifier.

Data file layout depends on backup settings such as:

Partition granularity (--leaf-partition-data)

When enabled, a separate data file is created for each partition of a partitioned table.
Single-file layout (--single-data-file)

Produces one data file per segment that contains data for all tables stored on that segment. This simplifies file management but may increase restore time for selective restores.
Compression (--compression-type and --compression-level)

If compression is enabled, data files are written in compressed form. The default compression type is gzip; alternatively, zstd can be specified. When compression is disabled, files have no extension and contain plain text CSV data.

In addition to full database backups, gpbackup and gprestore support partial backups, allowing you to include only specific objects. This provides flexibility for backing up critical data subsets, reducing backup time and storage usage, or migrating selected parts of a database.

You can configure gpbackup to back up:

an entire database (the default);
specified schemas;
specified tables;
all schemas except the specified ones;
all tables except the specified ones.

When working with partitioned tables, backup granularity extends to individual partitions. You can choose to include or exclude specific partitions when creating or restoring a backup. This is particularly useful for managing large historical data sets, where only active partitions need to be backed up regularly.

gprestore provides the same selection capabilities, allowing you to restore only the desired schemas, tables, or partitions from a backup set. This symmetry between backup and restore helps streamline partial data recovery and testing workflows.

For detailed usage examples, see Partial backups.

Incremental backups capture only the tables that have changed since a previous backup. This approach significantly reduces backup duration and storage requirements, especially for large databases with mostly static data. Incremental backups are well suited for environments that require frequent recovery points or regular short backup cycles without the overhead of full backups.

Subsequent incremental backups form a backup set with one full backup created before the first of them. Having a complete set, you can restore a database to a state captured by any of these backups.

IMPORTANT

A backup set cannot be used for restore after data model changes or cluster resize. In this case, a fresh full backup is needed to start a new incremental backup set.

For more information about how incremental backups work and how to configure them, see Incremental backups.

gpbackup and gprestore support storage plugins to extend backup and restore operations beyond the local filesystem. Plugins allow backups to be written to and restored from a variety of external storage systems, including cloud and network storage.

For example, the S3 storage plugin integrates gpbackup and gprestore with S3-compatible storage services, enabling backups to Amazon S3 or another S3-compatible storage. Using storage plugins simplifies offsite backup management, centralizes storage, and supports enterprise retention policies.

See Use S3 storage plugin for a plugin usage example and configuration details.

gpbackup and gprestore

Installation

Overview of gpbackup and gprestore

Backup and restore process

Backup process

Restore process

SSH connections

Database objects included in backups

Database objects

Global objects

Excluded system schemas

Backup layout

Metadata files on master host

Data files on segment hosts

Backup scope

Incremental backups

Extend backup functionality with storage plugins