Hello, I’m DocuDroid!
Submitting feedback
Thank you for rating our AI Search!
We would be grateful if you could share your thoughts so we can improve our AI Search for you and other readers.
GitHub

Backup and restore

Pavel Semyonov

This topic provides an overview of the ways to back up and restore data stored in a Greengage DB (based on Greenplum) cluster.

Overview

Backing up and restoring data is essential DBMS functionality. It ensures that data can be recovered in the event of hardware failures, software issues, or accidental data deletion. Additionally, database backups enable operational scenarios such as:

  • preserving historical data;

  • point-in-time recovery or snapshot-based environment recreation;

  • providing a safe rollback strategy before major upgrades or schema changes.

Backups are also used for data migration, for example:

  • between two Greengage DB clusters: from a staging environment to production, or moving databases to new hardware;

  • between different systems: loading data from Greengage DB to another PostgreSQL-compatible system or vice versa.

Logical and physical backups

This topic covers Greengage DB’s ways to create logical backups. Unlike physical backups — direct copies of the underlying data files — logical backups abstract the physical storage format and represent the database in a portable form. This usually means a set of SQL statements and metadata files that can recreate the same database structure and data on restore.

Although both kinds of backups are essentially snapshots of a database, their differences define distinct usage scenarios:

  • Physical backups are faster to create and restore, since they are essentially copies of database files. In complex, multi-component systems, this limits their applicability to similar hardware and system environments. As a result, they are typically used for fast point-in-time recovery within the same environment after failures.

  • Logical backups are more flexible and portable. They can be restored on other clusters or even on different DBMSs. However, creating and restoring logical backups usually takes more time.

For physical backups of Greengage DB databases, a modified version of PostgreSQL’s pgbackrest utility is available. Its source code and usage instructions are available in the pgbackrest repository. For more information on using it for Greengage DB cluster backup, see the following articles on the Greengage DB blog: Greengage DB backup and recovery and Greengage DB backup and recovery. Part 2.

Logical backup ways

Greengage DB supports two main approaches for creating and restoring logical backups:

  • Parallel backups, using the gpbackup and gprestore utilities. Each segment takes responsibility for dumping its portion of the data set, resulting in an evenly distributed workload across the cluster and parallel processing. As a result, this significantly reduces backup and restore time.

  • Non-parallel backups, using PostgreSQL’s pg_dump or pg_dumpall utilities. All data is streamed through the master instance and written to its file system or standard output. This approach is simpler and more portable but places a heavier load on the master.

The following table provides a high-level comparison of these two ways.

Backup way Tools Cluster load Backup structure Comment

Parallel

gpbackup, gprestore

Moderate and evenly distributed across segments

Distributed

  • Recommended for most cases, including regular backups.

  • Faster.

  • Supports incremental backup, compression, external storages (plugin-based).

Non-parallel

pg_dump, pg_dumpall

High resource and network load on the master; may require large local storage

Single-file

  • Highly portable.

  • Suitable for migrations between systems.

  • Slower in most cases.

The sections below describe both approaches in more detail and provide guidance on choosing between them.

Parallel backup with gpbackup and gprestore

Greengage DB provides the gpbackup and gprestore utilities for creating and restoring logical backups of stored databases. These utilities are distributed separately from the Greengage DB server core, which allows managing their installations independently. To learn how to install and use gpbackup and gprestore, refer to the dedicated gpbackup and gprestore documentation.

gpbackup and gprestore operate in a distributed manner and take full advantage of the Greengage DB architecture. When creating a backup, the master coordinates the process, while most of the workload is performed on segment instances. Each segment captures its portion of the data set and writes its own backup files to the local file system. As a result, a complete database backup consists of multiple files on all cluster hosts: metadata on the master host and data on each segment host. Such parallel processing minimizes backup time and results in an even load distribution across the cluster.

Because of their performance characteristics and operational flexibility, gpbackup and gprestore are the recommended way for most backup and restore scenarios. These advantages are especially visible on large databases and clusters with many segments.

The utilities support a broad set of configuration options that allow administrators to define the scope, structure, and location of a backup. Examples include selecting specific schemas or tables, including or excluding database metadata, writing to local or external storage, and enabling incremental behavior. The utilities also support storage plugins that allow saving backups directly to S3-compatible object storage or other external storage systems.

Below are several examples of typical gpbackup and gprestore use cases. They are shown for illustration; full command references and detailed procedures are provided in the gpbackup and gprestore documentation.

Back up an entire database with default backup settings:

$ gpbackup --dbname marketplace

Back up only a selected schema:

$ gpbackup --dbname marketplace --include-schema sales

Create an incremental backup based on a previous full backup:

$ gpbackup --dbname marketplace --leaf-partition-data --incremental

Restore a previously created backup into a new database:

$ gprestore --timestamp 20251006063113 --create-db

Send backup data to an external S3-compatible storage using the S3 storage plugin:

$ gpbackup --dbname marketplace --plugin-config /home/gpadmin/s3-config.yaml

Non-parallel backup with pg_dump or pg_dumpall

Greengage DB is shipped with modified versions of the standard PostgreSQL logical backup tools: pg_dump and pg_dumpall. These versions are adapted to work with Greengage DB’s distributed data model and can access data across the entire cluster. However, they do not perform any distributed processing and do not involve segment instances in the backup process. The difference between the two utilities is that pg_dump backs up a single database, while pg_dumpall backs up all databases in the cluster. To learn more about pg_dump and pg_dumpall, see the reference pages pg_dump and pg_dumpall in the PostgreSQL documentation.

Because these utilities originate from PostgreSQL, they do not leverage the Greengage DB’s distributed architecture. They retrieve all data from segments through the master instance and write it to a target file or standard output. No segment-side parallelism is used, and segment instances do not produce or store backup files. As a result, creating a backup with pg_dump typically requires substantially more time than with gpbackup. Additionally, the master must have enough local disk space to store the full backup.

The pg_dump and pg_dumpall utilities do not block most running workloads, except for DDL operations that require an ACCESS EXCLUSIVE lock.

The utilities support various output formats. The default format is a single SQL script, which can be restored using the psql console. Such single-file backups are easier to manage than distributed backups created with gpbackup. They are also more portable: when created with the --no-gp-syntax option, they are compatible with PostgreSQL and most PostgreSQL-derived systems.

Other output formats include directory, tar, and custom archive. Although more complex, these formats provide additional capabilities, such as compression or parallel restore. To restore databases from such backups, the PostgreSQL pg_restore utility is required.

Below are several examples of pg_dump and pg_dumpall usage.

Dump a single database to an SQL script file:

$ pg_dump marketplace > marketplace.sql

Dump only a specific schema:

$ pg_dump --schema=sales marketplace > sales.sql

Dump a specific table:

$ pg_dump --table=sales.orders marketplace > orders.sql

Dump all databases in the cluster using pg_dumpall:

$ pg_dumpall > full_cluster.sql

Create a custom archive backup:

$ pg_dump -Fc --dbname=marketplace > marketplace.dump

Restore a database from a custom archive backup with pg_restore using 8 parallel jobs:

$ pg_restore -j 8 --dbname=restored_db marketplace.dump

Choosing the backup way

The choice between parallel (gpbackup/gprestore) and non-parallel (pg_dump/pg_dumpall) backup approaches depends on performance requirements, compatibility needs, and operational constraints.

Parallel backups are generally faster and more scalable because data is read and written directly by the segment instances. No large data transfers pass through the master instance, which minimizes bottlenecks and makes this mode well-suited for large analytical databases. Parallel backups are the recommended mode for most routine backup and restore operations. Use parallel backup with gpbackup/gprestore when:

  • creating regular backups intended for in-place restore;

  • backing up large databases or clusters with many segments;

  • fast and predictable backup and restore performance is important;

  • you want incremental backups or backups to remote storage such as AWS S3;

  • even load distribution across segments is required.

Non-parallel backups, while significantly slower, provide better portability and produce output fully compatible with standard PostgreSQL. This makes them a better choice for certain migration scenarios, such as exporting a database for loading data into a different DBMS or producing a complete, self-contained SQL dump. Use non-parallel pg_dump/pg_dumpall when:

  • you need a portable SQL dump that can be loaded into standard PostgreSQL (use the --no-gp-syntax option);

  • performing cross-version or cross-platform migrations;

  • backup size is small enough that it can be transferred to the master instance and dumped to its file system;

  • the simplicity of a single SQL output file is important

In practice, organizations often use both methods: gpbackup and gprestore for operational backups and restores, and pg_dump/pg_dumpall for migrations, exports, or tasks where cross-system compatibility is needed.