GitHub

Recover a failed master

Pavel Semyonov

This section explains how to recover a Greengage DB (based on Greenplum) cluster from a primary master failure.

The primary (active) master instance in a Greengage DB cluster acts as the single access point for client connections. Therefore, its failure results in service interruption. Master mirroring enables Greengage DB to quickly recover from a master failure, allowing a master mirror — standby master — to take over.

In normal operation, the standby master does not accept requests or perform query processing. Instead, it continuously receives and applies changes from the active master by streaming write-ahead log (WAL) records. It maintains a synchronized copy of the system catalog and other metadata. A standby master failure does not interrupt cluster operations. The active master continues working and logs changes that occur while the standby is down. Once the standby is restored, it automatically synchronizes with the current state in the background.

When the primary master fails, the cluster stops serving client queries and appears down, even though the segment instances may continue running on their respective hosts. To restore cluster availability, you must activate the standby master. Upon activation, the cluster resumes operations from the state of the last successfully committed transaction before the failure.

Activate standby master

After a primary master failure, the cluster becomes unavailable. If it’s impossible to bring the failed master back online, you need to activate the standby master to resume operations.

To activate the standby master, use the gpactivatestandby utility:

  1. Log in to the standby master host as gpadmin.

  2. Run gpactivatestandby passing the path to the standby master data directory in the -d option:

    $ gpactivatestandby -d /data1/master/gpseg-1
    NOTE

    gpactivatestandby requires the PGPORT environment variable to be set.

    $ export PGPORT=5432

    Greengage DB prepares the activation procedure and outputs the standby master details:

    [INFO]:------------------------------------------------------
    [INFO]:-Standby data directory    = /data1/master/gpseg-1
    [INFO]:-Standby port              = 5432
    [INFO]:-Standby running           = yes
    [INFO]:-Force standby activation  = no
    [INFO]:------------------------------------------------------
  3. Enter y and press Enter to confirm the standby master activation:

    Do you want to continue with standby master activation? Yy|Nn (default=N):
    NOTE

    To automatically confirm the standby master activation, use the -a option:

    $ gpactivatestandby -d /data1/master/gpseg-1 -a

    After successful activation, the following lines are shown:

    [INFO]:-The activation of the standby master has completed successfully.
    [INFO]:-smdw is now the new primary master.

The cluster is now operational with a new active master — former standby. To resume interaction with the cluster, the clients should be reconfigured to connect to the new master. Internal Greengage DB communication is automatically updated to use the new master.

The cluster now has an active master and no standby master:

$ gpstate -f

The output shows the absence of the standby master:

[INFO]:-Standby master instance not configured

To return the cluster to a fault-tolerant state, use one of the following ways:

  • Set up a new standby master as described in Enable cluster mirroring.

  • Restore the original active-standby master pair as described below.

CAUTION

Do not restart the original master instance after standby master activation. This can lead to data corruption and cluster inconsistency.

Restore original master-standby configuration

If you’ve resolved the issue that caused the original master to fail, you can revert to the original active–standby master configuration after failover.

Example hostnames

In this section, the example hostnames from the Initialize DBMS topic are used for conciseness:

  • mdw — original primary master host.

  • smdw — original standby master host (current primary).

All utility calls are done on behalf of the gpadmin user.

To restore primary and standby master to their original hosts:

  1. Initialize a standby master on the original primary master host mdw.

  2. Activate it, making it the primary master again.

  3. Reinitialize a standby master on its original host smdw.

Below are the detailed descriptions of these steps.

Create standby master on original master host

To initialize a standby master on the original master host mdw:

  1. On mdw, rename or move the existing master data directory to save it as a backup:

    $ mv /data1/master/gpseg-1 /data1/master/backup_gpseg-1
  2. On smdw, initialize a new standby master specifying mdw as the target host:

    $ gpinitstandby -s mdw

    The output should end with the following line:

    [INFO]:-Successfully created standby master on mdw
  3. Check the standby master by running gpstate -f on smdw:

    $ gpstate -f

    The output shows the standby master details and state:

    [INFO]:-Standby master details
    [INFO]:-----------------------
    [INFO]:-   Standby address          = mdw
    [INFO]:-   Standby data directory   = /data1/master/gpseg-1
    [INFO]:-   Standby port             = 5432
    [INFO]:-   Standby PID              = 2063
    [INFO]:-   Standby status           = Standby host passive
    [INFO]:--------------------------------------------------------------
    [INFO]:--pg_stat_replication
    [INFO]:--------------------------------------------------------------
    [INFO]:--WAL Sender State: streaming
    [INFO]:--Sync state: sync

Activate original master

IMPORTANT

Activating a standby master while the cluster is running requires shutting the cluster down first.

To return the original master to its primary role:

  1. Stop the cluster by running gpstop on smdw:

    $ gpstop
  2. On mdw, activate the newly initialized standby with the -f option:

    $ gpactivatestandby -d $MASTER_DATA_DIRECTORY -f
    IMPORTANT

    The -f option forces activation if the standby master is not running. Use this option only when you are sure that its state is consistent with the primary master.

    The output informs that the primary master now runs on the original master host:

    [INFO]:-The activation of the standby master has completed successfully.
    [INFO]:-mdw is now the new primary master.
  3. Ensure that master mirroring is not enabled by calling gpstate on the new primary master host mdw:

    $ gpstate -f

    The output includes the line:

    [INFO]:-Standby master instance not configured

Initialize standby master in original location

To fully restore the original fault-tolerant topology, recreate the standby master on its original host smdw:

  1. On smdw, rename or move the existing master data directory to save it as a backup:

    $ mv /data1/master/gpseg-1 /data1/master/backup_gpseg-1
  2. On mdw, add a standby master specifying smdw as the target host:

    $ gpinitstandby -s smdw

    The output shows the result:

    [INFO]:-Successfully created standby master on smdw
  3. Check the master mirroring state with a gpstate call on mdw:

    $ gpstate -f

    Primary and standby masters are running on their hosts and synced:

    [INFO]:-Standby master details
    [INFO]:-----------------------
    [INFO]:-   Standby address          = smdw
    [INFO]:-   Standby data directory   = /data1/master/gpseg-1
    [INFO]:-   Standby port             = 5432
    [INFO]:-   Standby PID              = 1462
    [INFO]:-   Standby status           = Standby host passive
    [INFO]:--------------------------------------------------------------
    [INFO]:--pg_stat_replication
    [INFO]:--------------------------------------------------------------
    [INFO]:--WAL Sender State: streaming
    [INFO]:--Sync state: sync