Overview of the Greengage DB shrink approach

Contents

Introduction
Current state
Redistribution of rows between segments
What we offer within Greengage
- INSERT SELECT optimization for row moving
- Minimization of the replica downtime

In this article, we want to discuss with the community what it means to fully implement the ability to decrease the number of segments and/or hosts in Greenplum.

Our thesis is that the ability to downsize or expand a cluster is fully implemented in a solution if and only if the cluster becomes balanced (or remains such) after the shrinking or expansion operations.

What makes this article stand out from the previous ones (e.g. a great article from Roman Eskin about separating Orca into an extension in Greengage) is that at the time of writing, the work is fully underway. Currently, we are considering general boundaries of the solution, the main theses and requirements, which may inevitably undergo some changes as we move on.

What’s the difficulty to reduce the number of segments or hosts? As they say, to break is not to make.

It’s well known that segments in Greenplum are numbered by content (segment number that is the same for the primary-mirror pair). In Fig. 1, the content value is a postfix after the seg identifier, the colors represent primary and mirror segments, the mirroring type is group.

Fig. 1. Example of a cluster of 3 hosts and 3 segments with the group mirroring

When a cluster is expanded with gpexpand, new segments are added to a cluster (content and dbid are incremented and assigned to the new segments). In the interactive mode, gpexpand won’t allow breaking the available mirroring schemes and impose requirements on the number of the new hosts. To keep the mirroring strategy, gpexpand requests for two or more hosts (framed with a dotted line in Fig. 2).

Fig. 2. Adding segments/hosts with gpexpand

There’s much more freedom if a user passes an expansion configuration file to gpexpand (the --i option), but it’s all on the user’s conscience.

If the utility was more advanced in terms of keeping the mirroring configuration, it would be possible to achieve a balanced expansion with just one host. It would require moving, for example, the [seg6, seg7, seg8] group from the sdw1 host to the new sdw4 host, while the newly introduced [seg9, seg10, seg11] group would be placed on sdw1.

Fig. 3. Segment rebalancing

It is a limiting factor for the interactive mode usage of gpexpand. However, manually assembling the configuration file, the distribution’s uniformity can be altered (in a worst-case scenario — broken), for example, with such utility as gpmovemirrors.

To change the redistribution policy for the altered number of segments, one should consider the number of new segments.

The computed hash for a row (more precisely, computed for the distribution keys of a table) is stored in the hash field of the CdbHash structure alongside the number of segments — numsegs:

CdbHash

typedef struct CdbHash
{
	uint32        hash;         /* The result hash value                             */
	int           numsegs;      /* number of segments in Greenplum Database used for partitioning  */
	CdbHashReduce reducealg;    /* the algorithm used for reducing to buckets        */
	bool          is_legacy_hash;

	int           natts;
	FmgrInfo     *hashfuncs;
} CdbHash;

To get the final hash value that defines the source/receiver segment for a row there’s a final hash reduction to the necessary value with the cdbhashreduce function. This function is responsible for computing the number of a target segment. Starting with the version 6, the number computation is based on the jump consistent hash algorithm.

This algorithm requires using a continuous sequence of the "bucket" identifiers (in terms of general algorithm description), in our case buckets are the (0, num_segments) segments. This way, the main limitation of this algorithm is addition/removal starting from the "older" segments. Dropping a segment somewhere in the middle of a cluster is impossible in that paradigm.

All would be well, but these older segments may be placed on any hosts. What can it lead to? An unbalanced cluster (Fig. 4.).

Fig. 4. Example of an unbalanced cluster

As a result, 3 primary segments remain on sdw1, 3 mirror segments on sdw3, and two groups of 3 segments on sdw2. Doesn’t look very balanced, does it? The sdw3 host is only open to WAL-writes from sdw2.

In that case, the DBA would have to manually balance such a cluster. I doubt those who manage large clusters would be happy about it.

To generalize the task of increasing or reducing the number of segments in such a way, that the main process would be to rebalance the target segment set on the required number of hosts. This opens the following opportunities:

Reducing the number of segments without changing the number of hosts, for example, if an administrator decided to allocate the hardware to some other tasks, but changing the number of segments wasn’t required.
Reducing both the number of segments and hosts. For example, to free up some hosts and allocate them to another cluster.
Rebalancing in the same configuration in terms of the number of segments and hosts, i.e. balancing a configuration that for some reason wasn’t balanced, taking into account the mirroring strategy.
Make the user’s life easier by giving him one utility that would solve a complex task of scaling a cluster, particularly, replacing gpexpand.

For those users that for some reason would want to use the current utilities, gpexpand and it’s "antipode" ggshrink can be considered as wrappers on top of the general solution of gprebalance. Technically they would use gprebalance.

There’s also a couple of important possible improvements.

A possible option to optimize the table redistribution step with "jump consistent redistribution" while reducing the number of segments is to reject the CREATE TABLE AS SELECT approach and consider the INSERT SELECT implementation of the data movement from source segments. Indeed, since the older segments get decommissioned (which means the rows on the current segments will remain as is), we can get a presumably significant time advantage by redistributing only the subsets of the older segment data without fully recreating a table. I think that as the work on this subtask moves on, we will share with you the results of some comparative performance tests.

The second point we wanted to highlight is that during the segment movement, a replica is turned off almost immediately and might be unavailable for a long time. It’s not a huge problem for a single segment recovery or movement, but, for the task of a full and potentially large-scale cluster rebalancing, it might be critical. We are also working in that direction.

As a conclusion, I’d like to list several theses that can be considered the main requirements for the utility that we call gprebalance for now:

The utility should provide the ability to balance a cluster.
The group and spread mirroring configurations should be supported.
Ideally, the utility should support balancing the mount points for the data directories. Otherwise, a user may get surprised that all segments with a specific mount point have grouped on the same node.
The utility should support rebalancing for a given list of hosts.
The utility should support modeling all actions "in memory" without really moving the data (dry run).
The utility should have the ability to present a segment movement plan as a list of movement actions between nodes.
The utility should support rolling back the changes done to a cluster.
The utility should support executing the movement operations either within a given time interval, or within a given duration.
Should the need to switch the primary role to mirror and back arise, the utility (on it’s own or through some other utilities) should be able to do it automatically, alternatively it should pause and ask for an explicit confirmation from a user that it’s allowed to switch roles.
The utility should minimize mirror’s downtime during the segment movement operation.
The utility should support displaying the information about the rebalancing progress.
The utility should handle possible segment failover and node failure situations.

Greengage DB backup and recovery

Separating Orca into a Postgres extension: purpose and implementation details

Greengage scaling approach

Introduction

Current state

Redistribution of rows between segments

What we offer within Greengage

INSERT SELECT optimization for row moving

Minimization of the replica downtime