In this article, we want to discuss with the community what it means to fully implement the ability to decrease the number of segments and/or hosts in Greenplum.
Our thesis is that the ability to downsize or expand a cluster is fully implemented in a solution if and only if the cluster becomes balanced (or remains such) after the shrinking or expansion operations.
What makes this article stand out from the previous ones (e.g. a great article from Roman Eskin about separating Orca into an extension in Greengage) is that at the time of writing, the work is fully underway. Currently, we are considering general boundaries of the solution, the main theses and requirements, which may inevitably undergo some changes as we move on.
What’s the difficulty to reduce the number of segments or hosts? As they say, to break is not to make.
It’s well known that segments in Greenplum are numbered by content
(segment number that is the same for the primary-mirror pair). In Fig. 1, the content
value is a postfix after the seg
identifier, the colors represent primary and mirror segments, the mirroring type is group
.
When a cluster is expanded with gpexpand
, new segments are added to a cluster (content
and dbid
are incremented and assigned to the new segments). In the interactive mode, gpexpand
won’t allow breaking the available mirroring schemes and impose requirements on the number of the new hosts. To keep the mirroring strategy, gpexpand
requests for two or more hosts (framed with a dotted line in Fig. 2).
There’s much more freedom if a user passes an expansion configuration file to gpexpand
(the --i option), but it’s all on the user’s conscience.
If the utility was more advanced in terms of keeping the mirroring configuration, it would be possible to achieve a balanced expansion with just one host. It would require moving, for example, the [seg6
, seg7
, seg8
] group from the sdw1
host to the new sdw4
host, while the newly introduced [seg9
, seg10
, seg11
] group would be placed on sdw1
.
It is a limiting factor for the interactive mode usage of gpexpand
. However, manually assembling the configuration file, the distribution’s uniformity can be altered (in a worst-case scenario — broken), for example, with such utility as gpmovemirrors
.
To change the redistribution policy for the altered number of segments, one should consider the number of new segments.
The computed hash for a row (more precisely, computed for the distribution keys of a table) is stored in the hash
field of the CdbHash
structure alongside the number of segments — numsegs
:
typedef struct CdbHash
{
uint32 hash; /* The result hash value */
int numsegs; /* number of segments in Greenplum Database used for partitioning */
CdbHashReduce reducealg; /* the algorithm used for reducing to buckets */
bool is_legacy_hash;
int natts;
FmgrInfo *hashfuncs;
} CdbHash;
To get the final hash value that defines the source/receiver segment for a row there’s a final hash reduction to the necessary value with the cdbhashreduce function. This function is responsible for computing the number of a target segment. Starting with the version 6, the number computation is based on the jump consistent hash algorithm.
This algorithm requires using a continuous sequence of the "bucket" identifiers (in terms of general algorithm description), in our case buckets are the (0, num_segments)
segments. This way, the main limitation of this algorithm is addition/removal starting from the "older" segments. Dropping a segment somewhere in the middle of a cluster is impossible in that paradigm.
All would be well, but these older segments may be placed on any hosts. What can it lead to? An unbalanced cluster (Fig. 4.).
As a result, 3 primary segments remain on sdw1
, 3 mirror segments on sdw3
, and two groups of 3 segments on sdw2
. Doesn’t look very balanced, does it? The sdw3
host is only open to WAL-writes from sdw2
.
In that case, the DBA would have to manually balance such a cluster. I doubt those who manage large clusters would be happy about it.
To generalize the task of increasing or reducing the number of segments in such a way, that the main process would be to rebalance the target segment set on the required number of hosts. This opens the following opportunities:
Reducing the number of segments without changing the number of hosts, for example, if an administrator decided to allocate the hardware to some other tasks, but changing the number of segments wasn’t required.
Reducing both the number of segments and hosts. For example, to free up some hosts and allocate them to another cluster.
Rebalancing in the same configuration in terms of the number of segments and hosts, i.e. balancing a configuration that for some reason wasn’t balanced, taking into account the mirroring strategy.
Make the user’s life easier by giving him one utility that would solve a complex task of scaling a cluster, particularly, replacing gpexpand
.
For those users that for some reason would want to use the current utilities, gpexpand
and it’s "antipode" ggshrink
can be considered as wrappers on top of the general solution of gprebalance
. Technically they would use gprebalance
.
There’s also a couple of important possible improvements.
A possible option to optimize the table redistribution step with "jump consistent redistribution" while reducing the number of segments is to reject the CREATE TABLE AS SELECT
approach and consider the INSERT SELECT
implementation of the data movement from source segments. Indeed, since the older segments get decommissioned (which means the rows on the current segments will remain as is), we can get a presumably significant time advantage by redistributing only the subsets of the older segment data without fully recreating a table. I think that as the work on this subtask moves on, we will share with you the results of some comparative performance tests.
The second point we wanted to highlight is that during the segment movement, a replica is turned off almost immediately and might be unavailable for a long time. It’s not a huge problem for a single segment recovery or movement, but, for the task of a full and potentially large-scale cluster rebalancing, it might be critical. We are also working in that direction.
As a conclusion, I’d like to list several theses that can be considered the main requirements for the utility that we call gprebalance
for now:
The utility should provide the ability to balance a cluster.
The group
and spread
mirroring configurations should be supported.
Ideally, the utility should support balancing the mount points for the data directories. Otherwise, a user may get surprised that all segments with a specific mount point have grouped on the same node.
The utility should support rebalancing for a given list of hosts.
The utility should support modeling all actions "in memory" without really moving the data (dry run).
The utility should have the ability to present a segment movement plan as a list of movement actions between nodes.
The utility should support rolling back the changes done to a cluster.
The utility should support executing the movement operations either within a given time interval, or within a given duration.
Should the need to switch the primary role to mirror and back arise, the utility (on it’s own or through some other utilities) should be able to do it automatically, alternatively it should pause and ask for an explicit confirmation from a user that it’s allowed to switch roles.
The utility should minimize mirror’s downtime during the segment movement operation.
The utility should support displaying the information about the rebalancing progress.
The utility should handle possible segment failover and node failure situations.