Distribution

Andrey Aksenov

Contents

Overview
- Distribution policies
- Recommendations on choosing a distribution policy
Specify a distribution policy
View distribution information
- View table distribution settings
- Check data distribution
Change a distribution policy
Example: Incorrect distribution choice
- Recommended distribution
- Incorrect distribution

In Greengage DB, tables are distributed across cluster segments to enable parallel processing and improve query performance. When creating or altering a table, you can specify one of the available distribution policies to define how table rows are distributed across segments.

For optimal performance, both data and workloads should be distributed evenly across segments. This ensures that all segments start and finish their tasks simultaneously. The more evenly the data is distributed, the better the overall cluster performance.

The diagram below illustrates how rows from the customers and orders tables might be distributed across four segments. Both tables are distributed using the customer ID as the distribution key. As a result, operations such as joins, sorts, and aggregations can be performed locally on each segment, reducing data movement and improving query performance.

┌──────────────── Segment 0 ─────────────┐   ┌──────────────── Segment 1 ─────────────┐
│ customers (id, customer_name)          │   │ customers (id, customer_name)          │
│     1   │ 'Andrew Fuller'              │   │     11  │ 'Robert King'                │
│     2   │ 'Michael Suyama'             │   │     12  │ 'Laura Callahan'             │
│     ...                                │   │     ...                                │
│ orders (id, customer_id, product_name) │   │ orders (id, customer_id, product_name) │
│     101 │ 1  │ 'Laptop'                │   │     303 │ 11 │ 'Keyboard'              │
│     106 │ 1  │ 'Smartphone'            │   │     305 │ 11 │ 'Headphones'            │
│     110 │ 2  │ 'Tablet'                │   │     311 │ 12 │ 'Smartwatch'            │
│     117 │ 2  │ 'Monitor'               │   │     315 │ 12 │ 'Laptop'                │
│     ...                                │   │     ...                                │
└────────────────────────────────────────┘   └────────────────────────────────────────┘

┌──────────────── Segment 2 ─────────────┐   ┌──────────────── Segment 3 ─────────────┐
│ customers (id, customer_name)          │   │ customers (id, customer_name)          │
│     21  │ 'Nancy Davolio'              │   │     31  │ 'Steven Buchanan'            │
│     22  │ 'Janet Leverling'            │   │     32  │ 'Margaret Peacock'           │
│     ...                                │   │     ...                                │
│ orders (id, customer_id, product_name) │   │ orders (id, customer_id, product_name) │
│     502 │ 21 │ 'Smartphone'            │   │     701 │ 31 │ 'Smartphone'            │
│     504 │ 21 │ 'Speakers'              │   │     705 │ 31 │ 'Laptop'                │
│     508 │ 22 │ 'Printer'               │   │     712 │ 32 │ 'Keyboard'              │
│     511 │ 22 │ 'Tablet'                │   │     717 │ 32 │ 'Smartwatch'            │
│     ...                                │   │     ...                                │
└────────────────────────────────────────┘   └────────────────────────────────────────┘

To specify the distribution policy, use the DISTRIBUTED clause in the CREATE TABLE command:

CREATE TABLE <table_name> (
  <column_definitions>
)
[
  -- Hash distribution
  DISTRIBUTED BY (<column_name> [<opclass>] [, ... ] )

  -- Random distribution
  | DISTRIBUTED RANDOMLY

  -- Replicated distribution
  | DISTRIBUTED REPLICATED
];

Hash distribution

Distributes data by computing a hash of one or more columns defined as the distribution key. For optimal data balance, use the table’s primary key or a unique column (or set of columns) as the distribution key. If there is no suitable column, consider using random distribution, which distributes rows across segments randomly.

NOTE

You can specify a custom hash function for hash distribution using the opclass argument, which defines a hash operator class. A custom hash operator class lets you use a different comparison method instead of the default equality operator (=) in hash-based operations. This might be useful for supporting co-located joins on custom operators, such as case-insensitive comparisons or domain-specific equality definitions. See the CREATE OPERATOR CLASS page in the PostgreSQL documentation to learn how to define a new operator class.
Random distribution

Distributes data across segments randomly. Because segment selection is random, data distribution may become uneven over time. Recommended for tables without unique columns, especially when the table is expected to grow significantly.
Replicated distribution

Stores a complete copy of the table on every cluster segment. Suitable for small tables, such as reference tables. Using replicated distribution can help eliminate data motion during joins.

Always explicitly define the distribution policy when creating a table. If the DISTRIBUTED clause is omitted, Greengage DB assigns a hash distribution policy to the table using either the PRIMARY KEY (if defined) or the first column of the table as the distribution key. This behavior is controlled by the gp_create_table_random_default_distribution configuration parameter, which is set to OFF by default.

Consider the points below when deciding on a table distribution policy.

To ensure optimal query performance, data should be evenly distributed across all segments in the cluster. Uneven distribution or data skew can lead to performance degradation as segments with more data perform more work during query execution. Choose a distribution key that is unique for each record, such as the primary key.

Avoid using the following types of non-uniformly distributed columns as distribution keys:

Dates and timestamps, which often concentrate around specific time periods.
Columns that contain a lot of NULL values, which can concentrate data on fewer segments.
Columns with custom data types, with unpredictable distribution characteristics.

NOTE

Avoid using columns with potentially large data sizes (such as TEXT) as distribution keys. Hashing these columns can be computationally expensive, which may degrade performance during data insertion and join operations.

Local operations are more efficient than distributed ones. Query performance improves when joins, sorts, and aggregations are executed locally on each segment. Cluster-wide operations require redistributing tuples across segments, which introduces overhead. When tables use the same distribution key, joins and sorts on that column can be performed locally. In contrast, tables with a random distribution policy cannot benefit from local join operations.

For the best performance, all segments should handle an equal share of the query workload. Query workload can be skewed if a table’s data distribution policy and the query predicates are not well-matched. For example, the sales table is distributed by product_id. If a query filters on a specific product_id (the WHERE clause) to analyze sales performance for that product, the query workload is concentrated on a single segment.

TIP

Columns that are frequently used in WHERE clauses are often good candidates as partition key columns.

Prerequisites

To execute commands described in the following sections, connect to the Greengage DB master host using psql as described in Connect to Greengage DB via psql. Then, create a new database and connect to it:

CREATE DATABASE marketplace;
\c marketplace

The query below shows how to create a new append-optimized customers table whose rows are hash-distributed by the customer_id column:

CREATE TABLE customers
(
    customer_id   INTEGER,
    customer_name VARCHAR(25)
)
    WITH (appendoptimized = true)
    DISTRIBUTED BY (customer_id);

Sample data

INSERT INTO customers (customer_id, customer_name)
SELECT customer_number               AS customer_id,
       'customer' || customer_number AS name
FROM generate_series(1, 1000) AS customer_number;

The next query creates the orders table with a customer_id column, which is also used as a distribution key. In this case, joining the customers and orders tables by this column might be done locally.

CREATE TABLE orders
(
    order_id    INTEGER,
    customer_id INTEGER,
    amount      DECIMAL(6, 2)
)
    WITH (appendoptimized = true)
    DISTRIBUTED BY (customer_id);

Sample data

INSERT INTO orders (order_id, customer_id, amount)
SELECT order_number                               AS order_id,
       FLOOR(RANDOM() * 1000 + 1)::INTEGER        AS customer_id,
       ROUND((100 + RANDOM() * 1000)::NUMERIC, 2) AS amount
FROM generate_series(1, 100000) AS order_number;

The query below shows how to create the events table whose rows are distributed randomly:

CREATE TABLE events
(
    event_id   SERIAL,
    timestamp  TIMESTAMP    NOT NULL,
    user_id    INTEGER      NOT NULL,
    event_type VARCHAR(255) NOT NULL
)
    WITH (appendoptimized = true)
    DISTRIBUTED RANDOMLY;

Sample data

INSERT INTO events (timestamp, user_id, event_type)
SELECT NOW() - (INTERVAL '7 days' * random())                                     AS timestamp,
       (random() * 10)::INTEGER + 1                                               AS user_id,
       (ARRAY ['click', 'launch', 'purchase', 'logout'])[floor(random() * 4 + 1)] AS event_type
FROM generate_series(1, 4000);

The following query demonstrates how to create the pickup_points table replicated across all segments of the cluster:

CREATE TABLE pickup_points
(
    point_id     INTEGER,
    address      TEXT,
    phone_number TEXT
)
    WITH (appendoptimized = true)
    DISTRIBUTED REPLICATED;

Sample data

INSERT INTO pickup_points (point_id, address, phone_number)
SELECT point_id_seq,
       'Address ' || point_id_seq,
       '123-456-' || LPAD(point_id_seq::TEXT, 4, '0')
FROM generate_series(1, 100) AS point_id_seq;

To view distribution options of the specified table, use the \d meta-command:

\d orders

In the output below, the Distributed by section shows that the orders table is distributed by the customer_id column:

   Append-Only Table "public.orders"
   Column    |     Type     | Modifiers
-------------+--------------+-----------
 order_id    | integer      |
 customer_id | integer      |
 amount      | numeric(6,2) |
Compression Type: None
Compression Level: 0
Block Size: 32768
Checksum: t
Distributed by: (customer_id)

To ensure uniform data distribution after both initial and incremental loads, several ways can be used, for instance:

The get_ao_distribution() function shows the distribution of rows in an append-optimized table:
```
SELECT get_ao_distribution('orders');
```
The result shows how many rows are on each segment:
```
 get_ao_distribution
---------------------
 (1,24828)
 (0,25823)
 (3,24064)
 (2,25285)
```

The gp_toolkit.gp_skew_coefficients view shows data distribution skew by calculating the coefficient of variation (CV) for the data stored on each segment:

SELECT skcrelname, skccoeff
FROM gp_toolkit.gp_skew_coefficients;

The result might look like this:

  skcrelname   |        skccoeff
---------------+-------------------------
 customers     | 2.865891368027290280000
 orders        | 2.979247332241260000000
 events        | 2.222611077089286890000
 pickup_points | 0.000000000000000000000

The lower the value, the better. Higher values indicate greater data skew.

The gp_toolkit.gp_skew_idle_fractions view shows data distribution skew by calculating the percentage of the system that is idle during a table scan, which is an indicator of computational skew:
```
SELECT sifrelname, siffraction
FROM gp_toolkit.gp_skew_idle_fractions;
```
The result might look like this:
```
  sifrelname   |      siffraction
---------------+------------------------
 customers     | 0.02723735408560311284
 orders        | 0.03187081284126553847
 events        | 0.02439024390243902439
 pickup_points | 0.00000000000000000000
```
For example, a value of 0.02 indicates 2% skew. Tables that have more than 10% skew should have their distribution policies evaluated.

The pg_catalog.gp_segment_configuration table contains information about the cluster segments. The gp_segment_id field corresponds to the content column of the gp_segment_configuration table and can be used to get a row count for each segment. The following query shows the number of rows per segment as well as the variance from the minimum and maximum numbers of rows:

SELECT 'orders'                                                   AS "Table name",
       MAX(row_count)                                             AS "Max seg rows",
       MIN(row_count)                                             AS "Min seg rows",
       (MAX(row_count) - MIN(row_count)) * 100.0 / MAX(row_count) AS "Percentage difference between Max & Min"
FROM (SELECT COUNT(*) AS row_count, gp_segment_id
      FROM orders
      GROUP BY gp_segment_id) AS seg_row_counts;

The result might look like this:

 Table name | Max seg rows | Min seg rows | Percentage difference between Max & Min
------------+--------------+--------------+-----------------------------------------
 orders     |        25781 |        24056 |                      6.6909739730809511

NOTE

A query that references the gp_segment_id system column on a replicated table fails because Greengage DB does not allow queries to reference replicated tables' system columns.

You can change the distribution policy using the ALTER TABLE command. To specify a new distribution policy, use the SET DISTRIBUTED clause:

ALTER TABLE pickup_points
    SET DISTRIBUTED BY (point_id);

You can also redistribute table data using the current distribution policy. To do this, set the reorganize parameter to true in the SET WITH clause:

ALTER TABLE orders
    SET WITH (reorganize = true);

This might be useful to correct data skew or when new segments have been added to the cluster.

IMPORTANT

Changing a table’s distribution redistributes its data across the segments. This operation can be resource-intensive. Run these commands during a maintenance window for large tables to avoid service downtime.

This section illustrates how an inappropriate distribution choice can negatively impact the performance of an analytical query that involves a join.

This query joins the customers and orders tables on the common distribution key customer_id, aggregates the total purchase amount per customer, and returns the top 5 customers by total purchases:

SELECT c.customer_id,
       c.customer_name AS customer_name,
       SUM(o.amount)   AS total_purchases
FROM customers c
         JOIN orders o USING (customer_id)
GROUP BY c.customer_id, c.customer_name
ORDER BY total_purchases DESC
LIMIT 5;

EXPLAIN provides a detailed breakdown of the query execution plan:

EXPLAIN (COSTS OFF)
SELECT c.customer_id,
       c.customer_name AS customer_name,
       SUM(o.amount)   AS total_purchases
FROM customers c
         JOIN orders o USING (customer_id)
GROUP BY c.customer_id, c.customer_name
ORDER BY total_purchases DESC
LIMIT 5;

The command output indicates that no data motion is required for this query, except for Gather Motion. This is the typical operation in which segments return their results to the master:

 Limit
   ->  Gather Motion 4:1  (slice1; segments: 4)
         Merge Key: (sum(o.amount))
         ->  Limit ...

Full query plan

                                 QUERY PLAN
----------------------------------------------------------------------------
 Limit
   ->  Gather Motion 4:1  (slice1; segments: 4)
         Merge Key: (sum(o.amount))
         ->  Limit
               ->  Sort
                     Sort Key: (sum(o.amount))
                     ->  HashAggregate
                           Group Key: c.customer_id, c.customer_name
                           ->  Hash Join
                                 Hash Cond: (o.customer_id = c.customer_id)
                                 ->  Seq Scan on orders o
                                 ->  Hash
                                       ->  Seq Scan on customers c
 Optimizer: Pivotal Optimizer (GPORCA)
(14 rows)

Change the distribution key of the orders table from customer_id to order_id:

ALTER TABLE orders
    SET DISTRIBUTED BY (order_id);

Run EXPLAIN again:

EXPLAIN (COSTS OFF)
SELECT c.customer_id,
       c.customer_name AS customer_name,
       SUM(o.amount)   AS total_purchases
FROM customers c
         JOIN orders o USING (customer_id)
GROUP BY c.customer_id, c.customer_name
ORDER BY total_purchases DESC
LIMIT 5;

In comparison with the previous example, two additional motion operations are involved:

Redistribute Motion — moves tuples between the segments to complete the join. This is required because the customers table is distributed by customer_id, while the orders table is distributed by order_id.
Broadcast Motion — a copy of the table is sent to all segments, as they do not contain the necessary data for the join. While this is not a concern for small tables, it can lead to performance issues when joining large tables.

 Limit
   ->  Gather Motion 4:1  (slice3; segments: 4)
         Merge Key: (pg_catalog.sum((sum(o.amount))))
         ....
         ->  Redistribute Motion 4:4  (slice2; segments: 4)
             ....
                 ->  Broadcast Motion 4:4  (slice1; segments: 4)
                 ....

Full query plan

                                               QUERY PLAN
---------------------------------------------------------------------------------------------------------
 Limit
   ->  Gather Motion 4:1  (slice3; segments: 4)
         Merge Key: (pg_catalog.sum((sum(o.amount))))
         ->  Limit
               ->  Sort
                     Sort Key: (pg_catalog.sum((sum(o.amount))))
                     ->  HashAggregate
                           Group Key: c.customer_id, c.customer_name
                           ->  Redistribute Motion 4:4  (slice2; segments: 4)
                                 Hash Key: c.customer_id, c.customer_name
                                 ->  Result
                                       ->  HashAggregate
                                             Group Key: c.customer_id, c.customer_name
                                             ->  Hash Join
                                                   Hash Cond: (o.customer_id = c.customer_id)
                                                   ->  Seq Scan on orders o
                                                   ->  Hash
                                                         ->  Broadcast Motion 4:4  (slice1; segments: 4)
                                                               ->  Seq Scan on customers c
 Optimizer: Pivotal Optimizer (GPORCA)
(20 rows)

Data compression

Partitioning

Distribution

Overview

Distribution policies

Recommendations on choosing a distribution policy

Even data distribution

Local and distributed operations

Even query processing

Specify a distribution policy

Hash distribution

Random distribution

Replicated distribution

View distribution information

View table distribution settings

Check data distribution

Change a distribution policy

Example: Incorrect distribution choice

Recommended distribution

Incorrect distribution