Hello, I’m DocuDroid!
Submitting feedback
Thank you for rating our AI Search!
We would be grateful if you could share your thoughts so we can improve our AI Search for you and other readers.
GitHub

Use gpload

Anton Monakov

The Greengage DB gpload utility facilitates loading data into database tables via the Greenplum Parallel File Server (gpfdist). gpload operates based on a load specification defined in a YAML-formatted control file. It invokes gpfdist, creates an external table definition based on the source data defined, and runs an INSERT, UPDATE, or MERGE operation to load the source data into the target table in the database.

The gpload operation, including any additional SQL commands specified in the SQL section of the YAML control file, is performed as a single transaction to prevent inconsistent data when performing simultaneous load operations on a target table.

NOTE

gpfdist and gpload are compatible only with the Greengage DB major version in which they are shipped (for example, 6.x).

Requirements

The client machine where gpload is run must meet the following requirements:

  • Python 2.6.2 or later, with pygresql (the Python interface to PostgreSQL) and pyyaml libraries installed.

    Note that Python and the required libraries are included with the Greengage DB server installation. If you run gpload on the machine where Greengage DB is installed, you do not need a separate Python installation. gpload for Windows supports only Python 2.5 (available from python.org).

  • The gpfdist parallel file distribution utility installed. The utility is located in the $GPHOME/bin directory of the Greengage DB master host and each segment host. To install it on another host, copy it to the desired location on that host and add this location to PATH.

  • Network access to and from all hosts in your Greengage DB cluster (master and segments) available.

  • Network access to and from the hosts where the external data to be loaded resides (ETL servers) available.

Syntax overview

The gpload command’s general syntax is provided below. See the gpload command reference for the complete options description.

gpload -f <control_file>
       [ -l <log_file> ]
       [ -h <hostname> ]
       [ -p <port> ]
       [ -U <username> ]
       [ -d <database> ]
       [ -W ]
       [ --gpfdist_timeout <seconds> ]
       [ --no_auto_trans ]
       [ --max_retries <retry_times> ]
       [ [ -v | -V ] [-q ] ]
       [ -D ]

gpload -?

gpload --version

Control file

The control file is used for orchestrating the loading process. It specifies the Greengage DB connection information, gpfdist configuration information, external table options, and source data format.

The file uses the YAML 1.1 format with its own schema for defining the various steps of a Greengage DB load operation and must be a valid YAML document. The gpload program processes the control file document in specified order and uses indentation (spaces) to determine the document hierarchy and the relationships of the sections to one another.

NOTE

The use of whitespace in a YAML document is significant. Whitespace should not be used for formatting purposes, and tabs should not be used at all.

The structure of a load control file looks as follows. See Control file format for the complete description of keys.

---
VERSION: 1.0.0.1
DATABASE: <db_name>
USER: <db_username>
HOST: <master_hostname>
PORT: <master_port>
GPLOAD:
  INPUT:
    - SOURCE:
        LOCAL_HOSTNAME:
          - <hostname_or_ip>
        PORT: <http_port>
      | PORT_RANGE: [ <start_port_range>, <end_port_range> ]
        FILE:
          - </path/to/input_file>
        SSL: true | false
        CERTIFICATES_PATH: </path/to/certificates>
    - FULLY_QUALIFIED_DOMAIN_NAME: true | false
    - COLUMNS:
        - <field_name>: <data_type>
    - TRANSFORM: '<transformation>'
    - TRANSFORM_CONFIG: '<configuration-file-path>'
    - MAX_LINE_LENGTH: <integer>
    - FORMAT: text | csv
    - DELIMITER: '<delimiter_character>'
    - ESCAPE: '<escape_character>' | 'OFF'
    - NEWLINE: 'LF' | 'CR' | 'CRLF'
    - NULL_AS: '<null_string>'
    - FILL_MISSING_FIELDS: true | false
    - FORCE_NOT_NULL: <column_name> [, ...]
    - QUOTE: '<csv_quote_character>'
    - HEADER: true | false
    - ENCODING: <database_encoding>
    - ERROR_LIMIT: <integer>
    - LOG_ERRORS: true | false
  EXTERNAL:
    - SCHEMA: <schema> | '%'
  OUTPUT:
    - TABLE: <schema.table_name>
    - MODE: insert | update | merge
    - MATCH_COLUMNS:
        - <target_column_name>
    - UPDATE_COLUMNS:
        - <target_column_name>
    - UPDATE_CONDITION: '<boolean_condition>'
    - MAPPING:
        <target_column_name>: <source_column_name> | '<expression>'
  PRELOAD:
    - TRUNCATE: true | false
    - REUSE_TABLES: true | false
    - STAGING_TABLE: <external_table_name>
    - FAST_MATCH: true | false
  SQL:
    - BEFORE: "<sql_command>"
    - AFTER: "<sql_command>"
NOTE

The MERGE and UPDATE operations are not supported if the target table or column name is a reserved keyword, has capital letters, or is enclosed in double quotes (" "). If the database object names were created using a double-quoted identifier (delimited identifier), you must specify the delimited name within single quotes (') in the gpload control file.

Consider the following table:

CREATE TABLE "MyTable" ("MyColumn" text);

In this case, the load control file must refer to this table and column names as follows:

...
  - COLUMNS:
    - '"MyColumn"': text
OUTPUT:
  - TABLE: public.'"MyTable"'
...

Run gpload

  1. Create a YAML-formatted load control file.

  2. Run the gpload utility using the -f option to pass in the created load control file, for example:

    $ gpload -f example_load.yml

Examples

These examples illustrate loading data into a database table from an external file by using the INSERT, UPDATE, and MERGE methods.

Prerequisites

To try out the practical examples listed in this section, connect to the Greengage DB master host as gpadmin using psql as described in Connect to Greengage DB via psql. Then create a customers test database, connect to it, and create two tables, orders and audit, as follows:

CREATE DATABASE customers;
\c customers

CREATE TABLE orders (
    id INTEGER,
    name VARCHAR,
    price NUMERIC
);

CREATE TABLE audit (
    event VARCHAR,
    timestamp TIMESTAMP
);

INSERT method

This example demonstrates loading data into a database table by using the INSERT method.

  1. Empty the orders table and populate it with data as follows:

    TRUNCATE TABLE orders;
    
    INSERT INTO orders (id, name, price)
    VALUES
        (1,'Laptop',999.99),
        (2,'Smartphone',499.99),
        (3,'Tablet',299.99);
  2. Empty the auxiliary audit table:

    TRUNCATE TABLE audit;
  3. In the /tmp directory on the master host, create the orders.csv file containing the source data to load:

    $ cat > orders.csv <<EOF
    id,name,price
    4,Monitor,599.99
    5,Keyboard,99.99
    EOF
  4. On the master host, create the load.yml load control file:

    $ cat > load.yml <<EOF
    ---
    VERSION: 1.0.0.1
    DATABASE: customers
    USER: gpadmin
    HOST: mdw
    PORT: 5432
    GPLOAD:
      INPUT:
        - SOURCE:
            PORT: 8081
            FILE:
              - /tmp/orders.csv
        - COLUMNS:
            - id: INTEGER
            - name: VARCHAR
            - price: NUMERIC
        - FORMAT: CSV
        - HEADER: true
      OUTPUT:
        - TABLE: orders
        - MODE: INSERT
      SQL:
        - BEFORE: "INSERT INTO audit VALUES('start', current_timestamp)"
        - AFTER: "INSERT INTO audit VALUES('end', current_timestamp)"
    EOF
  5. Run the gpload utility passing in the created load control file. This will insert new rows into the orders table. The audit table will contain timestamps marking the loading start and end.

    $ gpload -f load.yml

    The output should look as follows:

    gpload session started
    INFO|setting schema 'public' for table 'orders'
    INFO|started gpfdist -p 8000 -P 9000 -f "/tmp/orders.csv" -t 30
    INFO|running time: 0.10 seconds
    INFO|rows Inserted          = 2
    INFO|rows Updated           = 0
    INFO|data formatting errors = 0
    INFO|gpload succeeded
  6. Query the orders table:

    SELECT * FROM orders;

    The output should look as follows, with rows 4 and 5 inserted into the table:

     id |    name    | price
    ----+------------+--------
      1 | Laptop     | 999.99
      2 | Smartphone | 499.99
      3 | Tablet     | 299.99
      4 | Monitor    | 599.99
      5 | Keyboard   |  99.99
  7. Query the audit table:

    SELECT * FROM audit;

    The output should contain the timestamps marking the loading start and end:

     event |         timestamp
    -------+----------------------------
     start | 2025-05-25 12:45:47.931476
     end   | 2025-05-25 12:45:50.134581

UPDATE method

This example demonstrates loading data into a database table by using the UPDATE method.

  1. Empty the orders table and populate it with data as follows:

    TRUNCATE TABLE orders;
    
    INSERT INTO orders (id, name, price)
    VALUES
        (1,'Laptop',999.99),
        (2,'Smartphone',499.99),
        (3,'Tablet',299.99),
        (4,'Monitor',599.99),
        (5,'Keyboard', 99.99);
  2. Empty the auxiliary audit table:

    TRUNCATE TABLE audit;
  3. In the /tmp directory on the master host, create the orders.csv file containing the source data to load. Notice that the data contains the rows already existing in the orders table whose values in the price column are different.

    $ cat > orders.csv <<EOF
    id,name,price
    1,Laptop,1299.99
    4,Monitor,899.99
    5,Keyboard,59.99
    EOF
  4. On the master host, create a load.yml load control file:

    $ cat > load.yml <<EOF
    ---
    VERSION: 1.0.0.1
    DATABASE: customers
    USER: gpadmin
    HOST: mdw
    PORT: 5432
    GPLOAD:
      INPUT:
        - SOURCE:
            FILE:
              - /tmp/orders.csv
        - COLUMNS:
            - id: integer
            - name: text
            - price: numeric
        - FORMAT: CSV
        - HEADER: true
      OUTPUT:
        - TABLE: orders
        - MODE: UPDATE
        - MATCH_COLUMNS:
          - id
        - UPDATE_COLUMNS:
          - price
      SQL:
        - BEFORE: "INSERT INTO audit VALUES('start', current_timestamp)"
        - AFTER: "INSERT INTO audit VALUES('end', current_timestamp)"
    EOF
  5. Run the gpload utility, passing in the created load control file. This will update the values in the price column for the rows where the id column values in the orders table match those in the orders.csv file. The audit table will contain timestamps marking the loading start and end.

    $ gpload -f load.yml

    The output should look as follows:

    INFO|gpload session started
    INFO|setting schema 'public' for table 'orders'
    INFO|started gpfdist -p 8000 -P 9000 -f "/tmp/orders.csv" -t 30
    INFO|running time: 0.13 seconds
    INFO|rows Inserted          = 0
    INFO|rows Updated           = 3
    INFO|data formatting errors = 0
    INFO|gpload succeeded
  6. Query the orders table:

    SELECT * FROM orders;

    The output should look as follows, with rows 1, 4, and 5 updated:

     id |    name    |  price
    ----+------------+---------
      1 | Laptop     | 1299.99
      2 | Smartphone |  499.99
      3 | Tablet     |  299.99
      4 | Monitor    |  899.99
      5 | Keyboard   |   59.99
  7. Query the audit table:

    SELECT * FROM audit;

    The output should contain the timestamps marking the loading start and end:

     event |         timestamp
    -------+----------------------------
     start | 2025-05-25 16:24:47.931476
     end   | 2025-05-25 16:24:50.134581

MERGE method

This example demonstrates loading data into a database table by using the MERGE method.

  1. Empty the orders table and populate it with data as follows:

    TRUNCATE TABLE orders;
    
    INSERT INTO orders (id, name, price)
    VALUES
        (1,'Laptop',999.99),
        (2,'Smartphone',499.99),
        (3,'Tablet',299.99),
        (4,'Monitor',599.99),
        (5,'Keyboard', 99.99);
  2. Empty the auxiliary audit table:

    TRUNCATE TABLE audit;
  3. In the /tmp directory on the master host, create the orders.csv file containing the source data to load. Notice that the data contains the rows already existing in the orders table whose values in the price column are different (2, 4, and 5). Rows 6 and 7 are new rows.

    $ cat > orders.csv <<EOF
    id,name,price
    1,Laptop,999.99
    2,Smartphone,799.99
    3,Tablet,299.99
    4,Monitor,849.99
    5,Keyboard,59.99
    6,Webcam,29.99
    7,Speakers,129.99
    EOF
  4. On the master host, create a load.yml load control file:

    $ cat > load.yml <<EOF
    ---
    VERSION: 1.0.0.1
    DATABASE: customers
    USER: gpadmin
    HOST: mdw
    PORT: 5432
    GPLOAD:
        INPUT:
          - SOURCE:
              FILE:
                - /tmp/orders.csv
          - COLUMNS:
              - id: integer
              - name: text
              - price: numeric
          - FORMAT: CSV
          - HEADER: true
        OUTPUT:
          - TABLE: orders
          - MODE: MERGE
          - MATCH_COLUMNS:
            - id
          - UPDATE_COLUMNS:
            - price
        SQL:
            - BEFORE: "INSERT INTO audit VALUES('start', current_timestamp)"
            - AFTER: "INSERT INTO audit VALUES('end', current_timestamp)"
    EOF
  5. Run the gpload utility, passing in the created load control file. This will update the values in the price column for the rows where the id column values in the orders table match those in the orders.csv file and insert new rows. The audit table will contain timestamps marking the loading start and end.

    $ gpload -f load.yml

    The output should look as follows:

    INFO|gpload session started
    INFO|setting schema 'public' for table 'orders'
    INFO|started gpfdist -p 8000 -P 9000 -f "/tmp/orders.csv" -t 30
    INFO|running time: 0.14 seconds
    INFO|rows Inserted          = 2
    INFO|rows Updated           = 5
    INFO|data formatting errors = 0
    INFO|gpload succeeded
  6. Query the orders table:

    SELECT * FROM orders;

    The output should look as follows, with rows 2, 4, and 5 updated and rows 6 and 7 inserted:

     id |    name    | price
    ----+------------+--------
      1 | Laptop     | 999.99
      2 | Smartphone | 799.99
      3 | Tablet     | 299.99
      4 | Monitor    | 849.99
      5 | Keyboard   |  59.99
      6 | Webcam     |  29.99
      7 | Speakers   | 129.99
  7. Query the audit table:

    SELECT * FROM audit;

    The output should contain the timestamps marking the loading start and end:

     event |         timestamp
    -------+----------------------------
     start | 2025-05-25 15:24:47.931476
     end   | 2025-05-25 15:24:50.134581