Use gpload

Anton Monakov

Contents

Syntax
Control file
Run gpload
Examples
Control file keys reference

The Greengage DB gpload utility facilitates loading data into database tables via the Greenplum Parallel File Server (gpfdist). gpload operates based on a load specification defined in a YAML-formatted control file.

It invokes gpfdist, creates an external table definition based on the source data defined, and runs an INSERT, UPDATE, or MERGE operation to load the source data into the target table in the database.

The gpload operation, including any additional SQL commands specified in the SQL section of the YAML control file, is performed as a single transaction to prevent inconsistent data when performing simultaneous load operations on a target table.

NOTE

gpfdist and gpload are compatible only with the Greengage DB major version in which they are shipped (for example, 6.x).

Requirements

The client machine where gpload is run must meet the following requirements:

Python 2.6.2 or later, with pygresql (the Python interface to PostgreSQL) and pyyaml libraries installed.

Note that Python and the required libraries are included with the Greengage DB server installation. If you run gpload on the machine where Greengage DB is installed, you do not need a separate Python installation. gpload for Windows supports only Python 2.5 (available from python.org).
The gpfdist parallel file distribution utility installed. The utility is located in the $GPHOME/bin directory of the Greengage DB master host and each segment host. To install it on another host, copy it to the desired location on that host and add this location to PATH.
Network access to and from all hosts in your Greengage DB cluster (master and segments) available.
Network access to and from the hosts where the external data to be loaded resides (ETL servers) available.

The gpload command’s general syntax is as follows:

gpload -f <control_file> [-l <log_file>] [-h <hostname>] [-p <port>]
[-U <username>] [-d <database>] [-W] [--gpfdist_timeout <seconds>]
[--no_auto_trans] [[-v | -V] [-q]] [-D]

gpload -?

gpload --version

Option Description

Option	Description
-f <control_file>	A YAML file that contains the load specification details. Learn more in Control file
-d <database>	The database to load the data into. If not specified, the database name is taken from the load control file, the `PGDATABASE` environment variable, or the current system user name
-h <hostname>	Specifies the host name of the machine where the Greengage DB master database server is running. If not specified, reads from the load control file, the `PGHOST` environment variable, or defaults to `localhost`
-p <port>	Specifies the TCP port on which the Greengage DB master database server is listening for connections. If not specified, reads from the load control file, the `PGPORT` environment variable, or defaults to `5432`
-U <username>	Specifies the database role name to connect as. If not specified, reads from the load control file, the `PGUSER` environment variable, or defaults to the current system user name
-W	Forces a password prompt. If not specified, reads the password from the environment variable `PGPASSWORD`, from a password file specified by `PGPASSFILE`, or in ~/.pgpass. If these are not set, `gpload` prompts for a password even if `-W` is not supplied
--gpfdist_timeout <seconds>	Sets the `gpfdist` response timeout, from 0 to 30 seconds. You might need to increase this value when operating on high-traffic networks
--no_auto_trans	Deactivates processing the load operation as a single transaction in case a single load operation is performed on the target table. By default, `gpload` processes each load operation as a single transaction to prevent inconsistent data when performing simultaneous operations on a target table
-l <log_file>	Specifies the log file location, ~/gpAdminLogs/gpload_YYYYMMDD by default. Learn more about logging in Greengage DB in Logging
-D	Checks for error conditions, but does not execute the actual data load
-q	Runs in quiet mode. The command output is not displayed on the screen, but is still written to the log file. The option cannot be used if `-v` or `-V` is provided
-v	Shows verbose output of the load steps as they are executed. The option cannot be used if `-q` is provided
-V	Shows very verbose output: the load steps as they are executed as well as all other messages generated by the utility. The option cannot be used if `-q` is provided
-? (show help)	Shows help, then exits
--version	Shows the `gpload` version, then exits

-f <control_file>

A YAML file that contains the load specification details. Learn more in Control file

-d <database>

The database to load the data into.

If not specified, the database name is taken from the load control file, the PGDATABASE environment variable, or the current system user name

-h <hostname>

Specifies the host name of the machine where the Greengage DB master database server is running.

If not specified, reads from the load control file, the PGHOST environment variable, or defaults to localhost

-p <port>

Specifies the TCP port on which the Greengage DB master database server is listening for connections.

If not specified, reads from the load control file, the PGPORT environment variable, or defaults to 5432

-U <username>

Specifies the database role name to connect as.

If not specified, reads from the load control file, the PGUSER environment variable, or defaults to the current system user name

-W

Forces a password prompt.

If not specified, reads the password from the environment variable PGPASSWORD, from a password file specified by PGPASSFILE, or in ~/.pgpass. If these are not set, gpload prompts for a password even if -W is not supplied

--gpfdist_timeout <seconds>

Sets the gpfdist response timeout, from 0 to 30 seconds.

You might need to increase this value when operating on high-traffic networks

--no_auto_trans

Deactivates processing the load operation as a single transaction in case a single load operation is performed on the target table.

By default, gpload processes each load operation as a single transaction to prevent inconsistent data when performing simultaneous operations on a target table

-l <log_file>

Specifies the log file location, ~/gpAdminLogs/gpload_YYYYMMDD by default.

Learn more about logging in Greengage DB in Logging

-D

Checks for error conditions, but does not execute the actual data load

-q

Runs in quiet mode.

The command output is not displayed on the screen, but is still written to the log file. The option cannot be used if -v or -V is provided

-v

Shows verbose output of the load steps as they are executed. The option cannot be used if -q is provided

-V

Shows very verbose output: the load steps as they are executed as well as all other messages generated by the utility. The option cannot be used if -q is provided

-? (show help)

Shows help, then exits

--version

Shows the gpload version, then exits

The control file is used for orchestrating the loading process. It specifies the Greengage DB connection information, gpfdist configuration information, external table options, and source data format.

The file uses the YAML 1.1 format with its own schema for defining the various steps of a Greengage DB load operation and must be a valid YAML document. The gpload program processes the control file document in specified order and uses indentation (spaces) to determine the document hierarchy and the relationships of the sections to one another.

NOTE

The use of whitespace in a YAML document is significant. Whitespace should not be used for formatting purposes, and tabs should not be used at all.

The structure of a load control file looks as follows. See Control file keys reference for the complete description of keys.

---
VERSION: 1.0.0.1
DATABASE: <db_name>
USER: <db_username>
HOST: <master_hostname>
PORT: <master_port>
GPLOAD:
  INPUT:
    - SOURCE:
        LOCAL_HOSTNAME:
          - <hostname_or_ip>
        PORT: <http_port>
      | PORT_RANGE: [ <start_port_range>, <end_port_range> ]
        FILE:
          - </path/to/input_file>
        SSL: true | false
        CERTIFICATES_PATH: </path/to/certificates>
    - FULLY_QUALIFIED_DOMAIN_NAME: true | false
    - COLUMNS:
        - <field_name>: <data_type>
    - TRANSFORM: '<transformation>'
    - TRANSFORM_CONFIG: '<configuration-file-path>'
    - MAX_LINE_LENGTH: <integer>
    - FORMAT: text | csv
    - DELIMITER: '<delimiter_character>'
    - ESCAPE: '<escape_character>' | 'OFF'
    - NEWLINE: 'LF' | 'CR' | 'CRLF'
    - NULL_AS: '<null_string>'
    - FILL_MISSING_FIELDS: true | false
    - FORCE_NOT_NULL: <column_name> [, ...]
    - QUOTE: '<csv_quote_character>'
    - HEADER: true | false
    - ENCODING: <database_encoding>
    - ERROR_LIMIT: <integer>
    - LOG_ERRORS: true | false
  EXTERNAL:
    - SCHEMA: <schema> | '%'
  OUTPUT:
    - TABLE: <schema.table_name>
    - MODE: insert | update | merge
    - MATCH_COLUMNS:
        - <target_column_name>
    - UPDATE_COLUMNS:
        - <target_column_name>
    - UPDATE_CONDITION: '<boolean_condition>'
    - MAPPING:
        <target_column_name>: <source_column_name> | '<expression>'
  PRELOAD:
    - TRUNCATE: true | false
    - REUSE_TABLES: true | false
    - STAGING_TABLE: <external_table_name>
    - FAST_MATCH: true | false
  SQL:
    - BEFORE: "<sql_command>"
    - AFTER: "<sql_command>"

NOTE

The MERGE and UPDATE operations are not supported if the target table or column name is a reserved keyword, has capital letters, or is enclosed in double quotes (" "). If the database object names were created using a double-quoted identifier (delimited identifier), you must specify the delimited name within single quotes (') in the gpload control file.

Consider the following table:

CREATE TABLE "MyTable" ("MyColumn" text);

In this case, the load control file must refer to this table and column names as follows:

...
  - COLUMNS:
    - '"MyColumn"': text
OUTPUT:
  - TABLE: public.'"MyTable"'
...

Create a YAML-formatted load control file.
Run the gpload utility using the -f option to pass in the created load control file, for example:
```
$ gpload -f example_load.yml
```

These examples illustrate loading data into a database table from an external file by using the INSERT, UPDATE, and MERGE methods.

Prerequisites

To try out the practical examples listed in this section, connect to the Greengage DB master host as gpadmin using psql as described in Connect to Greengage DB via psql. Then create a customers test database, connect to it, and create two tables, orders and audit, as follows:

CREATE DATABASE customers;
\c customers

CREATE TABLE orders (
    id INTEGER,
    name VARCHAR,
    price NUMERIC
);

CREATE TABLE audit (
    event VARCHAR,
    timestamp TIMESTAMP
);

This example demonstrates loading data into a database table by using the INSERT method.

Empty the orders table and populate it with data as follows:

TRUNCATE TABLE orders;

INSERT INTO orders (id, name, price)
VALUES
    (1,'Laptop',999.99),
    (2,'Smartphone',499.99),
    (3,'Tablet',299.99);

Empty the auxiliary audit table:
```
TRUNCATE TABLE audit;
```
In the /tmp directory on the master host, create the orders.csv file containing the source data to load:
```
$ cat > orders.csv <<EOF
id,name,price
4,Monitor,599.99
5,Keyboard,99.99
EOF
```

On the master host, create the load.yml load control file:

$ cat > load.yml <<EOF
---
VERSION: 1.0.0.1
DATABASE: customers
USER: gpadmin
HOST: mdw
PORT: 5432
GPLOAD:
  INPUT:
    - SOURCE:
        PORT: 8081
        FILE:
          - /tmp/orders.csv
    - COLUMNS:
        - id: INTEGER
        - name: VARCHAR
        - price: NUMERIC
    - FORMAT: CSV
    - HEADER: true
  OUTPUT:
    - TABLE: orders
    - MODE: INSERT
  SQL:
    - BEFORE: "INSERT INTO audit VALUES('start', current_timestamp)"
    - AFTER: "INSERT INTO audit VALUES('end', current_timestamp)"
EOF

Run the gpload utility passing in the created load control file. This will insert new rows into the orders table. The audit table will contain timestamps marking the loading start and end.

$ gpload -f load.yml

The output should look as follows:

gpload session started
INFO|setting schema 'public' for table 'orders'
INFO|started gpfdist -p 8000 -P 9000 -f "/tmp/orders.csv" -t 30
INFO|running time: 0.10 seconds
INFO|rows Inserted          = 2
INFO|rows Updated           = 0
INFO|data formatting errors = 0
INFO|gpload succeeded

Query the orders table:

SELECT * FROM orders;

The output should look as follows, with rows 4 and 5 inserted into the table:

 id |    name    | price
----+------------+--------
  1 | Laptop     | 999.99
  2 | Smartphone | 499.99
  3 | Tablet     | 299.99
  4 | Monitor    | 599.99
  5 | Keyboard   |  99.99

Query the audit table:

SELECT * FROM audit;

The output should contain the timestamps marking the loading start and end:

 event |         timestamp
-------+----------------------------
 start | 2025-05-25 12:45:47.931476
 end   | 2025-05-25 12:45:50.134581

This example demonstrates loading data into a database table by using the UPDATE method.

Empty the orders table and populate it with data as follows:

TRUNCATE TABLE orders;

INSERT INTO orders (id, name, price)
VALUES
    (1,'Laptop',999.99),
    (2,'Smartphone',499.99),
    (3,'Tablet',299.99),
    (4,'Monitor',599.99),
    (5,'Keyboard', 99.99);

Empty the auxiliary audit table:
```
TRUNCATE TABLE audit;
```
In the /tmp directory on the master host, create the orders.csv file containing the source data to load. Notice that the data contains the rows already existing in the orders table whose values in the price column are different.
```
$ cat > orders.csv <<EOF
id,name,price
1,Laptop,1299.99
4,Monitor,899.99
5,Keyboard,59.99
EOF
```

On the master host, create a load.yml load control file:

$ cat > load.yml <<EOF
---
VERSION: 1.0.0.1
DATABASE: customers
USER: gpadmin
HOST: mdw
PORT: 5432
GPLOAD:
  INPUT:
    - SOURCE:
        FILE:
          - /tmp/orders.csv
    - COLUMNS:
        - id: integer
        - name: text
        - price: numeric
    - FORMAT: CSV
    - HEADER: true
  OUTPUT:
    - TABLE: orders
    - MODE: UPDATE
    - MATCH_COLUMNS:
      - id
    - UPDATE_COLUMNS:
      - price
  SQL:
    - BEFORE: "INSERT INTO audit VALUES('start', current_timestamp)"
    - AFTER: "INSERT INTO audit VALUES('end', current_timestamp)"
EOF

Run the gpload utility, passing in the created load control file. This will update the values in the price column for the rows where the id column values in the orders table match those in the orders.csv file. The audit table will contain timestamps marking the loading start and end.

$ gpload -f load.yml

The output should look as follows:

INFO|gpload session started
INFO|setting schema 'public' for table 'orders'
INFO|started gpfdist -p 8000 -P 9000 -f "/tmp/orders.csv" -t 30
INFO|running time: 0.13 seconds
INFO|rows Inserted          = 0
INFO|rows Updated           = 3
INFO|data formatting errors = 0
INFO|gpload succeeded

Query the orders table:

SELECT * FROM orders;

The output should look as follows, with rows 1, 4, and 5 updated:

 id |    name    |  price
----+------------+---------
  1 | Laptop     | 1299.99
  2 | Smartphone |  499.99
  3 | Tablet     |  299.99
  4 | Monitor    |  899.99
  5 | Keyboard   |   59.99

Query the audit table:

SELECT * FROM audit;

The output should contain the timestamps marking the loading start and end:

 event |         timestamp
-------+----------------------------
 start | 2025-05-25 16:24:47.931476
 end   | 2025-05-25 16:24:50.134581

This example demonstrates loading data into a database table by using the MERGE method.

Empty the orders table and populate it with data as follows:

TRUNCATE TABLE orders;

INSERT INTO orders (id, name, price)
VALUES
    (1,'Laptop',999.99),
    (2,'Smartphone',499.99),
    (3,'Tablet',299.99),
    (4,'Monitor',599.99),
    (5,'Keyboard', 99.99);

Empty the auxiliary audit table:
```
TRUNCATE TABLE audit;
```
In the /tmp directory on the master host, create the orders.csv file containing the source data to load. Notice that the data contains the rows already existing in the orders table whose values in the price column are different (2, 4, and 5). Rows 6 and 7 are new rows.
```
$ cat > orders.csv <<EOF
id,name,price
1,Laptop,999.99
2,Smartphone,799.99
3,Tablet,299.99
4,Monitor,849.99
5,Keyboard,59.99
6,Webcam,29.99
7,Speakers,129.99
EOF
```

On the master host, create a load.yml load control file:

$ cat > load.yml <<EOF
---
VERSION: 1.0.0.1
DATABASE: customers
USER: gpadmin
HOST: mdw
PORT: 5432
GPLOAD:
    INPUT:
      - SOURCE:
          FILE:
            - /tmp/orders.csv
      - COLUMNS:
          - id: integer
          - name: text
          - price: numeric
      - FORMAT: CSV
      - HEADER: true
    OUTPUT:
      - TABLE: orders
      - MODE: MERGE
      - MATCH_COLUMNS:
        - id
      - UPDATE_COLUMNS:
        - price
    SQL:
        - BEFORE: "INSERT INTO audit VALUES('start', current_timestamp)"
        - AFTER: "INSERT INTO audit VALUES('end', current_timestamp)"
EOF

Run the gpload utility, passing in the created load control file. This will update the values in the price column for the rows where the id column values in the orders table match those in the orders.csv file and insert new rows. The audit table will contain timestamps marking the loading start and end.
```
$ gpload -f load.yml
```
The output should look as follows:
```
INFO|gpload session started
INFO|setting schema 'public' for table 'orders'
INFO|started gpfdist -p 8000 -P 9000 -f "/tmp/orders.csv" -t 30
INFO|running time: 0.14 seconds
INFO|rows Inserted          = 2
INFO|rows Updated           = 5
INFO|data formatting errors = 0
INFO|gpload succeeded
```

Query the orders table:

SELECT * FROM orders;

The output should look as follows, with rows 2, 4, and 5 updated and rows 6 and 7 inserted:

 id |    name    | price
----+------------+--------
  1 | Laptop     | 999.99
  2 | Smartphone | 799.99
  3 | Tablet     | 299.99
  4 | Monitor    | 849.99
  5 | Keyboard   |  59.99
  6 | Webcam     |  29.99
  7 | Speakers   | 129.99

Query the audit table:

SELECT * FROM audit;

The output should contain the timestamps marking the loading start and end:

 event |         timestamp
-------+----------------------------
 start | 2025-05-25 15:24:47.931476
 end   | 2025-05-25 15:24:50.134581

Keys

Key Description Required

Key	Description	Required
VERSION	The version of the `gpload` control file schema. The current version is 1.0.0.1	Yes
DATABASE <db_name>	The database to connect to. If not specified, defaults to `PGDATABASE` value (if set) or the current system user name. You can also specify the database on the `gpload` command line using the `-d` option	No
USER <db_username>	Specifies the database role to use for connecting. If not specified, defaults to the current user or `PGUSER` value (if set). You can also specify the database role on the `gpload` command line using the `-U` option. If the user running `gpload` is not a Greengage DB superuser, then the appropriate rights must be granted to the user for the load to be processed	No
HOST <master_hostname>	Specifies Greengage DB master host name. If not specified, defaults to `localhost` or `PGHOST` (if set). You can also specify the master host name on the command line using the `-h` option	No
PORT <master_port>	Specifies the master port on which the Greengage DB master database server is listening for connections. If not specified, defaults to `5432` or `PGPORT` value if set. You can also specify the master port on the `gpload` command line using the `-p` option	No
GPLOAD	Encapsulates the load specification. A specification must have an `INPUT` and an `OUTPUT` section defined	Yes
INPUT	Defines the location and the format of the input data to be loaded.	Yes
SOURCE	Defines the location of a source file. An `INPUT` section can have more than one `SOURCE` block defined. Each `SOURCE` block corresponds to one instance of the `gpfdist` utility that is started on a local machine. Each `SOURCE` block must include a `FILE` specification	Yes
LOCAL_HOSTNAME <hostname_or_ip>	Specifies the host name or the IP address of the local machine on which `gpload` is running. If the machine is configured with multiple network interface cards (NICs), you can specify the host name or the IP of each individual NIC to allow network traffic to use all NICs simultaneously. By default, only the local machine’s primary host name or IP is used	No
PORT <http_port>	Specifies the port number that the `gpfdist` utility should use. You can also use `PORT_RANGE` to have an available port selected from the specified range. If both `PORT` and `PORT_RANGE` are defined, then `PORT` takes precedence. If neither `PORT` nor `PORT_RANGE` are defined, an available port between `8000` and `9000` is selected by default. If multiple host names are declared in `LOCAL_HOSTNAME`, this port number is used for all hosts. This configuration is recommended if you want to use all NICs to load the same file or set of files in a given directory location	No
PORT_RANGE <start_port_range>, <end_port_range>	Specifies the range of port numbers from which `gpload` can choose an available port for this instance of the `gpfdist` utility	No
FILE </path/to/input_file>	Specifies the location of a file or directory on the local file system or a named pipe that contains the data to be loaded. You can declare more than one file if the same data format is used in all the specified files. To denote multiple files to read, use the wildcard characters such as `` or `[]`. The .gz* and .bz2 files are uncompressed automatically when reading data. The paths specified are assumed to be relative to the directory from which `gpload` is run (alternatively, you can provide an absolute path)	Yes
SSL true \| false	Specifies usage of SSL encryption. If `SSL` is set to `true`, `gpload` starts the `gpfdist` utility with the `--ssl` option and uses the GPFDISTS protocol. See Configure SSL for details	No
CERTIFICATES_PATH </path/to/certificates>	Specifies the certificates storage location. Required when `SSL` is `true`; cannot be specified when `SSL` is `false` or unspecified. The location specified in `CERTIFICATES_PATH` must contain the following files: server.crt — the server certificate file. server.key — the server private key file. root.crt — the trusted certificate authorities. The root directory (`/`) cannot be specified as `CERTIFICATES_PATH`	Yes, if `SSL` is `true`
FULLY_QUALIFIED_DOMAIN_NAME true \| false	Specifies whether `gpload` resolves hostnames to the fully qualified domain name (FQDN) or the local hostname. If set to `true`, names are resolved to FQDN; otherwise, resolution is to the local hostname. The default is `false`. A fully qualified domain name might be required in some situations. For example, if the Greengage DB system is in a different domain than an ETL application that is being accessed by `gpload`	No
COLUMNS <field_name>: <data_type>	Specifies the schema of the source data files in the format of `field_name:data_type`. The `DELIMITER` character in the source file is what separates two data value fields (columns). A row is determined by a line feed character (`0x0a`). If the input columns (`COLUMNS`) are not specified, then the schema of the output table (`TABLE`) is implied, meaning that the source data must have the same column order, number of columns, and data format as the target table. The default source-to-target mapping is based on a match of column names as defined in this section and the column names in the target table (`TABLE`). This default mapping can be overridden using the `MAPPING` section	No
TRANSFORM '<transformation>'	Specifies the name of the input transformation passed to `gpload`. See Load data with gpload for details	No
TRANSFORM_CONFIG '<configuration-file-path>'	Specifies the location of the transformation configuration file. See Transformation configuration file for details	Yes, if `TRANSFORM` is specified
MAX_LINE_LENGTH <integer>	Sets the maximum allowed data row length in bytes, `32768` by default. Should be used when user data includes very wide rows (or when the `line too long` error message occurs). Should not be used otherwise as it increases resource allocation. Valid range is 32 KB to 256 MB. On Windows systems, the upper limit is 1 MB	No
FORMAT text \| csv	Specifies the format of the source data files: either plain text (`TEXT`) or comma-separated values (`CSV`) format. Defaults to `TEXT` if not specified. Learn more about formatting source data in Format external data	No
DELIMITER '<delimiter_character>'	Designates a single ASCII character to act as a column delimiter. The delimiter must reside between any two data value fields but not at the beginning or the end of a row. You can also specify a non-printable ASCII character or a non-printable Unicode character, such as `\x1B` or `\u001B`. The escape string syntax, `E'<character-code>'`, is also supported for non-printable characters. The ASCII or Unicode character must be enclosed in single quotes, for example, `E'\x1B'` or `E'\u001B'`. For text files, the default column delimiter is the horizontal TAB character (`0x09`). For CSV files, the default column delimiter is the comma character (`,`). See Format columns for details	No
ESCAPE '<escape_character>' \| 'OFF'	Designates a single character used as an escape character. For text files, the default escape character is a backslash (`\`). You can deactivate escaping by providing the `OFF` value. For CSV files, the default escape character is a double quote (`"`). See Escape characters for details	No
NEWLINE 'LF' \| 'CR' \| 'CRLF'	Designates a character used as a newline character, which can be `LF` (Line feed, `0x0A`), `CR` (Carriage return, `0x0D`), or `CR` followed by `LF` (`CRLF`, `0x0D 0x0A`). If not specified, the newline character detected at the end of the first line of data is used. See Format rows for details	No
NULL_AS '<null_string>'	Designates a string representing a null value, which indicates an unknown piece of data in a column or field. For text files, the default string is `\N`. For CSV files, the default string is an empty value with no quotation marks. See Represent NULL values for details	No
FILL_MISSING_FIELDS true \| false	Sets missing trailing field values at the end of a line or row to `NULL`. If not set, an error is reported in such cases. Blank rows, fields with a `NOT NULL` constraint, and trailing delimiters on a line will still report an error	No
FORCE_NOT_NULL ( <column_name> [, …] )	Treats values in each specified column as if they were quoted. Since the default null string is an empty unquoted string, this causes missing values to be evaluated as zero-length strings	No
QUOTE '<csv_quote_character>'	Specifies the quotation character, double quote (`"`) by default	No
HEADER true \| false	Designates whether the data file contains a header row. If using multiple data source files, all off them must have a header row. The default is to assume that the input files do not have a header row	No
ENCODING <database_encoding>	Defines the character set encoding of the source data. If not specified, the default client encoding is used. If you change the `ENCODING` value in an existing `gpload` control file, you must manually drop any external tables that were creating using the previous `ENCODING` configuration. `gpload` does not drop and recreate external tables to use the new `ENCODING` if `REUSE_TABLES` is set to `true`. See Character encoding for details	No
ERROR_LIMIT <integer>	Enables single-row error isolation mode for this load operation. When enabled, input rows having format errors are discarded until the error limit count is reached on any Greengage DB segment instance during input processing. If the error limit is not reached, all valid rows are loaded and any invalid rows are either discarded or captured as part of the error log information. By default, the load operation is cancelled on the first error. Note that single-row error isolation only applies to data rows with format errors, for example, extra or missing attributes, attributes of a wrong data type, or invalid client encoding sequences. Constraint errors, such as primary key violations, still cause the load operation to be cancelled	No
LOG_ERRORS true \| false	Defines whether to store loading errors in the log if `ERROR_LIMIT` is declared. The value can be `true` or `false` (default). If the value is `true`, rows with formatting errors are logged internally when running in single-row error isolation mode. You can examine formatting errors by using the Greengage DB’s `gp_read_error_log('<table_name>')` built-in SQL function. If formatting errors are detected when loading data, `gpload` generates a warning message with the name of the table that contains the error information. If `LOG_ERRORS` is set to `true`, `REUSE_TABLES` must also be set to `true` to retain the formatting errors in Greengage DB error logs. Otherwise, if `REUSE_TABLES` is set to `false` or is not specified, the error information is deleted after the `gpload` operation. Only summary information about formatting errors is returned. You can delete the formatting errors from the error logs by using the `gp_truncate_error_log()` Greengage DB function.	No
EXTERNAL	Defines the schema of the external table database objects created by `gpload`. By default, the Greengage DB’s `search_path` is used	No
SCHEMA <schema> \| '%'	The name of the external table schema. If the schema does not exist, an error is returned. If the percent character (`%`) is specified, the schema of the table specified by `TABLE` in the `OUTPUT` section is used. If the table name does not specify a schema, the default schema is used	Yes, if `EXTERNAL` is declared
OUTPUT	Defines the target table and the data column values to be loaded into the database	Yes
TABLE <schema.table_name>	The name of the target table to load into	Yes
MODE insert \| update \| merge	Defines the data loading mode. Defaults to `INSERT` if not specified. The following load modes are available: `INSERT` — loads data into the target table by running `SELECT *` on the external table created by `gpload` and `INSERT` on the target table. `UPDATE` — updates the columns of the target table specified in `UPDATE_COLUMNS` based on the following conditions: The rows have the values in columns specified in `MATCH_COLUMNS` equal to those in the source data. The optional boolean condition specified in `UPDATE_CONDITION` evaluates to `true`. `MERGE` — inserts new rows and updates the columns of the target table specified in `UPDATE_COLUMNS` based on the following conditions: The rows have the values in columns specified in `MATCH_COLUMNS` equal to those in the source data. The optional boolean condition specified in `UPDATE_CONDITION` evaluates to `true`. When the `MATCH_COLUMNS` value in the source data does not have a corresponding matching value in the existing data of the target table, a new row is identified. In these cases, the entire row from the source file is inserted. If there are multiple new `MATCH_COLUMNS` values that are the same, only one new row for that value is inserted. Use `UPDATE_CONDITION` to filter out the rows to discard	No
MATCH_COLUMNS <target_column_name>	Specifies the columns to use as the join condition for the update. The values in the specified target columns must be equal to those of the corresponding source data columns in order for the row to be updated in the target table	Yes, if `MODE` is `UPDATE` or `MERGE`
UPDATE_COLUMNS <target_column_name>	Specifies the columns to update for the rows that meet the `MATCH_COLUMNS` criteria and the optional `UPDATE_CONDITION`	Yes, if `MODE` is `UPDATE` or `MERGE`
UPDATE_CONDITION '<boolean_condition>'	Specifies a boolean condition (similar to those commonly declared in a `WHERE` clause) that must be met in order for a row of the target table to be updated	No
MAPPING <source_column_name> \| '<expression>'	Overrides the default source-to-target column mapping. The default source-to-target mapping is based on a match of column names as defined in the source `COLUMNS` section and the column names of the target table specified in `TABLE`. A mapping is specified as either `<target_column_name>: <source_column_name>` or `<target_column_name>: '<expression>'`. `<expression>` is any expression that you would specify in a `SELECT` clause, such as a constant value, a column reference, an operator invocation, a function call, and so on	No
PRELOAD	Specifies the operations to run prior to the load operation. Currently, the only supported preload operation is `TRUNCATE`	No
TRUNCATE true \| false	If set to `true`, `gpload` removes all rows in the target table prior to loading; the default value is `false`	No
REUSE_TABLES true \| false	If set to `true`, `gpload` does not drop the external table objects and staging table objects it creates. These objects are reused for future load operations based on the same load specification, which improves performance of trickle loads (ongoing small loads to the same target table). If `LOG_ERRORS` is set to `true`, `REUSE_TABLES` must be set to `true` to retain the formatting errors in Greengage DB error logs. If `REUSE_TABLES` is set to `false`, formatting error information is deleted after the `gpload` operation.	No
STAGING_TABLE <external_table_name>	Specifies the name of the temporary external table that is created during a `gpload` operation The external table is used by `gpfdist`. `REUSE_TABLES` must also be set to `true`. Otherwise, if `REUSE_TABLES` is `false` or is not specified, `STAGING_TABLE` is ignored. By default, `gpload` creates a temporary external table with a randomly generated name. If `external_table_name` contains a period (`.`), `gpload` returns an error. If the table exists, the utility reuses it. The value of `SCHEMA` in the `EXTERNAL` section serves as the schema for the `<external_table_name>` table. The utility returns an error if the existing table schema does not match the `OUTPUT` table schema. If the `SCHEMA` value is `%`, the schema for the `<external_table_name>` table is the same as the schema of the target table, that is the table specified in the `TABLE` key in the `OUTPUT` section. If `SCHEMA` is not set, `gpload` searches for the table using the schemas in the database `search_path`. If the table is not found, the `<external_table_name>` table is created in the default `PUBLIC` schema. `gpload` creates the staging table using the distribution keys of the target table as the distribution keys for the staging table. If the target table uses random distribution (`DISTRIBUTED RANDOMLY`), `gpload` uses `MATCH_COLUMNS` as the staging table’s distribution keys	No
FAST_MATCH true \| false	If set to `true`, `gpload` only searches the database for matching external table objects when reusing external tables. If `REUSE_TABLES` is set to `false` or not specified and `FAST_MATCH` is set to `true`, `gpload` returns a warning message	No
SQL	Defines optional arbitrary SQL commands to run before (`BEFORE`) and after (`AFTER`) the load operation. The commands must be listed in the order of desired execution. You can use this section, for example, to configure the load status logging as described in Examples	No
BEFORE "<sql_command>"	An SQL command to run before the load operation starts. Enclose commands in quotes	No
AFTER "<sql_command>"	An SQL command to run after the load operation completes. Enclose commands in quotes	No

VERSION

The version of the gpload control file schema. The current version is 1.0.0.1

Yes

DATABASE <db_name>

The database to connect to.

If not specified, defaults to PGDATABASE value (if set) or the current system user name. You can also specify the database on the gpload command line using the -d option

USER <db_username>

Specifies the database role to use for connecting.

If not specified, defaults to the current user or PGUSER value (if set). You can also specify the database role on the gpload command line using the -U option. If the user running gpload is not a Greengage DB superuser, then the appropriate rights must be granted to the user for the load to be processed

HOST <master_hostname>

Specifies Greengage DB master host name.

If not specified, defaults to localhost or PGHOST (if set). You can also specify the master host name on the command line using the -h option

PORT <master_port>

Specifies the master port on which the Greengage DB master database server is listening for connections.

If not specified, defaults to 5432 or PGPORT value if set. You can also specify the master port on the gpload command line using the -p option

GPLOAD

Encapsulates the load specification. A specification must have an INPUT and an OUTPUT section defined

Yes

INPUT

Defines the location and the format of the input data to be loaded.

Yes

SOURCE

Defines the location of a source file.

An INPUT section can have more than one SOURCE block defined. Each SOURCE block corresponds to one instance of the gpfdist utility that is started on a local machine. Each SOURCE block must include a FILE specification

Yes

LOCAL_HOSTNAME <hostname_or_ip>

Specifies the host name or the IP address of the local machine on which gpload is running.

If the machine is configured with multiple network interface cards (NICs), you can specify the host name or the IP of each individual NIC to allow network traffic to use all NICs simultaneously. By default, only the local machine’s primary host name or IP is used

PORT <http_port>

Specifies the port number that the gpfdist utility should use.

You can also use PORT_RANGE to have an available port selected from the specified range. If both PORT and PORT_RANGE are defined, then PORT takes precedence. If neither PORT nor PORT_RANGE are defined, an available port between 8000 and 9000 is selected by default. If multiple host names are declared in LOCAL_HOSTNAME, this port number is used for all hosts. This configuration is recommended if you want to use all NICs to load the same file or set of files in a given directory location

PORT_RANGE <start_port_range>, <end_port_range>

Specifies the range of port numbers from which gpload can choose an available port for this instance of the gpfdist utility

FILE </path/to/input_file>

Specifies the location of a file or directory on the local file system or a named pipe that contains the data to be loaded.

You can declare more than one file if the same data format is used in all the specified files. To denote multiple files to read, use the wildcard characters such as * or []. The .gz and .bz2 files are uncompressed automatically when reading data. The paths specified are assumed to be relative to the directory from which gpload is run (alternatively, you can provide an absolute path)

Yes

SSL true | false

Specifies usage of SSL encryption.

If SSL is set to true, gpload starts the gpfdist utility with the --ssl option and uses the GPFDISTS protocol.

See Configure SSL for details

CERTIFICATES_PATH </path/to/certificates>

Specifies the certificates storage location.

Required when SSL is true; cannot be specified when SSL is false or unspecified.

The location specified in CERTIFICATES_PATH must contain the following files:

server.crt — the server certificate file.
server.key — the server private key file.
root.crt — the trusted certificate authorities.

The root directory (/) cannot be specified as CERTIFICATES_PATH

Yes, if SSL is true

FULLY_QUALIFIED_DOMAIN_NAME true | false

Specifies whether gpload resolves hostnames to the fully qualified domain name (FQDN) or the local hostname.

If set to true, names are resolved to FQDN; otherwise, resolution is to the local hostname. The default is false. A fully qualified domain name might be required in some situations. For example, if the Greengage DB system is in a different domain than an ETL application that is being accessed by gpload

COLUMNS <field_name>: <data_type>

Specifies the schema of the source data files in the format of field_name:data_type.

The DELIMITER character in the source file is what separates two data value fields (columns). A row is determined by a line feed character (0x0a). If the input columns (COLUMNS) are not specified, then the schema of the output table (TABLE) is implied, meaning that the source data must have the same column order, number of columns, and data format as the target table. The default source-to-target mapping is based on a match of column names as defined in this section and the column names in the target table (TABLE).

This default mapping can be overridden using the MAPPING section

TRANSFORM '<transformation>'

Specifies the name of the input transformation passed to gpload.

See Load data with gpload for details

TRANSFORM_CONFIG '<configuration-file-path>'

Specifies the location of the transformation configuration file.

See Transformation configuration file for details

Yes, if TRANSFORM is specified

MAX_LINE_LENGTH <integer>

Sets the maximum allowed data row length in bytes, 32768 by default.

Should be used when user data includes very wide rows (or when the line too long error message occurs). Should not be used otherwise as it increases resource allocation. Valid range is 32 KB to 256 MB. On Windows systems, the upper limit is 1 MB

FORMAT text | csv

Specifies the format of the source data files: either plain text (TEXT) or comma-separated values (CSV) format. Defaults to TEXT if not specified.

Learn more about formatting source data in Format external data

DELIMITER '<delimiter_character>'

Designates a single ASCII character to act as a column delimiter. The delimiter must reside between any two data value fields but not at the beginning or the end of a row.

You can also specify a non-printable ASCII character or a non-printable Unicode character, such as \x1B or \u001B. The escape string syntax, E'<character-code>', is also supported for non-printable characters. The ASCII or Unicode character must be enclosed in single quotes, for example, E'\x1B' or E'\u001B'.

For text files, the default column delimiter is the horizontal TAB character (0x09).

For CSV files, the default column delimiter is the comma character (,).

See Format columns for details

ESCAPE '<escape_character>' | 'OFF'

Designates a single character used as an escape character.

For text files, the default escape character is a backslash (\). You can deactivate escaping by providing the OFF value.

For CSV files, the default escape character is a double quote (").

See Escape characters for details

NEWLINE 'LF' | 'CR' | 'CRLF'

Designates a character used as a newline character, which can be LF (Line feed, 0x0A), CR (Carriage return, 0x0D), or CR followed by LF (CRLF, 0x0D 0x0A).

If not specified, the newline character detected at the end of the first line of data is used.

See Format rows for details

NULL_AS '<null_string>'

Designates a string representing a null value, which indicates an unknown piece of data in a column or field.

For text files, the default string is \N.

For CSV files, the default string is an empty value with no quotation marks.

See Represent NULL values for details

FILL_MISSING_FIELDS true | false

Sets missing trailing field values at the end of a line or row to NULL. If not set, an error is reported in such cases. Blank rows, fields with a NOT NULL constraint, and trailing delimiters on a line will still report an error

FORCE_NOT_NULL ( <column_name> [, …] )

Treats values in each specified column as if they were quoted. Since the default null string is an empty unquoted string, this causes missing values to be evaluated as zero-length strings

QUOTE '<csv_quote_character>'

Specifies the quotation character, double quote (") by default

HEADER true | false

Designates whether the data file contains a header row.

If using multiple data source files, all off them must have a header row. The default is to assume that the input files do not have a header row

ENCODING <database_encoding>

Defines the character set encoding of the source data.

If not specified, the default client encoding is used.

If you change the ENCODING value in an existing gpload control file, you must manually drop any external tables that were creating using the previous ENCODING configuration. gpload does not drop and recreate external tables to use the new ENCODING if REUSE_TABLES is set to true.

See Character encoding for details

ERROR_LIMIT <integer>

Enables single-row error isolation mode for this load operation.

When enabled, input rows having format errors are discarded until the error limit count is reached on any Greengage DB segment instance during input processing. If the error limit is not reached, all valid rows are loaded and any invalid rows are either discarded or captured as part of the error log information. By default, the load operation is cancelled on the first error.

Note that single-row error isolation only applies to data rows with format errors, for example, extra or missing attributes, attributes of a wrong data type, or invalid client encoding sequences. Constraint errors, such as primary key violations, still cause the load operation to be cancelled

LOG_ERRORS true | false

Defines whether to store loading errors in the log if ERROR_LIMIT is declared.

The value can be true or false (default). If the value is true, rows with formatting errors are logged internally when running in single-row error isolation mode.

You can examine formatting errors by using the Greengage DB’s gp_read_error_log('<table_name>') built-in SQL function. If formatting errors are detected when loading data, gpload generates a warning message with the name of the table that contains the error information.

If LOG_ERRORS is set to true, REUSE_TABLES must also be set to true to retain the formatting errors in Greengage DB error logs. Otherwise, if REUSE_TABLES is set to false or is not specified, the error information is deleted after the gpload operation.

Only summary information about formatting errors is returned. You can delete the formatting errors from the error logs by using the gp_truncate_error_log() Greengage DB function.

EXTERNAL

Defines the schema of the external table database objects created by gpload.

By default, the Greengage DB’s search_path is used

SCHEMA <schema> | '%'

The name of the external table schema. If the schema does not exist, an error is returned. If the percent character (%) is specified, the schema of the table specified by TABLE in the OUTPUT section is used.

If the table name does not specify a schema, the default schema is used

Yes, if EXTERNAL is declared

OUTPUT

Defines the target table and the data column values to be loaded into the database

Yes

TABLE <schema.table_name>

The name of the target table to load into

Yes

MODE insert | update | merge

Defines the data loading mode. Defaults to INSERT if not specified.

The following load modes are available:

INSERT — loads data into the target table by running SELECT * on the external table created by gpload and INSERT on the target table.
UPDATE — updates the columns of the target table specified in UPDATE_COLUMNS based on the following conditions:
- The rows have the values in columns specified in MATCH_COLUMNS equal to those in the source data.
- The optional boolean condition specified in UPDATE_CONDITION evaluates to true.
MERGE — inserts new rows and updates the columns of the target table specified in UPDATE_COLUMNS based on the following conditions:
- The rows have the values in columns specified in MATCH_COLUMNS equal to those in the source data.
- The optional boolean condition specified in UPDATE_CONDITION evaluates to true.
When the MATCH_COLUMNS value in the source data does not have a corresponding matching value in the existing data of the target table, a new row is identified. In these cases, the entire row from the source file is inserted. If there are multiple new MATCH_COLUMNS values that are the same, only one new row for that value is inserted. Use UPDATE_CONDITION to filter out the rows to discard

MATCH_COLUMNS <target_column_name>

Specifies the columns to use as the join condition for the update. The values in the specified target columns must be equal to those of the corresponding source data columns in order for the row to be updated in the target table

Yes, if MODE is UPDATE or MERGE

UPDATE_COLUMNS <target_column_name>

Specifies the columns to update for the rows that meet the MATCH_COLUMNS criteria and the optional UPDATE_CONDITION

Yes, if MODE is UPDATE or MERGE

UPDATE_CONDITION '<boolean_condition>'

Specifies a boolean condition (similar to those commonly declared in a WHERE clause) that must be met in order for a row of the target table to be updated

MAPPING <source_column_name> | '<expression>'

Overrides the default source-to-target column mapping.

The default source-to-target mapping is based on a match of column names as defined in the source COLUMNS section and the column names of the target table specified in TABLE.

A mapping is specified as either <target_column_name>: <source_column_name> or <target_column_name>: '<expression>'. <expression> is any expression that you would specify in a SELECT clause, such as a constant value, a column reference, an operator invocation, a function call, and so on

PRELOAD

Specifies the operations to run prior to the load operation. Currently, the only supported preload operation is TRUNCATE

TRUNCATE true | false

If set to true, gpload removes all rows in the target table prior to loading; the default value is false

REUSE_TABLES true | false

If set to true, gpload does not drop the external table objects and staging table objects it creates. These objects are reused for future load operations based on the same load specification, which improves performance of trickle loads (ongoing small loads to the same target table).

If LOG_ERRORS is set to true, REUSE_TABLES must be set to true to retain the formatting errors in Greengage DB error logs. If REUSE_TABLES is set to false, formatting error information is deleted after the gpload operation.

STAGING_TABLE <external_table_name>

Specifies the name of the temporary external table that is created during a gpload operation The external table is used by gpfdist. REUSE_TABLES must also be set to true. Otherwise, if REUSE_TABLES is false or is not specified, STAGING_TABLE is ignored.

By default, gpload creates a temporary external table with a randomly generated name. If external_table_name contains a period (.), gpload returns an error. If the table exists, the utility reuses it. The value of SCHEMA in the EXTERNAL section serves as the schema for the <external_table_name> table. The utility returns an error if the existing table schema does not match the OUTPUT table schema.

If the SCHEMA value is %, the schema for the <external_table_name> table is the same as the schema of the target table, that is the table specified in the TABLE key in the OUTPUT section.

If SCHEMA is not set, gpload searches for the table using the schemas in the database search_path. If the table is not found, the <external_table_name> table is created in the default PUBLIC schema. gpload creates the staging table using the distribution keys of the target table as the distribution keys for the staging table.

If the target table uses random distribution (DISTRIBUTED RANDOMLY), gpload uses MATCH_COLUMNS as the staging table’s distribution keys

FAST_MATCH true | false

If set to true, gpload only searches the database for matching external table objects when reusing external tables. If REUSE_TABLES is set to false or not specified and FAST_MATCH is set to true, gpload returns a warning message

SQL

Defines optional arbitrary SQL commands to run before (BEFORE) and after (AFTER) the load operation.

The commands must be listed in the order of desired execution. You can use this section, for example, to configure the load status logging as described in Examples

BEFORE "<sql_command>"

An SQL command to run before the load operation starts. Enclose commands in quotes

AFTER "<sql_command>"

An SQL command to run after the load operation completes. Enclose commands in quotes

Use gpfdist

Format external data

Use gpload

Syntax

Control file

Run gpload

Examples

INSERT method

UPDATE method

MERGE method

Control file keys reference