Hello, I’m DocuDroid!
Submitting feedback
Thank you for rating our AI Search!
We would be grateful if you could share your thoughts so we can improve our AI Search for you and other readers.
GitHub

Use gpload

Anton Monakov

The Greengage DB gpload utility facilitates loading data into database tables via the Greenplum Parallel File Server (gpfdist). gpload operates based on a load specification defined in a YAML-formatted control file.

It invokes gpfdist, creates an external table definition based on the source data defined, and runs an INSERT, UPDATE, or MERGE operation to load the source data into the target table in the database.

The gpload operation, including any additional SQL commands specified in the SQL section of the YAML control file, is performed as a single transaction to prevent inconsistent data when performing simultaneous load operations on a target table.

NOTE

gpfdist and gpload are compatible only with the Greengage DB major version in which they are shipped (for example, 6.x).

Requirements

The client machine where gpload is run must meet the following requirements:

  • Python 2.6.2 or later, with pygresql (the Python interface to PostgreSQL) and pyyaml libraries installed.

    Note that Python and the required libraries are included with the Greengage DB server installation. If you run gpload on the machine where Greengage DB is installed, you do not need a separate Python installation. gpload for Windows supports only Python 2.5 (available from python.org).

  • The gpfdist parallel file distribution utility installed. The utility is located in the $GPHOME/bin directory of the Greengage DB master host and each segment host. To install it on another host, copy it to the desired location on that host and add this location to PATH.

  • Network access to and from all hosts in your Greengage DB cluster (master and segments) available.

  • Network access to and from the hosts where the external data to be loaded resides (ETL servers) available.

Syntax

The gpload command’s general syntax is as follows:

gpload -f <control_file> [-l <log_file>] [-h <hostname>] [-p <port>]
[-U <username>] [-d <database>] [-W] [--gpfdist_timeout <seconds>]
[--no_auto_trans] [[-v | -V] [-q]] [-D]

gpload -?

gpload --version
Option Description

-f <control_file>

A YAML file that contains the load specification details. Learn more in Control file

-d <database>

The database to load the data into.

If not specified, the database name is taken from the load control file, the PGDATABASE environment variable, or the current system user name

-h <hostname>

Specifies the host name of the machine where the Greengage DB master database server is running.

If not specified, reads from the load control file, the PGHOST environment variable, or defaults to localhost

-p <port>

Specifies the TCP port on which the Greengage DB master database server is listening for connections.

If not specified, reads from the load control file, the PGPORT environment variable, or defaults to 5432

-U <username>

Specifies the database role name to connect as.

If not specified, reads from the load control file, the PGUSER environment variable, or defaults to the current system user name

-W

Forces a password prompt.

If not specified, reads the password from the environment variable PGPASSWORD, from a password file specified by PGPASSFILE, or in ~/.pgpass. If these are not set, gpload prompts for a password even if -W is not supplied

--gpfdist_timeout <seconds>

Sets the gpfdist response timeout, from 0 to 30 seconds.

You might need to increase this value when operating on high-traffic networks

--no_auto_trans

Deactivates processing the load operation as a single transaction in case a single load operation is performed on the target table.

By default, gpload processes each load operation as a single transaction to prevent inconsistent data when performing simultaneous operations on a target table

-l <log_file>

Specifies the log file location, ~/gpAdminLogs/gpload_YYYYMMDD by default.

Learn more about logging in Greengage DB in Logging

-D

Checks for error conditions, but does not execute the actual data load

-q

Runs in quiet mode.

The command output is not displayed on the screen, but is still written to the log file. The option cannot be used if -v or -V is provided

-v

Shows verbose output of the load steps as they are executed. The option cannot be used if -q is provided

-V

Shows very verbose output: the load steps as they are executed as well as all other messages generated by the utility. The option cannot be used if -q is provided

-? (show help)

Shows help, then exits

--version

Shows the gpload version, then exits

Control file

The control file is used for orchestrating the loading process. It specifies the Greengage DB connection information, gpfdist configuration information, external table options, and source data format.

The file uses the YAML 1.1 format with its own schema for defining the various steps of a Greengage DB load operation and must be a valid YAML document. The gpload program processes the control file document in specified order and uses indentation (spaces) to determine the document hierarchy and the relationships of the sections to one another.

NOTE

The use of whitespace in a YAML document is significant. Whitespace should not be used for formatting purposes, and tabs should not be used at all.

The structure of a load control file looks as follows. See Control file keys reference for the complete description of keys.

---
VERSION: 1.0.0.1
DATABASE: <db_name>
USER: <db_username>
HOST: <master_hostname>
PORT: <master_port>
GPLOAD:
  INPUT:
    - SOURCE:
        LOCAL_HOSTNAME:
          - <hostname_or_ip>
        PORT: <http_port>
      | PORT_RANGE: [ <start_port_range>, <end_port_range> ]
        FILE:
          - </path/to/input_file>
        SSL: true | false
        CERTIFICATES_PATH: </path/to/certificates>
    - FULLY_QUALIFIED_DOMAIN_NAME: true | false
    - COLUMNS:
        - <field_name>: <data_type>
    - TRANSFORM: '<transformation>'
    - TRANSFORM_CONFIG: '<configuration-file-path>'
    - MAX_LINE_LENGTH: <integer>
    - FORMAT: text | csv
    - DELIMITER: '<delimiter_character>'
    - ESCAPE: '<escape_character>' | 'OFF'
    - NEWLINE: 'LF' | 'CR' | 'CRLF'
    - NULL_AS: '<null_string>'
    - FILL_MISSING_FIELDS: true | false
    - FORCE_NOT_NULL: <column_name> [, ...]
    - QUOTE: '<csv_quote_character>'
    - HEADER: true | false
    - ENCODING: <database_encoding>
    - ERROR_LIMIT: <integer>
    - LOG_ERRORS: true | false
  EXTERNAL:
    - SCHEMA: <schema> | '%'
  OUTPUT:
    - TABLE: <schema.table_name>
    - MODE: insert | update | merge
    - MATCH_COLUMNS:
        - <target_column_name>
    - UPDATE_COLUMNS:
        - <target_column_name>
    - UPDATE_CONDITION: '<boolean_condition>'
    - MAPPING:
        <target_column_name>: <source_column_name> | '<expression>'
  PRELOAD:
    - TRUNCATE: true | false
    - REUSE_TABLES: true | false
    - STAGING_TABLE: <external_table_name>
    - FAST_MATCH: true | false
  SQL:
    - BEFORE: "<sql_command>"
    - AFTER: "<sql_command>"
NOTE

The MERGE and UPDATE operations are not supported if the target table or column name is a reserved keyword, has capital letters, or is enclosed in double quotes (" "). If the database object names were created using a double-quoted identifier (delimited identifier), you must specify the delimited name within single quotes (') in the gpload control file.

Consider the following table:

CREATE TABLE "MyTable" ("MyColumn" text);

In this case, the load control file must refer to this table and column names as follows:

...
  - COLUMNS:
    - '"MyColumn"': text
OUTPUT:
  - TABLE: public.'"MyTable"'
...

Run gpload

  1. Create a YAML-formatted load control file.

  2. Run the gpload utility using the -f option to pass in the created load control file, for example:

    $ gpload -f example_load.yml

Examples

These examples illustrate loading data into a database table from an external file by using the INSERT, UPDATE, and MERGE methods.

Prerequisites

To try out the practical examples listed in this section, connect to the Greengage DB master host as gpadmin using psql as described in Connect to Greengage DB via psql. Then create a customers test database, connect to it, and create two tables, orders and audit, as follows:

CREATE DATABASE customers;
\c customers

CREATE TABLE orders (
    id INTEGER,
    name VARCHAR,
    price NUMERIC
);

CREATE TABLE audit (
    event VARCHAR,
    timestamp TIMESTAMP
);

INSERT method

This example demonstrates loading data into a database table by using the INSERT method.

  1. Empty the orders table and populate it with data as follows:

    TRUNCATE TABLE orders;
    
    INSERT INTO orders (id, name, price)
    VALUES
        (1,'Laptop',999.99),
        (2,'Smartphone',499.99),
        (3,'Tablet',299.99);
  2. Empty the auxiliary audit table:

    TRUNCATE TABLE audit;
  3. In the /tmp directory on the master host, create the orders.csv file containing the source data to load:

    $ cat > orders.csv <<EOF
    id,name,price
    4,Monitor,599.99
    5,Keyboard,99.99
    EOF
  4. On the master host, create the load.yml load control file:

    $ cat > load.yml <<EOF
    ---
    VERSION: 1.0.0.1
    DATABASE: customers
    USER: gpadmin
    HOST: mdw
    PORT: 5432
    GPLOAD:
      INPUT:
        - SOURCE:
            PORT: 8081
            FILE:
              - /tmp/orders.csv
        - COLUMNS:
            - id: INTEGER
            - name: VARCHAR
            - price: NUMERIC
        - FORMAT: CSV
        - HEADER: true
      OUTPUT:
        - TABLE: orders
        - MODE: INSERT
      SQL:
        - BEFORE: "INSERT INTO audit VALUES('start', current_timestamp)"
        - AFTER: "INSERT INTO audit VALUES('end', current_timestamp)"
    EOF
  5. Run the gpload utility passing in the created load control file. This will insert new rows into the orders table. The audit table will contain timestamps marking the loading start and end.

    $ gpload -f load.yml

    The output should look as follows:

    gpload session started
    INFO|setting schema 'public' for table 'orders'
    INFO|started gpfdist -p 8000 -P 9000 -f "/tmp/orders.csv" -t 30
    INFO|running time: 0.10 seconds
    INFO|rows Inserted          = 2
    INFO|rows Updated           = 0
    INFO|data formatting errors = 0
    INFO|gpload succeeded
  6. Query the orders table:

    SELECT * FROM orders;

    The output should look as follows, with rows 4 and 5 inserted into the table:

     id |    name    | price
    ----+------------+--------
      1 | Laptop     | 999.99
      2 | Smartphone | 499.99
      3 | Tablet     | 299.99
      4 | Monitor    | 599.99
      5 | Keyboard   |  99.99
  7. Query the audit table:

    SELECT * FROM audit;

    The output should contain the timestamps marking the loading start and end:

     event |         timestamp
    -------+----------------------------
     start | 2025-05-25 12:45:47.931476
     end   | 2025-05-25 12:45:50.134581

UPDATE method

This example demonstrates loading data into a database table by using the UPDATE method.

  1. Empty the orders table and populate it with data as follows:

    TRUNCATE TABLE orders;
    
    INSERT INTO orders (id, name, price)
    VALUES
        (1,'Laptop',999.99),
        (2,'Smartphone',499.99),
        (3,'Tablet',299.99),
        (4,'Monitor',599.99),
        (5,'Keyboard', 99.99);
  2. Empty the auxiliary audit table:

    TRUNCATE TABLE audit;
  3. In the /tmp directory on the master host, create the orders.csv file containing the source data to load. Notice that the data contains the rows already existing in the orders table whose values in the price column are different.

    $ cat > orders.csv <<EOF
    id,name,price
    1,Laptop,1299.99
    4,Monitor,899.99
    5,Keyboard,59.99
    EOF
  4. On the master host, create a load.yml load control file:

    $ cat > load.yml <<EOF
    ---
    VERSION: 1.0.0.1
    DATABASE: customers
    USER: gpadmin
    HOST: mdw
    PORT: 5432
    GPLOAD:
      INPUT:
        - SOURCE:
            FILE:
              - /tmp/orders.csv
        - COLUMNS:
            - id: integer
            - name: text
            - price: numeric
        - FORMAT: CSV
        - HEADER: true
      OUTPUT:
        - TABLE: orders
        - MODE: UPDATE
        - MATCH_COLUMNS:
          - id
        - UPDATE_COLUMNS:
          - price
      SQL:
        - BEFORE: "INSERT INTO audit VALUES('start', current_timestamp)"
        - AFTER: "INSERT INTO audit VALUES('end', current_timestamp)"
    EOF
  5. Run the gpload utility, passing in the created load control file. This will update the values in the price column for the rows where the id column values in the orders table match those in the orders.csv file. The audit table will contain timestamps marking the loading start and end.

    $ gpload -f load.yml

    The output should look as follows:

    INFO|gpload session started
    INFO|setting schema 'public' for table 'orders'
    INFO|started gpfdist -p 8000 -P 9000 -f "/tmp/orders.csv" -t 30
    INFO|running time: 0.13 seconds
    INFO|rows Inserted          = 0
    INFO|rows Updated           = 3
    INFO|data formatting errors = 0
    INFO|gpload succeeded
  6. Query the orders table:

    SELECT * FROM orders;

    The output should look as follows, with rows 1, 4, and 5 updated:

     id |    name    |  price
    ----+------------+---------
      1 | Laptop     | 1299.99
      2 | Smartphone |  499.99
      3 | Tablet     |  299.99
      4 | Monitor    |  899.99
      5 | Keyboard   |   59.99
  7. Query the audit table:

    SELECT * FROM audit;

    The output should contain the timestamps marking the loading start and end:

     event |         timestamp
    -------+----------------------------
     start | 2025-05-25 16:24:47.931476
     end   | 2025-05-25 16:24:50.134581

MERGE method

This example demonstrates loading data into a database table by using the MERGE method.

  1. Empty the orders table and populate it with data as follows:

    TRUNCATE TABLE orders;
    
    INSERT INTO orders (id, name, price)
    VALUES
        (1,'Laptop',999.99),
        (2,'Smartphone',499.99),
        (3,'Tablet',299.99),
        (4,'Monitor',599.99),
        (5,'Keyboard', 99.99);
  2. Empty the auxiliary audit table:

    TRUNCATE TABLE audit;
  3. In the /tmp directory on the master host, create the orders.csv file containing the source data to load. Notice that the data contains the rows already existing in the orders table whose values in the price column are different (2, 4, and 5). Rows 6 and 7 are new rows.

    $ cat > orders.csv <<EOF
    id,name,price
    1,Laptop,999.99
    2,Smartphone,799.99
    3,Tablet,299.99
    4,Monitor,849.99
    5,Keyboard,59.99
    6,Webcam,29.99
    7,Speakers,129.99
    EOF
  4. On the master host, create a load.yml load control file:

    $ cat > load.yml <<EOF
    ---
    VERSION: 1.0.0.1
    DATABASE: customers
    USER: gpadmin
    HOST: mdw
    PORT: 5432
    GPLOAD:
        INPUT:
          - SOURCE:
              FILE:
                - /tmp/orders.csv
          - COLUMNS:
              - id: integer
              - name: text
              - price: numeric
          - FORMAT: CSV
          - HEADER: true
        OUTPUT:
          - TABLE: orders
          - MODE: MERGE
          - MATCH_COLUMNS:
            - id
          - UPDATE_COLUMNS:
            - price
        SQL:
            - BEFORE: "INSERT INTO audit VALUES('start', current_timestamp)"
            - AFTER: "INSERT INTO audit VALUES('end', current_timestamp)"
    EOF
  5. Run the gpload utility, passing in the created load control file. This will update the values in the price column for the rows where the id column values in the orders table match those in the orders.csv file and insert new rows. The audit table will contain timestamps marking the loading start and end.

    $ gpload -f load.yml

    The output should look as follows:

    INFO|gpload session started
    INFO|setting schema 'public' for table 'orders'
    INFO|started gpfdist -p 8000 -P 9000 -f "/tmp/orders.csv" -t 30
    INFO|running time: 0.14 seconds
    INFO|rows Inserted          = 2
    INFO|rows Updated           = 5
    INFO|data formatting errors = 0
    INFO|gpload succeeded
  6. Query the orders table:

    SELECT * FROM orders;

    The output should look as follows, with rows 2, 4, and 5 updated and rows 6 and 7 inserted:

     id |    name    | price
    ----+------------+--------
      1 | Laptop     | 999.99
      2 | Smartphone | 799.99
      3 | Tablet     | 299.99
      4 | Monitor    | 849.99
      5 | Keyboard   |  59.99
      6 | Webcam     |  29.99
      7 | Speakers   | 129.99
  7. Query the audit table:

    SELECT * FROM audit;

    The output should contain the timestamps marking the loading start and end:

     event |         timestamp
    -------+----------------------------
     start | 2025-05-25 15:24:47.931476
     end   | 2025-05-25 15:24:50.134581

Control file keys reference

Key Description Required

VERSION

The version of the gpload control file schema. The current version is 1.0.0.1

Yes

DATABASE <db_name>

The database to connect to.

If not specified, defaults to PGDATABASE value (if set) or the current system user name. You can also specify the database on the gpload command line using the -d option

No

USER <db_username>

Specifies the database role to use for connecting.

If not specified, defaults to the current user or PGUSER value (if set). You can also specify the database role on the gpload command line using the -U option. If the user running gpload is not a Greengage DB superuser, then the appropriate rights must be granted to the user for the load to be processed

No

HOST <master_hostname>

Specifies Greengage DB master host name.

If not specified, defaults to localhost or PGHOST (if set). You can also specify the master host name on the command line using the -h option

No

PORT <master_port>

Specifies the master port on which the Greengage DB master database server is listening for connections.

If not specified, defaults to 5432 or PGPORT value if set. You can also specify the master port on the gpload command line using the -p option

No

GPLOAD

Encapsulates the load specification. A specification must have an INPUT and an OUTPUT section defined

Yes

INPUT

Defines the location and the format of the input data to be loaded.

Yes

SOURCE

Defines the location of a source file.

An INPUT section can have more than one SOURCE block defined. Each SOURCE block corresponds to one instance of the gpfdist utility that is started on a local machine. Each SOURCE block must include a FILE specification

Yes

LOCAL_HOSTNAME <hostname_or_ip>

Specifies the host name or the IP address of the local machine on which gpload is running.

If the machine is configured with multiple network interface cards (NICs), you can specify the host name or the IP of each individual NIC to allow network traffic to use all NICs simultaneously. By default, only the local machine’s primary host name or IP is used

No

PORT <http_port>

Specifies the port number that the gpfdist utility should use.

You can also use PORT_RANGE to have an available port selected from the specified range. If both PORT and PORT_RANGE are defined, then PORT takes precedence. If neither PORT nor PORT_RANGE are defined, an available port between 8000 and 9000 is selected by default. If multiple host names are declared in LOCAL_HOSTNAME, this port number is used for all hosts. This configuration is recommended if you want to use all NICs to load the same file or set of files in a given directory location

No

PORT_RANGE <start_port_range>, <end_port_range>

Specifies the range of port numbers from which gpload can choose an available port for this instance of the gpfdist utility

No

FILE </path/to/input_file>

Specifies the location of a file or directory on the local file system or a named pipe that contains the data to be loaded.

You can declare more than one file if the same data format is used in all the specified files. To denote multiple files to read, use the wildcard characters such as * or []. The .gz and .bz2 files are uncompressed automatically when reading data. The paths specified are assumed to be relative to the directory from which gpload is run (alternatively, you can provide an absolute path)

Yes

SSL true | false

Specifies usage of SSL encryption.

If SSL is set to true, gpload starts the gpfdist utility with the --ssl option and uses the GPFDISTS protocol.

See Configure SSL for details

No

CERTIFICATES_PATH </path/to/certificates>

Specifies the certificates storage location.

Required when SSL is true; cannot be specified when SSL is false or unspecified.

The location specified in CERTIFICATES_PATH must contain the following files:

  • server.crt — the server certificate file.

  • server.key — the server private key file.

  • root.crt — the trusted certificate authorities.

The root directory (/) cannot be specified as CERTIFICATES_PATH

Yes, if SSL is true

FULLY_QUALIFIED_DOMAIN_NAME true | false

Specifies whether gpload resolves hostnames to the fully qualified domain name (FQDN) or the local hostname.

If set to true, names are resolved to FQDN; otherwise, resolution is to the local hostname. The default is false. A fully qualified domain name might be required in some situations. For example, if the Greengage DB system is in a different domain than an ETL application that is being accessed by gpload

No

COLUMNS <field_name>: <data_type>

Specifies the schema of the source data files in the format of field_name:data_type.

The DELIMITER character in the source file is what separates two data value fields (columns). A row is determined by a line feed character (0x0a). If the input columns (COLUMNS) are not specified, then the schema of the output table (TABLE) is implied, meaning that the source data must have the same column order, number of columns, and data format as the target table. The default source-to-target mapping is based on a match of column names as defined in this section and the column names in the target table (TABLE).

This default mapping can be overridden using the MAPPING section

No

TRANSFORM '<transformation>'

Specifies the name of the input transformation passed to gpload.

See Load data with gpload for details

No

TRANSFORM_CONFIG '<configuration-file-path>'

Specifies the location of the transformation configuration file.

Yes, if TRANSFORM is specified

MAX_LINE_LENGTH <integer>

Sets the maximum allowed data row length in bytes, 32768 by default.

Should be used when user data includes very wide rows (or when the line too long error message occurs). Should not be used otherwise as it increases resource allocation. Valid range is 32 KB to 256 MB. On Windows systems, the upper limit is 1 MB

No

FORMAT text | csv

Specifies the format of the source data files: either plain text (TEXT) or comma-separated values (CSV) format. Defaults to TEXT if not specified.

Learn more about formatting source data in Format external data

No

DELIMITER '<delimiter_character>'

Designates a single ASCII character to act as a column delimiter. The delimiter must reside between any two data value fields but not at the beginning or the end of a row.

You can also specify a non-printable ASCII character or a non-printable Unicode character, such as \x1B or \u001B. The escape string syntax, E'<character-code>', is also supported for non-printable characters. The ASCII or Unicode character must be enclosed in single quotes, for example, E'\x1B' or E'\u001B'.

For text files, the default column delimiter is the horizontal TAB character (0x09).

For CSV files, the default column delimiter is the comma character (,).

See Format columns for details

No

ESCAPE '<escape_character>' | 'OFF'

Designates a single character used as an escape character.

For text files, the default escape character is a backslash (\). You can deactivate escaping by providing the OFF value.

For CSV files, the default escape character is a double quote (").

See Escape characters for details

No

NEWLINE 'LF' | 'CR' | 'CRLF'

Designates a character used as a newline character, which can be LF (Line feed, 0x0A), CR (Carriage return, 0x0D), or CR followed by LF (CRLF, 0x0D 0x0A).

If not specified, the newline character detected at the end of the first line of data is used.

See Format rows for details

No

NULL_AS '<null_string>'

Designates a string representing a null value, which indicates an unknown piece of data in a column or field.

For text files, the default string is \N.

For CSV files, the default string is an empty value with no quotation marks.

See Represent NULL values for details

No

FILL_MISSING_FIELDS true | false

Sets missing trailing field values at the end of a line or row to NULL. If not set, an error is reported in such cases. Blank rows, fields with a NOT NULL constraint, and trailing delimiters on a line will still report an error

No

FORCE_NOT_NULL ( <column_name> [, …​] )

Treats values in each specified column as if they were quoted. Since the default null string is an empty unquoted string, this causes missing values to be evaluated as zero-length strings

No

QUOTE '<csv_quote_character>'

Specifies the quotation character, double quote (") by default

No

HEADER true | false

Designates whether the data file contains a header row.

If using multiple data source files, all off them must have a header row. The default is to assume that the input files do not have a header row

No

ENCODING <database_encoding>

Defines the character set encoding of the source data.

If not specified, the default client encoding is used.

If you change the ENCODING value in an existing gpload control file, you must manually drop any external tables that were creating using the previous ENCODING configuration. gpload does not drop and recreate external tables to use the new ENCODING if REUSE_TABLES is set to true.

See Character encoding for details

No

ERROR_LIMIT <integer>

Enables single-row error isolation mode for this load operation.

When enabled, input rows having format errors are discarded until the error limit count is reached on any Greengage DB segment instance during input processing. If the error limit is not reached, all valid rows are loaded and any invalid rows are either discarded or captured as part of the error log information. By default, the load operation is cancelled on the first error.

Note that single-row error isolation only applies to data rows with format errors, for example, extra or missing attributes, attributes of a wrong data type, or invalid client encoding sequences. Constraint errors, such as primary key violations, still cause the load operation to be cancelled

No

LOG_ERRORS true | false

Defines whether to store loading errors in the log if ERROR_LIMIT is declared.

The value can be true or false (default). If the value is true, rows with formatting errors are logged internally when running in single-row error isolation mode.

You can examine formatting errors by using the Greengage DB’s gp_read_error_log('<table_name>') built-in SQL function. If formatting errors are detected when loading data, gpload generates a warning message with the name of the table that contains the error information.

If LOG_ERRORS is set to true, REUSE_TABLES must also be set to true to retain the formatting errors in Greengage DB error logs. Otherwise, if REUSE_TABLES is set to false or is not specified, the error information is deleted after the gpload operation.

Only summary information about formatting errors is returned. You can delete the formatting errors from the error logs by using the gp_truncate_error_log() Greengage DB function.

No

EXTERNAL

Defines the schema of the external table database objects created by gpload.

By default, the Greengage DB’s search_path is used

No

SCHEMA <schema> | '%'

The name of the external table schema. If the schema does not exist, an error is returned. If the percent character (%) is specified, the schema of the table specified by TABLE in the OUTPUT section is used.

If the table name does not specify a schema, the default schema is used

Yes, if EXTERNAL is declared

OUTPUT

Defines the target table and the data column values to be loaded into the database

Yes

TABLE <schema.table_name>

The name of the target table to load into

Yes

MODE insert | update | merge

Defines the data loading mode. Defaults to INSERT if not specified.

The following load modes are available:

  • INSERT — loads data into the target table by running SELECT * on the external table created by gpload and INSERT on the target table.

  • UPDATE — updates the columns of the target table specified in UPDATE_COLUMNS based on the following conditions:

    • The rows have the values in columns specified in MATCH_COLUMNS equal to those in the source data.

    • The optional boolean condition specified in UPDATE_CONDITION evaluates to true.

  • MERGE — inserts new rows and updates the columns of the target table specified in UPDATE_COLUMNS based on the following conditions:

    • The rows have the values in columns specified in MATCH_COLUMNS equal to those in the source data.

    • The optional boolean condition specified in UPDATE_CONDITION evaluates to true.

    When the MATCH_COLUMNS value in the source data does not have a corresponding matching value in the existing data of the target table, a new row is identified. In these cases, the entire row from the source file is inserted. If there are multiple new MATCH_COLUMNS values that are the same, only one new row for that value is inserted. Use UPDATE_CONDITION to filter out the rows to discard

No

MATCH_COLUMNS <target_column_name>

Specifies the columns to use as the join condition for the update. The values in the specified target columns must be equal to those of the corresponding source data columns in order for the row to be updated in the target table

Yes, if MODE is UPDATE or MERGE

UPDATE_COLUMNS <target_column_name>

Specifies the columns to update for the rows that meet the MATCH_COLUMNS criteria and the optional UPDATE_CONDITION

Yes, if MODE is UPDATE or MERGE

UPDATE_CONDITION '<boolean_condition>'

Specifies a boolean condition (similar to those commonly declared in a WHERE clause) that must be met in order for a row of the target table to be updated

No

MAPPING <source_column_name> | '<expression>'

Overrides the default source-to-target column mapping.

The default source-to-target mapping is based on a match of column names as defined in the source COLUMNS section and the column names of the target table specified in TABLE.

A mapping is specified as either <target_column_name>: <source_column_name> or <target_column_name>: '<expression>'. <expression> is any expression that you would specify in a SELECT clause, such as a constant value, a column reference, an operator invocation, a function call, and so on

No

PRELOAD

Specifies the operations to run prior to the load operation. Currently, the only supported preload operation is TRUNCATE

No

TRUNCATE true | false

If set to true, gpload removes all rows in the target table prior to loading; the default value is false

No

REUSE_TABLES true | false

If set to true, gpload does not drop the external table objects and staging table objects it creates. These objects are reused for future load operations based on the same load specification, which improves performance of trickle loads (ongoing small loads to the same target table).

If LOG_ERRORS is set to true, REUSE_TABLES must be set to true to retain the formatting errors in Greengage DB error logs. If REUSE_TABLES is set to false, formatting error information is deleted after the gpload operation.

No

STAGING_TABLE <external_table_name>

Specifies the name of the temporary external table that is created during a gpload operation The external table is used by gpfdist. REUSE_TABLES must also be set to true. Otherwise, if REUSE_TABLES is false or is not specified, STAGING_TABLE is ignored.

By default, gpload creates a temporary external table with a randomly generated name. If external_table_name contains a period (.), gpload returns an error. If the table exists, the utility reuses it. The value of SCHEMA in the EXTERNAL section serves as the schema for the <external_table_name> table. The utility returns an error if the existing table schema does not match the OUTPUT table schema.

If the SCHEMA value is %, the schema for the <external_table_name> table is the same as the schema of the target table, that is the table specified in the TABLE key in the OUTPUT section.

If SCHEMA is not set, gpload searches for the table using the schemas in the database search_path. If the table is not found, the <external_table_name> table is created in the default PUBLIC schema. gpload creates the staging table using the distribution keys of the target table as the distribution keys for the staging table.

If the target table uses random distribution (DISTRIBUTED RANDOMLY), gpload uses MATCH_COLUMNS as the staging table’s distribution keys

No

FAST_MATCH true | false

If set to true, gpload only searches the database for matching external table objects when reusing external tables. If REUSE_TABLES is set to false or not specified and FAST_MATCH is set to true, gpload returns a warning message

No

SQL

Defines optional arbitrary SQL commands to run before (BEFORE) and after (AFTER) the load operation.

The commands must be listed in the order of desired execution. You can use this section, for example, to configure the load status logging as described in Examples

No

BEFORE "<sql_command>"

An SQL command to run before the load operation starts. Enclose commands in quotes

No

AFTER "<sql_command>"

An SQL command to run after the load operation completes. Enclose commands in quotes

No