Hello, I’m DocuDroid!
Submitting feedback
Thank you for rating our AI Search!
We would be grateful if you could share your thoughts so we can improve our AI Search for you and other readers.
GitHub

gpload

Runs a data load job defined in a YAML-formatted control file. See Use gpload for more details and usage examples.

Synopsis

gpload -f <control_file>
       [ -l <log_file> ]
       [ -h <hostname> ]
       [ -p <port> ]
       [ -U <username> ]
       [ -d <database> ]
       [ -W ]
       [ --gpfdist_timeout <seconds> ]
       [ --no_auto_trans ]
       [ --max_retries <retry_times> ]
       [ [ -v | -V ] [ -q ] ]
       [ -D ]

gpload -?

gpload --version

Requirements

The client machine where gpload is run must have the following:

  • The gpfdist parallel file distribution program installed and in your PATH. This program is located in $GPHOME/bin of your Greengage DB server installation.

  • Network access to and from all hosts in your Greengage DB cluster (master and segments).

  • Network access to and from the hosts where the data to be loaded resides (ETL servers).

Description

gpload is a data loading utility that acts as an interface to the Greengage DB external table parallel loading feature. Using a load specification defined in a YAML formatted control file, gpload runs a load by invoking the Greengage DB parallel file server (gpfdist), creating an external table definition based on the source data defined, and running an INSERT, UPDATE or MERGE operation to load the source data into the target table in the database.

NOTE

gpfdist is compatible only with the Greengage DB major version in which it is shipped. For example, a gpfdist utility that is installed with Greengage DB 6.x cannot be used with Greengage DB 7.x.

NOTE

MERGE and UPDATE operations are not supported if the target table column name is a reserved keyword, has capital letters, or includes any character that requires quotes (" ") to identify the column.

The operations, including any SQL commands specified in the SQL collection of the YAML control file (see Control file format), are performed as a single transaction to prevent inconsistent data when performing multiple, simultaneous load operations on a target table.

Options

-f <control_file>

(Required) A YAML file that contains the load specification details. See Control file format.

--gpfdist_timeout <seconds>

Set the timeout for the gpfdist parallel file distribution program to send a response. Enter a value from 0 to 30 seconds (entering 0 deactivates timeouts). Note that you might need to increase this value when operating on high-traffic networks.

-l <log_file>

Specify where to write the log file. Defaults to ~/gpAdminLogs/gpload_YYYYMMDD. For more information about the log file, see Log file format.

--no_auto_trans

Specify --no_auto_trans to deactivate processing the load operation as a single transaction if you are performing a single load operation on the target table.

By default, gpload processes each load operation as a single transaction to prevent inconsistent data when performing multiple, simultaneous operations on a target table.

-q

Run in quiet mode. Command output is not displayed on the screen, but is still written to the log file.

The option cannot be used if -v or -V is provided.

-D

Check for error conditions, but do not run the load.

-v

Show verbose output of the load steps as they are executed.

The option cannot be used if -q is provided.

-V

Show very verbose output: the load steps as they are executed as well as all other messages generated by the utility.

The option cannot be used if -q is provided.

-?

Display help.

--version

Show the version of this utility, then exit.

Connection options

-d <database>

The database to load into. If not specified, reads from the load control file, the environment variable PGDATABASE or defaults to the current system user name.

-h <hostname>

The host name of the machine on which the Greengage DB master is running. If not specified, reads from the load control file, the environment variable PGHOST or defaults to localhost.

-p <port>

The TCP port on which the Greengage DB master is listening for connections. If not specified, reads from the load control file, the environment variable PGPORT or defaults to 5432.

--max_retries <retry_times>

Specify the maximum number of times gpload attempts to connect to Greengage DB after a connection timeout. The default value is 0, do not attempt to connect after a connection timeout. A negative integer, such as -1, specifies an unlimited number of attempts.

-U <username>

The database role name to connect as. If not specified, reads from the load control file, the environment variable PGUSER or defaults to the current system user name.

-W

Force a password prompt. If not specified, reads the password from the environment variable PGPASSWORD or from a password file specified by PGPASSFILE or in ~/.pgpass. If these are not set, then gpload will prompt for a password even if -W is not supplied.

Control file format

The gpload control file uses the YAML 1.1 document format and then implements its own schema for defining the various steps of a Greengage DB load operation. The control file must be a valid YAML document.

The gpload program processes the control file document in order and uses indentation (spaces) to determine the document hierarchy and the relationships of the sections to one another. The use of white space is significant. White space should not be used for formatting purposes, and tabs should not be used at all.

The basic structure of a load control file is:

---
VERSION: 1.0.0.1
DATABASE: <db_name>
USER: <db_username>
HOST: <master_hostname>
PORT: <master_port>
GPLOAD:
  INPUT:
    - SOURCE:
        LOCAL_HOSTNAME:
          - <hostname_or_ip>
        PORT: <http_port>
      | PORT_RANGE: [ <start_port_range>, <end_port_range> ]
        FILE:
          - </path/to/input_file>
        SSL: true | false
        CERTIFICATES_PATH: </path/to/certificates>
    - FULLY_QUALIFIED_DOMAIN_NAME: true | false
    - COLUMNS:
        - <field_name>: <data_type>
    - TRANSFORM: '<transformation>'
    - TRANSFORM_CONFIG: '<configuration-file-path>'
    - MAX_LINE_LENGTH: <integer>
    - FORMAT: text | csv
    - DELIMITER: '<delimiter_character>'
    - ESCAPE: '<escape_character>' | 'OFF'
    - NEWLINE: 'LF' | 'CR' | 'CRLF'
    - NULL_AS: '<null_string>'
    - FILL_MISSING_FIELDS: true | false
    - FORCE_NOT_NULL: <column_name> [, ...]
    - QUOTE: '<csv_quote_character>'
    - HEADER: true | false
    - ENCODING: <database_encoding>
    - ERROR_LIMIT: <integer>
    - LOG_ERRORS: true | false
  EXTERNAL:
    - SCHEMA: <schema> | '%'
  OUTPUT:
    - TABLE: <schema.table_name>
    - MODE: insert | update | merge
    - MATCH_COLUMNS:
        - <target_column_name>
    - UPDATE_COLUMNS:
        - <target_column_name>
    - UPDATE_CONDITION: '<boolean_condition>'
    - MAPPING:
        <target_column_name>: <source_column_name> | '<expression>'
  PRELOAD:
    - TRUNCATE: true | false
    - REUSE_TABLES: true | false
    - STAGING_TABLE: <external_table_name>
    - FAST_MATCH: true | false
  SQL:
    - BEFORE: "<sql_command>"
    - AFTER: "<sql_command>"
Key Description Required

VERSION

The version of the gpload control file schema. The current version is 1.0.0.1

Yes

DATABASE <db_name>

The database to connect to.

If not specified, defaults to PGDATABASE value (if set) or the current system user name. You can also specify the database on the gpload command line using the -d option

No

USER <db_username>

Specifies the database role to use for connecting.

If not specified, defaults to the current user or PGUSER value (if set). You can also specify the database role on the gpload command line using the -U option. If the user running gpload is not a Greengage DB superuser, then the appropriate rights must be granted to the user for the load to be processed

No

HOST <master_hostname>

Specifies Greengage DB master host name.

If not specified, defaults to localhost or PGHOST (if set). You can also specify the master host name on the command line using the -h option

No

PORT <master_port>

Specifies the master port on which the Greengage DB master is listening for connections.

If not specified, defaults to 5432 or PGPORT value if set. You can also specify the master port on the gpload command line using the -p option

No

GPLOAD

Encapsulates the load specification. A specification must have an INPUT and an OUTPUT section defined

Yes

INPUT

Defines the location and the format of the input data to be loaded

Yes

SOURCE

Defines the location of a source file.

An INPUT section can have more than one SOURCE block defined. Each SOURCE block corresponds to one instance of the gpfdist utility that is started on a local machine. Each SOURCE block must include a FILE specification

Yes

LOCAL_HOSTNAME <hostname_or_ip>

Specifies the host name or the IP address of the local machine on which gpload is running.

If the machine is configured with multiple network interface cards (NICs), you can specify the host name or the IP of each individual NIC to allow network traffic to use all NICs simultaneously. By default, only the local machine’s primary host name or IP is used

No

PORT <http_port>

Specifies the port number that the gpfdist utility should use.

You can also use PORT_RANGE to have an available port selected from the specified range. If both PORT and PORT_RANGE are defined, then PORT takes precedence. If neither PORT nor PORT_RANGE are defined, an available port between 8000 and 9000 is selected by default. If multiple host names are declared in LOCAL_HOSTNAME, this port number is used for all hosts. This configuration is recommended if you want to use all NICs to load the same file or set of files in a given directory location

No

PORT_RANGE <start_port_range>, <end_port_range>

Specifies the range of port numbers from which gpload can choose an available port for this instance of the gpfdist utility

No

FILE </path/to/input_file>

Specifies the location of a file or directory on the local file system or a named pipe that contains the data to be loaded.

You can declare more than one file if the same data format is used in all the specified files. To denote multiple files to read, use the wildcard characters such as * or []. The .gz and .bz2 files are uncompressed automatically when reading data. The paths specified are assumed to be relative to the directory from which gpload is run (alternatively, you can provide an absolute path)

Yes

SSL true | false

Specifies usage of SSL encryption.

If SSL is set to true, gpload starts the gpfdist utility with the --ssl option and uses the GPFDISTS protocol.

See Configure SSL for details

No

CERTIFICATES_PATH </path/to/certificates>

Specifies the certificates storage location.

Required when SSL is true; cannot be specified when SSL is false or unspecified.

The location specified in CERTIFICATES_PATH must contain the following files:

  • server.crt — the server certificate file.

  • server.key — the server private key file.

  • root.crt — the trusted certificate authorities.

The root directory (/) cannot be specified as CERTIFICATES_PATH

Yes, if SSL is true

FULLY_QUALIFIED_DOMAIN_NAME true | false

Specifies whether gpload resolves hostnames to the fully qualified domain name (FQDN) or the local hostname.

If set to true, names are resolved to FQDN; otherwise, resolution is to the local hostname. The default is false. A fully qualified domain name might be required in some situations. For example, if the Greengage DB system is in a different domain than an ETL application that is being accessed by gpload

No

COLUMNS <field_name>: <data_type>

Specifies the schema of the source data files in the format of field_name:data_type.

The DELIMITER character in the source file is what separates two data value fields (columns). A row is determined by a line feed character (0x0a). If the input columns (COLUMNS) are not specified, then the schema of the output table (TABLE) is implied, meaning that the source data must have the same column order, number of columns, and data format as the target table. The default source-to-target mapping is based on a match of column names as defined in this section and the column names in the target table (TABLE).

This default mapping can be overridden using the MAPPING section

No

TRANSFORM '<transformation>'

Specifies the name of the input transformation passed to gpload.

See Load data with gpload for details

No

TRANSFORM_CONFIG '<configuration-file-path>'

Specifies the location of the transformation configuration file.

Yes, if TRANSFORM is specified

MAX_LINE_LENGTH <integer>

Sets the maximum allowed data row length in bytes, 32768 by default.

Should be used when user data includes very wide rows (or when the line too long error message occurs). Should not be used otherwise as it increases resource allocation. Valid range is 32 KB to 256 MB. On Windows systems, the upper limit is 1 MB

No

FORMAT text | csv

Specifies the format of the source data files: either plain text (TEXT) or comma-separated values (CSV) format. Defaults to TEXT if not specified.

Learn more about formatting source data in Format external data

No

DELIMITER '<delimiter_character>'

Designates a single ASCII character to act as a column delimiter. The delimiter must reside between any two data value fields but not at the beginning or the end of a row.

You can also specify a non-printable ASCII character or a non-printable Unicode character, such as \x1B or \u001B. The escape string syntax, E'<character-code>', is also supported for non-printable characters. The ASCII or Unicode character must be enclosed in single quotes, for example, E'\x1B' or E'\u001B'.

For text files, the default column delimiter is the horizontal TAB character (0x09).

For CSV files, the default column delimiter is the comma character (,).

See Format columns for details

No

ESCAPE '<escape_character>' | 'OFF'

Designates a single character used as an escape character.

For text files, the default escape character is a backslash (\). You can deactivate escaping by providing the OFF value.

For CSV files, the default escape character is a double quote (").

See Escape characters for details

No

NEWLINE 'LF' | 'CR' | 'CRLF'

Designates a character used as a newline character, which can be LF (Line feed, 0x0A), CR (Carriage return, 0x0D), or CR followed by LF (CRLF, 0x0D 0x0A).

If not specified, the newline character detected at the end of the first line of data is used.

See Format rows for details

No

NULL_AS '<null_string>'

Designates a string representing a null value, which indicates an unknown piece of data in a column or field.

For text files, the default string is \N.

For CSV files, the default string is an empty value with no quotation marks.

See Represent NULL values for details

No

FILL_MISSING_FIELDS true | false

Sets missing trailing field values at the end of a line or row to NULL. If not set, an error is reported in such cases. Blank rows, fields with a NOT NULL constraint, and trailing delimiters on a line will still report an error

No

FORCE_NOT_NULL ( <column_name> [, …​] )

Treats values in each specified column as if they were quoted. Since the default null string is an empty unquoted string, this causes missing values to be evaluated as zero-length strings

No

QUOTE '<csv_quote_character>'

Specifies the quotation character, double quote (") by default

No

HEADER true | false

Designates whether the data file contains a header row.

If using multiple data source files, all of them must have a header row. The default is to assume that the input files do not have a header row

No

ENCODING <database_encoding>

Defines the character set encoding of the source data.

If not specified, the default client encoding is used.

If you change the ENCODING value in an existing gpload control file, you must manually drop any external tables that were created using the previous ENCODING configuration. gpload does not drop and recreate external tables to use the new ENCODING if REUSE_TABLES is set to true.

See Character encoding for details

No

ERROR_LIMIT <integer>

Enables single-row error isolation mode for this load operation.

When enabled, input rows having format errors are discarded until the error limit count is reached on any Greengage DB segment instance during input processing. If the error limit is not reached, all valid rows are loaded and any invalid rows are either discarded or captured as part of the error log information. By default, the load operation is cancelled on the first error.

Note that single-row error isolation only applies to data rows with format errors, for example, extra or missing attributes, attributes of a wrong data type, or invalid client encoding sequences. Constraint errors, such as primary key violations, still cause the load operation to be cancelled

No

LOG_ERRORS true | false

Defines whether to store loading errors in the log if ERROR_LIMIT is declared.

The value can be true or false (default). If the value is true, rows with formatting errors are logged internally when running in single-row error isolation mode.

You can examine formatting errors by using the Greengage DB’s gp_read_error_log('<table_name>') built-in SQL function. If formatting errors are detected when loading data, gpload generates a warning message with the name of the table that contains the error information.

If LOG_ERRORS is set to true, REUSE_TABLES must also be set to true to retain the formatting errors in Greengage DB error logs. Otherwise, if REUSE_TABLES is set to false or is not specified, the error information is deleted after the gpload operation.

Only summary information about formatting errors is returned. You can delete the formatting errors from the error logs by using the gp_truncate_error_log() Greengage DB function.

No

EXTERNAL

Defines the schema of the external table database objects created by gpload.

By default, the Greengage DB’s search_path is used

No

SCHEMA <schema> | '%'

The name of the external table schema. If the schema does not exist, an error is returned. If the percent character (%) is specified, the schema of the table specified by TABLE in the OUTPUT section is used.

If the table name does not specify a schema, the default schema is used

Yes, if EXTERNAL is declared

OUTPUT

Defines the target table and the data column values to be loaded into the database

Yes

TABLE <schema.table_name>

The name of the target table to load into

Yes

MODE insert | update | merge

Defines the data loading mode. Defaults to INSERT if not specified.

The following load modes are available:

  • INSERT — loads data into the target table by running SELECT * on the external table created by gpload and INSERT on the target table.

  • UPDATE — updates the columns of the target table specified in UPDATE_COLUMNS based on the following conditions:

    • The rows have the values in columns specified in MATCH_COLUMNS equal to those in the source data.

    • The optional boolean condition specified in UPDATE_CONDITION evaluates to true.

  • MERGE — inserts new rows and updates the columns of the target table specified in UPDATE_COLUMNS based on the following conditions:

    • The rows have the values in columns specified in MATCH_COLUMNS equal to those in the source data.

    • The optional boolean condition specified in UPDATE_CONDITION evaluates to true.

    When the MATCH_COLUMNS value in the source data does not have a corresponding matching value in the existing data of the target table, a new row is identified. In these cases, the entire row from the source file is inserted. If there are multiple new MATCH_COLUMNS values that are the same, only one new row for that value is inserted. Use UPDATE_CONDITION to filter out the rows to discard

No

MATCH_COLUMNS <target_column_name>

Specifies the columns to use as the join condition for the update. The values in the specified target columns must be equal to those of the corresponding source data columns in order for the row to be updated in the target table

Yes, if MODE is UPDATE or MERGE

UPDATE_COLUMNS <target_column_name>

Specifies the columns to update for the rows that meet the MATCH_COLUMNS criteria and the optional UPDATE_CONDITION

Yes, if MODE is UPDATE or MERGE

UPDATE_CONDITION '<boolean_condition>'

Specifies a boolean condition (similar to those commonly declared in a WHERE clause) that must be met in order for a row of the target table to be updated

No

MAPPING <source_column_name> | '<expression>'

Overrides the default source-to-target column mapping.

The default source-to-target mapping is based on a match of column names as defined in the source COLUMNS section and the column names of the target table specified in TABLE.

A mapping is specified as either <target_column_name>: <source_column_name> or <target_column_name>: '<expression>'. <expression> is any expression that you would specify in a SELECT clause, such as a constant value, a column reference, an operator invocation, a function call, and so on

No

PRELOAD

Specifies the operations to run prior to the load operation. Currently, the only supported preload operation is TRUNCATE

No

TRUNCATE true | false

If set to true, gpload removes all rows in the target table prior to loading; the default value is false

No

REUSE_TABLES true | false

If set to true, gpload does not drop the external table objects and staging table objects it creates. These objects are reused for future load operations based on the same load specification, which improves performance of trickle loads (ongoing small loads to the same target table).

If LOG_ERRORS is set to true, REUSE_TABLES must be set to true to retain the formatting errors in Greengage DB error logs. If REUSE_TABLES is set to false, formatting error information is deleted after the gpload operation.

No

STAGING_TABLE <external_table_name>

Specifies the name of the temporary external table that is created during a gpload operation The external table is used by gpfdist. REUSE_TABLES must also be set to true. Otherwise, if REUSE_TABLES is false or is not specified, STAGING_TABLE is ignored.

By default, gpload creates a temporary external table with a randomly generated name. If external_table_name contains a period (.), gpload returns an error. If the table exists, the utility reuses it. The value of SCHEMA in the EXTERNAL section serves as the schema for the <external_table_name> table. The utility returns an error if the existing table schema does not match the OUTPUT table schema.

If the SCHEMA value is %, the schema for the <external_table_name> table is the same as the schema of the target table, that is the table specified in the TABLE key in the OUTPUT section.

If SCHEMA is not set, gpload searches for the table using the schemas in the database search_path. If the table is not found, the <external_table_name> table is created in the default PUBLIC schema. gpload creates the staging table using the distribution keys of the target table as the distribution keys for the staging table.

If the target table uses random distribution (DISTRIBUTED RANDOMLY), gpload uses MATCH_COLUMNS as the staging table’s distribution keys

No

FAST_MATCH true | false

If set to true, gpload only searches the database for matching external table objects when reusing external tables. If REUSE_TABLES is set to false or not specified and FAST_MATCH is set to true, gpload returns a warning message

No

SQL

Defines optional arbitrary SQL commands to run before (BEFORE) and after (AFTER) the load operation.

The commands must be listed in the order of desired execution. You can use this section, for example, to configure the load status logging as described in Examples

No

BEFORE "<sql_command>"

An SQL command to run before the load operation starts. Enclose commands in quotes

No

AFTER "<sql_command>"

An SQL command to run after the load operation completes. Enclose commands in quotes

No

Log file format

Log files output by gpload have the following format:

<timestamp>|<level>|<message>

Where timestamp takes the form: YYYY-MM-DD HH:MM:SS, level is one of DEBUG, LOG, INFO, ERROR, and message is a normal text message.

Some INFO messages that may be of interest in the log files are (where # corresponds to the actual number of seconds, units of data, or failed rows):

INFO|running time: <#.##> seconds
INFO|transferred <#.#> kB of <#.#> kB.
INFO|gpload succeeded
INFO|gpload succeeded with warnings
INFO|gpload failed
INFO|1 bad row
INFO|<#> bad rows

Notes

If your database object names were created using a double-quoted identifier (delimited identifier), you must specify the delimited name within single quotes in the gpload control file. For example, if you create a table as follows:

CREATE TABLE "MyTable" ("MyColumn" text);

Your YAML-formatted gpload control file would refer to the above table and column names as follows:

- COLUMNS:
   - '"MyColumn"': text
OUTPUT:
   - TABLE: public.'"MyTable"'

See also