Use gpload
The Greengage DB gpload
utility facilitates loading data into database tables via the Greenplum Parallel File Server (gpfdist).
gpload
operates based on a load specification defined in a YAML-formatted control file.
It invokes gpfdist
, creates an external table definition based on the source data defined, and runs an INSERT
, UPDATE
, or MERGE
operation to load the source data into the target table in the database.
The gpload
operation, including any additional SQL commands specified in the SQL
section of the YAML control file, is performed as a single transaction to prevent inconsistent data when performing simultaneous load operations on a target table.
gpfdist
and gpload
are compatible only with the Greengage DB major version in which they are shipped (for example, 6.x).
Syntax
The gpload
command’s general syntax is as follows:
gpload -f <control_file> [-l <log_file>] [-h <hostname>] [-p <port>]
[-U <username>] [-d <database>] [-W] [--gpfdist_timeout <seconds>]
[--no_auto_trans] [[-v | -V] [-q]] [-D]
gpload -?
gpload --version
Option | Description |
---|---|
-f <control_file> |
A YAML file that contains the load specification details. Learn more in Control file |
-d <database> |
The database to load the data into. If not specified, the database name is taken from the load control file, the |
-h <hostname> |
Specifies the host name of the machine where the Greengage DB master database server is running. If not specified, reads from the load control file, the |
-p <port> |
Specifies the TCP port on which the Greengage DB master database server is listening for connections. If not specified, reads from the load control file, the |
-U <username> |
Specifies the database role name to connect as. If not specified, reads from the load control file, the |
-W |
Forces a password prompt. If not specified, reads the password from the environment variable |
--gpfdist_timeout <seconds> |
Sets the You might need to increase this value when operating on high-traffic networks |
--no_auto_trans |
Deactivates processing the load operation as a single transaction in case a single load operation is performed on the target table. By default, |
-l <log_file> |
Specifies the log file location, ~/gpAdminLogs/gpload_YYYYMMDD by default. Learn more about logging in Greengage DB in Logging |
-D |
Checks for error conditions, but does not execute the actual data load |
-q |
Runs in quiet mode. The command output is not displayed on the screen, but is still written to the log file.
The option cannot be used if |
-v |
Shows verbose output of the load steps as they are executed.
The option cannot be used if |
-V |
Shows very verbose output: the load steps as they are executed as well as all other messages generated by the utility.
The option cannot be used if |
-? (show help) |
Shows help, then exits |
--version |
Shows the |
Control file
The control file is used for orchestrating the loading process.
It specifies the Greengage DB connection information, gpfdist
configuration information, external table options, and source data format.
The file uses the YAML 1.1 format with its own schema for defining the various steps of a Greengage DB load operation and must be a valid YAML document.
The gpload
program processes the control file document in specified order and uses indentation (spaces) to determine the document hierarchy and the relationships of the sections to one another.
The use of whitespace in a YAML document is significant. Whitespace should not be used for formatting purposes, and tabs should not be used at all.
The structure of a load control file looks as follows. See Control file keys reference for the complete description of keys.
---
VERSION: 1.0.0.1
DATABASE: <db_name>
USER: <db_username>
HOST: <master_hostname>
PORT: <master_port>
GPLOAD:
INPUT:
- SOURCE:
LOCAL_HOSTNAME:
- <hostname_or_ip>
PORT: <http_port>
| PORT_RANGE: [ <start_port_range>, <end_port_range> ]
FILE:
- </path/to/input_file>
SSL: true | false
CERTIFICATES_PATH: </path/to/certificates>
- FULLY_QUALIFIED_DOMAIN_NAME: true | false
- COLUMNS:
- <field_name>: <data_type>
- TRANSFORM: '<transformation>'
- TRANSFORM_CONFIG: '<configuration-file-path>'
- MAX_LINE_LENGTH: <integer>
- FORMAT: text | csv
- DELIMITER: '<delimiter_character>'
- ESCAPE: '<escape_character>' | 'OFF'
- NEWLINE: 'LF' | 'CR' | 'CRLF'
- NULL_AS: '<null_string>'
- FILL_MISSING_FIELDS: true | false
- FORCE_NOT_NULL: <column_name> [, ...]
- QUOTE: '<csv_quote_character>'
- HEADER: true | false
- ENCODING: <database_encoding>
- ERROR_LIMIT: <integer>
- LOG_ERRORS: true | false
EXTERNAL:
- SCHEMA: <schema> | '%'
OUTPUT:
- TABLE: <schema.table_name>
- MODE: insert | update | merge
- MATCH_COLUMNS:
- <target_column_name>
- UPDATE_COLUMNS:
- <target_column_name>
- UPDATE_CONDITION: '<boolean_condition>'
- MAPPING:
<target_column_name>: <source_column_name> | '<expression>'
PRELOAD:
- TRUNCATE: true | false
- REUSE_TABLES: true | false
- STAGING_TABLE: <external_table_name>
- FAST_MATCH: true | false
SQL:
- BEFORE: "<sql_command>"
- AFTER: "<sql_command>"
The MERGE
and UPDATE
operations are not supported if the target table or column name is a reserved keyword, has capital letters, or is enclosed in double quotes (" "
).
If the database object names were created using a double-quoted identifier (delimited identifier), you must specify the delimited name within single quotes ('
) in the gpload
control file.
Consider the following table:
CREATE TABLE "MyTable" ("MyColumn" text);
In this case, the load control file must refer to this table and column names as follows:
...
- COLUMNS:
- '"MyColumn"': text
OUTPUT:
- TABLE: public.'"MyTable"'
...
Run gpload
-
Create a YAML-formatted load control file.
-
Run the
gpload
utility using the-f
option to pass in the created load control file, for example:$ gpload -f example_load.yml
Examples
These examples illustrate loading data into a database table from an external file by using the INSERT
, UPDATE
, and MERGE
methods.
INSERT method
This example demonstrates loading data into a database table by using the INSERT
method.
-
Empty the
orders
table and populate it with data as follows:TRUNCATE TABLE orders; INSERT INTO orders (id, name, price) VALUES (1,'Laptop',999.99), (2,'Smartphone',499.99), (3,'Tablet',299.99);
-
Empty the auxiliary
audit
table:TRUNCATE TABLE audit;
-
In the /tmp directory on the master host, create the orders.csv file containing the source data to load:
$ cat > orders.csv <<EOF id,name,price 4,Monitor,599.99 5,Keyboard,99.99 EOF
-
On the master host, create the load.yml load control file:
$ cat > load.yml <<EOF --- VERSION: 1.0.0.1 DATABASE: customers USER: gpadmin HOST: mdw PORT: 5432 GPLOAD: INPUT: - SOURCE: PORT: 8081 FILE: - /tmp/orders.csv - COLUMNS: - id: INTEGER - name: VARCHAR - price: NUMERIC - FORMAT: CSV - HEADER: true OUTPUT: - TABLE: orders - MODE: INSERT SQL: - BEFORE: "INSERT INTO audit VALUES('start', current_timestamp)" - AFTER: "INSERT INTO audit VALUES('end', current_timestamp)" EOF
-
Run the
gpload
utility passing in the created load control file. This will insert new rows into theorders
table. Theaudit
table will contain timestamps marking the loading start and end.$ gpload -f load.yml
The output should look as follows:
gpload session started INFO|setting schema 'public' for table 'orders' INFO|started gpfdist -p 8000 -P 9000 -f "/tmp/orders.csv" -t 30 INFO|running time: 0.10 seconds INFO|rows Inserted = 2 INFO|rows Updated = 0 INFO|data formatting errors = 0 INFO|gpload succeeded
-
Query the
orders
table:SELECT * FROM orders;
The output should look as follows, with rows 4 and 5 inserted into the table:
id | name | price ----+------------+-------- 1 | Laptop | 999.99 2 | Smartphone | 499.99 3 | Tablet | 299.99 4 | Monitor | 599.99 5 | Keyboard | 99.99
-
Query the
audit
table:SELECT * FROM audit;
The output should contain the timestamps marking the loading start and end:
event | timestamp -------+---------------------------- start | 2025-05-25 12:45:47.931476 end | 2025-05-25 12:45:50.134581
UPDATE method
This example demonstrates loading data into a database table by using the UPDATE
method.
-
Empty the
orders
table and populate it with data as follows:TRUNCATE TABLE orders; INSERT INTO orders (id, name, price) VALUES (1,'Laptop',999.99), (2,'Smartphone',499.99), (3,'Tablet',299.99), (4,'Monitor',599.99), (5,'Keyboard', 99.99);
-
Empty the auxiliary
audit
table:TRUNCATE TABLE audit;
-
In the /tmp directory on the master host, create the orders.csv file containing the source data to load. Notice that the data contains the rows already existing in the
orders
table whose values in theprice
column are different.$ cat > orders.csv <<EOF id,name,price 1,Laptop,1299.99 4,Monitor,899.99 5,Keyboard,59.99 EOF
-
On the master host, create a load.yml load control file:
$ cat > load.yml <<EOF --- VERSION: 1.0.0.1 DATABASE: customers USER: gpadmin HOST: mdw PORT: 5432 GPLOAD: INPUT: - SOURCE: FILE: - /tmp/orders.csv - COLUMNS: - id: integer - name: text - price: numeric - FORMAT: CSV - HEADER: true OUTPUT: - TABLE: orders - MODE: UPDATE - MATCH_COLUMNS: - id - UPDATE_COLUMNS: - price SQL: - BEFORE: "INSERT INTO audit VALUES('start', current_timestamp)" - AFTER: "INSERT INTO audit VALUES('end', current_timestamp)" EOF
-
Run the
gpload
utility, passing in the created load control file. This will update the values in theprice
column for the rows where theid
column values in theorders
table match those in the orders.csv file. Theaudit
table will contain timestamps marking the loading start and end.$ gpload -f load.yml
The output should look as follows:
INFO|gpload session started INFO|setting schema 'public' for table 'orders' INFO|started gpfdist -p 8000 -P 9000 -f "/tmp/orders.csv" -t 30 INFO|running time: 0.13 seconds INFO|rows Inserted = 0 INFO|rows Updated = 3 INFO|data formatting errors = 0 INFO|gpload succeeded
-
Query the
orders
table:SELECT * FROM orders;
The output should look as follows, with rows 1, 4, and 5 updated:
id | name | price ----+------------+--------- 1 | Laptop | 1299.99 2 | Smartphone | 499.99 3 | Tablet | 299.99 4 | Monitor | 899.99 5 | Keyboard | 59.99
-
Query the
audit
table:SELECT * FROM audit;
The output should contain the timestamps marking the loading start and end:
event | timestamp -------+---------------------------- start | 2025-05-25 16:24:47.931476 end | 2025-05-25 16:24:50.134581
MERGE method
This example demonstrates loading data into a database table by using the MERGE
method.
-
Empty the
orders
table and populate it with data as follows:TRUNCATE TABLE orders; INSERT INTO orders (id, name, price) VALUES (1,'Laptop',999.99), (2,'Smartphone',499.99), (3,'Tablet',299.99), (4,'Monitor',599.99), (5,'Keyboard', 99.99);
-
Empty the auxiliary
audit
table:TRUNCATE TABLE audit;
-
In the /tmp directory on the master host, create the orders.csv file containing the source data to load. Notice that the data contains the rows already existing in the
orders
table whose values in theprice
column are different (2, 4, and 5). Rows 6 and 7 are new rows.$ cat > orders.csv <<EOF id,name,price 1,Laptop,999.99 2,Smartphone,799.99 3,Tablet,299.99 4,Monitor,849.99 5,Keyboard,59.99 6,Webcam,29.99 7,Speakers,129.99 EOF
-
On the master host, create a load.yml load control file:
$ cat > load.yml <<EOF --- VERSION: 1.0.0.1 DATABASE: customers USER: gpadmin HOST: mdw PORT: 5432 GPLOAD: INPUT: - SOURCE: FILE: - /tmp/orders.csv - COLUMNS: - id: integer - name: text - price: numeric - FORMAT: CSV - HEADER: true OUTPUT: - TABLE: orders - MODE: MERGE - MATCH_COLUMNS: - id - UPDATE_COLUMNS: - price SQL: - BEFORE: "INSERT INTO audit VALUES('start', current_timestamp)" - AFTER: "INSERT INTO audit VALUES('end', current_timestamp)" EOF
-
Run the
gpload
utility, passing in the created load control file. This will update the values in theprice
column for the rows where theid
column values in theorders
table match those in the orders.csv file and insert new rows. Theaudit
table will contain timestamps marking the loading start and end.$ gpload -f load.yml
The output should look as follows:
INFO|gpload session started INFO|setting schema 'public' for table 'orders' INFO|started gpfdist -p 8000 -P 9000 -f "/tmp/orders.csv" -t 30 INFO|running time: 0.14 seconds INFO|rows Inserted = 2 INFO|rows Updated = 5 INFO|data formatting errors = 0 INFO|gpload succeeded
-
Query the
orders
table:SELECT * FROM orders;
The output should look as follows, with rows 2, 4, and 5 updated and rows 6 and 7 inserted:
id | name | price ----+------------+-------- 1 | Laptop | 999.99 2 | Smartphone | 799.99 3 | Tablet | 299.99 4 | Monitor | 849.99 5 | Keyboard | 59.99 6 | Webcam | 29.99 7 | Speakers | 129.99
-
Query the
audit
table:SELECT * FROM audit;
The output should contain the timestamps marking the loading start and end:
event | timestamp -------+---------------------------- start | 2025-05-25 15:24:47.931476 end | 2025-05-25 15:24:50.134581
Control file keys reference
Key | Description | Required |
---|---|---|
VERSION |
The version of the |
Yes |
DATABASE <db_name> |
The database to connect to. If not specified, defaults to |
No |
USER <db_username> |
Specifies the database role to use for connecting. If not specified, defaults to the current user or |
No |
HOST <master_hostname> |
Specifies Greengage DB master host name. If not specified, defaults to |
No |
PORT <master_port> |
Specifies the master port on which the Greengage DB master database server is listening for connections. If not specified, defaults to |
No |
GPLOAD |
Encapsulates the load specification.
A specification must have an |
Yes |
INPUT |
Defines the location and the format of the input data to be loaded. |
Yes |
SOURCE |
Defines the location of a source file. An |
Yes |
LOCAL_HOSTNAME <hostname_or_ip> |
Specifies the host name or the IP address of the local machine on which If the machine is configured with multiple network interface cards (NICs), you can specify the host name or the IP of each individual NIC to allow network traffic to use all NICs simultaneously. By default, only the local machine’s primary host name or IP is used |
No |
PORT <http_port> |
Specifies the port number that the You can also use |
No |
PORT_RANGE <start_port_range>, <end_port_range> |
Specifies the range of port numbers from which |
No |
FILE </path/to/input_file> |
Specifies the location of a file or directory on the local file system or a named pipe that contains the data to be loaded. You can declare more than one file if the same data format is used in all the specified files.
To denote multiple files to read, use the wildcard characters such as |
Yes |
SSL true | false |
Specifies usage of SSL encryption. If See Configure SSL for details |
No |
CERTIFICATES_PATH </path/to/certificates> |
Specifies the certificates storage location. Required when The location specified in
The root directory ( |
Yes, if |
FULLY_QUALIFIED_DOMAIN_NAME true | false |
Specifies whether If set to |
No |
COLUMNS <field_name>: <data_type> |
Specifies the schema of the source data files in the format of The This default mapping can be overridden using the |
No |
TRANSFORM '<transformation>' |
Specifies the name of the input transformation passed to See Load data with gpload for details |
No |
TRANSFORM_CONFIG '<configuration-file-path>' |
Specifies the location of the transformation configuration file. See Transformation configuration file for details |
Yes, if |
MAX_LINE_LENGTH <integer> |
Sets the maximum allowed data row length in bytes, Should be used when user data includes very wide rows (or when the |
No |
FORMAT text | csv |
Specifies the format of the source data files: either plain text ( Learn more about formatting source data in Format external data |
No |
DELIMITER '<delimiter_character>' |
Designates a single ASCII character to act as a column delimiter. The delimiter must reside between any two data value fields but not at the beginning or the end of a row. You can also specify a non-printable ASCII character or a non-printable Unicode character, such as For text files, the default column delimiter is the horizontal TAB character ( For CSV files, the default column delimiter is the comma character ( See Format columns for details |
No |
ESCAPE '<escape_character>' | 'OFF' |
Designates a single character used as an escape character. For text files, the default escape character is a backslash ( For CSV files, the default escape character is a double quote ( See Escape characters for details |
No |
NEWLINE 'LF' | 'CR' | 'CRLF' |
Designates a character used as a newline character, which can be If not specified, the newline character detected at the end of the first line of data is used. See Format rows for details |
No |
NULL_AS '<null_string>' |
Designates a string representing a null value, which indicates an unknown piece of data in a column or field. For text files, the default string is For CSV files, the default string is an empty value with no quotation marks. See Represent NULL values for details |
No |
FILL_MISSING_FIELDS true | false |
Sets missing trailing field values at the end of a line or row to |
No |
FORCE_NOT_NULL ( <column_name> [, …] ) |
Treats values in each specified column as if they were quoted. Since the default null string is an empty unquoted string, this causes missing values to be evaluated as zero-length strings |
No |
QUOTE '<csv_quote_character>' |
Specifies the quotation character, double quote ( |
No |
HEADER true | false |
Designates whether the data file contains a header row. If using multiple data source files, all off them must have a header row. The default is to assume that the input files do not have a header row |
No |
ENCODING <database_encoding> |
Defines the character set encoding of the source data. If not specified, the default client encoding is used. If you change the See Character encoding for details |
No |
ERROR_LIMIT <integer> |
Enables single-row error isolation mode for this load operation. When enabled, input rows having format errors are discarded until the error limit count is reached on any Greengage DB segment instance during input processing. If the error limit is not reached, all valid rows are loaded and any invalid rows are either discarded or captured as part of the error log information. By default, the load operation is cancelled on the first error. Note that single-row error isolation only applies to data rows with format errors, for example, extra or missing attributes, attributes of a wrong data type, or invalid client encoding sequences. Constraint errors, such as primary key violations, still cause the load operation to be cancelled |
No |
LOG_ERRORS true | false |
Defines whether to store loading errors in the log if The value can be You can examine formatting errors by using the Greengage DB’s If Only summary information about formatting errors is returned.
You can delete the formatting errors from the error logs by using the |
No |
EXTERNAL |
Defines the schema of the external table database objects created by By default, the Greengage DB’s |
No |
SCHEMA <schema> | '%' |
The name of the external table schema.
If the schema does not exist, an error is returned.
If the percent character ( If the table name does not specify a schema, the default schema is used |
Yes, if |
OUTPUT |
Defines the target table and the data column values to be loaded into the database |
Yes |
TABLE <schema.table_name> |
The name of the target table to load into |
Yes |
MODE insert | update | merge |
Defines the data loading mode.
Defaults to The following load modes are available:
|
No |
MATCH_COLUMNS <target_column_name> |
Specifies the columns to use as the join condition for the update. The values in the specified target columns must be equal to those of the corresponding source data columns in order for the row to be updated in the target table |
Yes, if |
UPDATE_COLUMNS <target_column_name> |
Specifies the columns to update for the rows that meet the |
Yes, if |
UPDATE_CONDITION '<boolean_condition>' |
Specifies a boolean condition (similar to those commonly declared in a |
No |
MAPPING <source_column_name> | '<expression>' |
Overrides the default source-to-target column mapping. The default source-to-target mapping is based on a match of column names as defined in the source A mapping is specified as either |
No |
PRELOAD |
Specifies the operations to run prior to the load operation.
Currently, the only supported preload operation is |
No |
TRUNCATE true | false |
If set to |
No |
REUSE_TABLES true | false |
If set to If |
No |
STAGING_TABLE <external_table_name> |
Specifies the name of the temporary external table that is created during a By default, If the If If the target table uses random distribution ( |
No |
FAST_MATCH true | false |
If set to |
No |
SQL |
Defines optional arbitrary SQL commands to run before ( The commands must be listed in the order of desired execution. You can use this section, for example, to configure the load status logging as described in Examples |
No |
BEFORE "<sql_command>" |
An SQL command to run before the load operation starts. Enclose commands in quotes |
No |
AFTER "<sql_command>" |
An SQL command to run after the load operation completes. Enclose commands in quotes |
No |