Use gpload
The Greengage DB gpload utility facilitates loading data into database tables via the Greenplum Parallel File Server (gpfdist).
gpload operates based on a load specification defined in a YAML-formatted control file.
It invokes gpfdist, creates an external table definition based on the source data defined, and runs an INSERT, UPDATE, or MERGE operation to load the source data into the target table in the database.
The gpload operation, including any additional SQL commands specified in the SQL section of the YAML control file, is performed as a single transaction to prevent inconsistent data when performing simultaneous load operations on a target table.
gpfdist and gpload are compatible only with the Greengage DB major version in which they are shipped (for example, 6.x).
Syntax overview
The gpload command’s general syntax is provided below.
See the gpload command reference for the complete options description.
gpload -f <control_file>
[ -l <log_file> ]
[ -h <hostname> ]
[ -p <port> ]
[ -U <username> ]
[ -d <database> ]
[ -W ]
[ --gpfdist_timeout <seconds> ]
[ --no_auto_trans ]
[ --max_retries <retry_times> ]
[ [ -v | -V ] [-q ] ]
[ -D ]
gpload -?
gpload --version
Control file
The control file is used for orchestrating the loading process.
It specifies the Greengage DB connection information, gpfdist configuration information, external table options, and source data format.
The file uses the YAML 1.1 format with its own schema for defining the various steps of a Greengage DB load operation and must be a valid YAML document.
The gpload program processes the control file document in specified order and uses indentation (spaces) to determine the document hierarchy and the relationships of the sections to one another.
The use of whitespace in a YAML document is significant. Whitespace should not be used for formatting purposes, and tabs should not be used at all.
The structure of a load control file looks as follows. See Control file format for the complete description of keys.
---
VERSION: 1.0.0.1
DATABASE: <db_name>
USER: <db_username>
HOST: <master_hostname>
PORT: <master_port>
GPLOAD:
INPUT:
- SOURCE:
LOCAL_HOSTNAME:
- <hostname_or_ip>
PORT: <http_port>
| PORT_RANGE: [ <start_port_range>, <end_port_range> ]
FILE:
- </path/to/input_file>
SSL: true | false
CERTIFICATES_PATH: </path/to/certificates>
- FULLY_QUALIFIED_DOMAIN_NAME: true | false
- COLUMNS:
- <field_name>: <data_type>
- TRANSFORM: '<transformation>'
- TRANSFORM_CONFIG: '<configuration-file-path>'
- MAX_LINE_LENGTH: <integer>
- FORMAT: text | csv
- DELIMITER: '<delimiter_character>'
- ESCAPE: '<escape_character>' | 'OFF'
- NEWLINE: 'LF' | 'CR' | 'CRLF'
- NULL_AS: '<null_string>'
- FILL_MISSING_FIELDS: true | false
- FORCE_NOT_NULL: <column_name> [, ...]
- QUOTE: '<csv_quote_character>'
- HEADER: true | false
- ENCODING: <database_encoding>
- ERROR_LIMIT: <integer>
- LOG_ERRORS: true | false
EXTERNAL:
- SCHEMA: <schema> | '%'
OUTPUT:
- TABLE: <schema.table_name>
- MODE: insert | update | merge
- MATCH_COLUMNS:
- <target_column_name>
- UPDATE_COLUMNS:
- <target_column_name>
- UPDATE_CONDITION: '<boolean_condition>'
- MAPPING:
<target_column_name>: <source_column_name> | '<expression>'
PRELOAD:
- TRUNCATE: true | false
- REUSE_TABLES: true | false
- STAGING_TABLE: <external_table_name>
- FAST_MATCH: true | false
SQL:
- BEFORE: "<sql_command>"
- AFTER: "<sql_command>"
The MERGE and UPDATE operations are not supported if the target table or column name is a reserved keyword, has capital letters, or is enclosed in double quotes (" ").
If the database object names were created using a double-quoted identifier (delimited identifier), you must specify the delimited name within single quotes (') in the gpload control file.
Consider the following table:
CREATE TABLE "MyTable" ("MyColumn" text);
In this case, the load control file must refer to this table and column names as follows:
...
- COLUMNS:
- '"MyColumn"': text
OUTPUT:
- TABLE: public.'"MyTable"'
...
Run gpload
-
Create a YAML-formatted load control file.
-
Run the
gploadutility using the-foption to pass in the created load control file, for example:$ gpload -f example_load.yml
Examples
These examples illustrate loading data into a database table from an external file by using the INSERT, UPDATE, and MERGE methods.
INSERT method
This example demonstrates loading data into a database table by using the INSERT method.
-
Empty the
orderstable and populate it with data as follows:TRUNCATE TABLE orders; INSERT INTO orders (id, name, price) VALUES (1,'Laptop',999.99), (2,'Smartphone',499.99), (3,'Tablet',299.99); -
Empty the auxiliary
audittable:TRUNCATE TABLE audit; -
In the /tmp directory on the master host, create the orders.csv file containing the source data to load:
$ cat > orders.csv <<EOF id,name,price 4,Monitor,599.99 5,Keyboard,99.99 EOF -
On the master host, create the load.yml load control file:
$ cat > load.yml <<EOF --- VERSION: 1.0.0.1 DATABASE: customers USER: gpadmin HOST: mdw PORT: 5432 GPLOAD: INPUT: - SOURCE: PORT: 8081 FILE: - /tmp/orders.csv - COLUMNS: - id: INTEGER - name: VARCHAR - price: NUMERIC - FORMAT: CSV - HEADER: true OUTPUT: - TABLE: orders - MODE: INSERT SQL: - BEFORE: "INSERT INTO audit VALUES('start', current_timestamp)" - AFTER: "INSERT INTO audit VALUES('end', current_timestamp)" EOF -
Run the
gploadutility passing in the created load control file. This will insert new rows into theorderstable. Theaudittable will contain timestamps marking the loading start and end.$ gpload -f load.ymlThe output should look as follows:
gpload session started INFO|setting schema 'public' for table 'orders' INFO|started gpfdist -p 8000 -P 9000 -f "/tmp/orders.csv" -t 30 INFO|running time: 0.10 seconds INFO|rows Inserted = 2 INFO|rows Updated = 0 INFO|data formatting errors = 0 INFO|gpload succeeded
-
Query the
orderstable:SELECT * FROM orders;The output should look as follows, with rows 4 and 5 inserted into the table:
id | name | price ----+------------+-------- 1 | Laptop | 999.99 2 | Smartphone | 499.99 3 | Tablet | 299.99 4 | Monitor | 599.99 5 | Keyboard | 99.99
-
Query the
audittable:SELECT * FROM audit;The output should contain the timestamps marking the loading start and end:
event | timestamp -------+---------------------------- start | 2025-05-25 12:45:47.931476 end | 2025-05-25 12:45:50.134581
UPDATE method
This example demonstrates loading data into a database table by using the UPDATE method.
-
Empty the
orderstable and populate it with data as follows:TRUNCATE TABLE orders; INSERT INTO orders (id, name, price) VALUES (1,'Laptop',999.99), (2,'Smartphone',499.99), (3,'Tablet',299.99), (4,'Monitor',599.99), (5,'Keyboard', 99.99); -
Empty the auxiliary
audittable:TRUNCATE TABLE audit; -
In the /tmp directory on the master host, create the orders.csv file containing the source data to load. Notice that the data contains the rows already existing in the
orderstable whose values in thepricecolumn are different.$ cat > orders.csv <<EOF id,name,price 1,Laptop,1299.99 4,Monitor,899.99 5,Keyboard,59.99 EOF -
On the master host, create a load.yml load control file:
$ cat > load.yml <<EOF --- VERSION: 1.0.0.1 DATABASE: customers USER: gpadmin HOST: mdw PORT: 5432 GPLOAD: INPUT: - SOURCE: FILE: - /tmp/orders.csv - COLUMNS: - id: integer - name: text - price: numeric - FORMAT: CSV - HEADER: true OUTPUT: - TABLE: orders - MODE: UPDATE - MATCH_COLUMNS: - id - UPDATE_COLUMNS: - price SQL: - BEFORE: "INSERT INTO audit VALUES('start', current_timestamp)" - AFTER: "INSERT INTO audit VALUES('end', current_timestamp)" EOF -
Run the
gploadutility, passing in the created load control file. This will update the values in thepricecolumn for the rows where theidcolumn values in theorderstable match those in the orders.csv file. Theaudittable will contain timestamps marking the loading start and end.$ gpload -f load.ymlThe output should look as follows:
INFO|gpload session started INFO|setting schema 'public' for table 'orders' INFO|started gpfdist -p 8000 -P 9000 -f "/tmp/orders.csv" -t 30 INFO|running time: 0.13 seconds INFO|rows Inserted = 0 INFO|rows Updated = 3 INFO|data formatting errors = 0 INFO|gpload succeeded
-
Query the
orderstable:SELECT * FROM orders;The output should look as follows, with rows 1, 4, and 5 updated:
id | name | price ----+------------+--------- 1 | Laptop | 1299.99 2 | Smartphone | 499.99 3 | Tablet | 299.99 4 | Monitor | 899.99 5 | Keyboard | 59.99
-
Query the
audittable:SELECT * FROM audit;The output should contain the timestamps marking the loading start and end:
event | timestamp -------+---------------------------- start | 2025-05-25 16:24:47.931476 end | 2025-05-25 16:24:50.134581
MERGE method
This example demonstrates loading data into a database table by using the MERGE method.
-
Empty the
orderstable and populate it with data as follows:TRUNCATE TABLE orders; INSERT INTO orders (id, name, price) VALUES (1,'Laptop',999.99), (2,'Smartphone',499.99), (3,'Tablet',299.99), (4,'Monitor',599.99), (5,'Keyboard', 99.99); -
Empty the auxiliary
audittable:TRUNCATE TABLE audit; -
In the /tmp directory on the master host, create the orders.csv file containing the source data to load. Notice that the data contains the rows already existing in the
orderstable whose values in thepricecolumn are different (2, 4, and 5). Rows 6 and 7 are new rows.$ cat > orders.csv <<EOF id,name,price 1,Laptop,999.99 2,Smartphone,799.99 3,Tablet,299.99 4,Monitor,849.99 5,Keyboard,59.99 6,Webcam,29.99 7,Speakers,129.99 EOF -
On the master host, create a load.yml load control file:
$ cat > load.yml <<EOF --- VERSION: 1.0.0.1 DATABASE: customers USER: gpadmin HOST: mdw PORT: 5432 GPLOAD: INPUT: - SOURCE: FILE: - /tmp/orders.csv - COLUMNS: - id: integer - name: text - price: numeric - FORMAT: CSV - HEADER: true OUTPUT: - TABLE: orders - MODE: MERGE - MATCH_COLUMNS: - id - UPDATE_COLUMNS: - price SQL: - BEFORE: "INSERT INTO audit VALUES('start', current_timestamp)" - AFTER: "INSERT INTO audit VALUES('end', current_timestamp)" EOF -
Run the
gploadutility, passing in the created load control file. This will update the values in thepricecolumn for the rows where theidcolumn values in theorderstable match those in the orders.csv file and insert new rows. Theaudittable will contain timestamps marking the loading start and end.$ gpload -f load.ymlThe output should look as follows:
INFO|gpload session started INFO|setting schema 'public' for table 'orders' INFO|started gpfdist -p 8000 -P 9000 -f "/tmp/orders.csv" -t 30 INFO|running time: 0.14 seconds INFO|rows Inserted = 2 INFO|rows Updated = 5 INFO|data formatting errors = 0 INFO|gpload succeeded
-
Query the
orderstable:SELECT * FROM orders;The output should look as follows, with rows 2, 4, and 5 updated and rows 6 and 7 inserted:
id | name | price ----+------------+-------- 1 | Laptop | 999.99 2 | Smartphone | 799.99 3 | Tablet | 299.99 4 | Monitor | 849.99 5 | Keyboard | 59.99 6 | Webcam | 29.99 7 | Speakers | 129.99
-
Query the
audittable:SELECT * FROM audit;The output should contain the timestamps marking the loading start and end:
event | timestamp -------+---------------------------- start | 2025-05-25 15:24:47.931476 end | 2025-05-25 15:24:50.134581