Use gpfdist
The Greenplum Parallel File Server (gpfdist) is the Greengage DB’s parallel file distribution utility.
When external data is served by gpfdist, all Greengage DB segments can read or write external table data in parallel.
For readable external tables, the gpfdist server reads data records from files in a specified directory, packs them into a block, and sends the block in a response to a Greengage DB segment’s request.
The segments unpack received rows and distribute them according to the external table’s distribution policy.
For writable external tables, segments send blocks of rows in a request to gpfdist, and gpfdist writes them to the external file.
The gpfdist server can also be configured with a YAML-formatted file to transform external data files between a supported text format and another format such as XML or JSON.
To access external data served by gpfdist, use the GPFDIST/GPFDISTS protocol in the LOCATION clause of the CREATE EXTERNAL TABLE command.
Learn more in Create external tables with GPFDIST / GPFDISTS.
Syntax overview
The gpfdist command’s general syntax is as follows:
gpfdist [-d <directory>] [-p <http_port>] |-P <last_http_port>
[-l <log_file>] [-t <timeout>] [-S] [-w <time>]
[-v | -V] [-m <max_length>] [--ssl <certificate_path>]
[-k <seconds>] [--compress] [--multi_thread <number_of_threads>]
[-I <input_transformation_name>] [-O <output_transformation_name>]
[-c <config.yaml>]
gpfdist -? | --help
gpfdist --version
| Option | Description |
|---|---|
-d <directory> |
The directory from which If not specified, defaults to the directory where |
-p <http_port> |
The HTTP port on which The default port is |
-P <last_http_port> |
The ending port number in a range of HTTP port numbers ( |
-l <log_file> |
The full path or log file name to log standard output messages |
-t <timeout> |
The time allowed for Greengage DB to establish a connection with a The default timeout is 5 seconds |
-S |
Opens the file for synchronous I/O with the |
-w <time> |
The number of seconds before Greengage DB closes a target file, from 0 (no timeout) to 7200 seconds (2 hours). For a Greengage DB host with multiple segments, there might be a delay between segments when writing data from different segments to the file. This option is used to ensure all the data is written to the file before the file is closed |
-v |
Shows verbose output of the utility operation (the progress and status messages) |
-V |
Shows very verbose output of the utility operation (all output messages generated by |
-m <max_length> |
The maximum allowed data row length in bytes, The valid range is from 32 KB to 256 MB. On Windows systems, the upper limit is 1 MB |
--ssl <certificate_path> |
Adds SSL encryption to data transferred with If the option is provided, the data from this file server can only be loaded with the GPFDISTS protocol.
The root directory (/) cannot be specified as |
-k <seconds> |
The number of seconds that The default timeout is 300 seconds |
--compress |
Enables compression during data transfer.
When specified, |
--multi_threads <num_threads> |
Sets the maximum number of threads, up to 256, that |
-I <input_transformation_name> |
Sets one of the input transformations defined in the transformations configuration file ( Learn more in Load data with gpfdist |
-O <output_transformation_name> |
Sets one of the output transformations defined in the transformations configuration file ( Learn more in Load data with gpfdist |
-c <config.yaml> |
Specifies the YAML configuration file that defines the transformations applied when loading or unloading data.
A transformation configuration file can describe multiple transformations; you can set the default input or output transformation by using the Learn more in Transformation configuration file |
-? |
Displays the help information about |
--version |
Displays the |
Start gpfdist
The gpfdist utility is located in the $GPHOME/bin directory of the Greengage DB master host and on each segment host.
To install gpfdist on another host, copy it to the desired location on that host and add this location to PATH.
To avoid performance impact on query execution, gpfdist usually runs on ETL machines rather than the hosts where Greengage DB is installed.
For multiple gpfdist instances on the same ETL host, use a different base directory and port for each instance.
gpfdist and gpload are compatible only with the Greengage DB major version in which they are shipped (for example, 6.x).
Configure SSL
To create external tables using the GPFDISTS protocol, you need to configure SSL communication between Greengage DB and the file server.
-
Generate a server certificate for the file server. You can follow the steps described in Generate certificates. The certificate can be stored in an arbitrary location on your file sever; this location is specified in the
--ssloption when starting thegpfdistutility. -
Generate a client certificate as described in Generate a client certificate. Store the following client certificates in the gpfdists subdirectory inside the data directory of each segment instance (for example, /data1/primary/gpseg0/gpfdists) on each segment host:
-
client.crt — the client certificate file;
-
client.key — the client private key file;
-
root.crt — the trusted certificate authorities file.
-
The server and client private keys (server.key and client.key) must not be protected with a passphrase.
Note that gpfdists implements SSL security in a client/server scheme with the following attributes and limitations:
-
Client certificates are required.
-
Multilingual certificates are not supported.
-
A certificate revocation list (CRL) is not supported.
-
A minimum TLS version of 1.2 is required.
-
SSL renegotiation is supported.
The verify_gpfdists_cert server configuration parameter controls whether SSL certificate authentication is enabled when Greengage DB communicates with the gpfdist utility.
You can set this parameter to false to deactivate authentication when testing the communication between Greengage DB and gpfdist.
In this case, the following SSL exceptions are ignored:
-
The self-signed SSL certificate that is used by
gpfdistis not trusted by Greengage DB. -
The host name contained in the SSL certificate does not match the host name that is running
gpfdist.
Deactivating SSL certificate authentication exposes a security risk by not validating the gpfdist SSL certificate.
Examples
The following example commands illustrate several common gpfdist launch scenarios.
-
Starting
gpfdistin background (&), listening on the default port8080, and serving files from the current directory:$ gpfdist & -
Starting
gpfdistin background (&), listening on port8801, and serving files from the /home/gpadmin/external_files directory:$ gpfdist -p 8801 -d /home/gpadmin/external_files & -
Starting
gpfdistin background (&), listening on the default port8080, serving files from the current directory, and redirecting output and errors to the gpfdist_log file:$ gpfdist -l /home/gpadmin/gpfdist_log & -
Starting
gpfdistin SSL mode with certificates stored in /home/gpadmin/certs, listening on port8811, and serving files from the /home/gpadmin/external_files directory:$ gpfdist --ssl /home/gpadmin/certs -p 8811 -d /home/gpadmin/external_files -
Starting
gpfdistwith multithreaded data transfer using four threads, listening on the default port8080, and serving files from the current directory:$ gpfdist --multi_thread 4
The commands' output should look similar to the following:
INFO Before opening listening sockets - following listening sockets are available: INFO IPV6 socket: [::]:8811 INFO IPV4 socket: 0.0.0.0:8811 INFO Trying to open listening socket: INFO IPV6 socket: [::]:8811 INFO Opening listening socket succeeded INFO Trying to open listening socket: INFO IPV4 socket: 0.0.0.0:8811 INFO Opening listening socket succeeded Serving HTTP on port 8811, directory /home/gpadmin/external_files
INFO Before opening listening sockets - following listening sockets are available: INFO IPV6 socket: [::]:8811 INFO IPV4 socket: 0.0.0.0:8811 INFO Trying to open listening socket: INFO IPV6 socket: [::]:8811 INFO Opening listening socket succeeded INFO Trying to open listening socket: INFO IPV4 socket: 0.0.0.0:8811 INFO Opening listening socket succeeded Serving HTTPS on port 8811, directory /home/gpadmin/external_files
Stop gpfdist
To stop gpfdist when it is running in the background:
-
Locate the
gpfdistprocess ID:$ ps -ef | grep gpfdistThe output should look as follows:
gpadmin 12687 1 0 Apr07 ? gpfdist --ssl /home/gpadmin/certs -p 8811 -d /tmp gpadmin 41509 41470 0 13:00 pts/3 gpfdist gpadmin 41521 41510 0 13:01 pts/3 grep --color=auto gpfdist
-
Stop the corresponding process. In this example, the process ID is
41509:$ kill 41509
Create external tables with GPFDIST / GPFDISTS
To create an external table using the GPFDIST/GPFDISTS protocol, provide the LOCATION and FORMAT clauses inside the CREATE EXTERNAL TABLE command as follows:
CREATE [[READABLE] | WRITABLE] EXTERNAL TABLE <table_name> ( ... )
LOCATION ('gpfdist://<host>[:<port>]/<path>/<file>')
FORMAT 'TEXT' | 'CSV' | 'CUSTOM';
CREATE [[READABLE] | WRITABLE] EXTERNAL TABLE <table_name> ( ... )
LOCATION ('gpfdists://<host>[:<port>]/<path>/<file>')
FORMAT 'TEXT' | 'CSV' | 'CUSTOM';
In the LOCATION clause, provide a URI specifying the target files location: the file host name (<host>), port (<port>), and the path to the file relative to the directory where gpfdist is started (<path>/<file>).
You can provide multiple URIs if multiple gpfdist data sources are used.
To denote multiple files to read, use the wildcard characters such as * or [].
The gpfdist utility automatically uncompresses the .gz, .bz2, and .zst files when reading data and compresses them when writing data.
Compression is not supported when running gpfdist on Windows platforms.
In the FORMAT clause, define how your data is formatted.
Examples
These examples illustrate creating external tables via the GPFDIST / GPFDISTS protocols.
Readable table with GPFDIST
This example creates the customers readable table, whose data is loaded from two files residing in the /tmp directory on the fs file server.
-
In the /tmp directory on the fs host, create two source data files:
-
The customers.csv.gz compressed via
gzip:#!/bin/bash cat > customers.csv <<EOF id,first_name,last_name,email,address 1,John,Doe,john.doe@example.com,123 Elm Street 2,Jane,Smith,jane.smith@example.com,456 Oak Street EOF gzip customers.csv -
The regular customers.csv file:
#!/bin/bash cat > /tmp/customers.csv <<EOF id,first_name,last_name,email,address 3,Bob,Brown,bob.brown@example.com,789 Pine Street 4,Rob,Stuart,rob.stuart@example.com,119 Willow Street EOF
-
-
Start
gpfdistin the /tmp directory of the fs host:$ gpfdist -p 8811 -d /tmp/ & -
On the master host, run the
CREATE EXTERNAL TABLEcommand. In theLOCATIONclause, provide the host name of the file server (fs) and use the*wildcard to denote two files to read. In theFORMATclause, setCSVas the data format and specify that the files contain a header row (HEADER):CREATE EXTERNAL TABLE customers ( id INTEGER, first_name VARCHAR(50), last_name VARCHAR(50), email VARCHAR(100), address VARCHAR(255) ) LOCATION ('gpfdist://fs:8811/customers.csv*') FORMAT 'CSV' (HEADER); -
Query the created external table:
SELECT * FROM customers;The output should look as follows:
id | first_name | last_name | email | address ----+------------+-----------+------------------------+------------------- 1 | John | Doe | john.doe@example.com | 123 Elm Street 2 | Jane | Smith | jane.smith@example.com | 456 Oak Street 3 | Bob | Brown | bob.brown@example.com | 789 Pine Street 4 | Rob | Stuart | rob.stuart@example.com | 119 Willow Street
Writable table with GPFDISTS
This example creates the orders writable table, whose data is unloaded to the orders.csv file.
The file will be created automatically in the /tmp directory on the fs file server.
-
Start
gpfdiston the fs server from the /tmp directory, passing--ssl ~/certsto enable communication over GPFDISTS:$ gpfdist --ssl ~/certs -p 8811 -d /tmp/ & -
On the master host, run the
CREATE WRITABLE EXTERNAL TABLEcommand. In theLOCATIONclause, provide the host name of the file server (fs). In theFORMATclause, setCSVas the data format for the created orders.csv file:CREATE WRITABLE EXTERNAL TABLE orders ( id INTEGER, total NUMERIC(10,2), status VARCHAR(50) ) LOCATION ('gpfdists://fs:8811/orders.csv') FORMAT 'CSV'; -
Populate the created table with sample data:
INSERT INTO orders (customer_id, total, status) VALUES (1, 999.99, 'Pending'), (2, 499.99, 'Shipped'), (3, 299.99, 'Delivered'); -
Review the created orders.csv file:
$ cat /tmp/orders.csvThe file’s content should look as follows:
1,999.99,Pending 2,499.99,Shipped 3,299.99,Delivered
Performance tips
You can run gpfdist instances on multiple hosts as well as run multiple gpfdist instances on each host.
This lets you deploy gpfdist servers strategically, utilize all the available network bandwidth and Greengage DB’s parallelism, and thus achieve fast data load and unload rates.
-
Allow network traffic to use all ETL host network interfaces simultaneously. Run one
gpfdistinstance for each interface on the ETL host and use the host name of each network interface card (NIC) in theLOCATIONclause of theCREATE EXTERNAL TABLEcommand. -
Divide external table data equally among multiple
gpfdistinstances on the ETL host. For example, on an ETL system with two NICs, run twogpfdistinstances (one on each NIC) to optimize data load performance and divide the external table data files evenly between the twogpfdistservers. -
Use pipes (
|) to separate formatted text in source data files. Greengage DB encloses comma-separated text strings in single or double quotes.gpfdist, in turn, has to remove the quotes to parse the strings. Using pipes as field separators avoids the extra step and improves performance. -
If compression is enabled (that is, the
compressoption is provided),gpfdisttransmits a larger amount of data while maintaining low network usage. Compression, though, is cpu-intensive and might potentially impose additional load on the cluster. -
If multithreaded execution is enabled (that is, the
multi_threadsoption is provided), the overall time required for compression may decrease. This increases transmission speed while maintaining low network load. -
Use the
gp_external_max_segsserver configuration parameter in the postgresql.conf file of your master instance to control the number of segment instances that can access a singlegpfdistinstance simultaneously (64 by default). You can appoint several segments to process external data files and use the remaining ones to perform other database processing tasks.