Use gpfdist
The Greenplum Parallel File Server (gpfdist
) is the Greengage DB’s parallel file distribution utility.
When external data is served by gpfdist
, all Greengage DB segments can read or write external table data in parallel.
For readable external tables, the gpfdist
server reads data records from files in a specified directory, packs them into a block, and sends the block in a response to a Greengage DB segment’s request.
The segments unpack received rows and distribute them according to the external table’s distribution policy.
For writable external tables, segments send blocks of rows in a request to gpfdist
, and gpfdist
writes them to the external file.
The gpfdist
server can also be configured with a YAML-formatted file to transform external data files between a supported text format and another format such as XML or JSON.
To access external data served by gpfdist
, use the GPFDIST/GPFDISTS protocol in the LOCATION
clause of the CREATE EXTERNAL TABLE
command.
Learn more in Create external tables with GPFDIST / GPFDISTS.
Syntax overview
The gpfdist
command’s general syntax is as follows:
gpfdist [-d <directory>] [-p <http_port>] |-P <last_http_port>
[-l <log_file>] [-t <timeout>] [-S] [-w <time>]
[-v | -V] [-m <max_length>] [--ssl <certificate_path>]
[-k <seconds>] [--compress] [--multi_thread <number_of_threads>]
[-I <input_transformation_name>] [-O <output_transformation_name>]
[-c <config.yaml>]
gpfdist -? | --help
gpfdist --version
Option | Description |
---|---|
-d <directory> |
The directory from which If not specified, defaults to the directory where |
-p <http_port> |
The HTTP port on which The default port is |
-P <last_http_port> |
The ending port number in a range of HTTP port numbers ( |
-l <log_file> |
The full path or log file name to log standard output messages |
-t <timeout> |
The time allowed for Greengage DB to establish a connection with a The default timeout is 5 seconds |
-S |
Opens the file for synchronous I/O with the |
-w <time> |
The number of seconds before Greengage DB closes a target file, from 0 (no timeout) to 7200 seconds (2 hours). For a Greengage DB host with multiple segments, there might be a delay between segments when writing data from different segments to the file. This option is used to ensure all the data is written to the file before the file is closed |
-v |
Shows verbose output of the utility operation (the progress and status messages) |
-V |
Shows very verbose output of the utility operation (all output messages generated by |
-m <max_length> |
The maximum allowed data row length in bytes, The valid range is from 32 KB to 256 MB. On Windows systems, the upper limit is 1 MB |
--ssl <certificate_path> |
Adds SSL encryption to data transferred with If the option is provided, the data from this file server can only be loaded with the GPFDISTS protocol.
The root directory (/) cannot be specified as |
-k <seconds> |
The number of seconds that The default timeout is 300 seconds |
--compress |
Enables compression during data transfer.
When specified, |
--multi_threads <num_threads> |
Sets the maximum number of threads, up to 256, that |
-I <input_transformation_name> |
Sets one of the input transformations defined in the transformations configuration file ( Learn more in Load data with gpfdist |
-O <output_transformation_name> |
Sets one of the output transformations defined in the transformations configuration file ( Learn more in Load data with gpfdist |
-c <config.yaml> |
Specifies the YAML configuration file that defines the transformations applied when loading or unloading data.
A transformation configuration file can describe multiple transformations; you can set the default input or output transformation by using the Learn more in Transformation configuration file |
-? |
Displays the help information about |
--version |
Displays the |
Start gpfdist
The gpfdist
utility is located in the $GPHOME/bin directory of the Greengage DB master host and on each segment host.
To install gpfdist
on another host, copy it to the desired location on that host and add this location to PATH
.
To avoid performance impact on query execution, gpfdist
usually runs on ETL machines rather than the hosts where Greengage DB is installed.
For multiple gpfdist
instances on the same ETL host, use a different base directory and port for each instance.
gpfdist
and gpload
are compatible only with the Greengage DB major version in which they are shipped (for example, 6.x).
Configure SSL
To create external tables using the GPFDISTS protocol, you need to configure SSL communication between Greengage DB and the file server.
-
Generate a server certificate for the file server. You can follow the steps described in Generate certificates. The certificate can be stored in an arbitrary location on your file sever; this location is specified in the
--ssl
option when starting thegpfdist
utility. -
Generate a client certificate as described in Generate a client certificate. Store the following client certificates in the gpfdists subdirectory inside the data directory of each segment instance (for example, /data1/primary/gpseg0/gpfdists) on each segment host:
-
client.crt — the client certificate file;
-
client.key — the client private key file;
-
root.crt — the trusted certificate authorities file.
-
The server and client private keys (server.key and client.key) must not be protected with a passphrase.
Note that gpfdists
implements SSL security in a client/server scheme with the following attributes and limitations:
-
Client certificates are required.
-
Multilingual certificates are not supported.
-
A certificate revocation list (CRL) is not supported.
-
A minimum TLS version of 1.2 is required.
-
SSL renegotiation is supported.
The verify_gpfdists_cert
server configuration parameter controls whether SSL certificate authentication is enabled when Greengage DB communicates with the gpfdist
utility.
You can set this parameter to false
to deactivate authentication when testing the communication between Greengage DB and gpfdist
.
In this case, the following SSL exceptions are ignored:
-
The self-signed SSL certificate that is used by
gpfdist
is not trusted by Greengage DB. -
The host name contained in the SSL certificate does not match the host name that is running
gpfdist
.
Deactivating SSL certificate authentication exposes a security risk by not validating the gpfdist
SSL certificate.
Examples
The following example commands illustrate several common gpfdist
launch scenarios.
-
Starting
gpfdist
in background (&
), listening on the default port8080
, and serving files from the current directory:$ gpfdist &
-
Starting
gpfdist
in background (&
), listening on port8801
, and serving files from the /home/gpadmin/external_files directory:$ gpfdist -p 8801 -d /home/gpadmin/external_files &
-
Starting
gpfdist
in background (&
), listening on the default port8080
, serving files from the current directory, and redirecting output and errors to the gpfdist_log file:$ gpfdist -l /home/gpadmin/gpfdist_log &
-
Starting
gpfdist
in SSL mode with certificates stored in /home/gpadmin/certs, listening on port8811
, and serving files from the /home/gpadmin/external_files directory:$ gpfdist --ssl /home/gpadmin/certs -p 8811 -d /home/gpadmin/external_files
-
Starting
gpfdist
with multithreaded data transfer using four threads, listening on the default port8080
, and serving files from the current directory:$ gpfdist --multi_thread 4
The commands' output should look similar to the following:
INFO Before opening listening sockets - following listening sockets are available: INFO IPV6 socket: [::]:8811 INFO IPV4 socket: 0.0.0.0:8811 INFO Trying to open listening socket: INFO IPV6 socket: [::]:8811 INFO Opening listening socket succeeded INFO Trying to open listening socket: INFO IPV4 socket: 0.0.0.0:8811 INFO Opening listening socket succeeded Serving HTTP on port 8811, directory /home/gpadmin/external_files
INFO Before opening listening sockets - following listening sockets are available: INFO IPV6 socket: [::]:8811 INFO IPV4 socket: 0.0.0.0:8811 INFO Trying to open listening socket: INFO IPV6 socket: [::]:8811 INFO Opening listening socket succeeded INFO Trying to open listening socket: INFO IPV4 socket: 0.0.0.0:8811 INFO Opening listening socket succeeded Serving HTTPS on port 8811, directory /home/gpadmin/external_files
Stop gpfdist
To stop gpfdist
when it is running in the background:
-
Locate the
gpfdist
process ID:$ ps -ef | grep gpfdist
The output should look as follows:
gpadmin 12687 1 0 Apr07 ? gpfdist --ssl /home/gpadmin/certs -p 8811 -d /tmp gpadmin 41509 41470 0 13:00 pts/3 gpfdist gpadmin 41521 41510 0 13:01 pts/3 grep --color=auto gpfdist
-
Stop the corresponding process. In this example, the process ID is
41509
:$ kill 41509
Create external tables with GPFDIST / GPFDISTS
To create an external table using the GPFDIST/GPFDISTS protocol, provide the LOCATION
and FORMAT
clauses inside the CREATE EXTERNAL TABLE command as follows:
CREATE [[READABLE] | WRITABLE] EXTERNAL TABLE <table_name> ( ... )
LOCATION ('gpfdist://<host>[:<port>]/<path>/<file>')
FORMAT 'TEXT' | 'CSV' | 'CUSTOM';
CREATE [[READABLE] | WRITABLE] EXTERNAL TABLE <table_name> ( ... )
LOCATION ('gpfdists://<host>[:<port>]/<path>/<file>')
FORMAT 'TEXT' | 'CSV' | 'CUSTOM';
In the LOCATION
clause, provide a URI specifying the target files location: the file host name (<host>
), port (<port>
), and the path to the file relative to the directory where gpfdist
is started (<path>
/<file>
).
You can provide multiple URIs if multiple gpfdist
data sources are used.
To denote multiple files to read, use the wildcard characters such as *
or []
.
The gpfdist
utility automatically uncompresses the .gz, .bz2, and .zst files when reading data and compresses them when writing data.
Compression is not supported when running gpfdist
on Windows platforms.
In the FORMAT
clause, define how your data is formatted.
Examples
These examples illustrate creating external tables via the GPFDIST / GPFDISTS protocols.
Readable table with GPFDIST
This example creates the customers
readable table, whose data is loaded from two files residing in the /tmp directory on the fs file server.
-
In the /tmp directory on the fs host, create two source data files:
-
The customers.csv.gz compressed via
gzip
:#!/bin/bash cat > customers.csv <<EOF id,first_name,last_name,email,address 1,John,Doe,john.doe@example.com,123 Elm Street 2,Jane,Smith,jane.smith@example.com,456 Oak Street EOF gzip customers.csv
-
The regular customers.csv file:
#!/bin/bash cat > /tmp/customers.csv <<EOF id,first_name,last_name,email,address 3,Bob,Brown,bob.brown@example.com,789 Pine Street 4,Rob,Stuart,rob.stuart@example.com,119 Willow Street EOF
-
-
Start
gpfdist
in the /tmp directory of the fs host:$ gpfdist -p 8811 -d /tmp/ &
-
On the master host, run the
CREATE EXTERNAL TABLE
command. In theLOCATION
clause, provide the host name of the file server (fs
) and use the*
wildcard to denote two files to read. In theFORMAT
clause, setCSV
as the data format and specify that the files contain a header row (HEADER
):CREATE EXTERNAL TABLE customers ( id INTEGER, first_name VARCHAR(50), last_name VARCHAR(50), email VARCHAR(100), address VARCHAR(255) ) LOCATION ('gpfdist://fs:8811/customers.csv*') FORMAT 'CSV' (HEADER);
-
Query the created external table:
SELECT * FROM customers;
The output should look as follows:
id | first_name | last_name | email | address ----+------------+-----------+------------------------+------------------- 1 | John | Doe | john.doe@example.com | 123 Elm Street 2 | Jane | Smith | jane.smith@example.com | 456 Oak Street 3 | Bob | Brown | bob.brown@example.com | 789 Pine Street 4 | Rob | Stuart | rob.stuart@example.com | 119 Willow Street
Writable table with GPFDISTS
This example creates the orders
writable table, whose data is unloaded to the orders.csv file.
The file will be created automatically in the /tmp directory on the fs file server.
-
Start
gpfdist
on the fs server from the /tmp directory, passing--ssl ~/certs
to enable communication over GPFDISTS:$ gpfdist --ssl ~/certs -p 8811 -d /tmp/ &
-
On the master host, run the
CREATE WRITABLE EXTERNAL TABLE
command. In theLOCATION
clause, provide the host name of the file server (fs
). In theFORMAT
clause, setCSV
as the data format for the created orders.csv file:CREATE WRITABLE EXTERNAL TABLE orders ( id INTEGER, total NUMERIC(10,2), status VARCHAR(50) ) LOCATION ('gpfdists://fs:8811/orders.csv') FORMAT 'CSV';
-
Populate the created table with sample data:
INSERT INTO orders (customer_id, total, status) VALUES (1, 999.99, 'Pending'), (2, 499.99, 'Shipped'), (3, 299.99, 'Delivered');
-
Review the created orders.csv file:
$ cat /tmp/orders.csv
The file’s content should look as follows:
1,999.99,Pending 2,499.99,Shipped 3,299.99,Delivered
Performance tips
You can run gpfdist
instances on multiple hosts as well as run multiple gpfdist
instances on each host.
This lets you deploy gpfdist
servers strategically, utilize all the available network bandwidth and Greengage DB’s parallelism, and thus achieve fast data load and unload rates.
-
Allow network traffic to use all ETL host network interfaces simultaneously. Run one
gpfdist
instance for each interface on the ETL host and use the host name of each network interface card (NIC) in theLOCATION
clause of theCREATE EXTERNAL TABLE
command. -
Divide external table data equally among multiple
gpfdist
instances on the ETL host. For example, on an ETL system with two NICs, run twogpfdist
instances (one on each NIC) to optimize data load performance and divide the external table data files evenly between the twogpfdist
servers. -
Use pipes (
|
) to separate formatted text in source data files. Greengage DB encloses comma-separated text strings in single or double quotes.gpfdist
, in turn, has to remove the quotes to parse the strings. Using pipes as field separators avoids the extra step and improves performance. -
If compression is enabled (that is, the
compress
option is provided),gpfdist
transmits a larger amount of data while maintaining low network usage. Compression, though, is cpu-intensive and might potentially impose additional load on the cluster. -
If multithreaded execution is enabled (that is, the
multi_threads
option is provided), the overall time required for compression may decrease. This increases transmission speed while maintaining low network load. -
Use the
gp_external_max_segs
server configuration parameter in the postgresql.conf file of your master instance to control the number of segment instances that can access a singlegpfdist
instance simultaneously (64 by default). You can appoint several segments to process external data files and use the remaining ones to perform other database processing tasks.