gpfdist
Serves data files to or writes data files out from Greengage DB segments. See Use gpfdist for more details and usage examples.
Synopsis
gpfdist [ -d <directory> ]
[ -p <http_port> ]
[ -P <last_http_port> ]
[ -l <log_file> ]
[ -t <timeout> ]
[ -k <clean_up_timeout> ]
[ -S ]
[ -w <time> ]
[ -v | -V ]
[ -s ]
[ -m <max_length> ]
[ --ssl <certificate_path> ]
[ --compress ]
[ --multi_thread <num_threads> ]
[ -I <input_transformation_name> ]
[ -O <output_transformation_name> ]
[ -c <config.yaml> ]
gpfdist -? | --help
gpfdist --version
Description
gpfdist is a Greengage DB parallel file distribution program.
It is used by readable external tables and gpload to serve external table files to all Greengage DB segments in parallel.
It is used by writable external tables to accept output streams from Greengage DB segments in parallel and write them out to a file.
gpfdist and gpload are compatible only with the Greengage DB major version in which they are shipped.
For example, a gpfdist utility that is installed with Greengage DB 6.x cannot be used with Greengage DB 7.x.
In order for gpfdist to be used by an external table, the LOCATION clause of the external table definition must specify the external table data using the gpfdist:// protocol.
If the --ssl option is specified to enable SSL security, create the external table with the gpfdists:// protocol.
The benefit of using gpfdist is that you are guaranteed maximum parallelism while reading from or writing to external tables, thereby offering the best performance as well as easier administration of external tables.
For readable external tables, gpfdist parses and serves data files evenly to all the segment instances in the Greengage DB system when users SELECT from the external table.
For writable external tables, gpfdist accepts parallel output streams from the segments when users INSERT into the external table, and writes to an output file.
When gpfdist reads data and encounters a data formatting error, the error message includes a row number indicating the location of the formatting error.
gpfdist attempts to capture the row that contains the error.
However, gpfdist might not capture the exact row for some formatting errors.
For readable external tables, if load files are compressed using gzip, bzip2, or zstd (have a .gz, .bz2, or .zst file extension), gpfdist uncompresses the data while loading the data (on the fly).
For writable external tables, gpfdist compresses the data using gzip if the target file has a .gz extension, bzip2 if the target file has a .bz2 extension, or zstd if the target file has a .zst extension.
Compression is not supported for readable and writable external tables when the gpfdist utility runs on Windows platforms.
When reading or writing data with the gpfdist or gpfdists protocol, Greengage DB includes X-GP-PROTO in the HTTP request header to indicate that the request is from Greengage DB.
The utility rejects HTTP requests that do not include X-GP-PROTO in the request header.
Most likely, you will want to run gpfdist on your ETL machines rather than the hosts where Greengage DB is installed.
To install gpfdist on another host, copy the utility over to that host and add gpfdist to your PATH.
When using IPv6, always enclose the numeric IP address in brackets.
Options
- -d <directory>
-
The directory from which
gpfdistwill serve files for readable external tables or create output files for writable external tables. If not specified, defaults to the directory wheregpfdistis started. - -p <http_port>
-
The HTTP port on which
gpfdistwill serve files. Defaults to8080. - -P <last_http_port>
-
The last port number in a range of HTTP port numbers (
http_porttolast_http_port, inclusive) on whichgpfdistwill attempt to serve files.gpfdistserves the files on the first port number in the range to which it successfully binds. - -l <log_file>
-
The fully qualified path and log file name where standard output messages are to be logged.
- -t <timeout>
-
The time allowed for Greengage DB to establish a connection to a
gpfdistprocess. Default is5seconds. Allowed values are2to7200seconds (2 hours). May need to be increased on systems with a lot of network traffic. - -k <clean_up_timeout>
-
The number of seconds that
gpfdistwaits before cleaning up the session when there are noPOSTrequests from the segments. Default is300. Allowed values are300to86400. You may increase its value when experiencing heavy network traffic. - -m <max_length>
-
The maximum allowed data row length in bytes. Default is
32768. Should be used when user data includes very wide rows (or whenline too longerror message occurs). Should not be used otherwise as it increases resource allocation. Valid range is 32 KB to 256 MB. The upper limit is 1 MB on Windows systems.NOTEMemory issues might occur if you specify a large maximum row length and run a large number of
gpfdistconcurrent connections. For example, setting this value to the maximum of 256 MB with 96 concurrentgpfdistprocesses requires approximately 24 GB of memory ((96 + 1) x 256). - -s
-
Enable simplified logging. When this option is specified, only messages with
WARNlevel and higher are written to thegpfdistlog file.INFOlevel messages are not written to the log file. If this option is not specified, allgpfdistmessages are written to the log file.You can specify this option to reduce the information written to the log file.
- -S
-
Open the file for synchronous I/O with the
O_SYNCflag. Any writes to the resulting file descriptor blockgpfdistuntil the data is physically written to the underlying hardware. - -w <time>
-
Set the number of seconds that Greengage DB delays before closing a target file such as a named pipe. The default value is
0, no delay. The maximum value is7200seconds (2 hours).For a Greengage DB with multiple segments, there might be a delay between segments when writing data from different segments to the file. You can specify a time to wait before Greengage DB closes the file to ensure all the data is written to the file.
- --ssl <certificate_path>
-
Add SSL encryption to data transferred with
gpfdist. After runninggpfdistwith the--ssl certificate_pathoption, the only way to load data from this file server is with thegpfdists://protocol.The location specified in
certificate_pathmust contain the following files:-
The server certificate file (server.crt).
-
The server private key file (server.key).
-
The trusted certificate authorities (root.crt).
The root directory (/) cannot be specified as
certificate_path.
For details on creating external tables with GPFDISTS, see Create external tables with GPFDIST / GPFDISTS.
-
- --compress
-
Enable compression during data transfer. When specified,
gpfdistutilizes the Zstandard (zstd) compression algorithm. This option is not available on Windows platforms. - --multi_thread <num_threads>
-
Set the maximum number of threads that
gpfdistuses during data transfer, parallelizing the operation. When specified,gpfdistautomatically compresses the data (also parallelized) before transferring.gpfdistsupports a maximum of256threads. This option is not available on Windows platforms. - -I <input_transformation_name>
-
Set one of the input transformations defined in the transformations configuration file (
-c) as default. The transformation is applied to all read files. Not used if transformation is set at the table level via#transform. No transformation is applied by default.Learn more in Load data with gpfdist.
- -O <output_transformation_name>
-
Set one of the output transformations defined in the transformations configuration file (
-c) as default. The transformation is applied to all written files. Not used if transformation is set at the table level via#transform. No transformation is applied by default.Learn more in Load data with gpfdist.
- -c <config.yaml>
-
Specify rules that
gpfdistuses to select a transform to apply when loading or extracting data. Thegpfdistconfiguration file is a YAML 1.1 document.A transformation configuration file can describe multiple transformations; you can set the default input or output transformation by using the
-Iand-Ogpfdistoptions, respectively. The-coption is not available on Windows platforms.Learn more in Transformation configuration file.
- -v
-
Shows verbose output of the utility operation (the progress and status messages).
- -V
-
Shows very verbose output of the utility operation (all output messages generated by
gpfdist). - -?
-
Display help.
- --version
-
Display the version of this utility.
Notes
The server configuration parameter verify_gpfdists_cert controls whether SSL certificate authentication is enabled when Greengage DB communicates with the gpfdist utility to either read data from or write data to an external data source.
You can set the parameter value to false to deactivate authentication when testing the communication between the Greengage DB external table and the gpfdist utility that is serving the external data.
If the value is false, these SSL exceptions are ignored:
-
The self-signed SSL certificate that is used by
gpfdistis not trusted by Greengage DB. -
The host name contained in the SSL certificate does not match the host name that is running
gpfdist.
Deactivating SSL certificate authentication exposes a security risk by not validating the gpfdists SSL certificate.
You can set the server configuration parameter gpfdist_retry_timeout to control the time that Greengage DB waits before returning an error when a gpfdist server does not respond while Greengage DB is attempting to write data to gpfdist.
The default is 300 seconds.
If the gpfdist utility hangs with no read or write activity occurring, you can generate a core dump the next time a hang occurs to help debug the issue.
Set the environment variable GPFDIST_WATCHDOG_TIMER to the number of seconds of no activity to wait before gpfdist is forced to exit.
When the environment variable is set and gpfdist hangs, the utility is stopped after the specified number of seconds, creates a core dump, and sends relevant information to the log file.
This example sets the environment variable on a Linux system so that gpfdist exits after 300 seconds of no activity:
$ export GPFDIST_WATCHDOG_TIMER=300
When you enable compression, gpfdist transmits a larger amount of data while maintaining low network usage.
Note that compression can be time-intensive, and may potentially reduce transmission speeds.
When you utilize multithreaded execution, the overall time required for compression may decrease, which facilitates faster data transmission while maintaining low network occupancy and high speed.