Hello, I’m DocuDroid!
Submitting feedback
Thank you for rating our AI Search!
We would be grateful if you could share your thoughts so we can improve our AI Search for you and other readers.
GitHub

Use gpfdist

Anton Monakov

The Greenplum Parallel File Server (gpfdist) is the Greengage DB’s parallel file distribution utility. When external data is served by gpfdist, all Greengage DB segments can read or write external table data in parallel.

For readable external tables, the gpfdist server reads data records from files in a specified directory, packs them into a block, and sends the block in a response to a Greengage DB segment’s request. The segments unpack received rows and distribute them according to the external table’s distribution policy. For writable external tables, segments send blocks of rows in a request to gpfdist, and gpfdist writes them to the external file.

The gpfdist server can also be configured with a YAML-formatted file to transform external data files between a supported text format and another format such as XML or JSON.

To access external data served by gpfdist, use the GPFDIST/GPFDISTS protocol in the LOCATION clause of the CREATE EXTERNAL TABLE command. Learn more in Create external tables with GPFDIST / GPFDISTS.

Syntax overview

The gpfdist command’s general syntax is as follows:

gpfdist [-d <directory>] [-p <http_port>] |-P <last_http_port>
    [-l <log_file>] [-t <timeout>] [-S] [-w <time>]
    [-v | -V] [-m <max_length>] [--ssl <certificate_path>]
    [-k <seconds>] [--compress] [--multi_thread <number_of_threads>]
    [-I <input_transformation_name>] [-O <output_transformation_name>]
    [-c <config.yaml>]

gpfdist -? | --help

gpfdist --version
Option Description

-d <directory>

The directory from which gpfdist serves files for readable external tables or creates output files for writable external tables.

If not specified, defaults to the directory where gpfdist is started

-p <http_port>

The HTTP port on which gpfdist serves files.

The default port is 8080

-P <last_http_port>

The ending port number in a range of HTTP port numbers (http_port to last_http_port) on which gpfdist attempts to serve files. gpfdist serves files on the first port number in the range to which it successfully binds

-l <log_file>

The full path or log file name to log standard output messages

-t <timeout>

The time allowed for Greengage DB to establish a connection with a gpfdist process, from 2 to 7200 seconds (2 hours). You may increase this value when experiencing heavy network traffic.

The default timeout is 5 seconds

-S

Opens the file for synchronous I/O with the O_SYNC flag. Any writes to the resulting file descriptor block gpfdist until the data is physically written to the underlying hardware

-w <time>

The number of seconds before Greengage DB closes a target file, from 0 (no timeout) to 7200 seconds (2 hours).

For a Greengage DB host with multiple segments, there might be a delay between segments when writing data from different segments to the file. This option is used to ensure all the data is written to the file before the file is closed

-v

Shows verbose output of the utility operation (the progress and status messages)

-V

Shows very verbose output of the utility operation (all output messages generated by gpfdist)

-m <max_length>

The maximum allowed data row length in bytes, 32768 by default. Should be used when the data includes very wide rows (or when a line too long error message occurs). Should not be used otherwise as it increases resource allocation.

The valid range is from 32 KB to 256 MB. On Windows systems, the upper limit is 1 MB

--ssl <certificate_path>

Adds SSL encryption to data transferred with gpfdist.

If the option is provided, the data from this file server can only be loaded with the GPFDISTS protocol. The root directory (/) cannot be specified as certificate_path. For more details on creating external tables with GPFDISTS, see Create external tables with GPFDIST / GPFDISTS

-k <seconds>

The number of seconds that gpfdist waits before cleaning up the session when there are no POST requests from the segments, from 300 to 86400 seconds (24 hours). You may increase this value when experiencing heavy network traffic

The default timeout is 300 seconds

--compress

Enables compression during data transfer. When specified, gpfdist utilizes the Zstandard (zstd) compression algorithm. This option is not available on Windows platforms

--multi_threads <num_threads>

Sets the maximum number of threads, up to 256, that gpfdist uses during data transfer, parallelizing the operation. When specified, gpfdist automatically compresses the data before transferring. This option is not available on Windows platforms

-I <input_transformation_name>

Sets one of the input transformations defined in the transformations configuration file (-c) as default. The transformation is applied to all read files. Not used if transformation is set at the table level via #transform. No transformation is applied by default.

Learn more in Load data with gpfdist

-O <output_transformation_name>

Sets one of the output transformations defined in the transformations configuration file (-c) as default. The transformation is applied to all written files. Not used if transformation is set at the table level via #transform. No transformation is applied by default.

Learn more in Load data with gpfdist

-c <config.yaml>

Specifies the YAML configuration file that defines the transformations applied when loading or unloading data. A transformation configuration file can describe multiple transformations; you can set the default input or output transformation by using the -I and -O gpfdist options, respectively. The -c option is not available on Windows platforms.

Learn more in Transformation configuration file

-?

Displays the help information about gpfdist

--version

Displays the gpfdist version

Start gpfdist

The gpfdist utility is located in the $GPHOME/bin directory of the Greengage DB master host and on each segment host. To install gpfdist on another host, copy it to the desired location on that host and add this location to PATH. To avoid performance impact on query execution, gpfdist usually runs on ETL machines rather than the hosts where Greengage DB is installed. For multiple gpfdist instances on the same ETL host, use a different base directory and port for each instance.

NOTE

gpfdist and gpload are compatible only with the Greengage DB major version in which they are shipped (for example, 6.x).

Configure SSL

To create external tables using the GPFDISTS protocol, you need to configure SSL communication between Greengage DB and the file server.

  1. Generate a server certificate for the file server. You can follow the steps described in Generate certificates. The certificate can be stored in an arbitrary location on your file sever; this location is specified in the --ssl option when starting the gpfdist utility.

  2. Generate a client certificate as described in Generate a client certificate. Store the following client certificates in the gpfdists subdirectory inside the data directory of each segment instance (for example, /data1/primary/gpseg0/gpfdists) on each segment host:

    • client.crt — the client certificate file;

    • client.key — the client private key file;

    • root.crt — the trusted certificate authorities file.

IMPORTANT

The server and client private keys (server.key and client.key) must not be protected with a passphrase.

Note that gpfdists implements SSL security in a client/server scheme with the following attributes and limitations:

  • Client certificates are required.

  • Multilingual certificates are not supported.

  • A certificate revocation list (CRL) is not supported.

  • A minimum TLS version of 1.2 is required.

  • SSL renegotiation is supported.

The verify_gpfdists_cert server configuration parameter controls whether SSL certificate authentication is enabled when Greengage DB communicates with the gpfdist utility. You can set this parameter to false to deactivate authentication when testing the communication between Greengage DB and gpfdist. In this case, the following SSL exceptions are ignored:

  • The self-signed SSL certificate that is used by gpfdist is not trusted by Greengage DB.

  • The host name contained in the SSL certificate does not match the host name that is running gpfdist.

IMPORTANT

Deactivating SSL certificate authentication exposes a security risk by not validating the gpfdist SSL certificate.

Examples

The following example commands illustrate several common gpfdist launch scenarios.

  • Starting gpfdist in background (&), listening on the default port 8080, and serving files from the current directory:

    $ gpfdist &
  • Starting gpfdist in background (&), listening on port 8801, and serving files from the /home/gpadmin/external_files directory:

    $ gpfdist -p 8801 -d /home/gpadmin/external_files &
  • Starting gpfdist in background (&), listening on the default port 8080, serving files from the current directory, and redirecting output and errors to the gpfdist_log file:

    $ gpfdist -l /home/gpadmin/gpfdist_log &
  • Starting gpfdist in SSL mode with certificates stored in /home/gpadmin/certs, listening on port 8811, and serving files from the /home/gpadmin/external_files directory:

    $ gpfdist --ssl /home/gpadmin/certs -p 8811 -d /home/gpadmin/external_files
  • Starting gpfdist with multithreaded data transfer using four threads, listening on the default port 8080, and serving files from the current directory:

    $ gpfdist --multi_thread 4

The commands' output should look similar to the following:

INFO Before opening listening sockets - following listening sockets are available:
INFO IPV6 socket: [::]:8811
INFO IPV4 socket: 0.0.0.0:8811
INFO Trying to open listening socket:
INFO IPV6 socket: [::]:8811
INFO Opening listening socket succeeded
INFO Trying to open listening socket:
INFO IPV4 socket: 0.0.0.0:8811
INFO Opening listening socket succeeded
Serving HTTP on port 8811, directory /home/gpadmin/external_files
INFO Before opening listening sockets - following listening sockets are available:
INFO IPV6 socket: [::]:8811
INFO IPV4 socket: 0.0.0.0:8811
INFO Trying to open listening socket:
INFO IPV6 socket: [::]:8811
INFO Opening listening socket succeeded
INFO Trying to open listening socket:
INFO IPV4 socket: 0.0.0.0:8811
INFO Opening listening socket succeeded
Serving HTTPS on port 8811, directory /home/gpadmin/external_files

Stop gpfdist

To stop gpfdist when it is running in the background:

  1. Locate the gpfdist process ID:

    $ ps -ef | grep gpfdist

    The output should look as follows:

    gpadmin    12687       1  0 Apr07 ?        gpfdist --ssl /home/gpadmin/certs -p 8811 -d /tmp
    gpadmin    41509   41470  0 13:00 pts/3    gpfdist
    gpadmin    41521   41510  0 13:01 pts/3    grep --color=auto gpfdist
  2. Stop the corresponding process. In this example, the process ID is 41509:

    $ kill 41509

Create external tables with GPFDIST / GPFDISTS

To create an external table using the GPFDIST/GPFDISTS protocol, provide the LOCATION and FORMAT clauses inside the CREATE EXTERNAL TABLE command as follows:

CREATE [[READABLE] | WRITABLE] EXTERNAL TABLE <table_name> ( ... )
    LOCATION ('gpfdist://<host>[:<port>]/<path>/<file>')
    FORMAT 'TEXT' | 'CSV' | 'CUSTOM';
CREATE [[READABLE] | WRITABLE] EXTERNAL TABLE <table_name> ( ... )
    LOCATION ('gpfdists://<host>[:<port>]/<path>/<file>')
    FORMAT 'TEXT' | 'CSV' | 'CUSTOM';

In the LOCATION clause, provide a URI specifying the target files location: the file host name (<host>), port (<port>), and the path to the file relative to the directory where gpfdist is started (<path>/<file>). You can provide multiple URIs if multiple gpfdist data sources are used. To denote multiple files to read, use the wildcard characters such as * or []. The gpfdist utility automatically uncompresses the .gz, .bz2, and .zst files when reading data and compresses them when writing data.

NOTE

Compression is not supported when running gpfdist on Windows platforms.

In the FORMAT clause, define how your data is formatted.

Examples

These examples illustrate creating external tables via the GPFDIST / GPFDISTS protocols.

Prerequisites

To try out the practical examples listed in this section, connect to the Greengage DB master host as gpadmin using psql as described in Connect to Greengage DB via psql. Then create the customers test database, and connect to it:

DROP DATABASE IF EXISTS customers;
CREATE DATABASE customers;
\c customers

Readable table with GPFDIST

This example creates the customers readable table, whose data is loaded from two files residing in the /tmp directory on the fs file server.

  1. In the /tmp directory on the fs host, create two source data files:

    • The customers.csv.gz compressed via gzip:

      #!/bin/bash
      
      cat > customers.csv <<EOF
      id,first_name,last_name,email,address
      1,John,Doe,john.doe@example.com,123 Elm Street
      2,Jane,Smith,jane.smith@example.com,456 Oak Street
      EOF
      
      gzip customers.csv
    • The regular customers.csv file:

      #!/bin/bash
      
      cat > /tmp/customers.csv <<EOF
      id,first_name,last_name,email,address
      3,Bob,Brown,bob.brown@example.com,789 Pine Street
      4,Rob,Stuart,rob.stuart@example.com,119 Willow Street
      EOF
  2. Start gpfdist in the /tmp directory of the fs host:

    $ gpfdist -p 8811 -d /tmp/ &
  3. On the master host, run the CREATE EXTERNAL TABLE command. In the LOCATION clause, provide the host name of the file server (fs) and use the * wildcard to denote two files to read. In the FORMAT clause, set CSV as the data format and specify that the files contain a header row (HEADER):

    CREATE EXTERNAL TABLE customers (
        id INTEGER,
        first_name VARCHAR(50),
        last_name VARCHAR(50),
        email VARCHAR(100),
        address VARCHAR(255)
    )
        LOCATION ('gpfdist://fs:8811/customers.csv*')
        FORMAT 'CSV' (HEADER);
  4. Query the created external table:

    SELECT * FROM customers;

    The output should look as follows:

     id | first_name | last_name |         email          |      address
    ----+------------+-----------+------------------------+-------------------
      1 | John       | Doe       | john.doe@example.com   | 123 Elm Street
      2 | Jane       | Smith     | jane.smith@example.com | 456 Oak Street
      3 | Bob        | Brown     | bob.brown@example.com  | 789 Pine Street
      4 | Rob        | Stuart    | rob.stuart@example.com | 119 Willow Street

Writable table with GPFDISTS

This example creates the orders writable table, whose data is unloaded to the orders.csv file. The file will be created automatically in the /tmp directory on the fs file server.

  1. Start gpfdist on the fs server from the /tmp directory, passing --ssl ~/certs to enable communication over GPFDISTS:

    $ gpfdist --ssl ~/certs -p 8811 -d /tmp/ &
  2. On the master host, run the CREATE WRITABLE EXTERNAL TABLE command. In the LOCATION clause, provide the host name of the file server (fs). In the FORMAT clause, set CSV as the data format for the created orders.csv file:

    CREATE WRITABLE EXTERNAL TABLE orders (
        id INTEGER,
        total NUMERIC(10,2),
        status VARCHAR(50)
    )
        LOCATION ('gpfdists://fs:8811/orders.csv')
        FORMAT 'CSV';
  3. Populate the created table with sample data:

    INSERT INTO orders (customer_id, total, status)
    VALUES (1, 999.99, 'Pending'),
           (2, 499.99, 'Shipped'),
           (3, 299.99, 'Delivered');
  4. Review the created orders.csv file:

    $ cat /tmp/orders.csv

    The file’s content should look as follows:

    1,999.99,Pending
    2,499.99,Shipped
    3,299.99,Delivered

Performance tips

You can run gpfdist instances on multiple hosts as well as run multiple gpfdist instances on each host. This lets you deploy gpfdist servers strategically, utilize all the available network bandwidth and Greengage DB’s parallelism, and thus achieve fast data load and unload rates.

  • Allow network traffic to use all ETL host network interfaces simultaneously. Run one gpfdist instance for each interface on the ETL host and use the host name of each network interface card (NIC) in the LOCATION clause of the CREATE EXTERNAL TABLE command.

  • Divide external table data equally among multiple gpfdist instances on the ETL host. For example, on an ETL system with two NICs, run two gpfdist instances (one on each NIC) to optimize data load performance and divide the external table data files evenly between the two gpfdist servers.

  • Use pipes (|) to separate formatted text in source data files. Greengage DB encloses comma-separated text strings in single or double quotes. gpfdist, in turn, has to remove the quotes to parse the strings. Using pipes as field separators avoids the extra step and improves performance.

  • If compression is enabled (that is, the compress option is provided), gpfdist transmits a larger amount of data while maintaining low network usage. Compression, though, is cpu-intensive and might potentially impose additional load on the cluster.

  • If multithreaded execution is enabled (that is, the multi_threads option is provided), the overall time required for compression may decrease. This increases transmission speed while maintaining low network load.

  • Use the gp_external_max_segs server configuration parameter in the postgresql.conf file of your master instance to control the number of segment instances that can access a single gpfdist instance simultaneously (64 by default). You can appoint several segments to process external data files and use the remaining ones to perform other database processing tasks.