Hello, I’m DocuDroid!
Submitting feedback
Thank you for rating our AI Search!
We would be grateful if you could share your thoughts so we can improve our AI Search for you and other readers.
GitHub

Use gpfdist

Anton Monakov

The Greenplum Parallel File Server (gpfdist) is the Greengage DB’s parallel file distribution utility. When external data is served by gpfdist, all Greengage DB segments can read or write external table data in parallel.

For readable external tables, the gpfdist server reads data records from files in a specified directory, packs them into a block, and sends the block in a response to a Greengage DB segment’s request. The segments unpack received rows and distribute them according to the external table’s distribution policy. For writable external tables, segments send blocks of rows in a request to gpfdist, and gpfdist writes them to the external file.

The gpfdist server can also be configured with a YAML-formatted file to transform external data files between a supported text format and another format such as XML or JSON. See Transform external data for details.

To access external data served by gpfdist, use the GPFDIST/GPFDISTS protocol in the LOCATION clause of the CREATE EXTERNAL TABLE command. Learn more in Create external tables with GPFDIST / GPFDISTS.

Syntax overview

The gpfdist command’s general syntax is provided below. See the gpfdist command reference for the complete options description.

gpfdist [ -d <directory> ]
        [ -p <http_port> ]
        [ -P <last_http_port> ]
        [ -l <log_file> ]
        [ -t <timeout> ]
        [ -k <clean_up_timeout> ]
        [ -S ]
        [ -w <time> ]
        [ -v | -V ]
        [ -s ]
        [ -m <max_length> ]
        [ --ssl <certificate_path> ]
        [ --compress ]
        [ --multi_thread <number_of_threads> ]
        [ -I <input_transformation_name> ]
        [ -O <output_transformation_name> ]
        [ -c <config.yaml> ]

gpfdist -? | --help

gpfdist --version

Start gpfdist

The gpfdist utility is located in the $GPHOME/bin directory of the Greengage DB master host and on each segment host. To install gpfdist on another host, copy it to the desired location on that host and add this location to PATH. To avoid performance impact on query execution, gpfdist usually runs on ETL machines rather than the hosts where Greengage DB is installed. For multiple gpfdist instances on the same ETL host, use a different base directory and port for each instance.

NOTE

gpfdist and gpload are compatible only with the Greengage DB major version in which they are shipped (for example, 6.x).

Configure SSL

To create external tables using the GPFDISTS protocol, you need to configure SSL communication between Greengage DB and the file server.

  1. Generate a server certificate for the file server. You can follow the steps described in Generate certificates. The certificate can be stored in an arbitrary location on your file sever; this location is specified in the --ssl option when starting the gpfdist utility.

  2. Generate a client certificate as described in Generate a client certificate. Store the following client certificates in the gpfdists subdirectory inside the data directory of each segment instance (for example, /data1/primary/gpseg0/gpfdists) on each segment host:

    • client.crt — the client certificate file;

    • client.key — the client private key file;

    • root.crt — the trusted certificate authorities file.

IMPORTANT

The server and client private keys (server.key and client.key) must not be protected with a passphrase.

Note that gpfdists implements SSL security in a client/server scheme with the following attributes and limitations:

  • Client certificates are required.

  • Multilingual certificates are not supported.

  • A certificate revocation list (CRL) is not supported.

  • A minimum TLS version of 1.2 is required.

  • SSL renegotiation is supported.

The verify_gpfdists_cert server configuration parameter controls whether SSL certificate authentication is enabled when Greengage DB communicates with the gpfdist utility. You can set this parameter to false to deactivate authentication when testing the communication between Greengage DB and gpfdist. In this case, the following SSL exceptions are ignored:

  • The self-signed SSL certificate that is used by gpfdist is not trusted by Greengage DB.

  • The host name contained in the SSL certificate does not match the host name that is running gpfdist.

IMPORTANT

Deactivating SSL certificate authentication exposes a security risk by not validating the gpfdist SSL certificate.

Examples

The following example commands illustrate several common gpfdist launch scenarios.

  • Starting gpfdist in background (&), listening on the default port 8080, and serving files from the current directory:

    $ gpfdist &
  • Starting gpfdist in background (&), listening on port 8801, and serving files from the /home/gpadmin/external_files directory:

    $ gpfdist -p 8801 -d /home/gpadmin/external_files &
  • Starting gpfdist in background (&), listening on the default port 8080, serving files from the current directory, and redirecting output and errors to the gpfdist_log file:

    $ gpfdist -l /home/gpadmin/gpfdist_log &
  • Starting gpfdist in SSL mode with certificates stored in /home/gpadmin/certs, listening on port 8811, and serving files from the /home/gpadmin/external_files directory:

    $ gpfdist --ssl /home/gpadmin/certs -p 8811 -d /home/gpadmin/external_files
  • Starting gpfdist with multithreaded data transfer using four threads, listening on the default port 8080, and serving files from the current directory:

    $ gpfdist --multi_thread 4

The commands' output should look similar to the following:

INFO Before opening listening sockets - following listening sockets are available:
INFO IPV6 socket: [::]:8811
INFO IPV4 socket: 0.0.0.0:8811
INFO Trying to open listening socket:
INFO IPV6 socket: [::]:8811
INFO Opening listening socket succeeded
INFO Trying to open listening socket:
INFO IPV4 socket: 0.0.0.0:8811
INFO Opening listening socket succeeded
Serving HTTP on port 8811, directory /home/gpadmin/external_files
INFO Before opening listening sockets - following listening sockets are available:
INFO IPV6 socket: [::]:8811
INFO IPV4 socket: 0.0.0.0:8811
INFO Trying to open listening socket:
INFO IPV6 socket: [::]:8811
INFO Opening listening socket succeeded
INFO Trying to open listening socket:
INFO IPV4 socket: 0.0.0.0:8811
INFO Opening listening socket succeeded
Serving HTTPS on port 8811, directory /home/gpadmin/external_files

Stop gpfdist

To stop gpfdist when it is running in the background:

  1. Locate the gpfdist process ID:

    $ ps -ef | grep gpfdist

    The output should look as follows:

    gpadmin    12687       1  0 Apr07 ?        gpfdist --ssl /home/gpadmin/certs -p 8811 -d /tmp
    gpadmin    41509   41470  0 13:00 pts/3    gpfdist
    gpadmin    41521   41510  0 13:01 pts/3    grep --color=auto gpfdist
  2. Stop the corresponding process. In this example, the process ID is 41509:

    $ kill 41509

Create external tables with GPFDIST / GPFDISTS

To create an external table using the GPFDIST/GPFDISTS protocol, provide the LOCATION and FORMAT clauses inside the CREATE EXTERNAL TABLE command as follows:

CREATE [READABLE | WRITABLE] EXTERNAL TABLE <table_name> ( ... )
    LOCATION ('gpfdist://<host>[:<port>]/<path>/<file>')
    FORMAT 'TEXT' | 'CSV' | 'CUSTOM';
CREATE [READABLE | WRITABLE] EXTERNAL TABLE <table_name> ( ... )
    LOCATION ('gpfdists://<host>[:<port>]/<path>/<file>')
    FORMAT 'TEXT' | 'CSV' | 'CUSTOM';

In the LOCATION clause, provide a URI specifying the target files location: the file host name (<host>), port (<port>), and the path to the file relative to the directory where gpfdist is started (<path>/<file>). You can provide multiple URIs if multiple gpfdist data sources are used. To denote multiple files to read, use the wildcard characters such as * or []. The gpfdist utility automatically uncompresses the .gz, .bz2, and .zst files when reading data and compresses them when writing data.

NOTE

Compression is not supported when running gpfdist on Windows platforms.

In the FORMAT clause, define how your data is formatted.

Examples

These examples illustrate creating external tables via the GPFDIST / GPFDISTS protocols.

Prerequisites

To try out the practical examples listed in this section, connect to the Greengage DB master host as gpadmin using psql as described in Connect to Greengage DB via psql. Then create the customers test database, and connect to it:

DROP DATABASE IF EXISTS customers;
CREATE DATABASE customers;
\c customers

Readable table with GPFDIST

This example creates the customers readable table, whose data is loaded from two files residing in the /tmp directory on the fs file server.

  1. In the /tmp directory on the fs host, create two source data files:

    • The customers.csv.gz compressed via gzip:

      #!/bin/bash
      
      cat > customers.csv <<EOF
      id,first_name,last_name,email,address
      1,John,Doe,john.doe@example.com,123 Elm Street
      2,Jane,Smith,jane.smith@example.com,456 Oak Street
      EOF
      
      gzip customers.csv
    • The regular customers.csv file:

      #!/bin/bash
      
      cat > /tmp/customers.csv <<EOF
      id,first_name,last_name,email,address
      3,Bob,Brown,bob.brown@example.com,789 Pine Street
      4,Rob,Stuart,rob.stuart@example.com,119 Willow Street
      EOF
  2. Start gpfdist in the /tmp directory of the fs host:

    $ gpfdist -p 8811 -d /tmp/ &
  3. On the master host, run the CREATE EXTERNAL TABLE command. In the LOCATION clause, provide the host name of the file server (fs) and use the * wildcard to denote two files to read. In the FORMAT clause, set CSV as the data format and specify that the files contain a header row (HEADER):

    CREATE EXTERNAL TABLE customers (
        id INTEGER,
        first_name VARCHAR(50),
        last_name VARCHAR(50),
        email VARCHAR(100),
        address VARCHAR(255)
    )
        LOCATION ('gpfdist://fs:8811/customers.csv*')
        FORMAT 'CSV' (HEADER);
  4. Query the created external table:

    SELECT * FROM customers;

    The output should look as follows:

     id | first_name | last_name |         email          |      address
    ----+------------+-----------+------------------------+-------------------
      1 | John       | Doe       | john.doe@example.com   | 123 Elm Street
      2 | Jane       | Smith     | jane.smith@example.com | 456 Oak Street
      3 | Bob        | Brown     | bob.brown@example.com  | 789 Pine Street
      4 | Rob        | Stuart    | rob.stuart@example.com | 119 Willow Street

Writable table with GPFDISTS

This example creates the orders writable table, whose data is unloaded to the orders.csv file. The file will be created automatically in the /tmp directory on the fs file server.

  1. Start gpfdist on the fs server from the /tmp directory, passing --ssl ~/certs to enable communication over GPFDISTS:

    $ gpfdist --ssl ~/certs -p 8811 -d /tmp/ &
  2. On the master host, run the CREATE WRITABLE EXTERNAL TABLE command. In the LOCATION clause, provide the host name of the file server (fs). In the FORMAT clause, set CSV as the data format for the created orders.csv file:

    CREATE WRITABLE EXTERNAL TABLE orders (
        id INTEGER,
        total NUMERIC(10,2),
        status VARCHAR(50)
    )
        LOCATION ('gpfdists://fs:8811/orders.csv')
        FORMAT 'CSV';
  3. Populate the created table with sample data:

    INSERT INTO orders (id, total, status)
    VALUES (1, 999.99, 'Pending'),
           (2, 499.99, 'Shipped'),
           (3, 299.99, 'Delivered');
  4. Review the created orders.csv file:

    $ cat /tmp/orders.csv

    The file’s content should look as follows:

    1,999.99,Pending
    2,499.99,Shipped
    3,299.99,Delivered

Performance tips

You can run gpfdist instances on multiple hosts as well as run multiple gpfdist instances on each host. This lets you deploy gpfdist servers strategically, utilize all the available network bandwidth and Greengage DB’s parallelism, and thus achieve fast data load and unload rates.

  • Allow network traffic to use all ETL host network interfaces simultaneously. Run one gpfdist instance for each interface on the ETL host and use the host name of each network interface card (NIC) in the LOCATION clause of the CREATE EXTERNAL TABLE command.

  • Divide external table data equally among multiple gpfdist instances on the ETL host. For example, on an ETL system with two NICs, run two gpfdist instances (one on each NIC) to optimize data load performance and divide the external table data files evenly between the two gpfdist servers.

  • Use pipes (|) to separate formatted text in source data files. Greengage DB encloses comma-separated text strings in single or double quotes. gpfdist, in turn, has to remove the quotes to parse the strings. Using pipes as field separators avoids the extra step and improves performance.

  • If compression is enabled (that is, the compress option is provided), gpfdist transmits a larger amount of data while maintaining low network usage. Compression, though, is cpu-intensive and might potentially impose additional load on the cluster.

  • If multithreaded execution is enabled (that is, the multi_threads option is provided), the overall time required for compression may decrease. This increases transmission speed while maintaining low network load.

  • Use the gp_external_max_segs server configuration parameter in the postgresql.conf file of your master instance to control the number of segment instances that can access a single gpfdist instance simultaneously (64 by default). You can appoint several segments to process external data files and use the remaining ones to perform other database processing tasks.