Hello, I’m DocuDroid!
Submitting feedback
Thank you for rating our AI Search!
We would be grateful if you could share your thoughts so we can improve our AI Search for you and other readers.
GitHub

Use PXF object store connectors to read and write text data between Greengage DB and S3

Anton Monakov

With PXF object store connectors, you can read and write data residing in an object store such as Azure Blob Storage, Azure Data Lake, Google Cloud Storage, Amazon S3, MinIO, and other S3-compatible object stores.

This topic describes how to configure and use the connectors for reading and writing text data residing in an object store by using external tables and provides practical examples.

Create an external table using the PXF protocol

To create a Greengage DB external table to read or write data in S3, use the following general syntax:

CREATE [READABLE | WRITABLE] EXTERNAL TABLE <table_name>
    ( <column_name> <data_type> [, ...] | LIKE <other_table> )
    LOCATION ('pxf://<path_to_data>?PROFILE=<profile_name>[&<custom_option>=<value>[...]]')
    FORMAT '[TEXT|CSV]' (delimiter[=|<space>][E]'<delim_value>') |
        'CUSTOM' (FORMATTER='pxfdelimited_import' <format_option>[=|<space>][E]'<value>')
    [DISTRIBUTED BY (<column_name> [, ... ] ) | DISTRIBUTED RANDOMLY];
Keyword Value

<table_name>

The name of the table to create

<column_name>

The name of the column to create

<data_type>

The data type of the column

LIKE <other_table>

Specifies a table from which the new external table automatically copies all column names, data types, and distribution policy

<path_to_data>

The path to the directory or file in the object store. When the <server_name> configuration includes a pxf.fs.basePath property setting, the <path_to_data> value is considered to be relative to the base path specified. Otherwise, the path is considered absolute. The value must not include the dollar sign ($) character

PROFILE=<profile_name>

The profile is specified as the <objstore>:<format> pair, where <objstore> is the object store prefix and <format> is the postfix identifying the data format.

The following <objstore> prefixes and the corresponding object stores are supported:

  • wasbs — Azure Blob Storage

  • adl — Azure Data Lake

  • gs — Google Cloud Storage

  • s3 — Amazon S3, MinIO, and other S3-compatible object stores

The following <format> postfixes and the corresponding data formats are supported:

  • text — read or write delimited plain text data.

  • csv — read or write comma-separated value (CSV) data as well as read data containing multibyte / multi-character delimiters. Learn more in Read data with multibyte or multi-character delimiters.

  • text:multi — read plain text data with delimited single- or multiline records that include embedded (quoted) linefeed characters.

FORMAT <value>

Defines the data format, which can be one of the following:

Note that the HEADER option, which commonly designates whether the data file contains a header row, is not supported for external tables using PXF. If a text file includes header lines, use the SKIP_HEADER_COUNT custom option to specify the number of lines to skip at the beginning of each file

delimiter

The delimiter character in the data. For the CSV format, the default <delim_value> is the comma character (,). You can use an E'' escape string constant, for example delimiter=E'\t'

DISTRIBUTED BY

When loading data from a Greengage DB table into a writable external table, consider specifying the same distribution policy or column name on both tables. This will avoid extra motion of data between segments on the load operation. Learn more about table distribution in Distribution

<custom_option>

One of the custom options provided in the LOCATION string as described below

SERVER=<server_name>

The named server configuration that PXF uses to access the data. If not specified, the default PXF server is used

IGNORE_MISSING_PATH=<boolean>

The action to take when <path_to_data> is missing or invalid. If set to false (default), an error is returned. If set to true, PXF ignores missing path errors and returns an empty fragment

SKIP_HEADER_COUNT=<numlines>

The number of header lines to skip in the beginning of each HDFS file before reading the data. The default value is 0, no lines are skipped

COMPRESSION_CODEC

The compression codec to use when writing data: default, bzip2, gzip, or uncompressed (no compression). If not provided (or uncompressed is provided), no data compression is performed

NEWLINE

Designates a character used as a newline character. If CR or CRLF is provided for the NEWLINE custom formatter option, then NEWLINE must also be set to the same value in the LOCATION clause

NOTE

If you are accessing an S3 object store:

  • You can provide S3 credentials via the accesskey and secretkey custom options in the LOCATION clause of the CREATE EXTERNAL TABLE command.

  • When retrieving CSV-format data, you can use PXF with the Amazon S3 Select service.

Read data with multibyte or multi-character delimiters

For the data containing a multibyte delimiter or a delimiter with multiple characters, only <objstore>:csv (where <objstore> is the object store prefix) PXF profiles are supported. In the FORMAT clause, you must specify the custom formatter (the pxfdelimited_import formatter function) and the formatter options under <format_option>. For a practical example of reading such data, see Read data with multibyte delimiters.

The general syntax for creating a readable external table for such data is as follows.

CREATE EXTERNAL TABLE <table_name>
    ( <column_name> <data_type> [, ...] | LIKE <other_table> )
    LOCATION ('pxf://<path_to_data>?PROFILE=<objstore>:csv'[&<custom_option>=<value>[...]]')
    FORMAT 'CUSTOM' (FORMATTER='pxfdelimited_import' <format_option>[=|<space>][E]'<value>');
Formatter option Description Default value

DELIMITER=<delim_string>

Required. The single-byte or multibyte delimiter string that separates data columns. The specified string may be up to 32 bytes in length and may not contain quote or escape characters. The delimiter is specified either directly or as its byte representation.

The byte representation of the delimiter is specified in the E'<value>' format. Since some characters have different byte representations depending on a specific encoding, the database encoding is used. For example, if the database encoding is UTF8, the file encoding is LATIN1, and the delimiter is the currency symbol (¤), you need to specify its UTF8 byte representation — \xC2\xA4

None

QUOTE=<char>

The single one-byte ASCII quotation character for all columns

None

ESCAPE=<char>

The single one-byte ASCII character used to escape special characters (for example, the DELIMITER, QUOTE, or NEWLINE value, or the ESCAPE value itself)

None, or the QUOTE value if that is set

NEWLINE

The character used as a newline character, which can be LF (Line feed, 0x0A), CR (Carriage return, 0x0D), or CR followed by LF (CRLF, 0x0D 0x0A).

Note that if CR or CRLF is provided, the same value must be specified as the NEWLINE custom option in the LOCATION clause of the CREATE EXTERNAL TABLE command

LF

When reading data with a multibyte or multi-character delimiter, PXF behavior depends on the quote and escape character settings.

QUOTE specified ESCAPE specified PXF behavior

No

No

PXF reads the data as is

Yes

Yes

PXF reads the data between quote characters as is and unescapes only the quote and escape characters

Yes

No (ESCAPE 'OFF')

PXF reads the data between quote characters as is

No

Yes

PXF reads the data as is and unescapes only the delimiter, newline, and escape itself

NOTE

If QUOTE is specified, all data columns must be quoted. Otherwise, if not specified, all data columns must be unquoted. No whitespace or unwanted characters are allowed between the quote value and the delimiter value as well as between the quote value and the newline value.

Examples

Prerequisites

To try out the practical examples, connect to the Greengage DB master host as gpadmin using psql as described in Connect to Greengage DB via psql. Then create the customers test database and connect to it:

DROP DATABASE IF EXISTS customers;
CREATE DATABASE customers;
\c customers

To be able to create an external table using the PXF protocol, enable the PXF extension in the database as described in Register PXF in a database in PXF documentation:

CREATE EXTENSION pxf;

Configure the PXF S3 connector

To have PXF connect to an object store, you need to create the corresponding server configuration and then synchronize it to the Greengage DB cluster:

  1. Log in to the Greengage DB master host as gpadmin.

  2. Go to the $PXF_BASE/servers directory and create an S3 server configuration directory named s3. Depending on your object storage, copy the required server configuration file from $PXF_HOME/templates to $PXF_BASE/servers/s3. The example uses the s3-site.xml file based on the minio-site.xml template.

    $ mkdir $PXF_BASE/servers/s3
    $ cd $PXF_BASE/servers/s3
    $ cp $PXF_HOME/templates/s3-site.xml .

    In the configuration file, provide the relevant object store connection details:

    <?xml version="1.0" encoding="UTF-8"?>
    <configuration>
        <property>
            <name>fs.s3a.endpoint</name>
            <value>storage.example.com</value>
        </property>
        <property>
            <name>fs.s3a.access.key</name>
            <value>${ACCESS_KEY}</value>
        </property>
        <property>
            <name>fs.s3a.secret.key</name>
            <value>${SECRET_KEY}</value>
        </property>
        <property>
            <name>fs.s3a.fast.upload</name>
            <value>true</value>
        </property>
        <property>
            <name>fs.s3a.path.style.access</name>
            <value>true</value>
        </property>
    </configuration>
  3. Synchronize the server configuration to the Greengage DB cluster hosts:

    $ pxf cluster sync

Read plain text and CSV data

  1. In the customers bucket on the S3 host, create the orders.txt file with the following content:

    id,name,price
    1,Laptop,999.99
    2,Smartphone,499.99
    3,Tablet,299.99
    4,Monitor,599.99
    5,Keyboard,99.99
  2. On the Greengage DB master host, create a readable external table that references the orders.txt file. In the LOCATION clause, specify the PXF s3:text profile and the server configuration. Use the SKIP_HEADER_COUNT option to indicate that the file contains a header row. In the FORMAT clause, set TEXT as the data format and specify the comma character (,) as a delimiter:

    CREATE EXTERNAL TABLE orders_s3 (
            id INTEGER,
            name VARCHAR,
            price NUMERIC    
        )
        
        LOCATION ('pxf://customers/orders.txt?PROFILE=s3:text&SERVER=s3&SKIP_HEADER_COUNT=1')
        FORMAT 'TEXT' (delimiter=E',');    
  3. Query the created external table:

    SELECT * FROM orders_s3;

    The output should look as follows:

     id |    name    | price
    ----+------------+--------
      1 | Laptop     | 999.99
      2 | Smartphone | 499.99
      3 | Tablet     | 299.99
      4 | Monitor    | 599.99
      5 | Keyboard   |  99.99
    (5 rows)
  4. Create another readable external table that references the same orders.txt file. In the LOCATION clause, specify the PXF s3:csv profile and the server configuration. Use the SKIP_HEADER_COUNT option to indicate that the file contains a header row. In the FORMAT clause, set CSV as the data format:

    CREATE EXTERNAL TABLE orders_s3_csv (
            id INTEGER,
            name VARCHAR,
            price NUMERIC    
        )
    
        LOCATION ('pxf://customers/orders.txt?PROFILE=s3:csv&SERVER=s3&SKIP_HEADER_COUNT=1')
        FORMAT 'CSV';    
  5. Query the created external table:

    SELECT * FROM orders_s3_csv;

    The output should look as follows:

     id |    name    | price
    ----+------------+--------
      1 | Laptop     | 999.99
      2 | Smartphone | 499.99
      3 | Tablet     | 299.99
      4 | Monitor    | 599.99
      5 | Keyboard   |  99.99
    (5 rows)

Read multiline text data

  1. In the customers bucket on the S3 host, create the customers_multiline.txt file with the following content:

    1:John:Doe:john.doe@example.com:"123 Elm Street
    New York"
    2:Jane:Smith:jane.smith@example.com:"456 Oak Street
    Chicago"
    3:Bob:Brown:bob.brown@example.com:"789 Pine Street
    Los Angeles"
    4:Rob:Stuart:rob.stuart@example.com:"119 Willow Street
    New Orleans"
  2. On the Greengage DB master host, create a readable external table that references the customers_multiline.txt file. In the LOCATION clause, specify the PXF s3:text:multi profile and the server configuration. In the FORMAT clause, set CSV as the data format and specify the colon character (:) as a delimiter:

    CREATE EXTERNAL TABLE customers_s3_multiline (
            id INTEGER,
            first_name TEXT,
            last_name TEXT,
            email TEXT,
            address TEXT    
        )
    
        LOCATION ('pxf://customers/customers_multiline.txt?PROFILE=s3:text:multi&SERVER=s3')
        FORMAT 'CSV' (delimiter=':');    
  3. Query the created external table:

    SELECT * FROM customers_s3_multiline;

    The output should look as follows:

     id | first_name | last_name |         email          |      address
    ----+------------+-----------+------------------------+-------------------
      1 | John       | Doe       | john.doe@example.com   | 123 Elm Street   +
        |            |           |                        | New York
      2 | Jane       | Smith     | jane.smith@example.com | 456 Oak Street   +
        |            |           |                        | Chicago
      3 | Bob        | Brown     | bob.brown@example.com  | 789 Pine Street  +
        |            |           |                        | Los Angeles
      4 | Rob        | Stuart    | rob.stuart@example.com | 119 Willow Street+
        |            |           |                        | New Orleans
    (4 rows)

Read data with multibyte delimiters

  1. In the customers bucket on the S3 host, create the customers_multibyte.txt file having the following content:

    "John Doe"⟼"john.doe@example.com"⟼"Note: \"Urgent\""
    "Jane Smith"⟼"jane.smith@example.com"⟼""
    "Bob Brown"⟼"bob.brown@example.com"⟼""
    "Rob Stuart"⟼"rob.stuart@example.com"⟼"Note: \"Delayed\""
  2. On the Greengage DB master host, create a readable external table that references the customers_multibyte.txt file. In the LOCATION clause, specify the PXF s3:csv profile and the server configuration. In the FORMAT clause, specify pxfdelimited_import, which is the built-in custom formatter function for reading delimited text data. Specify the right arrow () as a delimiter character, a double quote (") as a quote character, and a backslash (\) as an escape character:

    CREATE EXTERNAL TABLE customers_s3_multibyte (
            name TEXT,
            email TEXT,
            notes TEXT    
        )
    
        LOCATION ('pxf://customers/customers_multibyte.txt?PROFILE=s3:csv&SERVER=s3')
        FORMAT 'CUSTOM' (FORMATTER='pxfdelimited_import', DELIMITER=E'⟼', QUOTE='"', ESCAPE='\');    
  3. Query the created external table:

    SELECT * FROM customers_s3_multibyte;

    The output should look as follows:

        name    |         email          |      notes
    ------------+------------------------+-----------------
     John Doe   | john.doe@example.com   | Note: "Urgent"
     Jane Smith | jane.smith@example.com |
     Bob Brown  | bob.brown@example.com  |
     Rob Stuart | rob.stuart@example.com | Note: "Delayed"
    (4 rows)

Write text data

  1. On the Greengage DB master host, create a writable external table that writes data to the customers bucket on the S3 host. In the LOCATION clause, specify the PXF s3:text profile and the server configuration and set the compression codec to gzip. In the FORMAT clause, specify TEXT as the data format and set the comma character (,) as a delimiter:

    CREATE WRITABLE EXTERNAL TABLE customers_w_s3 (
            name TEXT,
            email TEXT,
            address TEXT
        )
    
        LOCATION ('pxf://customers?PROFILE=s3:text&SERVER=s3&COMPRESSION_CODEC=gzip')
        FORMAT 'TEXT' (delimiter=E',');    
  2. Populate the created table with sample data:

    INSERT INTO customers_w_s3 (name, email, address) 
    VALUES ('Alice Smith','alice.smith@example.com','42 Maple Avenue, Anytown'),
           ('Bob Johnson','bob.johnson@example.com','10 Oak Lane, Springfield'),
           ('Carol Williams','carol.williams@example.com','7 Pine Court, Hill Valley'),
           ('David Miller','david.miller@example.com','3 Birch Road, Pleasantville');
  3. Verify the contents of the customers bucket. The set of created files should look similar to the following:

    /customers/190-0000000021_1.gz
    /customers/190-0000000021_2.gz
  4. On the Greengage DB master host, create a readable external table. In the LOCATION clause, use the customers prefix to reference all created files and specify the PXF s3:text profile and the server configuration. In the FORMAT clause, set TEXT as the data format and specify the comma character (,) as a delimiter:

    CREATE EXTERNAL TABLE customers_s3_v (
            name TEXT,
            email TEXT,
            address TEXT
        )
    
        LOCATION ('pxf://customers?PROFILE=s3:text&SERVER=s3')
        FORMAT 'TEXT' (delimiter=E',');    
  5. Query the created external table:

    SELECT * FROM customers_s3_v;

    The output should look as follows:

          name      |           email            |           address
    ----------------+----------------------------+-----------------------------
     Bob Johnson    | bob.johnson@example.com    | 10 Oak Lane, Springfield
     David Miller   | david.miller@example.com   | 3 Birch Road, Pleasantville
     Alice Smith    | alice.smith@example.com    | 42 Maple Avenue, Anytown
     Carol Williams | carol.williams@example.com | 7 Pine Court, Hill Valley
    (4 rows)