Use PXF object store connectors to read and write text data between Greengage DB and S3
With PXF object store connectors, you can read and write data residing in an object store such as Azure Blob Storage, Azure Data Lake, Google Cloud Storage, Amazon S3, MinIO, and other S3-compatible object stores.
This topic describes how to configure and use the connectors for reading and writing text data residing in an object store by using external tables and provides practical examples.
Create an external table using the PXF protocol
To create a Greengage DB external table to read or write data in S3, use the following general syntax:
CREATE [READABLE | WRITABLE] EXTERNAL TABLE <table_name>
( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ('pxf://<path_to_data>?PROFILE=<profile_name>[&<custom_option>=<value>[...]]')
FORMAT '[TEXT|CSV]' (delimiter[=|<space>][E]'<delim_value>') |
'CUSTOM' (FORMATTER='pxfdelimited_import' <format_option>[=|<space>][E]'<value>')
[DISTRIBUTED BY (<column_name> [, ... ] ) | DISTRIBUTED RANDOMLY];
| Keyword | Value |
|---|---|
<table_name> |
The name of the table to create |
<column_name> |
The name of the column to create |
<data_type> |
The data type of the column |
LIKE <other_table> |
Specifies a table from which the new external table automatically copies all column names, data types, and distribution policy |
<path_to_data> |
The path to the directory or file in the object store.
When the |
PROFILE=<profile_name> |
The profile is specified as the The following
The following
|
FORMAT <value> |
Defines the data format, which can be one of the following:
Note that the |
delimiter |
The delimiter character in the data.
For the |
DISTRIBUTED BY |
When loading data from a Greengage DB table into a writable external table, consider specifying the same distribution policy or column name on both tables. This will avoid extra motion of data between segments on the load operation. Learn more about table distribution in Distribution |
<custom_option> |
One of the custom options provided in the |
SERVER=<server_name> |
The named server configuration that PXF uses to access the data. If not specified, the default PXF server is used |
IGNORE_MISSING_PATH=<boolean> |
The action to take when |
SKIP_HEADER_COUNT=<numlines> |
The number of header lines to skip in the beginning of each HDFS file before reading the data.
The default value is |
COMPRESSION_CODEC |
The compression codec to use when writing data: |
NEWLINE |
Designates a character used as a newline character.
If |
If you are accessing an S3 object store:
-
You can provide S3 credentials via the
accesskeyandsecretkeycustom options in theLOCATIONclause of theCREATE EXTERNAL TABLEcommand. -
When retrieving CSV-format data, you can use PXF with the Amazon S3 Select service.
Read data with multibyte or multi-character delimiters
For the data containing a multibyte delimiter or a delimiter with multiple characters, only <objstore>:csv (where <objstore> is the object store prefix) PXF profiles are supported.
In the FORMAT clause, you must specify the custom formatter (the pxfdelimited_import formatter function) and the formatter options under <format_option>.
For a practical example of reading such data, see Read data with multibyte delimiters.
The general syntax for creating a readable external table for such data is as follows.
CREATE EXTERNAL TABLE <table_name>
( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ('pxf://<path_to_data>?PROFILE=<objstore>:csv'[&<custom_option>=<value>[...]]')
FORMAT 'CUSTOM' (FORMATTER='pxfdelimited_import' <format_option>[=|<space>][E]'<value>');
| Formatter option | Description | Default value |
|---|---|---|
DELIMITER=<delim_string> |
Required. The single-byte or multibyte delimiter string that separates data columns. The specified string may be up to 32 bytes in length and may not contain quote or escape characters. The delimiter is specified either directly or as its byte representation. The byte representation of the delimiter is specified in the |
None |
QUOTE=<char> |
The single one-byte ASCII quotation character for all columns |
None |
ESCAPE=<char> |
The single one-byte ASCII character used to escape special characters (for example, the |
None, or the |
NEWLINE |
The character used as a newline character, which can be Note that if |
LF |
When reading data with a multibyte or multi-character delimiter, PXF behavior depends on the quote and escape character settings.
| QUOTE specified | ESCAPE specified | PXF behavior |
|---|---|---|
No |
No |
PXF reads the data as is |
Yes |
Yes |
PXF reads the data between quote characters as is and unescapes only the quote and escape characters |
Yes |
No ( |
PXF reads the data between quote characters as is |
No |
Yes |
PXF reads the data as is and unescapes only the delimiter, newline, and escape itself |
If QUOTE is specified, all data columns must be quoted.
Otherwise, if not specified, all data columns must be unquoted.
No whitespace or unwanted characters are allowed between the quote value and the delimiter value as well as between the quote value and the newline value.
Examples
Configure the PXF S3 connector
To have PXF connect to an object store, you need to create the corresponding server configuration and then synchronize it to the Greengage DB cluster:
-
Log in to the Greengage DB master host as
gpadmin. -
Go to the $PXF_BASE/servers directory and create an S3 server configuration directory named s3. Depending on your object storage, copy the required server configuration file from $PXF_HOME/templates to $PXF_BASE/servers/s3. The example uses the s3-site.xml file based on the minio-site.xml template.
$ mkdir $PXF_BASE/servers/s3 $ cd $PXF_BASE/servers/s3 $ cp $PXF_HOME/templates/s3-site.xml .In the configuration file, provide the relevant object store connection details:
<?xml version="1.0" encoding="UTF-8"?> <configuration> <property> <name>fs.s3a.endpoint</name> <value>storage.example.com</value> </property> <property> <name>fs.s3a.access.key</name> <value>${ACCESS_KEY}</value> </property> <property> <name>fs.s3a.secret.key</name> <value>${SECRET_KEY}</value> </property> <property> <name>fs.s3a.fast.upload</name> <value>true</value> </property> <property> <name>fs.s3a.path.style.access</name> <value>true</value> </property> </configuration> -
Synchronize the server configuration to the Greengage DB cluster hosts:
$ pxf cluster sync
Read plain text and CSV data
-
In the customers bucket on the S3 host, create the orders.txt file with the following content:
id,name,price 1,Laptop,999.99 2,Smartphone,499.99 3,Tablet,299.99 4,Monitor,599.99 5,Keyboard,99.99
-
On the Greengage DB master host, create a readable external table that references the orders.txt file. In the
LOCATIONclause, specify the PXFs3:textprofile and the server configuration. Use theSKIP_HEADER_COUNToption to indicate that the file contains a header row. In theFORMATclause, setTEXTas the data format and specify the comma character (,) as a delimiter:CREATE EXTERNAL TABLE orders_s3 ( id INTEGER, name VARCHAR, price NUMERIC ) LOCATION ('pxf://customers/orders.txt?PROFILE=s3:text&SERVER=s3&SKIP_HEADER_COUNT=1') FORMAT 'TEXT' (delimiter=E','); -
Query the created external table:
SELECT * FROM orders_s3;The output should look as follows:
id | name | price ----+------------+-------- 1 | Laptop | 999.99 2 | Smartphone | 499.99 3 | Tablet | 299.99 4 | Monitor | 599.99 5 | Keyboard | 99.99 (5 rows)
-
Create another readable external table that references the same orders.txt file. In the
LOCATIONclause, specify the PXFs3:csvprofile and the server configuration. Use theSKIP_HEADER_COUNToption to indicate that the file contains a header row. In theFORMATclause, setCSVas the data format:CREATE EXTERNAL TABLE orders_s3_csv ( id INTEGER, name VARCHAR, price NUMERIC ) LOCATION ('pxf://customers/orders.txt?PROFILE=s3:csv&SERVER=s3&SKIP_HEADER_COUNT=1') FORMAT 'CSV'; -
Query the created external table:
SELECT * FROM orders_s3_csv;The output should look as follows:
id | name | price ----+------------+-------- 1 | Laptop | 999.99 2 | Smartphone | 499.99 3 | Tablet | 299.99 4 | Monitor | 599.99 5 | Keyboard | 99.99 (5 rows)
Read multiline text data
-
In the customers bucket on the S3 host, create the customers_multiline.txt file with the following content:
1:John:Doe:john.doe@example.com:"123 Elm Street New York" 2:Jane:Smith:jane.smith@example.com:"456 Oak Street Chicago" 3:Bob:Brown:bob.brown@example.com:"789 Pine Street Los Angeles" 4:Rob:Stuart:rob.stuart@example.com:"119 Willow Street New Orleans"
-
On the Greengage DB master host, create a readable external table that references the customers_multiline.txt file. In the
LOCATIONclause, specify the PXFs3:text:multiprofile and the server configuration. In theFORMATclause, setCSVas the data format and specify the colon character (:) as a delimiter:CREATE EXTERNAL TABLE customers_s3_multiline ( id INTEGER, first_name TEXT, last_name TEXT, email TEXT, address TEXT ) LOCATION ('pxf://customers/customers_multiline.txt?PROFILE=s3:text:multi&SERVER=s3') FORMAT 'CSV' (delimiter=':'); -
Query the created external table:
SELECT * FROM customers_s3_multiline;The output should look as follows:
id | first_name | last_name | email | address ----+------------+-----------+------------------------+------------------- 1 | John | Doe | john.doe@example.com | 123 Elm Street + | | | | New York 2 | Jane | Smith | jane.smith@example.com | 456 Oak Street + | | | | Chicago 3 | Bob | Brown | bob.brown@example.com | 789 Pine Street + | | | | Los Angeles 4 | Rob | Stuart | rob.stuart@example.com | 119 Willow Street+ | | | | New Orleans (4 rows)
Read data with multibyte delimiters
-
In the customers bucket on the S3 host, create the customers_multibyte.txt file having the following content:
"John Doe"⟼"john.doe@example.com"⟼"Note: \"Urgent\"" "Jane Smith"⟼"jane.smith@example.com"⟼"" "Bob Brown"⟼"bob.brown@example.com"⟼"" "Rob Stuart"⟼"rob.stuart@example.com"⟼"Note: \"Delayed\""
-
On the Greengage DB master host, create a readable external table that references the customers_multibyte.txt file. In the
LOCATIONclause, specify the PXFs3:csvprofile and the server configuration. In theFORMATclause, specifypxfdelimited_import, which is the built-in custom formatter function for reading delimited text data. Specify the right arrow (⟼) as a delimiter character, a double quote (") as a quote character, and a backslash (\) as an escape character:CREATE EXTERNAL TABLE customers_s3_multibyte ( name TEXT, email TEXT, notes TEXT ) LOCATION ('pxf://customers/customers_multibyte.txt?PROFILE=s3:csv&SERVER=s3') FORMAT 'CUSTOM' (FORMATTER='pxfdelimited_import', DELIMITER=E'⟼', QUOTE='"', ESCAPE='\'); -
Query the created external table:
SELECT * FROM customers_s3_multibyte;The output should look as follows:
name | email | notes ------------+------------------------+----------------- John Doe | john.doe@example.com | Note: "Urgent" Jane Smith | jane.smith@example.com | Bob Brown | bob.brown@example.com | Rob Stuart | rob.stuart@example.com | Note: "Delayed" (4 rows)
Write text data
-
On the Greengage DB master host, create a writable external table that writes data to the customers bucket on the S3 host. In the
LOCATIONclause, specify the PXFs3:textprofile and the server configuration and set the compression codec togzip. In theFORMATclause, specifyTEXTas the data format and set the comma character (,) as a delimiter:CREATE WRITABLE EXTERNAL TABLE customers_w_s3 ( name TEXT, email TEXT, address TEXT ) LOCATION ('pxf://customers?PROFILE=s3:text&SERVER=s3&COMPRESSION_CODEC=gzip') FORMAT 'TEXT' (delimiter=E','); -
Populate the created table with sample data:
INSERT INTO customers_w_s3 (name, email, address) VALUES ('Alice Smith','alice.smith@example.com','42 Maple Avenue, Anytown'), ('Bob Johnson','bob.johnson@example.com','10 Oak Lane, Springfield'), ('Carol Williams','carol.williams@example.com','7 Pine Court, Hill Valley'), ('David Miller','david.miller@example.com','3 Birch Road, Pleasantville'); -
Verify the contents of the customers bucket. The set of created files should look similar to the following:
/customers/190-0000000021_1.gz /customers/190-0000000021_2.gz
-
On the Greengage DB master host, create a readable external table. In the
LOCATIONclause, use thecustomersprefix to reference all created files and specify the PXFs3:textprofile and the server configuration. In theFORMATclause, setTEXTas the data format and specify the comma character (,) as a delimiter:CREATE EXTERNAL TABLE customers_s3_v ( name TEXT, email TEXT, address TEXT ) LOCATION ('pxf://customers?PROFILE=s3:text&SERVER=s3') FORMAT 'TEXT' (delimiter=E','); -
Query the created external table:
SELECT * FROM customers_s3_v;The output should look as follows:
name | email | address ----------------+----------------------------+----------------------------- Bob Johnson | bob.johnson@example.com | 10 Oak Lane, Springfield David Miller | david.miller@example.com | 3 Birch Road, Pleasantville Alice Smith | alice.smith@example.com | 42 Maple Avenue, Anytown Carol Williams | carol.williams@example.com | 7 Pine Court, Hill Valley (4 rows)