Use PXF HDFS connector to read and write text data between Greengage DB and HDFS

Anton Monakov

Contents

Configure the PXF HDFS connector
Create an external table using the PXF protocol
Read regular text data from HDFS
Read multiline text data from HDFS
Write text data to HDFS

These examples demonstrate how to configure and use the PXF HDFS connector for reading and writing HDFS text data by using external tables.

Prerequisites

To try out the practical examples, connect to the Greengage DB master host as gpadmin using psql as described in Connect to Greengage DB via psql. Then create the customers test database and connect to it:

DROP DATABASE IF EXISTS customers;
CREATE DATABASE customers;
\c customers

To be able to create an external table using the PXF protocol, enable the PXF extension in the database as described in Register PXF in a database in PXF documentation:

CREATE EXTENSION pxf;

To have PXF connect to HDFS, you need to create a Hadoop server configuration as described in Configure PXF Hadoop connectors in PXF documentation and then synchronize it to the Greengage DB cluster:

Log in to the Greengage DB master host as gpadmin.
Go to the $PXF_BASE/servers directory and create a Hadoop server configuration directory named hadoop:
```
$ mkdir $PXF_BASE/servers/hadoop
$ cd $PXF_BASE/servers/hadoop
```

Copy the core-site.xml, hdfs-site.xml, mapred-site.xml, and yarn-site.xml Hadoop configuration files from the NameNode host of the Hadoop cluster to the current host:

$ scp hdfsuser@namenode:/etc/hadoop/conf/core-site.xml .
$ scp hdfsuser@namenode:/etc/hadoop/conf/hdfs-site.xml .
$ scp hdfsuser@namenode:/etc/hadoop/conf/mapred-site.xml .
$ scp hdfsuser@namenode:/etc/hadoop/conf/yarn-site.xml .

Synchronize the server configuration to the Greengage DB cluster hosts:
```
$ pxf cluster sync
```

To create a Greengage DB external table that references a text file or directory on HDFS, use the following general syntax:

CREATE [READABLE | WRITABLE] EXTERNAL TABLE <table_name>
    ( <column_name> <data_type> [, ...] | LIKE <other_table> )

    LOCATION ('pxf://<path_to_data>?PROFILE=<profile_name>[&<custom_option>[...]]')
    FORMAT '[TEXT|CSV]' (delimiter[=|<space>][E]'<delim_value>')
    [DISTRIBUTED BY (<column_name> [, ... ] ) | DISTRIBUTED RANDOMLY];

Keyword Value

Keyword	Value
<table_name>	The name of the table to create
<column_name>	The name of the column to create
<data_type>	The data type of the column
LIKE <other_table>	Specifies a table from which the new external table automatically copies all column names, data types, and distribution policy
<path_to_data>	The path to the directory or file in the HDFS data store. When the `<server_name>` configuration includes a `pxf.fs.basePath` property setting, the `<path_to_data>` value is considered to be relative to the base path specified. Otherwise, the path is considered absolute. The value must not include the dollar sign (`$`) character. PXF supports reading and writing text files compressed with the default, bzip2, and gzip codecs. For readable tables, decompression is performed automatically. For writable external tables, you can specify the compression codec and type by using the `COMPRESSION_CODEC` and `COMPRESSION_TYPE` custom options described below
PROFILE=<profile_name>	Use the `hdfs:text` profile when reading delimited plain text, `hdfs:csv` when reading CSV data where each row is a single record, and `hdfs:text:multi` for reading multiline text data
FORMAT <value>	The data format, which can be `TEXT` when referencing plain text delimited data or `CSV` when referencing comma-separated value data. Note that the `HEADER` option, which commonly designates whether the data file contains a header row, is not supported for external tables using PXF. If a text file includes header lines, use the `SKIP_HEADER_COUNT` custom option to specify the number of lines to skip at the beginning of each file
delimiter	The delimiter character in the data. For the `CSV` format, the default `<delim_value>` is the comma character (`,`). You can use an E'' escape string constant, for example `delimiter=E'\t'`
DISTRIBUTED BY	When loading data from a Greengage DB table into a writable external table, consider specifying the same distribution policy or column name on both tables. This will avoid extra motion of data between segments on the load operation. Learn more about table distribution in Distribution
<custom_option>	One of the custom options provided in the `LOCATION` string as described below
SERVER=<server_name>	The named server configuration that PXF uses to access the data. If not specified, the default PXF server is used
IGNORE_MISSING_PATH=<boolean>	The action to take when `<path_to_data>` is missing or invalid. If set to `false` (default), an error is returned. If set to `true`, PXF ignores missing path errors and returns an empty fragment
SKIP_HEADER_COUNT=<numlines>	The number of header lines to skip in the beginning of each HDFS file before reading the data. The default value is `0`, no lines are skipped
COMPRESSION_CODEC	The compression codec to use when writing data: `default`, `bzip2`, `gzip`, or `uncompressed` (no compression). If not provided (or `uncompressed` is provided), no data compression is performed
COMPRESSION_TYPE	The compression type to use when writing data: record (`RECORD`, the default) or block (`BLOCK`)

<table_name>

The name of the table to create

<column_name>

The name of the column to create

<data_type>

The data type of the column

LIKE <other_table>

Specifies a table from which the new external table automatically copies all column names, data types, and distribution policy

<path_to_data>

The path to the directory or file in the HDFS data store. When the <server_name> configuration includes a pxf.fs.basePath property setting, the <path_to_data> value is considered to be relative to the base path specified. Otherwise, the path is considered absolute. The value must not include the dollar sign ($) character.

PXF supports reading and writing text files compressed with the default, bzip2, and gzip codecs. For readable tables, decompression is performed automatically. For writable external tables, you can specify the compression codec and type by using the COMPRESSION_CODEC and COMPRESSION_TYPE custom options described below

PROFILE=<profile_name>

Use the hdfs:text profile when reading delimited plain text, hdfs:csv when reading CSV data where each row is a single record, and hdfs:text:multi for reading multiline text data

FORMAT <value>

The data format, which can be TEXT when referencing plain text delimited data or CSV when referencing comma-separated value data.

Note that the HEADER option, which commonly designates whether the data file contains a header row, is not supported for external tables using PXF. If a text file includes header lines, use the SKIP_HEADER_COUNT custom option to specify the number of lines to skip at the beginning of each file

delimiter

The delimiter character in the data. For the CSV format, the default <delim_value> is the comma character (,). You can use an E'' escape string constant, for example delimiter=E'\t'

DISTRIBUTED BY

When loading data from a Greengage DB table into a writable external table, consider specifying the same distribution policy or column name on both tables. This will avoid extra motion of data between segments on the load operation. Learn more about table distribution in Distribution

<custom_option>

One of the custom options provided in the LOCATION string as described below

SERVER=<server_name>

The named server configuration that PXF uses to access the data. If not specified, the default PXF server is used

IGNORE_MISSING_PATH=<boolean>

The action to take when <path_to_data> is missing or invalid. If set to false (default), an error is returned. If set to true, PXF ignores missing path errors and returns an empty fragment

SKIP_HEADER_COUNT=<numlines>

The number of header lines to skip in the beginning of each HDFS file before reading the data. The default value is 0, no lines are skipped

COMPRESSION_CODEC

The compression codec to use when writing data: default, bzip2, gzip, or uncompressed (no compression). If not provided (or uncompressed is provided), no data compression is performed

COMPRESSION_TYPE

The compression type to use when writing data: record (RECORD, the default) or block (BLOCK)

Read regular text data from HDFS

In the /tmp directory on the HDFS host, create a delimited plain text data file named customers.txt and having the following content:

id,first_name,last_name,email,address
1,John,Doe,john.doe@example.com,123 Elm Street
2,Jane,Smith,jane.smith@example.com,456 Oak Street
3,Bob,Brown,bob.brown@example.com,789 Pine Street
4,Rob,Stuart,rob.stuart@example.com,119 Willow Street

Compress the created file:
```
$ gzip /tmp/customers.txt
```
Create the pxf_examples HDFS directory for storing PXF example data files and add the customers.txt.gz file to HDFS:
```
$ hdfs dfs -mkdir -p /tmp/pxf_examples
$ hdfs dfs -put /tmp/customers.txt.gz /tmp/pxf_examples/
```
Optionally, review the contents of the customers.txt.gz file stored in HDFS:
```
$ hdfs dfs -text /tmp/pxf_examples/customers.txt.gz
```
On the Greengage DB master host, create a readable external table using the PXF protocol that references the customers.txt.gz file. In the LOCATION clause, specify the PXF hdfs:text profile and the server configuration. Use the SKIP_HEADER_COUNT option to indicate that the file contains a header row. In the FORMAT clause, set TEXT as the data format and set the comma character (,) as a delimiter (delimiter):
```
CREATE EXTERNAL TABLE customers_plain (
    id INTEGER,
    first_name VARCHAR(50),
    last_name VARCHAR(50),
    email VARCHAR(100),
    address VARCHAR(255)
)
    LOCATION ('pxf://tmp/pxf_examples/customers.txt.gz?PROFILE=hdfs:text&SERVER=hadoop&SKIP_HEADER_COUNT=1')
    FORMAT 'TEXT' (delimiter=E',');
```

Query the external table:

SELECT * FROM customers_plain;

The output should look as follows:

 id | first_name | last_name |         email          |      address
----+------------+-----------+------------------------+-------------------
  1 | John       | Doe       | john.doe@example.com   | 123 Elm Street
  2 | Jane       | Smith     | jane.smith@example.com | 456 Oak Street
  3 | Bob        | Brown     | bob.brown@example.com  | 789 Pine Street
  4 | Rob        | Stuart    | rob.stuart@example.com | 119 Willow Street
(4 rows)

Read multiline text data from HDFS

In the /tmp directory on the HDFS host, create a delimited plain text data file named customers_multiline.txt and having the following content. Notice the use of the colon (:) to separate the fields. Also notice the quotes around the address field: this field includes an embedded line feed separating the street address from the city:
```
id:first_name:last_name:email:address
1:John:Doe:john.doe@example.com:"123 Elm Street
New York"
2:Jane:Smith:jane.smith@example.com:"456 Oak Street
Chicago"
3:Bob:Brown:bob.brown@example.com:"789 Pine Street
Los Angeles"
4:Rob:Stuart:rob.stuart@example.com:"119 Willow Street
New Orleans"
```

Copy the text file to HDFS:

$ hdfs dfs -put /tmp/customers_multiline.txt /tmp/pxf_examples/

On the Greengage DB master host, create a readable external table using the PXF protocol that references the customers_multiline.txt file. In the LOCATION clause, specify the PXF hdfs:text:multi profile and the server configuration. Use the SKIP_HEADER_COUNT option to indicate that the file contains a header row. In the FORMAT clause, set CSV as the data format and set the colon character (:) as a delimiter (delimiter):
```
CREATE EXTERNAL TABLE customers_multiline (
    id INTEGER,
    first_name TEXT,
    last_name TEXT,
    email TEXT,
    address TEXT
)
    LOCATION ('pxf://tmp/pxf_examples/customers_multiline.txt?PROFILE=hdfs:text:multi&SERVER=hadoop&SKIP_HEADER_COUNT=1')
    FORMAT 'CSV' (delimiter ':');
```

Query the created external table:

SELECT * FROM customers_multiline;

The output should look as follows:

 id | first_name | last_name |         email          |      address
----+------------+-----------+------------------------+-------------------
  1 | John       | Doe       | john.doe@example.com   | 123 Elm Street
    |            |           |                        | New York
  2 | Jane       | Smith     | jane.smith@example.com | 456 Oak Street
    |            |           |                        | Chicago
  3 | Bob        | Brown     | bob.brown@example.com  | 789 Pine Street
    |            |           |                        |  Los Angeles
  4 | Rob        | Stuart    | rob.stuart@example.com | 119 Willow Street
    |            |           |                        | New Orleans
(4 rows)

Write text data to HDFS

On the Greengage DB master host, create a writable external table that writes data to the /tmp/pxf_examples/customers HDFS directory. In the LOCATION clause, specify the PXF hdfs:text profile and the server configuration and set the compression codec to gzip. In the FORMAT clause, specify TEXT as the data format and set the comma character (,) as a delimiter (delimiter):
```
CREATE WRITABLE EXTERNAL TABLE customers_w (
    id INTEGER,
    first_name TEXT,
    last_name TEXT,
    email TEXT,
    address TEXT
)
    LOCATION ('pxf://tmp/pxf_examples/customers?PROFILE=hdfs:text&SERVER=hadoop&COMPRESSION_CODEC=gzip')
    FORMAT 'TEXT' (delimiter ',');
```
Insert the contents of the earlier created customers_plain readable table into the customers_w table:
```
INSERT INTO customers_w SELECT * FROM customers_plain;
```

Manually insert several values into the customers_w table:

INSERT INTO customers_w (
     id,
     first_name,
     last_name,
     email,
     address
 ) 
VALUES (5,'Alice','Johnson','alice.johnson@example.com','10 Oak Avenue'),
       (6,'Charlie','Williams','charlie.williams@example.com','42 Maple Drive');

On the HDFS host, view the contents of the files created in the customers directory:

$ hdfs dfs -text /tmp/pxf_examples/customers/*

The output should look as follows:

1,John,Doe,john.doe@example.com,123 Elm Street
2,Jane,Smith,jane.smith@example.com,456 Oak Street
3,Bob,Brown,bob.brown@example.com,789 Pine Street
4,Rob,Stuart,rob.stuart@example.com,119 Willow Street
5,Alice,Johnson,alice.johnson@example.com,10 Oak Avenue
6,Charlie,Williams,charlie.williams@example.com,42 Maple Drive

MySQL

JSON

Use PXF HDFS connector to read and write text data between Greengage DB and HDFS

Configure the PXF HDFS connector

Create an external table using the PXF protocol

Read regular text data from HDFS

Read multiline text data from HDFS

Write text data to HDFS