Use PXF HDFS connector to read multiline text from HDFS into a single table row in Greengage DB

Anton Monakov

Contents

Create an external table using the PXF protocol
Examples

The PXF HDFS connector lets you read one or more single- and multiline text files in HDFS, loading each file into a single table row and column. This may be useful for reading multiple files into the same Greengage DB external table, for example when individual JSON files each contain a separate record. Only text and JSON files can be loaded in this manner, including files with embedded line feeds.

This topic describes how to configure and use the PXF HDFS connector for reading multiline text data stored in HDFS by using external tables and provides practical examples.

To create a Greengage DB external table to read text or JSON files on HDFS, use the following general syntax:

CREATE EXTERNAL TABLE <table_name>
    ( <column_name> TEXT|JSON | LIKE <other_table> )

    LOCATION ('pxf://<path_to_data>?PROFILE=hdfs:text:multi&FILE_AS_ROW=true[&<custom_option>=<value>[...]]')
    FORMAT 'CSV';

Keyword Value

Keyword	Value
<table_name>	The name of the table to create
<column_name> TEXT\|JSON	The column to read the data file into. The column type must be set to either `TEXT` or `JSON` depending on the data file type
LIKE <other_table>	Specifies a table from which the new external table automatically copies all column names, data types, and distribution policy
<path_to_data>	The path to the directory or file in the HDFS data store. When the `<server_name>` configuration includes a `pxf.fs.basePath` property setting, the `<path_to_data>` value is considered to be relative to the base path specified. Otherwise, the path is considered absolute. The value must not include the dollar sign (`$`) character. When reading multiple files, ensure that all of them are of the same type (text or JSON). When reading multiple JSON files, ensure that each file is a complete record and contains the same record type
PROFILE=hdfs:text:multi	The `hdfs:text:multi` profile is used to read multiline text data in HDFS
FILE_AS_ROW=true	The required option that instructs PXF to read each file into a single table row. When this option is specified, no additional custom or formatting options are supported
FORMAT	To read multiline text data in HDFS, `CSV` must be specified
<custom_option>	One of the custom options provided in the `LOCATION` string as described below
SERVER=<server_name>	The named server configuration that PXF uses to access the data. If not specified, the default PXF server is used
IGNORE_MISSING_PATH	The action to take when `<path_to_data>` is missing or invalid. If set to `false` (default), an error is returned. If set to `true`, PXF ignores missing path errors and returns an empty fragment

<table_name>

The name of the table to create

<column_name> TEXT|JSON

The column to read the data file into. The column type must be set to either TEXT or JSON depending on the data file type

LIKE <other_table>

Specifies a table from which the new external table automatically copies all column names, data types, and distribution policy

<path_to_data>

The path to the directory or file in the HDFS data store. When the <server_name> configuration includes a pxf.fs.basePath property setting, the <path_to_data> value is considered to be relative to the base path specified. Otherwise, the path is considered absolute. The value must not include the dollar sign ($) character.

When reading multiple files, ensure that all of them are of the same type (text or JSON). When reading multiple JSON files, ensure that each file is a complete record and contains the same record type

PROFILE=hdfs:text:multi

The hdfs:text:multi profile is used to read multiline text data in HDFS

FILE_AS_ROW=true

The required option that instructs PXF to read each file into a single table row. When this option is specified, no additional custom or formatting options are supported

FORMAT

To read multiline text data in HDFS, CSV must be specified

<custom_option>

One of the custom options provided in the LOCATION string as described below

SERVER=<server_name>

The named server configuration that PXF uses to access the data. If not specified, the default PXF server is used

IGNORE_MISSING_PATH

The action to take when <path_to_data> is missing or invalid. If set to false (default), an error is returned. If set to true, PXF ignores missing path errors and returns an empty fragment

These examples demonstrate how to configure and use the PXF HDFS connector for reading multiline text and JSON data in HDFS by using external tables.

Prerequisites

To try out the practical examples, connect to the Greengage DB master host as gpadmin using psql as described in Connect to Greengage DB via psql. Then create the customers test database and connect to it:

DROP DATABASE IF EXISTS customers;
CREATE DATABASE customers;
\c customers

To be able to create an external table using the PXF protocol, enable the PXF extension in the database as described in Register PXF in a database in PXF documentation:

CREATE EXTENSION pxf;

To have PXF connect to HDFS, you need to create a Hadoop server configuration as described in Configure PXF Hadoop connectors in PXF documentation and then synchronize it to the Greengage DB cluster:

Log in to the Greengage DB master host as gpadmin.
Go to the $PXF_BASE/servers directory and create a Hadoop server configuration directory named hadoop:
```
$ mkdir $PXF_BASE/servers/hadoop
$ cd $PXF_BASE/servers/hadoop
```

Copy the core-site.xml, hdfs-site.xml, mapred-site.xml, and yarn-site.xml Hadoop configuration files from the NameNode host of the Hadoop cluster to the current host:

$ scp hdfsuser@namenode:/etc/hadoop/conf/core-site.xml .
$ scp hdfsuser@namenode:/etc/hadoop/conf/hdfs-site.xml .
$ scp hdfsuser@namenode:/etc/hadoop/conf/mapred-site.xml .
$ scp hdfsuser@namenode:/etc/hadoop/conf/yarn-site.xml .

Synchronize the server configuration to the Greengage DB cluster hosts:
```
$ pxf cluster sync
```

Read multiline text files

In the /tmp directory on the HDFS host, create three plain text data files named customers_1.txt, customers_2.txt, and customers_3.txt having the following content:

1,John,Doe,john.doe@example.com,123 Elm Street

2,Jane,Smith,jane.smith@example.com,456 Oak Street
3,Bob,Brown,bob.brown@example.com,789 Pine Street
4,Rob,Stuart,rob.stuart@example.com,119 Willow Street

5,Alice,Johnson,alice.johnson@example.com,"101 Maple Avenue
Unit 2B
Second Floor"
6,Charlie,Davis,charlie.davis@example.com,"123 Elm Road
Suite 300
Business Park"
7,Eve,Wilson,eve.wilson@example.com,"PO Box 555
Anytown, USA
Zip Code: 12345"

Create the /tmp/pxf_examples HDFS directory for storing PXF example data files and add the created files to HDFS:

$ hdfs dfs -mkdir -p /tmp/pxf_examples
$ hdfs dfs -put /tmp/customers_1.txt /tmp/customers_2.txt /tmp/customers_3.txt /tmp/pxf_examples/

On the Greengage DB master host, create a readable external table that references the text files. The table has the single customers_data column of type TEXT. In the LOCATION clause, specify the PXF hdfs:text:multi profile and the server configuration. Use the FILE_AS_ROW=true option to enable loading files into single table rows. In the FORMAT clause, set CSV as the data format:
```
CREATE EXTERNAL TABLE customers_text (
    customers_data TEXT
    )
    LOCATION ('pxf://tmp/pxf_examples/?PROFILE=hdfs:text:multi&SERVER=hadoop&FILE_AS_ROW=true')
    FORMAT 'CSV';
```

Query the created external table:

SELECT * FROM customers_text;

The output should look as follows:

                       customers_data
-------------------------------------------------------------
 1,John,Doe,john.doe@example.com,123 Elm Street
 5,Alice,Johnson,alice.johnson@example.com,"101 Maple Avenue+
 Unit 2B                                                    +
 Second Floor"                                              +
 6,Charlie,Davis,charlie.davis@example.com,"123 Elm Road    +
 Suite 300                                                  +
 Business Park"                                             +
 7,Eve,Wilson,eve.wilson@example.com,"PO Box 555            +
 Anytown, USA                                               +
 Zip Code: 12345"
 2,Jane,Smith,jane.smith@example.com,456 Oak Street         +
 3,Bob,Brown,bob.brown@example.com,789 Pine Street          +
 4,Rob,Stuart,rob.stuart@example.com,119 Willow Street
(3 rows)

Enable psql expanded display mode and query the customers_text table again:

$ psql -d customers -x -c 'SELECT * FROM customers_text;'

The output should look as follows:

-[ RECORD 1 ]--+------------------------------------------------------------
customers_data | 1,John,Doe,john.doe@example.com,123 Elm Street
-[ RECORD 2 ]--+------------------------------------------------------------
customers_data | 2,Jane,Smith,jane.smith@example.com,456 Oak Street
               | 3,Bob,Brown,bob.brown@example.com,789 Pine Street
               | 4,Rob,Stuart,rob.stuart@example.com,119 Willow Street
-[ RECORD 3 ]--+------------------------------------------------------------
customers_data | 5,Alice,Johnson,alice.johnson@example.com,"101 Maple Avenue
               | Unit 2B
               | Second Floor"
               | 6,Charlie,Davis,charlie.davis@example.com,"123 Elm Road
               | Suite 300
               | Business Park"
               | 7,Eve,Wilson,eve.wilson@example.com,"PO Box 555
               | Anytown, USA
               | Zip Code: 12345"

In the /tmp directory on the HDFS host, create two JSON data files named customers_1.json and customers_2.json having the following content:

{
  "customers": [
    {
      "id": 101,
      "name": "Alice Smith",
      "ordered_items": [
        "laptop",
        "monitor"
      ]
    },
    {
      "id": 102,
      "name": "Bob Johnson",
      "ordered_items": [
        "keyboard",
        "mouse",
        "pad"
      ]
    }
  ]
}

{
  "customers": [
    {
      "id": 103,
      "name": "Charlie Brown",
      "ordered_items": [
        "headphones"
      ]
    }
  ]
}

If not created yet, create the /tmp/pxf_examples HDFS directory for storing PXF example data files and add the JSON files to HDFS:
```
$ hdfs dfs -mkdir -p /tmp/pxf_examples
$ hdfs dfs -put /tmp/customers_1.json /tmp/customers_2.json /tmp/pxf_examples/
```
On the Greengage DB master host, create a readable external table using the PXF protocol that references the JSON files. The table has the single customers_data column of type JSON. In the LOCATION clause, specify the PXF hdfs:text:multi profile and the server configuration. Use the FILE_AS_ROW=true option to enable loading files into single table rows. In the FORMAT clause, set CSV as the data format:
```
CREATE EXTERNAL TABLE customers_json (
    customers_data JSON
    )
    LOCATION ('pxf://tmp/pxf_examples/?PROFILE=hdfs:text:multi&SERVER=hadoop&FILE_AS_ROW=true')
    FORMAT 'CSV';
```

Query the created external table:

SELECT * FROM customers_json;

The output should look as follows:

         customers_data
--------------------------------
 {                             +
   "customers": [              +
     {                         +
       "id": 103,              +
       "name": "Charlie Brown",+
       "ordered_items": [      +
         "headphones"          +
       ]                       +
     }                         +
   ]                           +
 }
 {                             +
   "customers": [              +
     {                         +
       "id": 101,              +
       "name": "Alice Smith",  +
       "ordered_items": [      +
         "laptop",             +
         "monitor"             +
       ]                       +
     },                        +
     {                         +
       "id": 102,              +
       "name": "Bob Johnson",  +
       "ordered_items": [      +
         "keyboard",           +
         "mouse",              +
         "pad"                 +
       ]                       +
     }                         +
   ]                           +
 }
(2 rows)

Enable psql expanded display mode and query the customers_json table again:

$ psql -d customers -x -c 'SELECT * FROM customers_json;'

The output should look as follows:

-[ RECORD 1 ]--+-------------------------------
customers_data | {
               |   "customers": [
               |     {
               |       "id": 103,
               |       "name": "Charlie Brown",
               |       "ordered_items": [
               |         "headphones"
               |       ]
               |     }
               |   ]
               | }
-[ RECORD 2 ]--+-------------------------------
customers_data | {
               |   "customers": [
               |     {
               |       "id": 101,
               |       "name": "Alice Smith",
               |       "ordered_items": [
               |         "laptop",
               |         "monitor"
               |       ]
               |     },
               |     {
               |       "id": 102,
               |       "name": "Bob Johnson",
               |       "ordered_items": [
               |         "keyboard",
               |         "mouse",
               |         "pad"
               |       ]
               |     }
               |   ]
               | }

Use the json_array_elements() Greengage DB function to extract the elements of the customers node and view them as individual records:

SELECT json_array_elements(customers_data -> 'customers') AS customer FROM customers_json;

The output should look as follows:

-[ RECORD 1 ]----------------------------
customer | {
         |       "id": 101,
         |       "name": "Alice Smith",
         |       "ordered_items": [
         |         "laptop",
         |         "monitor"
         |       ]
         |     }
-[ RECORD 2 ]----------------------------
customer | {
         |       "id": 102,
         |       "name": "Bob Johnson",
         |       "ordered_items": [
         |         "keyboard",
         |         "mouse",
         |         "pad"
         |       ]
         |     }
-[ RECORD 3 ]----------------------------
customer | {
         |       "id": 103,
         |       "name": "Charlie Brown",
         |       "ordered_items": [
         |         "headphones"
         |       ]
         |     }

Use the -> operator (Get JSON array element) to retrieve the first item in the ordered_items array of each customer node:

SELECT json_array_elements(customers_data -> 'customers') -> 'ordered_items' -> 0 AS first_item FROM customers_json;

The output should look as follows:

-[ RECORD 1 ]------------
first_item | "headphones"
-[ RECORD 2 ]------------
first_item | "laptop"
-[ RECORD 3 ]------------
first_item | "keyboard"

SequenceFile

Fixed-width text

Use PXF HDFS connector to read multiline text from HDFS into a single table row in Greengage DB

Create an external table using the PXF protocol

Examples

Configure the PXF HDFS connector

Read multiline text files

Read JSON files