Use PXF HDFS connector to read and write JSON data between Greengage DB and HDFS

Anton Monakov

Contents

JSON data access modes
Data type mapping
- Read mapping
- Write mapping
Use column projection
Create an external table using the PXF protocol
Examples

JSON is a text-based data-interchange format. JSON data is a collection of objects, each of which is a set of unordered name/value pairs. A value can be a string, a number, boolean, null, and a nested object or an array.

JSON data is typically stored in a file with a .json or .jsonl (JSON Lines) suffix. For detailed information on JSON syntax, see Introducing JSON. For detailed information on JSON Lines syntax, see JSON Lines.

This topic describes how to configure and use the PXF HDFS connector for reading and writing JSON data stored in HDFS by using external tables and provides practical examples.

TIP

For detailed information on how to work with the JSON data type directly in Greengage DB, see Work with JSON data.

PXF supports two JSON data access modes.

The single-object-per-row mode (default) expects one full JSON record per row, where each row represents a database tuple. In this mode, JSON files read by PXF may have any or no suffix. When writing, PXF creates the file with a .jsonl suffix.

A single-object-per-row JSON data file looks similar to the following:

{"id":1,"color":"red"}
{"id":2,"color":"yellow"}
{"id":3,"color":"green"}

The single-object-per-file mode operates on one JSON object per file, and the JSON records may span multiple lines. A JSON file can also contain a single named root-level JSON object whose value is an array of JSON objects. When reading, the array may contain objects with arbitrary complexity and nesting, and PXF forms database tuples from objects that have a property named the same as the specified IDENTIFIER value (described below). When writing, each JSON object in the array represents a database tuple. JSON files of this type have the .json suffix.

IMPORTANT

Ensure that there are no blank lines in your JSON files.

A sample single-object-per-file JSON data file looks as follows. The root-level records object is an array of three objects (tuples):

{
  "records": [
    {
      "id": 1,
      "color": "red"
    },
    {
      "id": 2,
      "color": "yellow"
    },
    {
      "id": 3,
      "color": "green"
    }
  ]
}

When writing JSON data, PXF creates a .jsonl file by default. When the ROOT attribute (described below) is specified, PXF creates a .json file.

To represent JSON data in Greengage DB, map data values that use a primitive data type to Greengage DB columns of the same type. JSON supports complex data types including projections and arrays.

PXF uses the following data type mapping when reading JSON data. You can use N-level projection to map members of nested objects and arrays to primitive data types.

JSON data type PXF / Greengage DB data type

JSON data type	PXF / Greengage DB data type
boolean	BOOLEAN
number	One of the following: `BIGINT`, `FLOAT8`, `INTEGER`, `NUMERIC`, `REAL`, `SMALLINT`
string	TEXT
`string` (base64-encoded value)	BYTEA
`string` (`date`, `time`, `timestamp`, `timestamptz` in a text format recognized by Greengage DB)	One of the following: `DATE`, `TIME`, `TIMESTAMP`, `TIMESTAMPTZ`. PXF returns an error if Greengage DB cannot convert the date or time string to the target type
Array (one dimension) of type `boolean[]`	BOOLEAN[]
Array (one dimension) of type `number[]`	One of the following: `BIGINT[]`, `FLOAT8[]`, `INTEGER[]`, `NUMERIC[]`, `REAL[]`, `SMALLINT[]`
Array (one dimension) of type `string[]` (base64-encoded value)	BYTEA[]
Array (one dimension) of type `string[]` (`date`, `time`, `timestamp` in a text format recognized by Greengage DB)	`DATE[]`, `TIME[]`, `TIMESTAMP[]`, `TIMESTAMPTZ[]`
Array (one dimension) of type `string[]`	TEXT[]
Array of other types	TEXT[]
Object	Use the dot (`.`) notation to specify each level of projection (nesting) to a member of a primitive or Array type

boolean

BOOLEAN

number

One of the following: BIGINT, FLOAT8, INTEGER, NUMERIC, REAL, SMALLINT

string

TEXT

string (base64-encoded value)

BYTEA

string (date, time, timestamp, timestamptz in a text format recognized by Greengage DB)

One of the following: DATE, TIME, TIMESTAMP, TIMESTAMPTZ. PXF returns an error if Greengage DB cannot convert the date or time string to the target type

Array (one dimension) of type boolean[]

BOOLEAN[]

Array (one dimension) of type number[]

One of the following: BIGINT[], FLOAT8[], INTEGER[], NUMERIC[], REAL[], SMALLINT[]

Array (one dimension) of type string[] (base64-encoded value)

BYTEA[]

Array (one dimension) of type string[] (date, time, timestamp in a text format recognized by Greengage DB)

DATE[], TIME[], TIMESTAMP[], TIMESTAMPTZ[]

Array (one dimension) of type string[]

TEXT[]

Array of other types

TEXT[]

Object

Use the dot (.) notation to specify each level of projection (nesting) to a member of a primitive or Array type

PXF supports writing primitive types and single-dimension arrays of primitive types. Other complex types are written to JSON as strings. PXF uses the following data type mapping when writing JSON data.

PXF / Greengage DB data type JSON data type

PXF / Greengage DB data type	JSON data type
`BIGINT`, `FLOAT8`, `INTEGER`, `NUMERIC`, `REAL`, `SMALLINT`	number
BOOLEAN	boolean
`BPCHAR`, `TEXT`, `VARCHAR`	string
BYTEA	`string` (base64-encoded value)
`DATE`, `TIME`, `TIMESTAMP`, `TIMESTAMPTZ`	string
BOOLEAN[]	boolean[]
`BIGINT[]`, `FLOAT8[]`, `INT[]`, `NUMERIC[]`, `REAL[]`, `SMALLINT[]`	number[]
BYTEA[]	`string[]` (base64-encoded value)
`DATE[]`, `TIME[]`, `TIMESTAMP[]`, `TIMESTAMPTZ[]`	string[]

BIGINT, FLOAT8, INTEGER, NUMERIC, REAL, SMALLINT

number

BOOLEAN

boolean

BPCHAR, TEXT, VARCHAR

string

BYTEA

string (base64-encoded value)

DATE, TIME, TIMESTAMP, TIMESTAMPTZ

string

BOOLEAN[]

boolean[]

BIGINT[], FLOAT8[], INT[], NUMERIC[], REAL[], SMALLINT[]

number[]

BYTEA[]

string[] (base64-encoded value)

DATE[], TIME[], TIMESTAMP[], TIMESTAMPTZ[]

string[]

Consider the following example JSON data file, where user is an object composed of fields named id and location:

{
  "created_at": "MonSep3004:04:53+00002013",
  "id_str": "384529256681725952",
  "user": {
    "id": 31424214,
    "location": "COLUMBUS"
  },
  "coordinates": {
    "type": "Point",
    "values": [
      13,
      99
    ]
  }
}

To specify the nested fields in the user object as the Greengage DB external table columns, use the dot (.) projection:

user.id
user.location

coordinates is an object composed of a text field named type and an array of integers named values. To read all the elements of the values array in a single column, define the corresponding Greengage DB external table column as type INT[]:

"coordinates.values" INT[]

PXF also supports accessing specific elements of a JSON array using the [n] syntax in the table definition. Note that when you use this method to identify specific array elements, PXF provides only those values to Greengage DB, not the whole array:

"coordinates.values[0]" INT

When writing JSON data, PXF supports only scalar values or one-dimensional arrays of Greengage DB data types and does not support column projection.

To create a Greengage DB external table to read or write JSON data in HDFS, use the following general syntax.

CREATE [READABLE | WRITABLE] EXTERNAL TABLE <table_name>
    ( <column_name> <data_type> [, ...] | LIKE <other_table> )

    LOCATION ('pxf://<path_to_data>?PROFILE=hdfs:json[&<custom_option>=<value>[...]]')
    FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import'|'pxfwritable_export')
    [DISTRIBUTED BY (<column_name> [, ... ] ) | DISTRIBUTED RANDOMLY];

Keyword Value

Keyword	Value
<table_name>	The name of the table to create
<column_name>	The name of the column to create
<data_type>	The data type of the column
LIKE <other_table>	Specifies a table from which the new external table automatically copies all column names, data types, and distribution policy
<path_to_data>	The path to the directory or file in the HDFS data store. When the `<server_name>` configuration includes a `pxf.fs.basePath` property setting, the `<path_to_data>` value is considered to be relative to the base path specified. Otherwise, the path is considered absolute. The value must not include the dollar sign (`$`) character
PROFILE=hdfs:json	The `hdfs:json` profile is used to read or write JSON data in HDFS
FORMAT 'CUSTOM'	The custom format with the built-in custom formatter functions for read (`pxfwritable_import`) and write (`pxfwritable_export`) operations is used to work with JSON data in HDFS
DISTRIBUTED BY	When loading data from a Greengage DB table into a writable external table, consider specifying the same distribution policy or column name on both tables. This will avoid extra motion of data between segments on the load operation. Learn more about table distribution in Distribution
<custom_option>	One of the custom options provided in the `LOCATION` string as described below
SERVER=<server_name>	The named server configuration that PXF uses to access the data. If not specified, the default PXF server is used
IDENTIFIER=<value>	Specified only when accessing JSON data comprised of multiline records. `<value>` identifies the name of the field whose parent JSON object has to be returned as an individual tuple. When a nested object also includes a field with the same name as the one specified as `IDENTIFIER`, PXF might return incorrect results. You can work around this edge case by compressing the JSON file and having PXF read the compressed file
SPLIT_BY_FILE=<boolean>	Defines how to split the data specified in `<path_to_data>`. The default value is `false`: PXF creates multiple splits for each file and processes them in parallel. When set to `true`, PXF creates and processes a single split per file
IGNORE_MISSING_PATH=<boolean>	The action to take when `<path_to_data>` is missing or invalid. If set to `false` (default), an error is returned. If set to `true`, PXF ignores missing path errors and returns an empty fragment
ROOT=<value>	When writing to a single JSON object, identifies the name of the root-level object attribute
COMPRESSION_CODEC	The compression codec to use when writing data: `default`, `bzip2`, `gzip`, or `uncompressed` (no compression). If not provided (or `uncompressed` is provided), no data compression is performed. If a compression codec is specified, the following naming convention applies to written files: `<basename>.<json_file_type>.<compression_extension>`, for example customers.jsonl.gz

<table_name>

The name of the table to create

<column_name>

The name of the column to create

<data_type>

The data type of the column

LIKE <other_table>

Specifies a table from which the new external table automatically copies all column names, data types, and distribution policy

<path_to_data>

The path to the directory or file in the HDFS data store. When the <server_name> configuration includes a pxf.fs.basePath property setting, the <path_to_data> value is considered to be relative to the base path specified. Otherwise, the path is considered absolute. The value must not include the dollar sign ($) character

PROFILE=hdfs:json

The hdfs:json profile is used to read or write JSON data in HDFS

FORMAT 'CUSTOM'

The custom format with the built-in custom formatter functions for read (pxfwritable_import) and write (pxfwritable_export) operations is used to work with JSON data in HDFS

DISTRIBUTED BY

When loading data from a Greengage DB table into a writable external table, consider specifying the same distribution policy or column name on both tables. This will avoid extra motion of data between segments on the load operation. Learn more about table distribution in Distribution

<custom_option>

One of the custom options provided in the LOCATION string as described below

SERVER=<server_name>

The named server configuration that PXF uses to access the data. If not specified, the default PXF server is used

IDENTIFIER=<value>

Specified only when accessing JSON data comprised of multiline records. <value> identifies the name of the field whose parent JSON object has to be returned as an individual tuple.

When a nested object also includes a field with the same name as the one specified as IDENTIFIER, PXF might return incorrect results. You can work around this edge case by compressing the JSON file and having PXF read the compressed file

SPLIT_BY_FILE=<boolean>

Defines how to split the data specified in <path_to_data>. The default value is false: PXF creates multiple splits for each file and processes them in parallel. When set to true, PXF creates and processes a single split per file

IGNORE_MISSING_PATH=<boolean>

The action to take when <path_to_data> is missing or invalid. If set to false (default), an error is returned. If set to true, PXF ignores missing path errors and returns an empty fragment

ROOT=<value>

When writing to a single JSON object, identifies the name of the root-level object attribute

COMPRESSION_CODEC

The compression codec to use when writing data: default, bzip2, gzip, or uncompressed (no compression). If not provided (or uncompressed is provided), no data compression is performed.

If a compression codec is specified, the following naming convention applies to written files: <basename>.<json_file_type>.<compression_extension>, for example customers.jsonl.gz

These examples demonstrate how to configure and use the PXF HDFS connector for reading and writing JSON data in HDFS by using external tables.

Prerequisites

To try out the practical examples, connect to the Greengage DB master host as gpadmin using psql as described in Connect to Greengage DB via psql. Then create the customers test database and connect to it:

DROP DATABASE IF EXISTS customers;
CREATE DATABASE customers;
\c customers

To be able to create an external table using the PXF protocol, enable the PXF extension in the database as described in Register PXF in a database in PXF documentation:

CREATE EXTENSION pxf;

To have PXF connect to HDFS, you need to create a Hadoop server configuration as described in Configure PXF Hadoop connectors in PXF documentation and then synchronize it to the Greengage DB cluster:

Log in to the Greengage DB master host as gpadmin.
Go to the $PXF_BASE/servers directory and create a Hadoop server configuration directory named hadoop:
```
$ mkdir $PXF_BASE/servers/hadoop
$ cd $PXF_BASE/servers/hadoop
```

Copy the core-site.xml, hdfs-site.xml, mapred-site.xml, and yarn-site.xml Hadoop configuration files from the NameNode host of the Hadoop cluster to the current host:

$ scp hdfsuser@namenode:/etc/hadoop/conf/core-site.xml .
$ scp hdfsuser@namenode:/etc/hadoop/conf/hdfs-site.xml .
$ scp hdfsuser@namenode:/etc/hadoop/conf/mapred-site.xml .
$ scp hdfsuser@namenode:/etc/hadoop/conf/yarn-site.xml .

Synchronize the server configuration to the Greengage DB cluster hosts:
```
$ pxf cluster sync
```

In the /tmp directory on the HDFS host, create a JSON Lines file named customers.jsonl having the following content:

{"customer": {"id": 1,"name": "John Doe"},"ordered_items":["laptop", "monitor"]}
{"customer": {"id": 2,"name": "Jane Smith"},"ordered_items":["keyboard", "mouse", "pad"]}
{"customer": {"id": 3,"name": "Bob Brown"},"ordered_items":["headphones"]}
{"customer": {"id": 4,"name": "Alice Green"},"ordered_items":["webcam", "microphone"]}

Create the /tmp/pxf_examples HDFS directory for storing PXF example data files and add the customers.jsonl file to HDFS:
```
$ hdfs dfs -mkdir -p /tmp/pxf_examples
$ hdfs dfs -put /tmp/customers.jsonl /tmp/pxf_examples/
```
On the Greengage DB master host, create an external table that references the customers.jsonl file. When defining columns, use the dot notation (.) to access the nested fields in the customer object. Map the ordered_items text array to a Greengage DB text-type column (TEXT). In the LOCATION clause, specify the PXF hdfs:json profile and the server configuration. In the FORMAT clause, specify pxfwritable_import, which is the built-in custom formatter function for read operations:
```
CREATE EXTERNAL TABLE customers_r_singleline_1 (
        "customer.id" INT,
        "customer.name" TEXT,
        "ordered_items" TEXT
    )
    
    LOCATION('pxf://tmp/pxf_examples/customers.jsonl?PROFILE=hdfs:json&SERVER=hadoop')
    FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
```

Query the created external table:

SELECT * FROM customers_r_singleline_1;

The output should look as follows:

 customer.id | customer.name |       ordered_items
-------------+---------------+----------------------------
           1 | John Doe      | ["laptop","monitor"]
           2 | Jane Smith    | ["keyboard","mouse","pad"]
           3 | Bob Brown     | ["headphones"]
           4 | Alice Green   | ["webcam","microphone"]
(4 rows)

The ordered_items JSON array is retrieved into a single text-type column. To convert it into a native Greengage DB array type, you can run the following query that uses the json_array_elements_text() function:

SELECT "customer.id",
       "customer.name",
       ARRAY(SELECT json_array_elements_text(ordered_items::json))::text[] AS items
FROM customers_r_singleline_1;

The output should look as follows:

 customer.id | customer.name |        items
-------------+---------------+----------------------
           1 | John Doe      | {laptop,monitor}
           2 | Jane Smith    | {keyboard,mouse,pad}
           3 | Bob Brown     | {headphones}
           4 | Alice Green   | {webcam,microphone}
(4 rows)

On the Greengage DB master host, create another external table that references the customers.jsonl file. Map the ordered_items text array to a Greengage DB text array column (TEXT[]):

CREATE EXTERNAL TABLE customers_r_singleline_2 (
        "customer.id" INT,
        "customer.name" TEXT,
        "ordered_items" TEXT[]
    )
    
    LOCATION('pxf://tmp/pxf_examples/customers.jsonl?PROFILE=hdfs:json&SERVER=hadoop')
    FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');

Query the created external table using the array subscript number in square brackets to retrieve the first item of the ordered_items array:

SELECT "customer.id",
       "customer.name",
       "ordered_items"[1]
FROM customers_r_singleline_2;

The output should look as follows:

 customer.id | customer.name | ordered_items
-------------+---------------+---------------
           1 | John Doe      | laptop
           2 | Jane Smith    | keyboard
           3 | Bob Brown     | headphones
           4 | Alice Green   | webcam
(4 rows)

On the Greengage DB master host, create another external table that references the customers.jsonl file. Use the array subscript number in square brackets to retrieve the second item of the ordered_items array and map it to a Greengage DB text-type column:

CREATE EXTERNAL TABLE customers_r_singleline_3 (
        "customer.id" INT,
        "customer.name" TEXT,
        "ordered_items[1]" TEXT
    )
    
    LOCATION('pxf://tmp/pxf_examples/customers.jsonl?PROFILE=hdfs:json&SERVER=hadoop')
    FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');

Query the created external table:

SELECT * FROM customers_r_singleline_3;

The output should look as follows:

 customer.id | customer.name | ordered_items[1]
-------------+---------------+------------------
           1 | John Doe      | monitor
           2 | Jane Smith    | mouse
           3 | Bob Brown     |
           4 | Alice Green   | microphone
(4 rows)

In the /tmp directory on the HDFS host, create a JSON file named customers.json having the following content:

{
  "customers": [
    {
      "id": 101,
      "name": "Alice Smith",
      "ordered_items": [
        "laptop",
        "monitor"
      ]
    },
    {
      "id": 102,
      "name": "Bob Johnson",
      "ordered_items": [
        "keyboard",
        "mouse",
        "pad"
      ]
    },
    {
      "id": 103,
      "name": "Charlie Brown",
      "ordered_items": [
        "headphones"
      ]
    }
  ]
}

If not created yet, create the /tmp/pxf_examples HDFS directory for storing PXF example data files and add the customers.json file to HDFS:
```
$ hdfs dfs -mkdir -p /tmp/pxf_examples
$ hdfs dfs -put /tmp/customers.json /tmp/pxf_examples/
```
On the Greengage DB master host, create an external table that references the customers.json file. Map the ordered_items text array to a Greengage DB text array column (TEXT[]). In the LOCATION clause, specify the PXF hdfs:json profile, the server configuration, and id as the IDENTIFIER value. In the FORMAT clause, specify pxfwritable_import, which is the built-in custom formatter function for read operations:
```
CREATE EXTERNAL TABLE customers_r_multiline (
    id INT,
    name TEXT,
    ordered_items TEXT[]
    )
    
    LOCATION ('pxf://tmp/pxf_examples/customers.json?PROFILE=hdfs:json&SERVER=hadoop&IDENTIFIER=id')
    FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
```

Query the created external table:

SELECT * FROM customers_r_multiline;

The output should look as follows:

 id  |     name      |    ordered_items
-----+---------------+----------------------
 101 | Alice Smith   | {laptop,monitor}
 102 | Bob Johnson   | {keyboard,mouse,pad}
 103 | Charlie Brown | {headphones}
(3 rows)

On the Greengage DB master host, create the writable external table that stores data into the /tmp/pxf_examples/customers HDFS directory. In the LOCATION clause, specify the PXF hdfs:json profile and the server configuration. In the FORMAT clause, specify pxfwritable_export, which is the built-in custom formatter function for write operations:
```
CREATE WRITABLE EXTERNAL TABLE customers_w_singleline (
        id INT,
        name TEXT, 
        ordered_items TEXT[]    
    )
    
    LOCATION ('pxf://tmp/pxf_examples/customers?PROFILE=hdfs:json&SERVER=hadoop')
    FORMAT 'CUSTOM' (FORMATTER='pxfwritable_export');
```

Insert some data into the customers_w_singleline table:

INSERT INTO customers_w_singleline 
VALUES (1, 'John Doe', ARRAY['laptop', 'monitor']),
       (2, 'Jane Smith', ARRAY['keyboard', 'mouse', 'pad']),
       (3, 'Bob Brown', ARRAY['headphones']),
       (4, 'Alice Green', ARRAY['webcam', 'microphone']);

View the contents of the files created in HDFS:

$ hdfs dfs -cat /tmp/pxf_examples/customers/*

The output should look similar to the following:

{"id":3,"name":"Bob Brown","ordered_items":["headphones"]}
{"id":4,"name":"Alice Green","ordered_items":["webcam","microphone"]}
{"id":1,"name":"John Doe","ordered_items":["laptop","monitor"]}
{"id":2,"name":"Jane Smith","ordered_items":["keyboard","mouse","pad"]}

On the Greengage DB master host, create the writable external table that stores data into the /tmp/pxf_examples/customers HDFS directory. In the LOCATION clause, specify the PXF hdfs:json profile, the server configuration, and customers as the ROOT value. In the FORMAT clause, specify pxfwritable_export, which is the built-in custom formatter function for write operations:
```
CREATE WRITABLE EXTERNAL TABLE customers_w_multiline (
        id INT,
        name TEXT, 
        ordered_items TEXT[]    
    )
    
    LOCATION ('pxf://tmp/pxf_examples/customers?PROFILE=hdfs:json&SERVER=hadoop&ROOT=customers')
    FORMAT 'CUSTOM' (FORMATTER='pxfwritable_export');
```

Insert some data into the customers_w_multiline table:

INSERT INTO customers_w_multiline 
VALUES (1, 'John Doe', ARRAY['laptop', 'monitor']),
       (2, 'Jane Smith', ARRAY['keyboard', 'mouse', 'pad']),
       (3, 'Bob Brown', ARRAY['headphones']),
       (4, 'Alice Green', ARRAY['webcam', 'microphone']);

View the contents of the files created in HDFS:

$ hdfs dfs -cat /tmp/pxf_examples/customers/*

The output should look similar to the following:

{"customers":[
{"id":3,"name":"Bob Brown","ordered_items":["headphones"]}
]}
{"customers":[
{"id":1,"name":"John Doe","ordered_items":["laptop","monitor"]}
]}
{"customers":[
{"id":2,"name":"Jane Smith","ordered_items":["keyboard","mouse","pad"]}
,{"id":4,"name":"Alice Green","ordered_items":["webcam","microphone"]}
]}

Text

Avro

Use PXF HDFS connector to read and write JSON data between Greengage DB and HDFS

JSON data access modes

Data type mapping

Read mapping

Write mapping

Use column projection

Create an external table using the PXF protocol

Examples

Configure the PXF HDFS connector

Read a JSON file with single-line records

Read a JSON file with multiline records

Write a JSON file with single-line records

Write a JSON file with multiline records