Use PXF HDFS connector to read and write JSON data from HDFS to Greengage DB
JSON is a text-based data-interchange format. JSON data is a collection of objects, each of which is a set of unordered name/value pairs. A value can be a string, a number, boolean, null, and a nested object or an array.
JSON data is typically stored in a file with a .json or .jsonl (JSON Lines) suffix. For detailed information on JSON syntax, see Introducing JSON. For detailed information on JSON Lines syntax, see JSON Lines.
This topic describes how to configure and use the PXF HDFS connector for reading and writing JSON data stored in HDFS by using external tables and provides practical examples.
For detailed information on how to work with the JSON data type directly in Greengage DB, see Work with JSON data.
JSON data access modes
PXF supports two JSON data access modes.
The single-object-per-row mode (default) expects one full JSON record per row, where each row represents a database tuple. In this mode, JSON files read by PXF may have any or no suffix. When writing, PXF creates the file with a .jsonl suffix.
A single-object-per-row JSON data file looks similar to the following:
{"id":1,"color":"red"}
{"id":2,"color":"yellow"}
{"id":3,"color":"green"}
The single-object-per-file mode operates on one JSON object per file, and the JSON records may span multiple lines.
A JSON file can also contain a single named root-level JSON object whose value is an array of JSON objects.
When reading, the array may contain objects with arbitrary complexity and nesting, and PXF forms database tuples from objects that have a property named the same as the specified IDENTIFIER value (described below).
When writing, each JSON object in the array represents a database tuple.
JSON files of this type have the .json suffix.
Ensure that there are no blank lines in your JSON files.
A sample single-object-per-file JSON data file looks as follows.
The root-level records object is an array of three objects (tuples):
{
"records": [
{
"id": 1,
"color": "red"
},
{
"id": 2,
"color": "yellow"
},
{
"id": 3,
"color": "green"
}
]
}
When writing JSON data, PXF creates a .jsonl file by default.
When the ROOT attribute (described below) is specified, PXF creates a .json file.
Data type mapping
To represent JSON data in Greengage DB, map data values that use a primitive data type to Greengage DB columns of the same type. JSON supports complex data types including projections and arrays.
Read mapping
PXF uses the following data type mapping when reading JSON data. You can use N-level projection to map members of nested objects and arrays to primitive data types.
| JSON data type | PXF / Greengage DB data type |
|---|---|
boolean |
BOOLEAN |
number |
One of the following: |
string |
TEXT |
|
BYTEA |
|
One of the following: |
Array (one dimension) of type |
BOOLEAN[] |
Array (one dimension) of type |
One of the following: |
Array (one dimension) of type |
BYTEA[] |
Array (one dimension) of type |
|
Array (one dimension) of type |
TEXT[] |
Array of other types |
TEXT[] |
Object |
Use the dot ( |
Write mapping
PXF supports writing primitive types and single-dimension arrays of primitive types. Other complex types are written to JSON as strings. PXF uses the following data type mapping when writing JSON data.
| PXF / Greengage DB data type | JSON data type |
|---|---|
|
number |
BOOLEAN |
boolean |
|
string |
BYTEA |
|
|
string |
BOOLEAN[] |
boolean[] |
|
number[] |
BYTEA[] |
|
|
string[] |
Use column projection
Consider the following example JSON data file, where user is an object composed of fields named id and location:
{
"created_at": "MonSep3004:04:53+00002013",
"id_str": "384529256681725952",
"user": {
"id": 31424214,
"location": "COLUMBUS"
},
"coordinates": {
"type": "Point",
"values": [
13,
99
]
}
}
To specify the nested fields in the user object as the Greengage DB external table columns, use the dot (.) projection:
user.id
user.location
coordinates is an object composed of a text field named type and an array of integers named values.
To read all the elements of the values array in a single column, define the corresponding Greengage DB external table column as type INT[]:
"coordinates.values" INT[]
PXF also supports accessing specific elements of a JSON array using the [n] syntax in the table definition.
Note that when you use this method to identify specific array elements, PXF provides only those values to Greengage DB, not the whole array:
"coordinates.values[0]" INT
When writing JSON data, PXF supports only scalar values or one-dimensional arrays of Greengage DB data types and does not support column projection.
Create an external table using the PXF protocol
To create a Greengage DB external table to read or write JSON data in HDFS, use the following general syntax.
CREATE [READABLE | WRITABLE] EXTERNAL TABLE <table_name>
( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ('pxf://<path-to-hdfs>?PROFILE=hdfs:json[&<custom-option>=<value>[...]]')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import'|'pxfwritable_export')
[DISTRIBUTED BY (<column_name> [, ... ] ) | DISTRIBUTED RANDOMLY];
| Keyword | Value |
|---|---|
<path‑to‑hdfs> |
The path to the directory or file in the HDFS data store.
When the |
PROFILE=hdfs:json |
The |
FORMAT 'CUSTOM' |
The custom format with the built-in custom formatter functions for read ( |
DISTRIBUTED BY |
When loading data from a Greengage DB table into a writable external table, consider specifying the same distribution policy or column name on both tables. This will avoid extra motion of data between segments on the load operation. Learn more about table distribution in Distribution |
<custom‑option> |
One of the custom options provided in the |
SERVER=<server_name> |
The named server configuration that PXF uses to access the data. If not specified, the default PXF server is used |
IDENTIFIER=<value> |
Specified only when accessing JSON data comprised of multiline records.
When a nested object also includes a field with the same name as the one specified as |
SPLIT_BY_FILE=<boolean> |
Defines how to split the data specified in |
IGNORE_MISSING_PATH=<boolean> |
The action to take when |
ROOT=<value> |
When writing to a single JSON object, identifies the name of the root-level object attribute |
COMPRESSION_CODEC |
The compression codec to use when writing data: If a compression codec is specified, the following naming convention applies to written files: |
Examples
These examples demonstrate how to configure and use the PXF HDFS connector for reading and writing JSON data in HDFS by using external tables.
Configure the PXF HDFS connector
To have PXF connect to HDFS, you need to create a Hadoop server configuration as described in Configure PXF Hadoop connectors in PXF documentation and then synchronize it to the Greengage DB cluster:
-
Log in to the Greengage DB master host as
gpadmin. -
Go to the $PXF_BASE/servers directory and create a Hadoop server configuration directory named hadoop:
$ mkdir $PXF_BASE/servers/hadoop $ cd $PXF_BASE/servers/hadoop -
Copy the core-site.xml, hdfs-site.xml, mapred-site.xml, and yarn-site.xml Hadoop configuration files from the NameNode host of the Hadoop cluster to the current host:
$ scp hdfsuser@namenode:/etc/hadoop/conf/core-site.xml . $ scp hdfsuser@namenode:/etc/hadoop/conf/hdfs-site.xml . $ scp hdfsuser@namenode:/etc/hadoop/conf/mapred-site.xml . $ scp hdfsuser@namenode:/etc/hadoop/conf/yarn-site.xml . -
Synchronize the PXF server configuration to the Greengage DB cluster:
$ pxf cluster sync
Read a JSON file with single-line records
-
In the /tmp directory on the HDFS host, create a JSON Lines file named customers.jsonl having the following content:
{"customer": {"id": 1,"name": "John Doe"},"ordered_items":["laptop", "monitor"]} {"customer": {"id": 2,"name": "Jane Smith"},"ordered_items":["keyboard", "mouse", "pad"]} {"customer": {"id": 3,"name": "Bob Brown"},"ordered_items":["headphones"]} {"customer": {"id": 4,"name": "Alice Green"},"ordered_items":["webcam", "microphone"]} -
Create the /tmp/pxf_examples HDFS directory for storing PXF example data files and add the customers.jsonl file to HDFS:
$ hdfs dfs -mkdir -p /tmp/pxf_examples $ hdfs dfs -put /tmp/customers.jsonl /tmp/pxf_examples/ -
On the Greengage DB master host, create an external table that references the customers.jsonl file. When defining columns, use the dot notation (
.) to access the nested fields in thecustomerobject. Map theordered_itemstext array to a Greengage DB text-type column (TEXT). In theLOCATIONclause, specify the PXFhdfs:jsonprofile and the server configuration. In theFORMATclause, specifypxfwritable_import, which is the built-in custom formatter function for read operations:CREATE EXTERNAL TABLE customers_r_singleline_1 ( "customer.id" INT, "customer.name" TEXT, "ordered_items" TEXT ) LOCATION('pxf://tmp/pxf_examples/customers.jsonl?PROFILE=hdfs:json&SERVER=hadoop') FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import'); -
Query the created external table:
SELECT * FROM customers_r_singleline_1;The output should look as follows:
customer.id | customer.name | ordered_items -------------+---------------+---------------------------- 1 | John Doe | ["laptop","monitor"] 2 | Jane Smith | ["keyboard","mouse","pad"] 3 | Bob Brown | ["headphones"] 4 | Alice Green | ["webcam","microphone"] (4 rows)The
ordered_itemsJSON array is retrieved into a single text-type column. To convert it into a native Greengage DB array type, you can run the following query that uses the json_array_elements_text() function:SELECT "customer.id", "customer.name", ARRAY(SELECT json_array_elements_text(ordered_items::json))::text[] AS items FROM customers_r_singleline_1;The output should look as follows:
customer.id | customer.name | items -------------+---------------+---------------------- 1 | John Doe | {laptop,monitor} 2 | Jane Smith | {keyboard,mouse,pad} 3 | Bob Brown | {headphones} 4 | Alice Green | {webcam,microphone} (4 rows) -
On the Greengage DB master host, create another external table that references the customers.jsonl file. Map the
ordered_itemstext array to a Greengage DB text array column (TEXT[]):CREATE EXTERNAL TABLE customers_r_singleline_2 ( "customer.id" INT, "customer.name" TEXT, "ordered_items" TEXT[] ) LOCATION('pxf://tmp/pxf_examples/customers.jsonl?PROFILE=hdfs:json&SERVER=hadoop') FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import'); -
Query the created external table using the array subscript number in square brackets to retrieve the first item of the
ordered_itemsarray:SELECT "customer.id", "customer.name", "ordered_items"[1] FROM customers_r_singleline_2;The output should look as follows:
customer.id | customer.name | ordered_items -------------+---------------+--------------- 1 | John Doe | laptop 2 | Jane Smith | keyboard 3 | Bob Brown | headphones 4 | Alice Green | webcam (4 rows) -
On the Greengage DB master host, create another external table that references the customers.jsonl file. Use the array subscript number in square brackets to retrieve the second item of the
ordered_itemsarray and map it to a Greengage DB text-type column:CREATE EXTERNAL TABLE customers_r_singleline_3 ( "customer.id" INT, "customer.name" TEXT, "ordered_items[1]" TEXT ) LOCATION('pxf://tmp/pxf_examples/customers.jsonl?PROFILE=hdfs:json&SERVER=hadoop') FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import'); -
Query the created external table:
SELECT * FROM customers_r_singleline_3;The output should look as follows:
customer.id | customer.name | ordered_items[1] -------------+---------------+------------------ 1 | John Doe | monitor 2 | Jane Smith | mouse 3 | Bob Brown | 4 | Alice Green | microphone (4 rows)
Read a JSON file with multiline records
-
In the /tmp directory on the HDFS host, create a JSON file named customers.json having the following content:
{ "customers": [ { "id": 101, "name": "Alice Smith", "ordered_items": [ "laptop", "monitor" ] }, { "id": 102, "name": "Bob Johnson", "ordered_items": [ "keyboard", "mouse", "pad" ] }, { "id": 103, "name": "Charlie Brown", "ordered_items": [ "headphones" ] } ] } -
If not created yet, create the /tmp/pxf_examples HDFS directory for storing PXF example data files and add the customers.json file to HDFS:
$ hdfs dfs -mkdir -p /tmp/pxf_examples $ hdfs dfs -put /tmp/customers.json /tmp/pxf_examples/ -
On the Greengage DB master host, create an external table that references the customers.json file. Map the
ordered_itemstext array to a Greengage DB text array column (TEXT[]). In theLOCATIONclause, specify the PXFhdfs:jsonprofile, the server configuration, andidas theIDENTIFIERvalue. In theFORMATclause, specifypxfwritable_import, which is the built-in custom formatter function for read operations:CREATE EXTERNAL TABLE customers_r_multiline ( id INT, name TEXT, ordered_items TEXT[] ) LOCATION ('pxf://tmp/pxf_examples/customers.json?PROFILE=hdfs:json&SERVER=hadoop&IDENTIFIER=id') FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import'); -
Query the created external table:
SELECT * FROM customers_r_multiline;The output should look as follows:
id | name | ordered_items -----+---------------+---------------------- 101 | Alice Smith | {laptop,monitor} 102 | Bob Johnson | {keyboard,mouse,pad} 103 | Charlie Brown | {headphones} (3 rows)
Write a JSON file with single-line records
-
On the Greengage DB master host, create the writable external table that stores data into the /tmp/pxf_examples/customers HDFS directory. In the
LOCATIONclause, specify the PXFhdfs:jsonprofile and the server configuration. In theFORMATclause, specifypxfwritable_export, which is the built-in custom formatter function for write operations:CREATE WRITABLE EXTERNAL TABLE customers_w_singleline ( id INT, name TEXT, ordered_items TEXT[] ) LOCATION ('pxf://tmp/pxf_examples/customers?PROFILE=hdfs:json&SERVER=hadoop') FORMAT 'CUSTOM' (FORMATTER='pxfwritable_export'); -
Insert some data into the
customers_w_singlelinetable:INSERT INTO customers_w_singleline VALUES (1, 'John Doe', ARRAY['laptop', 'monitor']), (2, 'Jane Smith', ARRAY['keyboard', 'mouse', 'pad']), (3, 'Bob Brown', ARRAY['headphones']), (4, 'Alice Green', ARRAY['webcam', 'microphone']); -
View the contents of the files created in HDFS:
$ hdfs dfs -cat /tmp/pxf_examples/customers/*The output should look similar to the following:
{"id":3,"name":"Bob Brown","ordered_items":["headphones"]} {"id":4,"name":"Alice Green","ordered_items":["webcam","microphone"]} {"id":1,"name":"John Doe","ordered_items":["laptop","monitor"]} {"id":2,"name":"Jane Smith","ordered_items":["keyboard","mouse","pad"]}
Write a JSON file with multiline records
-
On the Greengage DB master host, create the writable external table that stores data into the /tmp/pxf_examples/customers HDFS directory. In the
LOCATIONclause, specify the PXFhdfs:jsonprofile, the server configuration, andcustomersas theROOTvalue. In theFORMATclause, specifypxfwritable_export, which is the built-in custom formatter function for write operations:CREATE WRITABLE EXTERNAL TABLE customers_w_multiline ( id INT, name TEXT, ordered_items TEXT[] ) LOCATION ('pxf://tmp/pxf_examples/customers?PROFILE=hdfs:json&SERVER=hadoop&ROOT=customers') FORMAT 'CUSTOM' (FORMATTER='pxfwritable_export'); -
Insert some data into the
customers_w_multilinetable:INSERT INTO customers_w_multiline VALUES (1, 'John Doe', ARRAY['laptop', 'monitor']), (2, 'Jane Smith', ARRAY['keyboard', 'mouse', 'pad']), (3, 'Bob Brown', ARRAY['headphones']), (4, 'Alice Green', ARRAY['webcam', 'microphone']); -
View the contents of the files created in HDFS:
$ hdfs dfs -cat /tmp/pxf_examples/customers/*The output should look similar to the following:
{"customers":[ {"id":3,"name":"Bob Brown","ordered_items":["headphones"]} ]} {"customers":[ {"id":1,"name":"John Doe","ordered_items":["laptop","monitor"]} ]} {"customers":[ {"id":2,"name":"Jane Smith","ordered_items":["keyboard","mouse","pad"]} ,{"id":4,"name":"Alice Green","ordered_items":["webcam","microphone"]} ]}