Use PXF HDFS connector to read multiline text from HDFS into a single table row in Greengage DB
The PXF HDFS connector lets you read one or more single- and multiline text files in HDFS, loading each file into a single table row and column. This may be useful for reading multiple files into the same Greengage DB external table, for example when individual JSON files each contain a separate record. Only text and JSON files can be loaded in this manner, including files with embedded line feeds.
This topic describes how to configure and use the PXF HDFS connector for reading multiline text data stored in HDFS by using external tables and provides practical examples.
Create an external table using the PXF protocol
To create a Greengage DB external table to read text or JSON files on HDFS, use the following general syntax:
CREATE EXTERNAL TABLE <table_name>
( <column_name> TEXT|JSON | LIKE <other_table> )
LOCATION ('pxf://<path_to_data>?PROFILE=hdfs:text:multi&FILE_AS_ROW=true[&<custom_option>=<value>[...]]')
FORMAT 'CSV';
| Keyword | Value |
|---|---|
<table_name> |
The name of the table to create |
<column_name> TEXT|JSON |
The column to read the data file into.
The column type must be set to either |
LIKE <other_table> |
Specifies a table from which the new external table automatically copies all column names, data types, and distribution policy |
<path_to_data> |
The path to the directory or file in the HDFS data store.
When the When reading multiple files, ensure that all of them are of the same type (text or JSON). When reading multiple JSON files, ensure that each file is a complete record and contains the same record type |
PROFILE=hdfs:text:multi |
The |
FILE_AS_ROW=true |
The required option that instructs PXF to read each file into a single table row. When this option is specified, no additional custom or formatting options are supported |
FORMAT |
To read multiline text data in HDFS, |
<custom_option> |
One of the custom options provided in the |
SERVER=<server_name> |
The named server configuration that PXF uses to access the data. If not specified, the default PXF server is used |
IGNORE_MISSING_PATH |
The action to take when |
Examples
These examples demonstrate how to configure and use the PXF HDFS connector for reading multiline text and JSON data in HDFS by using external tables.
Configure the PXF HDFS connector
To have PXF connect to HDFS, you need to create a Hadoop server configuration as described in Configure PXF Hadoop connectors in PXF documentation and then synchronize it to the Greengage DB cluster:
-
Log in to the Greengage DB master host as
gpadmin. -
Go to the $PXF_BASE/servers directory and create a Hadoop server configuration directory named hadoop:
$ mkdir $PXF_BASE/servers/hadoop $ cd $PXF_BASE/servers/hadoop -
Copy the core-site.xml, hdfs-site.xml, mapred-site.xml, and yarn-site.xml Hadoop configuration files from the NameNode host of the Hadoop cluster to the current host:
$ scp hdfsuser@namenode:/etc/hadoop/conf/core-site.xml . $ scp hdfsuser@namenode:/etc/hadoop/conf/hdfs-site.xml . $ scp hdfsuser@namenode:/etc/hadoop/conf/mapred-site.xml . $ scp hdfsuser@namenode:/etc/hadoop/conf/yarn-site.xml . -
Synchronize the server configuration to the Greengage DB cluster hosts:
$ pxf cluster sync
Read multiline text files
-
In the /tmp directory on the HDFS host, create three plain text data files named customers_1.txt, customers_2.txt, and customers_3.txt having the following content:
1,John,Doe,john.doe@example.com,123 Elm Street
2,Jane,Smith,jane.smith@example.com,456 Oak Street 3,Bob,Brown,bob.brown@example.com,789 Pine Street 4,Rob,Stuart,rob.stuart@example.com,119 Willow Street
5,Alice,Johnson,alice.johnson@example.com,"101 Maple Avenue Unit 2B Second Floor" 6,Charlie,Davis,charlie.davis@example.com,"123 Elm Road Suite 300 Business Park" 7,Eve,Wilson,eve.wilson@example.com,"PO Box 555 Anytown, USA Zip Code: 12345"
-
Create the /tmp/pxf_examples HDFS directory for storing PXF example data files and add the created files to HDFS:
$ hdfs dfs -mkdir -p /tmp/pxf_examples $ hdfs dfs -put /tmp/customers_1.txt /tmp/customers_2.txt /tmp/customers_3.txt /tmp/pxf_examples/ -
On the Greengage DB master host, create a readable external table that references the text files. The table has the single
customers_datacolumn of typeTEXT. In theLOCATIONclause, specify the PXFhdfs:text:multiprofile and the server configuration. Use theFILE_AS_ROW=trueoption to enable loading files into single table rows. In theFORMATclause, setCSVas the data format:CREATE EXTERNAL TABLE customers_text ( customers_data TEXT ) LOCATION ('pxf://tmp/pxf_examples/?PROFILE=hdfs:text:multi&SERVER=hadoop&FILE_AS_ROW=true') FORMAT 'CSV'; -
Query the created external table:
SELECT * FROM customers_text;The output should look as follows:
customers_data ------------------------------------------------------------- 1,John,Doe,john.doe@example.com,123 Elm Street 5,Alice,Johnson,alice.johnson@example.com,"101 Maple Avenue+ Unit 2B + Second Floor" + 6,Charlie,Davis,charlie.davis@example.com,"123 Elm Road + Suite 300 + Business Park" + 7,Eve,Wilson,eve.wilson@example.com,"PO Box 555 + Anytown, USA + Zip Code: 12345" 2,Jane,Smith,jane.smith@example.com,456 Oak Street + 3,Bob,Brown,bob.brown@example.com,789 Pine Street + 4,Rob,Stuart,rob.stuart@example.com,119 Willow Street (3 rows)
-
Enable
psqlexpanded display mode and query thecustomers_texttable again:$ psql -d customers -x -c 'SELECT * FROM customers_text;'The output should look as follows:
-[ RECORD 1 ]--+------------------------------------------------------------ customers_data | 1,John,Doe,john.doe@example.com,123 Elm Street -[ RECORD 2 ]--+------------------------------------------------------------ customers_data | 2,Jane,Smith,jane.smith@example.com,456 Oak Street | 3,Bob,Brown,bob.brown@example.com,789 Pine Street | 4,Rob,Stuart,rob.stuart@example.com,119 Willow Street -[ RECORD 3 ]--+------------------------------------------------------------ customers_data | 5,Alice,Johnson,alice.johnson@example.com,"101 Maple Avenue | Unit 2B | Second Floor" | 6,Charlie,Davis,charlie.davis@example.com,"123 Elm Road | Suite 300 | Business Park" | 7,Eve,Wilson,eve.wilson@example.com,"PO Box 555 | Anytown, USA | Zip Code: 12345"
Read JSON files
-
In the /tmp directory on the HDFS host, create two JSON data files named customers_1.json and customers_2.json having the following content:
{ "customers": [ { "id": 101, "name": "Alice Smith", "ordered_items": [ "laptop", "monitor" ] }, { "id": 102, "name": "Bob Johnson", "ordered_items": [ "keyboard", "mouse", "pad" ] } ] }{ "customers": [ { "id": 103, "name": "Charlie Brown", "ordered_items": [ "headphones" ] } ] } -
If not created yet, create the /tmp/pxf_examples HDFS directory for storing PXF example data files and add the JSON files to HDFS:
$ hdfs dfs -mkdir -p /tmp/pxf_examples $ hdfs dfs -put /tmp/customers_1.json /tmp/customers_2.json /tmp/pxf_examples/ -
On the Greengage DB master host, create a readable external table using the PXF protocol that references the JSON files. The table has the single
customers_datacolumn of typeJSON. In theLOCATIONclause, specify the PXFhdfs:text:multiprofile and the server configuration. Use theFILE_AS_ROW=trueoption to enable loading files into single table rows. In theFORMATclause, setCSVas the data format:CREATE EXTERNAL TABLE customers_json ( customers_data JSON ) LOCATION ('pxf://tmp/pxf_examples/?PROFILE=hdfs:text:multi&SERVER=hadoop&FILE_AS_ROW=true') FORMAT 'CSV'; -
Query the created external table:
SELECT * FROM customers_json;The output should look as follows:
customers_data -------------------------------- { + "customers": [ + { + "id": 103, + "name": "Charlie Brown",+ "ordered_items": [ + "headphones" + ] + } + ] + } { + "customers": [ + { + "id": 101, + "name": "Alice Smith", + "ordered_items": [ + "laptop", + "monitor" + ] + }, + { + "id": 102, + "name": "Bob Johnson", + "ordered_items": [ + "keyboard", + "mouse", + "pad" + ] + } + ] + } (2 rows) -
Enable
psqlexpanded display mode and query thecustomers_jsontable again:$ psql -d customers -x -c 'SELECT * FROM customers_json;'The output should look as follows:
-[ RECORD 1 ]--+------------------------------- customers_data | { | "customers": [ | { | "id": 103, | "name": "Charlie Brown", | "ordered_items": [ | "headphones" | ] | } | ] | } -[ RECORD 2 ]--+------------------------------- customers_data | { | "customers": [ | { | "id": 101, | "name": "Alice Smith", | "ordered_items": [ | "laptop", | "monitor" | ] | }, | { | "id": 102, | "name": "Bob Johnson", | "ordered_items": [ | "keyboard", | "mouse", | "pad" | ] | } | ] | } -
Use the json_array_elements() Greengage DB function to extract the elements of the
customersnode and view them as individual records:SELECT json_array_elements(customers_data -> 'customers') AS customer FROM customers_json;The output should look as follows:
-[ RECORD 1 ]---------------------------- customer | { | "id": 101, | "name": "Alice Smith", | "ordered_items": [ | "laptop", | "monitor" | ] | } -[ RECORD 2 ]---------------------------- customer | { | "id": 102, | "name": "Bob Johnson", | "ordered_items": [ | "keyboard", | "mouse", | "pad" | ] | } -[ RECORD 3 ]---------------------------- customer | { | "id": 103, | "name": "Charlie Brown", | "ordered_items": [ | "headphones" | ] | } -
Use the
->operator (Get JSON array element) to retrieve the first item in theordered_itemsarray of eachcustomernode:SELECT json_array_elements(customers_data -> 'customers') -> 'ordered_items' -> 0 AS first_item FROM customers_json;The output should look as follows:
-[ RECORD 1 ]------------ first_item | "headphones" -[ RECORD 2 ]------------ first_item | "laptop" -[ RECORD 3 ]------------ first_item | "keyboard"