Use PXF HDFS connector to read and write text data between Greengage DB and HDFS
These examples demonstrate how to configure and use the PXF HDFS connector for reading and writing HDFS text data by using external tables.
Configure the PXF HDFS connector
To have PXF connect to HDFS, you need to create a Hadoop server configuration as described in Configure PXF Hadoop connectors in PXF documentation and then synchronize it to the Greengage DB cluster:
-
Log in to the Greengage DB master host as
gpadmin. -
Go to the $PXF_BASE/servers directory and create a Hadoop server configuration directory named hadoop:
$ mkdir $PXF_BASE/servers/hadoop $ cd $PXF_BASE/servers/hadoop -
Copy the core-site.xml, hdfs-site.xml, mapred-site.xml, and yarn-site.xml Hadoop configuration files from the NameNode host of the Hadoop cluster to the current host:
$ scp hdfsuser@namenode:/etc/hadoop/conf/core-site.xml . $ scp hdfsuser@namenode:/etc/hadoop/conf/hdfs-site.xml . $ scp hdfsuser@namenode:/etc/hadoop/conf/mapred-site.xml . $ scp hdfsuser@namenode:/etc/hadoop/conf/yarn-site.xml . -
Synchronize the PXF server configuration to the Greengage DB cluster:
$ pxf cluster sync
Create an external table using the PXF protocol
To create a Greengage DB external table that references a text file or directory on HDFS, use the following general syntax:
CREATE [READABLE | WRITABLE] EXTERNAL TABLE <table_name>
( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ('pxf://<path-to-hdfs>?PROFILE=<profile_name>[&<custom_option>[...]]')
FORMAT '[TEXT|CSV]' (delimiter[=|<space>][E]'<delim_value>')
[DISTRIBUTED BY (<column_name> [, ... ] ) | DISTRIBUTED RANDOMLY];
| Keyword | Value |
|---|---|
<path‑to‑hdfs> |
The path to the directory or file in the HDFS data store.
When the PXF supports reading and writing text files compressed with the default, bzip2, and gzip codecs.
For readable tables, decompression is performed automatically.
For writable external tables, you can specify the compression codec and type by using the |
PROFILE=<profile_name> |
Use the |
FORMAT <value> |
The data format, which can be |
delimiter |
The delimiter character in the data.
For the Note that the |
DISTRIBUTED BY |
When loading data from a Greengage DB table into a writable external table, consider specifying the same distribution policy or column name on both tables. This will avoid extra motion of data between segments on the load operation. Learn more about table distribution in Distribution |
<custom_option> |
One of the custom options provided in the |
SERVER=<server_name> |
The named server configuration that PXF uses to access the data. If not specified, the default PXF server is used |
IGNORE_MISSING_PATH=<boolean> |
The action to take when |
SKIP_HEADER_COUNT=<numlines> |
The number of header lines to skip in the beginning of each HDFS file before reading the data.
The default value is |
COMPRESSION_CODEC |
The compression codec to use when writing data: |
COMPRESSION_TYPE |
The compression type to use when writing data: record ( |
Read regular text data from HDFS
-
In the /tmp directory on the HDFS host, create a delimited plain text data file named customers.txt and having the following content:
id,first_name,last_name,email,address 1,John,Doe,john.doe@example.com,123 Elm Street 2,Jane,Smith,jane.smith@example.com,456 Oak Street 3,Bob,Brown,bob.brown@example.com,789 Pine Street 4,Rob,Stuart,rob.stuart@example.com,119 Willow Street
-
Compress the created file:
$ gzip /tmp/customers.txt -
Create the pxf_examples HDFS directory for storing PXF example data files and add the customers.txt.gz file to HDFS:
$ hdfs dfs -mkdir -p /tmp/pxf_examples $ hdfs dfs -put /tmp/customers.txt.gz /tmp/pxf_examples/ -
Optionally, review the contents of the customers.txt.gz file stored in HDFS:
$ hdfs dfs -text /tmp/pxf_examples/customers.txt.gz -
On the Greengage DB master host, create a readable external table using the PXF protocol that references the customers.txt.gz file. In the
LOCATIONclause, specify the PXFhdfs:textprofile and the server configuration. Use theSKIP_HEADER_COUNToption to indicate that the file contains a header row. In theFORMATclause, setTEXTas the data format and set the comma character (,) as a delimiter (delimiter):CREATE EXTERNAL TABLE customers_plain ( id INTEGER, first_name VARCHAR(50), last_name VARCHAR(50), email VARCHAR(100), address VARCHAR(255) ) LOCATION ('pxf://tmp/pxf_examples/customers.txt.gz?PROFILE=hdfs:text&SERVER=hadoop&SKIP_HEADER_COUNT=1') FORMAT 'TEXT' (delimiter=E','); -
Query the external table:
SELECT * FROM customers_plain;The output should look as follows:
id | first_name | last_name | email | address ----+------------+-----------+------------------------+------------------- 1 | John | Doe | john.doe@example.com | 123 Elm Street 2 | Jane | Smith | jane.smith@example.com | 456 Oak Street 3 | Bob | Brown | bob.brown@example.com | 789 Pine Street 4 | Rob | Stuart | rob.stuart@example.com | 119 Willow Street (4 rows)
Read multiline text data from HDFS
-
In the /tmp directory on the HDFS host, create a delimited plain text data file named customers_multiline.txt and having the following content. Notice the use of the colon (
:) to separate the fields. Also notice the quotes around theaddressfield: this field includes an embedded line feed separating the street address from the city:id:first_name:last_name:email:address 1:John:Doe:john.doe@example.com:"123 Elm Street New York" 2:Jane:Smith:jane.smith@example.com:"456 Oak Street Chicago" 3:Bob:Brown:bob.brown@example.com:"789 Pine Street Los Angeles" 4:Rob:Stuart:rob.stuart@example.com:"119 Willow Street New Orleans"
-
Copy the text file to HDFS:
$ hdfs dfs -put /tmp/customers_multiline.txt /tmp/pxf_examples/ -
On the Greengage DB master host, create a readable external table using the PXF protocol that references the customers_multiline.txt file. In the
LOCATIONclause, specify the PXFhdfs:text:multiprofile and the server configuration. Use theSKIP_HEADER_COUNToption to indicate that the file contains a header row. In theFORMATclause, setCSVas the data format and set the colon character (:) as a delimiter (delimiter):CREATE EXTERNAL TABLE customers_multiline ( id INTEGER, first_name TEXT, last_name TEXT, email TEXT, address TEXT ) LOCATION ('pxf://tmp/pxf_examples/customers_multiline.txt?PROFILE=hdfs:text:multi&SERVER=hadoop&SKIP_HEADER_COUNT=1') FORMAT 'CSV' (delimiter ':'); -
Query the created external table:
SELECT * FROM customers_multiline;The output should look as follows:
id | first_name | last_name | email | address ----+------------+-----------+------------------------+------------------- 1 | John | Doe | john.doe@example.com | 123 Elm Street | | | | New York 2 | Jane | Smith | jane.smith@example.com | 456 Oak Street | | | | Chicago 3 | Bob | Brown | bob.brown@example.com | 789 Pine Street | | | | Los Angeles 4 | Rob | Stuart | rob.stuart@example.com | 119 Willow Street | | | | New Orleans (4 rows)
Write text data to HDFS
-
On the Greengage DB master host, create a writable external table that writes data to the /tmp/pxf_examples/customers HDFS directory. In the
LOCATIONclause, specify the PXFhdfs:textprofile and the server configuration and set the compression codec togzip. In theFORMATclause, specifyTEXTas the data format and set the comma character (,) as a delimiter (delimiter):CREATE WRITABLE EXTERNAL TABLE customers_w ( id INTEGER, first_name TEXT, last_name TEXT, email TEXT, address TEXT ) LOCATION ('pxf://tmp/pxf_examples/customers?PROFILE=hdfs:text&SERVER=hadoop&COMPRESSION_CODEC=gzip') FORMAT 'TEXT' (delimiter ','); -
Insert the contents of the earlier created
customers_plainreadable table into thecustomers_wtable:INSERT INTO customers_w SELECT * FROM customers_plain; -
Manually insert several values into the
customers_wtable:INSERT INTO customers_w ( id, first_name, last_name, email, address ) VALUES (5,'Alice','Johnson','alice.johnson@example.com','10 Oak Avenue'), (6,'Charlie','Williams','charlie.williams@example.com','42 Maple Drive'); -
On the HDFS host, view the contents of the files created in the customers directory:
$ hdfs dfs -text /tmp/pxf_examples/customers/*The output should look as follows:
1,John,Doe,john.doe@example.com,123 Elm Street 2,Jane,Smith,jane.smith@example.com,456 Oak Street 3,Bob,Brown,bob.brown@example.com,789 Pine Street 4,Rob,Stuart,rob.stuart@example.com,119 Willow Street 5,Alice,Johnson,alice.johnson@example.com,10 Oak Avenue 6,Charlie,Williams,charlie.williams@example.com,42 Maple Drive