Use PXF HDFS connector to read and write fixed-width text between HDFS and Greengage DB
This topic describes how to configure and use the PXF HDFS connector for reading and writing fixed-width text data stored in HDFS by using external tables and provides practical examples.
To learn more about the fixed-width data formatter options in Greengage DB, see Format fixed-width data.
Create an external table using the PXF protocol
To create a Greengage DB external table to read or write fixed-width text data in HDFS, use the following general syntax:
CREATE [READABLE | WRITABLE] EXTERNAL TABLE <table_name>
( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ('pxf://<path_to_data>?PROFILE=hdfs:fixedwidth[&<custom_option>=<value>[...]]')
FORMAT 'CUSTOM' (FORMATTER='fixedwidth_in | fixedwidth_out',
<field_name>='<width>' [, ...]
[, line_delim[=|<space>][E]'<delim_value>'])
[DISTRIBUTED BY (<column_name> [, ... ] ) | DISTRIBUTED RANDOMLY];
| Keyword | Value |
|---|---|
<table_name> |
The name of the table to create |
<column_name> |
The name of the column to create |
<data_type> |
The data type of the column |
LIKE <other_table> |
Specifies a table from which the new external table automatically copies all column names, data types, and distribution policy |
<path_to_data> |
The path to the directory or file in the HDFS data store.
When the |
PROFILE=hdfs:fixedwidth |
The |
FORMAT 'CUSTOM' |
The custom format with the built-in custom formatter functions for read ( |
<field_name>='<width>' |
The name and the width of the field in characters.
Fields must be listed in their physical order.
The field names must match the columns listed in the When reading data, if the field value is less than the |
line_delim |
The line delimiter character in the data, |
DISTRIBUTED BY |
When loading data from a Greengage DB table into a writable external table, consider specifying the same distribution policy or column name on both tables. This will avoid extra motion of data between segments on the load operation. Learn more about table distribution in Distribution |
<custom_option> |
One of the custom options provided in the |
SERVER=<server_name> |
The named server configuration that PXF uses to access the data. If not specified, the default PXF server is used |
NEWLINE |
When the |
COMPRESSION_CODEC |
The compression codec to use when writing data: |
IGNORE_MISSING_PATH |
The action to take when |
Examples
These examples demonstrate how to configure and use the PXF HDFS connector for reading and writing fixed-width text data in HDFS by using external tables.
Configure the PXF HDFS connector
To have PXF connect to HDFS, you need to create a Hadoop server configuration as described in Configure PXF Hadoop connectors in PXF documentation and then synchronize it to the Greengage DB cluster:
-
Log in to the Greengage DB master host as
gpadmin. -
Go to the $PXF_BASE/servers directory and create a Hadoop server configuration directory named hadoop:
$ mkdir $PXF_BASE/servers/hadoop $ cd $PXF_BASE/servers/hadoop -
Copy the core-site.xml, hdfs-site.xml, mapred-site.xml, and yarn-site.xml Hadoop configuration files from the NameNode host of the Hadoop cluster to the current host:
$ scp hdfsuser@namenode:/etc/hadoop/conf/core-site.xml . $ scp hdfsuser@namenode:/etc/hadoop/conf/hdfs-site.xml . $ scp hdfsuser@namenode:/etc/hadoop/conf/mapred-site.xml . $ scp hdfsuser@namenode:/etc/hadoop/conf/yarn-site.xml . -
Synchronize the server configuration to the Greengage DB cluster hosts:
$ pxf cluster sync
Create a readable external table
-
In the /tmp directory on the HDFS host, create a file named customers.txt. In the sample file, the first field is 3 characters long, the second — 15 characters long, the third — 25 characters long, and the last — 20 characters long. Notice the fields are right-padded with spaces to the required length:
1 John Doe john.doe@example.com 123 Elm Street 2 Jane Smith jane.smith@example.com 456 Oak Street 3 Bob Brown bob.brown@example.com 789 Pine Street 4 Rob Stuart rob.stuart@example.com 119 Willow Street -
Create the /tmp/pxf_examples HDFS directory for storing PXF example data files and add the customers.txt file to HDFS:
$ hdfs dfs -mkdir -p /tmp/pxf_examples $ hdfs dfs -put /tmp/customers.txt /tmp/pxf_examples/ -
On the Greengage DB master host, create an external table that references the customers.txt file. In the
LOCATIONclause, specify the PXFhdfs:fixedwidthprofile and the server configuration. In theFORMATclause, specify thefixedwidth_inbuilt-in formatter function for reading data and enumerate the data fields with their respective lengths:CREATE EXTERNAL TABLE customers_r ( id INTEGER, name TEXT, email TEXT, address TEXT ) LOCATION ('pxf://tmp/pxf_examples/customers.txt?PROFILE=hdfs:fixedwidth&SERVER=hadoop') FORMAT 'CUSTOM' ( FORMATTER='fixedwidth_in', id='3', name='15', email='25', address='20'); -
Query the created external table:
SELECT * FROM customers_r;The output should look as follows:
id | name | email | address ----+------------+------------------------+------------------- 1 | John Doe | john.doe@example.com | 123 Elm Street 2 | Jane Smith | jane.smith@example.com | 456 Oak Street 3 | Bob Brown | bob.brown@example.com | 789 Pine Street 4 | Rob Stuart | rob.stuart@example.com | 119 Willow Street (4 rows)
Create a writable external table
-
On the Greengage DB master host, create the writable external table that stores data into the /tmp/pxf_examples/customers HDFS directory. In the
LOCATIONclause, specify the PXFhdfs:fixedwidthprofile and the server configuration. In theFORMATclause, specify thefixedwidth_outbuilt-in formatter function for writing data and enumerate the data fields with their respective lengths:CREATE WRITABLE EXTERNAL TABLE customers_w ( id INTEGER, name TEXT, email TEXT, address TEXT ) LOCATION ('pxf://tmp/pxf_examples/customers?PROFILE=hdfs:fixedwidth&SERVER=hadoop') FORMAT 'CUSTOM' ( FORMATTER='fixedwidth_out', id='3', name='15', email='25', address='20'); -
Insert some data into the
customers_wtable:INSERT INTO customers_w ( id, name, email, address ) VALUES (5, 'Alice Price', 'alice.price@example.com', '234 Maple Avenue'), (6, 'David Lee', 'david.lee@example.com', '567 Birch Lane'), (7, 'Emily Wilson', 'emily.wilson@example.com', '890 Cedar Court'), (8, 'Kevin Garcia', 'kevin.garcia@example.com', '123 Spruce Drive'); -
On the HDFS host, view the contents of the files created in the /tmp/pxf_examples/customers directory:
$ hdfs dfs -cat /tmp/pxf_examples/customers/*The output should look as follows:
8 Kevin Garcia kevin.garcia@example.com 123 Spruce Drive 5 Alice Price alice.price@example.com 234 Maple Avenue 6 David Lee david.lee@example.com 567 Birch Lane 7 Emily Wilson emily.wilson@example.com 890 Cedar Court