Use PXF HDFS connector to read and write SequenceFile data between Greengage DB and HDFS
The SequenceFile format is a common data transfer format between MapReduce jobs. Files of this type consist of binary key/value pairs.
This topic describes how to configure and use the PXF HDFS connector for reading and writing SequenceFile data stored in HDFS by using external tables and provides practical examples.
Create an external table using the PXF protocol
To create a Greengage DB external table to read or write SequenceFile data in HDFS, use the following general syntax:
CREATE [READABLE | WRITABLE] EXTERNAL TABLE <table_name>
( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ('pxf://<path_to_data>?PROFILE=hdfs:SequenceFile&DATA_SCHEMA=<data_schema>[&<custom_option>=<value>[...]]')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import' | 'pxfwritable_export')
[DISTRIBUTED BY (<column_name> [, ... ] ) | DISTRIBUTED RANDOMLY];
| Keyword | Value |
|---|---|
<table_name> |
The name of the table to create |
<column_name> |
The name of the column to create |
<data_type> |
The data type of the column |
LIKE <other_table> |
Specifies a table from which the new external table automatically copies all column names, data types, and distribution policy |
<path_to_data> |
The path to the directory or file in the HDFS data store.
When the |
PROFILE=hdfs:SequenceFile |
The |
DATA_SCHEMA=<data_schema> |
The name of the serialization/deserialization class, which provides read and write methods for each data type referenced in the data schema. The JAR file in which this class resides must be in the PXF classpath |
FORMAT 'CUSTOM' |
The custom format with the built-in custom formatter functions for read ( |
DISTRIBUTED BY |
When loading data from a Greengage DB table into a writable external table, consider specifying the same distribution policy or column name on both tables. This will avoid extra motion of data between segments on the load operation. Learn more about table distribution in Distribution |
<custom_option> |
One of the custom options provided in the |
SERVER=<server_name> |
The named server configuration that PXF uses to access the data. If not specified, the default PXF server is used |
IGNORE_MISSING_PATH |
The action to take when |
COMPRESSION_TYPE |
The compression type to employ when writing data: |
COMPRESSION_CODEC |
The compression codec to use when writing data: |
Read the record key
When a Greengage DB external table references a SequenceFile or another data format that stores rows in a key/value format, you can access the key values in Greengage DB queries by using the recordkey keyword as a field name.
The field type of recordkey must correspond to the key type.
You can define recordkey to be any of the following Hadoop types:
-
BooleanWritable -
ByteWritable -
DoubleWritable -
FloatWritable -
IntWritable -
LongWritable -
Text
If no record key is defined for a row, Greengage DB returns the ID of the segment that processed the row.
Examples
These examples demonstrate how to configure and use the PXF HDFS connector for reading and writing SequenceFile data in HDFS by using external tables.
Configure the PXF HDFS connector
To have PXF connect to HDFS, you need to create a Hadoop server configuration as described in Configure PXF Hadoop connectors in PXF documentation and then synchronize it to the Greengage DB cluster:
-
Log in to the Greengage DB master host as
gpadmin. -
Go to the $PXF_BASE/servers directory and create a Hadoop server configuration directory named hadoop:
$ mkdir $PXF_BASE/servers/hadoop $ cd $PXF_BASE/servers/hadoop -
Copy the core-site.xml, hdfs-site.xml, mapred-site.xml, and yarn-site.xml Hadoop configuration files from the NameNode host of the Hadoop cluster to the current host:
$ scp hdfsuser@namenode:/etc/hadoop/conf/core-site.xml . $ scp hdfsuser@namenode:/etc/hadoop/conf/hdfs-site.xml . $ scp hdfsuser@namenode:/etc/hadoop/conf/mapred-site.xml . $ scp hdfsuser@namenode:/etc/hadoop/conf/yarn-site.xml . -
Synchronize the server configuration to the Greengage DB cluster hosts:
$ pxf cluster sync
Create a writable external table
-
In the /tmp directory on the HDFS host, prepare the directory structure for the serialization/deserialization Java class and create the CustomerWritable.java file:
$ mkdir -p /tmp/pxf_seqfile_example/com/example/pxf/hdfs/writable/dataschema $ cd /tmp/pxf_seqfile_example/com/example/pxf/hdfs/writable/dataschema $ vi CustomerWritable.java -
Add the following content to the CustomerWritable.java file:
package com.example.pxf.hdfs.writable.dataschema; import org.apache.hadoop.io.*; import java.io.DataInput; import java.io.DataOutput; import java.io.IOException; import java.lang.reflect.Field; public class CustomerWritable implements Writable { public int int1; public String st1, st2, st3; public CustomerWritable() { int1 = 0; st1 = ""; st2 = ""; st3 = ""; } int getInt1() { return int1; } String getSt1() { return st1; } String getSt2() { return st2; } String getSt3() { return st3; } @Override public void write(DataOutput out) throws IOException { IntWritable intw = new IntWritable(); intw.set(int1); intw.write(out); Text txt = new Text(); txt.set(st1); txt.write(out); txt.set(st2); txt.write(out); txt.set(st3); txt.write(out); } @Override public void readFields(DataInput in) throws IOException { IntWritable intw = new IntWritable(); intw.readFields(in); int1 = intw.get(); Text txt = new Text(); txt.readFields(in); st1 = txt.toString(); txt.readFields(in); st2 = txt.toString(); txt.readFields(in); st3 = txt.toString(); } } -
Save and close the file.
-
Compile a
CustomerWritableclass and package the resulting CustomerWritable.class file into a JAR file. Provide a classpath that includes the hadoop-common.jar file for your Hadoop system and copy the resulting customerwritable.jar file to the /tmp directory:$ javac -classpath /usr/lib/hadoop/hadoop-common.jar CustomerWritable.java $ cd /tmp/pxf_seqfile_example/ $ jar cf customerwritable.jar com $ cp customerwritable.jar /tmp/ -
Copy the customerwritable.jar file to the Greengage DB master host, for example:
$ scp /tmp/customerwritable.jar gpadmin@mdw:/home/gpadmin -
On the Greengage DB master host, copy the customerwritable.jar JAR file to the PXF runtime configuration directory ($PXF_BASE):
$ cp /home/gpadmin/customerwritable.jar $PXF_BASE/lib/customerwritable.jar -
Synchronize the PXF configuration to the Greengage DB cluster and then restart PXF on all hosts:
$ pxf cluster sync $ pxf cluster restart -
On the Greengage DB master host, create the writable external table that stores data into the /tmp/pxf_examples/customers HDFS directory. In the
LOCATIONclause, specify the PXFhdfs:SequenceFileprofile and the server configuration. Use theDATA_SCHEMAoption to specify the fully qualified name of theCustomerWritableserialization/deserialization Java class. In theFORMATclause, specifypxfwritable_export, which is the built-in custom formatter function for write operations:CREATE WRITABLE EXTERNAL TABLE customers_w ( id INT, name TEXT, email TEXT, address TEXT ) LOCATION ('pxf://tmp/pxf_examples/customers?PROFILE=hdfs:SequenceFile&DATA_SCHEMA=com.example.pxf.hdfs.writable.dataschema.CustomerWritable&SERVER=hadoop') FORMAT 'CUSTOM' (FORMATTER='pxfwritable_export'); -
Insert some data into the
customers_wtable:INSERT INTO customers_w (id, name, email, address) VALUES (1,'John Doe','john.doe@example.com','123 Elm Street'), (2,'Jane Smith','jane.smith@example.com','456 Oak Street'), (3,'Bob Brown','bob.brown@example.com','789 Pine Street'), (4,'Rob Stuart','rob.stuart@example.com','119 Willow Street'); -
On the HDFS host, view the contents of the created /tmp/pxf_examples/customers HDFS directory:
$ hdfs dfs -ls /tmp/pxf_examples/customersThe output should look similar to the following:
Found 2 items -rw-r--r-- 3 gpadmin hadoop 340 2026-03-06 15:27 /tmp/pxf_examples/customers/119-0000000004_1 -rw-r--r-- 3 gpadmin hadoop 210 2026-03-06 15:27 /tmp/pxf_examples/customers/119-0000000004_2
Create a readable external table
-
On the Greengage DB master host, create an external table that reads files from the /tmp/pxf_examples/customers HDFS directory. In the
LOCATIONclause, specify the PXFhdfs:SequenceFileprofile and the server configuration. Use theDATA_SCHEMAoption to specify the fully qualified name of theCustomerWritableserialization/deserialization Java class. In theFORMATclause, specifypxfwritable_import, which is the built-in custom formatter function for read operations:CREATE EXTERNAL TABLE customers_r ( id INT, name TEXT, email TEXT, address TEXT ) LOCATION ('pxf://tmp/pxf_examples/customers?PROFILE=hdfs:SequenceFile&DATA_SCHEMA=com.example.pxf.hdfs.writable.dataschema.CustomerWritable&SERVER=hadoop') FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import'); -
Query the created external table:
SELECT * FROM customers_r;The output should look as follows:
id | name | email | address ----+------------+------------------------+------------------- 2 | Jane Smith | jane.smith@example.com | 456 Oak Street 3 | Bob Brown | bob.brown@example.com | 789 Pine Street 1 | John Doe | john.doe@example.com | 123 Elm Street 4 | Rob Stuart | rob.stuart@example.com | 119 Willow Street (4 rows)
Use record keys
-
On the Greengage DB master host, create an external table similar to the one created earlier but having the extra
recordkeyfirst column of typeBIGINT. In theLOCATIONclause, specify the PXFhdfs:SequenceFileprofile and the server configuration. Use theDATA_SCHEMAoption to specify the fully qualified name of theCustomerWritableserialization/deserialization Java class. In theFORMATclause, specifypxfwritable_import, which is the built-in custom formatter function for read operations:CREATE EXTERNAL TABLE customers_rkey ( recordkey BIGINT, id INT, name TEXT, email TEXT, address TEXT ) LOCATION ('pxf://tmp/pxf_examples/customers?PROFILE=hdfs:SequenceFile&DATA_SCHEMA=com.example.pxf.hdfs.writable.dataschema.CustomerWritable&SERVER=hadoop') FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import'); -
Query the created external table:
SELECT * FROM customers_rkey;The output should similar to the following. Since no record key was defined when inserting the rows into the writable table, the
recordkeycolumn identifies the ID of the segment on which the row data was processed:recordkey | id | name | email | address -----------+----+------------+------------------------+------------------- 2 | 4 | Rob Stuart | rob.stuart@example.com | 119 Willow Street 1 | 1 | John Doe | john.doe@example.com | 123 Elm Street 1 | 2 | Jane Smith | jane.smith@example.com | 456 Oak Street 1 | 3 | Bob Brown | bob.brown@example.com | 789 Pine Street (4 rows)