Use PXF HDFS connector to read and write SequenceFile data between Greengage DB and HDFS

Anton Monakov

Contents

Create an external table using the PXF protocol
- Read the record key
Examples

The SequenceFile format is a common data transfer format between MapReduce jobs. Files of this type consist of binary key/value pairs.

This topic describes how to configure and use the PXF HDFS connector for reading and writing SequenceFile data stored in HDFS by using external tables and provides practical examples.

To create a Greengage DB external table to read or write SequenceFile data in HDFS, use the following general syntax:

CREATE [READABLE | WRITABLE] EXTERNAL TABLE <table_name>
    ( <column_name> <data_type> [, ...] | LIKE <other_table> )
    LOCATION ('pxf://<path_to_data>?PROFILE=hdfs:SequenceFile&DATA_SCHEMA=<data_schema>[&<custom_option>=<value>[...]]')

    FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import' | 'pxfwritable_export')
    [DISTRIBUTED BY (<column_name> [, ... ] ) | DISTRIBUTED RANDOMLY];

Keyword Value

Keyword	Value
<table_name>	The name of the table to create
<column_name>	The name of the column to create
<data_type>	The data type of the column
LIKE <other_table>	Specifies a table from which the new external table automatically copies all column names, data types, and distribution policy
<path_to_data>	The path to the directory or file in the HDFS data store. When the `<server_name>` configuration includes a `pxf.fs.basePath` property setting, the `<path_to_data>` value is considered to be relative to the base path specified. Otherwise, the path is considered absolute. The value must not include the dollar sign (`$`) character
PROFILE=hdfs:SequenceFile	The `hdfs:SequenceFile` profile is used to read or write SequenceFile-format data in HDFS
DATA_SCHEMA=<data_schema>	The name of the serialization/deserialization class, which provides read and write methods for each data type referenced in the data schema. The JAR file in which this class resides must be in the PXF classpath
FORMAT 'CUSTOM'	The custom format with the built-in custom formatter functions for read (`pxfwritable_import`) and write (`pxfwritable_export`) operations is used to work with SequenceFile-format data in HDFS
DISTRIBUTED BY	When loading data from a Greengage DB table into a writable external table, consider specifying the same distribution policy or column name on both tables. This will avoid extra motion of data between segments on the load operation. Learn more about table distribution in Distribution
<custom_option>	One of the custom options provided in the `LOCATION` string as described below
SERVER=<server_name>	The named server configuration that PXF uses to access the data. If not specified, the default PXF server is used
IGNORE_MISSING_PATH	The action to take when `<path_to_data>` is missing or invalid. If set to `false` (default), an error is returned. If set to `true`, PXF ignores missing path errors and returns an empty fragment
COMPRESSION_TYPE	The compression type to employ when writing data: `RECORD` (the default) or `BLOCK`
COMPRESSION_CODEC	The compression codec to use when writing data: `default`, `bzip2`, `gzip`, or `uncompressed` (no compression). If not provided (or `uncompressed` is provided), no data compression is performed

<table_name>

The name of the table to create

<column_name>

The name of the column to create

<data_type>

The data type of the column

LIKE <other_table>

Specifies a table from which the new external table automatically copies all column names, data types, and distribution policy

<path_to_data>

The path to the directory or file in the HDFS data store. When the <server_name> configuration includes a pxf.fs.basePath property setting, the <path_to_data> value is considered to be relative to the base path specified. Otherwise, the path is considered absolute. The value must not include the dollar sign ($) character

PROFILE=hdfs:SequenceFile

The hdfs:SequenceFile profile is used to read or write SequenceFile-format data in HDFS

DATA_SCHEMA=<data_schema>

The name of the serialization/deserialization class, which provides read and write methods for each data type referenced in the data schema. The JAR file in which this class resides must be in the PXF classpath

FORMAT 'CUSTOM'

The custom format with the built-in custom formatter functions for read (pxfwritable_import) and write (pxfwritable_export) operations is used to work with SequenceFile-format data in HDFS

DISTRIBUTED BY

When loading data from a Greengage DB table into a writable external table, consider specifying the same distribution policy or column name on both tables. This will avoid extra motion of data between segments on the load operation. Learn more about table distribution in Distribution

<custom_option>

One of the custom options provided in the LOCATION string as described below

SERVER=<server_name>

The named server configuration that PXF uses to access the data. If not specified, the default PXF server is used

IGNORE_MISSING_PATH

The action to take when <path_to_data> is missing or invalid. If set to false (default), an error is returned. If set to true, PXF ignores missing path errors and returns an empty fragment

COMPRESSION_TYPE

The compression type to employ when writing data: RECORD (the default) or BLOCK

COMPRESSION_CODEC

The compression codec to use when writing data: default, bzip2, gzip, or uncompressed (no compression). If not provided (or uncompressed is provided), no data compression is performed

When a Greengage DB external table references a SequenceFile or another data format that stores rows in a key/value format, you can access the key values in Greengage DB queries by using the recordkey keyword as a field name.

The field type of recordkey must correspond to the key type. You can define recordkey to be any of the following Hadoop types:

BooleanWritable
ByteWritable
DoubleWritable
FloatWritable
IntWritable
LongWritable
Text

If no record key is defined for a row, Greengage DB returns the ID of the segment that processed the row.

These examples demonstrate how to configure and use the PXF HDFS connector for reading and writing SequenceFile data in HDFS by using external tables.

Prerequisites

To try out the practical examples, connect to the Greengage DB master host as gpadmin using psql as described in Connect to Greengage DB via psql. Then create the customers test database and connect to it:

DROP DATABASE IF EXISTS customers;
CREATE DATABASE customers;
\c customers

To be able to create an external table using the PXF protocol, enable the PXF extension in the database as described in Register PXF in a database in PXF documentation:

CREATE EXTENSION pxf;

To have PXF connect to HDFS, you need to create a Hadoop server configuration as described in Configure PXF Hadoop connectors in PXF documentation and then synchronize it to the Greengage DB cluster:

Log in to the Greengage DB master host as gpadmin.
Go to the $PXF_BASE/servers directory and create a Hadoop server configuration directory named hadoop:
```
$ mkdir $PXF_BASE/servers/hadoop
$ cd $PXF_BASE/servers/hadoop
```

Copy the core-site.xml, hdfs-site.xml, mapred-site.xml, and yarn-site.xml Hadoop configuration files from the NameNode host of the Hadoop cluster to the current host:

$ scp hdfsuser@namenode:/etc/hadoop/conf/core-site.xml .
$ scp hdfsuser@namenode:/etc/hadoop/conf/hdfs-site.xml .
$ scp hdfsuser@namenode:/etc/hadoop/conf/mapred-site.xml .
$ scp hdfsuser@namenode:/etc/hadoop/conf/yarn-site.xml .

Synchronize the server configuration to the Greengage DB cluster hosts:
```
$ pxf cluster sync
```

In the /tmp directory on the HDFS host, prepare the directory structure for the serialization/deserialization Java class and create the CustomerWritable.java file:

$ mkdir -p /tmp/pxf_seqfile_example/com/example/pxf/hdfs/writable/dataschema
$ cd /tmp/pxf_seqfile_example/com/example/pxf/hdfs/writable/dataschema
$ vi CustomerWritable.java

Add the following content to the CustomerWritable.java file:

package com.example.pxf.hdfs.writable.dataschema;

import org.apache.hadoop.io.*;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.lang.reflect.Field;

public class CustomerWritable implements Writable {

    public int int1;
    public String st1, st2, st3;

    public CustomerWritable() {
        int1 = 0;
        st1 = "";
        st2 = "";
        st3 = "";
    }

    int getInt1() {
        return int1;
    }

    String getSt1() {
        return st1;
    }

    String getSt2() {
        return st2;
    }

    String getSt3() {
        return st3;
    }

    @Override
    public void write(DataOutput out) throws IOException {

        IntWritable intw = new IntWritable();
        intw.set(int1);
        intw.write(out);

        Text txt = new Text();
        txt.set(st1);
        txt.write(out);
        txt.set(st2);
        txt.write(out);
        txt.set(st3);
        txt.write(out);
    }

    @Override
    public void readFields(DataInput in) throws IOException {

        IntWritable intw = new IntWritable();
        intw.readFields(in);
        int1 = intw.get();

        Text txt = new Text();
        txt.readFields(in);
        st1 = txt.toString();
        txt.readFields(in);
        st2 = txt.toString();
        txt.readFields(in);
        st3 = txt.toString();
    }
}

Save and close the file.
Compile a CustomerWritable class and package the resulting CustomerWritable.class file into a JAR file. Provide a classpath that includes the hadoop-common.jar file for your Hadoop system and copy the resulting customerwritable.jar file to the /tmp directory:
```
$ javac -classpath /usr/lib/hadoop/hadoop-common.jar CustomerWritable.java
$ cd /tmp/pxf_seqfile_example/
$ jar cf customerwritable.jar com
$ cp customerwritable.jar /tmp/
```
Copy the customerwritable.jar file to the Greengage DB master host, for example:
```
$ scp /tmp/customerwritable.jar gpadmin@mdw:/home/gpadmin
```
On the Greengage DB master host, copy the customerwritable.jar JAR file to the PXF runtime configuration directory ($PXF_BASE):
```
$ cp /home/gpadmin/customerwritable.jar $PXF_BASE/lib/customerwritable.jar
```
Synchronize the PXF configuration to the Greengage DB cluster and then restart PXF on all hosts:
```
$ pxf cluster sync
$ pxf cluster restart
```
On the Greengage DB master host, create the writable external table that stores data into the /tmp/pxf_examples/customers HDFS directory. In the LOCATION clause, specify the PXF hdfs:SequenceFile profile and the server configuration. Use the DATA_SCHEMA option to specify the fully qualified name of the CustomerWritable serialization/deserialization Java class. In the FORMAT clause, specify pxfwritable_export, which is the built-in custom formatter function for write operations:
```
CREATE WRITABLE EXTERNAL TABLE customers_w (
    id INT,
    name TEXT, 
    email TEXT, 
    address TEXT    
    )
    
    LOCATION ('pxf://tmp/pxf_examples/customers?PROFILE=hdfs:SequenceFile&DATA_SCHEMA=com.example.pxf.hdfs.writable.dataschema.CustomerWritable&SERVER=hadoop')
    FORMAT 'CUSTOM' (FORMATTER='pxfwritable_export');
```

Insert some data into the customers_w table:

INSERT INTO customers_w (id, name, email, address)
VALUES (1,'John Doe','john.doe@example.com','123 Elm Street'),
       (2,'Jane Smith','jane.smith@example.com','456 Oak Street'),
       (3,'Bob Brown','bob.brown@example.com','789 Pine Street'),
       (4,'Rob Stuart','rob.stuart@example.com','119 Willow Street');

On the HDFS host, view the contents of the created /tmp/pxf_examples/customers HDFS directory:

$ hdfs dfs -ls /tmp/pxf_examples/customers

The output should look similar to the following:

Found 2 items
-rw-r--r--   3 gpadmin hadoop        340 2026-03-06 15:27 /tmp/pxf_examples/customers/119-0000000004_1
-rw-r--r--   3 gpadmin hadoop        210 2026-03-06 15:27 /tmp/pxf_examples/customers/119-0000000004_2

On the Greengage DB master host, create an external table that reads files from the /tmp/pxf_examples/customers HDFS directory. In the LOCATION clause, specify the PXF hdfs:SequenceFile profile and the server configuration. Use the DATA_SCHEMA option to specify the fully qualified name of the CustomerWritable serialization/deserialization Java class. In the FORMAT clause, specify pxfwritable_import, which is the built-in custom formatter function for read operations:
```
CREATE EXTERNAL TABLE customers_r (
    id INT,
    name TEXT,
    email TEXT,
    address TEXT
    )

    LOCATION ('pxf://tmp/pxf_examples/customers?PROFILE=hdfs:SequenceFile&DATA_SCHEMA=com.example.pxf.hdfs.writable.dataschema.CustomerWritable&SERVER=hadoop')
    FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
```

Query the created external table:

SELECT * FROM customers_r;

The output should look as follows:

 id |    name    |         email          |      address
----+------------+------------------------+-------------------
  2 | Jane Smith | jane.smith@example.com | 456 Oak Street
  3 | Bob Brown  | bob.brown@example.com  | 789 Pine Street
  1 | John Doe   | john.doe@example.com   | 123 Elm Street
  4 | Rob Stuart | rob.stuart@example.com | 119 Willow Street
(4 rows)

On the Greengage DB master host, create an external table similar to the one created earlier but having the extra recordkey first column of type BIGINT. In the LOCATION clause, specify the PXF hdfs:SequenceFile profile and the server configuration. Use the DATA_SCHEMA option to specify the fully qualified name of the CustomerWritable serialization/deserialization Java class. In the FORMAT clause, specify pxfwritable_import, which is the built-in custom formatter function for read operations:
```
CREATE EXTERNAL TABLE customers_rkey (
    recordkey BIGINT,
    id INT,
    name TEXT,
    email TEXT,
    address TEXT
    )

    LOCATION ('pxf://tmp/pxf_examples/customers?PROFILE=hdfs:SequenceFile&DATA_SCHEMA=com.example.pxf.hdfs.writable.dataschema.CustomerWritable&SERVER=hadoop')
    FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
```

Query the created external table:

SELECT * FROM customers_rkey;

The output should similar to the following. Since no record key was defined when inserting the rows into the writable table, the recordkey column identifies the ID of the segment on which the row data was processed:

 recordkey | id |    name    |         email          |      address
-----------+----+------------+------------------------+-------------------
         2 |  4 | Rob Stuart | rob.stuart@example.com | 119 Willow Street
         1 |  1 | John Doe   | john.doe@example.com   | 123 Elm Street
         1 |  2 | Jane Smith | jane.smith@example.com | 456 Oak Street
         1 |  3 | Bob Brown  | bob.brown@example.com  | 789 Pine Street
(4 rows)

ORC

Multiline text

Use PXF HDFS connector to read and write SequenceFile data between Greengage DB and HDFS

Create an external table using the PXF protocol

Read the record key

Examples

Configure the PXF HDFS connector

Create a writable external table

Create a readable external table

Use record keys