Use PXF HDFS connector to read and write Parquet-format data between Greengage DB and HDFS

Anton Monakov

Contents

Data type mapping
- Read mapping
- Write mapping
Create an external table using the PXF protocol
Numeric data overflow conditions
Examples

Apache Parquet is an open-source column-oriented data storage format used in the Apache Hadoop ecosystem. A Parquet data file contains a compact binary representation of the data and includes an embedded schema. The schema defines the structure of the data and is composed of the primitive and complex types listed in Data type mapping.

This topic describes how to configure and use the PXF HDFS connector for reading and writing Parquet data stored in HDFS by using external tables and provides practical examples.

To read and write Parquet primitive data types in Greengage DB, map Parquet data values to Greengage DB columns of the same type.

Parquet supports a small set of primitive data types and uses metadata annotations that specify how to interpret a primitive type and extend it into a logical type. For example, Parquet stores both the INTEGER and the DATE types as the INT32 primitive type, while an annotation identifies the original type as DATE.

PXF uses the following data type mapping when reading Parquet data.

Parquet physical type Parquet logical type PXF / Greengage DB data type

Parquet physical type	Parquet logical type	PXF / Greengage DB data type
boolean	—	BOOLEAN
`binary` (`byte_array`)	—	BYTEA
`binary` (`byte_array`)	Date	DATE
`binary` (`byte_array`)	Timestamp_millis	TIMESTAMP
`binary` (`byte_array`)	UTF8	TEXT
double	—	FLOAT8
fixed_len_byte_array	Decimal	NUMERIC
float	—	REAL
int32	int_8	SMALLINT
int32	Date	DATE
int32	Decimal	NUMERIC
int32	—	INTEGER
int64	Decimal	NUMERIC
int64	—	BIGINT
int96	—	TIMESTAMP

boolean

—

BOOLEAN

binary (byte_array)

—

BYTEA

binary (byte_array)

Date

DATE

binary (byte_array)

Timestamp_millis

TIMESTAMP

binary (byte_array)

UTF8

TEXT

double

—

FLOAT8

fixed_len_byte_array

Decimal

NUMERIC

float

—

REAL

int32

int_8

SMALLINT

int32

Date

DATE

int32

Decimal

NUMERIC

int32

—

INTEGER

int64

Decimal

NUMERIC

int64

—

BIGINT

int96

—

TIMESTAMP

NOTE

PXF supports filter predicate pushdown on all parquet data types listed above, except the fixed_len_byte_array and int96 types.

PXF can read a Parquet LIST nested type when it represents a one-dimensional array of the following Parquet types.

Parquet data type PXF / Greengage DB data type

Parquet data type	PXF / Greengage DB data type
List of `<boolean>`	BOOLEAN[]
List of `<binary>`	BYTEA[]
List of `<binary>` (`Date`)	DATE[]
List of `<binary>` (`Timestamp_millis`)	TIMESTAMP[]
List of `<binary>` (`UTF8`)	TEXT[]
List of `<double>`	FLOAT8[]
List of `<fixed_len_byte_array>` (`Decimal`)	NUMERIC[]
List of `<float>`	REAL[]
List of `<int32>` (`int_8`)	SMALLINT[]
List of `<int32>` (`Date`)	DATE[]
List of `<int32>` (`Decimal`)	NUMERIC[]
List of `<int32>`	INTEGER[]
List of `<int64>` (`Decimal`)	NUMERIC[]
List of `<int64>`	BIGINT[]
List of `<int96>`	TIMESTAMP[]

List of <boolean>

BOOLEAN[]

List of <binary>

BYTEA[]

List of <binary> (Date)

DATE[]

List of <binary> (Timestamp_millis)

TIMESTAMP[]

List of <binary> (UTF8)

TEXT[]

List of <double>

FLOAT8[]

List of <fixed_len_byte_array> (Decimal)

NUMERIC[]

List of <float>

REAL[]

List of <int32> (int_8)

SMALLINT[]

List of <int32> (Date)

DATE[]

List of <int32> (Decimal)

NUMERIC[]

List of <int32>

INTEGER[]

List of <int64> (Decimal)

NUMERIC[]

List of <int64>

BIGINT[]

List of <int96>

TIMESTAMP[]

PXF uses the following data type mapping when writing Parquet data.

PXF / Greengage DB data type Parquet physical type Parquet logical type

PXF / Greengage DB data type	Parquet physical type	Parquet logical type
BIGINT	int64	–
BOOLEAN	boolean	–
BPCHAR (1)	`binary` (`byte_array`)	UTF8
BYTEA	`binary` (`byte_array`)	–
DATE	int32	Date
FLOAT8	double	–
INTEGER	int32	–
`NUMERIC` / `DECIMAL`	fixed_len_byte_array	Decimal
REAL	float	–
SMALLINT	int32	int_8
TEXT	`binary` (`byte_array`)	UTF8
TIMESTAMP (2)	int96	–
TIMESTAMPTZ (3)	int96	–
VARCHAR	`binary` (`byte_array`)	UTF8
Others	Unsupported	–

BIGINT

int64

–

BOOLEAN

boolean

–

BPCHAR (1)

binary (byte_array)

UTF8

BYTEA

binary (byte_array)

–

DATE

int32

Date

FLOAT8

double

–

INTEGER

int32

–

NUMERIC / DECIMAL

fixed_len_byte_array

Decimal

REAL

float

–

SMALLINT

int32

int_8

TEXT

binary (byte_array)

UTF8

TIMESTAMP (2)

int96

–

TIMESTAMPTZ (3)

int96

–

VARCHAR

binary (byte_array)

UTF8

Others

Unsupported

–

PXF can write a one-dimensional LIST array of the following Parquet data types.

PXF / Greengage DB data type Parquet data type

PXF / Greengage DB data type	Parquet data type
BIGINT[]	List of `<int64>`
BOOLEAN[]	List of `<boolean>`
BPCHAR[] (1)	List of `<binary>` (`UTF8`)
BYTEA[]	List of `<binary>`
DATE[]	List of `<int32>` (`Date`)
FLOAT8[]	List of `<double>`
INTEGER[]	List of `<int32>`
NUMERIC[] / DECIMAL[]	List of `<fixed_len_byte_array>` (`Decimal`)
REAL[]	List of `<float>`
SMALLINT[]	List of `<int32>` (`int_8`)
TEXT[]	List of `<binary>` (`UTF8`)
TIMESTAMP[] (2)	List of `<int96>`
TIMESTAMPTZ[] (3)	List of `<int96>`
VARCHAR[]	List of `<binary>` (`UTF8`)
Others	Unsupported

BIGINT[]

List of <int64>

BOOLEAN[]

List of <boolean>

BPCHAR[] (1)

List of <binary> (UTF8)

BYTEA[]

List of <binary>

DATE[]

List of <int32> (Date)

FLOAT8[]

List of <double>

INTEGER[]

List of <int32>

NUMERIC[] / DECIMAL[]

List of <fixed_len_byte_array> (Decimal)

REAL[]

List of <float>

SMALLINT[]

List of <int32> (int_8)

TEXT[]

List of <binary> (UTF8)

TIMESTAMP[] (2)

List of <int96>

TIMESTAMPTZ[] (3)

List of <int96>

VARCHAR[]

List of <binary> (UTF8)

Others

Unsupported

Parquet does not save the field length, so a BPCHAR value written to Parquet will be a text of undefined length.
PXF localizes the TIMESTAMP type to the current system time zone and converts it to universal time (UTC) before finally converting to int96.
PXF converts the TIMESTAMPTZ type to a UTC timestamp and then converts to int96. The time zone information is lost during this conversion.

To create a Greengage DB external table to read or write Parquet-format data in HDFS, use the following general syntax:

CREATE [READABLE | WRITABLE] EXTERNAL TABLE <table_name>
    ( <column_name> <data_type> [, ...] | LIKE <other_table> )

    LOCATION ('pxf://<path_to_data>?PROFILE=hdfs:parquet[&<custom_option>=<value>[...]]')
    FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import'|'pxfwritable_export')
    [DISTRIBUTED BY (<column_name> [, ... ] ) | DISTRIBUTED RANDOMLY];

Keyword Value

Keyword	Value
<table_name>	The name of the table to create
<column_name>	The name of the column to create
<data_type>	The data type of the column
LIKE <other_table>	Specifies a table from which the new external table automatically copies all column names, data types, and distribution policy
<path_to_data>	The path to the directory or file in the HDFS data store. When the `<server_name>` configuration includes a `pxf.fs.basePath` property setting, the `<path_to_data>` value is considered to be relative to the base path specified. Otherwise, the path is considered absolute. The value must not include the dollar sign (`$`) character
PROFILE	The `hdfs:parquet` profile is used to read or write Parquet-format data in HDFS
FORMAT 'CUSTOM'	The custom format with the built-in custom formatter functions for read (`pxfwritable_import`) and write (`pxfwritable_export`) operations is used to work with Parquet-format data in HDFS
DISTRIBUTED BY	When loading data from a Greengage DB table into a writable external table, consider specifying the same distribution policy or column name on both tables. This will avoid extra motion of data between segments on the load operation. Learn more about table distribution in Distribution
<custom_option>	One of the custom options provided in the `LOCATION` string as described below
SERVER=<server_name>	The named server configuration that PXF uses to access the data. If not specified, the default PXF server is used
IGNORE_MISSING_PATH	The action to take when `<path_to_data>` is missing or invalid. If set to `false` (default), an error is returned. If set to `true`, PXF ignores missing path errors and returns an empty fragment
COMPRESSION_CODEC	The compression codec to use when writing data: `snappy`, `gzip`, `lzo`, or `uncompressed` (no compression). You must explicitly specify `uncompressed` if you do not want PXF to compress the data. Otherwise, PXF compresses the data using Snappy compression
ROWGROUP_SIZE	The size (in bytes) of the row group, which provides a logical partitioning of the data into rows. The default row group size is 8 * 1024 * 1024 bytes
PAGE_SIZE	The size (in bytes) of a page, which divides row groups in a column into column chunks. The default page size is 1 * 1024 * 1024 bytes
ENABLE_DICTIONARY	Specifies whether to enable dictionary encoding. The default value is `true`; dictionary encoding is enabled when writing Parquet files
DICTIONARY_PAGE_SIZE	When dictionary encoding is enabled, defines a single dictionary page per column, per row group. `DICTIONARY_PAGE_SIZE` is similar to `PAGE_SIZE`, but is specified for the dictionary. The default dictionary page size is 1 * 1024 * 1024 bytes
PARQUET_VERSION	The Parquet version; the supported values are `v1` (default) and `v2`
SCHEMA	The absolute path to the Parquet schema file on the Greengage DB host or on HDFS. If not specified, PXF creates the schema based on the external table definition

<table_name>

The name of the table to create

<column_name>

The name of the column to create

<data_type>

The data type of the column

LIKE <other_table>

Specifies a table from which the new external table automatically copies all column names, data types, and distribution policy

<path_to_data>

The path to the directory or file in the HDFS data store. When the <server_name> configuration includes a pxf.fs.basePath property setting, the <path_to_data> value is considered to be relative to the base path specified. Otherwise, the path is considered absolute. The value must not include the dollar sign ($) character

PROFILE

The hdfs:parquet profile is used to read or write Parquet-format data in HDFS

FORMAT 'CUSTOM'

The custom format with the built-in custom formatter functions for read (pxfwritable_import) and write (pxfwritable_export) operations is used to work with Parquet-format data in HDFS

DISTRIBUTED BY

When loading data from a Greengage DB table into a writable external table, consider specifying the same distribution policy or column name on both tables. This will avoid extra motion of data between segments on the load operation. Learn more about table distribution in Distribution

<custom_option>

One of the custom options provided in the LOCATION string as described below

SERVER=<server_name>

The named server configuration that PXF uses to access the data. If not specified, the default PXF server is used

IGNORE_MISSING_PATH

The action to take when <path_to_data> is missing or invalid. If set to false (default), an error is returned. If set to true, PXF ignores missing path errors and returns an empty fragment

COMPRESSION_CODEC

The compression codec to use when writing data: snappy, gzip, lzo, or uncompressed (no compression).

You must explicitly specify uncompressed if you do not want PXF to compress the data. Otherwise, PXF compresses the data using Snappy compression

ROWGROUP_SIZE

The size (in bytes) of the row group, which provides a logical partitioning of the data into rows. The default row group size is 8 * 1024 * 1024 bytes

PAGE_SIZE

The size (in bytes) of a page, which divides row groups in a column into column chunks. The default page size is 1 * 1024 * 1024 bytes

ENABLE_DICTIONARY

Specifies whether to enable dictionary encoding. The default value is true; dictionary encoding is enabled when writing Parquet files

DICTIONARY_PAGE_SIZE

When dictionary encoding is enabled, defines a single dictionary page per column, per row group. DICTIONARY_PAGE_SIZE is similar to PAGE_SIZE, but is specified for the dictionary. The default dictionary page size is 1 * 1024 * 1024 bytes

PARQUET_VERSION

The Parquet version; the supported values are v1 (default) and v2

SCHEMA

The absolute path to the Parquet schema file on the Greengage DB host or on HDFS. If not specified, PXF creates the schema based on the external table definition

PXF uses the HiveDecimal class to write numeric Parquet data, which limits both the precision and the scale of a numeric type to a maximum of 38. When you define a NUMERIC column in an external table without specifying a precision or scale, PXF internally maps the column to DECIMAL(38, 18).

PXF handles the following precision overflow conditions:

A NUMERIC column is defined in the external table, and the total digit count of a value exceeds the maximum supported precision of 38, for example, 1234567890123456789012345678901234567890.12345, which has a total digit count of 45.
A NUMERIC(<precision>) column is defined with a <precision> value greater than 38, for example NUMERIC(55).
A NUMERIC column is defined in the external table, and the integer digit count of a value is greater than 20 (38-18), for example, 123456789012345678901234567890.12345, which has an integer digit count of 30.

If you define a NUMERIC(<precision>, <scale>) column and the integer digit count of a value is greater than <precision>-<scale>, PXF returns an error. For example, you define a NUMERIC(20,4) column and the value is 12345678901234567.12, whose integer digit count of 17 is greater than 20-4=16.

PXF can perform one of the following actions when detecting a numeric data overflow: round the value (the default), return an error, or ignore the overflow. The pxf.parquet.write.decimal.overflow property in the pxf-site.xml server configuration file specifies the action to take.

Value PXF action

Value	PXF action
round	The default behavior. When PXF encounters an overflow, it attempts to round the value to meet both precision and scale requirements before writing and reports an error if rounding fails. This may potentially leave an incomplete dataset in the external system
error	PXF reports an error when it encounters an overflow, and the transaction fails
ignore	PXF logs a warning and attempts to round the value to meet both precision and scale requirements; otherwise a `NULL` value is written

round

The default behavior. When PXF encounters an overflow, it attempts to round the value to meet both precision and scale requirements before writing and reports an error if rounding fails. This may potentially leave an incomplete dataset in the external system

error

PXF reports an error when it encounters an overflow, and the transaction fails

ignore

PXF logs a warning and attempts to round the value to meet both precision and scale requirements; otherwise a NULL value is written

NOTE

Learn about configuring a PXF server in Configure a PXF server in PXF documentation.

These examples demonstrate how to configure and use the PXF HDFS connector for reading and writing Parquet data in HDFS by using external tables. Since Parquet data is binary and thus not human-readable, a writable external table is used for creating data files, which are then read into Greengage DB via a readable table.

Prerequisites

To try out the practical examples, connect to the Greengage DB master host as gpadmin using psql as described in Connect to Greengage DB via psql. Then create the customers test database and connect to it:

DROP DATABASE IF EXISTS customers;
CREATE DATABASE customers;
\c customers

To be able to create an external table using the PXF protocol, enable the PXF extension in the database as described in Register PXF in a database in PXF documentation:

CREATE EXTENSION pxf;

To have PXF connect to HDFS, you need to create a Hadoop server configuration as described in Configure PXF Hadoop connectors in PXF documentation and then synchronize it to the Greengage DB cluster:

Log in to the Greengage DB master host as gpadmin.
Go to the $PXF_BASE/servers directory and create a Hadoop server configuration directory named hadoop:
```
$ mkdir $PXF_BASE/servers/hadoop
$ cd $PXF_BASE/servers/hadoop
```

Copy the core-site.xml, hdfs-site.xml, mapred-site.xml, and yarn-site.xml Hadoop configuration files from the NameNode host of the Hadoop cluster to the current host:

$ scp hdfsuser@namenode:/etc/hadoop/conf/core-site.xml .
$ scp hdfsuser@namenode:/etc/hadoop/conf/hdfs-site.xml .
$ scp hdfsuser@namenode:/etc/hadoop/conf/mapred-site.xml .
$ scp hdfsuser@namenode:/etc/hadoop/conf/yarn-site.xml .

Synchronize the server configuration to the Greengage DB cluster hosts:
```
$ pxf cluster sync
```

On the Greengage DB master host, create the writable external table that stores data into the /tmp/pxf_examples/customers HDFS directory. In the LOCATION clause, specify the PXF hdfs:parquet profile and the server configuration. In the FORMAT clause, specify pxfwritable_export, which is the built-in custom formatter function for write operations:
```
CREATE WRITABLE EXTERNAL TABLE customers_w (
        id INT,
        name TEXT, 
        ordered_items TEXT[]    
    )
    
    LOCATION ('pxf://tmp/pxf_examples/customers?PROFILE=hdfs:parquet&SERVER=hadoop')
    FORMAT 'CUSTOM' (FORMATTER='pxfwritable_export');
```

Populate the customers_w table with sample data:

INSERT INTO customers_w 
VALUES (1, 'John Doe', ARRAY['laptop', 'monitor']),
       (2, 'Jane Smith', ARRAY['keyboard', 'mouse', 'pad']),
       (3, 'Bob Brown', ARRAY['headphones']),
       (4, 'Alice Green', ARRAY['webcam', 'microphone']);

On the HDFS host, view the contents of the created /tmp/pxf_examples/customers HDFS directory:

$ hdfs dfs -ls /tmp/pxf_examples/customers

The output should look similar to the following:

Found 4 items
-rw-r--r--   3 gpadmin hadoop        799 2026-02-05 14:38 /tmp/pxf_examples/customers/31-0000000034_0.snappy.parquet
-rw-r--r--   3 gpadmin hadoop        769 2026-02-05 14:38 /tmp/pxf_examples/customers/31-0000000034_1.snappy.parquet
-rw-r--r--   3 gpadmin hadoop        807 2026-02-05 14:38 /tmp/pxf_examples/customers/31-0000000034_2.snappy.parquet
-rw-r--r--   3 gpadmin hadoop        789 2026-02-05 14:38 /tmp/pxf_examples/customers/31-0000000034_3.snappy.parquet

On the Greengage DB master host, create an external table that reads files from the /tmp/pxf_examples/customers HDFS directory. In the LOCATION clause, specify the PXF hdfs:parquet profile and the server configuration. In the FORMAT clause, specify pxfwritable_import, which is the built-in custom formatter function for read operations:
```
CREATE EXTERNAL TABLE customers_r (
        id INT,
        name TEXT, 
        ordered_items TEXT[]
    )
    
    LOCATION('pxf://tmp/pxf_examples/customers?PROFILE=hdfs:parquet&SERVER=hadoop')
    FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
```

Query the created external table:

SELECT * FROM customers_r;

The output should look as follows:

 id |    name     |    ordered_items
----+-------------+----------------------
  1 | John Doe    | {laptop,monitor}
  3 | Bob Brown   | {headphones}
  4 | Alice Green | {webcam,microphone}
  2 | Jane Smith  | {keyboard,mouse,pad}
(4 rows)

Avro

ORC

Use PXF HDFS connector to read and write Parquet-format data between Greengage DB and HDFS

Data type mapping

Read mapping

Write mapping

Create an external table using the PXF protocol

Numeric data overflow conditions

Examples

Configure the PXF HDFS connector

Create a writable external table

Create a readable external table