Configure a PXF server

Anton Monakov

Contents

Server template files
Configure a server
pxf-site.xml configuration file
Configure a PXF user
- Example
Use a server configuration in external tables
- Example

Reading or writing data to an external data store is performed via a PXF connector. To access an external data store, you must provide the server location, client access credentials, and other data store-specific properties.

A PXF server is a named connector-specific configuration that provides access to a specific external data store. For example, a PXF JDBC Connector server definition may include settings for the JDBC driver class name, URL, username, password, and other connection-specific and session-specific properties.

A server definition must be created for each accessed external data store. For example, to access two Hadoop clusters, you need to create a PXF Hadoop server configuration for each cluster. To access an Oracle and a MySQL database, you need to create one or more PXF JDBC server configurations for each database.

A server definition may include default settings for user access credentials and other properties for the external data store. You can allow Greengage DB users to access the external data store using the default settings, or you can configure access and other properties on a per-user basis. This lets you configure different Greengage DB users with different external data store access credentials in a single PXF server definition.

PXF provides a template configuration file for each connector, which contains the typical set of properties that have to be configured to use the connector. The following server template configuration files are stored in the $PXF_HOME/templates/ directory:

adl-site.xml — provides configuration for Azure Data Lake Storage Gen2 using the ADL connector.
core-site.xml — provides core Hadoop configuration settings, often used as a base for other connectors accessing HDFS.
gs-site.xml — provides configuration for Google Cloud Storage (GCS).
hbase-site.xml — provides configuration for Apache HBase.
hdfs-site.xml — defines connection parameters for accessing Hadoop Distributed File System (HDFS) clusters.
hive-site.xml — provides configuration for Hive Metastore.
jdbc-site.xml — lets you configure JDBC connection to relational databases.
mapred-site.xml — provides configuration for MapReduce jobs.
minio-site.xml — provides configuration for MinIO object storage, an S3-compatible storage.
pxf-site.xml — contains PXF-specific configuration settings, such as logging levels and server properties.
s3-site.xml — provides configuration for Amazon Simple Storage Service (S3).
wasbs-site.xml — provides configuration for Azure Blob Storage using the WASB connector.
yarn-site.xml — provides configuration for Apache YARN, used by some PXF connectors for resource management.

NOTE

The template files for the Hadoop connectors are not intended to be modified and used for configuration, as they only provide an example of the information needed. Instead of modifying the Hadoop templates, you will copy several Hadoop *-site.xml files from the Hadoop cluster to your PXF Hadoop server configuration.

To configure a PXF connector to an external data store, you need to add a named PXF server configuration for the connector. The configuration information resides in one or more <connector>-site.xml files residing in the $PXF_BASE/servers/<server_name>/ directory. The name of a directory in $PXF_BASE/servers/ should correspond to the configured PXF server name.

PXF also uses a special server named default. During PXF installation, an empty $PXF_BASE/servers/default/ directory is created, which identifies the default PXF server configuration. You can configure and assign the default PXF server to any external data source, for example, a Hadoop cluster, or a MySQL database that you frequently access. PXF automatically uses the default server configuration if you omit the SERVER=<server_name> option in the CREATE EXTERNAL TABLE command LOCATION clause.

Configuring a server involves the following general steps:

Determine if you are configuring the default PXF server or creating a new server configuration.
Create the $PXF_BASE/servers/<server_name> directory.
Copy the template or other configuration files to the new server directory.
Fill in the appropriate values for the properties in the template file.
Add any additional configuration properties and values required for your environment.
Configure one or more users for the server configuration as described in Configure a PXF user.
Synchronize the server and user configuration to the Greengage DB cluster by running the pxf cluster sync command. See Overview of the pxf cluster commands for the full description of the available pxf cluster commands.

After you configure a PXF server, it becomes available to the Greengage DB users who need access to the corresponding data store. A user only needs to provide the server name when creating an external table that accesses the external data store. PXF obtains the external data source location and access credentials from the server and the user configuration files residing in the server configuration directory. Learn more about using the created configuration in Use a server configuration in external tables.

Configuration property precedence

A PXF server configuration may include default settings for user access credentials and other properties for accessing an external data store. Some PXF connectors, for example, the S3 or JDBC, let you directly specify certain server properties via custom options in the LOCATION clause of the CREATE EXTERNAL TABLE command. A <greengage_user_name>-user.xml file specifies property settings for an external data store that are specific to a Greengage DB user.

For a Greengage DB user, the following precedence rules (highest to lowest) apply to obtain the configuration property settings:

A property that is specified in the <server_name>/<greengage_user_name>-user.xml file overrides any setting of the property elsewhere.
A property that is specified via custom options in the LOCATION clause of the CREATE EXTERNAL TABLE command overrides any setting of this property in a PXF server configuration.
The properties specified in the PXF server definition define the default property values.

These precedence rules let you create a single external table that can be accessed by multiple Greengage DB users, each with their own unique external data store user credentials.

PXF includes a template file named pxf-site.xml for PXF-specific configuration parameters. You can use the pxf-site.xml template file to configure the following:

Kerberos or user impersonation settings for server configurations.
A base directory for file access.
The action that is performed when PXF detects an overflow condition while writing numeric ORC or Parquet data.

NOTE

The Kerberos and user impersonation settings in the pxf-site.xml file may apply only to Hadoop and JDBC server configurations; they do not apply to file system or object storage server configurations.

You need to configure properties in the pxf-site.xml file for a PXF server when any of the following conditions is met:

The remote Hadoop system utilizes Kerberos authentication.
You want to activate or deactivate user impersonation on the remote Hadoop or external database system.
You want to activate or deactivate Kerberos constrained delegation for a Hadoop PXF server.
You access a network file system with the server configuration.
You access a remote Hadoop or object storage file system with the server configuration and want to allow a user to access only a specific directory and subdirectories.

The following properties are included in the pxf-site.xml file.

Property Description Default Value

Property	Description	Default Value
pxf.service.kerberos.principal	The Kerberos principal name	gpadmin/_HOST@EXAMPLE.COM
pxf.service.kerberos.keytab	The file system path to the Kerberos keytab file	$PXF_BASE/keytabs/pxf.service.keytab
pxf.service.kerberos.constrained-delegation	Activates / deactivates Kerberos constrained delegation. This property is applicable only to Hadoop PXF server configurations, it is not applicable to JDBC PXF servers	false
pxf.service.kerberos.ticket-renew-thread-enabled	Indicates whether PXF should use a background thread that refreshes a Kerberos ticket-granting ticket (TGT) when its remaining lifespan reaches the renewal window	false
pxf.service.kerberos.ticket-renew-window	The minimum elapsed lifespan after which PXF attempts to renew / refresh a Kerberos ticket. A value is a fraction ranging is from `0` (PXF generates a new ticket for all requests) to `1` (PXF renews after full ticket lifespan)	0.8 (80%)
pxf.service.kerberos.ticket-renew-max-backoff-seconds	The maximum amount of time in seconds between consecutive Kerberos TGT renewal attempts	300
pxf.service.kerberos.ticket-renew-timeout-seconds	The maximum amount of time in seconds that Kerberos TGT renewal might take. If multiple renewal attempts fail to obtain the new ticket within this time interval, the renewal attempts stop, the TGT Renewal thread exits, and the query fails	3600
pxf.service.user.impersonation	Activates / deactivates user impersonation when connecting to the remote system	If the property is not set, the default is `true` (activated) for PXF Hadoop servers and `false` (deactivated) for JDBC servers
pxf.service.user.name	The login user for the remote system. If the property is not set, the default value is the operating system user that starts the PXF process, typically `gpadmin`. If the property is set, the default value depends on the user impersonation setting (`pxf.service.user.impersonation`) and, if you are accessing Hadoop, on whether the accessed cluster is secured with Kerberos	None; the property is commented out by default
pxf.fs.basePath	Identifies the base path or share point on the remote file system. The property must be set when the server configuration is used with a profile that accesses a file with a `file:` profile. When set, it applies to any profile that accesses a file, including `:text`, `:parquet`, or `:json`. The property is optional for a PXF server that accesses a file in Hadoop or in an object storage. You can use this property to restrict access to files in a specific remote directory. When the property is set, PXF considers the file path specified in the `LOCATION` clause of the `CREATE EXTERNAL TABLE` command to be relative to this base path setting and constructs the full remote path accordingly	None; the property is commented out by default
pxf.ppd.hive	Specifies whether predicate pushdown is enabled for queries on external tables that use the `hive`, `hive:rc`, or `hive:orc` profiles. You can override this setting on a per-table basis by specifying the `&PPD=<boolean>` option in the `LOCATION` clause when creating the external table	`true`, predicate pushdown is enabled
pxf.sasl.connection.retries	Specifies the maximum number of times for retrying a SASL connection request after a refused connection returns a `GSS initiate failed` error	5
pxf.orc.write.decimal.overflow	Specifies how PXF handles numeric data that exceeds the maximum precision of 38 and overflows when writing to an ORC file. The following values are accepted: `round`, `error`, or `ignore`	round
pxf.parquet.write.decimal.overflow	Specifies how PXF handles numeric data that exceeds the maximum precision of 38 and overflows when writing to a Parquet file. The following values are accepted: `round`, `error`, or `ignore`	round
pxf.hdfs.fragmenter.sort-by-date	Specifies whether PXF sorts the list of files received from HDFS-compatible systems by modification date instead of file name. It does not apply to JDBC or HBase PXF servers	`false`; files are sorted by file name

pxf.service.kerberos.principal

The Kerberos principal name

gpadmin/_HOST@EXAMPLE.COM

pxf.service.kerberos.keytab

The file system path to the Kerberos keytab file

$PXF_BASE/keytabs/pxf.service.keytab

pxf.service.kerberos.constrained-delegation

Activates / deactivates Kerberos constrained delegation. This property is applicable only to Hadoop PXF server configurations, it is not applicable to JDBC PXF servers

false

pxf.service.kerberos.ticket-renew-thread-enabled

Indicates whether PXF should use a background thread that refreshes a Kerberos ticket-granting ticket (TGT) when its remaining lifespan reaches the renewal window

false

pxf.service.kerberos.ticket-renew-window

The minimum elapsed lifespan after which PXF attempts to renew / refresh a Kerberos ticket. A value is a fraction ranging is from 0 (PXF generates a new ticket for all requests) to 1 (PXF renews after full ticket lifespan)

0.8 (80%)

pxf.service.kerberos.ticket-renew-max-backoff-seconds

The maximum amount of time in seconds between consecutive Kerberos TGT renewal attempts

300

pxf.service.kerberos.ticket-renew-timeout-seconds

The maximum amount of time in seconds that Kerberos TGT renewal might take. If multiple renewal attempts fail to obtain the new ticket within this time interval, the renewal attempts stop, the TGT Renewal thread exits, and the query fails

3600

pxf.service.user.impersonation

Activates / deactivates user impersonation when connecting to the remote system

If the property is not set, the default is true (activated) for PXF Hadoop servers and false (deactivated) for JDBC servers

pxf.service.user.name

The login user for the remote system.

If the property is not set, the default value is the operating system user that starts the PXF process, typically gpadmin. If the property is set, the default value depends on the user impersonation setting (pxf.service.user.impersonation) and, if you are accessing Hadoop, on whether the accessed cluster is secured with Kerberos

None; the property is commented out by default

pxf.fs.basePath

Identifies the base path or share point on the remote file system.

The property must be set when the server configuration is used with a profile that accesses a file with a file:* profile. When set, it applies to any profile that accesses a file, including *:text, *:parquet, or *:json. The property is optional for a PXF server that accesses a file in Hadoop or in an object storage.

You can use this property to restrict access to files in a specific remote directory. When the property is set, PXF considers the file path specified in the LOCATION clause of the CREATE EXTERNAL TABLE command to be relative to this base path setting and constructs the full remote path accordingly

None; the property is commented out by default

pxf.ppd.hive

Specifies whether predicate pushdown is enabled for queries on external tables that use the hive, hive:rc, or hive:orc profiles. You can override this setting on a per-table basis by specifying the &PPD=<boolean> option in the LOCATION clause when creating the external table

true, predicate pushdown is enabled

pxf.sasl.connection.retries

Specifies the maximum number of times for retrying a SASL connection request after a refused connection returns a GSS initiate failed error

pxf.orc.write.decimal.overflow

Specifies how PXF handles numeric data that exceeds the maximum precision of 38 and overflows when writing to an ORC file. The following values are accepted: round, error, or ignore

round

pxf.parquet.write.decimal.overflow

Specifies how PXF handles numeric data that exceeds the maximum precision of 38 and overflows when writing to a Parquet file. The following values are accepted: round, error, or ignore

round

pxf.hdfs.fragmenter.sort-by-date

Specifies whether PXF sorts the list of files received from HDFS-compatible systems by modification date instead of file name. It does not apply to JDBC or HBase PXF servers

false; files are sorted by file name

You can configure access to an external data store on a per-server, per-user basis.

NOTE

PXF per-server, per-user configuration provides the most benefit for JDBC servers.

You configure external data store user access credentials and properties for a specific Greengage DB user by providing a <greengage_user_name>-user.xml user configuration file in the PXF server configuration directory, $PXF_BASE/servers/<server_name>/. You can configure zero, one, or more users in a PXF server configuration.

The properties that you specify in a user configuration file are connector-specific. You can specify any configuration property supported by the PXF connector server in a <greengage_user_name>-user.xml configuration file.

For each PXF user that you want to configure, do the following:

Identify the name of the Greengage DB user.
Identify the PXF server definition for which you want to configure user access.
Identify the name and value of each property that you want to configure for the user.
Create or edit the $PXF_BASE/servers/<server_name>/<greengage_user_name>-user.xml file and add the outer configuration block:
```
<configuration>

</configuration>
```
Add each property/value pair for each required property within the configuration block.
If you are adding the user configuration to a previously configured PXF server definition, synchronize the user configuration to the Greengage DB cluster.

NOTE

A Greengage DB user who queries or writes to an external table accesses the external data store with the credentials configured for the PXF server user. If no user-specific credentials are configured for PXF server, the Greengage DB user accesses the external data store with the default credentials configured for PXF server.

Suppose you have configured access to a PostgreSQL database in the PXF JDBC server configuration named pgsrv1. To allow the Greengage DB user named bill to access this database as the PostgreSQL user named pguser1 with password password, create the user configuration file $PXF_BASE/servers/pgsrv1/bill-user.xml with the following properties:

<configuration>
    <property>
        <name>jdbc.user</name>
        <value>pguser1</value>
    </property>
    <property>
        <name>jdbc.password</name>
        <value>password</value>
    </property>
</configuration>

To configure a specific search path and a larger read fetch size for bill, add the following properties to the bill-user.xml user configuration file:

<property>
    <name>jdbc.session.property.search_path</name>
    <value>bill_schema</value>
</property>
<property>
    <name>jdbc.statement.fetchSize</name>
    <value>2000</value>
</property>

To access an external data store, specify the server name using the SERVER=<server_name> option in the LOCATION clause of the CREATE EXTERNAL TABLE command. The provided server name identifies the directory from which PXF obtains the configuration and credentials to access the external data store. If no SERVER=<server_name> option is provided, PXF automatically uses the default server configuration.

Prerequisites

To try out the practical example, connect to the Greengage DB master host as gpadmin using psql as described in Connect to Greengage DB via psql. Then create the customers test database and connect to it:

DROP DATABASE IF EXISTS customers;
CREATE DATABASE customers;
\c customers

To be able to create an external table using the PXF protocol, enable the PXF extension in the database as described in Register PXF in a database:

CREATE EXTENSION pxf;

This example uses the orders.csv file residing in the customers bucket on the storage.example.com S3 host and having the following content:

id,name,price
1,Laptop,999.99
2,Smartphone,499.99
3,Tablet,299.99
4,Monitor,599.99
5,Keyboard,99.99

On the master host, create the server configuration for accessing the S3 object storage. The created s3-site.xml file is based on the minio-site.xml template. It is stored in the $PXF_BASE/servers/example-s3/ directory, and has the following content:

<?xml version="1.0" encoding="UTF-8"?>
<configuration>
    <property>
        <name>fs.s3a.endpoint</name>
        <value>storage.example.com</value>
    </property>
    <property>
        <name>fs.s3a.access.key</name>
        <value>ACCESS_KEY</value>
    </property>
    <property>
        <name>fs.s3a.secret.key</name>
        <value>SECRET_KEY</value>
    </property>
    <property>
        <name>fs.s3a.fast.upload</name>
        <value>true</value>
    </property>
    <property>
        <name>fs.s3a.path.style.access</name>
        <value>true</value>
    </property>
</configuration>

Synchronize the server configuration to the Greengage DB cluster hosts:

$ pxf cluster sync

The output should look similar to the following:

Syncing PXF configuration files from coordinator host to standby coordinator host and 2 segment hosts...
PXF configs synced successfully on 3 out of 3 hosts

Run the CREATE EXTERNAL TABLE command. In the LOCATION clause, provide the orders.csv path and file name and set the following options:
- PROFILE=s3:csv — the PXF profile used for reading delimited single-line CSV data.
- SERVER=example-s3 — the name of the server configuration created earlier.
- SKIP_HEADER_COUNT=1 — the number of header rows to skip.
In the FORMAT clause, set CSV as the data format:
```
CREATE EXTERNAL TABLE orders_s3(
    id INTEGER,
    name VARCHAR,
    price NUMERIC    
)
LOCATION ('pxf://customers/orders.csv?PROFILE=s3:csv&SERVER=example-s3&SKIP_HEADER_COUNT=1')
FORMAT 'CSV';
```

Query the created external table:

SELECT * FROM orders_s3;

The output should look as follows:

 id |    name    | price
----+------------+--------
  1 | Laptop     | 999.99
  2 | Smartphone | 499.99
  3 | Tablet     | 299.99
  4 | Monitor    | 599.99
  5 | Keyboard   |  99.99

PXF configuration files

Hadoop

Configure a PXF server

Server template files

Configure a server

pxf-site.xml configuration file

Configure a PXF user

Example

Use a server configuration in external tables

Example