Configure a PXF server
Reading or writing data to an external data store is performed via a PXF connector. To access an external data store, you must provide the server location, client access credentials, and other data store-specific properties.
A PXF server is a named connector-specific configuration that provides access to a specific external data store. For example, a PXF JDBC Connector server definition may include settings for the JDBC driver class name, URL, username, password, and other connection-specific and session-specific properties.
A server definition must be created for each accessed external data store. For example, to access two Hadoop clusters, you need to create a PXF Hadoop server configuration for each cluster. To access an Oracle and a MySQL database, you need to create one or more PXF JDBC server configurations for each database.
A server definition may include default settings for user access credentials and other properties for the external data store. You can allow Greengage DB users to access the external data store using the default settings, or you can configure access and other properties on a per-user basis. This lets you configure different Greengage DB users with different external data store access credentials in a single PXF server definition.
Server template files
PXF provides a template configuration file for each connector, which contains the typical set of properties that have to be configured to use the connector. The following server template configuration files are stored in the $PXF_HOME/templates/ directory:
-
adl-site.xml — provides configuration for Azure Data Lake Storage Gen2 using the ADL connector.
-
core-site.xml — provides core Hadoop configuration settings, often used as a base for other connectors accessing HDFS.
-
gs-site.xml — provides configuration for Google Cloud Storage (GCS).
-
hbase-site.xml — provides configuration for Apache HBase.
-
hdfs-site.xml — defines connection parameters for accessing Hadoop Distributed File System (HDFS) clusters.
-
hive-site.xml — provides configuration for Hive Metastore.
-
jdbc-site.xml — lets you configure JDBC connection to relational databases.
-
mapred-site.xml — provides configuration for MapReduce jobs.
-
minio-site.xml — provides configuration for MinIO object storage, an S3-compatible storage.
-
pxf-site.xml — contains PXF-specific configuration settings, such as logging levels and server properties.
-
s3-site.xml — provides configuration for Amazon Simple Storage Service (S3).
-
wasbs-site.xml — provides configuration for Azure Blob Storage using the WASB connector.
-
yarn-site.xml — provides configuration for Apache YARN, used by some PXF connectors for resource management.
The template files for the Hadoop connectors are not intended to be modified and used for configuration, as they only provide an example of the information needed. Instead of modifying the Hadoop templates, you will copy several Hadoop *-site.xml files from the Hadoop cluster to your PXF Hadoop server configuration.
Configure a server
To configure a PXF connector to an external data store, you need to add a named PXF server configuration for the connector. The configuration information resides in one or more <connector>-site.xml files residing in the $PXF_BASE/servers/<server_name>/ directory. The name of a directory in $PXF_BASE/servers/ should correspond to the configured PXF server name.
PXF also uses a special server named default.
During PXF installation, an empty $PXF_BASE/servers/default/ directory is created, which identifies the default PXF server configuration.
You can configure and assign the default PXF server to any external data source, for example, a Hadoop cluster, or a MySQL database that you frequently access.
PXF automatically uses the default server configuration if you omit the SERVER=<server_name> option in the CREATE EXTERNAL TABLE command LOCATION clause.
Configuring a server involves the following general steps:
-
Determine if you are configuring the
defaultPXF server or creating a new server configuration. -
Create the $PXF_BASE/servers/<server_name> directory.
-
Copy the template or other configuration files to the new server directory.
-
Fill in the appropriate values for the properties in the template file.
-
Add any additional configuration properties and values required for your environment.
-
Configure one or more users for the server configuration as described in Configure a PXF user.
-
Synchronize the server and user configuration to the Greengage DB cluster by running the
pxf cluster synccommand. See Overview of the pxf cluster commands for the full description of the availablepxf clustercommands.
After you configure a PXF server, it becomes available to the Greengage DB users who need access to the corresponding data store. A user only needs to provide the server name when creating an external table that accesses the external data store. PXF obtains the external data source location and access credentials from the server and the user configuration files residing in the server configuration directory. Learn more about using the created configuration in Use a server configuration in external tables.
pxf-site.xml configuration file
PXF includes a template file named pxf-site.xml for PXF-specific configuration parameters. You can use the pxf-site.xml template file to configure the following:
-
Kerberos or user impersonation settings for server configurations.
-
A base directory for file access.
-
The action that is performed when PXF detects an overflow condition while writing numeric ORC or Parquet data.
The Kerberos and user impersonation settings in the pxf-site.xml file may apply only to Hadoop and JDBC server configurations; they do not apply to file system or object storage server configurations.
You need to configure properties in the pxf-site.xml file for a PXF server when any of the following conditions is met:
-
The remote Hadoop system utilizes Kerberos authentication.
-
You want to activate or deactivate user impersonation on the remote Hadoop or external database system.
-
You want to activate or deactivate Kerberos constrained delegation for a Hadoop PXF server.
-
You access a network file system with the server configuration.
-
You access a remote Hadoop or object storage file system with the server configuration and want to allow a user to access only a specific directory and subdirectories.
The following properties are included in the pxf-site.xml file.
| Property | Description | Default Value |
|---|---|---|
pxf.service.kerberos.principal |
The Kerberos principal name |
gpadmin/_HOST@EXAMPLE.COM |
pxf.service.kerberos.keytab |
The file system path to the Kerberos keytab file |
$PXF_BASE/keytabs/pxf.service.keytab |
pxf.service.kerberos.constrained-delegation |
Activates / deactivates Kerberos constrained delegation. This property is applicable only to Hadoop PXF server configurations, it is not applicable to JDBC PXF servers |
false |
pxf.service.kerberos.ticket-renew-thread-enabled |
Indicates whether PXF should use a background thread that refreshes a Kerberos ticket-granting ticket (TGT) when its remaining lifespan reaches the renewal window |
false |
pxf.service.kerberos.ticket-renew-window |
The minimum elapsed lifespan after which PXF attempts to renew / refresh a Kerberos ticket.
A value is a fraction ranging is from |
0.8 (80%) |
pxf.service.kerberos.ticket-renew-max-backoff-seconds |
The maximum amount of time in seconds between consecutive Kerberos TGT renewal attempts |
300 |
pxf.service.kerberos.ticket-renew-timeout-seconds |
The maximum amount of time in seconds that Kerberos TGT renewal might take. If multiple renewal attempts fail to obtain the new ticket within this time interval, the renewal attempts stop, the TGT Renewal thread exits, and the query fails |
3600 |
pxf.service.user.impersonation |
Activates / deactivates user impersonation when connecting to the remote system |
If the property is not set, the default is |
pxf.service.user.name |
The login user for the remote system. If the property is not set, the default value is the operating system user that starts the PXF process, typically |
None; the property is commented out by default |
pxf.fs.basePath |
Identifies the base path or share point on the remote file system. The property must be set when the server configuration is used with a profile that accesses a file with a You can use this property to restrict access to files in a specific remote directory.
When the property is set, PXF considers the file path specified in the |
None; the property is commented out by default |
pxf.ppd.hive |
Specifies whether predicate pushdown is enabled for queries on external tables that use the |
|
pxf.sasl.connection.retries |
Specifies the maximum number of times for retrying a SASL connection request after a refused connection returns a |
5 |
pxf.orc.write.decimal.overflow |
Specifies how PXF handles numeric data that exceeds the maximum precision of 38 and overflows when writing to an ORC file.
The following values are accepted: |
round |
pxf.parquet.write.decimal.overflow |
Specifies how PXF handles numeric data that exceeds the maximum precision of 38 and overflows when writing to a Parquet file.
The following values are accepted: |
round |
pxf.hdfs.fragmenter.sort-by-date |
Specifies whether PXF sorts the list of files received from HDFS-compatible systems by modification date instead of file name. It does not apply to JDBC or HBase PXF servers |
|
Configure a PXF user
You can configure access to an external data store on a per-server, per-user basis.
PXF per-server, per-user configuration provides the most benefit for JDBC servers.
You configure external data store user access credentials and properties for a specific Greengage DB user by providing a <greengage_user_name>-user.xml user configuration file in the PXF server configuration directory, $PXF_BASE/servers/<server_name>/. You can configure zero, one, or more users in a PXF server configuration.
The properties that you specify in a user configuration file are connector-specific. You can specify any configuration property supported by the PXF connector server in a <greengage_user_name>-user.xml configuration file.
For each PXF user that you want to configure, do the following:
-
Identify the name of the Greengage DB user.
-
Identify the PXF server definition for which you want to configure user access.
-
Identify the name and value of each property that you want to configure for the user.
-
Create or edit the $PXF_BASE/servers/<server_name>/<greengage_user_name>-user.xml file and add the outer
configurationblock:<configuration> </configuration> -
Add each property/value pair for each required property within the
configurationblock. -
If you are adding the user configuration to a previously configured PXF server definition, synchronize the user configuration to the Greengage DB cluster.
A Greengage DB user who queries or writes to an external table accesses the external data store with the credentials configured for the PXF server user. If no user-specific credentials are configured for PXF server, the Greengage DB user accesses the external data store with the default credentials configured for PXF server.
Example
Suppose you have configured access to a PostgreSQL database in the PXF JDBC server configuration named pgsrv1.
To allow the Greengage DB user named bill to access this database as the PostgreSQL user named pguser1 with password password, create the user configuration file $PXF_BASE/servers/pgsrv1/bill-user.xml with the following properties:
<configuration>
<property>
<name>jdbc.user</name>
<value>pguser1</value>
</property>
<property>
<name>jdbc.password</name>
<value>password</value>
</property>
</configuration>
To configure a specific search path and a larger read fetch size for bill, add the following properties to the bill-user.xml user configuration file:
<property>
<name>jdbc.session.property.search_path</name>
<value>bill_schema</value>
</property>
<property>
<name>jdbc.statement.fetchSize</name>
<value>2000</value>
</property>
Use a server configuration in external tables
To access an external data store, specify the server name using the SERVER=<server_name> option in the LOCATION clause of the CREATE EXTERNAL TABLE command.
The provided server name identifies the directory from which PXF obtains the configuration and credentials to access the external data store.
If no SERVER=<server_name> option is provided, PXF automatically uses the default server configuration.
Example
This example uses the orders.csv file residing in the customers bucket on the storage.example.com S3 host and having the following content:
id,name,price 1,Laptop,999.99 2,Smartphone,499.99 3,Tablet,299.99 4,Monitor,599.99 5,Keyboard,99.99
-
On the master host, create the server configuration for accessing the S3 object storage. The created s3-site.xml file is based on the minio-site.xml template. It is stored in the $PXF_BASE/servers/example-s3/ directory, and has the following content:
<?xml version="1.0" encoding="UTF-8"?> <configuration> <property> <name>fs.s3a.endpoint</name> <value>storage.example.com</value> </property> <property> <name>fs.s3a.access.key</name> <value>ACCESS_KEY</value> </property> <property> <name>fs.s3a.secret.key</name> <value>SECRET_KEY</value> </property> <property> <name>fs.s3a.fast.upload</name> <value>true</value> </property> <property> <name>fs.s3a.path.style.access</name> <value>true</value> </property> </configuration> -
Synchronize the server configuration to the Greengage DB cluster hosts:
$ pxf cluster syncThe output should look similar to the following:
Syncing PXF configuration files from coordinator host to standby coordinator host and 2 segment hosts... PXF configs synced successfully on 3 out of 3 hosts
-
Run the
CREATE EXTERNAL TABLEcommand. In theLOCATIONclause, provide the orders.csv path and file name and set the following options:-
PROFILE=s3:csv— the PXF profile used for reading delimited single-line CSV data. -
SERVER=example-s3— the name of the server configuration created earlier. -
SKIP_HEADER_COUNT=1— the number of header rows to skip.
In the
FORMATclause, setCSVas the data format:CREATE EXTERNAL TABLE orders_s3( id INTEGER, name VARCHAR, price NUMERIC ) LOCATION ('pxf://customers/orders.csv?PROFILE=s3:csv&SERVER=example-s3&SKIP_HEADER_COUNT=1') FORMAT 'CSV'; -
-
Query the created external table:
SELECT * FROM orders_s3;The output should look as follows:
id | name | price ----+------------+-------- 1 | Laptop | 999.99 2 | Smartphone | 499.99 3 | Tablet | 299.99 4 | Monitor | 599.99 5 | Keyboard | 99.99