Hello, I’m DocuDroid!
Submitting feedback
Thank you for rating our AI Search!
We would be grateful if you could share your thoughts so we can improve our AI Search for you and other readers.
GitHub

Configure PXF JDBC connectors

Anton Monakov

The PXF JDBC connector lets you access an external SQL database, such as MySQL, Oracle, Microsoft SQL Server, DB2, PostgreSQL, Apache Hive, or Apache Ignite. To access data in an external SQL database with the PXF JDBC connector, you need to specify the JDBC driver class name, database URL, and client credentials.

NOTE

If you want to use a custom JDBC driver, install the JDBC driver JAR file for the external SQL database in the $PXF_BASE/lib directory on each Greengage DB host. The JDBC driver JAR files must be compatible with your JRE version.

In the Greengage DB documentation, you can view the practical examples of configuring and using the PXF JDBC connector:

JDBC server configuration

To configure the PXF JDBC connector, you need to add at least one named PXF server configuration for the connector as described in Configure a PXF server. You can also configure one or more statically-defined queries to run against the remote SQL database as described in JDBC named queries.

  1. Log in to the Greengage DB master host as gpadmin.

  2. Choose a name for the JDBC server and create the $PXF_BASE/servers/<server_name> directory. For example, use the following command to create a JDBC server configuration named pg_user1_testdb:

    $ mkdir $PXF_BASE/servers/pg_user1_testdb
  3. Copy the PXF JDBC server template file to the server configuration directory, for example:

    $ cp $PXF_HOME/templates/jdbc-site.xml $PXF_BASE/servers/pg_user1_testdb/
  4. Open the template server configuration file in a text editor and provide appropriate property values for your environment. For example, if you are configuring access to a PostgreSQL database named testdb on a PostgreSQL instance running on the host named pgserverhost for the user named user1, configure the properties as follows:

    <?xml version="1.0" encoding="UTF-8"?>
    <configuration>
        <property>
            <name>jdbc.driver</name>
            <value>org.postgresql.Driver</value>
        </property>
        <property>
            <name>jdbc.url</name>
            <value>jdbc:postgresql://pgserverhost:5432/testdb</value>
        </property>
        <property>
            <name>jdbc.user</name>
            <value>user1</value>
        </property>
        <property>
            <name>jdbc.password</name>
            <value>changeme</value>
        </property>
    </configuration>
  5. Save and close the file.

  6. Synchronize the PXF JDBC server configuration with the Greengage DB cluster:

    $ pxf cluster sync

JDBC server configuration file

PXF provides a template configuration file for the JDBC connector. The template is located in $PXF_HOME/templates/jdbc-site.xml and identifies the properties that you can configure to establish a connection to the external SQL database. The template also includes optional properties that can be set before running the query or the insert commands in the external database session.

NOTE

You can override the JDBC server configuration by directly specifying certain JDBC properties via custom options in the LOCATION clause of the CREATE EXTERNAL TABLE command.

The following are the required properties in the jdbc-site.xml server template file.

Property Description Value

jdbc.driver

Class name of the JDBC driver

The JDBC driver Java class name, for example org.postgresql.Driver

jdbc.url

The URL that the JDBC driver uses to connect to the database

The database connection URL (database-specific), for example jdbc:postgresql://<phost>:<pport>/<pdatabase>

jdbc.user

The database user name

The user name for connecting to the database

jdbc.password

The password for jdbc.user

The password for connecting to the database

NOTE

When you configure a PXF JDBC server, you specify the external database user credentials to PXF in clear text in a configuration file. Alternatively, you can use the pxf encrypt command to encrypt the password and then provide the resulting encrypted string as the jdbc.password value, for example:

<?xml version="1.0" encoding="UTF-8"?>
<configuration>
...
    <property>
        <name>jdbc.password</name>
        <value>aes256:7BhhI+10ut+sdf34kj5D/a98d754a8add7cf8ba40540752a0c3af227c6023</value>
    </property>
</configuration>

The specified password will be automatically decrypted when connecting to the JDBC server.

Connection-level properties

To set additional JDBC connection-level properties, add jdbc.connection.property.<CPROP_NAME> properties to jdbc-site.xml. PXF passes these properties to the JDBC driver when establishing the connection to the external SQL database. Replace <CPROP_NAME> with the connection property name and specify its value. Ensure that the JDBC driver for the external SQL database supports any connection-level property that you specify.

Property Description Value

jdbc.connection.property.<CPROP_NAME>

The name of a property (<CPROP_NAME>) to pass to the JDBC driver when PXF establishes the connection to the external SQL database

The value of the <CPROP_NAME> property

For example, to set the createDatabaseIfNotExist connection property on a JDBC connection to a PostgreSQL database, include the following property block in jdbc-site.xml:

<property>
    <name>jdbc.connection.property.createDatabaseIfNotExist</name>
    <value>true</value>
</property>

Connection transaction isolation property

The SQL standard defines four transaction isolation levels. The level that you specify for a given connection to an external SQL database determines how and when the changes made by one transaction run on the connection are visible to another.

The PXF JDBC connector exposes an optional server configuration property named jdbc.connection.transactionIsolation that lets you specify the transaction isolation level. PXF sets the isolation level just after establishing the connection to the external SQL database. Different SQL databases support different transaction isolation levels. Ensure that the external database supports the level that you specify.

The JDBC connector supports the following jdbc.connection.transactionIsolation property values.

SQL level PXF property value

READ UNCOMMITTED

READ_UNCOMMITTED

READ COMMITTED

READ_COMMITTED

REPEATABLE READ

REPEATABLE_READ

SERIALIZABLE

SERIALIZABLE

For example, to set the transaction isolation level to READ UNCOMMITTED, add the following property block to the jdbc-site.xml file:

<property>
    <name>jdbc.connection.transactionIsolation</name>
    <value>READ_UNCOMMITTED</value>
</property>

Statement-level properties

The PXF JDBC connector runs a query or an INSERT command on an external SQL database table in a statement. The connector exposes properties that let you configure certain aspects of the statement before the command is run in the external database. Ensure that the JDBC driver for the external database supports any statement-level property that you specify. PXF uses the default value for any statement-level property that is not configured explicitly.

The connector supports the following statement-level properties.

Property Description Value

jdbc.statement.batchSize

The number of rows to write to the external database table in a batch

The number of rows. The default write batch size is 100

jdbc.statement.fetchSize

The number of rows to fetch/buffer when reading from the external database table

The number of rows. The default read fetch size for MySQL is -2147483648. The default read fetch size for all other databases is 1000

jdbc.statement.writeSize

Number of rows to be batched prior to writing to the external database table

The default write batch size is 100

jdbc.statement.queryTimeout

The amount of time (in seconds) the JDBC driver waits for a statement to run. This timeout applies to statements created for both read and write operations

The timeout duration in seconds. The default wait time is unlimited

For example, to set the read fetch size to 5000, add the following property block to jdbc-site.xml:

<property>
    <name>jdbc.statement.fetchSize</name>
    <value>5000</value>
</property>

Prepared statements

By default, the PXF JDBC connector reads from an external data source using a JDBC Statement.

The PXF jdbc.read.prepared-statement property defines whether the connector should use PreparedStatement. If the JDBC driver that you are using requires the use of a PreparedStatement, set the property to true.

Property Description Default value

jdbc.read.prepared-statement

Defines whether to use a PreparedStatement instead of a Statement when reading from the external data source

false

Session-level properties

To set session-level properties, add the jdbc.session.property.<SPROP_NAME> property to jdbc-site.xml. PXF sets these properties in the external database before running a query.

Replace <SPROP_NAME> with the session property name and specify its value.

Property Description Value

jdbc.session.property.<SPROP_NAME>

The name of a session property (<SPROP_NAME>) to set before PXF runs the query

The value of the <SPROP_NAME> property. Note that the PXF JDBC connector passes both the session property names and property values to the external SQL database exactly as specified in the jdbc-site.xml server configuration file. To limit the potential threat of SQL injection, the connector rejects any property name or value that contains the ;, \n, \b, or \0 characters

For example, to set the search_path parameter before running a query in a PostgreSQL database, add the following property block to jdbc-site.xml:

<property>
    <name>jdbc.session.property.search_path</name>
    <value>public</value>
</property>

Other properties

Other properties supported by the PXF JDBC connector are as follows.

Property Description Default value

jdbc.date.wideRange

Defines whether special parsing of dates when the year contains more than four alphanumeric characters is enabled. When set to true, PXF uses extended classes to parse dates and recognizes years that specify BC or AD

false

JDBC connection pooling

The PXF JDBC connector uses JDBC connection pooling implemented by HikariCP. When a user queries or writes to an external table, the connector establishes a connection pool for the associated server configuration the first time that it encounters a unique combination of jdbc.url, jdbc.user, jdbc.password, connection property, and pool property settings. The connector reuses connections in the pool subject to certain connection and timeout settings.

One or more connection pools may exist for a given server configuration, and user access to different external tables specifying the same server may share a connection pool.

NOTE

If JDBC user impersonation is activated in a server configuration, the JDBC connector creates a separate connection pool for each Greengage DB user that accesses any external table specifying that server configuration.

The jdbc.pool.enabled property regulates JDBC connection pooling for a server configuration. Connection pooling is activated by default. To deactivate it, set the property to false:

<property>
    <name>jdbc.pool.enabled</name>
    <value>false</value>
</property>

If JDBC connection pooling is deactivated for a server configuration, PXF does not reuse JDBC connections for that server. PXF creates a connection to the remote database for every partition of a query and closes the connection when the query for that partition completes.

PXF exposes connection pooling properties that you can configure in a JDBC server definition. These properties are named with the jdbc.pool.property prefix and apply to each PXF JVM.

The JDBC connector automatically sets the following connection pool properties and default values.

Property Description Default value

jdbc.pool.property.maximumPoolSize

The maximum number of connections to the database backend

15

jdbc.pool.property.connectionTimeout

The maximum amount of time, in milliseconds, to wait for a connection from the pool

30000

jdbc.pool.property.idleTimeout

The maximum amount of time, in milliseconds, after which an inactive connection is considered idle

30000

jdbc.pool.property.minimumIdle

The minimum number of idle connections maintained in the connection pool

0

You can set other HikariCP-specific connection pooling properties for a server configuration by specifying jdbc.pool.property.<HIKARICP_PROP_NAME> and the desired value in the jdbc-site.xml file. Note that the JDBC connector passes along any property that you specify with a jdbc.connection.property prefix when it requests a connection from the JDBC DriverManager.

Tune the maximum connection pool size

To not exceed the maximum number of connections allowed by the target database and at the same time ensure that the JDBC connections are shared evenly between PXF JVMs, determine the maximum value of maximumPoolSize based on the size of the Greengage DB cluster as follows:

maximum number of connections allowed by remote db / number of segment hosts

For example, if a Greengage DB cluster has 16 segment hosts and the target database allows 160 concurrent connections, maximumPoolSize is calculated as follows:

160 / 16 = 10

In practice, you may set maximumPoolSize to a lower value, since the number of concurrent connections per JDBC query depends on the number of partitions scanned by the query. When a query scans no partitions, a single PXF JVM serves the query. If a query scans 12 partitions, PXF establishes 12 concurrent JDBC connections to the remote database. Ideally, these connections are distributed equally among the PXF JVMs, but that is not guaranteed.

JDBC user impersonation

The PXF JDBC connector uses the jdbc.user setting or the information in jdbc.url to determine the identity of the user to connect to the external data store. When PXF JDBC user impersonation is deactivated (the default), the behavior of the JDBC connector depends on the external data store. For example, if you are using the JDBC connector to access Hive, the connector uses the settings of certain Hive authentication and impersonation properties to determine the user. It may be required to provide a jdbc.user setting or add properties to the jdbc.url setting in the server jdbc-site.xml file.

When you activate PXF JDBC user impersonation, the PXF JDBC connector accesses the external data store on behalf of a Greengage DB end user that accesses the PXF external table. PXF overrides the value of a jdbc.user property setting defined in either jdbc-site.xml or <greengage_db_user_name>-user.xml, or specified in the external table DDL, with the Greengage DB user name. For user impersonation to work effectively when the external data store requires passwords to authenticate connecting users, you must specify the jdbc.password setting for each user that can be impersonated in that user’s <greengage_db_user_name>-user.xml property override file. See Configure a PXF user for more information about per-server user configuration.

PXF JDBC user impersonation is regulated by the pxf.service.user.impersonation property in the jdbc-site.xml configuration file. To turn PXF user impersonation on or off for a JDBC server configuration:

  1. Log in to the Greengage DB master host as gpadmin.

  2. Identify the name of the PXF JDBC server configuration to update.

  3. Navigate to the server configuration directory. For example, if the server is named mysqldb:

    $ cd $PXF_BASE/servers/mysqldb
  4. Open the jdbc-site.xml file in a text editor, add or uncomment the user impersonation property (pxf.service.user.impersonation), and set it to true:

    <property>
        <name>pxf.service.user.impersonation</name>
        <value>true</value>
    </property>
  5. Save and close the jdbc-site.xml file.

  6. Synchronize the PXF JDBC server configuration with the Greengage DB cluster:

    $ pxf cluster sync

Session authorization

Certain SQL databases, such as PostgreSQL or DB2, allow a privileged user to change the effective database user that runs commands in a session. You can use this feature if, for example, you connect to a remote database as a proxy user and want to switch session authorization after establishing the database connection.

In supported databases, you can configure a session property to switch the effective user. For example, in DB2, you use the SET SESSION_USER <username> command to switch the effective DB2 user. If you configure the DB2 session_user variable via a PXF session-level property (jdbc.session.property.<SPROP_NAME>) in the jdbc-site.xml file, PXF runs this command for you.

For example, to switch the effective DB2 user to the user named bill, modify the jdbc-site.xml file as follows:

<property>
    <name>jdbc.session.property.session_user</name>
    <value>bill</value>
</property>

After establishing the database connection, PXF implicitly runs the following command to set the session_user DB2 session variable to the configured value (bill):

SET SESSION_USER = bill

PXF recognizes a synthetic property value, ${pxf.session.user}, that identifies the Greengage DB user name. You may choose to use this value when you configure a property that requires a value that changes based on the Greengage DB user running the session.

You can use ${pxf.session.user} when you authenticate to the remote SQL database with Kerberos, the primary component of the Kerberos principal identifies the Greengage DB user name, and you want to run queries in the remote database using this effective user name. For example, if you are accessing DB2, in the jdbc-site.xml specify the Kerberos securityMechanism and KerberosServerPrincipal and then set the session_user variable as follows:

<property>
    <name>jdbc.session.property.session_user</name>
    <value>${pxf.session.user}</value>
</property>

With this configuration, PXF sets the DB2 session_user variable to the current Greengage DB user name, and runs subsequent operations on the DB2 table as that user.

Session authorization for connection pooling

When PXF performs session authorization on your behalf and JDBC connection pooling is activated (the default), you can set the jdbc.pool.qualifier property. This property instructs PXF to include the property value in the criteria that it uses to create and reuse connection pools. In practice, this property is set not to a fixed value, but rather to a value that changes based on the user/session/transaction, and so on. When set to ${pxf.session.user}, PXF includes the Greengage DB user name in the criteria that it uses to create and reuse connection pools. By default, no qualifier is set.

To use this feature, add or uncomment the following property block in jdbc-site.xml to prompt PXF to include the Greengage DB user name in connection pool creation/reuse criteria:

<property>
    <name>jdbc.pool.qualifier</name>
    <value>${pxf.session.user}</value>
</property>

Fragment distribution logic

To handle fragment distribution across segments when reading a partitioned table in PXF JDBC, you can use several policies, which are specified as the FRAGMENT_DISTRIBUTION_POLICY custom option in the LOCATION clause of the CREATE EXTERNAL TABLE command, for example:

LOCATION (
    'pxf://orders?PROFILE=Jdbc&Server=pg&FRAGMENT_DISTRIBUTION_POLICY=ACTIVE-SEGMENT&ACTIVE_SEGMENT_COUNT=2'
)

The following policies are available:

  • ROUND-ROBIN — (default) fragments are sequentially assigned to Greengage DB segments in a cyclical (round-robin) order.

  • ACTIVE-SEGMENT — fragments are assigned to the fixed number of segments defined by the ACTIVE_SEGMENT_COUNT option thus limiting parallelism.

  • IMPROVED-ROUND-ROBIN — the policy involving the following logic:

    • If the fragment count is less than the total number of segments, the fragments are distributed between segments evenly.

    • If the fragment count is N times the number of total segments, all fragments are distributed between all segments N times.

    • If the fragment count is greater than the total number of segments, fragments are distributed between all segments N times (with N being the ratio between the number of fragments and the number of segments), and the remaining fragments are evenly distributed between segments.

  • RANDOM — fragments are distributed randomly between segments.

JDBC named queries

A PXF named query is a static user-defined query that PXF runs in the remote SQL database.

A PXF JDBC named query is defined in a text file. You can then reference the query in a Greengage DB external table definition. PXF runs the query each time a SELECT command is invoked on a Greengage DB external table.

To create a named query, add the query statement to a text file that has the following naming format: <query_name>.sql. You can define one or more named queries for a JDBC server configuration, with each query residing in a separate text file.

The query text file must be located in the PXF JDBC server configuration directory from which it will be accessed. To make the query available to more than one JDBC server configuration, copy the corresponding text file to the configuration directory for each JDBC server.

The query text file must contain a single query that you want to run in the remote SQL database. The query must be constructed in accordance with the syntax supported by the database. For example, if a MySQL database has a customers table and an orders table, you can include the following SQL statement in a query text file:

SELECT c.name, c.city, sum(o.amount) AS total, o.month
    FROM customers c JOIN orders o ON c.id = o.customer_id
    WHERE c.state = 'CO'
    GROUP BY c.name, c.city, o.month

You may optionally provide the ending semicolon (;) for the SQL statement.

A named query is referenced by specifying the query file name without the extension. For example, if you define a query in a file named report.sql, then the name of that query is report. When creating an external table, query:<query_name> is specified rather than the name of a remote SQL database table. For example, if the query is defined in the $PXF_BASE/servers/mydb/report.sql file, the LOCATION clause of the CREATE EXTERNAL TABLE command would look as follows:

LOCATION ('pxf://query:report?PROFILE=jdbc&SERVER=mydb ...')