Configure PXF Hadoop connectors
PXF is compatible with generic Apache Hadoop distributions and comes with HDFS, Hive, and HBase connectors preinstalled. You can use these connectors to access data in various formats from these Hadoop distributions.
Before working with Hadoop data using PXF, ensure that the following prerequisites are met:
-
You have installed and configured PXF, and PXF is running on each Greengage DB host.
-
You have configured the PXF Hadoop Connectors that you plan to use. Before configuring the PXF Hadoop connectors, ensure that you can copy files from hosts in your Hadoop cluster to the Greengage DB master host.
-
If user impersonation is enabled (the default), you have granted read or write permission to the required HDFS files and directories to each Greengage DB role name that will access them. If user impersonation is not enabled, this permission must be granted to the
gpadminuser. -
Time is synchronized between the Greengage DB hosts and the external Hadoop systems.
Data formats and profiles
The PXF Hadoop connectors provide built-in profiles to support the following data formats:
-
Text
-
CSV
-
Avro
-
JSON
-
ORC
-
Parquet
-
RCFile
-
SequenceFile
-
AvroSequenceFile
The following profiles are exposed for reading and writing data in the supported formats.
| Data source | Data format | Profile name | Deprecated profile name | Supported operations |
|---|---|---|---|---|
HDFS |
Delimited single-line text values |
hdfs:text |
N/A |
Read, Write |
HDFS |
Single-line comma-separated text values |
hdfs:csv |
N/A |
Read, Write |
HDFS |
Fixed-width single-line text |
hdfs:fixedwidth |
N/A |
Read, Write |
HDFS |
Delimited text with quoted linefeeds |
hdfs:text:multi |
N/A |
Read |
HDFS |
Avro |
hdfs:avro |
N/A |
Read, Write |
HDFS |
JSON |
hdfs:json |
N/A |
Read |
HDFS |
ORC |
hdfs:orc |
N/A |
Read, Write |
HDFS |
Parquet |
hdfs:parquet |
N/A |
Read, Write |
HDFS |
AvroSequenceFile |
hdfs:AvroSequenceFile |
N/A |
Read, Write |
HDFS |
SequenceFile |
hdfs:SequenceFile |
N/A |
Read, Write |
Hive |
Stored as TextFile |
hive, hive:text |
Hive, HiveText |
Read |
Hive |
Stored as SequenceFile |
hive |
Hive |
Read |
Hive |
Stored as RCFile |
hive, hive:rc |
Hive, HiveRC |
Read |
Hive |
Stored as ORC |
hive, hive:orc |
Hive, HiveORC, HiveVectorizedORC |
Read |
Hive |
Stored as Parquet |
hive |
Hive |
Read |
Hive |
Stored as Avro |
hive |
Hive |
Read |
HBase |
Any |
hbase |
HBase |
Read |
Choose the profile
PXF provides several profiles to access text and Parquet data on Hadoop. When determining which profile to choose, consider the following:
-
Choose the
hiveprofile when any of the following conditions are met:-
The data resides in a Hive table, and you do not know the underlying file type of the table.
-
The data resides in a partitioned Hive table.
-
-
Choose the
hdfs:textorhdfs:csvprofiles when the file is text and its location in the HDFS file system is known. -
When accessing ORC-format data:
-
Choose the
hdfs:orcprofile when the file is ORC, you know its location in the HDFS file system, and the file is not managed by Hive or you do not want to use the Hive Metastore. -
Choose the
hive:orcprofile when the table is ORC and is managed by Hive, and the data is partitioned or includes complex types.
-
-
Choose the
hdfs:parquetprofile when the file is Parquet, you know its location in the HDFS file system and want to take advantage of extended filter pushdown support for additional data types and operators.
Specify the profile
The profile name is provided when specifying the PXF protocol in a CREATE EXTERNAL TABLE command to create an external table that references a Hadoop file or directory, HBase table, or Hive table.
For example, the following command creates an orders external table that uses a hadoop server and specifies the hdfs:csv profile to access the HDFS file /tmp/orders.csv:
CREATE EXTERNAL TABLE orders(
id INTEGER,
name VARCHAR,
price NUMERIC
)
LOCATION ('pxf://tmp/orders.csv?SERVER=hadoop&PROFILE=hdfs:csv')
FORMAT 'CSV';
Configure Hadoop connectors
-
Log in to the Greengage DB master host as
gpadmin. -
Decide whether to use the default PXF server or configure a new one. In this example, a new Hadoop server configuration named
hadoopis created, with its configuration files located in the $PXF_BASE/servers/hadoop directory. Create the server configuration directory and switch to it:$ mkdir $PXF_BASE/servers/hadoop $ cd $PXF_BASE/servers/hadoop -
PXF requires information from the Hadoop configuration files. Copy the core-site.xml, hdfs-site.xml, mapred-site.xml, and yarn-site.xml Hadoop configuration files from the NameNode host of the Hadoop cluster to the current host. Your file paths may differ based on the Hadoop distribution in use. For example, these commands use
scpto copy the files:$ scp hdfsuser@namenode:/etc/hadoop/conf/core-site.xml . $ scp hdfsuser@namenode:/etc/hadoop/conf/hdfs-site.xml . $ scp hdfsuser@namenode:/etc/hadoop/conf/mapred-site.xml . $ scp hdfsuser@namenode:/etc/hadoop/conf/yarn-site.xml . -
If you plan to use the PXF Hive connector to access Hive table data, similarly copy the Hive configuration to the Greengage DB master host, for example:
$ scp hiveuser@hivehost:/etc/hive/conf/hive-site.xml . -
If you plan to use the PXF HBase connector to access HBase table data, similarly copy the HBase configuration to the Greengage DB master host, for example:
$ scp hbaseuser@hbasehost:/etc/hbase/conf/hbase-site.xml . -
Synchronize the PXF configuration to the Greengage DB cluster:
$ pxf cluster sync -
By default, PXF tries to access HDFS, Hive, and HBase using the identity of the Greengage DB user account that logs into Greengage DB. In order to support this functionality, you must configure proxy settings for Hadoop, Hive, and HBase. See Configure Hadoop users, impersonation, and proxying for details.
-
Grant read permission to the HDFS files and directories that will be accessed via external tables in Greengage DB. If user impersonation is enabled (the default), grant this permission to each Greengage DB role name that will use external tables that reference the HDFS files. If user impersonation is not enabled, grant this permission to the
gpadminuser.
Configure Hadoop users, impersonation, and proxying
PXF accesses Hadoop services on behalf of Greengage DB end users and uses only the login identity of the user when accessing Hadoop services.
This means that if a user logs in to Greengage DB as the user jane and then runs the SET ROLE or SET SESSION AUTHORIZATION command to assume a different user identity, all PXF requests still use the jane identity to access Hadoop services.
Impersonation is a way to present a Greengage DB end user identity to a remote system. In PXF, this is achieved by configuring a Hadoop proxy user. When the Hadoop service is secured with Kerberos, you can configure impersonation using Kerberos constrained delegation.
When user impersonation is activated (the default), PXF accesses non-secured Hadoop services using the identity of the Greengage DB user account that logs in to Greengage DB and performs an operation that uses a PXF connector. You must explicitly configure each Hadoop data source (HDFS, Hive, HBase) to allow PXF to act as a proxy for impersonating specific Hadoop users or groups.
When user impersonation is deactivated, PXF runs all Hadoop service requests as the PXF process owner (usually gpadmin) or the Hadoop user identity that you specify.
This behavior provides no means to control access to Hadoop services for different Greengage DB users.
It requires user access to all files and directories in HDFS and all tables in Hive and HBase that are referenced in PXF external table definitions.
The Hadoop user and PXF user impersonation settings for a server are configured in the pxf-site.xml server configuration file. See pxf-site.xml configuration file for the detailed description of the configuration properties contained in this file.
Configuration scenarios
User, user impersonation, and proxy configuration for Hadoop depend on how you use PXF to access Hadoop, and whether the Hadoop cluster is secured with Kerberos.
The following scenarios describe the use cases and configuration required when you use PXF to access non-secured Hadoop.
Access Hadoop as the Greengage DB user proxied by gpadmin
This is the default configuration for PXF.
The gpadmin user proxies Greengage DB queries on behalf of Greengage DB users.
The effective user in Hadoop is the Greengage DB user that runs the query.
The following table identifies the pxf.service.user.impersonation and pxf.service.user.name settings and the required PXF and Hadoop configuration.
| Impersonation | Service user | PXF configuration | Hadoop configuration |
|---|---|---|---|
true |
gpadmin |
None, this is the default configuration |
Set the |
Access Hadoop as the Greengage DB user proxied by a custom user
In this configuration, PXF accesses Hadoop as the Greengage DB user proxied by a <custom> user.
A query initiated by a Greengage DB user appears on the Hadoop side as originating from the Greengage DB user.
This configuration might be desirable when Hadoop is already configured with a proxy user, or when you want to proxy Greengage DB queries via a user different from gpadmin.
The following table identifies the pxf.service.user.impersonation and pxf.service.user.name settings and the required PXF and Hadoop configuration.
| Impersonation | Service user | PXF configuration | Hadoop configuration |
|---|---|---|---|
true |
<custom> |
Configure the Hadoop user to the |
Set the |
Access Hadoop as the gpadmin user
In this configuration, PXF accesses Hadoop as the gpadmin user.
A query initiated by any Greengage DB user appears on the Hadoop side as originating from the gpadmin user.
The following table identifies the pxf.service.user.impersonation and pxf.service.user.name settings and the required PXF and Hadoop configuration.
| Impersonation | Service user | PXF configuration | Hadoop configuration |
|---|---|---|---|
false |
gpadmin |
Turn off user impersonation as described in Configure PXF user impersonation |
None required |
Access Hadoop as a custom user
In this configuration, PXF accesses Hadoop as a <custom> user.
A query initiated by any Greengage DB user appears on the Hadoop side as originating from the <custom> user.
The following table identifies the pxf.service.user.impersonation and pxf.service.user.name settings, and the PXF and Hadoop configuration.
| Impersonation | Service user | PXF configuration | Hadoop configuration |
|---|---|---|---|
false |
<custom> |
Turn off user impersonation as described in Configure PXF user impersonation and configure a Hadoop user for the |
None required |
Configure Hadoop users
By default, PXF accesses Hadoop using the identity of a Greengage DB user. You can configure PXF to access Hadoop as a different user on a per-server basis.
-
Log in to the master host as
gpadmin. -
Identify the name of the Hadoop PXF server configuration that you want to update.
-
Navigate to the server configuration directory ($PXF_BASE/servers). For example, if the server is named
hadoop:$ cd $PXF_BASE/servers/hadoop -
If the server configuration does not yet include a pxf-site.xml file, copy the template file to the directory, for example:
$ cp $PXF_HOME/templates/pxf-site.xml . -
Open pxf-site.xml in an editor and configure the Hadoop user name.
$ vi pxf-site.xml-
When impersonation is deactivated, this name identifies the Hadoop user identity that PXF will use to access the Hadoop system.
-
When user impersonation is activated for a non-secure Hadoop cluster, this name identifies the PXF proxy Hadoop user.
For example, to access Hadoop as the
hdfsuser1user, uncomment the property and set it as follows:<property> <name>pxf.service.user.name</name> <value>hdfsuser1</value> </property>The
hdfsuser1Hadoop user must exist in the Hadoop cluster. -
-
Save and close the pxf-site.xml file.
-
Use the
pxf cluster synccommand to synchronize the PXF Hadoop server configuration to your Greengage DB cluster:$ pxf cluster sync
Configure PXF user impersonation
PXF user impersonation is activated by default for Hadoop servers. You can configure PXF user impersonation on a per-server basis. To turn PXF user impersonation on or off for a Hadoop server configuration:
-
Navigate to the server configuration directory. For example, if the server is named
hadoop:$ cd $PXF_BASE/servers/hadoop -
If the server configuration does not yet include a pxf-site.xml file, copy the template file to this directory, for example:
$ cp $PXF_HOME/templates/pxf-site.xml . -
Open pxf-site.xml in an editor and update the user impersonation property setting:
$ vi pxf-site.xml-
To turn user impersonation off, set the
pxf.service.user.impersonationproperty tofalse:<property> <name>pxf.service.user.impersonation</name> <value>false</value> </property> -
To turn user impersonation on, set the
pxf.service.user.impersonationproperty totrue:<property> <name>pxf.service.user.impersonation</name> <value>true</value> </property>
-
-
If user impersonation is activated and Kerberos constrained delegation is deactivated (the default), configure Hadoop proxying as described in Configure Hadoop proxying. If you plan to use Hive or HBase, you also need to configure impersonation for them as described in Hive user impersonation and HBase user impersonation.
-
Save and close the pxf-site.xml file.
-
Use the
pxf cluster synccommand to synchronize the PXF Hadoop server configuration to your Greengage DB cluster:$ pxf cluster sync
Hive user impersonation
The PXF Hive connector uses the Hive MetaStore to determine the HDFS locations of Hive tables and then accesses the underlying HDFS files directly. No specific impersonation configuration is required for Hive because the Hadoop proxy configuration in core-site.xml also applies to Hive tables.
HBase user impersonation
In order for user impersonation to work with HBase, you must activate the AccessController coprocessor in the HBase configuration and restart the cluster. See Server-side Configuration for Simple User Access Operation in the Apache HBase Reference Guide for the required hbase-site.xml configuration settings.
Configure Hadoop proxying
When PXF user impersonation is activated for a Hadoop server configuration and Kerberos constrained delegation is deactivated (the default), you must configure Hadoop to permit PXF to proxy Greengage DB users.
This configuration involves setting certain hadoop.proxyuser.* properties.
-
Log in to the Hadoop cluster and open the core-site.xml configuration file in an editor:
$ vi core-site.xml -
Set the
hadoop.proxyuser.<username>.hostsproperty to specify the comma-separated list of PXF host names from which proxy requests are permitted. Substitute<username>with the PXF proxy Hadoop user name, which is the value ofpxf.service.user.nameconfigured as described in Configure Hadoop users. If you are using Kerberos authentication to Hadoop, then the proxy user identity is the primary component of the Kerberos principal. Ifpxf.service.user.nameis not configured explicitly, the proxy user is the operating system user that started PXF. For example, if the PXF proxy user is namedhdfsuser2:<property> <name>hadoop.proxyuser.hdfsuser2.hosts</name> <value>pxfhost1,pxfhost2,pxfhost3</value> </property> -
Set the
hadoop.proxyuser.<username>.groupsproperty to specify the list of HDFS groups that PXF as Hadoop user<username>can impersonate. The list should contain only those groups that require access to HDFS data from PXF, for example:<property> <name>hadoop.proxyuser.hdfsuser2.groups</name> <value>group1,group2</value> </property> -
Restart Hadoop for the changes in core-site.xml to take effect.
-
Copy the updated core-site.xml file to the PXF Hadoop server configuration directory $PXF_BASE/servers/<server_name> on the Greengage DB master host and synchronize the configuration to the Greengage DB cluster:
$ pxf cluster sync
Update Hadoop configuration
If you update the Hadoop, Hive, or HBase configuration on a running PXF service, you must copy the updated configuration files to the $PXF_BASE/servers/<server_name> directory and re-sync the PXF configuration to your Greengage DB cluster, for example:
$ cd $PXF_BASE/servers/<server_name>
$ scp hiveuser@hivehost:/etc/hive/conf/hive-site.xml .
$ pxf cluster sync