Use PXF HBase connector to read table data from HBase to Greengage DB
Apache HBase is a distributed versioned non-relational database on Hadoop.
HBase relies on the following main concepts:
-
An HBase table is a multidimensional map comprised of one or more columns and rows of data.
-
An HBase column includes two components: a column family and a column qualifier, which are delimited by a colon character (
:). You specify the complete set of column families when you create an HBase table. -
An HBase row consists of a row key and one or more column values. A row key is a unique identifier for the table row.
-
An HBase cell comprises a row (column family, column qualifier, column value) and a timestamp. The column value and timestamp in a given cell represent a version of the value.
For more information about HBase, refer to the Apache HBase Reference Guide.
Create an external table using the PXF protocol
To create a Greengage DB external table that reads data from an HBase table, use the following general syntax:
CREATE EXTERNAL TABLE <table_name>
( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ('pxf://<hbase_table_name>?PROFILE=hbase[&SERVER=<server_name>]')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
| Keyword | Value |
|---|---|
<table_name> |
The name of the table to create |
<column_name> |
The name of the column to create |
<data_type> |
The data type of the column |
LIKE <other_table> |
Specifies a table from which the new external table automatically copies all column names, data types, and distribution policy |
<hbase_table_name> |
The name of the HBase table the data is read from |
PROFILE=hbase |
The |
SERVER=<server_name> |
The named server configuration that PXF uses to access the data. If not specified, the default PXF server is used |
FORMAT 'CUSTOM' |
The custom format with the built-in custom formatter function for read operation ( |
Data type mapping
HBase is byte-based; it stores all data types as an array of bytes. To represent HBase data in Greengage DB, you need to select the Greengage DB column data type that matches the underlying content of the HBase column qualifier values. In a Greengage DB external table definition, you can reference all column qualifiers of an HBase table or a subset of column qualifiers. Note that PXF does not support complex HBase objects.
PXF supports direct or indirect mapping between a Greengage DB table column and an HBase table column qualifier. See Example for a practical example of creating a Greengage DB external table using both mapping methods.
Direct column mapping
Direct mapping of Greengage DB external table column names to HBase qualifiers is specified via quoted column-family-qualified HBase qualifier names (<column family>:<column qualifier>).
When reading the table data, the PXF HBase connector passes these column names to HBase as is.
Indirect column mapping via lookup table
Indirect mapping of Greengage DB external table column names to HBase qualifiers is specified via a lookup table created in HBase.
This lookup table maps a <column family>:<column qualifier> pair to an arbitrary column name alias specified when creating the Greengage DB external table.
Indirect mapping lets you create a shorter, character-based alias for the HBase <column family>:<column qualifier> pair and is a more robust approach due to the following:
-
While HBase qualifier names can be very long, Greengage DB imposes a 63-character limit on the column name size.
-
While HBase qualifier names can include binary or non-printable characters, Greengage DB column names are limited to standard text characters.
The HBase PXF lookup table must be named pxflookup and must contain a single column family named mapping.
The table is populated as follows:
-
The row key specifies the HBase table name.
-
The
mappingcolumn family qualifier identifies the Greengage DB column name, and the value identifies the HBase<column family>:<column qualifier>pair for which you are creating the alias.
Row key
The HBase table row key is a unique identifier for the table row.
PXF lets you access it from a Greengage DB external table query by using the reserved column named recordkey.
The recordkey column name instructs PXF to return the HBase table record key for each row.
You can then use it in a WHERE clause to filter the HBase table on a range of row key values.
To enable filter pushdown on the recordkey column, define it with the TEXT type.
See Example for a practical example of retrieving a row key in a Greengage DB external table.
Example
This example demonstrates how to configure and use the PXF HDFS connector for reading data from an HBase table to a Greengage DB external table.
Configure the PXF HBase connector
To have PXF connect to HDFS, you need to create a Hadoop server configuration as described in Configure PXF Hadoop connectors in PXF documentation and then synchronize it to the Greengage DB cluster:
-
Log in to the Greengage DB master host as
gpadmin. -
Go to the $PXF_BASE/servers directory and create a Hadoop server configuration directory named hadoop:
$ mkdir $PXF_BASE/servers/hadoop $ cd $PXF_BASE/servers/hadoop -
Copy the core-site.xml, hdfs-site.xml, mapred-site.xml, and yarn-site.xml Hadoop configuration files from the NameNode host of the Hadoop cluster to the current host:
$ scp hdfsuser@namenode:/etc/hadoop/conf/core-site.xml . $ scp hdfsuser@namenode:/etc/hadoop/conf/hdfs-site.xml . $ scp hdfsuser@namenode:/etc/hadoop/conf/mapred-site.xml . $ scp hdfsuser@namenode:/etc/hadoop/conf/yarn-site.xml . -
Copy the hbase-site.xml HBase configuration file from the NameNode host of the Hadoop cluster to the current host:
$ scp hdfsuser@namenode:/etc/hbase/conf/hbase-site.xml . -
For the PXF HBase connector to support filter pushdown, copy the $PXF_HOME/share/pxf-hbase-*.jar file to each machine of the HBase cluster. Make sure that the copy location is in the HBase classpath. For example:
$ scp $PXF_HOME/share/pxf-hbase-lib-6.15.0-SNAPSHOT.jar hdfsuser@namenode:/usr/lib/hbase/lib -
Synchronize the PXF configuration to the Greengage DB cluster and then restart PXF on all hosts:
$ pxf cluster sync $ pxf cluster restart
Create a readable external table
-
On the HBase host, launch the HBase shell:
$ hbase shell -
Run the command that creates the HBase table and populates it with sample data. The HBase table named
customersis created in the default namespace and has thenameandcontactcolumn families:create 'customers', 'name', 'contact' put 'customers', '1', 'name:first_name', 'John' put 'customers', '1', 'name:last_name', 'Doe' put 'customers', '1', 'contact:email', 'john.doe@example.com' put 'customers', '1', 'contact:address', '123 Elm Street' put 'customers', '2', 'name:first_name', 'Jane' put 'customers', '2', 'name:last_name', 'Smith' put 'customers', '2', 'contact:email', 'jane.smith@example.com' put 'customers', '2', 'contact:address', '456 Oak Street' put 'customers', '3', 'name:first_name', 'Bob' put 'customers', '3', 'name:last_name', 'Brown' put 'customers', '3', 'contact:email', 'bob.brown@example.com' put 'customers', '3', 'contact:address', '789 Pine Street' put 'customers', '4', 'name:first_name', 'Rob' put 'customers', '4', 'name:last_name', 'Stuart' put 'customers', '4', 'contact:email', 'rob.stuart@example.com' put 'customers', '4', 'contact:address', '119 Willow Street' -
Optionally, query the
customerstable to view the inserted data:scan 'customers'The output should look as follows:
ROW COLUMN+CELL 1 column=contact:address, timestamp=2026-03-13T09:50:53.882, value=123 Elm Street 1 column=contact:email, timestamp=2026-03-13T09:50:53.855, value=john.doe@example.com 1 column=name:first_name, timestamp=2026-03-13T09:50:53.826, value=John 1 column=name:last_name, timestamp=2026-03-13T09:50:53.838, value=Doe 2 column=contact:address, timestamp=2026-03-13T09:50:53.928, value=456 Oak Street 2 column=contact:email, timestamp=2026-03-13T09:50:53.918, value=jane.smith@example.com 2 column=name:first_name, timestamp=2026-03-13T09:50:53.895, value=Jane 2 column=name:last_name, timestamp=2026-03-13T09:50:53.908, value=Smith 3 column=contact:address, timestamp=2026-03-13T09:50:53.984, value=789 Pine Street 3 column=contact:email, timestamp=2026-03-13T09:50:53.960, value=bob.brown@example.com 3 column=name:first_name, timestamp=2026-03-13T09:50:53.940, value=Bob 3 column=name:last_name, timestamp=2026-03-13T09:50:53.950, value=Brown 4 column=contact:address, timestamp=2026-03-13T09:50:54.048, value=119 Willow Street 4 column=contact:email, timestamp=2026-03-13T09:50:54.024, value=rob.stuart@example.com 4 column=name:first_name, timestamp=2026-03-13T09:50:54.002, value=Rob 4 column=name:last_name, timestamp=2026-03-13T09:50:54.013, value=Stuart 4 row(s)
-
On the Greengage DB master host, create an external table that references the
customersHBase table using direct column mapping. Use the quoted column-family-qualified HBase qualifier names (<column family>:<column qualifier>) to enumerate table columns. In theLOCATIONclause, specify the PXFhbaseprofile and the server configuration. In theFORMATclause, specifypxfwritable_import, which is the built-in custom formatter function for read operations:CREATE EXTERNAL TABLE customers_direct ( "name:first_name" TEXT, "name:last_name" TEXT, "contact:address" TEXT, "contact:email" TEXT ) LOCATION ('pxf://customers?PROFILE=hbase&SERVER=hadoop') FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import'); -
Query the created external table:
SELECT * FROM customers_direct;The output should look as follows:
name:first_name | name:last_name | contact:address | contact:email -----------------+----------------+-------------------+------------------------ John | Doe | 123 Elm Street | john.doe@example.com Jane | Smith | 456 Oak Street | jane.smith@example.com Bob | Brown | 789 Pine Street | bob.brown@example.com Rob | Stuart | 119 Willow Street | rob.stuart@example.com (4 rows)
-
On the HBase host, use the HBase shell to create a PXF lookup table and populate it with data. The lookup entries map column aliases to the HBase column-family-qualified qualifier names (
<column family>:<column qualifier>):create 'pxflookup', 'mapping' put 'pxflookup', 'customers', 'mapping:first_name', 'name:first_name' put 'pxflookup', 'customers', 'mapping:last_name', 'name:last_name' put 'pxflookup', 'customers', 'mapping:address', 'contact:address' put 'pxflookup', 'customers', 'mapping:email', 'contact:email' -
Optionally, query the
pxflookuptable to view the inserted mapping data:scan 'pxflookup'The output should look as follows:
ROW COLUMN+CELL customers column=mapping:address, timestamp=2026-03-16T10:10:15.674, value=contact:address customers column=mapping:email, timestamp=2026-03-16T10:10:18.028, value=contact:email customers column=mapping:first_name, timestamp=2026-03-16T10:10:15.592, value=name:first_name customers column=mapping:last_name, timestamp=2026-03-16T10:10:15.641, value=name:last_name 1 row(s)
-
On the Greengage DB master host, create an external table that references the
customersHBase table using indirect column mapping. Use the column aliases to enumerate table columns. In theLOCATIONclause, specify the PXFhbaseprofile and the server configuration. In theFORMATclause, specifypxfwritable_import, which is the built-in custom formatter function for read operations:CREATE EXTERNAL TABLE customers_lookup ( first_name TEXT, last_name TEXT, address TEXT, email TEXT ) LOCATION ('pxf://customers?PROFILE=hbase&SERVER=hadoop') FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import'); -
Query the created external table:
SELECT * FROM customers_lookup;The output should look as follows:
first_name | last_name | address | email ------------+-----------+-------------------+------------------------ John | Doe | 123 Elm Street | john.doe@example.com Jane | Smith | 456 Oak Street | jane.smith@example.com Bob | Brown | 789 Pine Street | bob.brown@example.com Rob | Stuart | 119 Willow Street | rob.stuart@example.com (4 rows)
-
Add a
recordkeycolumn of typeTEXTto thecustomers_lookuptable to retrieve row keys:ALTER EXTERNAL TABLE customers_lookup ADD COLUMN recordkey TEXT; -
Query the external table:
SELECT * FROM customers_lookup;The output should look as follows:
first_name | last_name | address | email | recordkey ------------+-----------+-------------------+------------------------+----------- John | Doe | 123 Elm Street | john.doe@example.com | 1 Jane | Smith | 456 Oak Street | jane.smith@example.com | 2 Bob | Brown | 789 Pine Street | bob.brown@example.com | 3 Rob | Stuart | 119 Willow Street | rob.stuart@example.com | 4 (4 rows)