Hello, I’m DocuDroid!
Submitting feedback
Thank you for rating our AI Search!
We would be grateful if you could share your thoughts so we can improve our AI Search for you and other readers.
GitHub

Use PXF HBase connector to read table data from HBase to Greengage DB

Anton Monakov

Apache HBase is a distributed versioned non-relational database on Hadoop.

HBase relies on the following main concepts:

  • An HBase table is a multidimensional map comprised of one or more columns and rows of data.

  • An HBase column includes two components: a column family and a column qualifier, which are delimited by a colon character (:). You specify the complete set of column families when you create an HBase table.

  • An HBase row consists of a row key and one or more column values. A row key is a unique identifier for the table row.

  • An HBase cell comprises a row (column family, column qualifier, column value) and a timestamp. The column value and timestamp in a given cell represent a version of the value.

For more information about HBase, refer to the Apache HBase Reference Guide.

Create an external table using the PXF protocol

To create a Greengage DB external table that reads data from an HBase table, use the following general syntax:

CREATE EXTERNAL TABLE <table_name>
    ( <column_name> <data_type> [, ...] | LIKE <other_table> )

    LOCATION ('pxf://<hbase_table_name>?PROFILE=hbase[&SERVER=<server_name>]')
    FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
Keyword Value

<table_name>

The name of the table to create

<column_name>

The name of the column to create

<data_type>

The data type of the column

LIKE <other_table>

Specifies a table from which the new external table automatically copies all column names, data types, and distribution policy

<hbase_table_name>

The name of the HBase table the data is read from

PROFILE=hbase

The hbase profile is used to read data from an HBase table

SERVER=<server_name>

The named server configuration that PXF uses to access the data. If not specified, the default PXF server is used

FORMAT 'CUSTOM'

The custom format with the built-in custom formatter function for read operation (pxfwritable_import) is used to read data from an HBase table

Data type mapping

HBase is byte-based; it stores all data types as an array of bytes. To represent HBase data in Greengage DB, you need to select the Greengage DB column data type that matches the underlying content of the HBase column qualifier values. In a Greengage DB external table definition, you can reference all column qualifiers of an HBase table or a subset of column qualifiers. Note that PXF does not support complex HBase objects.

PXF supports direct or indirect mapping between a Greengage DB table column and an HBase table column qualifier. See Example for a practical example of creating a Greengage DB external table using both mapping methods.

Direct column mapping

Direct mapping of Greengage DB external table column names to HBase qualifiers is specified via quoted column-family-qualified HBase qualifier names (<column family>:<column qualifier>). When reading the table data, the PXF HBase connector passes these column names to HBase as is.

Indirect column mapping via lookup table

Indirect mapping of Greengage DB external table column names to HBase qualifiers is specified via a lookup table created in HBase. This lookup table maps a <column family>:<column qualifier> pair to an arbitrary column name alias specified when creating the Greengage DB external table.

Indirect mapping lets you create a shorter, character-based alias for the HBase <column family>:<column qualifier> pair and is a more robust approach due to the following:

  • While HBase qualifier names can be very long, Greengage DB imposes a 63-character limit on the column name size.

  • While HBase qualifier names can include binary or non-printable characters, Greengage DB column names are limited to standard text characters.

The HBase PXF lookup table must be named pxflookup and must contain a single column family named mapping. The table is populated as follows:

  • The row key specifies the HBase table name.

  • The mapping column family qualifier identifies the Greengage DB column name, and the value identifies the HBase <column family>:<column qualifier> pair for which you are creating the alias.

Row key

The HBase table row key is a unique identifier for the table row. PXF lets you access it from a Greengage DB external table query by using the reserved column named recordkey. The recordkey column name instructs PXF to return the HBase table record key for each row. You can then use it in a WHERE clause to filter the HBase table on a range of row key values. To enable filter pushdown on the recordkey column, define it with the TEXT type.

See Example for a practical example of retrieving a row key in a Greengage DB external table.

Example

This example demonstrates how to configure and use the PXF HDFS connector for reading data from an HBase table to a Greengage DB external table.

Prerequisites

To try out the practical examples, connect to the Greengage DB master host as gpadmin using psql as described in Connect to Greengage DB via psql. Then create the customers test database and connect to it:

DROP DATABASE IF EXISTS customers;
CREATE DATABASE customers;
\c customers

To be able to create an external table using the PXF protocol, enable the PXF extension in the database as described in Register PXF in a database in PXF documentation:

CREATE EXTENSION pxf;

Configure the PXF HBase connector

To have PXF connect to HDFS, you need to create a Hadoop server configuration as described in Configure PXF Hadoop connectors in PXF documentation and then synchronize it to the Greengage DB cluster:

  1. Log in to the Greengage DB master host as gpadmin.

  2. Go to the $PXF_BASE/servers directory and create a Hadoop server configuration directory named hadoop:

    $ mkdir $PXF_BASE/servers/hadoop
    $ cd $PXF_BASE/servers/hadoop
  3. Copy the core-site.xml, hdfs-site.xml, mapred-site.xml, and yarn-site.xml Hadoop configuration files from the NameNode host of the Hadoop cluster to the current host:

    $ scp hdfsuser@namenode:/etc/hadoop/conf/core-site.xml .
    $ scp hdfsuser@namenode:/etc/hadoop/conf/hdfs-site.xml .
    $ scp hdfsuser@namenode:/etc/hadoop/conf/mapred-site.xml .
    $ scp hdfsuser@namenode:/etc/hadoop/conf/yarn-site.xml .
  4. Copy the hbase-site.xml HBase configuration file from the NameNode host of the Hadoop cluster to the current host:

    $ scp hdfsuser@namenode:/etc/hbase/conf/hbase-site.xml .
  5. For the PXF HBase connector to support filter pushdown, copy the $PXF_HOME/share/pxf-hbase-*.jar file to each machine of the HBase cluster. Make sure that the copy location is in the HBase classpath. For example:

    $ scp $PXF_HOME/share/pxf-hbase-lib-6.15.0-SNAPSHOT.jar hdfsuser@namenode:/usr/lib/hbase/lib
  6. Synchronize the PXF configuration to the Greengage DB cluster and then restart PXF on all hosts:

    $ pxf cluster sync
    $ pxf cluster restart

Create a readable external table

  1. On the HBase host, launch the HBase shell:

    $ hbase shell
  2. Run the command that creates the HBase table and populates it with sample data. The HBase table named customers is created in the default namespace and has the name and contact column families:

    create 'customers', 'name', 'contact'
    put 'customers', '1', 'name:first_name', 'John'
    put 'customers', '1', 'name:last_name', 'Doe'
    put 'customers', '1', 'contact:email', 'john.doe@example.com'
    put 'customers', '1', 'contact:address', '123 Elm Street'
    put 'customers', '2', 'name:first_name', 'Jane'
    put 'customers', '2', 'name:last_name', 'Smith'
    put 'customers', '2', 'contact:email', 'jane.smith@example.com'
    put 'customers', '2', 'contact:address', '456 Oak Street'
    put 'customers', '3', 'name:first_name', 'Bob'
    put 'customers', '3', 'name:last_name', 'Brown'
    put 'customers', '3', 'contact:email', 'bob.brown@example.com'
    put 'customers', '3', 'contact:address', '789 Pine Street'
    put 'customers', '4', 'name:first_name', 'Rob'
    put 'customers', '4', 'name:last_name', 'Stuart'
    put 'customers', '4', 'contact:email', 'rob.stuart@example.com'
    put 'customers', '4', 'contact:address', '119 Willow Street'
  3. Optionally, query the customers table to view the inserted data:

    scan 'customers'

    The output should look as follows:

    ROW                                    COLUMN+CELL
     1         column=contact:address, timestamp=2026-03-13T09:50:53.882, value=123 Elm Street
     1         column=contact:email, timestamp=2026-03-13T09:50:53.855, value=john.doe@example.com
     1         column=name:first_name, timestamp=2026-03-13T09:50:53.826, value=John
     1         column=name:last_name, timestamp=2026-03-13T09:50:53.838, value=Doe
     2         column=contact:address, timestamp=2026-03-13T09:50:53.928, value=456 Oak Street
     2         column=contact:email, timestamp=2026-03-13T09:50:53.918, value=jane.smith@example.com
     2         column=name:first_name, timestamp=2026-03-13T09:50:53.895, value=Jane
     2         column=name:last_name, timestamp=2026-03-13T09:50:53.908, value=Smith
     3         column=contact:address, timestamp=2026-03-13T09:50:53.984, value=789 Pine Street
     3         column=contact:email, timestamp=2026-03-13T09:50:53.960, value=bob.brown@example.com
     3         column=name:first_name, timestamp=2026-03-13T09:50:53.940, value=Bob
     3         column=name:last_name, timestamp=2026-03-13T09:50:53.950, value=Brown
     4         column=contact:address, timestamp=2026-03-13T09:50:54.048, value=119 Willow Street
     4         column=contact:email, timestamp=2026-03-13T09:50:54.024, value=rob.stuart@example.com
     4         column=name:first_name, timestamp=2026-03-13T09:50:54.002, value=Rob
     4         column=name:last_name, timestamp=2026-03-13T09:50:54.013, value=Stuart
    4 row(s)
  4. On the Greengage DB master host, create an external table that references the customers HBase table using direct column mapping. Use the quoted column-family-qualified HBase qualifier names (<column family>:<column qualifier>) to enumerate table columns. In the LOCATION clause, specify the PXF hbase profile and the server configuration. In the FORMAT clause, specify pxfwritable_import, which is the built-in custom formatter function for read operations:

    CREATE EXTERNAL TABLE customers_direct (
        "name:first_name" TEXT,
        "name:last_name" TEXT,
        "contact:address" TEXT,
        "contact:email" TEXT
        )
        
        LOCATION ('pxf://customers?PROFILE=hbase&SERVER=hadoop')
        FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
  5. Query the created external table:

    SELECT * FROM customers_direct;

    The output should look as follows:

     name:first_name | name:last_name |  contact:address  |     contact:email
    -----------------+----------------+-------------------+------------------------
     John            | Doe            | 123 Elm Street    | john.doe@example.com
     Jane            | Smith          | 456 Oak Street    | jane.smith@example.com
     Bob             | Brown          | 789 Pine Street   | bob.brown@example.com
     Rob             | Stuart         | 119 Willow Street | rob.stuart@example.com
    (4 rows)
  6. On the HBase host, use the HBase shell to create a PXF lookup table and populate it with data. The lookup entries map column aliases to the HBase column-family-qualified qualifier names (<column family>:<column qualifier>):

    create 'pxflookup', 'mapping'
    put 'pxflookup', 'customers', 'mapping:first_name', 'name:first_name'
    put 'pxflookup', 'customers', 'mapping:last_name', 'name:last_name'
    put 'pxflookup', 'customers', 'mapping:address', 'contact:address'
    put 'pxflookup', 'customers', 'mapping:email', 'contact:email'
  7. Optionally, query the pxflookup table to view the inserted mapping data:

    scan 'pxflookup'

    The output should look as follows:

    ROW                                    COLUMN+CELL
     customers         column=mapping:address, timestamp=2026-03-16T10:10:15.674, value=contact:address
     customers         column=mapping:email, timestamp=2026-03-16T10:10:18.028, value=contact:email
     customers         column=mapping:first_name, timestamp=2026-03-16T10:10:15.592, value=name:first_name
     customers         column=mapping:last_name, timestamp=2026-03-16T10:10:15.641, value=name:last_name
    1 row(s)
  8. On the Greengage DB master host, create an external table that references the customers HBase table using indirect column mapping. Use the column aliases to enumerate table columns. In the LOCATION clause, specify the PXF hbase profile and the server configuration. In the FORMAT clause, specify pxfwritable_import, which is the built-in custom formatter function for read operations:

    CREATE EXTERNAL TABLE customers_lookup (
        first_name TEXT,
        last_name TEXT,
        address TEXT,
        email TEXT
        )
        
        LOCATION ('pxf://customers?PROFILE=hbase&SERVER=hadoop')
        FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
  9. Query the created external table:

    SELECT * FROM customers_lookup;

    The output should look as follows:

     first_name | last_name |      address      |         email
    ------------+-----------+-------------------+------------------------
     John       | Doe       | 123 Elm Street    | john.doe@example.com
     Jane       | Smith     | 456 Oak Street    | jane.smith@example.com
     Bob        | Brown     | 789 Pine Street   | bob.brown@example.com
     Rob        | Stuart    | 119 Willow Street | rob.stuart@example.com
    (4 rows)
  10. Add a recordkey column of type TEXT to the customers_lookup table to retrieve row keys:

    ALTER EXTERNAL TABLE customers_lookup 
        ADD COLUMN recordkey TEXT;
  11. Query the external table:

    SELECT * FROM customers_lookup;

    The output should look as follows:

     first_name | last_name |      address      |         email          | recordkey
    ------------+-----------+-------------------+------------------------+-----------
     John       | Doe       | 123 Elm Street    | john.doe@example.com   |         1
     Jane       | Smith     | 456 Oak Street    | jane.smith@example.com |         2
     Bob        | Brown     | 789 Pine Street   | bob.brown@example.com  |         3
     Rob        | Stuart    | 119 Willow Street | rob.stuart@example.com |         4
    (4 rows)