Use PXF Hive connector to read table data from Hive to Greengage DB

Anton Monakov

Contents

Hive data formats
Data type mapping
Partition pruning
Create an external table using the PXF protocol
Examples

Apache Hive is a distributed data warehousing infrastructure. Hive facilitates managing large data sets supporting multiple data formats, including comma-separated values (CSV) TextFile, RCFile, ORC, and Parquet.

This topic describes how to configure and use the PXF Hive connector for reading Hive table data by using external tables and provides practical examples.

NOTE

When accessing Hive 3, the PXF Hive connector supports using the hive[:*] profiles described below to access Hive 3 external tables only. The connector does not support using the hive[:*] profiles to access Hive 3 managed (CRUD and insert-only transactional, and temporary) tables. Use the PXF JDBC Connector to access Hive 3 managed tables instead.

The PXF Hive connector provides the hive, hive:text, hive:rc, and hive:orc profiles to support different file formats.

File format Profile Description

File format	Profile	Description
TextFile	hive, hive:text	Flat file with data in comma-, tab-, or space-separated value format or JSON notation. See example in the Use hive:text profile section
RCFile	hive, hive:rc	Columnar data consisting of binary key/value pairs; high row compression rate. See example in the Use hive:rc profile section
ORC	hive, hive:orc	Optimized row columnar data with stripe, footer, and postscript sections; reduces data size. The `hive:orc` profile: Reads a single row of data at a time. Supports column projection. Supports complex types (array, map, struct, and union). PXF serializes each of these complex types to text. If the `VECTORIZE=true` option is specified when creating an external table, Hive vectorized query execution is enabled. The `hive:orc` profile in this case: Reads up to 1024 rows of data at once. Supports column projection. Does not support complex types or the `timestamp` data type. See examples in the Use hive:orc profile and Read complex data types sections.
Parquet	hive	Compressed columnar data format. See example in the Access Parquet-format Hive tables section
Avro	hive	Serialization framework with a binary data format. See example in the Access Avro-format Hive tables section
SequenceFile	hive	Flat file consisting of binary key/value pairs. See example in the Access SequenceFile-format Hive tables section
Any supported file storage format	hive	The optimal `hive[:*]` profile for the underlying file format type is selected. You can use this profile when the underlying Hive table is composed of multiple partitions stored in files of differing formats. See example in the Access partitioned heterogeneous data section

TextFile

hive, hive:text

Flat file with data in comma-, tab-, or space-separated value format or JSON notation. See example in the Use hive:text profile section

RCFile

hive, hive:rc

Columnar data consisting of binary key/value pairs; high row compression rate. See example in the Use hive:rc profile section

ORC

hive, hive:orc

Optimized row columnar data with stripe, footer, and postscript sections; reduces data size.

The hive:orc profile:

Reads a single row of data at a time.
Supports column projection.
Supports complex types (array, map, struct, and union). PXF serializes each of these complex types to text.

If the VECTORIZE=true option is specified when creating an external table, Hive vectorized query execution is enabled. The hive:orc profile in this case:

Reads up to 1024 rows of data at once.
Supports column projection.
Does not support complex types or the timestamp data type.

See examples in the Use hive:orc profile and Read complex data types sections.

Parquet

hive

Compressed columnar data format. See example in the Access Parquet-format Hive tables section

Avro

hive

Serialization framework with a binary data format. See example in the Access Avro-format Hive tables section

SequenceFile

hive

Flat file consisting of binary key/value pairs. See example in the Access SequenceFile-format Hive tables section

Any supported file storage format

hive

The optimal hive[:*] profile for the underlying file format type is selected. You can use this profile when the underlying Hive table is composed of multiple partitions stored in files of differing formats. See example in the Access partitioned heterogeneous data section

NOTE

PXF uses column projection to increase query performance when using the hive, hive:rc, or hive:orc profiles.

The PXF Hive connector supports primitive and complex data types. Complex Hive data types, including array, struct, map, and union, are mapped to text. You can create Greengage DB user-defined functions (UDF) or application code to extract subcomponents of complex data types.

Primitive types use the following data type mapping.

Hive data type	Greengage DB data type
boolean	BOOL
int	INT4
smallint	INT2
tinyint	INT2
bigint	INT8
float	FLOAT4
double	FLOAT8
string	TEXT
binary	BYTEA
timestamp (1)	TIMESTAMP

Hive data type

Greengage DB data type

boolean

BOOL

int

INT4

smallint

INT2

tinyint

INT2

bigint

INT8

float

FLOAT4

double

FLOAT8

string

TEXT

binary

BYTEA

timestamp (1)

TIMESTAMP

The hive:orc profile does not support the timestamp data type when you specify vectorized query execution (VECTORIZE=true).

The PXF Hive connector supports Hive partition pruning (filtering) and the Hive partition directory structure. This enables partition exclusion on selected HDFS files comprising a Hive table, which reduces network traffic and I/O load. To use partition filtering, run a query on an external table using a WHERE clause that refers to a specific partition column in a partitioned Hive table. See examples in the Access partitioned heterogeneous data and Access partitioned homogenous data sections.

Partition filtering in PXF is supported for Hive string and integral types:

The relational operators =, <, <=, >, >=, and <> are supported on string types.
The relational operators = and <> are supported on integral types. To use partition filtering with Hive integral types, you need to update the Hive configuration as described in the Configure the PXF Hive connector section.
The AND and OR logical operators are supported when used with the relational operators mentioned above.
The LIKE string operator is not supported.

To use PXF partition filtering pushdown, make sure the names of the Hive and the PXF partition field match. Otherwise, pushdown is ignored and filtering is performed on the Greengage DB side.

NOTE

The PXF Hive connector filters only on partition columns, not on other table attributes. PXF filter pushdown is enabled by default and is supported only for the data types and operators listed above. Learn more about filter pushdown in Filter pushdown in PXF documentation.

When dynamic partitioning is enabled in Hive, a partitioned table may store data in a default partition. Hive creates a default partition when the value of a partitioning column does not match the defined type of the column (for example, when a NULL value is used for any partitioning column).

In this case, query results may differ between Hive and PXF queries. In Hive, any query that includes a filter on a partition column excludes any data that is stored in the table’s default partition. Similar to Hive, PXF represents table’s partitioning columns as the columns that are appended to the end of the table. However, any column value in a default partition is translated to a NULL value. This means that a Greengage DB query that includes an IS NULL filter on a partitioning column can return different results than the same Hive query. See Access Hive default partition for a practical example.

Hive tables can be backed by one or more files with each file having a unique layout or schema, so the column names in a Greengage DB external table definition must match the Hive table column names. This direct name mapping lets you:

Create an external table with columns in a different order than the Hive table.
Create an external table containing only a subset of the Hive table columns.
Read a Hive table whose backing files have a different number of columns.

To create a Greengage DB external table to read Hive table data, use the following general syntax:

CREATE EXTERNAL TABLE <table_name>
    ( <column_name> <data_type> [, ...] | LIKE <other_table> )

    LOCATION ('pxf://<hive_db_name>.<hive_table_name>?PROFILE=<profile_name>[&<custom_option>=<value>[...]]')
    FORMAT 'CUSTOM|TEXT' (FORMATTER='pxfwritable_import' | delimiter='<delim>');

Keyword Value

Keyword	Value
<table_name>	The name of the table to create
<column_name>	The name of the column to create
<data_type>	The data type of the column
LIKE <other_table>	Specifies a table from which the new external table automatically copies all column names, data types, and distribution policy
<hive_db_name>	The name of the Hive database. If omitted, defaults to the Hive database named `default`
<hive_table_name>	The name of the Hive table
PROFILE=<profile_name>	One of the following profiles is used to read Hive table data: `hive`, `hive:text`, `hive:rc`, or `hive:orc`
<custom_option>	One of the custom options provided in the `LOCATION` string as described below
SERVER=<server_name>	The named server configuration that PXF uses to access the data. If not specified, the default PXF server is used
PPD=<boolean>	Activates predicate pushdown for all queries on this table. Applies only to the `hive`, `hive:orc`, and `hive:rc` profiles and overrides a `pxf.ppd.hive` property setting in the server configuration
VECTORIZE=<boolean>	For the `hive:orc` profile specifies whether PXF uses vectorized query execution when accessing the underlying ORC files. The default value is `false`: vectorized query execution is not used
FORMAT (`hive` and `hive:orc` profiles)	The custom format with the built-in custom formatter function for read operation (`pxfwritable_import`) is used to read Hive table data
FORMAT (`hive:text` and `hive:rc` profiles)	The `TEXT` format must be specified, and a single ASCII character must be set as the field delimiter (`delimiter`). You can use an E'' escape string constant, for example `delimiter=E','`

<table_name>

The name of the table to create

<column_name>

The name of the column to create

<data_type>

The data type of the column

LIKE <other_table>

Specifies a table from which the new external table automatically copies all column names, data types, and distribution policy

<hive_db_name>

The name of the Hive database. If omitted, defaults to the Hive database named default

<hive_table_name>

The name of the Hive table

PROFILE=<profile_name>

One of the following profiles is used to read Hive table data: hive, hive:text, hive:rc, or hive:orc

<custom_option>

One of the custom options provided in the LOCATION string as described below

SERVER=<server_name>

The named server configuration that PXF uses to access the data. If not specified, the default PXF server is used

PPD=<boolean>

Activates predicate pushdown for all queries on this table. Applies only to the hive, hive:orc, and hive:rc profiles and overrides a pxf.ppd.hive property setting in the server configuration

VECTORIZE=<boolean>

For the hive:orc profile specifies whether PXF uses vectorized query execution when accessing the underlying ORC files. The default value is false: vectorized query execution is not used

FORMAT (hive and hive:orc profiles)

The custom format with the built-in custom formatter function for read operation (pxfwritable_import) is used to read Hive table data

FORMAT (hive:text and hive:rc profiles)

The TEXT format must be specified, and a single ASCII character must be set as the field delimiter (delimiter). You can use an E'' escape string constant, for example delimiter=E','

These examples demonstrate how to configure and use the PXF Hive connector for reading Hive table data by using external tables.

Prerequisites

To try out the practical examples, connect to the Greengage DB master host as gpadmin using psql as described in Connect to Greengage DB via psql. Then create the customers test database and connect to it:

DROP DATABASE IF EXISTS customers;
CREATE DATABASE customers;
\c customers

To be able to create an external table using the PXF protocol, enable the PXF extension in the database as described in Register PXF in a database in PXF documentation:

CREATE EXTENSION pxf;

To have PXF connect to HDFS, you need to create a Hadoop server configuration as described in Configure PXF Hadoop connectors in PXF documentation and then synchronize it to the Greengage DB cluster:

Log in to the Greengage DB master host as gpadmin.
Go to the $PXF_BASE/servers directory and create a Hadoop server configuration directory named hadoop:
```
$ mkdir $PXF_BASE/servers/hadoop
$ cd $PXF_BASE/servers/hadoop
```

Copy the core-site.xml, hdfs-site.xml, mapred-site.xml, and yarn-site.xml Hadoop configuration files from the NameNode host of the Hadoop cluster to the current host:

$ scp hdfsuser@namenode:/etc/hadoop/conf/core-site.xml .
$ scp hdfsuser@namenode:/etc/hadoop/conf/hdfs-site.xml .
$ scp hdfsuser@namenode:/etc/hadoop/conf/mapred-site.xml .
$ scp hdfsuser@namenode:/etc/hadoop/conf/yarn-site.xml .

If you plan to use PXF filter pushdown with Hive integral types, ensure that the hive.metastore.integral.jdo.pushdown configuration parameter exists and is set to true in the hive-site.xml Hive configuration file. Then copy the hive-site.xml file from the NameNode host of the Hadoop cluster to the current host:
```
$ scp hdfsuser@namenode:/etc/hbase/conf/hive-site.xml .
```
Synchronize the server configuration to the Greengage DB cluster hosts:
```
$ pxf cluster sync
```

In the /tmp directory on the HiveServer host, create the customers.txt text file with comma-separated data fields content:
```
John Doe,24,152.12
Jane Smith,66,456.72
Bob Brown,12,51.19
Rob Stuart,42,34.81
```
On the HiveClient host, launch the Hive command line:
```
$ hive
```
Create the customers Hive table to store the sample dataset. The FIELDS TERMINATED BY clause sets the comma character (,) as the field delimiter within a data record (line). The STORED AS TEXTFILE clause instructs Hive to create the table in the TEXTFILE (the default) format:
```
CREATE TABLE customers (
    name STRING,
    orders INT, 
    sales DOUBLE
    )
    
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    STORED AS TEXTFILE;
```

Load the customers.txt sample data file into the created customers table:

LOAD DATA LOCAL INPATH '/tmp/customers.txt'
INTO TABLE customers;

Optionally, query the customers table:

SELECT * FROM customers;

The output should look as follows:

+-----------------+-------------------+------------------+
| customers.name  | customers.orders  | customers.sales  |
+-----------------+-------------------+------------------+
| John Doe        | 24                | 152.12           |
| Jane Smith      | 66                | 456.72           |
| Bob Brown       | 12                | 51.19            |
| Rob Stuart      | 42                | 34.81            |
+-----------------+-------------------+------------------+
4 rows selected (0.079 seconds)

You can view a Hive table’s location in HDFS by using the DESCRIBE EXTENDED command:
```
DESCRIBE EXTENDED customers;
```
In the command output, the table HDFS location is listed in the location entry:
```
...
location:hdfs://hadoop/apps/hive/warehouse/customers
...
```

Before you begin, make sure you have created the sample dataset as described in the Create a sample dataset section.

On the Greengage DB master host, create an external table that references the customers Hive table. In the LOCATION clause, specify the PXF hive profile and the server configuration. In the FORMAT clause, specify pxfwritable_import, which is the built-in custom formatter function for read operations:
```
CREATE EXTERNAL TABLE customers_hive (
    name TEXT,
    orders INT, 
    sales FLOAT
    )

    LOCATION ('pxf://default.customers?PROFILE=hive&SERVER=hadoop')
    FORMAT 'custom' (FORMATTER='pxfwritable_import');
```

Query the created external table:

SELECT * FROM customers_hive;

The output should look as follows:

    name    | orders | sales
------------+--------+--------
 John Doe   |     24 | 152.12
 Jane Smith |     66 | 456.72
 Bob Brown  |     12 |  51.19
 Rob Stuart |     42 |  34.81
(4 rows)

Use hive:text profile

Before you begin, make sure you have created the sample dataset as described in the Create a sample dataset section.

On the Greengage DB master host, create an external table that references the customers Hive table. In the LOCATION clause, specify the PXF hive:text profile and the server configuration. In the FORMAT clause, specify TEXT as the data format and set the comma character (,) as a delimiter (delimiter):
```
CREATE EXTERNAL TABLE customers_text (
    name TEXT,
    orders INT, 
    sales FLOAT
    )

    LOCATION ('pxf://default.customers?PROFILE=hive:text&SERVER=hadoop')
    FORMAT 'TEXT' (delimiter=E',');
```

Query the created external table:

SELECT * FROM customers_text;

The output should look as follows:

    name    | orders | sales
------------+--------+--------
 John Doe   |     24 | 152.12
 Jane Smith |     66 | 456.72
 Bob Brown  |     12 |  51.19
 Rob Stuart |     42 |  34.81
(4 rows)

Before you begin, make sure you have created the sample dataset as described in the Create a sample dataset section.

On the HiveClient host, launch the Hive command line:
```
$ hive
```

Create a Hive table stored in RCFile format:

CREATE TABLE customers_rc (
    name STRING,
    orders INT, 
    sales DOUBLE
    )
    
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    STORED AS RCFILE;

Copy the data from the customers table into customers_rc to store the dataset in the RCFile format:
```
INSERT INTO TABLE customers_rc SELECT * FROM customers;
```

Query the customers_rc Hive table to verify that the data was loaded correctly:

SELECT * FROM customers_rc;

The output should look as follows:

+--------------------+----------------------+---------------------+
| customers_rc.name  | customers_rc.orders  | customers_rc.sales  |
+--------------------+----------------------+---------------------+
| John Doe           | 24                   | 152.12              |
| Jane Smith         | 66                   | 456.72              |
| Bob Brown          | 12                   | 51.19               |
| Rob Stuart         | 42                   | 34.81               |
+--------------------+----------------------+---------------------+
4 rows selected (0.1 seconds)

On the Greengage DB master host, create an external table that references the customers_rc Hive table. In the LOCATION clause, specify the PXF hive:rc profile and the server configuration. In the FORMAT clause, specify TEXT as the data format and set the comma character (,) as a delimiter (delimiter):
```
CREATE EXTERNAL TABLE customers_rc (
    name TEXT,
    orders INT, 
    sales FLOAT
    )

    LOCATION ('pxf://default.customers_rc?PROFILE=hive:rc&SERVER=hadoop')
    FORMAT 'TEXT' (delimiter=E',');
```

Query the created external table:

SELECT * FROM customers_rc;

The output should look as follows:

    name    | orders | sales
------------+--------+--------
 John Doe   |     24 | 152.12
 Jane Smith |     66 | 456.72
 Bob Brown  |     12 |  51.19
 Rob Stuart |     42 |  34.81
(4 rows)

Before you begin, make sure you have created the sample dataset as described in the Create a sample dataset section.

On the HiveClient host, launch the Hive command line:
```
$ hive
```

Create a Hive table stored in the ORC file format:

CREATE TABLE customers_orc (
    name STRING,
    orders INT, 
    sales DOUBLE
    )
    
    STORED AS ORC;

Copy the data from the customers table into customers_orc to store the dataset in the ORC format:
```
INSERT INTO TABLE customers_orc SELECT * FROM customers;
```

Optionally, query the customers_orc table to verify that the data was loaded successfully:

SELECT * FROM customers_orc;

The output should look as follows:

+---------------------+-----------------------+----------------------+
| customers_orc.name  | customers_orc.orders  | customers_orc.sales  |
+---------------------+-----------------------+----------------------+
| John Doe            | 24                    | 152.12               |
| Jane Smith          | 66                    | 456.72               |
| Bob Brown           | 12                    | 51.19                |
| Rob Stuart          | 42                    | 34.81                |
+---------------------+-----------------------+----------------------+
4 rows selected (0.095 seconds)

On the Greengage DB master host, create an external table that references the customers_orc Hive table. In the LOCATION clause, specify the PXF hive:orc profile and the server configuration. In the FORMAT clause, specify pxfwritable_import, which is the built-in custom formatter function for read operations:
```
CREATE EXTERNAL TABLE customers_orc (
    name TEXT,
    orders INT, 
    sales FLOAT
    )

    LOCATION ('pxf://default.customers_orc?PROFILE=hive:orc&SERVER=hadoop')
    FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');    
```

Query the created external table:

SELECT * FROM customers_orc;

The output should look as follows:

    name    | orders | sales
------------+--------+--------
 John Doe   |     24 | 152.12
 Jane Smith |     66 | 456.72
 Bob Brown  |     12 |  51.19
 Rob Stuart |     42 |  34.81
(4 rows)

Before you begin, make sure you have created the sample dataset as described in the Create a sample dataset section.

On the HiveClient host, launch the Hive command line:
```
$ hive
```

Create a Hive table stored in Parquet format:

CREATE TABLE customers_parquet (
    name STRING,
    orders INT, 
    sales DOUBLE
    )
    
    STORED AS PARQUET;

Copy the data from the customers table into customers_parquet to store the dataset in the Parquet format:
```
INSERT INTO TABLE customers_parquet SELECT * FROM customers;
```

Optionally, query the customers_parquet table:

SELECT * FROM customers_parquet;

The output should look as follows:

+-------------------------+---------------------------+--------------------------+
| customers_parquet.name  | customers_parquet.orders  | customers_parquet.sales  |
+-------------------------+---------------------------+--------------------------+
| John Doe                | 24                        | 152.12                   |
| Jane Smith              | 66                        | 456.72                   |
| Bob Brown               | 12                        | 51.19                    |
| Rob Stuart              | 42                        | 34.81                    |
+-------------------------+---------------------------+--------------------------+
4 rows selected (0.1 seconds)

On the Greengage DB master host, create an external table that references the customers_parquet Hive table. In the LOCATION clause, specify the PXF hive profile and the server configuration. In the FORMAT clause, specify pxfwritable_import, which is the built-in custom formatter function for read operations:
```
CREATE EXTERNAL TABLE customers_parquet (
    name TEXT,
    orders INT, 
    sales FLOAT
    )

    LOCATION ('pxf://default.customers_parquet?PROFILE=hive&SERVER=hadoop')
    FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');    
```

Query the created external table:

SELECT * FROM customers_parquet;

The output should look as follows:

    name    | orders | sales
------------+--------+--------
 John Doe   |     24 | 152.12
 Jane Smith |     66 | 456.72
 Bob Brown  |     12 |  51.19
 Rob Stuart |     42 |  34.81
(4 rows)

Before you begin, make sure you have created the sample dataset as described in the Create a sample dataset section.

On the HiveClient host, launch the Hive command line:
```
$ hive
```

Create a Hive table stored in Avro format:

CREATE TABLE customers_avro (
    name STRING,
    orders INT, 
    sales DOUBLE
    )
    
    STORED AS AVRO;

Copy the data from the customers table into customers_avro to store the dataset in the Avro format:
```
INSERT INTO TABLE customers_avro SELECT * FROM customers;
```

Optionally, query the customers_avro table:

SELECT * FROM customers_avro;

The output should look as follows:

+----------------------+------------------------+-----------------------+
| customers_avro.name  | customers_avro.orders  | customers_avro.sales  |
+----------------------+------------------------+-----------------------+
| John Doe             | 24                     | 152.12                |
| Jane Smith           | 66                     | 456.72                |
| Bob Brown            | 12                     | 51.19                 |
| Rob Stuart           | 42                     | 34.81                 |
+----------------------+------------------------+-----------------------+
4 rows selected (0.198 seconds)

On the Greengage DB master host, create an external table that references the customers_avro Hive table. In the LOCATION clause, specify the PXF hive profile and the server configuration. In the FORMAT clause, specify pxfwritable_import, which is the built-in custom formatter function for read operations:
```
CREATE EXTERNAL TABLE customers_avro (
    name TEXT,
    orders INT, 
    sales FLOAT
    )

    LOCATION ('pxf://default.customers_avro?PROFILE=hive&SERVER=hadoop')
    FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');    
```

Query the created external table:

SELECT * FROM customers_avro;

The output should look as follows:

    name    | orders | sales
------------+--------+--------
 John Doe   |     24 | 152.12
 Jane Smith |     66 | 456.72
 Bob Brown  |     12 |  51.19
 Rob Stuart |     42 |  34.81
(4 rows)

Before you begin, make sure you have created the sample dataset as described in the Create a sample dataset section.

On the HiveClient host, launch the Hive command line:
```
$ hive
```

Create a Hive table stored in SequenceFile format:

CREATE TABLE customers_seqfile (
    name STRING,
    orders INT, 
    sales DOUBLE
    )
    
    STORED AS SEQUENCEFILE;

Copy the data from the customers table into customers_seqfile to store the dataset in the SequenceFile format:
```
INSERT INTO TABLE customers_seqfile SELECT * FROM customers;
```

Optionally, query the customers_seqfile table:

SELECT * FROM customers_seqfile;

The output should look as follows:

+-------------------------+---------------------------+--------------------------+
| customers_seqfile.name  | customers_seqfile.orders  | customers_seqfile.sales  |
+-------------------------+---------------------------+--------------------------+
| John Doe                | 24                        | 152.12                   |
| Jane Smith              | 66                        | 456.72                   |
| Bob Brown               | 12                        | 51.19                    |
| Rob Stuart              | 42                        | 34.81                    |
+-------------------------+---------------------------+--------------------------+
4 rows selected (0.086 seconds)

On the Greengage DB master host, create an external table that references the customers_seqfile Hive table. In the LOCATION clause, specify the PXF hive profile and the server configuration. In the FORMAT clause, specify pxfwritable_import, which is the built-in custom formatter function for read operations:
```
CREATE EXTERNAL TABLE customers_seqfile (
    name TEXT,
    orders INT, 
    sales FLOAT
    )

    LOCATION ('pxf://default.customers_seqfile?PROFILE=hive&SERVER=hadoop')
    FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');    
```

Query the created external table:

SELECT * FROM customers_seqfile;

The output should look as follows:

    name    | orders | sales
------------+--------+--------
 John Doe   |     24 | 152.12
 Jane Smith |     66 | 456.72
 Bob Brown  |     12 |  51.19
 Rob Stuart |     42 |  34.81
(4 rows)

In the /tmp directory on the Hive host, create the customers_complex.txt text file. When specifying an array field in a Hive table, you must identify the terminator character for each item in the collection as well as each map key. The customers_complex.txt file uses a comma character (,) to separate field values, the percent symbol (%) to separate collection items, and a colon (:) to terminate map key values:
```
John Doe,1%2%3,tier:gold%subscription:active%referrals:6
Jane Smith,4%5%6,tier:silver%subscription:active
Bob Brown,7%8%9,tier:bronze%subscription:inactive%referrals:4
Rob Stuart,0%1%2,tier:gold%subscription:active%referrals:2
```
On the HiveClient host, launch the Hive command line:
```
$ hive
```
Create the customers_c Hive table to store the sample dataset. The STORED AS TEXTFILE clause instructs Hive to create the table in the TEXTFILE (the default) format. The FIELDS TERMINATED BY clause sets the comma character (,) as the field delimiter within a data record (line). The COLLECTION ITEMS TERMINATED BY clause sets the percent sign (%) as the collection items (array item, map key/value pair) terminator. The MAP KEYS TERMINATED BY clause sets the colon character (:) as the terminator for map keys:
```
CREATE TABLE customers_c (
    name STRING,
    metrics ARRAY<int>, 
    status MAP<string, string>
    )
    
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    COLLECTION ITEMS TERMINATED BY '%'
    MAP KEYS TERMINATED BY ':'
    STORED AS TEXTFILE;
```

Load the customers_complex.txt sample data file into the created customers_c table:

LOAD DATA LOCAL INPATH '/tmp/customers_complex.txt'
INTO TABLE customers_c;

Create a Hive table stored in the ORC format:

CREATE TABLE customers_c_orc (
    name STRING,
    metrics ARRAY<int>, 
    status MAP<string, string>
    )
    
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    COLLECTION ITEMS TERMINATED BY '%'
    MAP KEYS TERMINATED BY ':'
    STORED AS ORC;

Copy the data from the customers_c table into customers_c_orc to store the dataset in the ORC format:
```
INSERT INTO TABLE customers_c_orc SELECT * FROM customers_c;
```

Optionally, query the customers_c_orc Hive table:

SELECT * FROM customers_c_orc;

The output should look as follows

+-----------------------+--------------------------+-------------------------------------------------------------+
| customers_c_orc.name  | customers_c_orc.metrics  |                   customers_c_orc.status                    |
+-----------------------+--------------------------+-------------------------------------------------------------+
| John Doe              | [1,2,3]                  | {"tier":"gold","subscription":"active","referrals":"6"}     |
| Jane Smith            | [4,5,6]                  | {"tier":"silver","subscription":"active"}                   |
| Bob Brown             | [7,8,9]                  | {"tier":"bronze","subscription":"inactive","referrals":"4"} |
| Rob Stuart            | [0,1,2]                  | {"tier":"gold","subscription":"active","referrals":"2"}     |
+-----------------------+--------------------------+-------------------------------------------------------------+
4 rows selected (0.082 seconds)

On the Greengage DB master host, create an external table that references the customers_c_orc Hive table. In the LOCATION clause, specify the PXF hive:orc profile and the server configuration. In the FORMAT clause, specify pxfwritable_import, which is the built-in custom formatter function for read operations:
```
CREATE EXTERNAL TABLE customers_c_orc (
    name TEXT,
    metrics TEXT, 
    status TEXT
    )

    LOCATION ('pxf://default.customers_c_orc?PROFILE=hive:orc&SERVER=hadoop')
    FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
```

Query the created external table:

SELECT * FROM customers_c_orc;

The output should look as follows, with metrics and status serialized as text strings:

    name    | metrics |                           status
------------+---------+-------------------------------------------------------------
 John Doe   | [1,2,3] | {"tier":"gold","subscription":"active","referrals":"6"}
 Jane Smith | [4,5,6] | {"tier":"silver","subscription":"active"}
 Bob Brown  | [7,8,9] | {"tier":"bronze","subscription":"inactive","referrals":"4"}
 Rob Stuart | [0,1,2] | {"tier":"gold","subscription":"active","referrals":"2"}
(4 rows)

On the HiveClient host, launch the Hive command line:
```
$ hive
```

Create a Hive table named customers_part with two partition columns, state and city:

CREATE TABLE customers_part (
    name STRING,
    sales DOUBLE
    )

    PARTITIONED BY (state string, city string)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

Load the sample data into the created table, partitioning it by state and city:

INSERT INTO TABLE customers_part
    PARTITION(state = 'California', city = 'Fresno')
    VALUES ('John Doe', 150.00);

INSERT INTO TABLE customers_part
    PARTITION(state = 'California', city = 'Sacramento')
    VALUES ('Jane Smith', 175.50);

INSERT INTO TABLE customers_part
    PARTITION(state = 'Nevada', city = 'Reno')
    VALUES ('Bob Brown', 200.75);

INSERT INTO TABLE customers_part
    PARTITION(state = 'Nevada', city = 'Las Vegas')
    VALUES ('Rob Stuart', 180.20);

Query the customers_part table:

SELECT * FROM customers_part;

The output should look as follows, with partition fields added at the end of each record:

+----------------------+-----------------------+-----------------------+----------------------+
| customers_part.name  | customers_part.sales  | customers_part.state  | customers_part.city  |
+----------------------+-----------------------+-----------------------+----------------------+
| John Doe             | 150.0                 | California            | Fresno               |
| Jane Smith           | 175.5                 | California            | Sacramento           |
| Rob Stuart           | 180.2                 | Nevada                | Las Vegas            |
| Bob Brown            | 200.75                | Nevada                | Reno                 |
+----------------------+-----------------------+-----------------------+----------------------+
4 rows selected (0.127 seconds)

Examine the Hive HDFS directory structure for the customers_part table:

$ hdfs dfs -ls -C -R /apps/hive/warehouse/customers_part

The output should look as follows:

/apps/hive/warehouse/customers_part/state=California
/apps/hive/warehouse/customers_part/state=California/city=Fresno
/apps/hive/warehouse/customers_part/state=California/city=Fresno/000000_0
/apps/hive/warehouse/customers_part/state=California/city=Sacramento
/apps/hive/warehouse/customers_part/state=California/city=Sacramento/000000_0
/apps/hive/warehouse/customers_part/state=Nevada
/apps/hive/warehouse/customers_part/state=Nevada/city=Las Vegas
/apps/hive/warehouse/customers_part/state=Nevada/city=Las Vegas/000000_0
/apps/hive/warehouse/customers_part/state=Nevada/city=Reno
/apps/hive/warehouse/customers_part/state=Nevada/city=Reno/000000_0

On the Greengage DB master host, create an external table that references the customers_part Hive table. To take advantage of partition filter pushdown, define the state and city columns corresponding to the Hive partition fields at the end of the columns list. In the LOCATION clause, specify the PXF hive profile and the server configuration. In the FORMAT clause, specify pxfwritable_import, which is the built-in custom formatter function for read operations:
```
CREATE EXTERNAL TABLE customers_part (
    name TEXT,
    sales FLOAT,
    state TEXT, 
    city TEXT
    )

    LOCATION ('pxf://default.customers_part?PROFILE=hive&SERVER=hadoop')
    FORMAT 'custom' (FORMATTER='pxfwritable_import');
```

Query the created external table:

SELECT * FROM customers_part;

The output should look as follows:

    name    | sales  |   state    |    city
------------+--------+------------+------------
 Jane Smith |  175.5 | California | Sacramento
 Rob Stuart |  180.2 | Nevada     | Las Vegas
 John Doe   |    150 | California | Fresno
 Bob Brown  | 200.75 | Nevada     | Reno
(4 rows)

Query the customers_part external table to return the records where the state value is Nevada and the sales value is greater than 170. The filter on sales is not pushed down, since it is not a partition column. It is performed on the Greengage DB side after all the data in the Nevada partition is transferred for processing:
```
SELECT * FROM customers_part where state = 'Nevada' and sales > 170;
```
The output should look as follows:
```
    name    | sales  | state  |   city
------------+--------+--------+-----------
 Rob Stuart |  180.2 | Nevada | Las Vegas
 Bob Brown  | 200.75 | Nevada | Reno
(2 rows)
```
Query the customers_part external table to return the records where the state value is California. The filter on state is pushed down, since it is a partition column. It is performed on the Hive side before the data in the California partition is transferred for processing:
```
SELECT * FROM customers_part where state = 'California';
```
The output should look as follows:
```
    name    | sales |   state    |    city
------------+-------+------------+------------
 John Doe   |   150 | California | Fresno
 Jane Smith | 175.5 | California | Sacramento
(2 rows)
```

Before you begin, make sure you have created the sample dataset as described in the Create a sample dataset section and its RCfile copy as described in the Use hive:rc profile section.

On the HiveClient host, launch the Hive command line:
```
$ hive
```

Create a Hive external table that will comprise the HDFS data files associated with the customers and customers_rc Hive tables. The table is partitioned by the year string field:

CREATE EXTERNAL TABLE customers_part_m (
    name STRING,
    orders INT,
    sales DOUBLE
    )
    
    PARTITIONED BY (year string)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

Run the DESCRIBE EXTENDED commands on the customers and customers_rc tables to view the HDFS file location for each table:
```
DESCRIBE EXTENDED customers;
DESCRIBE EXTENDED customers_rc;
```
In the command output, the table HDFS location is listed in the location entry:
```
...
location:hdfs://hadoop/apps/hive/warehouse/customers
location:hdfs://hadoop/apps/hive/warehouse/customers_rc
...
```

In the customers_part_m HDFS table, create partitions referring to the file locations associated with the customers and customers_rc tables:

ALTER TABLE customers_part_m ADD PARTITION (year = '2025') LOCATION 'hdfs://hadoop/apps/hive/warehouse/customers';
ALTER TABLE customers_part_m ADD PARTITION (year = '2026') LOCATION 'hdfs://hadoop/apps/hive/warehouse/customers_rc';

The customers table has no storage format specified, so the default TEXTFILE is used. For the customers_rc table, explicitly specify the file format associated with the partition:
```
ALTER TABLE customers_part_m PARTITION (year='2026') SET FILEFORMAT RCFILE;
```

View the created partitions:

SHOW PARTITIONS customers_part_m;

The output should look as follows:

+------------+
| partition  |
+------------+
| year=2025  |
| year=2026  |
+------------+
2 rows selected (0.044 seconds)

Query the created external Hive table:

SELECT * FROM customers_part_m;

The output should look as follows:

+------------------------+--------------------------+-------------------------+------------------------+
| customers_part_m.name  | customers_part_m.orders  | customers_part_m.sales  | customers_part_m.year  |
+------------------------+--------------------------+-------------------------+------------------------+
| John Doe               | 24                       | 152.12                  | 2025                   |
| Jane Smith             | 66                       | 456.72                  | 2025                   |
| Bob Brown              | 12                       | 51.19                   | 2025                   |
| Rob Stuart             | 42                       | 34.81                   | 2025                   |
| John Doe               | 24                       | 152.12                  | 2026                   |
| Jane Smith             | 66                       | 456.72                  | 2026                   |
| Bob Brown              | 12                       | 51.19                   | 2026                   |
| Rob Stuart             | 42                       | 34.81                   | 2026                   |
+------------------------+--------------------------+-------------------------+------------------------+
8 rows selected (0.135 seconds)

On the Greengage DB master host, create an external table that references the customers_part_m Hive table. In the LOCATION clause, specify the PXF hive profile and the server configuration. In the FORMAT clause, specify pxfwritable_import, which is the built-in custom formatter function for read operations:
```
CREATE EXTERNAL TABLE customers_part_m (
    name TEXT,
    orders INT,
    sales FLOAT,
    year TEXT
    )

    LOCATION ('pxf://default.customers_part_m?PROFILE=hive&SERVER=hadoop')
    FORMAT 'custom' (FORMATTER='pxfwritable_import');
```

Query the created external table:

SELECT * FROM customers_part_m;

The output should look as follows:

    name    | orders | sales  | year
------------+--------+--------+------
 John Doe   |     24 | 152.12 | 2026
 Jane Smith |     66 | 456.72 | 2026
 Bob Brown  |     12 |  51.19 | 2026
 Rob Stuart |     42 |  34.81 | 2026
 John Doe   |     24 | 152.12 | 2025
 Jane Smith |     66 | 456.72 | 2025
 Bob Brown  |     12 |  51.19 | 2025
 Rob Stuart |     42 |  34.81 | 2025
(8 rows)

Query the external table to return the sum of the sales column values for the year value of 2025:
```
SELECT sum(sales) FROM customers_part_m where year = '2025';
```
The output should look as follows:
```
  sum
--------
 694.84
(1 row)
```

On the HiveClient host, launch the Hive command line:
```
$ hive
```

Create a Hive table named customers_part_d partitioned by the year string field:

CREATE TABLE customers_part_d (
    name STRING,
    orders INT, 
    sales DOUBLE
    )

    PARTITIONED BY (year string)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

Load the sample data into the created table. Notice that the PARTITION clause is omitted and no partitioning column value is provided for the third row:

INSERT INTO TABLE customers_part_d
    PARTITION(year = 2025)
    VALUES ('John Doe', 30, 150.00);

INSERT INTO TABLE customers_part_d
    PARTITION(year = 2025)
    VALUES ('Jane Smith', 28, 175.50);

INSERT INTO TABLE customers_part_d
    VALUES ('Bob Brown', 42, 200.75, NULL);

INSERT INTO TABLE customers_part_d
    PARTITION(year = 2025)
    VALUES ('Rob Stuart', 35, 180.20);

Query the customers_part_d table:

SELECT * FROM customers_part_d;

The output should look as follows. Notice that __HIVE_DEFAULT_PARTITION__ is returned for the NULL value in the year partitioning column:

+------------------------+--------------------------+-------------------------+-----------------------------+
| customers_part_d.name  | customers_part_d.orders  | customers_part_d.sales  |    customers_part_d.year    |
+------------------------+--------------------------+-------------------------+-----------------------------+
| John Doe               | 30                       | 150.0                   | 2025                        |
| Jane Smith             | 28                       | 175.5                   | 2025                        |
| Rob Stuart             | 35                       | 180.2                   | 2025                        |
| Bob Brown              | 42                       | 200.75                  | __HIVE_DEFAULT_PARTITION__  |
+------------------------+--------------------------+-------------------------+-----------------------------+
4 rows selected (0.112 seconds)

Query the customers_part_d table filtering on the year partition column:

SELECT * FROM customers_part_d WHERE year IS NULL;

The output should contain no rows, since the data stored in the table’s default partition is excluded:

+------------------------+--------------------------+-------------------------+------------------------+
| customers_part_d.name  | customers_part_d.orders  | customers_part_d.sales  | customers_part_d.year  |
+------------------------+--------------------------+-------------------------+------------------------+
+------------------------+--------------------------+-------------------------+------------------------+
No rows selected (0.165 seconds)

On the Greengage DB master host, create an external table that references the customers_part_d Hive table. In the LOCATION clause, specify the PXF hive profile and the server configuration. In the FORMAT clause, specify pxfwritable_import, which is the built-in custom formatter function for read operations:
```
CREATE EXTERNAL TABLE customers_part_d (
    name TEXT,
    orders INT, 
    sales FLOAT,
    year TEXT 
    )

    LOCATION ('pxf://default.customers_part_d?PROFILE=hive&SERVER=hadoop')
    FORMAT 'custom' (FORMATTER='pxfwritable_import');
```

Query the created external table:

SELECT * FROM customers_part_d;

The output should look as follows:

    name    | orders | sales  | year
------------+--------+--------+------
 Jane Smith |     28 |  175.5 | 2025
 Bob Brown  |     42 | 200.75 |
 Rob Stuart |     35 |  180.2 | 2025
 John Doe   |     30 |    150 | 2025
(4 rows)

Query the created external table filtering on the year partition column.
```
SELECT * FROM customers_part_d WHERE year IS NULL;
```
Since all default partition values are translated into actual NULL values, the query returns a single row:
```
   name    | orders | sales  | year
-----------+--------+--------+------
 Bob Brown |     42 | 200.75 |
(1 row)
```

HBase

Text

Use PXF Hive connector to read table data from Hive to Greengage DB

Hive data formats

Data type mapping

Partition pruning

Create an external table using the PXF protocol

Examples

Configure the PXF Hive connector

Create a sample dataset

Use hive profile

Use hive:text profile

Use hive:rc profile

Use hive:orc profile

Access Parquet-format Hive tables

Access Avro-format Hive tables

Access SequenceFile-format Hive tables

Read complex data types

Access partitioned homogenous data

Access partitioned heterogeneous data

Access Hive default partition