Use PXF Hive connector to read table data from Hive to Greengage DB
- Hive data formats
- Data type mapping
- Partition pruning
- Create an external table using the PXF protocol
- Examples
- Configure the PXF Hive connector
- Create a sample dataset
- Use hive profile
- Use hive:text profile
- Use hive:rc profile
- Use hive:orc profile
- Access Parquet-format Hive tables
- Access Avro-format Hive tables
- Access SequenceFile-format Hive tables
- Read complex data types
- Access partitioned homogenous data
- Access partitioned heterogeneous data
- Access Hive default partition
Apache Hive is a distributed data warehousing infrastructure. Hive facilitates managing large data sets supporting multiple data formats, including comma-separated values (CSV) TextFile, RCFile, ORC, and Parquet.
This topic describes how to configure and use the PXF Hive connector for reading Hive table data by using external tables and provides practical examples.
When accessing Hive 3, the PXF Hive connector supports using the hive[:*] profiles described below to access Hive 3 external tables only.
The connector does not support using the hive[:*] profiles to access Hive 3 managed (CRUD and insert-only transactional, and temporary) tables.
Use the PXF JDBC Connector to access Hive 3 managed tables instead.
Hive data formats
The PXF Hive connector provides the hive, hive:text, hive:rc, and hive:orc profiles to support different file formats.
| File format | Profile | Description |
|---|---|---|
TextFile |
hive, hive:text |
Flat file with data in comma-, tab-, or space-separated value format or JSON notation. See example in the Use hive:text profile section |
RCFile |
hive, hive:rc |
Columnar data consisting of binary key/value pairs; high row compression rate. See example in the Use hive:rc profile section |
ORC |
hive, hive:orc |
Optimized row columnar data with stripe, footer, and postscript sections; reduces data size. The
If the
See examples in the Use hive:orc profile and Read complex data types sections. |
Parquet |
hive |
Compressed columnar data format. See example in the Access Parquet-format Hive tables section |
Avro |
hive |
Serialization framework with a binary data format. See example in the Access Avro-format Hive tables section |
SequenceFile |
hive |
Flat file consisting of binary key/value pairs. See example in the Access SequenceFile-format Hive tables section |
Any supported file storage format |
hive |
The optimal |
PXF uses column projection to increase query performance when using the hive, hive:rc, or hive:orc profiles.
Data type mapping
The PXF Hive connector supports primitive and complex data types. Complex Hive data types, including array, struct, map, and union, are mapped to text. You can create Greengage DB user-defined functions (UDF) or application code to extract subcomponents of complex data types.
Primitive types use the following data type mapping.
| Hive data type | Greengage DB data type |
|---|---|
boolean |
BOOL |
int |
INT4 |
smallint |
INT2 |
tinyint |
INT2 |
bigint |
INT8 |
float |
FLOAT4 |
double |
FLOAT8 |
string |
TEXT |
binary |
BYTEA |
timestamp (1) |
TIMESTAMP |
Partition pruning
The PXF Hive connector supports Hive partition pruning (filtering) and the Hive partition directory structure.
This enables partition exclusion on selected HDFS files comprising a Hive table, which reduces network traffic and I/O load.
To use partition filtering, run a query on an external table using a WHERE clause that refers to a specific partition column in a partitioned Hive table.
See examples in the Access partitioned heterogeneous data and Access partitioned homogenous data sections.
Partition filtering in PXF is supported for Hive string and integral types:
-
The relational operators
=,<,<=,>,>=, and<>are supported on string types. -
The relational operators
=and<>are supported on integral types. To use partition filtering with Hive integral types, you need to update the Hive configuration as described in the Configure the PXF Hive connector section. -
The
ANDandORlogical operators are supported when used with the relational operators mentioned above. -
The
LIKEstring operator is not supported.
To use PXF partition filtering pushdown, make sure the names of the Hive and the PXF partition field match. Otherwise, pushdown is ignored and filtering is performed on the Greengage DB side.
The PXF Hive connector filters only on partition columns, not on other table attributes. PXF filter pushdown is enabled by default and is supported only for the data types and operators listed above. Learn more about filter pushdown in Filter pushdown in PXF documentation.
When dynamic partitioning is enabled in Hive, a partitioned table may store data in a default partition. Hive creates a default partition when the value of a partitioning column does not match the defined type of the column (for example, when a NULL value is used for any partitioning column).
In this case, query results may differ between Hive and PXF queries.
In Hive, any query that includes a filter on a partition column excludes any data that is stored in the table’s default partition.
Similar to Hive, PXF represents table’s partitioning columns as the columns that are appended to the end of the table.
However, any column value in a default partition is translated to a NULL value.
This means that a Greengage DB query that includes an IS NULL filter on a partitioning column can return different results than the same Hive query.
See Access Hive default partition for a practical example.
Create an external table using the PXF protocol
Hive tables can be backed by one or more files with each file having a unique layout or schema, so the column names in a Greengage DB external table definition must match the Hive table column names. This direct name mapping lets you:
-
Create an external table with columns in a different order than the Hive table.
-
Create an external table containing only a subset of the Hive table columns.
-
Read a Hive table whose backing files have a different number of columns.
To create a Greengage DB external table to read Hive table data, use the following general syntax:
CREATE EXTERNAL TABLE <table_name>
( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ('pxf://<hive_db_name>.<hive_table_name>?PROFILE=<profile_name>[&<custom_option>=<value>[...]]')
FORMAT 'CUSTOM|TEXT' (FORMATTER='pxfwritable_import' | delimiter='<delim>');
| Keyword | Value |
|---|---|
<table_name> |
The name of the table to create |
<column_name> |
The name of the column to create |
<data_type> |
The data type of the column |
LIKE <other_table> |
Specifies a table from which the new external table automatically copies all column names, data types, and distribution policy |
<hive_db_name> |
The name of the Hive database.
If omitted, defaults to the Hive database named |
<hive_table_name> |
The name of the Hive table |
PROFILE=<profile_name> |
One of the following profiles is used to read Hive table data: |
<custom_option> |
One of the custom options provided in the |
SERVER=<server_name> |
The named server configuration that PXF uses to access the data. If not specified, the default PXF server is used |
PPD=<boolean> |
Activates predicate pushdown for all queries on this table.
Applies only to the |
VECTORIZE=<boolean> |
For the |
FORMAT ( |
The custom format with the built-in custom formatter function for read operation ( |
FORMAT ( |
The |
Examples
These examples demonstrate how to configure and use the PXF Hive connector for reading Hive table data by using external tables.
Configure the PXF Hive connector
To have PXF connect to HDFS, you need to create a Hadoop server configuration as described in Configure PXF Hadoop connectors in PXF documentation and then synchronize it to the Greengage DB cluster:
-
Log in to the Greengage DB master host as
gpadmin. -
Go to the $PXF_BASE/servers directory and create a Hadoop server configuration directory named hadoop:
$ mkdir $PXF_BASE/servers/hadoop $ cd $PXF_BASE/servers/hadoop -
Copy the core-site.xml, hdfs-site.xml, mapred-site.xml, and yarn-site.xml Hadoop configuration files from the NameNode host of the Hadoop cluster to the current host:
$ scp hdfsuser@namenode:/etc/hadoop/conf/core-site.xml . $ scp hdfsuser@namenode:/etc/hadoop/conf/hdfs-site.xml . $ scp hdfsuser@namenode:/etc/hadoop/conf/mapred-site.xml . $ scp hdfsuser@namenode:/etc/hadoop/conf/yarn-site.xml . -
If you plan to use PXF filter pushdown with Hive integral types, ensure that the
hive.metastore.integral.jdo.pushdownconfiguration parameter exists and is set totruein the hive-site.xml Hive configuration file. Then copy the hive-site.xml file from the NameNode host of the Hadoop cluster to the current host:$ scp hdfsuser@namenode:/etc/hbase/conf/hive-site.xml . -
Synchronize the server configuration to the Greengage DB cluster hosts:
$ pxf cluster sync
Create a sample dataset
-
In the /tmp directory on the HiveServer host, create the customers.txt text file with comma-separated data fields content:
John Doe,24,152.12 Jane Smith,66,456.72 Bob Brown,12,51.19 Rob Stuart,42,34.81
-
On the HiveClient host, launch the Hive command line:
$ hive -
Create the
customersHive table to store the sample dataset. TheFIELDS TERMINATED BYclause sets the comma character (,) as the field delimiter within a data record (line). TheSTORED AS TEXTFILEclause instructs Hive to create the table in theTEXTFILE(the default) format:CREATE TABLE customers ( name STRING, orders INT, sales DOUBLE ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE; -
Load the customers.txt sample data file into the created
customerstable:LOAD DATA LOCAL INPATH '/tmp/customers.txt' INTO TABLE customers; -
Optionally, query the
customerstable:SELECT * FROM customers;The output should look as follows:
+-----------------+-------------------+------------------+ | customers.name | customers.orders | customers.sales | +-----------------+-------------------+------------------+ | John Doe | 24 | 152.12 | | Jane Smith | 66 | 456.72 | | Bob Brown | 12 | 51.19 | | Rob Stuart | 42 | 34.81 | +-----------------+-------------------+------------------+ 4 rows selected (0.079 seconds)
-
You can view a Hive table’s location in HDFS by using the
DESCRIBE EXTENDEDcommand:DESCRIBE EXTENDED customers;In the command output, the table HDFS location is listed in the
locationentry:... location:hdfs://hadoop/apps/hive/warehouse/customers ...
Use hive profile
Before you begin, make sure you have created the sample dataset as described in the Create a sample dataset section.
-
On the Greengage DB master host, create an external table that references the
customersHive table. In theLOCATIONclause, specify the PXFhiveprofile and the server configuration. In theFORMATclause, specifypxfwritable_import, which is the built-in custom formatter function for read operations:CREATE EXTERNAL TABLE customers_hive ( name TEXT, orders INT, sales FLOAT ) LOCATION ('pxf://default.customers?PROFILE=hive&SERVER=hadoop') FORMAT 'custom' (FORMATTER='pxfwritable_import'); -
Query the created external table:
SELECT * FROM customers_hive;The output should look as follows:
name | orders | sales ------------+--------+-------- John Doe | 24 | 152.12 Jane Smith | 66 | 456.72 Bob Brown | 12 | 51.19 Rob Stuart | 42 | 34.81 (4 rows)
Use hive:text profile
Before you begin, make sure you have created the sample dataset as described in the Create a sample dataset section.
-
On the Greengage DB master host, create an external table that references the
customersHive table. In theLOCATIONclause, specify the PXFhive:textprofile and the server configuration. In theFORMATclause, specifyTEXTas the data format and set the comma character (,) as a delimiter (delimiter):CREATE EXTERNAL TABLE customers_text ( name TEXT, orders INT, sales FLOAT ) LOCATION ('pxf://default.customers?PROFILE=hive:text&SERVER=hadoop') FORMAT 'TEXT' (delimiter=E','); -
Query the created external table:
SELECT * FROM customers_text;The output should look as follows:
name | orders | sales ------------+--------+-------- John Doe | 24 | 152.12 Jane Smith | 66 | 456.72 Bob Brown | 12 | 51.19 Rob Stuart | 42 | 34.81 (4 rows)
Use hive:rc profile
Before you begin, make sure you have created the sample dataset as described in the Create a sample dataset section.
-
On the HiveClient host, launch the Hive command line:
$ hive -
Create a Hive table stored in RCFile format:
CREATE TABLE customers_rc ( name STRING, orders INT, sales DOUBLE ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS RCFILE; -
Copy the data from the
customerstable intocustomers_rcto store the dataset in the RCFile format:INSERT INTO TABLE customers_rc SELECT * FROM customers; -
Query the
customers_rcHive table to verify that the data was loaded correctly:SELECT * FROM customers_rc;The output should look as follows:
+--------------------+----------------------+---------------------+ | customers_rc.name | customers_rc.orders | customers_rc.sales | +--------------------+----------------------+---------------------+ | John Doe | 24 | 152.12 | | Jane Smith | 66 | 456.72 | | Bob Brown | 12 | 51.19 | | Rob Stuart | 42 | 34.81 | +--------------------+----------------------+---------------------+ 4 rows selected (0.1 seconds)
-
On the Greengage DB master host, create an external table that references the
customers_rcHive table. In theLOCATIONclause, specify the PXFhive:rcprofile and the server configuration. In theFORMATclause, specifyTEXTas the data format and set the comma character (,) as a delimiter (delimiter):CREATE EXTERNAL TABLE customers_rc ( name TEXT, orders INT, sales FLOAT ) LOCATION ('pxf://default.customers_rc?PROFILE=hive:rc&SERVER=hadoop') FORMAT 'TEXT' (delimiter=E','); -
Query the created external table:
SELECT * FROM customers_rc;The output should look as follows:
name | orders | sales ------------+--------+-------- John Doe | 24 | 152.12 Jane Smith | 66 | 456.72 Bob Brown | 12 | 51.19 Rob Stuart | 42 | 34.81 (4 rows)
Use hive:orc profile
Before you begin, make sure you have created the sample dataset as described in the Create a sample dataset section.
-
On the HiveClient host, launch the Hive command line:
$ hive -
Create a Hive table stored in the ORC file format:
CREATE TABLE customers_orc ( name STRING, orders INT, sales DOUBLE ) STORED AS ORC; -
Copy the data from the
customerstable intocustomers_orcto store the dataset in the ORC format:INSERT INTO TABLE customers_orc SELECT * FROM customers; -
Optionally, query the
customers_orctable to verify that the data was loaded successfully:SELECT * FROM customers_orc;The output should look as follows:
+---------------------+-----------------------+----------------------+ | customers_orc.name | customers_orc.orders | customers_orc.sales | +---------------------+-----------------------+----------------------+ | John Doe | 24 | 152.12 | | Jane Smith | 66 | 456.72 | | Bob Brown | 12 | 51.19 | | Rob Stuart | 42 | 34.81 | +---------------------+-----------------------+----------------------+ 4 rows selected (0.095 seconds)
-
On the Greengage DB master host, create an external table that references the
customers_orcHive table. In theLOCATIONclause, specify the PXFhive:orcprofile and the server configuration. In theFORMATclause, specifypxfwritable_import, which is the built-in custom formatter function for read operations:CREATE EXTERNAL TABLE customers_orc ( name TEXT, orders INT, sales FLOAT ) LOCATION ('pxf://default.customers_orc?PROFILE=hive:orc&SERVER=hadoop') FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import'); -
Query the created external table:
SELECT * FROM customers_orc;The output should look as follows:
name | orders | sales ------------+--------+-------- John Doe | 24 | 152.12 Jane Smith | 66 | 456.72 Bob Brown | 12 | 51.19 Rob Stuart | 42 | 34.81 (4 rows)
Access Parquet-format Hive tables
Before you begin, make sure you have created the sample dataset as described in the Create a sample dataset section.
-
On the HiveClient host, launch the Hive command line:
$ hive -
Create a Hive table stored in Parquet format:
CREATE TABLE customers_parquet ( name STRING, orders INT, sales DOUBLE ) STORED AS PARQUET; -
Copy the data from the
customerstable intocustomers_parquetto store the dataset in the Parquet format:INSERT INTO TABLE customers_parquet SELECT * FROM customers; -
Optionally, query the
customers_parquettable:SELECT * FROM customers_parquet;The output should look as follows:
+-------------------------+---------------------------+--------------------------+ | customers_parquet.name | customers_parquet.orders | customers_parquet.sales | +-------------------------+---------------------------+--------------------------+ | John Doe | 24 | 152.12 | | Jane Smith | 66 | 456.72 | | Bob Brown | 12 | 51.19 | | Rob Stuart | 42 | 34.81 | +-------------------------+---------------------------+--------------------------+ 4 rows selected (0.1 seconds)
-
On the Greengage DB master host, create an external table that references the
customers_parquetHive table. In theLOCATIONclause, specify the PXFhiveprofile and the server configuration. In theFORMATclause, specifypxfwritable_import, which is the built-in custom formatter function for read operations:CREATE EXTERNAL TABLE customers_parquet ( name TEXT, orders INT, sales FLOAT ) LOCATION ('pxf://default.customers_parquet?PROFILE=hive&SERVER=hadoop') FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import'); -
Query the created external table:
SELECT * FROM customers_parquet;The output should look as follows:
name | orders | sales ------------+--------+-------- John Doe | 24 | 152.12 Jane Smith | 66 | 456.72 Bob Brown | 12 | 51.19 Rob Stuart | 42 | 34.81 (4 rows)
Access Avro-format Hive tables
Before you begin, make sure you have created the sample dataset as described in the Create a sample dataset section.
-
On the HiveClient host, launch the Hive command line:
$ hive -
Create a Hive table stored in Avro format:
CREATE TABLE customers_avro ( name STRING, orders INT, sales DOUBLE ) STORED AS AVRO; -
Copy the data from the
customerstable intocustomers_avroto store the dataset in the Avro format:INSERT INTO TABLE customers_avro SELECT * FROM customers; -
Optionally, query the
customers_avrotable:SELECT * FROM customers_avro;The output should look as follows:
+----------------------+------------------------+-----------------------+ | customers_avro.name | customers_avro.orders | customers_avro.sales | +----------------------+------------------------+-----------------------+ | John Doe | 24 | 152.12 | | Jane Smith | 66 | 456.72 | | Bob Brown | 12 | 51.19 | | Rob Stuart | 42 | 34.81 | +----------------------+------------------------+-----------------------+ 4 rows selected (0.198 seconds)
-
On the Greengage DB master host, create an external table that references the
customers_avroHive table. In theLOCATIONclause, specify the PXFhiveprofile and the server configuration. In theFORMATclause, specifypxfwritable_import, which is the built-in custom formatter function for read operations:CREATE EXTERNAL TABLE customers_avro ( name TEXT, orders INT, sales FLOAT ) LOCATION ('pxf://default.customers_avro?PROFILE=hive&SERVER=hadoop') FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import'); -
Query the created external table:
SELECT * FROM customers_avro;The output should look as follows:
name | orders | sales ------------+--------+-------- John Doe | 24 | 152.12 Jane Smith | 66 | 456.72 Bob Brown | 12 | 51.19 Rob Stuart | 42 | 34.81 (4 rows)
Access SequenceFile-format Hive tables
Before you begin, make sure you have created the sample dataset as described in the Create a sample dataset section.
-
On the HiveClient host, launch the Hive command line:
$ hive -
Create a Hive table stored in SequenceFile format:
CREATE TABLE customers_seqfile ( name STRING, orders INT, sales DOUBLE ) STORED AS SEQUENCEFILE; -
Copy the data from the
customerstable intocustomers_seqfileto store the dataset in the SequenceFile format:INSERT INTO TABLE customers_seqfile SELECT * FROM customers; -
Optionally, query the
customers_seqfiletable:SELECT * FROM customers_seqfile;The output should look as follows:
+-------------------------+---------------------------+--------------------------+ | customers_seqfile.name | customers_seqfile.orders | customers_seqfile.sales | +-------------------------+---------------------------+--------------------------+ | John Doe | 24 | 152.12 | | Jane Smith | 66 | 456.72 | | Bob Brown | 12 | 51.19 | | Rob Stuart | 42 | 34.81 | +-------------------------+---------------------------+--------------------------+ 4 rows selected (0.086 seconds)
-
On the Greengage DB master host, create an external table that references the
customers_seqfileHive table. In theLOCATIONclause, specify the PXFhiveprofile and the server configuration. In theFORMATclause, specifypxfwritable_import, which is the built-in custom formatter function for read operations:CREATE EXTERNAL TABLE customers_seqfile ( name TEXT, orders INT, sales FLOAT ) LOCATION ('pxf://default.customers_seqfile?PROFILE=hive&SERVER=hadoop') FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import'); -
Query the created external table:
SELECT * FROM customers_seqfile;The output should look as follows:
name | orders | sales ------------+--------+-------- John Doe | 24 | 152.12 Jane Smith | 66 | 456.72 Bob Brown | 12 | 51.19 Rob Stuart | 42 | 34.81 (4 rows)
Read complex data types
-
In the /tmp directory on the Hive host, create the customers_complex.txt text file. When specifying an array field in a Hive table, you must identify the terminator character for each item in the collection as well as each map key. The customers_complex.txt file uses a comma character (
,) to separate field values, the percent symbol (%) to separate collection items, and a colon (:) to terminate map key values:John Doe,1%2%3,tier:gold%subscription:active%referrals:6 Jane Smith,4%5%6,tier:silver%subscription:active Bob Brown,7%8%9,tier:bronze%subscription:inactive%referrals:4 Rob Stuart,0%1%2,tier:gold%subscription:active%referrals:2
-
On the HiveClient host, launch the Hive command line:
$ hive -
Create the
customers_cHive table to store the sample dataset. TheSTORED AS TEXTFILEclause instructs Hive to create the table in theTEXTFILE(the default) format. TheFIELDS TERMINATED BYclause sets the comma character (,) as the field delimiter within a data record (line). TheCOLLECTION ITEMS TERMINATED BYclause sets the percent sign (%) as the collection items (array item, map key/value pair) terminator. TheMAP KEYS TERMINATED BYclause sets the colon character (:) as the terminator for map keys:CREATE TABLE customers_c ( name STRING, metrics ARRAY<int>, status MAP<string, string> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY '%' MAP KEYS TERMINATED BY ':' STORED AS TEXTFILE; -
Load the customers_complex.txt sample data file into the created
customers_ctable:LOAD DATA LOCAL INPATH '/tmp/customers_complex.txt' INTO TABLE customers_c; -
Create a Hive table stored in the ORC format:
CREATE TABLE customers_c_orc ( name STRING, metrics ARRAY<int>, status MAP<string, string> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY '%' MAP KEYS TERMINATED BY ':' STORED AS ORC; -
Copy the data from the
customers_ctable intocustomers_c_orcto store the dataset in the ORC format:INSERT INTO TABLE customers_c_orc SELECT * FROM customers_c; -
Optionally, query the
customers_c_orcHive table:SELECT * FROM customers_c_orc;The output should look as follows
+-----------------------+--------------------------+-------------------------------------------------------------+ | customers_c_orc.name | customers_c_orc.metrics | customers_c_orc.status | +-----------------------+--------------------------+-------------------------------------------------------------+ | John Doe | [1,2,3] | {"tier":"gold","subscription":"active","referrals":"6"} | | Jane Smith | [4,5,6] | {"tier":"silver","subscription":"active"} | | Bob Brown | [7,8,9] | {"tier":"bronze","subscription":"inactive","referrals":"4"} | | Rob Stuart | [0,1,2] | {"tier":"gold","subscription":"active","referrals":"2"} | +-----------------------+--------------------------+-------------------------------------------------------------+ 4 rows selected (0.082 seconds) -
On the Greengage DB master host, create an external table that references the
customers_c_orcHive table. In theLOCATIONclause, specify the PXFhive:orcprofile and the server configuration. In theFORMATclause, specifypxfwritable_import, which is the built-in custom formatter function for read operations:CREATE EXTERNAL TABLE customers_c_orc ( name TEXT, metrics TEXT, status TEXT ) LOCATION ('pxf://default.customers_c_orc?PROFILE=hive:orc&SERVER=hadoop') FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import'); -
Query the created external table:
SELECT * FROM customers_c_orc;The output should look as follows, with
metricsandstatusserialized as text strings:name | metrics | status ------------+---------+------------------------------------------------------------- John Doe | [1,2,3] | {"tier":"gold","subscription":"active","referrals":"6"} Jane Smith | [4,5,6] | {"tier":"silver","subscription":"active"} Bob Brown | [7,8,9] | {"tier":"bronze","subscription":"inactive","referrals":"4"} Rob Stuart | [0,1,2] | {"tier":"gold","subscription":"active","referrals":"2"} (4 rows)
Access partitioned homogenous data
-
On the HiveClient host, launch the Hive command line:
$ hive -
Create a Hive table named
customers_partwith two partition columns,stateandcity:CREATE TABLE customers_part ( name STRING, sales DOUBLE ) PARTITIONED BY (state string, city string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; -
Load the sample data into the created table, partitioning it by
stateandcity:INSERT INTO TABLE customers_part PARTITION(state = 'California', city = 'Fresno') VALUES ('John Doe', 150.00); INSERT INTO TABLE customers_part PARTITION(state = 'California', city = 'Sacramento') VALUES ('Jane Smith', 175.50); INSERT INTO TABLE customers_part PARTITION(state = 'Nevada', city = 'Reno') VALUES ('Bob Brown', 200.75); INSERT INTO TABLE customers_part PARTITION(state = 'Nevada', city = 'Las Vegas') VALUES ('Rob Stuart', 180.20); -
Query the
customers_parttable:SELECT * FROM customers_part;The output should look as follows, with partition fields added at the end of each record:
+----------------------+-----------------------+-----------------------+----------------------+ | customers_part.name | customers_part.sales | customers_part.state | customers_part.city | +----------------------+-----------------------+-----------------------+----------------------+ | John Doe | 150.0 | California | Fresno | | Jane Smith | 175.5 | California | Sacramento | | Rob Stuart | 180.2 | Nevada | Las Vegas | | Bob Brown | 200.75 | Nevada | Reno | +----------------------+-----------------------+-----------------------+----------------------+ 4 rows selected (0.127 seconds)
-
Examine the Hive HDFS directory structure for the
customers_parttable:$ hdfs dfs -ls -C -R /apps/hive/warehouse/customers_partThe output should look as follows:
/apps/hive/warehouse/customers_part/state=California /apps/hive/warehouse/customers_part/state=California/city=Fresno /apps/hive/warehouse/customers_part/state=California/city=Fresno/000000_0 /apps/hive/warehouse/customers_part/state=California/city=Sacramento /apps/hive/warehouse/customers_part/state=California/city=Sacramento/000000_0 /apps/hive/warehouse/customers_part/state=Nevada /apps/hive/warehouse/customers_part/state=Nevada/city=Las Vegas /apps/hive/warehouse/customers_part/state=Nevada/city=Las Vegas/000000_0 /apps/hive/warehouse/customers_part/state=Nevada/city=Reno /apps/hive/warehouse/customers_part/state=Nevada/city=Reno/000000_0
-
On the Greengage DB master host, create an external table that references the
customers_partHive table. To take advantage of partition filter pushdown, define thestateandcitycolumns corresponding to the Hive partition fields at the end of the columns list. In theLOCATIONclause, specify the PXFhiveprofile and the server configuration. In theFORMATclause, specifypxfwritable_import, which is the built-in custom formatter function for read operations:CREATE EXTERNAL TABLE customers_part ( name TEXT, sales FLOAT, state TEXT, city TEXT ) LOCATION ('pxf://default.customers_part?PROFILE=hive&SERVER=hadoop') FORMAT 'custom' (FORMATTER='pxfwritable_import'); -
Query the created external table:
SELECT * FROM customers_part;The output should look as follows:
name | sales | state | city ------------+--------+------------+------------ Jane Smith | 175.5 | California | Sacramento Rob Stuart | 180.2 | Nevada | Las Vegas John Doe | 150 | California | Fresno Bob Brown | 200.75 | Nevada | Reno (4 rows)
-
Query the
customers_partexternal table to return the records where thestatevalue isNevadaand thesalesvalue is greater than 170. The filter onsalesis not pushed down, since it is not a partition column. It is performed on the Greengage DB side after all the data in theNevadapartition is transferred for processing:SELECT * FROM customers_part where state = 'Nevada' and sales > 170;The output should look as follows:
name | sales | state | city ------------+--------+--------+----------- Rob Stuart | 180.2 | Nevada | Las Vegas Bob Brown | 200.75 | Nevada | Reno (2 rows)
-
Query the
customers_partexternal table to return the records where thestatevalue isCalifornia. The filter onstateis pushed down, since it is a partition column. It is performed on the Hive side before the data in theCaliforniapartition is transferred for processing:SELECT * FROM customers_part where state = 'California';The output should look as follows:
name | sales | state | city ------------+-------+------------+------------ John Doe | 150 | California | Fresno Jane Smith | 175.5 | California | Sacramento (2 rows)
Access partitioned heterogeneous data
Before you begin, make sure you have created the sample dataset as described in the Create a sample dataset section and its RCfile copy as described in the Use hive:rc profile section.
-
On the HiveClient host, launch the Hive command line:
$ hive -
Create a Hive external table that will comprise the HDFS data files associated with the
customersandcustomers_rcHive tables. The table is partitioned by theyearstring field:CREATE EXTERNAL TABLE customers_part_m ( name STRING, orders INT, sales DOUBLE ) PARTITIONED BY (year string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; -
Run the
DESCRIBE EXTENDEDcommands on thecustomersandcustomers_rctables to view the HDFS file location for each table:DESCRIBE EXTENDED customers; DESCRIBE EXTENDED customers_rc;In the command output, the table HDFS location is listed in the
locationentry:... location:hdfs://hadoop/apps/hive/warehouse/customers location:hdfs://hadoop/apps/hive/warehouse/customers_rc ...
-
In the
customers_part_mHDFS table, create partitions referring to the file locations associated with thecustomersandcustomers_rctables:ALTER TABLE customers_part_m ADD PARTITION (year = '2025') LOCATION 'hdfs://hadoop/apps/hive/warehouse/customers'; ALTER TABLE customers_part_m ADD PARTITION (year = '2026') LOCATION 'hdfs://hadoop/apps/hive/warehouse/customers_rc'; -
The
customerstable has no storage format specified, so the defaultTEXTFILEis used. For thecustomers_rctable, explicitly specify the file format associated with the partition:ALTER TABLE customers_part_m PARTITION (year='2026') SET FILEFORMAT RCFILE; -
View the created partitions:
SHOW PARTITIONS customers_part_m;The output should look as follows:
+------------+ | partition | +------------+ | year=2025 | | year=2026 | +------------+ 2 rows selected (0.044 seconds)
-
Query the created external Hive table:
SELECT * FROM customers_part_m;The output should look as follows:
+------------------------+--------------------------+-------------------------+------------------------+ | customers_part_m.name | customers_part_m.orders | customers_part_m.sales | customers_part_m.year | +------------------------+--------------------------+-------------------------+------------------------+ | John Doe | 24 | 152.12 | 2025 | | Jane Smith | 66 | 456.72 | 2025 | | Bob Brown | 12 | 51.19 | 2025 | | Rob Stuart | 42 | 34.81 | 2025 | | John Doe | 24 | 152.12 | 2026 | | Jane Smith | 66 | 456.72 | 2026 | | Bob Brown | 12 | 51.19 | 2026 | | Rob Stuart | 42 | 34.81 | 2026 | +------------------------+--------------------------+-------------------------+------------------------+ 8 rows selected (0.135 seconds)
-
On the Greengage DB master host, create an external table that references the
customers_part_mHive table. In theLOCATIONclause, specify the PXFhiveprofile and the server configuration. In theFORMATclause, specifypxfwritable_import, which is the built-in custom formatter function for read operations:CREATE EXTERNAL TABLE customers_part_m ( name TEXT, orders INT, sales FLOAT, year TEXT ) LOCATION ('pxf://default.customers_part_m?PROFILE=hive&SERVER=hadoop') FORMAT 'custom' (FORMATTER='pxfwritable_import'); -
Query the created external table:
SELECT * FROM customers_part_m;The output should look as follows:
name | orders | sales | year ------------+--------+--------+------ John Doe | 24 | 152.12 | 2026 Jane Smith | 66 | 456.72 | 2026 Bob Brown | 12 | 51.19 | 2026 Rob Stuart | 42 | 34.81 | 2026 John Doe | 24 | 152.12 | 2025 Jane Smith | 66 | 456.72 | 2025 Bob Brown | 12 | 51.19 | 2025 Rob Stuart | 42 | 34.81 | 2025 (8 rows)
-
Query the external table to return the sum of the
salescolumn values for theyearvalue of2025:SELECT sum(sales) FROM customers_part_m where year = '2025';The output should look as follows:
sum -------- 694.84 (1 row)
Access Hive default partition
-
On the HiveClient host, launch the Hive command line:
$ hive -
Create a Hive table named
customers_part_dpartitioned by theyearstring field:CREATE TABLE customers_part_d ( name STRING, orders INT, sales DOUBLE ) PARTITIONED BY (year string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; -
Load the sample data into the created table. Notice that the
PARTITIONclause is omitted and no partitioning column value is provided for the third row:INSERT INTO TABLE customers_part_d PARTITION(year = 2025) VALUES ('John Doe', 30, 150.00); INSERT INTO TABLE customers_part_d PARTITION(year = 2025) VALUES ('Jane Smith', 28, 175.50); INSERT INTO TABLE customers_part_d VALUES ('Bob Brown', 42, 200.75, NULL); INSERT INTO TABLE customers_part_d PARTITION(year = 2025) VALUES ('Rob Stuart', 35, 180.20); -
Query the
customers_part_dtable:SELECT * FROM customers_part_d;The output should look as follows. Notice that
__HIVE_DEFAULT_PARTITION__is returned for the NULL value in theyearpartitioning column:+------------------------+--------------------------+-------------------------+-----------------------------+ | customers_part_d.name | customers_part_d.orders | customers_part_d.sales | customers_part_d.year | +------------------------+--------------------------+-------------------------+-----------------------------+ | John Doe | 30 | 150.0 | 2025 | | Jane Smith | 28 | 175.5 | 2025 | | Rob Stuart | 35 | 180.2 | 2025 | | Bob Brown | 42 | 200.75 | __HIVE_DEFAULT_PARTITION__ | +------------------------+--------------------------+-------------------------+-----------------------------+ 4 rows selected (0.112 seconds)
-
Query the
customers_part_dtable filtering on theyearpartition column:SELECT * FROM customers_part_d WHERE year IS NULL;The output should contain no rows, since the data stored in the table’s default partition is excluded:
+------------------------+--------------------------+-------------------------+------------------------+ | customers_part_d.name | customers_part_d.orders | customers_part_d.sales | customers_part_d.year | +------------------------+--------------------------+-------------------------+------------------------+ +------------------------+--------------------------+-------------------------+------------------------+ No rows selected (0.165 seconds)
-
On the Greengage DB master host, create an external table that references the
customers_part_dHive table. In theLOCATIONclause, specify the PXFhiveprofile and the server configuration. In theFORMATclause, specifypxfwritable_import, which is the built-in custom formatter function for read operations:CREATE EXTERNAL TABLE customers_part_d ( name TEXT, orders INT, sales FLOAT, year TEXT ) LOCATION ('pxf://default.customers_part_d?PROFILE=hive&SERVER=hadoop') FORMAT 'custom' (FORMATTER='pxfwritable_import'); -
Query the created external table:
SELECT * FROM customers_part_d;The output should look as follows:
name | orders | sales | year ------------+--------+--------+------ Jane Smith | 28 | 175.5 | 2025 Bob Brown | 42 | 200.75 | Rob Stuart | 35 | 180.2 | 2025 John Doe | 30 | 150 | 2025 (4 rows)
-
Query the created external table filtering on the
yearpartition column.SELECT * FROM customers_part_d WHERE year IS NULL;Since all default partition values are translated into actual NULL values, the query returns a single row:
name | orders | sales | year -----------+--------+--------+------ Bob Brown | 42 | 200.75 | (1 row)