Hello, I’m DocuDroid!
Submitting feedback
Thank you for rating our AI Search!
We would be grateful if you could share your thoughts so we can improve our AI Search for you and other readers.
GitHub

Use PXF HDFS connector to read and write fixed-width text between HDFS and Greengage DB

Anton Monakov

This topic describes how to configure and use the PXF HDFS connector for reading and writing fixed-width text data stored in HDFS by using external tables and provides practical examples.

To learn more about the fixed-width data formatter options in Greengage DB, see Format fixed-width data.

Create an external table using the PXF protocol

To create a Greengage DB external table to read or write fixed-width text data in HDFS, use the following general syntax:

CREATE [READABLE | WRITABLE] EXTERNAL TABLE <table_name>
    ( <column_name> <data_type> [, ...] | LIKE <other_table> )

    LOCATION ('pxf://<path_to_data>?PROFILE=hdfs:fixedwidth[&<custom_option>=<value>[...]]')
    FORMAT 'CUSTOM' (FORMATTER='fixedwidth_in | fixedwidth_out',
        <field_name>='<width>' [, ...]
        [, line_delim[=|<space>][E]'<delim_value>'])
    [DISTRIBUTED BY (<column_name> [, ... ] ) | DISTRIBUTED RANDOMLY];
Keyword Value

<table_name>

The name of the table to create

<column_name>

The name of the column to create

<data_type>

The data type of the column

LIKE <other_table>

Specifies a table from which the new external table automatically copies all column names, data types, and distribution policy

<path_to_data>

The path to the directory or file in the HDFS data store. When the <server_name> configuration includes a pxf.fs.basePath property setting, the <path_to_data> value is considered to be relative to the base path specified. Otherwise, the path is considered absolute. The value must not include the dollar sign ($) character

PROFILE=hdfs:fixedwidth

The hdfs:fixedwidth profile is used to read or write fixed-width text data in HDFS

FORMAT 'CUSTOM'

The custom format with the built-in custom formatter functions for read (fixedwidth_in) and write (fixedwidth_out) operations is used to work with fixed-width text data in HDFS

<field_name>='<width>'

The name and the width of the field in characters. Fields must be listed in their physical order. The field names must match the columns listed in the CREATE EXTERNAL TABLE command.

When reading data, if the field value is less than the <width> value, Greengage DB expects the field to be right-padded with spaces to that size. When writing data, if the field value is less than the <width> value, Greengage DB right-pads the field with spaces up to the <width> value

line_delim

The line delimiter character in the data, \n (LF) by default. If the option is provided and contains \r (CR), \r\n (CRLF), or a set of custom escape characters, you must also specify the NEWLINE option and set its value to CR, CRLF or the set of bytecode characters, respectively

DISTRIBUTED BY

When loading data from a Greengage DB table into a writable external table, consider specifying the same distribution policy or column name on both tables. This will avoid extra motion of data between segments on the load operation. Learn more about table distribution in Distribution

<custom_option>

One of the custom options provided in the LOCATION string as described below

SERVER=<server_name>

The named server configuration that PXF uses to access the data. If not specified, the default PXF server is used

NEWLINE

When the line_delim formatter option contains \r, \r\n, or a set of custom escape characters, you must set NEWLINE to CR, CRLF, or the set of bytecode characters, respectively

COMPRESSION_CODEC

The compression codec to use when writing data: default, bzip2, gzip, or uncompressed (no compression). If not provided (or uncompressed is provided), no data compression is performed

IGNORE_MISSING_PATH

The action to take when <path_to_data> is missing or invalid. If set to false (default), an error is returned. If set to true, PXF ignores missing path errors and returns an empty fragment

Examples

These examples demonstrate how to configure and use the PXF HDFS connector for reading and writing fixed-width text data in HDFS by using external tables.

Prerequisites

To try out the practical examples, connect to the Greengage DB master host as gpadmin using psql as described in Connect to Greengage DB via psql. Then create the customers test database and connect to it:

DROP DATABASE IF EXISTS customers;
CREATE DATABASE customers;
\c customers

To be able to create an external table using the PXF protocol, enable the PXF extension in the database as described in Register PXF in a database in PXF documentation:

CREATE EXTENSION pxf;

Configure the PXF HDFS connector

To have PXF connect to HDFS, you need to create a Hadoop server configuration as described in Configure PXF Hadoop connectors in PXF documentation and then synchronize it to the Greengage DB cluster:

  1. Log in to the Greengage DB master host as gpadmin.

  2. Go to the $PXF_BASE/servers directory and create a Hadoop server configuration directory named hadoop:

    $ mkdir $PXF_BASE/servers/hadoop
    $ cd $PXF_BASE/servers/hadoop
  3. Copy the core-site.xml, hdfs-site.xml, mapred-site.xml, and yarn-site.xml Hadoop configuration files from the NameNode host of the Hadoop cluster to the current host:

    $ scp hdfsuser@namenode:/etc/hadoop/conf/core-site.xml .
    $ scp hdfsuser@namenode:/etc/hadoop/conf/hdfs-site.xml .
    $ scp hdfsuser@namenode:/etc/hadoop/conf/mapred-site.xml .
    $ scp hdfsuser@namenode:/etc/hadoop/conf/yarn-site.xml .
  4. Synchronize the server configuration to the Greengage DB cluster hosts:

    $ pxf cluster sync

Create a readable external table

  1. In the /tmp directory on the HDFS host, create a file named customers.txt. In the sample file, the first field is 3 characters long, the second — 15 characters long, the third — 25 characters long, and the last — 20 characters long. Notice the fields are right-padded with spaces to the required length:

    1  John Doe       john.doe@example.com     123 Elm Street
    2  Jane Smith     jane.smith@example.com   456 Oak Street
    3  Bob Brown      bob.brown@example.com    789 Pine Street
    4  Rob Stuart     rob.stuart@example.com   119 Willow Street
  2. Create the /tmp/pxf_examples HDFS directory for storing PXF example data files and add the customers.txt file to HDFS:

    $ hdfs dfs -mkdir -p /tmp/pxf_examples
    $ hdfs dfs -put /tmp/customers.txt /tmp/pxf_examples/
  3. On the Greengage DB master host, create an external table that references the customers.txt file. In the LOCATION clause, specify the PXF hdfs:fixedwidth profile and the server configuration. In the FORMAT clause, specify the fixedwidth_in built-in formatter function for reading data and enumerate the data fields with their respective lengths:

    CREATE EXTERNAL TABLE customers_r (
        id INTEGER,
        name TEXT,
        email TEXT,
        address TEXT
        )
        
        LOCATION ('pxf://tmp/pxf_examples/customers.txt?PROFILE=hdfs:fixedwidth&SERVER=hadoop')
        FORMAT 'CUSTOM' (
            FORMATTER='fixedwidth_in', 
            id='3', 
            name='15', 
            email='25',
            address='20');
  4. Query the created external table:

    SELECT * FROM customers_r;

    The output should look as follows:

     id |    name    |         email          |      address
    ----+------------+------------------------+-------------------
      1 | John Doe   | john.doe@example.com   | 123 Elm Street
      2 | Jane Smith | jane.smith@example.com | 456 Oak Street
      3 | Bob Brown  | bob.brown@example.com  | 789 Pine Street
      4 | Rob Stuart | rob.stuart@example.com | 119 Willow Street
    (4 rows)

Create a writable external table

  1. On the Greengage DB master host, create the writable external table that stores data into the /tmp/pxf_examples/customers HDFS directory. In the LOCATION clause, specify the PXF hdfs:fixedwidth profile and the server configuration. In the FORMAT clause, specify the fixedwidth_out built-in formatter function for writing data and enumerate the data fields with their respective lengths:

    CREATE WRITABLE EXTERNAL TABLE customers_w (
        id INTEGER,
        name TEXT,
        email TEXT,
        address TEXT
        )
        
        LOCATION ('pxf://tmp/pxf_examples/customers?PROFILE=hdfs:fixedwidth&SERVER=hadoop')
        FORMAT 'CUSTOM' (
                FORMATTER='fixedwidth_out', 
                id='3', 
                name='15', 
                email='25',
                address='20');
  2. Insert some data into the customers_w table:

    INSERT INTO customers_w (
         id,
         name,
         email,
         address
         ) 
    VALUES (5, 'Alice Price', 'alice.price@example.com', '234 Maple Avenue'),
           (6, 'David Lee', 'david.lee@example.com', '567 Birch Lane'),
           (7, 'Emily Wilson', 'emily.wilson@example.com', '890 Cedar Court'),
           (8, 'Kevin Garcia', 'kevin.garcia@example.com', '123 Spruce Drive');
  3. On the HDFS host, view the contents of the files created in the /tmp/pxf_examples/customers directory:

    $ hdfs dfs -cat /tmp/pxf_examples/customers/*

    The output should look as follows:

    8  Kevin Garcia   kevin.garcia@example.com 123 Spruce Drive
    5  Alice Price    alice.price@example.com  234 Maple Avenue
    6  David Lee      david.lee@example.com    567 Birch Lane
    7  Emily Wilson   emily.wilson@example.com 890 Cedar Court