Hello, I’m DocuDroid!
Submitting feedback
Thank you for rating our AI Search!
We would be grateful if you could share your thoughts so we can improve our AI Search for you and other readers.
GitHub

Use PXF HDFS connector to read and write text data between Greengage DB and HDFS

Anton Monakov

These examples demonstrate how to configure and use the PXF HDFS connector for reading and writing HDFS text data by using external tables.

Prerequisites

To try out the practical examples, connect to the Greengage DB master host as gpadmin using psql as described in Connect to Greengage DB via psql. Then create the customers test database and connect to it:

DROP DATABASE IF EXISTS customers;
CREATE DATABASE customers;
\c customers

To be able to create an external table using the PXF protocol, enable the PXF extension in the database as described in Register PXF in a database in PXF documentation:

CREATE EXTENSION pxf;

Configure the PXF HDFS connector

To have PXF connect to HDFS, you need to create a Hadoop server configuration as described in Configure PXF Hadoop connectors in PXF documentation and then synchronize it to the Greengage DB cluster:

  1. Log in to the Greengage DB master host as gpadmin.

  2. Go to the $PXF_BASE/servers directory and create a Hadoop server configuration directory named hadoop:

    $ mkdir $PXF_BASE/servers/hadoop
    $ cd $PXF_BASE/servers/hadoop
  3. Copy the core-site.xml, hdfs-site.xml, mapred-site.xml, and yarn-site.xml Hadoop configuration files from the NameNode host of the Hadoop cluster to the current host:

    $ scp hdfsuser@namenode:/etc/hadoop/conf/core-site.xml .
    $ scp hdfsuser@namenode:/etc/hadoop/conf/hdfs-site.xml .
    $ scp hdfsuser@namenode:/etc/hadoop/conf/mapred-site.xml .
    $ scp hdfsuser@namenode:/etc/hadoop/conf/yarn-site.xml .
  4. Synchronize the PXF server configuration to the Greengage DB cluster:

    $ pxf cluster sync

Create an external table using the PXF protocol

To create a Greengage DB external table that references a text file or directory on HDFS, use the following general syntax:

CREATE [READABLE | WRITABLE] EXTERNAL TABLE <table_name>
    ( <column_name> <data_type> [, ...] | LIKE <other_table> )

    LOCATION ('pxf://<path-to-hdfs>?PROFILE=<profile_name>[&<custom_option>[...]]')
    FORMAT '[TEXT|CSV]' (delimiter[=|<space>][E]'<delim_value>')
    [DISTRIBUTED BY (<column_name> [, ... ] ) | DISTRIBUTED RANDOMLY];
Keyword Value

<path‑to‑hdfs>

The path to the directory or file in the HDFS data store. When the <server_name> configuration includes a pxf.fs.basePath property setting, the <path‑to‑hdfs> value is considered to be relative to the base path specified. Otherwise, the path is considered absolute. The value must not include the dollar sign ($) character.

PXF supports reading and writing text files compressed with the default, bzip2, and gzip codecs. For readable tables, decompression is performed automatically. For writable external tables, you can specify the compression codec and type by using the COMPRESSION_CODEC and COMPRESSION_TYPE custom options described below

PROFILE=<profile_name>

Use the hdfs:text profile when reading delimited plain text, hdfs:csv when reading CSV data where each row is a single record, and hdfs:text:multi for reading multiline text data

FORMAT <value>

The data format, which can be TEXT when referencing plain text delimited data or CSV when referencing comma-separated value data

delimiter

The delimiter character in the data. For the CSV format, the default <delim_value> is a comma (,). You can use an E'' escape string constant, for example delimiter=E'\t'.

Note that the HEADER option, which commonly designates whether the data file contains a header row, is not supported for external tables using PXF. If a text file includes header lines, use the SKIP_HEADER_COUNT custom option to specify the number of lines to skip at the beginning of each file

DISTRIBUTED BY

When loading data from a Greengage DB table into a writable external table, consider specifying the same distribution policy or column name on both tables. This will avoid extra motion of data between segments on the load operation. Learn more about table distribution in Distribution

<custom_option>

One of the custom options provided in the LOCATION string as described below

SERVER=<server_name>

The named server configuration that PXF uses to access the data. If not specified, the default PXF server is used

IGNORE_MISSING_PATH=<boolean>

The action to take when <path-to-hdfs> is missing or invalid. If set to false (default), an error is returned. If set to true, PXF ignores missing path errors and returns an empty fragment

SKIP_HEADER_COUNT=<numlines>

The number of header lines to skip in the beginning of each HDFS file before reading the data. The default value is 0, no lines are skipped

COMPRESSION_CODEC

The compression codec to use when writing data: default, bzip2, gzip, and uncompressed (no compression). If not provided (or uncompressed is provided), no data compression is performed

COMPRESSION_TYPE

The compression type to use when writing data: record (RECORD, the default) or block (BLOCK)

Read regular text data from HDFS

  1. In the /tmp directory on the HDFS host, create a delimited plain text data file named customers.txt and having the following content:

    id,first_name,last_name,email,address
    1,John,Doe,john.doe@example.com,123 Elm Street
    2,Jane,Smith,jane.smith@example.com,456 Oak Street
    3,Bob,Brown,bob.brown@example.com,789 Pine Street
    4,Rob,Stuart,rob.stuart@example.com,119 Willow Street
  2. Compress the created file:

    $ gzip /tmp/customers.txt
  3. Create the pxf_examples HDFS directory for storing PXF example data files and add the customers.txt.gz file to HDFS:

    $ hdfs dfs -mkdir -p /tmp/pxf_examples
    $ hdfs dfs -put /tmp/customers.txt.gz /tmp/pxf_examples/
  4. Optionally, review the contents of the customers.txt.gz file stored in HDFS:

    $ hdfs dfs -text /tmp/pxf_examples/customers.txt.gz
  5. On the Greengage DB master host, create a readable external table using the PXF protocol that references the customers.txt.gz file. In the LOCATION clause, specify the PXF hdfs:text profile and the server configuration. Use the SKIP_HEADER_COUNT option to indicate that the file contains a header row. In the FORMAT clause, set TEXT as the data format and set the comma character (,) as a delimiter (delimiter):

    CREATE EXTERNAL TABLE customers_plain (
        id INTEGER,
        first_name VARCHAR(50),
        last_name VARCHAR(50),
        email VARCHAR(100),
        address VARCHAR(255)
    )
        LOCATION ('pxf://tmp/pxf_examples/customers.txt.gz?PROFILE=hdfs:text&SERVER=hadoop&SKIP_HEADER_COUNT=1')
        FORMAT 'TEXT' (delimiter=E',');
  6. Query the external table:

    SELECT * FROM customers_plain;

    The output should look as follows:

     id | first_name | last_name |         email          |      address
    ----+------------+-----------+------------------------+-------------------
      1 | John       | Doe       | john.doe@example.com   | 123 Elm Street
      2 | Jane       | Smith     | jane.smith@example.com | 456 Oak Street
      3 | Bob        | Brown     | bob.brown@example.com  | 789 Pine Street
      4 | Rob        | Stuart    | rob.stuart@example.com | 119 Willow Street
    (4 rows)

Read multiline text data from HDFS

  1. In the /tmp directory on the HDFS host, create a delimited plain text data file named customers_multiline.txt and having the following content. Notice the use of the colon (:) to separate the fields. Also notice the quotes around the address field: this field includes an embedded line feed separating the street address from the city:

    id:first_name:last_name:email:address
    1:John:Doe:john.doe@example.com:"123 Elm Street
    New York"
    2:Jane:Smith:jane.smith@example.com:"456 Oak Street
    Chicago"
    3:Bob:Brown:bob.brown@example.com:"789 Pine Street
    Los Angeles"
    4:Rob:Stuart:rob.stuart@example.com:"119 Willow Street
    New Orleans"
  2. Copy the text file to HDFS:

    $ hdfs dfs -put /tmp/customers_multiline.txt /tmp/pxf_examples/
  3. On the Greengage DB master host, create a readable external table using the PXF protocol that references the customers_multiline.txt file. In the LOCATION clause, specify the PXF hdfs:text:multi profile and the server configuration. Use the SKIP_HEADER_COUNT option to indicate that the file contains a header row. In the FORMAT clause, set CSV as the data format and set the colon character (:) as a delimiter (delimiter):

    CREATE EXTERNAL TABLE customers_multiline (
        id INTEGER,
        first_name TEXT,
        last_name TEXT,
        email TEXT,
        address TEXT
    )
        LOCATION ('pxf://tmp/pxf_examples/customers_multiline.txt?PROFILE=hdfs:text:multi&SERVER=hadoop&SKIP_HEADER_COUNT=1')
        FORMAT 'CSV' (delimiter ':');
  4. Query the created external table:

    SELECT * FROM customers_multiline;

    The output should look as follows:

     id | first_name | last_name |         email          |      address
    ----+------------+-----------+------------------------+-------------------
      1 | John       | Doe       | john.doe@example.com   | 123 Elm Street
        |            |           |                        | New York
      2 | Jane       | Smith     | jane.smith@example.com | 456 Oak Street
        |            |           |                        | Chicago
      3 | Bob        | Brown     | bob.brown@example.com  | 789 Pine Street
        |            |           |                        |  Los Angeles
      4 | Rob        | Stuart    | rob.stuart@example.com | 119 Willow Street
        |            |           |                        | New Orleans
    (4 rows)

Write text data to HDFS

  1. On the Greengage DB master host, create a writable external table that writes data to the /tmp/pxf_examples/customers HDFS directory. In the LOCATION clause, specify the PXF hdfs:text profile and the server configuration and set the compression codec to gzip. In the FORMAT clause, specify TEXT as the data format and set the comma character (,) as a delimiter (delimiter):

    CREATE WRITABLE EXTERNAL TABLE customers_w (
        id INTEGER,
        first_name TEXT,
        last_name TEXT,
        email TEXT,
        address TEXT
    )
        LOCATION ('pxf://tmp/pxf_examples/customers?PROFILE=hdfs:text&SERVER=hadoop&COMPRESSION_CODEC=gzip')
        FORMAT 'TEXT' (delimiter ',');
  2. Insert the contents of the earlier created customers_plain readable table into the customers_w table:

    INSERT INTO customers_w SELECT * FROM customers_plain;
  3. Manually insert several values into the customers_w table:

    INSERT INTO customers_w (
         id,
         first_name,
         last_name,
         email,
         address
     ) 
    VALUES (5,'Alice','Johnson','alice.johnson@example.com','10 Oak Avenue'),
           (6,'Charlie','Williams','charlie.williams@example.com','42 Maple Drive');
  4. On the HDFS host, view the contents of the files created in the customers directory:

    $ hdfs dfs -text /tmp/pxf_examples/customers/*

    The output should look as follows:

    1,John,Doe,john.doe@example.com,123 Elm Street
    2,Jane,Smith,jane.smith@example.com,456 Oak Street
    3,Bob,Brown,bob.brown@example.com,789 Pine Street
    4,Rob,Stuart,rob.stuart@example.com,119 Willow Street
    5,Alice,Johnson,alice.johnson@example.com,10 Oak Avenue
    6,Charlie,Williams,charlie.williams@example.com,42 Maple Drive