Transform external data

Anton Monakov

Contents

Transformation configuration file
Transfer the data
- Load data with gpfdist
- Load data with gpload
Examples
- Load data
- Unload data

With data transformations, you can read or write external data files stored in formats that Greengage DB doesn’t support natively. An input transformation reads a file in a foreign data format and outputs rows to gpfdist in the format specified in the external table’s FORMAT clause (TEXT or CSV). An output transformation receives rows from gpfdist in text format and converts them to a foreign data format. Data transformation is supported when working with the gpfdist and gpload utilities. One of the use cases requiring transformation is accessing data stored in external structured files (such as XML or JSON) from within the database. See Examples for detailed guides on loading and unloading data in XML format.

The general workflow for setting up a transformation project looks as follows:

Determine the transformation schema: first establish the project’s objective (for example, indexing, analysis, or data combination). Then, examine the source files, their structure, and elements' names. Finally, select which elements to import.
Write a transformation that specifies what to extract from the data. You can use any authoring environment and language appropriate for your project. For example, an XML transformation can be performed via the technologies such as XSLT, Java, or Python, based on the goals and scope of the project. A transformation is provided as an executable command that gpfdist calls with the name of the source data file.
Write the gpfdist transformation configuration file. The file is passed to gpfdist on the command line and contains the rules that gpfdist uses to select a transformation to apply when loading or extracting data.
Transfer the data by using either gpfdist or gpload, which automates certain loading tasks. Learn more in Transfer the data.

The gpfdist transformation configuration file defines the transformation parameters and uses the YAML 1.1 document format. The configuration file is processed in specified order; indentation (spaces) determines the document hierarchy and relationships between the sections.

NOTE

The use of whitespace in a YAML document is significant. Whitespace should not be used for formatting purposes, and tabs should not be used at all.

The basic configuration file structure looks as follows:

---
VERSION: 1.0.0.1
TRANSFORMATIONS:
  transformation_name1:
    TYPE:      input | output
    COMMAND:   command
    SAFE:      posix-regex
  transformation_name2:
    TYPE:      input | output
    COMMAND:   command
...

Keys
Key	Description	Required
VERSION	The version of the `gpfdist` configuration file schema. The current version is `1.0.0.1`	Yes
TRANSFORMATIONS	Begins the transformation specification section. A configuration file must have at least one transformation defined. Upon receiving a transformation request, `gpfdist` scans this section for an entry with the matching transformation name	Yes
transformation_name1, transformation_name2, …	The name of the transformation, which is then used when configuring data transfer	Yes
TYPE	The transformation direction, which can be `input` or `output`: `input` — the standard output of the transformation process is treated as a stream of records to load into Greengage DB. `output` — the standard input of the transformation process is treated as a stream of records from Greengage DB to transform and write to the appropriate output.	Yes
COMMAND	The command that `gpfdist` runs to perform the transformation. For input transformations, the command is expected to open the source data files and produce the text, where each line is a single row to load into Greengage DB. The input transformation determines whether the entire content should be converted to one row or to multiple rows. For output transformations, the command is expected to open and write to the target files as appropriate. The output transformation determines the final placement of the converted output.	Yes
SAFE	A POSIX regular expression that the paths to data files must match to be passed to the transformation. Specify `SAFE` in case there is a concern about injection or improper interpretation of paths passed to the command. By default, no restrictions on paths apply	No

To load the data into or out of the database, use one of the following approaches:

gpfdist supports both input and output transformations but exposes the details otherwise automated by gpload.
gpload supports only input transformations but may be easier to implement.

Transforming with gpfdist is performed as follows:

Start gpfdist and use the -c option to provide the transformation configuration file:
```
$ gpfdist -c config.yaml
```
Create an external table. In the CREATE EXTERNAL TABLE command’s LOCATION clause, specify the name of the host gpfdist runs on (<hostname>), the port it listens on (<port>), and the source data file (<filename>). Use the #transform parameter to specify the transformation name (<transformation_name>). If multiple gpfdist servers are used, make sure to list them all and provide the transformation name for each of them.
```
CREATE READABLE EXTERNAL TABLE <external_table_name>
    LOCATION ('gpfdist://<hostname>:<port>/<filename>#transform=<transformation_name>'
               [, ...]
             )
```
Use the INSERT command to insert the loaded data into the target table (<table_name>):
```
INSERT INTO <table_name> SELECT * FROM <external_table_name>;
```

TIP

You can start gpfdist with the -I or -O options that specify the default input or output transformation, respectively, for example:

$ gpfdist -c config.yaml -I <input_transformation_name>

The selected transformation is applied to all affected files. Using the #transform parameter on the table level overrides the -I or -O options.

The gpload utility uses the gpfdist file server and a YAML-formatted control file to orchestrate the data loading operation and automate the following tasks:

Creating a readable external table in the database.
Starting gpfdist instances with the configuration file that specifies the transformation.
Running the INSERT INTO <table_name> SELECT FROM <external_table_name> command to load the data.
After the data is loaded, removing the external table definition.

gpload only supports input transformations. To configure the transformation, you need to specify the TRANSFORM and TRANSFORM_CONFIG values in the INPUT section of the gpload control file:

---
VERSION: 1.0.0.1
GPLOAD:
  INPUT:
    - TRANSFORM_CONFIG: <gpfdist_configuration_file>
    - TRANSFORM: <transformation_name>
    - SOURCE:
        FILE: <filename>

where:

TRANSFORM_CONFIG specifies the name of the gpfdist transformation configuration file.
TRANSFORM specifies the name of the transformation that is provided in the gpfdist configuration file.

NOTE

The transformation name must appear in two places: in the TRANSFORM setting of the gpload control file and in the TRANSFORMATIONS section of the gpfdist configuration file (the file specified in the TRANSFORM_CONFIG section).

Having defined the control file (for example, load.yml), pass it to gpload by using the -f option as follows:

$ gpload -f load.yml

These examples demonstrate loading XML data into the database table, and unloading it from the database table. Transformation between the XML and the text format is implemented in XSLT and Python.

Prerequisites

To try out the practical examples listed in this section, connect to the Greengage DB master host as gpadmin using psql as described in Connect to Greengage DB via psql. Then create the customers test database, and connect to it:

DROP DATABASE IF EXISTS customers;
CREATE DATABASE customers;
\c customers

Then create the customers table:

CREATE TABLE customers
(
    id INTEGER,
    name VARCHAR(50),
    email VARCHAR(100),
    address VARCHAR(255)
);

To be able to run XSLT-based transformations, install the xsltproc package:

Ubuntu

$ sudo apt-get update && sudo apt-get install xsltproc

CentOS

$ sudo yum update && sudo yum install libxslt

This example demonstrates transforming the source data with XSLT or Python and loading it into the database table via gpfdist or gpload.

In the /tmp directory on the master host, create the customers.xml file containing the source data to load:

<customers>
    <customer id="1">
        <first_name>John</first_name>
        <last_name>Doe</last_name>
        <email>john.doe@example.com</email>
        <address>123 Elm Street</address>
    </customer>
    <customer id="2">
        <first_name>Jane</first_name>
        <last_name>Smith</last_name>
        <email>jane.smith@example.com</email>
        <address>456 Oak Street</address>
    </customer>
    <customer id="3">
        <first_name>Bob</first_name>
        <last_name>Brown</last_name>
        <email>bob.brown@example.com</email>
        <address>789 Pine Street</address>
    </customer>
    <customer id="4">
        <first_name>Rob</first_name>
        <last_name>Stuart</last_name>
        <email>rob.stuart@example.com</email>
        <address>119 Willow Street</address>
    </customer>
</customers>

In the /tmp directory on the master host, create the transform_customers_in.xslt or the transform_customers_in.py file containing the transformation logic. The XML document is transformed into a CSV file by iterating through each customer element within the customers root, extracting the ID, full name (first and last names concatenated), email, and address:

XSLT

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="/">
    <!-- Header Row -->
    <xsl:text>id,name,email,address</xsl:text>
    <xsl:text>&#xa;</xsl:text>
    <xsl:for-each select="//customers/customer">
        <!-- Data Row -->
        <xsl:value-of select="@id"/>
        <xsl:text>,</xsl:text>
        <xsl:value-of select="concat(first_name/text(), ' ', last_name/text())"/>
        <xsl:text>,</xsl:text>
        <xsl:value-of select="email/text()"/>
        <xsl:text>,</xsl:text>
        <xsl:value-of select="address/text()"/>
        <xsl:text>&#xa;</xsl:text>
    </xsl:for-each>
</xsl:template>
</xsl:stylesheet>

Python

import xml.etree.ElementTree as ET
import csv
import sys

def xml_to_csv(xml_data):

    root = ET.fromstring(xml_data)

    header = ['id', 'name', 'email', 'address']

    writer = csv.writer(sys.stdout)

    writer.writerow(header)

    for customer in root.findall('customer'):
        customer_id = customer.get('id')
        first_name = customer.find('first_name').text
        last_name = customer.find('last_name').text
        name = "{0} {1}".format(first_name, last_name)
        email = customer.find('email').text
        address = customer.find('address').text

        writer.writerow([customer_id, name, email, address])

if __name__ == "__main__":
    xml_data = sys.stdin.read()
    xml_to_csv(xml_data)

Create the gpfdist transformation configuration file (config_in.yaml). In the TRANSFORMATIONS block, provide the name of the transformation (transform_customers) and the transformation type (input). Notice that the COMMAND setting uses the respective command with a %filename% placeholder that starts an XSLT- or Python-based transformation. When gpfdist runs the transform_customers transformation, it invokes the command with /bin/bash and replaces the %filename% placeholder with the path to the input file to transform:
XSLT
--- VERSION: 1.0.0.1 TRANSFORMATIONS: transform_customers: TYPE: input COMMAND: /usr/bin/xsltproc /tmp/transform_customers_in.xslt %filename%
Python
--- VERSION: 1.0.0.1 TRANSFORMATIONS: transform_customers: TYPE: input COMMAND: /usr/bin/env bash -c "(/usr/bin/python /tmp/transform_customers_in.py < $0)" %filename%

Load the data. You can use either gpfdist (to perform all the loading steps manually) or gpload (to automate the steps via the control file):

gpfdist

In the /tmp directory on the master host, start gpfdist and provide the config_in.yaml configuration file:
```
$ gpfdist -c config_in.yaml -d /tmp -p 8888 &
```

Create the customers_r readable external table. In the LOCATION clause, use the #transform parameter and specify the transformation name (transform_customers):

CREATE EXTERNAL TABLE customers_r (LIKE customers)
    LOCATION ('gpfdist://mdw:8888/customers.xml#transform=transform_customers')
    FORMAT 'CSV' (HEADER);

Insert the data into the target customers table:

INSERT INTO customers SELECT * FROM customers_r;

gpload

Create the gpload control file (load.yaml). For the TRANSFORM_CONFIG key, specify the location of the created gpfdist configuration file (config_in.yaml); for the TRANSFORM key, specify the name of the transformation (transform_customers):

---
VERSION: 1.0.0.1
DATABASE: customers
USER: gpadmin
HOST: mdw
PORT: 5432
GPLOAD:
  INPUT:
    - SOURCE:
        FILE:
          - /tmp/customers.xml
    - COLUMNS:
      - id: integer
      - name: text
      - email: text
      - address: text
    - TRANSFORM_CONFIG: /tmp/config_in.yaml
    - TRANSFORM: transform_customers
    - FORMAT: CSV
    - HEADER: true
  OUTPUT:
    - TABLE: customers
    - MODE: INSERT

Run gpload and use the -f option to provide the location of the created control file:
```
$ gpload -f /tmp/load.yaml
```

Query the customers table:

TABLE customers;

The output should look as follows:

 id |    name    |         email          |      address
----+------------+------------------------+-------------------
  1 | John Doe   | john.doe@example.com   | 123 Elm Street
  2 | Jane Smith | jane.smith@example.com | 456 Oak Street
  3 | Bob Brown  | bob.brown@example.com  | 789 Pine Street
  4 | Rob Stuart | rob.stuart@example.com | 119 Willow Street

This example demonstrates unloading the data from the database table via gpfdist and transforming it to XML with Python.

In the /tmp directory on the master host, create the transform_customers_out.py file containing the transformation logic. The CSV file is transformed into an XML document by reading each row, extracting data such as customer ID, name (split into first name and last name), email, and address. The corresponding <customer> elements are created within the root <customers> element.

import csv
import xml.etree.ElementTree as ET
import sys

def csv_to_xml(csv_data):

    root = ET.Element('customers')

    reader = csv.reader(csv_data.splitlines(), delimiter='|')
    header = next(reader)

    for row in reader:
        if len(row) != 4:
            print "Skipping invalid row: %s" % row
            continue

        customer_id = row[0]
        name = row[1]
        email = row[2]
        address = row[3]

        customer = ET.SubElement(root, 'customer', {'id': customer_id})

        ET.SubElement(customer, 'first_name').text = name.split(' ')[0]
        ET.SubElement(customer, 'last_name').text = name.split(' ')[-1]
        ET.SubElement(customer, 'email').text = email
        ET.SubElement(customer, 'address').text = address

    return ET.tostring(root, encoding='utf-8')

if __name__ == "__main__":
    csv_data = sys.stdin.read()
    try:
        xml_output = csv_to_xml(csv_data)
        print xml_output
    except ValueError as e:
        print "Error: %s" % e
        sys.exit(1)

Create the gpfdist transformation configuration file (config_out.yaml). In the TRANSFORMATIONS block, provide the name of the transformation (transform_customers) and the transformation type (output). Notice that the COMMAND setting uses the respective command with a %filename% placeholder that starts a Python-based transformation. When gpfdist runs the transform_customers transformation, it invokes the command with /bin/bash and replaces the %filename% placeholder with the path to the output file:
```
---
VERSION: 1.0.0.1
TRANSFORMATIONS:
    transform_customers:
        TYPE: output
        COMMAND: /usr/bin/env bash -c "(/usr/bin/python /tmp/transform_customers_out.py | cat - >> $0)" %filename%
```
Load the data with gpfdist. In the /tmp directory on the master host, start gpfdist and provide the config_out.yaml configuration file:
```
$ gpfdist -c config_out.yaml -d /tmp -p 8888 &
```
Create the customers_w writable external table. In the LOCATION clause, use the #transform parameter and specify the transformation name (transform_customers). In the FORMAT clause, provide the output format (TEXT) and specify that the pipe symbol (|) is used as a delimiter (DELIMITER):
```
CREATE WRITABLE EXTERNAL TABLE customers_w (LIKE customers)
    LOCATION ('gpfdist://mdw:8888/customers_out.xml#transform=transform_customers')
    FORMAT 'TEXT' (DELIMITER '|');
```

Populate the customers_w table with sample data:

INSERT INTO customers_w (id, name, email, address)
VALUES (1,'John Doe','john.doe@example.com','123 Elm Street'),
       (2,'Jane Smith','jane.smith@example.com','456 Oak Street'),
       (3,'Bob Brown','bob.brown@example.com','789 Pine Street'),
       (4,'Rob Stuart','rob.stuart@example.com','119 Willow Street');

Review the created customers_out.xml file:

$ cat /tmp/customers_out.xml

The file’s content should look as follows:

<customers>
    <customer id="1">
        <first_name>John</first_name>
        <last_name>Doe</last_name>
        <email>john.doe@example.com</email>
        <address>123 Elm Street</address>
    </customer>
    <customer id="2">
        <first_name>Jane</first_name>
        <last_name>Smith</last_name>
        <email>jane.smith@example.com</email>
        <address>456 Oak Street</address>
    </customer>
    <customer id="3">
        <first_name>Bob</first_name>
        <last_name>Brown</last_name>
        <email>bob.brown@example.com</email>
        <address>789 Pine Street</address>
    </customer>
    <customer id="4">
        <first_name>Rob</first_name>
        <last_name>Stuart</last_name>
        <email>rob.stuart@example.com</email>
        <address>119 Willow Street</address>
    </customer>
</customers>

Format external data

Transform external data

Transformation configuration file

Transfer the data

Load data with gpfdist

Load data with gpload

Examples

Load data

Unload data