Transform external data
With data transformations, you can read or write external data files stored in formats that Greengage DB doesn’t support natively.
An input transformation reads a file in a foreign data format and outputs rows to gpfdist
in the format specified in the external table’s FORMAT
clause (TEXT
or CSV
).
An output transformation receives rows from gpfdist
in text format and converts them to a foreign data format.
Data transformation is supported when working with the gpfdist and gpload utilities.
One of the use cases requiring transformation is accessing data stored in external structured files (such as XML or JSON) from within the database.
See Examples for detailed guides on loading and unloading data in XML format.
The general workflow for setting up a transformation project looks as follows:
-
Determine the transformation schema: first establish the project’s objective (for example, indexing, analysis, or data combination). Then, examine the source files, their structure, and elements' names. Finally, select which elements to import.
-
Write a transformation that specifies what to extract from the data. You can use any authoring environment and language appropriate for your project. For example, an XML transformation can be performed via the technologies such as XSLT, Java, or Python, based on the goals and scope of the project. A transformation is provided as an executable command that
gpfdist
calls with the name of the source data file. -
Write the
gpfdist
transformation configuration file. The file is passed togpfdist
on the command line and contains the rules thatgpfdist
uses to select a transformation to apply when loading or extracting data. -
Transfer the data by using either
gpfdist
orgpload
, which automates certain loading tasks. Learn more in Transfer the data.
Transformation configuration file
The gpfdist
transformation configuration file defines the transformation parameters and uses the YAML 1.1 document format.
The configuration file is processed in specified order; indentation (spaces) determines the document hierarchy and relationships between the sections.
The use of whitespace in a YAML document is significant. Whitespace should not be used for formatting purposes, and tabs should not be used at all.
The basic configuration file structure looks as follows:
---
VERSION: 1.0.0.1
TRANSFORMATIONS:
transformation_name1:
TYPE: input | output
COMMAND: command
SAFE: posix-regex
transformation_name2:
TYPE: input | output
COMMAND: command
...
Key | Description | Required |
---|---|---|
VERSION |
The version of the |
Yes |
TRANSFORMATIONS |
Begins the transformation specification section.
A configuration file must have at least one transformation defined.
Upon receiving a transformation request, |
Yes |
transformation_name1, transformation_name2, … |
The name of the transformation, which is then used when configuring data transfer |
Yes |
TYPE |
The transformation direction, which can be
|
Yes |
COMMAND |
The command that
|
Yes |
SAFE |
A POSIX regular expression that the paths to data files must match to be passed to the transformation.
Specify |
No |
Transfer the data
To load the data into or out of the database, use one of the following approaches:
Load data with gpfdist
Transforming with gpfdist
is performed as follows:
-
Start
gpfdist
and use the-c
option to provide the transformation configuration file:$ gpfdist -c config.yaml
-
Create an external table. In the
CREATE EXTERNAL TABLE
command’sLOCATION
clause, specify the name of the hostgpfdist
runs on (<hostname>
), the port it listens on (<port>
), and the source data file (<filename>
). Use the#transform
parameter to specify the transformation name (<transformation_name>
). If multiplegpfdist
servers are used, make sure to list them all and provide the transformation name for each of them.CREATE READABLE EXTERNAL TABLE <external_table_name> LOCATION ('gpfdist://<hostname>:<port>/<filename>#transform=<transformation_name>' [, ...] )
-
Use the
INSERT
command to insert the loaded data into the target table (<table_name>
):INSERT INTO <table_name> SELECT * FROM <external_table_name>;
You can start gpfdist
with the -I
or -O
options that specify the default input or output transformation, respectively, for example:
$ gpfdist -c config.yaml -I <input_transformation_name>
The selected transformation is applied to all affected files.
Using the #transform
parameter on the table level overrides the -I
or -O
options.
Load data with gpload
The gpload
utility uses the gpfdist
file server and a YAML-formatted control file to orchestrate the data loading operation and automate the following tasks:
-
Creating a readable external table in the database.
-
Starting
gpfdist
instances with the configuration file that specifies the transformation. -
Running the
INSERT INTO <table_name> SELECT FROM <external_table_name>
command to load the data. -
After the data is loaded, removing the external table definition.
gpload
only supports input transformations.
To configure the transformation, you need to specify the TRANSFORM
and TRANSFORM_CONFIG
values in the INPUT
section of the gpload
control file:
---
VERSION: 1.0.0.1
GPLOAD:
INPUT:
- TRANSFORM_CONFIG: <gpfdist_configuration_file>
- TRANSFORM: <transformation_name>
- SOURCE:
FILE: <filename>
where:
-
TRANSFORM_CONFIG
specifies the name of thegpfdist
transformation configuration file. -
TRANSFORM
specifies the name of the transformation that is provided in thegpfdist
configuration file.
The transformation name must appear in two places: in the TRANSFORM
setting of the gpload
control file and in the TRANSFORMATIONS
section of the gpfdist
configuration file (the file specified in the TRANSFORM_CONFIG
section).
Having defined the control file (for example, load.yml), pass it to gpload
by using the -f
option as follows:
$ gpload -f load.yml
Examples
These examples demonstrate loading XML data into the database table, and unloading it from the database table. Transformation between the XML and the text format is implemented in XSLT and Python.
Load data
This example demonstrates transforming the source data with XSLT or Python and loading it into the database table via gpfdist
or gpload
.
-
In the /tmp directory on the master host, create the customers.xml file containing the source data to load:
<customers> <customer id="1"> <first_name>John</first_name> <last_name>Doe</last_name> <email>john.doe@example.com</email> <address>123 Elm Street</address> </customer> <customer id="2"> <first_name>Jane</first_name> <last_name>Smith</last_name> <email>jane.smith@example.com</email> <address>456 Oak Street</address> </customer> <customer id="3"> <first_name>Bob</first_name> <last_name>Brown</last_name> <email>bob.brown@example.com</email> <address>789 Pine Street</address> </customer> <customer id="4"> <first_name>Rob</first_name> <last_name>Stuart</last_name> <email>rob.stuart@example.com</email> <address>119 Willow Street</address> </customer> </customers>
-
In the /tmp directory on the master host, create the transform_customers_in.xslt or the transform_customers_in.py file containing the transformation logic. The XML document is transformed into a CSV file by iterating through each
customer
element within thecustomers
root, extracting the ID, full name (first and last names concatenated), email, and address:<?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="text"/> <xsl:template match="/"> <!-- Header Row --> <xsl:text>id,name,email,address</xsl:text> <xsl:text>
</xsl:text> <xsl:for-each select="//customers/customer"> <!-- Data Row --> <xsl:value-of select="@id"/> <xsl:text>,</xsl:text> <xsl:value-of select="concat(first_name/text(), ' ', last_name/text())"/> <xsl:text>,</xsl:text> <xsl:value-of select="email/text()"/> <xsl:text>,</xsl:text> <xsl:value-of select="address/text()"/> <xsl:text>
</xsl:text> </xsl:for-each> </xsl:template> </xsl:stylesheet>
import xml.etree.ElementTree as ET import csv import sys def xml_to_csv(xml_data): root = ET.fromstring(xml_data) header = ['id', 'name', 'email', 'address'] writer = csv.writer(sys.stdout) writer.writerow(header) for customer in root.findall('customer'): customer_id = customer.get('id') first_name = customer.find('first_name').text last_name = customer.find('last_name').text name = "{0} {1}".format(first_name, last_name) email = customer.find('email').text address = customer.find('address').text writer.writerow([customer_id, name, email, address]) if __name__ == "__main__": xml_data = sys.stdin.read() xml_to_csv(xml_data)
-
Create the
gpfdist
transformation configuration file (config_in.yaml). In theTRANSFORMATIONS
block, provide the name of the transformation (transform_customers
) and the transformation type (input
). Notice that theCOMMAND
setting uses the respective command with a%filename%
placeholder that starts an XSLT- or Python-based transformation. Whengpfdist
runs thetransform_customers
transformation, it invokes the command with/bin/bash
and replaces the%filename%
placeholder with the path to the input file to transform:--- VERSION: 1.0.0.1 TRANSFORMATIONS: transform_customers: TYPE: input COMMAND: /usr/bin/xsltproc /tmp/transform_customers_in.xslt %filename%
--- VERSION: 1.0.0.1 TRANSFORMATIONS: transform_customers: TYPE: input COMMAND: /usr/bin/env bash -c "(/usr/bin/python /tmp/transform_customers_in.py < $0)" %filename%
-
Load the data. You can use either
gpfdist
(to perform all the loading steps manually) orgpload
(to automate the steps via the control file):-
In the /tmp directory on the master host, start
gpfdist
and provide the config_in.yaml configuration file:$ gpfdist -c config_in.yaml -d /tmp -p 8888 &
-
Create the
customers_r
readable external table. In theLOCATION
clause, use the#transform
parameter and specify the transformation name (transform_customers
):CREATE EXTERNAL TABLE customers_r (LIKE customers) LOCATION ('gpfdist://mdw:8888/customers.xml#transform=transform_customers') FORMAT 'CSV' (HEADER);
-
Insert the data into the target
customers
table:INSERT INTO customers SELECT * FROM customers_r;
-
Create the
gpload
control file (load.yaml). For theTRANSFORM_CONFIG
key, specify the location of the createdgpfdist
configuration file (config_in.yaml); for theTRANSFORM
key, specify the name of the transformation (transform_customers
):--- VERSION: 1.0.0.1 DATABASE: customers USER: gpadmin HOST: mdw PORT: 5432 GPLOAD: INPUT: - SOURCE: FILE: - /tmp/customers.xml - COLUMNS: - id: integer - name: text - email: text - address: text - TRANSFORM_CONFIG: /tmp/config_in.yaml - TRANSFORM: transform_customers - FORMAT: CSV - HEADER: true OUTPUT: - TABLE: customers - MODE: INSERT
-
Run
gpload
and use the-f
option to provide the location of the created control file:$ gpload -f /tmp/load.yaml
-
-
Query the
customers
table:TABLE customers;
The output should look as follows:
id | name | email | address ----+------------+------------------------+------------------- 1 | John Doe | john.doe@example.com | 123 Elm Street 2 | Jane Smith | jane.smith@example.com | 456 Oak Street 3 | Bob Brown | bob.brown@example.com | 789 Pine Street 4 | Rob Stuart | rob.stuart@example.com | 119 Willow Street
Unload data
This example demonstrates unloading the data from the database table via gpfdist
and transforming it to XML with Python.
-
In the /tmp directory on the master host, create the transform_customers_out.py file containing the transformation logic. The CSV file is transformed into an XML document by reading each row, extracting data such as customer ID, name (split into first name and last name), email, and address. The corresponding
<customer>
elements are created within the root<customers>
element.import csv import xml.etree.ElementTree as ET import sys def csv_to_xml(csv_data): root = ET.Element('customers') reader = csv.reader(csv_data.splitlines(), delimiter='|') header = next(reader) for row in reader: if len(row) != 4: print "Skipping invalid row: %s" % row continue customer_id = row[0] name = row[1] email = row[2] address = row[3] customer = ET.SubElement(root, 'customer', {'id': customer_id}) ET.SubElement(customer, 'first_name').text = name.split(' ')[0] ET.SubElement(customer, 'last_name').text = name.split(' ')[-1] ET.SubElement(customer, 'email').text = email ET.SubElement(customer, 'address').text = address return ET.tostring(root, encoding='utf-8') if __name__ == "__main__": csv_data = sys.stdin.read() try: xml_output = csv_to_xml(csv_data) print xml_output except ValueError as e: print "Error: %s" % e sys.exit(1)
-
Create the
gpfdist
transformation configuration file (config_out.yaml). In theTRANSFORMATIONS
block, provide the name of the transformation (transform_customers
) and the transformation type (output
). Notice that theCOMMAND
setting uses the respective command with a%filename%
placeholder that starts a Python-based transformation. Whengpfdist
runs thetransform_customers
transformation, it invokes the command with/bin/bash
and replaces the%filename%
placeholder with the path to the output file:--- VERSION: 1.0.0.1 TRANSFORMATIONS: transform_customers: TYPE: output COMMAND: /usr/bin/env bash -c "(/usr/bin/python /tmp/transform_customers_out.py | cat - >> $0)" %filename%
-
Load the data with
gpfdist
. In the /tmp directory on the master host, startgpfdist
and provide the config_out.yaml configuration file:$ gpfdist -c config_out.yaml -d /tmp -p 8888 &
-
Create the
customers_w
writable external table. In theLOCATION
clause, use the#transform
parameter and specify the transformation name (transform_customers
). In theFORMAT
clause, provide the output format (TEXT
) and specify that the pipe symbol (|
) is used as a delimiter (DELIMITER
):CREATE WRITABLE EXTERNAL TABLE customers_w (LIKE customers) LOCATION ('gpfdist://mdw:8888/customers_out.xml#transform=transform_customers') FORMAT 'TEXT' (DELIMITER '|');
-
Populate the
customers_w
table with sample data:INSERT INTO customers_w (id, name, email, address) VALUES (1,'John Doe','john.doe@example.com','123 Elm Street'), (2,'Jane Smith','jane.smith@example.com','456 Oak Street'), (3,'Bob Brown','bob.brown@example.com','789 Pine Street'), (4,'Rob Stuart','rob.stuart@example.com','119 Willow Street');
-
Review the created customers_out.xml file:
$ cat /tmp/customers_out.xml
The file’s content should look as follows:
<customers> <customer id="1"> <first_name>John</first_name> <last_name>Doe</last_name> <email>john.doe@example.com</email> <address>123 Elm Street</address> </customer> <customer id="2"> <first_name>Jane</first_name> <last_name>Smith</last_name> <email>jane.smith@example.com</email> <address>456 Oak Street</address> </customer> <customer id="3"> <first_name>Bob</first_name> <last_name>Brown</last_name> <email>bob.brown@example.com</email> <address>789 Pine Street</address> </customer> <customer id="4"> <first_name>Rob</first_name> <last_name>Stuart</last_name> <email>rob.stuart@example.com</email> <address>119 Willow Street</address> </customer> </customers>