Transform external data
With data transformations, you can read or write external data files stored in formats that Greengage DB doesn’t support natively.
An input transformation reads a file in a foreign data format and outputs rows to gpfdist in the format specified in the external table’s FORMAT clause (TEXT or CSV).
An output transformation receives rows from gpfdist in text format and converts them to a foreign data format.
Data transformation is supported when working with the gpfdist and gpload utilities.
One of the use cases requiring transformation is accessing data stored in external structured files (such as XML or JSON) from within the database.
See Examples for detailed guides on loading and unloading data in XML format.
The general workflow for setting up a transformation project looks as follows:
-
Determine the transformation schema: first establish the project’s objective (for example, indexing, analysis, or data combination). Then, examine the source files, their structure, and elements' names. Finally, select which elements to import.
-
Write a transformation that specifies what to extract from the data. You can use any authoring environment and language appropriate for your project. For example, an XML transformation can be performed via the technologies such as XSLT, Java, or Python, based on the goals and scope of the project. A transformation is provided as an executable command that
gpfdistcalls with the name of the source data file. -
Write the
gpfdisttransformation configuration file. The file is passed togpfdiston the command line and contains the rules thatgpfdistuses to select a transformation to apply when loading or extracting data. -
Transfer the data by using either
gpfdistorgpload, which automates certain loading tasks. Learn more in Transfer the data.
Transformation configuration file
The gpfdist transformation configuration file defines the transformation parameters and uses the YAML 1.1 document format.
The configuration file is processed in specified order; indentation (spaces) determines the document hierarchy and relationships between the sections.
The use of whitespace in a YAML document is significant. Whitespace should not be used for formatting purposes, and tabs should not be used at all.
The basic configuration file structure looks as follows:
---
VERSION: 1.0.0.1
TRANSFORMATIONS:
transformation_name1:
TYPE: input | output
COMMAND: command
SAFE: posix-regex
transformation_name2:
TYPE: input | output
COMMAND: command
...
| Key | Description | Required |
|---|---|---|
VERSION |
The version of the |
Yes |
TRANSFORMATIONS |
Begins the transformation specification section.
A configuration file must have at least one transformation defined.
Upon receiving a transformation request, |
Yes |
transformation_name1, transformation_name2, … |
The name of the transformation, which is then used when configuring data transfer |
Yes |
TYPE |
The transformation direction, which can be
|
Yes |
COMMAND |
The command that
|
Yes |
SAFE |
A POSIX regular expression that the paths to data files must match to be passed to the transformation.
Specify |
No |
Transfer the data
To load the data into or out of the database, use one of the following approaches:
Load data with gpfdist
Transforming with gpfdist is performed as follows:
-
Start
gpfdistand use the-coption to provide the transformation configuration file:$ gpfdist -c config.yaml -
Create an external table. In the
CREATE EXTERNAL TABLEcommand’sLOCATIONclause, specify the name of the hostgpfdistruns on (<hostname>), the port it listens on (<port>), and the source data file (<filename>). Use the#transformparameter to specify the transformation name (<transformation_name>). If multiplegpfdistservers are used, make sure to list them all and provide the transformation name for each of them.CREATE READABLE EXTERNAL TABLE <external_table_name> LOCATION ('gpfdist://<hostname>:<port>/<filename>#transform=<transformation_name>' [, ...] ) -
Use the
INSERTcommand to insert the loaded data into the target table (<table_name>):INSERT INTO <table_name> SELECT * FROM <external_table_name>;
You can start gpfdist with the -I or -O options that specify the default input or output transformation, respectively, for example:
$ gpfdist -c config.yaml -I <input_transformation_name>
The selected transformation is applied to all affected files.
Using the #transform parameter on the table level overrides the -I or -O options.
Load data with gpload
The gpload utility uses the gpfdist file server and a YAML-formatted control file to orchestrate the data loading operation and automate the following tasks:
-
Creating a readable external table in the database.
-
Starting
gpfdistinstances with the configuration file that specifies the transformation. -
Running the
INSERT INTO <table_name> SELECT FROM <external_table_name>command to load the data. -
After the data is loaded, removing the external table definition.
gpload only supports input transformations.
To configure the transformation, you need to specify the TRANSFORM and TRANSFORM_CONFIG values in the INPUT section of the gpload control file:
---
VERSION: 1.0.0.1
GPLOAD:
INPUT:
- TRANSFORM_CONFIG: <gpfdist_configuration_file>
- TRANSFORM: <transformation_name>
- SOURCE:
FILE: <filename>
where:
-
TRANSFORM_CONFIGspecifies the name of thegpfdisttransformation configuration file. -
TRANSFORMspecifies the name of the transformation that is provided in thegpfdistconfiguration file.
The transformation name must appear in two places: in the TRANSFORM setting of the gpload control file and in the TRANSFORMATIONS section of the gpfdist configuration file (the file specified in the TRANSFORM_CONFIG section).
Having defined the control file (for example, load.yml), pass it to gpload by using the -f option as follows:
$ gpload -f load.yml
Examples
These examples demonstrate loading XML data into the database table, and unloading it from the database table. Transformation between the XML and the text format is implemented in XSLT and Python.
Load data
This example demonstrates transforming the source data with XSLT or Python and loading it into the database table via gpfdist or gpload.
-
In the /tmp directory on the master host, create the customers.xml file containing the source data to load:
<customers> <customer id="1"> <first_name>John</first_name> <last_name>Doe</last_name> <email>john.doe@example.com</email> <address>123 Elm Street</address> </customer> <customer id="2"> <first_name>Jane</first_name> <last_name>Smith</last_name> <email>jane.smith@example.com</email> <address>456 Oak Street</address> </customer> <customer id="3"> <first_name>Bob</first_name> <last_name>Brown</last_name> <email>bob.brown@example.com</email> <address>789 Pine Street</address> </customer> <customer id="4"> <first_name>Rob</first_name> <last_name>Stuart</last_name> <email>rob.stuart@example.com</email> <address>119 Willow Street</address> </customer> </customers> -
In the /tmp directory on the master host, create the transform_customers_in.xslt or the transform_customers_in.py file containing the transformation logic. The XML document is transformed into a CSV file by iterating through each
customerelement within thecustomersroot, extracting the ID, full name (first and last names concatenated), email, and address:<?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="text"/> <xsl:template match="/"> <!-- Header Row --> <xsl:text>id,name,email,address</xsl:text> <xsl:text>
</xsl:text> <xsl:for-each select="//customers/customer"> <!-- Data Row --> <xsl:value-of select="@id"/> <xsl:text>,</xsl:text> <xsl:value-of select="concat(first_name/text(), ' ', last_name/text())"/> <xsl:text>,</xsl:text> <xsl:value-of select="email/text()"/> <xsl:text>,</xsl:text> <xsl:value-of select="address/text()"/> <xsl:text>
</xsl:text> </xsl:for-each> </xsl:template> </xsl:stylesheet>import xml.etree.ElementTree as ET import csv import sys def xml_to_csv(xml_data): root = ET.fromstring(xml_data) header = ['id', 'name', 'email', 'address'] writer = csv.writer(sys.stdout) writer.writerow(header) for customer in root.findall('customer'): customer_id = customer.get('id') first_name = customer.find('first_name').text last_name = customer.find('last_name').text name = "{0} {1}".format(first_name, last_name) email = customer.find('email').text address = customer.find('address').text writer.writerow([customer_id, name, email, address]) if __name__ == "__main__": xml_data = sys.stdin.read() xml_to_csv(xml_data) -
Create the
gpfdisttransformation configuration file (config_in.yaml). In theTRANSFORMATIONSblock, provide the name of the transformation (transform_customers) and the transformation type (input). Notice that theCOMMANDsetting uses the respective command with a%filename%placeholder that starts an XSLT- or Python-based transformation. Whengpfdistruns thetransform_customerstransformation, it invokes the command with/bin/bashand replaces the%filename%placeholder with the path to the input file to transform:--- VERSION: 1.0.0.1 TRANSFORMATIONS: transform_customers: TYPE: input COMMAND: /usr/bin/xsltproc /tmp/transform_customers_in.xslt %filename%--- VERSION: 1.0.0.1 TRANSFORMATIONS: transform_customers: TYPE: input COMMAND: /usr/bin/env bash -c "(/usr/bin/python /tmp/transform_customers_in.py < $0)" %filename% -
Load the data. You can use either
gpfdist(to perform all the loading steps manually) orgpload(to automate the steps via the control file):-
In the /tmp directory on the master host, start
gpfdistand provide the config_in.yaml configuration file:$ gpfdist -c config_in.yaml -d /tmp -p 8888 & -
Create the
customers_rreadable external table. In theLOCATIONclause, use the#transformparameter and specify the transformation name (transform_customers):CREATE EXTERNAL TABLE customers_r (LIKE customers) LOCATION ('gpfdist://mdw:8888/customers.xml#transform=transform_customers') FORMAT 'CSV' (HEADER); -
Insert the data into the target
customerstable:INSERT INTO customers SELECT * FROM customers_r;
-
Create the
gploadcontrol file (load.yaml). For theTRANSFORM_CONFIGkey, specify the location of the createdgpfdistconfiguration file (config_in.yaml); for theTRANSFORMkey, specify the name of the transformation (transform_customers):--- VERSION: 1.0.0.1 DATABASE: customers USER: gpadmin HOST: mdw PORT: 5432 GPLOAD: INPUT: - SOURCE: FILE: - /tmp/customers.xml - COLUMNS: - id: integer - name: text - email: text - address: text - TRANSFORM_CONFIG: /tmp/config_in.yaml - TRANSFORM: transform_customers - FORMAT: CSV - HEADER: true OUTPUT: - TABLE: customers - MODE: INSERT -
Run
gploadand use the-foption to provide the location of the created control file:$ gpload -f /tmp/load.yaml
-
-
Query the
customerstable:TABLE customers;The output should look as follows:
id | name | email | address ----+------------+------------------------+------------------- 1 | John Doe | john.doe@example.com | 123 Elm Street 2 | Jane Smith | jane.smith@example.com | 456 Oak Street 3 | Bob Brown | bob.brown@example.com | 789 Pine Street 4 | Rob Stuart | rob.stuart@example.com | 119 Willow Street
Unload data
This example demonstrates unloading the data from the database table via gpfdist and transforming it to XML with Python.
-
In the /tmp directory on the master host, create the transform_customers_out.py file containing the transformation logic. The CSV file is transformed into an XML document by reading each row, extracting data such as customer ID, name (split into first name and last name), email, and address. The corresponding
<customer>elements are created within the root<customers>element.import csv import xml.etree.ElementTree as ET import sys def csv_to_xml(csv_data): root = ET.Element('customers') reader = csv.reader(csv_data.splitlines(), delimiter='|') header = next(reader) for row in reader: if len(row) != 4: print "Skipping invalid row: %s" % row continue customer_id = row[0] name = row[1] email = row[2] address = row[3] customer = ET.SubElement(root, 'customer', {'id': customer_id}) ET.SubElement(customer, 'first_name').text = name.split(' ')[0] ET.SubElement(customer, 'last_name').text = name.split(' ')[-1] ET.SubElement(customer, 'email').text = email ET.SubElement(customer, 'address').text = address return ET.tostring(root, encoding='utf-8') if __name__ == "__main__": csv_data = sys.stdin.read() try: xml_output = csv_to_xml(csv_data) print xml_output except ValueError as e: print "Error: %s" % e sys.exit(1) -
Create the
gpfdisttransformation configuration file (config_out.yaml). In theTRANSFORMATIONSblock, provide the name of the transformation (transform_customers) and the transformation type (output). Notice that theCOMMANDsetting uses the respective command with a%filename%placeholder that starts a Python-based transformation. Whengpfdistruns thetransform_customerstransformation, it invokes the command with/bin/bashand replaces the%filename%placeholder with the path to the output file:--- VERSION: 1.0.0.1 TRANSFORMATIONS: transform_customers: TYPE: output COMMAND: /usr/bin/env bash -c "(/usr/bin/python /tmp/transform_customers_out.py | cat - >> $0)" %filename% -
Load the data with
gpfdist. In the /tmp directory on the master host, startgpfdistand provide the config_out.yaml configuration file:$ gpfdist -c config_out.yaml -d /tmp -p 8888 & -
Create the
customers_wwritable external table. In theLOCATIONclause, use the#transformparameter and specify the transformation name (transform_customers). In theFORMATclause, provide the output format (TEXT) and specify that the pipe symbol (|) is used as a delimiter (DELIMITER):CREATE WRITABLE EXTERNAL TABLE customers_w (LIKE customers) LOCATION ('gpfdist://mdw:8888/customers_out.xml#transform=transform_customers') FORMAT 'TEXT' (DELIMITER '|'); -
Populate the
customers_wtable with sample data:INSERT INTO customers_w (id, name, email, address) VALUES (1,'John Doe','john.doe@example.com','123 Elm Street'), (2,'Jane Smith','jane.smith@example.com','456 Oak Street'), (3,'Bob Brown','bob.brown@example.com','789 Pine Street'), (4,'Rob Stuart','rob.stuart@example.com','119 Willow Street'); -
Review the created customers_out.xml file:
$ cat /tmp/customers_out.xmlThe file’s content should look as follows:
<customers> <customer id="1"> <first_name>John</first_name> <last_name>Doe</last_name> <email>john.doe@example.com</email> <address>123 Elm Street</address> </customer> <customer id="2"> <first_name>Jane</first_name> <last_name>Smith</last_name> <email>jane.smith@example.com</email> <address>456 Oak Street</address> </customer> <customer id="3"> <first_name>Bob</first_name> <last_name>Brown</last_name> <email>bob.brown@example.com</email> <address>789 Pine Street</address> </customer> <customer id="4"> <first_name>Rob</first_name> <last_name>Stuart</last_name> <email>rob.stuart@example.com</email> <address>119 Willow Street</address> </customer> </customers>