Hello, I’m DocuDroid!
Submitting feedback
Thank you for rating our AI Search!
We would be grateful if you could share your thoughts so we can improve our AI Search for you and other readers.
GitHub

Format external data

Anton Monakov

When loading or unloading data via the COPY command, an external table, or the gpload utility, you can specify how your data is formatted.

Greengage DB supports the delimited text (TEXT), comma-separated values (CSV), and binary formats. You can use other formats by defining a custom format or a custom protocol.

Prerequisites

To try out the practical examples listed in this topic, connect to the Greengage DB master host as gpadmin using psql as described in Connect to Greengage DB via psql. Then create the suppliers test database, and connect to it.

DROP DATABASE IF EXISTS suppliers;
CREATE DATABASE suppliers;
\c suppliers

Then create the suppliers table:

CREATE TABLE suppliers
(
    id INTEGER,
    name VARCHAR,
    city VARCHAR,
    country VARCHAR
)
    DISTRIBUTED RANDOMLY;

Define data format

You define how the data is formatted by specifying various data formatting clauses.

Clause Description Applicable to

HEADER

Designates whether the data file contains a header row.

If using multiple data source files, all off them must have a header row. The default is to assume that the input files do not have a header row

External table, COPY, gpload

NEWLINE

Designates a character used as a newline character, which can be LF (Line feed, 0x0A), CR (Carriage return, 0x0D), or CR followed by LF (CRLF, 0x0D 0x0A).

If not specified, the newline character detected at the end of the first line of data is used.

See Format rows for details

External table, COPY, gpload

DELIMITER

Designates a single ASCII character to act as a column delimiter. The delimiter must reside between any two data value fields but not at the beginning or the end of a row.

You can also specify a non-printable ASCII character or a non-printable Unicode character, such as \x1B or \u001B. The escape string syntax, E'<character-code>', is also supported for non-printable characters. The ASCII or Unicode character must be enclosed in single quotes, for example, E'\x1B' or E'\u001B'.

The default column delimiter is the horizontal TAB character (0x09).

See Format columns for details

External table, COPY, gpload

NULL

Designates a string representing a null value, which indicates an unknown piece of data in a column or field.

The default string is \N.

When using with the gpload utility, specify the clause as NULL_AS.

See Represent NULL values for details

External table, COPY, gpload

ESCAPE

Designates a single character used as an escape character.

The default escape character is a backslash (\). You can deactivate escaping by providing the OFF value.

See Escape characters for details

External table, COPY, gpload

FILL MISSING FIELDS

Sets missing trailing field values at the end of a line or row to NULL. If not set, an error is reported in such cases. Blank rows, fields with a NOT NULL constraint, and trailing delimiters on a line will still report an error

External table, COPY

ENCODING

Defines the character set encoding of the source data.

If not specified, the default client encoding is used.

See Character encoding for details

External table, COPY, gpload

Clause Description Applicable to

HEADER

Designates whether the data file contains a header row.

If using multiple data source files, all off them must have a header row. The default is to assume that the input files do not have a header row

External table, COPY, gpload

NEWLINE

Designates a character used as a newline character, which can be LF (Line feed, 0x0A), CR (Carriage return, 0x0D), or CR followed by LF (CRLF, 0x0D 0x0A).

If not specified, the newline character detected at the end of the first line of data is used.

See Format rows for details

External table, COPY, gpload

QUOTE

Specifies the quotation character, double quote (") by default

External table, COPY, gpload

DELIMITER

Designates a single ASCII character to act as a column delimiter. The delimiter must reside between any two data value fields but not at the beginning or the end of a row.

You can also specify a non-printable ASCII character or a non-printable Unicode character, such as \x1B or \u001B. The escape string syntax, E'<character-code>', is also supported for non-printable characters. The ASCII or Unicode character must be enclosed in single quotes, for example, E'\x1B' or E'\u001B'.

The default column delimiter is the comma character (,).

See Format columns for details

External table, COPY, gpload

NULL

Designates a string representing a null value, which indicates an unknown piece of data in a column or field.

The default string is an empty value with no quotation marks.

When using with the gpload utility, specify the clause as NULL_AS.

See Represent NULL values for details

External table, COPY, gpload

FORCE NOT NULL

Treats values in each specified column as if they were quoted. Since the default null string is an empty unquoted string, this causes missing values to be evaluated as zero-length strings.

When using with the COPY command or the gpload utility, specify the clause as FORCE_NOT_NULL

External table, COPY, gpload

FORCE_NULL

Matches the specified columns' values against the null string. If a match is found, sets the value to NULL even if the value is quoted. In the default case where the null string is empty, this converts a quoted empty string to NULL

COPY

FORCE QUOTE

Enforces quoting for all non-NULL values in each specified column.

If an asterisk (*) is specified, non-NULL values are quoted in all columns. NULL output is never quoted.

When using with the COPY command, specify the clause as FORCE_QUOTE

External table, COPY

ESCAPE

Designates a single character used as an escape character.

The default escape character is a double quote (").

See Escape characters for details

External table, COPY, gpload

FILL MISSING FIELDS

Sets missing trailing field values at the end of a line or row to NULL. If not set, an error is reported in such cases. Blank rows, fields with a NOT NULL constraint, and trailing delimiters on a line will still report an error

External table, COPY

ENCODING

Defines the character set encoding of the source data.

If not specified, the default client encoding is used.

See Character encoding for details

External table, COPY, gpload

In the CREATE EXTERNAL TABLE command, COPY command, or the gpload utility’s control file, use the formatting-related clauses as follows:

CREATE [[READABLE] | WRITABLE] EXTERNAL [WEB] TABLE <table_name>
    LOCATION ('<protocol>://<location>) | EXECUTE '<command>'
    FORMAT 'TEXT'
           [( [HEADER]
              [DELIMITER [AS] '<delimiter>' | 'OFF']
              [NULL [AS] '<null string>']
              [ESCAPE [AS] '<escape>' | 'OFF']
              [NEWLINE [ AS ] 'LF' | 'CR' | 'CRLF']
              [FILL MISSING FIELDS] )]
        | 'CSV'
            [( [HEADER]
            [QUOTE [AS] '<quote>']
            [DELIMITER [AS] '<delimiter>']
            [NULL [AS] '<null string>']
            [FORCE NOT NULL <column> [, ...]]
            [ESCAPE [AS] '<escape>']
            [NEWLINE [ AS ] 'LF' | 'CR' | 'CRLF']
            [FILL MISSING FIELDS] )]
        | 'CUSTOM' (Formatter=<<formatter_specifications>>)
COPY <table_name> [(<column_name> [, ...])]
     TO | FROM {'<filename>' | PROGRAM '<command>' | {STDIN | STDOUT}}
     [ [ WITH ]
    (
        FORMAT TEXT | CSV | BINARY,
        DELIMITER '<delimiter_character>',
        NULL '<null string>',
        HEADER [ <boolean> ],
        QUOTE '<quote_character>',
        NEWLINE '<newline_character>',
        ESCAPE '<escape_character>',
        FORCE_QUOTE { ( <column_name> [, ...] ) \| * },
        FORCE_NOT_NULL ( <column_name> [, ...] ),
        FORCE_NULL ( <column_name> [, ...] ),
        ENCODING '<encoding_name>'
        FILL MISSING FIELDS
    ) ]
...
GPLOAD:
   INPUT:
    - SOURCE:
         LOCAL_HOSTNAME:
           - <hostname_or_ip>
         PORT: <http_port>
         FILE:
           - </path/to/input_file>
    - COLUMNS:
           - <field_name>: <data_type>
    - FORMAT: text | csv
    - DELIMITER: '<delimiter_character>'
    - ESCAPE: '<escape_character>' | 'OFF'
    - NEWLINE: 'LF' | 'CR' | 'CRLF'
    - NULL_AS: '<null_string>'
    - FILL_MISSING_FIELDS: true | false
    - FORCE_NOT_NULL: <column_name> [, ...]
    - QUOTE: '<csv_quote_character>'
    - HEADER: true | false
    - ENCODING: <database_encoding>
...

Format rows

Greengage DB expects rows of data to be separated by the LF character (Line feed, 0x0A), CR (Carriage return, 0x0D), or CR followed by LF (CR+LF, 0x0D 0x0A). You can specify the row separator by using the NEWLINE clause when defining a data format. If not specified, the first newline character encountered at the first row of data is used.

LF is the standard newline representation in UNIX, Linux, and macOS; Windows uses CRLF.

Format columns

Columns of data have to be separated with the delimiter character. The default column or field delimiter is the horizontal TAB character (0x09) for text files and the comma character (0x2C) for CSV files. You can declare a single-character delimiter using the DELIMITER clause when defining a data format. The delimiter character must appear between any two data value fields, not at the beginning or the end of a row.

Example

In this example, the COPY command is used for copying the suppliers.txt file’s content into the suppliers table. The file uses the pipe character (|) as a delimiter.

  1. In the /tmp directory on the master host, create the suppliers.txt file:

    #!/bin/bash
    
    cat > /tmp/suppliers.txt <<EOF
    id|name|city|country
    1|Sunrise Corp|Los Angeles|USA
    2|Blue Sky Co|Toronto|Canada
    3|Green Leaf LLC|Berlin|Germany
    4|Red Rock Group|Sydney|Australia
    5|Golden Gate Solutions|Tokyo|Japan
    EOF
  2. Empty the suppliers table:

    TRUNCATE TABLE suppliers;
  3. Run the COPY FROM command. In the WITH clause, set TEXT as the data format, specify that the file contains a header row (HEADER), and set the pipe character (|) as a delimiter (DELIMITER).

    COPY suppliers FROM '/tmp/suppliers.txt'
    WITH (FORMAT TEXT, HEADER, DELIMITER '|');
  4. Query the suppliers table:

    SELECT * FROM suppliers;

    The output should look as follows:

     id |         name          |    city     |  country
    ----+-----------------------+-------------+-----------
      1 | Sunrise Corp          | Los Angeles | USA
      2 | Blue Sky Co           | Toronto     | Canada
      3 | Green Leaf LLC        | Berlin      | Germany
      4 | Red Rock Group        | Sydney      | Australia
      5 | Golden Gate Solutions | Tokyo       | Japan

Represent NULL values

NULL represents an unknown piece of data in a field. You can declare a string to represent null values by using the NULL clause when defining a data format. You can use an arbitrary string that should match a NULL value, for example, an empty string if you do not want to distinguish nulls from empty strings.

The default string is \N for text files, and an empty value with no quotation marks for CSV files.

Example

In this example, the COPY command is used for copying the suppliers.txt file’s content into the suppliers table. Some data values in the file are missing and represented by the N/A string.

  1. In the /tmp directory on the master host, create the suppliers.txt file:

    #!/bin/bash
    
    cat > /tmp/suppliers.txt <<EOF
    id,name,city,country
    1,Sunrise Corp,Los Angeles,USA
    2,Blue Sky Co,N/A,N/A
    3,Green Leaf LLC,Berlin,Germany
    4,Red Rock Group,Sydney,N/A
    5,Golden Gate Solutions,N/A,Japan
    EOF
  2. Empty the suppliers table:

    TRUNCATE TABLE suppliers;
  3. Run the COPY FROM command. In the WITH clause, set TEXT as the data format, specify that the file contains a header row (HEADER), and set N/A as the null-string (NULL).

    COPY suppliers FROM '/tmp/suppliers.txt'
    WITH (FORMAT TEXT, DELIMITER ',', HEADER, NULL 'N/A');
  4. Query the suppliers table:

    SELECT * FROM suppliers;

    The output should look as follows:

     id |         name          |    city     | country
    ----+-----------------------+-------------+---------
      1 | Sunrise Corp          | Los Angeles | USA
      2 | Blue Sky Co           |             |
      3 | Green Leaf LLC        | Berlin      | Germany
      4 | Red Rock Group        | Sydney      |
      5 | Golden Gate Solutions |             | Japan

Escape characters

The characters designated as a delimiter (DELIMITER) and a newline (NEWLINE) are reserved in Greengage DB. If your data contains any of these characters, you must escape them so that they are treated as data and not as a field separator or a new row. If the escape character appears in the data, use it to escape itself.

You can declare an escape character by using the ESCAPE clause when defining a data format.

Escape characters in text-formatted files

By default, the escape character is a backslash (\) for text-formatted files.

You can use the escape character to escape octal and hexadecimal sequences. The escaped value is converted to the equivalent character when loaded into Greengage DB. For example, to load the ampersand character (&), use the escape character to escape its equivalent hexadecimal (\x26) or octal (\046) representation.

If the input data (for example, web log data) contains many backslash characters, you can deactivate escaping by providing the OFF value for the ESCAPE clause, for example:

COPY suppliers FROM '/tmp/suppliers.txt'
WITH (FORMAT TEXT, HEADER, ESCAPE 'OFF');

Example

In this example, the COPY command is used for copying the suppliers.txt file’s content into the suppliers table. The file uses the comma character (,) as a delimiter.

  1. In the /tmp directory on the master host, create the suppliers.txt file. Notice that the comma character that is part of the data is escaped with the default escape character (\). The backslash character that is itself part of the data is also escaped:

    #!/bin/bash
    
    cat > /tmp/suppliers.txt <<'EOF'
    id,name,city,country
    1,Acme Inc.,New York,USA
    2,Comma\, Inc.,London,GB
    3,Backslash \\ Ltd.,Paris,France
    EOF
  2. Empty the suppliers table:

    TRUNCATE TABLE suppliers;
  3. Run the COPY FROM command. In the WITH clause, set TEXT as the data format, specify that the file contains a header row (HEADER), and set the comma character (,) as a delimiter (DELIMITER):

    COPY suppliers FROM '/tmp/suppliers.txt'
    WITH (FORMAT TEXT, HEADER, DELIMITER ',');
  4. Query the suppliers table:

    SELECT * FROM suppliers;

    The output should look as follows:

     id |       name       |   city   | country
    ----+------------------+----------+---------
      1 | Acme Inc.        | New York | USA
      2 | Comma, Inc.      | London   | GB
      3 | Backslash \ Ltd. | Paris    | France

Escape characters in CSV-formatted files

By default, the escape character for CSV-formatted files is the same as the QUOTE value (double quote (") by default). In CSV files, all characters are significant. If a quoted value is surrounded by white space or any characters other than DELIMITER, then it includes those characters. This can cause errors if you import data from a system that pads CSV lines with white space to some fixed width. In this case, preprocess the CSV file to remove the trailing white space before importing the data.

Example

In this example, the COPY command is used for copying the suppliers.csv file’s content into the suppliers table.

  1. In the /tmp directory on the master host, create the suppliers.csv file. Notice that the data value containing a comma character is enclosed in double quotes. The double quotes that are part of the data are escaped with the default escape character (") even though the field value itself is enclosed in double quotes. To preserve leading and trailing whitespace characters that are contained in the data, the corresponding field is also enclosed in double quotes:

    #!/bin/bash
    
    cat > /tmp/suppliers.csv <<EOF
    id,name,city,country
    1,Acme Inc.,New York,USA
    2,"Comma, Inc.",London,GB
    3,"""Quotes"" Ltd.",Paris,France
    4,"  Space  GmbH",Berlin,Germany
    EOF
  2. Empty the suppliers table:

    TRUNCATE TABLE suppliers;
  3. Run the COPY FROM command. In the WITH clause, set CSV as the data format and specify that the file contains a header row (HEADER):

    COPY suppliers FROM '/tmp/suppliers.csv'
    WITH (FORMAT CSV, HEADER);
  4. Query the suppliers table:

    SELECT * FROM suppliers;

    The output should look as follows:

     id |     name      |   city   | country
    ----+---------------+----------+---------
      1 | Acme Inc.     | New York | USA
      2 | Comma, Inc.   | London   | GB
      3 | "Quotes" Ltd. | Paris    | France
      4 |   Space  GmbH | Berlin   | Germany

Character encoding

Character encoding systems pair each character from a character set with something else, such as a sequence of numbers or octets, to facilitate data transmission and storage. Greengage DB supports a variety of character sets. You can define the server-side character set during database initialization; UTF-8 is used by default.

Clients can use all supported character sets, but not all of them are supported for use within the server as a server-side encoding. When loading or inserting data into Greengage DB, the data is converted from the specified client encoding into the server encoding. When sending data back to the client, the data is converted from the server character encoding into the specified client encoding.

Data files must use the character encoding recognized by Greengage DB; otherwise, they encounter errors when loading into Greengage DB.

TIP

On Windows platforms, run the dos2unix system command on data files to remove any Windows-only characters before loading into Greengage DB.

NOTE

If you change the ENCODING value in an existing gpload control file, you must manually drop any external tables that were created using the previous ENCODING configuration. gpload does not drop and recreate external tables to use the new ENCODING if REUSE_TABLES is set to true.

The client-side character encoding can be changed for a session by setting the client_encoding server configuration parameter as follows:

SET client_encoding TO 'latin1';

To view the current client-side character encoding, run the following command:

SHOW client_encoding;

The output should look as follows:

client_encoding
-----------------
 UTF8

To revert the client-side character encoding back to the default value, use the RESET command:

RESET client_encoding;