Format external data
When loading or unloading data via the COPY command, an external table, or the gpload utility, you can specify how your data is formatted.
Greengage DB supports the delimited text (TEXT
), comma-separated values (CSV
), and binary formats.
You can use other formats by defining a custom format or a custom protocol.
Define data format
You define how the data is formatted by specifying various data formatting clauses.
Clause | Description | Applicable to |
---|---|---|
HEADER |
Designates whether the data file contains a header row. If using multiple data source files, all off them must have a header row. The default is to assume that the input files do not have a header row |
External table, COPY, gpload |
NEWLINE |
Designates a character used as a newline character, which can be If not specified, the newline character detected at the end of the first line of data is used. See Format rows for details |
External table, COPY, gpload |
DELIMITER |
Designates a single ASCII character to act as a column delimiter. The delimiter must reside between any two data value fields but not at the beginning or the end of a row. You can also specify a non-printable ASCII character or a non-printable Unicode character, such as The default column delimiter is the horizontal TAB character ( See Format columns for details |
External table, COPY, gpload |
NULL |
Designates a string representing a null value, which indicates an unknown piece of data in a column or field. The default string is When using with the See Represent NULL values for details |
External table, COPY, gpload |
ESCAPE |
Designates a single character used as an escape character. The default escape character is a backslash ( See Escape characters for details |
External table, COPY, gpload |
FILL MISSING FIELDS |
Sets missing trailing field values at the end of a line or row to |
External table, COPY |
ENCODING |
Defines the character set encoding of the source data. If not specified, the default client encoding is used. See Character encoding for details |
External table, COPY, gpload |
Clause | Description | Applicable to |
---|---|---|
HEADER |
Designates whether the data file contains a header row. If using multiple data source files, all off them must have a header row. The default is to assume that the input files do not have a header row |
External table, COPY, gpload |
NEWLINE |
Designates a character used as a newline character, which can be If not specified, the newline character detected at the end of the first line of data is used. See Format rows for details |
External table, COPY, gpload |
QUOTE |
Specifies the quotation character, double quote ( |
External table, COPY, gpload |
DELIMITER |
Designates a single ASCII character to act as a column delimiter. The delimiter must reside between any two data value fields but not at the beginning or the end of a row. You can also specify a non-printable ASCII character or a non-printable Unicode character, such as The default column delimiter is the comma character ( See Format columns for details |
External table, COPY, gpload |
NULL |
Designates a string representing a null value, which indicates an unknown piece of data in a column or field. The default string is an empty value with no quotation marks. When using with the See Represent NULL values for details |
External table, COPY, gpload |
FORCE NOT NULL |
Treats values in each specified column as if they were quoted. Since the default null string is an empty unquoted string, this causes missing values to be evaluated as zero-length strings. When using with the |
External table, COPY, gpload |
FORCE_NULL |
Matches the specified columns' values against the null string.
If a match is found, sets the value to |
COPY |
FORCE QUOTE |
Enforces quoting for all non- If an asterisk ( When using with the |
External table, COPY |
ESCAPE |
Designates a single character used as an escape character. The default escape character is a double quote ( See Escape characters for details |
External table, COPY, gpload |
FILL MISSING FIELDS |
Sets missing trailing field values at the end of a line or row to |
External table, COPY |
ENCODING |
Defines the character set encoding of the source data. If not specified, the default client encoding is used. See Character encoding for details |
External table, COPY, gpload |
In the CREATE EXTERNAL TABLE command, COPY command, or the gpload
utility’s control file, use the formatting-related clauses as follows:
CREATE [[READABLE] | WRITABLE] EXTERNAL [WEB] TABLE <table_name>
LOCATION ('<protocol>://<location>) | EXECUTE '<command>'
FORMAT 'TEXT'
[( [HEADER]
[DELIMITER [AS] '<delimiter>' | 'OFF']
[NULL [AS] '<null string>']
[ESCAPE [AS] '<escape>' | 'OFF']
[NEWLINE [ AS ] 'LF' | 'CR' | 'CRLF']
[FILL MISSING FIELDS] )]
| 'CSV'
[( [HEADER]
[QUOTE [AS] '<quote>']
[DELIMITER [AS] '<delimiter>']
[NULL [AS] '<null string>']
[FORCE NOT NULL <column> [, ...]]
[ESCAPE [AS] '<escape>']
[NEWLINE [ AS ] 'LF' | 'CR' | 'CRLF']
[FILL MISSING FIELDS] )]
| 'CUSTOM' (Formatter=<<formatter_specifications>>)
COPY <table_name> [(<column_name> [, ...])]
TO | FROM {'<filename>' | PROGRAM '<command>' | {STDIN | STDOUT}}
[ [ WITH ]
(
FORMAT TEXT | CSV | BINARY,
DELIMITER '<delimiter_character>',
NULL '<null string>',
HEADER [ <boolean> ],
QUOTE '<quote_character>',
NEWLINE '<newline_character>',
ESCAPE '<escape_character>',
FORCE_QUOTE { ( <column_name> [, ...] ) \| * },
FORCE_NOT_NULL ( <column_name> [, ...] ),
FORCE_NULL ( <column_name> [, ...] ),
ENCODING '<encoding_name>'
FILL MISSING FIELDS
) ]
...
GPLOAD:
INPUT:
- SOURCE:
LOCAL_HOSTNAME:
- <hostname_or_ip>
PORT: <http_port>
FILE:
- </path/to/input_file>
- COLUMNS:
- <field_name>: <data_type>
- FORMAT: text | csv
- DELIMITER: '<delimiter_character>'
- ESCAPE: '<escape_character>' | 'OFF'
- NEWLINE: 'LF' | 'CR' | 'CRLF'
- NULL_AS: '<null_string>'
- FILL_MISSING_FIELDS: true | false
- FORCE_NOT_NULL: <column_name> [, ...]
- QUOTE: '<csv_quote_character>'
- HEADER: true | false
- ENCODING: <database_encoding>
...
Format rows
Greengage DB expects rows of data to be separated by the LF
character (Line feed, 0x0A
), CR
(Carriage return, 0x0D
), or CR
followed by LF
(CR+LF
, 0x0D 0x0A
).
You can specify the row separator by using the NEWLINE
clause when defining a data format.
If not specified, the first newline character encountered at the first row of data is used.
LF
is the standard newline representation in UNIX, Linux, and macOS; Windows uses CRLF
.
Format columns
Columns of data have to be separated with the delimiter character.
The default column or field delimiter is the horizontal TAB character (0x09
) for text files and the comma character (0x2C
) for CSV files.
You can declare a single-character delimiter using the DELIMITER
clause when defining a data format.
The delimiter character must appear between any two data value fields, not at the beginning or the end of a row.
Example
In this example, the COPY
command is used for copying the suppliers.txt file’s content into the suppliers
table.
The file uses the pipe character (|
) as a delimiter.
-
In the /tmp directory on the master host, create the suppliers.txt file:
#!/bin/bash cat > /tmp/suppliers.txt <<EOF id|name|city|country 1|Sunrise Corp|Los Angeles|USA 2|Blue Sky Co|Toronto|Canada 3|Green Leaf LLC|Berlin|Germany 4|Red Rock Group|Sydney|Australia 5|Golden Gate Solutions|Tokyo|Japan EOF
-
Empty the
suppliers
table:TRUNCATE TABLE suppliers;
-
Run the
COPY FROM
command. In theWITH
clause, setTEXT
as the data format, specify that the file contains a header row (HEADER
), and set the pipe character (|
) as a delimiter (DELIMITER
).COPY suppliers FROM '/tmp/suppliers.txt' WITH (FORMAT TEXT, HEADER, DELIMITER '|');
-
Query the
suppliers
table:SELECT * FROM suppliers;
The output should look as follows:
id | name | city | country ----+-----------------------+-------------+----------- 1 | Sunrise Corp | Los Angeles | USA 2 | Blue Sky Co | Toronto | Canada 3 | Green Leaf LLC | Berlin | Germany 4 | Red Rock Group | Sydney | Australia 5 | Golden Gate Solutions | Tokyo | Japan
Represent NULL values
NULL
represents an unknown piece of data in a field.
You can declare a string to represent null values by using the NULL
clause when defining a data format.
You can use an arbitrary string that should match a NULL
value, for example, an empty string if you do not want to distinguish nulls from empty strings.
The default string is \N
for text files, and an empty value with no quotation marks for CSV files.
Example
In this example, the COPY
command is used for copying the suppliers.txt file’s content into the suppliers
table.
Some data values in the file are missing and represented by the N/A
string.
-
In the /tmp directory on the master host, create the suppliers.txt file:
#!/bin/bash cat > /tmp/suppliers.txt <<EOF id,name,city,country 1,Sunrise Corp,Los Angeles,USA 2,Blue Sky Co,N/A,N/A 3,Green Leaf LLC,Berlin,Germany 4,Red Rock Group,Sydney,N/A 5,Golden Gate Solutions,N/A,Japan EOF
-
Empty the
suppliers
table:TRUNCATE TABLE suppliers;
-
Run the
COPY FROM
command. In theWITH
clause, setTEXT
as the data format, specify that the file contains a header row (HEADER
), and setN/A
as the null-string (NULL
).COPY suppliers FROM '/tmp/suppliers.txt' WITH (FORMAT TEXT, DELIMITER ',', HEADER, NULL 'N/A');
-
Query the
suppliers
table:SELECT * FROM suppliers;
The output should look as follows:
id | name | city | country ----+-----------------------+-------------+--------- 1 | Sunrise Corp | Los Angeles | USA 2 | Blue Sky Co | | 3 | Green Leaf LLC | Berlin | Germany 4 | Red Rock Group | Sydney | 5 | Golden Gate Solutions | | Japan
Escape characters
The characters designated as a delimiter (DELIMITER
) and a newline (NEWLINE
) are reserved in Greengage DB.
If your data contains any of these characters, you must escape them so that they are treated as data and not as a field separator or a new row.
If the escape character appears in the data, use it to escape itself.
You can declare an escape character by using the ESCAPE
clause when defining a data format.
Escape characters in text-formatted files
By default, the escape character is a backslash (\
) for text-formatted files.
You can use the escape character to escape octal and hexadecimal sequences.
The escaped value is converted to the equivalent character when loaded into Greengage DB.
For example, to load the ampersand character (&
), use the escape character to escape its equivalent hexadecimal (\x26
) or octal (\046
) representation.
If the input data (for example, web log data) contains many backslash characters, you can deactivate escaping by providing the OFF
value for the ESCAPE
clause, for example:
COPY suppliers FROM '/tmp/suppliers.txt'
WITH (FORMAT TEXT, HEADER, ESCAPE 'OFF');
Example
In this example, the COPY
command is used for copying the suppliers.txt file’s content into the suppliers
table.
The file uses the comma character (,
) as a delimiter.
-
In the /tmp directory on the master host, create the suppliers.txt file. Notice that the comma character that is part of the data is escaped with the default escape character (
\
). The backslash character that is itself part of the data is also escaped:#!/bin/bash cat > /tmp/suppliers.txt <<'EOF' id,name,city,country 1,Acme Inc.,New York,USA 2,Comma\, Inc.,London,GB 3,Backslash \\ Ltd.,Paris,France EOF
-
Empty the
suppliers
table:TRUNCATE TABLE suppliers;
-
Run the
COPY FROM
command. In theWITH
clause, setTEXT
as the data format, specify that the file contains a header row (HEADER
), and set the comma character (,
) as a delimiter (DELIMITER
):COPY suppliers FROM '/tmp/suppliers.txt' WITH (FORMAT TEXT, HEADER, DELIMITER ',');
-
Query the
suppliers
table:SELECT * FROM suppliers;
The output should look as follows:
id | name | city | country ----+------------------+----------+--------- 1 | Acme Inc. | New York | USA 2 | Comma, Inc. | London | GB 3 | Backslash \ Ltd. | Paris | France
Escape characters in CSV-formatted files
By default, the escape character for CSV-formatted files is the same as the QUOTE
value (double quote ("
) by default).
In CSV files, all characters are significant.
If a quoted value is surrounded by white space or any characters other than DELIMITER
, then it includes those characters.
This can cause errors if you import data from a system that pads CSV lines with white space to some fixed width.
In this case, preprocess the CSV file to remove the trailing white space before importing the data.
Example
In this example, the COPY
command is used for copying the suppliers.csv file’s content into the suppliers
table.
-
In the /tmp directory on the master host, create the suppliers.csv file. Notice that the data value containing a comma character is enclosed in double quotes. The double quotes that are part of the data are escaped with the default escape character (
"
) even though the field value itself is enclosed in double quotes. To preserve leading and trailing whitespace characters that are contained in the data, the corresponding field is also enclosed in double quotes:#!/bin/bash cat > /tmp/suppliers.csv <<EOF id,name,city,country 1,Acme Inc.,New York,USA 2,"Comma, Inc.",London,GB 3,"""Quotes"" Ltd.",Paris,France 4," Space GmbH",Berlin,Germany EOF
-
Empty the
suppliers
table:TRUNCATE TABLE suppliers;
-
Run the
COPY FROM
command. In theWITH
clause, setCSV
as the data format and specify that the file contains a header row (HEADER
):COPY suppliers FROM '/tmp/suppliers.csv' WITH (FORMAT CSV, HEADER);
-
Query the
suppliers
table:SELECT * FROM suppliers;
The output should look as follows:
id | name | city | country ----+---------------+----------+--------- 1 | Acme Inc. | New York | USA 2 | Comma, Inc. | London | GB 3 | "Quotes" Ltd. | Paris | France 4 | Space GmbH | Berlin | Germany
Character encoding
Character encoding systems pair each character from a character set with something else, such as a sequence of numbers or octets, to facilitate data transmission and storage. Greengage DB supports a variety of character sets. You can define the server-side character set during database initialization; UTF-8 is used by default.
Clients can use all supported character sets, but not all of them are supported for use within the server as a server-side encoding. When loading or inserting data into Greengage DB, the data is converted from the specified client encoding into the server encoding. When sending data back to the client, the data is converted from the server character encoding into the specified client encoding.
Data files must use the character encoding recognized by Greengage DB; otherwise, they encounter errors when loading into Greengage DB.
On Windows platforms, run the dos2unix
system command on data files to remove any Windows-only characters before loading into Greengage DB.
If you change the ENCODING
value in an existing gpload control file, you must manually drop any external tables that were created using the previous ENCODING
configuration.
gpload
does not drop and recreate external tables to use the new ENCODING
if REUSE_TABLES
is set to true
.
The client-side character encoding can be changed for a session by setting the client_encoding
server configuration parameter as follows:
SET client_encoding TO 'latin1';
To view the current client-side character encoding, run the following command:
SHOW client_encoding;
The output should look as follows:
client_encoding ----------------- UTF8
To revert the client-side character encoding back to the default value, use the RESET
command:
RESET client_encoding;