External tables overview

Anton Monakov

Contents

Types of external tables
Create an external table
Supported protocols
- FILE
- GPFDIST/GPFDISTS
- PXF
- S3
- CUSTOM
Web tables
- URL-based tables
- Command-based tables

External tables can be used as regular Greengage DB (based on Greenplum) tables, but their data is stored outside the database. They let you access the data as if it is stored locally and are commonly used for loading data into and unloading it out of a database.

You can use external tables in SQL commands, for example, run SELECT on readable external tables, join them with regular Greengage DB tables via JOIN, or run INSERT or COPY FROM on writable external tables.

Prerequisites

To try out the practical examples listed in this topic, connect to the Greengage DB master host as gpadmin using psql as described in Connect to Greengage DB via psql and create a test database as described in Create a new database.

External tables can be either readable or writable:

Readable external tables are used for loading data from external data sources and allow only SELECT operations.
Writable external tables are used for unloading data to external data sources and allow only INSERT operations.

Depending on the data source used, external tables can be either file-based or web-based:

File-based external tables access static flat files. Such tables are rescannable, which means that the data is static while the query runs.
Web-based external tables access dynamic data sources, either on a web server with the HTTP protocol or by running commands or scripts. They are not rescannable, which means that the data can change while the query runs. For more details on web tables, see Web tables.

To create an external table definition, use the CREATE EXTERNAL TABLE command:

CREATE [[READABLE] | WRITABLE] EXTERNAL [WEB] TABLE <table_name>
    (<column_name> <data_type> [, ...] | LIKE <other_table>)
    LOCATION ('<protocol>://<location>)
    | EXECUTE '<command>'
    FORMAT 'TEXT' | 'CSV' | 'CUSTOM'

Main clauses
Clause	Description
READABLE \| WRITABLE	Defines the type of the external table, which can be either readable or writable. If the clause is omitted, the table is automatically created as readable
WEB	Defines the created table as web-based. If the clause is omitted, the table is created as file-based
LOCATION \| EXECUTE	`LOCATION` defines one of the supported access protocols and the actual location of data files. `EXECUTE` can be provided for readable external web tables or writable external tables. It specifies an operating system (OS) command or a script that runs on segment instances
FORMAT	Defines the data format, which can be `TEXT`, `CSV`, or `CUSTOM`. Learn more about formatting source data in Format external data

Arguments
Argument	Description
<table_name>	The name of the table to create
<column_name>	The name of a column to create. Note that unlike regular tables, external tables do not have column constraints or default values
<data_type>	The data type of the column
LIKE <other_table>	Specifies a table from which the external table copies all column names, data types, and distribution policy. If the original table specifies any column constraints or default column values, those are not be copied over to the external table definition
<protocol>	The access protocol used for connecting to a data source
<location>	The location of a data file
<command>	The command or script to execute

To access external table data sources, use one of the supported protocols. Note that you cannot mix protocols in the CREATE EXTERNAL TABLE statements.

Internal protocols are implemented internally in Greengage DB and don’t require additional configuration. They are always present and cannot be unregistered.

FILE directly accesses data files on segment hosts.
GPFDIST/GPFDISTS accesses files served on file hosts via the gpfdist utility.
HTTP accesses files on web servers.

Opt-in protocols let you connect to a data source that cannot be accessed with the FILE, GPFDIST, or HTTP. You must enable and configure them in Greengage DB before they can be used.

PXF accesses object storage systems, external Hadoop systems (HDFS, Hive, HBase), and SQL databases using the Greenplum Platform Extension Framework (PXF).
S3 accesses files in Amazon S3-compatible buckets.
CUSTOM accesses data via a user-defined mechanism, for example, by working with custom data formats or by connecting through a custom protocol.

The FILE protocol lets you read from one or more files located on each Greengage DB segment host. Writing to files is not supported, therefore, the tables created with the FILE protocol can only be readable.

To create an external table using the FILE protocol, provide the LOCATION and FORMAT clauses inside the CREATE EXTERNAL TABLE command as follows:

CREATE EXTERNAL TABLE <table_name>
    (<column_name> <data_type> [, ...] | LIKE <other_table>)
    LOCATION ('file://<segment_host1>[:<port>]/<path>/<file>',
              'file://<segment_host2>[:<port>]/<path>/<file>')
    FORMAT 'TEXT' | 'CSV' | 'CUSTOM';

In the LOCATION clause, provide a URI specifying the files location: the segment host name (<segment_host>), port (<port>), and path to the file (<path>/<file>). To denote multiple files to read, use the wildcard characters such as * or []. When reading data, .gz, .bz2, and .zst files are uncompressed automatically.

Each referred file must reside on a segment host in a location accessible by gpadmin. To make sure the URIs match the registered segment host names and ports, query the gp_segment_configuration system table.

Example gp_segment_configuration query

SELECT * FROM gp_segment_configuration;

The output should look as follows:

 dbid | content | role | preferred_role | mode | status | port  | hostname | address |        datadir
------+---------+------+----------------+------+--------+-------+-------------+-------------+---------------
    1 |      -1 | p    | p              | n    | u      |  5432 |   mdw    |   mdw   | /data1/master/gpseg-1
    2 |       0 | p    | p              | s    | u      | 10000 |   sdw1   |   sdw1  | /data1/primary/gpseg0
    6 |       0 | m    | m              | s    | u      | 10500 |   sdw2   |   sdw2  | /data1/mirror/gpseg0
    3 |       1 | p    | p              | s    | u      | 10001 |   sdw1   |   sdw1  | /data1/primary/gpseg1
    7 |       1 | m    | m              | s    | u      | 10501 |   sdw2   |   sdw2  | /data1/mirror/gpseg1
    4 |       2 | p    | p              | s    | u      | 10000 |   sdw2   |   sdw2  | /data1/primary/gpseg2
    8 |       2 | m    | m              | s    | u      | 10500 |   sdw1   |   sdw1  | /data1/mirror/gpseg2
    5 |       3 | p    | p              | s    | u      | 10001 |   sdw2   |   sdw2  | /data1/primary/gpseg3
    9 |       3 | m    | m              | s    | u      | 10501 |   sdw1   |   sdw1  | /data1/mirror/gpseg3
   10 |      -1 | m    | m              | s    | u      |  5432 |   smdw   |   smdw  | /data1/master/gpseg-1

You can provide multiple URIs, with their number corresponding to the number of segment instances that work in parallel to access the external table. The number of external files per segment host cannot exceed the number of primary segment instances on that host. To view how many external files are permitted, query the pg_max_external_files system view.
Example pg_max_external_files query
SELECT * FROM pg_max_external_files;

As the output shows, in this example cluster two external files are permitted per segment host:

hostname | maxfiles ---------+---------- sdw1 | 2 sdw2 | 2
TIP

For maximum parallelism when loading data, divide the data into as many equally sized files as you have primary segments. This ensures that all segments participate in the load.

In the FORMAT clause, define how your data is formatted. Learn more in Format external data.

Example

This example uses two CSV files, customers-1.csv and customers-2.csv, having the following content:

id,first_name,last_name,email,phone_number,address
1,John,Doe,john.doe@example.com,123-456-7890,123 Elm Street
2,Jane,Smith,jane.smith@example.com,987-654-3210,456 Oak Street

id,first_name,last_name,email,phone_number,address
3,Bob,Brown,bob.brown@example.com,867-5309-1234,789 Pine Street
4,Rob,Stuart,rob.stuart@example.com,817-4309-6731,119 Willow Street

The files reside in the /tmp directory on the sdw1 and sdw2 segment hosts.

On the master host, run the CREATE EXTERNAL TABLE command. In the FORMAT clause, set CSV as the data format and specify that the files contain a header row (HEADER):

CREATE EXTERNAL TABLE customers (
        id INTEGER,
        first_name VARCHAR(50),
        last_name VARCHAR(50),
        email VARCHAR(100),
        phone_number VARCHAR(15),
        address VARCHAR(255)
    )
    LOCATION ('file://sdw1:10000/tmp/customers-1.csv',
            'file://sdw2:10000/tmp/customers-2.csv'
    )
    FORMAT 'CSV' (HEADER);

Query the created external table:

SELECT * FROM customers;

The output should look as follows:

 id | first_name | last_name |         email          | phone_number  |     address
----+------------+-----------+------------------------+---------------+------------------
  1 | John       | Doe       | john.doe@example.com   | 123-456-7890  | 123 Elm Street
  2 | Jane       | Smith     | jane.smith@example.com | 987-654-3210  | 456 Oak Street
  3 | Bob        | Brown     | bob.brown@example.com  | 867-5309-1234 | 789 Pine Street
  4 | Rob        | Stuart    | rob.stuart@example.com | 817-4309-6731 | 119 Willow Street

The GPFDIST protocol lets you access external data served by the Greengage DB gpfdist (Greenplum Parallel File Server) utility. The GPFDISTS protocol additionally lets you use SSL-encrypted communication between Greengage DB and the file server.

See Use gpfdist for more details and usage examples.

The Greenplum Platform Extension Framework (PXF) protocol lets you access data residing in object storage systems (Azure, Google Cloud Storage, Minio, S3), external Hadoop systems (HDFS, Hive, HBase), and SQL databases. You can also write text, binary, and parquet-format data with the PXF protocol.

The PXF protocol is packaged as a Greengage DB extension. Before using it, you must explicitly initialize and start it, enable it in each database in which you plan to create PXF external tables, and grant permissions on the PXF protocol to the users.

The S3 protocol lets you access files located in S3 buckets. You can define readable or writable external tables.

Before using the S3 protocol, you must register it in Greengage DB and create a configuration file.

Configure the S3 protocol

Connect to the database in which you want to use the S3 protocol:
```
$ psql <database>
```

Create the read and write functions for the S3 protocol library:

CREATE OR REPLACE FUNCTION write_to_s3() RETURNS integer AS
'$libdir/gps3ext.so', 's3_export' LANGUAGE C STABLE;

CREATE OR REPLACE FUNCTION read_from_s3() RETURNS integer AS
'$libdir/gps3ext.so', 's3_import' LANGUAGE C STABLE;

Declare the S3 protocol and specify the created read and write functions.
- To allow only the Greengage DB superuser (gpadmin) to access the protocol:
  CREATE PROTOCOL s3 (writefunc = write_to_s3, readfunc = read_from_s3);
- To allow non-superusers to use the S3 protocol, create it as a trusted protocol and grant access to the users by running the GRANT command. For more details on access management in Greengage DB, see Roles and privileges.
  CREATE TRUSTED PROTOCOL s3 (writefunc = write_to_s3, readfunc = read_from_s3); GRANT ALL ON PROTOCOL s3 TO user1, user2;

Create a configuration file

Run the gpcheckcloud utility to generate a template configuration file:
```
$ gpcheckcloud -t > ./s3.conf
```
If necessary, edit the generated file:
```
$ vi s3.conf
```
Provide the necessary values:
```
[default]
secret = "<secret_access_key>"
accessid = "<access_key_id>"
threadnum = 4
chunksize = 67108864
low_speed_limit = 10240
low_speed_time = 60
encryption = true
version = 2
proxy = ""
autocompress = true
verifycert = true
server_side_encryption = ""
# gpcheckcloud config
gpcheckcloud_newline = "\n"
```
IMPORTANT

Be aware that the expected values might vary depending on your service provider’s requirements. For example, if version is set to 1, the LOCATION clause of the CREATE EXTERNAL TABLE command supports Amazon S3 URLs and does not contain the region parameter. If version is set to 2, the LOCATION clause supports S3-compatible services and must include the region parameter.

Check your connection by using the gpcheckcloud utility. Notice that in this case, since version is set to 2 in the configuration file, the region parameter is provided explicitly:

$ gpcheckcloud -c "s3://example.company.com/<bucket_name>/[<S3_prefix>] \
    config=/path/to/s3.config \
    region=<S3_data_source_region>"

The output should look as follows:

...
File: data//info.txt, Size: 378
File: data//orders.csv, Size: 1080
File: data//test.txt, Size: 550

Your configuration works well.

Store the s3.conf configuration file in a desired location.

By default, the file is read from the s3 directory inside the data directory of each segment instance (for example, /data1/primary/gpseg0/s3). You can simplify the configuration by using the same file system location on each segment host (for example, /tmp/s3). Alternatively, you can run an HTTP/HTTPS server to serve up the configuration file to all segment hosts.

To create an external table using the S3 protocol, provide the LOCATION and FORMAT clauses inside the CREATE EXTERNAL TABLE command as follows:

CREATE EXTERNAL TABLE <table_name>
    (<column_name> <data_type> [, ...] | LIKE <other_table>)
    LOCATION ('s3://<S3_endpoint>[:<port>]/<bucket_name>/[<S3_prefix>]
        [region=<S3_region>]
        [config=<config_file_location> | config_server=<url>]
        [section=<section_name>]')
    FORMAT 'TEXT' | 'CSV' | 'CUSTOM';

In the LOCATION clause, provide the S3 endpoint (<S3_endpoint>), port (<port>), and the bucket name (<bucket_name>). Each Greengage DB segment host must have access to the specified S3 location.

Use <S3_prefix> to select files for readable tables or as a file name prefix for files created via writable tables.
Provide the values for your S3 region (<S3_region>), configuration file location (<config_file_location>) or configuration web server URL (<url>), and the configuration file’s section used (<section_name>) according to your S3 configuration.

When reading data, the S3 protocol automatically uncompresses the .gz and .deflate files. When writing data, .gz is appended to the file name if autocompress is not specified or is explicitly set to true in the S3 configuration file.

In the FORMAT clause, define how your data is formatted. The chosen format (TEXT or CSV) defines the extension (.txt or .csv) of the created file. Learn more in Format external data.

Examples

Readable table

This example uses two files, orders-1.csv and orders-2.csv, residing in the customers bucket on the storage.example.com S3 host and having the following content:

id,name,price
1,Laptop,999.99
2,Smartphone,499.99
3,Tablet,299.99

id,name,price
4,Monitor,599.99
5,Keyboard,99.99

On the master host, run the CREATE EXTERNAL TABLE command. In the LOCATION clause, instead of providing a file name, use the orders file prefix, which will select both files. In the FORMAT clause, set CSV as the data format and specify that the file contains a header row (HEADER):

CREATE EXTERNAL TABLE orders (
        id INTEGER,
        name VARCHAR,
        price NUMERIC
    )
    LOCATION ('s3://storage.example.com/customers/orders
        region=ru-central1
        config=/home/gpadmin/s3.conf'
    )
    FORMAT 'CSV' (HEADER);

Query the created external table:

SELECT * FROM orders;

The output should look as follows:

 id |    name    | price
----+------------+--------
  1 | Laptop     | 999.99
  2 | Smartphone | 499.99
  3 | Tablet     | 299.99
  4 | Monitor    | 599.99
  5 | Keyboard   |  99.99

Writable table

This example creates the orders writable table, whose data is unloaded to a CSV file. The file will be created automatically in the customers bucket on the storage.example.com S3 host.

On the master host, run the CREATE WRITABLE EXTERNAL TABLE command. In the LOCATION clause, instead of providing a file name, use the orders file prefix, which will be prepended to the created file’s name. In the FORMAT clause, set CSV as the data format:

CREATE WRITABLE EXTERNAL TABLE orders (
        id INTEGER,
        name VARCHAR,
        price NUMERIC
    )
    LOCATION ('s3://storage.example.com/customers/orders
        region=ru-central1
        config=/home/gpadmin/s3.conf'
    )
    FORMAT 'CSV';

Populate the created table with sample data:

INSERT INTO orders (id, name, price)
VALUES (1,'Laptop',999.99),
       (2,'Smartphone',499.99),
       (3,'Tablet',299.99);

Uncompress and review the orders1d6c93761.csv.gz file created in the S3 bucket:
```
$ gzip -dk orders1d6c93761.csv.gz && cat orders1d6c93761.csv
```
The file’s content should look as follows:
```
1,Laptop,999.99
2,Smartphone,499.99
3,Tablet,299.99
```

A CUSTOM protocol lets you connect Greengage DB to a data source that cannot be accessed with other protocols. For example, you could directly stream data from Greengage DB to another system without the need for an intermediate utility such as gpfdist.

To create a custom protocol, you need to implement a set of C functions with specified interfaces, declare these functions in Greengage DB, enable the protocol in the database, and grant users permissions on the created protocol.

External web tables allow Greengage DB to treat dynamic data sources like regular database tables. Because web table data can change as a query runs, such tables are not rescannable.

URL-based web tables access data from a web server over the HTTP protocol. Such tables can only be readable.
Command-based web tables receive data by running commands and scripts or send data to operating system (OS) commands and scripts. Such tables can be readable or writable.

To create a web table definition, use the CREATE EXTERNAL WEB TABLE command. Note that you cannot mix command-based and URL-based definitions in the CREATE EXTERNAL WEB TABLE statements.

URL-based web tables access data from a web server over the HTTP protocol.

To create an external table using the HTTP protocol, provide the LOCATION and FORMAT clauses inside the CREATE EXTERNAL WEB TABLE command as follows:

CREATE EXTERNAL WEB TABLE <table_name>
    (<column_name> <data_type> [, ...] | LIKE <other_table>)
    LOCATION ('http://<webhost>[:<port>]/<path>/<file>')
    FORMAT 'TEXT' | 'CSV' | 'CUSTOM';

In the LOCATION clause, provide the URI specifying the target files location: the web server name (<webhost>), port (<port>), and the path to the file (<path>/<file>). The specified web server must be accessible by all Greengage DB segment hosts. You can provide multiple URIs, with their number corresponding to the number of segment instances that work in parallel to access the web server. To denote multiple files to read, use the wildcard characters such as * or [].

In the FORMAT clause, define how your data is formatted. Learn more in Format external data.

Example

This example uses the customers.csv file served from the web.storage.com web server and having the following content:

id,first_name,last_name,email
1,Bob,Brown,bob.brown@example.com
2,Rob,Stuart,rob.stuart@example.com

On the master host, run the CREATE EXTERNAL WEB TABLE command. In the FORMAT clause, set CSV as the data format and specify that the file contains a header row (HEADER):

CREATE EXTERNAL WEB TABLE customers (
        id INTEGER,
        first_name VARCHAR(50),
        last_name VARCHAR(50),
    )
    LOCATION ('http://web.storage.com:8000/customers.csv')
    FORMAT 'CSV' ( HEADER );

Query the created external table:

SELECT * FROM customers;

The file’s content should look as follows:

id | first_name | last_name |         email
---+------------+-----------+------------------------
 1 | Bob        | Brown     | bob.brown@example.com
 2 | Rob        | Stuart    | rob.stuart@example.com

Command-based external web tables read or write data by running an OS command or a script on segment instances.

To create a command-based external table, provide the LOCATION and FORMAT clauses inside the CREATE EXTERNAL WEB TABLE command as follows:

CREATE [[READABLE] | WRITABLE] EXTERNAL WEB TABLE <table_name>
    (<column_name> <data_type> [, ...] | LIKE <other_table>)
    EXECUTE '<command>'
    FORMAT 'TEXT' | 'CSV' | 'CUSTOM';

In the EXECUTE clause, specify the OS command or the location of the script to run (<command>). Scripts must be executable by the gpadmin user and reside in the same location on the master or segment hosts. By default, commands and scripts run on each primary segment instance on all segment hosts. Using the optional ON clause, you can specify which segment instances will run it.

In the FORMAT clause, define how your data is formatted. Learn more in Format external data.

When specifying a command, you can use environment variables such as GP_SEGMENT_ID to identify requests made by a specific segment instance.

NOTE

The specified command or script is run from the database and cannot access environment variables from .bashrc or .profile. You must set environment variables in the EXECUTE clause, for example:

CREATE [[READABLE] | WRITABLE] EXTERNAL WEB TABLE <table_name> ( ... )
    EXECUTE 'PATH=/home/gpadmin/programs; export PATH; script.sh'
    FORMAT 'TEXT';

Examples

Readable table

This example uses the loader.sh script, which outputs a random number and has the following content:

#!/bin/bash

# Output random number
echo $((1 + $RANDOM % 10))

The script is located in the /home/gpadmin/programs directory on each of the two primary segment instances on two segment hosts.

On the master host, run the CREATE EXTERNAL WEB TABLE command pointing to loader.sh. In the FORMAT clause, set TEXT as the data format:

CREATE EXTERNAL WEB TABLE log (number INTEGER)
    EXECUTE 'PATH=/home/gpadmin/programs; export PATH; loader.sh'
    FORMAT 'TEXT';

Query the created external table:

SELECT * FROM log;

The output should look as follows:

 number
--------
      3
      6
      6
     10

Writable table

This example uses the cat command, which unloads the data to the output.txt files stored in the /tmp directory. The command will run on two primary segment instances on two segment hosts.

On the master host, run the CREATE WRITABLE EXTERNAL WEB TABLE command. In the EXECUTE clause, provide the cat command. Use the GP_SEGMENT_ID environment variable to append the ID of the segment running the command to the output file name. In the FORMAT clause, set CSV as the data format:
```
CREATE WRITABLE EXTERNAL WEB TABLE orders (
        id INTEGER,
        name VARCHAR,
        price NUMERIC
    )
    EXECUTE 'cat > /tmp/output_$GP_SEGMENT_ID.csv'
    FORMAT 'CSV';
```

Populate the created table with sample data:

INSERT INTO orders (id, name, price)
VALUES (1,'Laptop',999.99),
       (2,'Smartphone',499.99),
       (3,'Tablet',299.99),
       (4,'Monitor',599.99),
       (5,'Keyboard',99.99);

The cat command runs on each primary segment instance on all segment hosts and creates output_*.csv files in the /tmp directory on each segment host. Review the created files:
```
$ cat /tmp/output_*.csv
```
The files' content should look as follows:
```
1,Laptop,999.99
2,Smartphone,499.99
3,Tablet,299.99
4,Monitor,599.99
5,Keyboard,99.99
```

COPY command

Use gpfdist

External tables overview

Types of external tables

Create an external table

Supported protocols

FILE

GPFDIST/GPFDISTS

PXF

S3

CUSTOM

Web tables

URL-based tables

Command-based tables