External tables overview
External tables can be used as regular Greengage DB (based on Greenplum) tables, but their data is stored outside the database. They let you access the data as if it is stored locally and are commonly used for loading data into and unloading it out of a database.
You can use external tables in SQL commands, for example, run SELECT
on readable external tables, join them with regular Greengage DB tables via JOIN
, or run INSERT
or COPY FROM
on writable external tables.
Types of external tables
External tables can be either readable or writable:
-
Readable external tables are used for loading data from external data sources and allow only
SELECT
operations. -
Writable external tables are used for unloading data to external data sources and allow only
INSERT
operations.
Depending on the data source used, external tables can be either file-based or web-based:
-
File-based external tables access static flat files. Such tables are rescannable, which means that the data is static while the query runs.
-
Web-based external tables access dynamic data sources, either on a web server with the HTTP protocol or by running commands or scripts. They are not rescannable, which means that the data can change while the query runs. For more details on web tables, see Web tables.
Create an external table
To create an external table definition, use the CREATE EXTERNAL TABLE
command:
CREATE [[READABLE] | WRITABLE] EXTERNAL [WEB] TABLE <table_name>
(<column_name> <data_type> [, ...] | LIKE <other_table>)
LOCATION ('<protocol>://<location>)
| EXECUTE '<command>'
FORMAT 'TEXT' | 'CSV' | 'CUSTOM'
Clause | Description |
---|---|
READABLE | WRITABLE |
Defines the type of the external table, which can be either readable or writable. If the clause is omitted, the table is automatically created as readable |
WEB |
Defines the created table as web-based. If the clause is omitted, the table is created as file-based |
LOCATION | EXECUTE |
|
FORMAT |
Defines the data format, which can be Learn more about formatting source data in Format external data |
Argument | Description |
---|---|
<table_name> |
The name of the table to create |
<column_name> |
The name of a column to create. Note that unlike regular tables, external tables do not have column constraints or default values |
<data_type> |
The data type of the column |
LIKE <other_table> |
Specifies a table from which the external table copies all column names, data types, and distribution policy. If the original table specifies any column constraints or default column values, those are not be copied over to the external table definition |
<protocol> |
The access protocol used for connecting to a data source |
<location> |
The location of a data file |
<command> |
The command or script to execute |
Supported protocols
To access external table data sources, use one of the supported protocols. Note that you cannot mix protocols in the CREATE EXTERNAL TABLE statements.
Internal protocols are implemented internally in Greengage DB and don’t require additional configuration. They are always present and cannot be unregistered.
-
FILE directly accesses data files on segment hosts.
-
GPFDIST/GPFDISTS accesses files served on file hosts via the
gpfdist
utility. -
HTTP accesses files on web servers.
Opt-in protocols let you connect to a data source that cannot be accessed with the FILE, GPFDIST, or HTTP. You must enable and configure them in Greengage DB before they can be used.
-
PXF accesses object storage systems, external Hadoop systems (HDFS, Hive, HBase), and SQL databases using the Greenplum Platform Extension Framework (PXF).
-
S3 accesses files in Amazon S3-compatible buckets.
-
CUSTOM accesses data via a user-defined mechanism, for example, by working with custom data formats or by connecting through a custom protocol.
FILE
The FILE protocol lets you read from one or more files located on each Greengage DB segment host. Writing to files is not supported, therefore, the tables created with the FILE protocol can only be readable.
To create an external table using the FILE protocol, provide the LOCATION
and FORMAT
clauses inside the CREATE EXTERNAL TABLE command as follows:
CREATE EXTERNAL TABLE <table_name>
(<column_name> <data_type> [, ...] | LIKE <other_table>)
LOCATION ('file://<segment_host1>[:<port>]/<path>/<file>',
'file://<segment_host2>[:<port>]/<path>/<file>')
FORMAT 'TEXT' | 'CSV' | 'CUSTOM';
In the LOCATION
clause, provide a URI specifying the files location: the segment host name (<segment_host>
), port (<port>
), and path to the file (<path>
/<file>
).
To denote multiple files to read, use the wildcard characters such as *
or []
.
When reading data, .gz, .bz2, and .zst files are uncompressed automatically.
-
Each referred file must reside on a segment host in a location accessible by
gpadmin
. To make sure the URIs match the registered segment host names and ports, query thegp_segment_configuration
system table.SELECT * FROM gp_segment_configuration;
The output should look as follows:
dbid | content | role | preferred_role | mode | status | port | hostname | address | datadir ------+---------+------+----------------+------+--------+-------+-------------+-------------+--------------- 1 | -1 | p | p | n | u | 5432 | mdw | mdw | /data1/master/gpseg-1 2 | 0 | p | p | s | u | 10000 | sdw1 | sdw1 | /data1/primary/gpseg0 6 | 0 | m | m | s | u | 10500 | sdw2 | sdw2 | /data1/mirror/gpseg0 3 | 1 | p | p | s | u | 10001 | sdw1 | sdw1 | /data1/primary/gpseg1 7 | 1 | m | m | s | u | 10501 | sdw2 | sdw2 | /data1/mirror/gpseg1 4 | 2 | p | p | s | u | 10000 | sdw2 | sdw2 | /data1/primary/gpseg2 8 | 2 | m | m | s | u | 10500 | sdw1 | sdw1 | /data1/mirror/gpseg2 5 | 3 | p | p | s | u | 10001 | sdw2 | sdw2 | /data1/primary/gpseg3 9 | 3 | m | m | s | u | 10501 | sdw1 | sdw1 | /data1/mirror/gpseg3 10 | -1 | m | m | s | u | 5432 | smdw | smdw | /data1/master/gpseg-1
-
You can provide multiple URIs, with their number corresponding to the number of segment instances that work in parallel to access the external table. The number of external files per segment host cannot exceed the number of primary segment instances on that host. To view how many external files are permitted, query the
pg_max_external_files
system view.SELECT * FROM pg_max_external_files;
As the output shows, in this example cluster two external files are permitted per segment host:
hostname | maxfiles ---------+---------- sdw1 | 2 sdw2 | 2
TIPFor maximum parallelism when loading data, divide the data into as many equally sized files as you have primary segments. This ensures that all segments participate in the load.
In the FORMAT
clause, define how your data is formatted.
Learn more in Format external data.
This example uses two CSV files, customers-1.csv and customers-2.csv, having the following content:
id,first_name,last_name,email,phone_number,address 1,John,Doe,john.doe@example.com,123-456-7890,123 Elm Street 2,Jane,Smith,jane.smith@example.com,987-654-3210,456 Oak Street
id,first_name,last_name,email,phone_number,address 3,Bob,Brown,bob.brown@example.com,867-5309-1234,789 Pine Street 4,Rob,Stuart,rob.stuart@example.com,817-4309-6731,119 Willow Street
The files reside in the /tmp directory on the sdw1 and sdw2 segment hosts.
-
On the master host, run the
CREATE EXTERNAL TABLE
command. In theFORMAT
clause, setCSV
as the data format and specify that the files contain a header row (HEADER
):CREATE EXTERNAL TABLE customers ( id INTEGER, first_name VARCHAR(50), last_name VARCHAR(50), email VARCHAR(100), phone_number VARCHAR(15), address VARCHAR(255) ) LOCATION ('file://sdw1:10000/tmp/customers-1.csv', 'file://sdw2:10000/tmp/customers-2.csv' ) FORMAT 'CSV' (HEADER);
-
Query the created external table:
SELECT * FROM customers;
The output should look as follows:
id | first_name | last_name | email | phone_number | address ----+------------+-----------+------------------------+---------------+------------------ 1 | John | Doe | john.doe@example.com | 123-456-7890 | 123 Elm Street 2 | Jane | Smith | jane.smith@example.com | 987-654-3210 | 456 Oak Street 3 | Bob | Brown | bob.brown@example.com | 867-5309-1234 | 789 Pine Street 4 | Rob | Stuart | rob.stuart@example.com | 817-4309-6731 | 119 Willow Street
GPFDIST/GPFDISTS
The GPFDIST protocol lets you access external data served by the Greengage DB gpfdist
(Greenplum Parallel File Server) utility.
The GPFDISTS protocol additionally lets you use SSL-encrypted communication between Greengage DB and the file server.
See Use gpfdist for more details and usage examples.
PXF
The Greenplum Platform Extension Framework (PXF) protocol lets you access data residing in object storage systems (Azure, Google Cloud Storage, Minio, S3), external Hadoop systems (HDFS, Hive, HBase), and SQL databases. You can also write text, binary, and parquet-format data with the PXF protocol.
The PXF protocol is packaged as a Greengage DB extension. Before using it, you must explicitly initialize and start it, enable it in each database in which you plan to create PXF external tables, and grant permissions on the PXF protocol to the users.
S3
The S3 protocol lets you access files located in S3 buckets. You can define readable or writable external tables.
Before using the S3 protocol, you must register it in Greengage DB and create a configuration file.
-
Connect to the database in which you want to use the S3 protocol:
$ psql <database>
-
Create the read and write functions for the S3 protocol library:
CREATE OR REPLACE FUNCTION write_to_s3() RETURNS integer AS '$libdir/gps3ext.so', 's3_export' LANGUAGE C STABLE;
CREATE OR REPLACE FUNCTION read_from_s3() RETURNS integer AS '$libdir/gps3ext.so', 's3_import' LANGUAGE C STABLE;
-
Declare the S3 protocol and specify the created read and write functions.
-
To allow only the Greengage DB superuser (
gpadmin
) to access the protocol:CREATE PROTOCOL s3 (writefunc = write_to_s3, readfunc = read_from_s3);
-
To allow non-superusers to use the S3 protocol, create it as a trusted protocol and grant access to the users by running the
GRANT
command. For more details on access management in Greengage DB, see Roles and privileges.CREATE TRUSTED PROTOCOL s3 (writefunc = write_to_s3, readfunc = read_from_s3); GRANT ALL ON PROTOCOL s3 TO user1, user2;
-
-
Run the
gpcheckcloud
utility to generate a template configuration file:$ gpcheckcloud -t > ./s3.conf
-
If necessary, edit the generated file:
$ vi s3.conf
Provide the necessary values:
[default] secret = "<secret_access_key>" accessid = "<access_key_id>" threadnum = 4 chunksize = 67108864 low_speed_limit = 10240 low_speed_time = 60 encryption = true version = 2 proxy = "" autocompress = true verifycert = true server_side_encryption = "" # gpcheckcloud config gpcheckcloud_newline = "\n"
IMPORTANTBe aware that the expected values might vary depending on your service provider’s requirements. For example, if
version
is set to1
, theLOCATION
clause of theCREATE EXTERNAL TABLE
command supports Amazon S3 URLs and does not contain theregion
parameter. Ifversion
is set to2
, theLOCATION
clause supports S3-compatible services and must include theregion
parameter. -
Check your connection by using the
gpcheckcloud
utility. Notice that in this case, sinceversion
is set to2
in the configuration file, theregion
parameter is provided explicitly:$ gpcheckcloud -c "s3://example.company.com/<bucket_name>/[<S3_prefix>] \ config=/path/to/s3.config \ region=<S3_data_source_region>"
The output should look as follows:
... File: data//info.txt, Size: 378 File: data//orders.csv, Size: 1080 File: data//test.txt, Size: 550 Your configuration works well.
-
Store the s3.conf configuration file in a desired location.
By default, the file is read from the s3 directory inside the data directory of each segment instance (for example, /data1/primary/gpseg0/s3). You can simplify the configuration by using the same file system location on each segment host (for example, /tmp/s3). Alternatively, you can run an HTTP/HTTPS server to serve up the configuration file to all segment hosts.
To create an external table using the S3 protocol, provide the LOCATION
and FORMAT
clauses inside the CREATE EXTERNAL TABLE command as follows:
CREATE EXTERNAL TABLE <table_name>
(<column_name> <data_type> [, ...] | LIKE <other_table>)
LOCATION ('s3://<S3_endpoint>[:<port>]/<bucket_name>/[<S3_prefix>]
[region=<S3_region>]
[config=<config_file_location> | config_server=<url>]
[section=<section_name>]')
FORMAT 'TEXT' | 'CSV' | 'CUSTOM';
In the LOCATION
clause, provide the S3 endpoint (<S3_endpoint>
), port (<port>
), and the bucket name (<bucket_name>
).
Each Greengage DB segment host must have access to the specified S3 location.
-
Use
<S3_prefix>
to select files for readable tables or as a file name prefix for files created via writable tables. -
Provide the values for your S3 region (
<S3_region>
), configuration file location (<config_file_location>
) or configuration web server URL (<url>
), and the configuration file’s section used (<section_name>
) according to your S3 configuration.
When reading data, the S3 protocol automatically uncompresses the .gz and .deflate files.
When writing data, .gz is appended to the file name if autocompress
is not specified or is explicitly set to true
in the S3 configuration file.
In the FORMAT
clause, define how your data is formatted.
The chosen format (TEXT
or CSV
) defines the extension (.txt or .csv) of the created file.
Learn more in Format external data.
This example uses two files, orders-1.csv and orders-2.csv, residing in the customers bucket on the storage.example.com S3 host and having the following content:
id,name,price 1,Laptop,999.99 2,Smartphone,499.99 3,Tablet,299.99
id,name,price 4,Monitor,599.99 5,Keyboard,99.99
-
On the master host, run the
CREATE EXTERNAL TABLE
command. In theLOCATION
clause, instead of providing a file name, use theorders
file prefix, which will select both files. In theFORMAT
clause, setCSV
as the data format and specify that the file contains a header row (HEADER
):CREATE EXTERNAL TABLE orders ( id INTEGER, name VARCHAR, price NUMERIC ) LOCATION ('s3://storage.example.com/customers/orders region=ru-central1 config=/home/gpadmin/s3.conf' ) FORMAT 'CSV' (HEADER);
-
Query the created external table:
SELECT * FROM orders;
The output should look as follows:
id | name | price ----+------------+-------- 1 | Laptop | 999.99 2 | Smartphone | 499.99 3 | Tablet | 299.99 4 | Monitor | 599.99 5 | Keyboard | 99.99
This example creates the orders
writable table, whose data is unloaded to a CSV file.
The file will be created automatically in the customers bucket on the storage.example.com S3 host.
-
On the master host, run the
CREATE WRITABLE EXTERNAL TABLE
command. In theLOCATION
clause, instead of providing a file name, use theorders
file prefix, which will be prepended to the created file’s name. In theFORMAT
clause, setCSV
as the data format:CREATE WRITABLE EXTERNAL TABLE orders ( id INTEGER, name VARCHAR, price NUMERIC ) LOCATION ('s3://storage.example.com/customers/orders region=ru-central1 config=/home/gpadmin/s3.conf' ) FORMAT 'CSV';
-
Populate the created table with sample data:
INSERT INTO orders (id, name, price) VALUES (1,'Laptop',999.99), (2,'Smartphone',499.99), (3,'Tablet',299.99);
-
Uncompress and review the orders1d6c93761.csv.gz file created in the S3 bucket:
$ gzip -dk orders1d6c93761.csv.gz && cat orders1d6c93761.csv
The file’s content should look as follows:
1,Laptop,999.99 2,Smartphone,499.99 3,Tablet,299.99
CUSTOM
A CUSTOM protocol lets you connect Greengage DB to a data source that cannot be accessed with other protocols.
For example, you could directly stream data from Greengage DB to another system without the need for an intermediate utility such as gpfdist
.
To create a custom protocol, you need to implement a set of C functions with specified interfaces, declare these functions in Greengage DB, enable the protocol in the database, and grant users permissions on the created protocol.
Web tables
External web tables allow Greengage DB to treat dynamic data sources like regular database tables. Because web table data can change as a query runs, such tables are not rescannable.
-
URL-based web tables access data from a web server over the HTTP protocol. Such tables can only be readable.
-
Command-based web tables receive data by running commands and scripts or send data to operating system (OS) commands and scripts. Such tables can be readable or writable.
To create a web table definition, use the CREATE EXTERNAL WEB TABLE
command.
Note that you cannot mix command-based and URL-based definitions in the CREATE EXTERNAL WEB TABLE
statements.
URL-based tables
URL-based web tables access data from a web server over the HTTP protocol.
To create an external table using the HTTP protocol, provide the LOCATION
and FORMAT
clauses inside the CREATE EXTERNAL WEB TABLE command as follows:
CREATE EXTERNAL WEB TABLE <table_name>
(<column_name> <data_type> [, ...] | LIKE <other_table>)
LOCATION ('http://<webhost>[:<port>]/<path>/<file>')
FORMAT 'TEXT' | 'CSV' | 'CUSTOM';
In the LOCATION
clause, provide the URI specifying the target files location: the web server name (<webhost>
), port (<port>
), and the path to the file (<path>
/<file>
).
The specified web server must be accessible by all Greengage DB segment hosts.
You can provide multiple URIs, with their number corresponding to the number of segment instances that work in parallel to access the web server.
To denote multiple files to read, use the wildcard characters such as *
or []
.
In the FORMAT
clause, define how your data is formatted.
Learn more in Format external data.
This example uses the customers.csv file served from the web.storage.com web server and having the following content:
id,first_name,last_name,email 1,Bob,Brown,bob.brown@example.com 2,Rob,Stuart,rob.stuart@example.com
-
On the master host, run the
CREATE EXTERNAL WEB TABLE
command. In theFORMAT
clause, setCSV
as the data format and specify that the file contains a header row (HEADER
):CREATE EXTERNAL WEB TABLE customers ( id INTEGER, first_name VARCHAR(50), last_name VARCHAR(50), ) LOCATION ('http://web.storage.com:8000/customers.csv') FORMAT 'CSV' ( HEADER );
-
Query the created external table:
SELECT * FROM customers;
The file’s content should look as follows:
id | first_name | last_name | email ---+------------+-----------+------------------------ 1 | Bob | Brown | bob.brown@example.com 2 | Rob | Stuart | rob.stuart@example.com
Command-based tables
Command-based external web tables read or write data by running an OS command or a script on segment instances.
To create a command-based external table, provide the LOCATION
and FORMAT
clauses inside the CREATE EXTERNAL WEB TABLE command as follows:
CREATE [[READABLE] | WRITABLE] EXTERNAL WEB TABLE <table_name>
(<column_name> <data_type> [, ...] | LIKE <other_table>)
EXECUTE '<command>'
FORMAT 'TEXT' | 'CSV' | 'CUSTOM';
In the EXECUTE
clause, specify the OS command or the location of the script to run (<command>
).
Scripts must be executable by the gpadmin
user and reside in the same location on the master or segment hosts.
By default, commands and scripts run on each primary segment instance on all segment hosts.
Using the optional ON
clause, you can specify which segment instances will run it.
In the FORMAT
clause, define how your data is formatted.
Learn more in Format external data.
When specifying a command, you can use environment variables such as GP_SEGMENT_ID
to identify requests made by a specific segment instance.
The specified command or script is run from the database and cannot access environment variables from .bashrc or .profile.
You must set environment variables in the EXECUTE
clause, for example:
CREATE [[READABLE] | WRITABLE] EXTERNAL WEB TABLE <table_name> ( ... )
EXECUTE 'PATH=/home/gpadmin/programs; export PATH; script.sh'
FORMAT 'TEXT';
This example uses the loader.sh script, which outputs a random number and has the following content:
#!/bin/bash
# Output random number
echo $((1 + $RANDOM % 10))
The script is located in the /home/gpadmin/programs directory on each of the two primary segment instances on two segment hosts.
-
On the master host, run the
CREATE EXTERNAL WEB TABLE
command pointing to loader.sh. In theFORMAT
clause, setTEXT
as the data format:CREATE EXTERNAL WEB TABLE log (number INTEGER) EXECUTE 'PATH=/home/gpadmin/programs; export PATH; loader.sh' FORMAT 'TEXT';
-
Query the created external table:
SELECT * FROM log;
The output should look as follows:
number -------- 3 6 6 10
This example uses the cat
command, which unloads the data to the output.txt files stored in the /tmp directory.
The command will run on two primary segment instances on two segment hosts.
-
On the master host, run the
CREATE WRITABLE EXTERNAL WEB TABLE
command. In theEXECUTE
clause, provide thecat
command. Use theGP_SEGMENT_ID
environment variable to append the ID of the segment running the command to the output file name. In theFORMAT
clause, setCSV
as the data format:CREATE WRITABLE EXTERNAL WEB TABLE orders ( id INTEGER, name VARCHAR, price NUMERIC ) EXECUTE 'cat > /tmp/output_$GP_SEGMENT_ID.csv' FORMAT 'CSV';
-
Populate the created table with sample data:
INSERT INTO orders (id, name, price) VALUES (1,'Laptop',999.99), (2,'Smartphone',499.99), (3,'Tablet',299.99), (4,'Monitor',599.99), (5,'Keyboard',99.99);
-
The
cat
command runs on each primary segment instance on all segment hosts and creates output_*.csv files in the /tmp directory on each segment host. Review the created files:$ cat /tmp/output_*.csv
The files' content should look as follows:
1,Laptop,999.99 2,Smartphone,499.99 3,Tablet,299.99 4,Monitor,599.99 5,Keyboard,99.99