testdb=# BEGIN;
BEGIN
testdb=# CREATE TABLE t1(id bigserial, s char(4000)) WITH (appendonly = true) DISTRIBUTED BY (ID);
CREATE TABLE
testdb=# INSERT INTO t1(s) SELECT v::text FROM generate_series(1,100000) v;
INSERT 0 100000
Some of the challenges that the Greengage/Arenadata DB development team faces arise from incident analysis, which is initially handled by our product support team. Although at first, a task may seem like a simple bug that can be fixed quickly, we always try to solve the issue in the most comprehensive way, addressing a wider range of potential problems.
Incorrect behavior of the DBMS core in this case fell into the category of errors that were fixed easily and did not involve many complexities, and we solved this problem relatively quickly…
That’s how I would like to start this article, but (spoiler!) the reality was the exact opposite. A couple of times we were close to abandoning our planned approach, taking the path of least resistance and solving the problem in a way similar to how the upstream of the Greenplum project did at that time.
The problem can be illustrated by an example provided by the support team in the ticket.
During Greenplum DBMS operation, orphaned files may appear on segments.
The corresponding case to reproduce:
On the master, open a transaction, create a table, and insert data into it:
testdb=# BEGIN;
BEGIN
testdb=# CREATE TABLE t1(id bigserial, s char(4000)) WITH (appendonly = true) DISTRIBUTED BY (ID);
CREATE TABLE
testdb=# INSERT INTO t1(s) SELECT v::text FROM generate_series(1,100000) v;
INSERT 0 100000
On the segment server, you can see data files of the new table in the base directory:
$ ls /data1/primary/gpseg1/base/541094/ | sort -n | tail
Emulate kernel panic on the segment server:
$ echo c | tee /proc/sysrq-trigger
When you try to commit a transaction, an error occurs:
testdb=# COMMIT; ERROR: gang was lost due to cluster reconfiguration (cdbgang_async.c:94)
Restore Linux server operation.
Run gprecoverseg
.
Run gprecoverseg -r
.
Script execution result:
After gprecoverseg
execution, the database server consistency is restored, the t1
table is not found in the system directory.
Data files of the t1
table remain on the segment server. These files do not correspond to the table in the directory and are not returned by the search query:
testdb=# SELECT relname, relfilenode FROM pg_class WHERE relfilenode = $1;
The creation of these files is related to failover events on the cluster.
In Arenadata DB 6.x versions, it is impossible to completely exclude the occurrence of such files, and therefore, in reality, users may encounter them with consequences: excessive utilization of data directories (the main data directory and/or tablespaces). In practice, technical support engineers have already dealt with several similar cases when they searched for and deleted orphaned files on Arenadata DB clusters together with users.
It is clear from this description that in case of a backend process crash associated with a transaction that created a new table, table data files may remain on the file system. These files are referred to as "orphaned" in this context.
After a process crash, such files will not be associated with any tables in the database. Depending on the load profile and the frequency of crashes, such files may increase in number over time, wasting valuable storage space. Manually identifying them can be time-consuming, and deleting them carries the risk of deleting something that is needed.
The problem is not specific to Greengage/Greenplum. Those interested can read, for example, this long discussion. As it seems to me, PostgreSQL treats the possibility of leaving "garbage" from interrupted operations without much concern. In particular, this conclusion can be drawn from reading the comments in the source code of the function that moves tables from one tablespace to another one when executing the ALTER DATABASE SET TABLESPACE
statement:
/*
* ALTER DATABASE SET TABLESPACE
*/
static void
movedb(const char *dbname, const char *tblspcname)
{
...
/*
* Force synchronous commit, thus minimizing the window between
* copying the database files and committal of the transaction. If we
* crash before committing, we'll leave an orphaned set of files on
* disk, which is not fatal but not good either.
*/
...
}
Perhaps for an OLTP DBMS, where the state of the database in terms of a set of tables can be quite static, this problem is not so critical. However, for our major Greengage/Greenplum users, the situation is different. Multiple segments on one server, mass inserts within ETL processes can result in a large number of orphaned files in case of an unexpected shutdown of processes. These consequences can be caused by hardware failures, operational errors (for example, lack of free disk space), software errors — everything that can lead to a crash of processes.
The DBMS core should be able to handle such situations during recovery.
The problem remains relevant in the latest version of PostgreSQL. Let’s reproduce one of the scenarios in PostgreSQL 17.
postgres=# SELECT version();
version
-------------------------------------------------------------------------------------------------------
PostgreSQL 17.5 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bit
(1 row)
postgres=# BEGIN;
BEGIN
postgres=*# CREATE TABLE t1(col1 INT); INSERT INTO t1 SELECT generate_series(1,1000000);
CREATE TABLE
INSERT 0 1000000
postgres=*# SELECT pg_relation_filepath('t1');
pg_relation_filepath
----------------------
base/5/32778
(1 row)
postgres=*# \! stat /data1/postgres/base/5/32778
File: /data1/postgres/base/5/32778
Size: 36249600 Blocks: 70800 IO Block: 4096 regular file
Device: 10302h/66306d Inode: 15860721 Links: 1
Access: (0600/-rw-------) Uid: ( 1000/ andrey) Gid: ( 1000/ andrey)
Access: 2025-05-29 11:14:04.671474274 +0300
Modify: 2025-05-29 11:17:29.490127947 +0300
Change: 2025-05-29 11:17:29.490127947 +0300
Birth: 2025-05-29 11:14:04.671474274 +0300
postgres=*# \! kill -9 $(head -1 /data1/postgres/postmaster.pid)
$ pg_ctl -D /data1/postgres/ -l /usr/local/pgsql/postgres.log start
$ psql postgres
postgres=# \dt
Did not find any relations.
postgres=# \! stat /data1/postgres/base/5/32778
File: /data1/postgres/base/5/32778
Size: 36249600 Blocks: 70808 IO Block: 4096 regular file
Device: 10302h/66306d Inode: 15860721 Links: 1
Access: (0600/-rw-------) Uid: ( 1000/ andrey) Gid: ( 1000/ andrey)
Access: 2025-05-29 11:14:04.671474274 +0300
Modify: 2025-05-29 11:18:18.195104666 +0300
Change: 2025-05-29 11:18:18.195104666 +0300
Birth: 2025-05-29 11:14:04.671474274 +0300
To understand the reasons for the occurrence of orphaned files, it is necessary to consider the process of table creation in terms of records saved in the WAL log. You may be aware that PostgreSQL implements the REDO log. Processing the CREATE TABLE heap(col1 INT)
statement will result in inserting the following records in the WAL log on the master (further on, all examples will be discussed within the context of Greengage/Greenplum):
..., tx: 720, lsn: 0/0C000078, ..., desc: file create: base/12812/16384 ..., tx: 720, lsn: 0/0C0000B0, ..., desc: insert: rel 1663/12812/12546; tid 2/82 ... ..., tx: 720, lsn: 0/0C03D5A0, ..., desc: distributed commit ... gid = 1746706495-0000000005, gxid = 5 ..., tx: 720, lsn: 0/0C03D618, ..., desc: distributed forget gid = 1746706495-0000000005, gxid = 5
and on the segment:
..., tx: 715, lsn: 0/0C000078, ..., desc: file create: base/12812/16384 ..., tx: 715, lsn: 0/0C0000B0, ..., desc: insert: rel 1663/12812/12546; tid 2/82 ... ..., tx: 715, lsn: 0/0C03D438, ..., desc: prepare ..., tx: 0, lsn: 0/0C03D850, ..., desc: commit prepared 715: ... gid = 1746706495-0000000005 gxid = 5
I have omitted a part of the WAL log for the sake of brevity (and also removed a lot of insert
records from the output).
For the scenario we are considering, the following WAL records are of particular interest:
file create
on the line marked with the number 1 (both logs);
distributed commit
and forget
on the lines 2 and 3 (master log);
prepare
and commit prepared
on the lines 2 and 3 (segment log).
The file create
record corresponds to the event of creating a table data file. The lines 2 and 3 refer to the commit request (vote)
and commit (completion)
stages in terms of the 2PC protocol from the segment’s point of view. On the master, the successful completion of a distributed transaction is indicated by the forget
record on the line 3, which follows the distributed commit (distributed commit
) on the line 2.
In order for an orphaned file to remain on the file system, it is enough to "kill" the backend process that is involved in the transaction (distributed or local).
To illustrate the presence of an orphaned file, we will use the arenadata_toolkit
extension and its arenadata_toolkit.__db_files_current_unmapped
view. This view displays a list of files from the base directory of the current database (where this extension is installed) that do not correspond to any of the tables in the pg_class
system catalog (the relfilenode
field). However, in order to get this list for an uncommitted transaction, the view should be requested from a different transaction than the one creating these files. Otherwise, the transaction will receive a snapshot of pg_class
, where the new files are associated with the created table and are not technically considered orphaned.
So in the example below, there are two processes, which I have labeled as P1 and P2:
P1:
postgres=# BEGIN;
BEGIN
postgres=# CREATE TABLE orphaned(col1 INT) WITH (appendoptimized=false) DISTRIBUTED BY (col1);
CREATE TABLE
P2:
postgres=# CREATE EXTENSION arenadata_toolkit;
CREATE EXTENSION
postgres=# SELECT content, file FROM arenadata_toolkit.__db_files_current_unmapped;
content | file
---------+--------------------------------------------------------------------------------------------
0 | /data1/primary/gpseg0/base/12812/16393
(1 row)
postgres=# SELECT pid, query FROM gp_dist_random('pg_stat_activity') WHERE query LIKE 'CREATE TABLE orphaned%';
pid | query
-------+---------------------------------------------------------------------------------
17532 | CREATE TABLE orphaned(col1 INT) WITH (appendoptimized=false) DISTRIBUTED BY (col1);
(1 row)
postgres=# \! kill -9 17532
P1:
postgres=# COMMIT;
ERROR: Error on receive from seg0 127.0.0.1:6002 pid=17532: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
postgres=# \dt
No relations found.
P2:
postgres=# SELECT content, file FROM arenadata_toolkit.__db_files_current_unmapped;
content | file
---------+--------------------------------------------------------------------------------------------
0 | /data1/primary/gpseg0/base/12812/16384
(1 row)
As we can see, after the transaction interruption and an unsuccessful attempt to execute COMMIT
(line 2) caused by a crash of the pid=17532
process on the seg0
segment (line 1), an orphaned table file remains on the file system (line 5). However, information about the table is not present in the pg_class
system catalog (lines 3 and 4).
We partially discussed the implementation of deleting table files when transactions are completed by the COMMIT
and ABORT
statements in the Arenadata DB table changes tracking article. The key point in deleting table files is that this occurs when COMMIT
of a transaction is executed if a user requests a table deletion (for example, explicitly via DROP TABLE
). In the case of ABORT
, this is a rollback of the transaction that created the table (for example, via CREATE TABLE
).
If such a WAL record (commit prepared
or abort
) is missing, a table file will remain on the file system after the WAL log is replayed. In general, such a file cannot be deleted using basic standard DBMS tools, and manual deletion carries the risk of deleting something necessary. A trivial example is mentioned above: a transaction is in the process of execution, its effect is not yet visible to other transactions, and deleting such files will lead to a complete loss of table data.
The issue is known and has been discussed many times within the PostgreSQL community. For example, this discussion raised the possibility of implementing UNDO logs, which, in theory, could solve the problem. However, unfortunately, the discussion has stalled at the moment, and the question of the complexity of integrating UNDO logs into the PostgreSQL core remains unanswered. It is hoped that the community will return to this topic someday.
Greenplum, when it was still an open-source project, went its own way and implemented the gp_check_functions
extension. This extension provides users with several functions and views that allow listing orphaned files, moving orphaned files to a user-specified folder, or solving the inverse problem of displaying missing files. Such files are expected to be in the server’s base directory according to the data in pg_class
, but for some reason, they are not there. The main disadvantage of this approach, in my opinion, is that it shifts the responsibility of removing "garbage" from the DBMS onto a user.
We also found a technical problem with this extension related to file renaming, which is implemented based on the rename Linux API function. This function has a limitation — renaming only works within one file system. In this case, the function returns the EXDEV
error:
EXDEV oldpath and newpath are not on the same mounted filesystem. (Linux permits a filesystem to be mounted at multiple points, but rename() does not work across different mount points, even if the same filesystem is mounted on both.)
Thus, if tablespaces are located in different file systems (which is likely to be the case in most production clusters that use tablespaces with disk shelf mount points, etc.), then when a user specifies such a directory for file transfer, the call will fail. If the list of orphaned files contains files from different tablespaces, some of the files may be renamed, while others may not, as the function fails somewhere in the middle.
Since we also support this extension, we plan to fix this problem by expanding the contract with new functions. Unfortunately, the potential problem of deleting the current version in case of combining different tablespaces based on different file systems will still remain relevant for the old contract. At the moment, there is no mechanism for separating the renaming (transfer between directories) of orphaned files in the context of tablespaces, but we are working on this.
At the same time, to be objective, the gp_check_functions
extension at the initial stage complements our approach. This will become clear from the further description of the solution we propose. For now, we can say that our solution will not allow you to delete orphaned files that already exist in the base directory, but it will prevent them from being produced in the future.
Also, the undoubted advantage of this extension is the ability to search for missing files. In general, this is an alarm bell: does the catalog expect the presence of data files of tables on the disk, which are not on the disk? Most likely, this is a very bad situation.
Our proposed idea is based on the following three simple theses:
If a transaction that created a table is committed or rolled back, the DBMS core will decide what to do with table data files. In the case of abort
, the files will be deleted; in the case of commit prepared
, the files will remain.
Thus, the current behavior remains unchanged.
While a transaction is in the TRANSACTION_STATUS_IN_PROGRESS
state, there is a possibility that the transaction can abort abnormally, leaving behind orphaned table data files. The core itself could consider such files as candidates to check if they were orphaned by a transaction that did not finish with commit prepared
or abort
. These files are then considered candidates for deletion.
The decision to delete a file can be made at the recovery startup. This point requires special consideration, and we will discuss it further. This is the main change in the current behavior of the DBMS core (in addition to the implementation of tracking files to be considered orphaned).
One of the most important elements of ensuring data consistency in a database is the process of restoring the server after a failure (although this process is also the entry point when starting the server after a correct shutdown). The entry point of this process is the StartupXLOG function. Almost two thousand lines of source code of this function are responsible for performing the following basic tasks:
read the pg_control file (data from this file will be required, in particular, for the next two steps to obtain the server state at the time of shutdown, the checkpoint position, and the timeline);
understand the state of the database at the time it was shut down (whether it was a normal or unexpected server shutdown);
understand the recovery mode (start after a correct shutdown, recovery after a failure, point-in-time recovery, standby start, etc.);
read a checkpoint (the checkpoint
record) and identify the corresponding redo
point — the position in the WAL log from which to start redoing the WAL records;
read the backup_label file if it exists (in this case, the checkpoint position is determined from this file, not from pg_control);
redo WAL records (if they need to be replayed starting from the redo
point);
switch timelines if necessary;
stop redoing WAL records when the target point is reached or all existing records are replayed (end of the WAL log);
promote the mirror if necessary;
process prepared transactions in a special way according to the 2PC protocol (more on this later);
and, upon completion, start the server in a consistent state, if that state has been reached.
For standby (including before its possible promotion), this list of actions is slightly shorter, but that does not change the overall picture.
In a simplified way, this process can be represented as follows.
If we consider possible places in the code where we could embed the deletion of orphaned files, the following option comes to mind.
It looks as follows (the moment of deleting orphaned files is indicated by a red rectangle, the image shows a part of our initial scheme).
This is where we started. Generally, everything worked as expected, but the first detected flaw of this scheme showed its incompleteness. For example, what happens when a mirror is restored by the gprecoverseg
utility after a crash and starts in its previous role?
P1:
postgres=# BEGIN; CREATE TABLE heap(col1 INT) DISTRIBUTED BY (col1);
BEGIN
CREATE TABLE
P2:
postgres=# SELECT content, file FROM arenadata_toolkit.__db_files_current_unmapped;
content | file
---------+--------------------------------------------------------------------------------------------
0 | /data1/primiary/gpseg0/base/12812/16392
(1 row)
postgres=# \! kill -9 407529 408718
P1:
postgres=# COMMIT;
ERROR: Error on receive from seg0 127.0.0.1:6002 pid=407834: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
postgres=# \dt
No relations found.
P2:
postgres=# SELECT content, file FROM arenadata_toolkit.__db_files_current_unmapped;
content | file
---------+------
(0 rows)
postgres=# \! ls /data1/primary/gpseg0/base/12812/16392
ls: cannot access '/data/primary/gpseg0/base/12812/16392': No such file or directory
postgres=# \! ls /data1/mirror/gpseg0/base/12812/16392
/data1/mirror/gpseg0/base/12812/16392
postgres=# SELECT content, file FROM arenadata_toolkit.__db_files_current_unmapped;
dbid | content | role | mode | status
------+---------+------+------+--------
1 | -1 | p | n | u
4 | -1 | m | s | u
2 | 0 | p | n | u
3 | 0 | m | n | d
(4 rows)
postgres=# \! gprecoverseg
postgres=# SELECT content, file FROM arenadata_toolkit.__db_files_current_unmapped;
dbid | content | role | mode | status
------+---------+------+------+--------
1 | -1 | p | n | u
4 | -1 | m | s | u
2 | 0 | p | s | u
3 | 0 | m | s | u
(4 rows)
postgres=# \! ls /data1/mirror1/gpseg0/base/12812/16392
/data1/mirror1/gpseg0/base/12812/16392
It can be seen that the primary segment (line marked with the number 2), unlike its standby (line 3), deleted the files of the table that had been created by the failed transaction after the backend process crashed.
Indeed, according to the algorithm of the StartupXLOG
function:
It is determined that standby is restored after a crash (for a normal mirror shutdown, this state would be DB_SHUTDOWNED_IN_RECOVERY
).
The checkpoint for the gprecoverseg
recovery case is read from the pg_control file and the redo
point specified there.
Replaying records from the WAL log starts, but since this is a standby, after this process is completed (the "marker" for the end of this process is the XLOG_CHECKPOINT_SHUTDOWN
record), standby remains in the cycle of waiting and redoing WAL-records.
Deletion of orphaned files on standby does not occur, since the first implemented version of an algorithm assumed deletion before the end of the standby process and return of control to postmaster for further initialization/promotion (or complete shutdown). In the case of redoing records from primary, where standby is constantly located, this does not happen.
You can ask, how did the files get deleted on the primary, and why did the startup process run again together with the StartupXLOG
function, although judging by the comment, this function seems to run only once?
It’s all about the backend process crashed: SIGKILL
is sent (line 1) to two processes — the postmaster standby and the backend of the transaction process on the primary segment.
The primary log shows that this results in the termination of all processes (line 1) and re-initialization through the startup process. This, in turn, leads to the same StartupXLOG
, but already in the not properly shut down
status (line 2), and according to our implementation of deleting orphaned files, leads to deleting orphaned files of the transaction 718 (lines 3 and 4):
"LOG","00000","server process (PID 408718) was terminated by signal 9" "LOG","00000","terminating any other active server processes",,,,,,,0,,"postmaster.c",3812, "LOG","00000","all server processes terminated; reinitializing",,,,,,,0,,"postmaster.c",4369, "LOG","00000","database system was interrupted; last known up at 2025-05-19 18:15:12 MSK",,,,,,,0,,"xlog.c",6628, "LOG","00000","database system was not properly shut down; automatic recovery in progress",,,,,,,0,,"xlog.c",7079, "LOG","00000","redo starts at 0/C1A5350",,,,,,,0,,"xlog.c",7345, "LOG","00000","checkpoint starting: end-of-recovery immediate",,,,,,,0,,"xlog.c",8911, "LOG","00000","Prepare to drop node (1663: 12812: 16392) for xid: 718",,,,,,,0,,"storage_pending_deletes_redo.c",279, "LOG","00000","Pending delete rels were dropped (count: 1; xid: 718).",,,,,,,0,,"storage_pending_deletes_redo.c",331, "LOG","00000","database system is ready",,,,,,,0,,"xlog.c",8171,
A more complex but similar scenario is when the former primary, "downshifted" after falling into the mirror, starts after being restored by the gprecoverseg
utility.
In both cases, the result is the same — orphaned files remain in the base directory. And they remain until the moment of standby promotion or its regular shutdown.
We were not satisfied with either of these and complicated the scheme to the following version.
As you can see, the difference is in the added block for deleting orphaned files in the WAL record processing cycle — when processing the XLOG_CHECKPOINT_SHUTDOWN
or XLOG_END_OF_RECOVERY
records. The fact that these records are in the WAL log means that there are no active transactions in the system at this moment, and therefore all uncommitted transactions will remain uncommitted. Now, during the execution of gprecoverseg
, when the corresponding WAL records are received, orphaned files are checked and deleted.
As a result, in this part of making a decision on the file deletion, we came to the conclusion that this scheme covers all scenarios. At least, at this point, we have not found any evidence to the contrary.
The remaining nuances described below complement and refine this scheme. The decision points for deletion remain the same.
In the previous section, we looked at where in the code of the StartupXLOG
function we built in the deletion of orphaned files. Next, there were a lot of surprises waiting for us, but before we continue, we need to talk about the implementation of saving a list of relfilenodes (file identifiers), which will be required when deleting orphaned files. This has also revealed a lot of nuances.
As mentioned above, the creation of a table data file corresponds to the file create
entry in the WAL log (here I have provided the full output of pg_xlog_dump
):
rmgr: Storage len (rec/tot): 20/ 52, tx: 720, lsn: 0/0C000078, prev 0/0C000050, bkp: 0000, desc: file create: base/12812/16384
A key element of the process of creating a table data file is the RelationCreateStorage function. In addition to requesting the storage manager (smgr
) responsible for physically creating the table files, it creates a new instance of the PendingRelDelete structure:
typedef struct PendingRelDelete
{
RelFileNodePendingDelete relnode; /* relation that may need to be deleted */
bool atCommit; /* T=delete at commit; F=delete at abort */
int nestLevel; /* xact nesting level of request */
struct PendingRelDelete *next; /* linked-list link */
} PendingRelDelete;
Instances of these structures describe files that should be deleted when a transaction is committed or rolled back.
They are stored as a linked list within each of the backends in the local memory of the process (that is, they are not visible to anyone other than the current backend!). The list is represented by a static pointer to the head of the pendingDeletes list.
The created instance of RelationCreateStorage
is added to the linked list (the next
pointer to the next element is used). The atCommit
field indicates the action to be performed on a file if a transaction is committed or rolled back. If a file needs to be deleted on commit, then atCommit
will be set to true
, if on rolling back a transaction, then it will be set to false
. The file itself is identified by an instance of the RelFileNodePendingDelete structure:
typedef struct RelFileNode
{
Oid spcNode; /* tablespace */
Oid dbNode; /* database */
Oid relNode; /* relation */
} RelFileNode;
typedef struct RelFileNodePendingDelete
{
RelFileNode node;
bool isTempRelation;
char relstorage;
} RelFileNodePendingDelete;
relNode
is an identifier contained in the relfilenode
field of the pg_class
table.
When the backend reaches the stage of committing or rolling back a transaction, one of the operations performed is deleting files stored in the pendingDeletes
list. The smgrDoPendingDeletes function is responsible for this. In the depths of this function, there is a call to physical deletion (the unlink operation of the files associated with the table).
This is how regular file deletion works. As you can see, this process depends entirely on the pendingDeletes
list.
For our task, it is important to track which transaction created a file with a specific relfilenode
identifier. We will see later why this is important. We decided to store the identifiers (RelFileNodePendingDelete
and TransactionId
) also as a linked list because if there is a pointer to a specific node, this data structure allows deleting an item from the list in O(1).
At the same time, we wanted to do this without changing the existing functionality and decided to implement tracking of orphaned files based on the existing scheme with a local pendingDeletes
list.
typedef struct PendingRelDelete
{
... /* all existing fields */
dsa_pointer shmemPtr; /* ptr to shared pending delete list node */
} PendingRelDelete;
By expanding the PendingRelDelete
structure with a new field (shmemPtr
) that is a pointer to the node in the list of orphaned files, we solved several problems at once:
Conceptually linked pendingDeletes
with a list of files that will be considered orphaned on crash.
Simplified removing orphaned files from the list, since we already have a pointer to the desired node in the list (shmemPtr
). Thus, committing or rolling back a transaction, leading to a deletion of items from the monitored local pendingDeletes
list, immediately deletes nodes from the list of potentially orphaned files.
Having implemented the basic deletion scheme and the list saving mechanism described above, and being encouraged by the intermediate success, we carried out the first tests and saw the following problem — if a transaction does not complete before the next checkpoint, and the create
record is located "to the left" of the redo
point on the timeline, the orphaned files are not deleted in case of failure.
It is obvious that:
recovery starts from the redo
point;
the file create
record is not replayed during recovery (as it is located before the redo
point);
identifiers are not added to the list of potentially orphaned files;
as a result, the recovery process does not realize that some files will not be linked to a table and should be deleted.
Therefore, in order for file tracking to pass the "rubicon" of the checkpoint and the redo
record, it is necessary to solve the task of storing this list.
The first candidate for storing the list was the checkpoint
record itself. However, we rejected this idea for reasons of keeping compatibility in critical code areas relating to the checkpoint record processing. As a result, we decided to add a new type of WAL record called XLOG_PENDING_DELETE
(orphaned relfilenodes to delete
in the output of pg_xlog_dump
).
This record is a "container" of identifiers of files to delete (RelFileNodePendingDelete relnode
) in their association with the transaction (the xid
field) that created the file (instances of the PendingRelXactDeleteArray
structure in the array
field):
typedef struct PendingRelXactDelete
{
RelFileNodePendingDelete relnode;
TransactionId xid;
} PendingRelXactDelete;
typedef struct PendingRelXactDeleteArray
{
Size count;
PendingRelXactDelete array[FLEXIBLE_ARRAY_MEMBER];
} PendingRelXactDeleteArray;
Two questions immediately arise: who and when should add this record to the WAL log?
Since the checkpointer
process is responsible for the checkpoint creation process, it seemed logical to integrate into the online
checkpoint creation function. The answer to the question "who" we received was the checkpointer
process.
There are some difficulties with the answer to the question "when". The list of identifiers of files that may become orphaned due to transaction abort will be needed as "close" to the redo
point as possible, so that if the recovery process starts, the list can be immediately obtained and then the status of the records in it can be updated. The update is obviously needed so that the core can make a decision on the files to be deleted independently, since after the redo
point, commit prepared
or abort
records can be inserted. Thus, the update implies deleting records for those transactions that were processed normally.
It is impossible to guarantee that a new WAL record will be the first one, that is, the redo
point would immediately point to this record, without a lock of recording in the WAL log. We did not want to make backends wait on a lock unless it was absolutely necessary. As will be seen later, this creates some difficulties, but we considered locking to be a bad solution.
From this, it follows that the answer to the question "when?" implies either some synchronization between the checkpointer reading the list and backends modifying this list simultaneously, or considerting the situation that the information in the list of orphaned files may already be outdated at the time of reading. This requires a separate explanation.
Backend processes that handle transactions have their own lives. In the context of creating and deleting table files, this means working with the pendingDeletes
list in their local memory area. In order for the list of identifiers of each backend to be available to the checkpointer
process, this list should be shared. To solve the problem of creating shared memory structures, the PostgreSQL architecture uses a shared memory area.
However, if we make this list uniform for all backends, then, obviously, we will need to use synchronization primitives (in our case, LWLock
implementations). All work with the list (deleting or adding nodes) should be protected by an exclusive lock. This approach has one significant drawback — all transactions that create tables or delete existing ones will compete for this lock. This solution was rejected, and we implemented the ability for each backend to use its own list. The checkpointer
process needs read-only access, while the backend processes need write access. A lock is also necessary in this case since while checkpointer
is going through the list, the backend cannot change its contents. However, this will be a local lock within the "backend-checkpointer" pair, and not "all backends-checkpointer".
We also considered, prototyped, and tested for performance a lock-free list. A similar data structure is described in the article Lock-Free Linked Lists Using Compare-and-Swap and a variant of its improved implementation in the article A more Pragmatic Implementation of the Lock-free, Ordered, Linked List. Based on the results of load testing, we have concluded that such a data structure is redundant in our current case. This structure showed itself well in the conditions of modeling the concurrent access of a significant number of backends and a large number of elements in the list (from hundreds of thousands or more). In the supposed operating conditions, such a load on this structure is not expected. Therefore, we settled on a compromise option with a lock on a list within a backend. However, the results of the lock-free list study are interesting in themselves and, perhaps, implementations based on such lists will find application in other, more loaded parts of the DBMS core. We hope to share these results in one of future articles.
An important point is to understand the following implementation details:
the list of file identifiers that is stored in the XLOG_PENDING_DELETE
record is prepared by the checkpointer
process;
at this time, other backends can add their commit prepared
or abort
records to the WAL log;
therefore, we should not rely on the order in which records appear in the WAL log.
For example, a situation is possible where a commit prepared
statement is executed for a table, this record is added to the WAL log, then checkpointer
inserts a record into the WAL log according to its knowledge of the list of orphaned files. Indeed, while checkpointer
was traversing the lists of active backends, the transaction could be committed or rolled back. Thus, when replaying WAL records during recovery, it is necessary to check whether the transaction that created a file has been committed. If the transaction status is TRANSACTION_STATUS_IN_PROGRESS
, the file should not be deleted.
This solution allowed us not to be tied to the order of writing to the WAL log. Any synchronization would entail the need to serialize the record to the WAL log, which would inevitably create competition between the backends.
Hence, it is necessary to save the xid:relfilenode
pair, which was mentioned earlier — the transaction status can be found out by its identifier. Additionally, it should be noted that a transaction can have nested transactions with their own identifiers (for which their own rollback in the form of ROLLBACK TO SAVEPOINT
is also possible). When a transaction is committed, COMMIT
is performed within the top-level transaction. There are no separate events for committing nested transactions. In one of the first versions of the patch, when deleting, we ran into a situation when after committing a transaction and subsequent recovery from a failure, the DBMS happily deleted files of tables that were created and committed within the nested transactions. This problem was solved by deleting the entire "tree" of nested transaction identifiers from the list.
When we conducted a large number of tests and explored different scenarios and sequences of events/WAL records, my colleague, Vasily Ivanov (aka Stolb27), found a scenario where the necessary files were mistakenly deleted. This scenario initially stumped us. After so many tests and processing of boundary conditions (transaction status, transaction visibility horizon), we still had a non-trivial issue. At that moment, I almost started thinking about options with a PostgreSQL extension or some other external (in relation to the core) solution.
The scenario from the point of view of master WAL log:
..., tx: 732, lsn: 0/0C19B1B8, ..., desc: file create: base/12812/57345 ..., tx: 0, lsn: 0/0C1C4298, ..., desc: orphaned relfilenodes to delete: 1 ..., tx: 0, lsn: 0/0C1C42D8, ..., desc: checkpoint: redo 0/C1C4298; ...; online, ... ..., tx: 732, lsn: 0/0C1C4358, ..., desc: distributed commit ... gid = 1747898739-0000000005, gxid = 5
Segment WAL log:
..., tx: 727, lsn: 0/0C19C158, ..., desc: file create: base/12812/49154 ..., tx: 727, lsn: 0/0C1C5000, ..., desc: prepare ..., tx: 0, lsn: 0/0C1C5418, ..., desc: orphaned relfilenodes to delete: 1 ..., tx: 0, lsn: 0/0C1C5458, ..., desc: checkpoint: redo 0/C1C5000; ...; online, ...
Master server initiates a distributed transaction that will create a table data file (line 1 in the master and segment logs, respectively).
Segment executes prepare
(the DTX_PROTOCOL_COMMAND_PREPARE
command in terms of the distributed transaction manager), sends a response to the master (line 2 in the segment log).
At this point, checkpointer
wakes up and creates a checkpoint (lines 3 and 4 in the master and segment logs).
checkpointer
stores the list of xid:relfilenode
pairs in the orphaned relfilenodes to delete
record in the WAL log, since the transaction has not yet committed and these files are formally considered as possible candidates for deletion during recovery (lines 2 and 3 in the master and segment logs).
Master receives a response from the segments and inserts the distributed commit
record into the WAL log (line 4 in the master log).
Master updates the commit log for the transaction, marking it as successfully completed.
Next, the master should make a request for the transaction commit prepared
to segments (the DTX_PROTOCOL_COMMAND_COMMIT_PREPARED
command), but it does not have time to do this and crashes.
At the same time, one or several segments crashes, thereby initiating the recovery process after the failure at the next start.
User starts the cluster.
The startup process on the master and segment starts replaying the WAL log starting with redo
records lsn: 0/C1C4298
for the master (the orphaned relfilenodes to delete
record) and lsn: 0/C1C5000
(also orphaned relfilenodes to delete
).
According to the initial algorithm, before the completion of the recovery process, the startup process starts deleting orphaned files, and since there is no the commit prepared
record in the WAL log on the segment (the master did not have time to send it), the files of this transaction are considered orphaned and are deleted from the disk.
Master finds an interrupted distributed transaction and starts the process of recovering the distributed transaction, within which it sends the commit prepared
record to segments, thereby completing the distributed transaction according to the 2PC protocol.
Master log:
"LOG","00000","Crash recovery broadcast of the distributed transaction 'Commit Prepared' broadcast succeeded for gid = 1747898739-0000000005.",,,,,,,0,,"cdbdtxrecovery.c",98,
Segment WAL log:
..., tx: 0, lsn: 0/0C1C5578, ..., desc: commit prepared 727: ... gid = 3953992400-0000025774 gid = 1747898739-0000000005 gxid = 5
Thus, at step 11, the files of the table that was successfully created at the final recovery step (since the transaction was eventually committed) were deleted on the segment!
The result is extremely sad, but logical:
postgres=# SELECT * FROM heap;
ERROR: could not open file "base/12812/49154": No such file or directory (seg0 slice1 127.0.0.1:6002 pid=57716)
Why were the same files not deleted on the master? The difference is that when restoring the master, which also starts with the orphaned relfilenodes to delete
WAL record (lsn: 0/0C1C4298
), when processing the distributed commit
record (lsn: 0/0C1C4358
), the identifiers of the files created within the transaction are removed from the list of ones to be deleted.
It was completely unclear what to do in this case. Formally, at the moment the recovery process on the segment was completed, the file was considered orphaned (there was no the commit prepared
record, transaction status was TRANSACTION_STATUS_IN_PROGRESS
). How to separate transactions that can still be used during the process of restoring distributed transactions?
Before we could answer this question, we needed to take a closer look at the implementation of the 2PC protocol and especially the implementation of error handling.
The 2PC protocol solves the following problem: the protocol ensures that all participating database servers receive and perform the same action (either commit or roll back a transaction) regardless of local or network failure. If any database server fails to commit its part of the transaction, all database servers participating in the transaction are prevented from committing their work.
As you may be aware, the mechanism of prepared transactions is implemented to support external (in relation to the PostgreSQL core) transaction managers.
Implicitly, we have already seen elements of the 2PC protocol implementation in Greengage/Greenplum. The protocol implementation base is as follows:
Master sends a request to segments to prepare a transaction commit (the PREPARE TRANSACTION
statement). It makes sense to note here that after executing this command, the transaction will no longer be associated with the current session and can be committed (or, conversely, a rollback will be requested) from any other session. This will be needed for the next phase of the 2PC protocol and error handling in particular.
If all segments successfully execute this request and the master receives a response from all segments, it requests segments to commit the previously prepared transaction (the COMMIT PREPARED
statement). There is no way back after this step — the master should ensure that all segments execute this request.
If all segments successfully execute COMMIT PREPARED
, the distributed transaction is considered committed, and then this fact is recorded in the master WAL log.
The difficulties, as always, begin with considering error handling options. What is important for us now is the master failure scenario, when segments responded with a readiness to commit the transaction (the prepare
record) and the master recorded the intention to commit the distributed transaction in its WAL log with the distributed commit
record.
When restoring the master, this scenario is processed as follows:
Master finds the distributed commit
record (the distributed transaction manager has restored the context, so the master knows about such a transaction)
Master sends a request to segments about the intention to commit this transaction. Physically, a separate background worker process (dtx recovery process
) is responsible for this step.
Upon receiving such a request, segments add commit prepared
to their log.
Upon receiving the response, the master finally commits the distributed transaction by inserting the forget
record into the WAL log.
Information about a prepared transaction is recorded in a special 2PC file, and its content is also saved in a WAL record of the XLOG_XACT_PREPARE
type. The Greengage/Greenplum-specific part (relative to PostgreSQL) is that when processing a WAL record of this type, the handler reads its content and saves it in a special hash table (crashRecoverPostCheckpointPreparedTransactions_map_ht), the key in which is the transaction (top-level) identifier, and the content of the record is a 2PC file. Elements also get into this hash table when processing a checkpoint
record from the WAL log, if the checkpoint contains such records (that is, at the time of creation, there are distributed transactions that may require restoring their context if the recovery process is started). When a transaction is committed or rolled back, elements are deleted from the hash table. Upon completion of the recovery process, one of the last steps is to process the remaining prepared transactions in the list.
Thus, the presence of a transaction identifier (and its subtransactions, if any) in the hash table means that files related to that transaction should not be deleted, since such a prepared transaction may be committed during the processing of error situations!
The final scheme for deleting orphaned files began to look like this, and this implementation was released.
It took us almost a year to develop this feature. In addition to the implementation itself — careful and thoughtful development in one of the most critical parts of the core (recovery from failures, creation and deletion of table data files) — a significant amount of time was also spent on analytical work. This involved identifying and testing potential failure scenarios.
Yes, from the perspective of solving the problems of ordinary users, this feature does not seem to be very useful, but its purpose lies elsewhere — in helping DBMS administrators solve their problems. And we know that it is in demand.