race condition when writing pg_control

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

race condition when writing pg_control

Bossart, Nathan
Hi hackers,

I believe I've discovered a race condition between the startup and
checkpointer processes that can cause a CRC mismatch in the pg_control
file.  If a cluster crashes at the right time, the following error
appears when you attempt to restart it:

        FATAL:  incorrect checksum in control file

This appears to be caused by some code paths in xlog_redo() that
update ControlFile without taking the ControlFileLock.  The attached
patch seems to be sufficient to prevent the CRC mismatch in the
control file, but perhaps this is a symptom of a bigger problem with
concurrent modifications of ControlFile->checkPointCopy.nextFullXid.

Nathan


v1-0001-Prevent-race-condition-when-writing-pg_control.patch (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: race condition when writing pg_control

Thomas Munro-5
On Tue, May 5, 2020 at 5:53 AM Bossart, Nathan <[hidden email]> wrote:

> I believe I've discovered a race condition between the startup and
> checkpointer processes that can cause a CRC mismatch in the pg_control
> file.  If a cluster crashes at the right time, the following error
> appears when you attempt to restart it:
>
>         FATAL:  incorrect checksum in control file
>
> This appears to be caused by some code paths in xlog_redo() that
> update ControlFile without taking the ControlFileLock.  The attached
> patch seems to be sufficient to prevent the CRC mismatch in the
> control file, but perhaps this is a symptom of a bigger problem with
> concurrent modifications of ControlFile->checkPointCopy.nextFullXid.

This does indeed look pretty dodgy.  CreateRestartPoint() running in
the checkpointer does UpdateControlFile() to compute a checksum and
write it out, but xlog_redo() processing
XLOG_CHECKPOINT_{ONLINE,SHUTDOWN} modifies that data without
interlocking.  It looks like the ancestors of that line were there
since 35af5422f64 (2006), but back then RecoveryRestartPoint() ran
UpdateControLFile() directly in the startup process (immediately after
that update), so no interlocking problem.  Then in cdd46c76548 (2009),
RecoveryRestartPoint() was split up so that CreateRestartPoint() ran
in another process.


Reply | Threaded
Open this post in threaded view
|

Re: race condition when writing pg_control

Thomas Munro-5
On Tue, May 5, 2020 at 9:51 AM Thomas Munro <[hidden email]> wrote:

> On Tue, May 5, 2020 at 5:53 AM Bossart, Nathan <[hidden email]> wrote:
> > I believe I've discovered a race condition between the startup and
> > checkpointer processes that can cause a CRC mismatch in the pg_control
> > file.  If a cluster crashes at the right time, the following error
> > appears when you attempt to restart it:
> >
> >         FATAL:  incorrect checksum in control file
> >
> > This appears to be caused by some code paths in xlog_redo() that
> > update ControlFile without taking the ControlFileLock.  The attached
> > patch seems to be sufficient to prevent the CRC mismatch in the
> > control file, but perhaps this is a symptom of a bigger problem with
> > concurrent modifications of ControlFile->checkPointCopy.nextFullXid.
>
> This does indeed look pretty dodgy.  CreateRestartPoint() running in
> the checkpointer does UpdateControlFile() to compute a checksum and
> write it out, but xlog_redo() processing
> XLOG_CHECKPOINT_{ONLINE,SHUTDOWN} modifies that data without
> interlocking.  It looks like the ancestors of that line were there
> since 35af5422f64 (2006), but back then RecoveryRestartPoint() ran
> UpdateControLFile() directly in the startup process (immediately after
> that update), so no interlocking problem.  Then in cdd46c76548 (2009),
> RecoveryRestartPoint() was split up so that CreateRestartPoint() ran
> in another process.
Here's a version with a commit message added.  I'll push this to all
releases in a day or two if there are no objections.

0001-Fix-race-condition-that-could-corrupt-pg_control.patch (2K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: race condition when writing pg_control

Fujii Masao-4


On 2020/05/22 13:51, Thomas Munro wrote:

> On Tue, May 5, 2020 at 9:51 AM Thomas Munro <[hidden email]> wrote:
>> On Tue, May 5, 2020 at 5:53 AM Bossart, Nathan <[hidden email]> wrote:
>>> I believe I've discovered a race condition between the startup and
>>> checkpointer processes that can cause a CRC mismatch in the pg_control
>>> file.  If a cluster crashes at the right time, the following error
>>> appears when you attempt to restart it:
>>>
>>>          FATAL:  incorrect checksum in control file
>>>
>>> This appears to be caused by some code paths in xlog_redo() that
>>> update ControlFile without taking the ControlFileLock.  The attached
>>> patch seems to be sufficient to prevent the CRC mismatch in the
>>> control file, but perhaps this is a symptom of a bigger problem with
>>> concurrent modifications of ControlFile->checkPointCopy.nextFullXid.
>>
>> This does indeed look pretty dodgy.  CreateRestartPoint() running in
>> the checkpointer does UpdateControlFile() to compute a checksum and
>> write it out, but xlog_redo() processing
>> XLOG_CHECKPOINT_{ONLINE,SHUTDOWN} modifies that data without
>> interlocking.  It looks like the ancestors of that line were there
>> since 35af5422f64 (2006), but back then RecoveryRestartPoint() ran
>> UpdateControLFile() directly in the startup process (immediately after
>> that update), so no interlocking problem.  Then in cdd46c76548 (2009),
>> RecoveryRestartPoint() was split up so that CreateRestartPoint() ran
>> in another process.
>
> Here's a version with a commit message added.  I'll push this to all
> releases in a day or two if there are no objections.

+1 to push the patch.

Per my quick check, XLogReportParameters() seems to have the similar issue,
i.e., it updates the control file without taking ControlFileLock.
Maybe we should fix this at the same time?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION


Reply | Threaded
Open this post in threaded view
|

Re: race condition when writing pg_control

Michael Paquier-2
On Sat, May 23, 2020 at 01:00:17AM +0900, Fujii Masao wrote:
> Per my quick check, XLogReportParameters() seems to have the similar issue,
> i.e., it updates the control file without taking ControlFileLock.
> Maybe we should fix this at the same time?

Yeah.  It also checks the control file values, implying that we should
have LW_SHARED taken at least at the beginning, but this lock cannot
be upgraded we need LW_EXCLUSIVE the whole time.  I am wondering if we
should check with an assert if ControlFileLock is taken when going
through UpdateControlFile().  We have one code path at the beginning
of redo where we don't need a lock close to the backup_label file
checks, but we could just pass down a boolean flag to the routine to
handle that case.  Another good thing in having an assert is that any
new caller of UpdateControlFile() would need to think about the need
of a lock.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: race condition when writing pg_control

Bossart, Nathan
In reply to this post by Thomas Munro-5
On 5/21/20, 9:52 PM, "Thomas Munro" <[hidden email]> wrote:
> Here's a version with a commit message added.  I'll push this to all
> releases in a day or two if there are no objections.

Looks good to me.  Thanks!

Nathan

Reply | Threaded
Open this post in threaded view
|

Re: race condition when writing pg_control

Bossart, Nathan
In reply to this post by Michael Paquier-2
On 5/22/20, 10:40 PM, "Michael Paquier" <[hidden email]> wrote:

> On Sat, May 23, 2020 at 01:00:17AM +0900, Fujii Masao wrote:
>> Per my quick check, XLogReportParameters() seems to have the similar issue,
>> i.e., it updates the control file without taking ControlFileLock.
>> Maybe we should fix this at the same time?
>
> Yeah.  It also checks the control file values, implying that we should
> have LW_SHARED taken at least at the beginning, but this lock cannot
> be upgraded we need LW_EXCLUSIVE the whole time.  I am wondering if we
> should check with an assert if ControlFileLock is taken when going
> through UpdateControlFile().  We have one code path at the beginning
> of redo where we don't need a lock close to the backup_label file
> checks, but we could just pass down a boolean flag to the routine to
> handle that case.  Another good thing in having an assert is that any
> new caller of UpdateControlFile() would need to think about the need
> of a lock.
While an assertion in UpdateControlFile() would not have helped us
catch the problem I initially reported, it does seem worthwhile to add
it.  I have attached a patch that adds this assertion and also
attempts to fix XLogReportParameters().  Since there is only one place
where we feel it is safe to call UpdateControlFile() without a lock, I
just changed it to take the lock.  I don't think this adds any sort of
significant contention risk, and IMO it is a bit cleaner than the
boolean flag.

For the XLogReportParameters() fix, I simply added an exclusive lock
acquisition for the portion that updates the values in shared memory
and calls UpdateControlFile().  IIUC the first part of this function
that accesses several ControlFile values should be safe, as none of
them can be updated after server start.

Nathan


v1-0001-Assert-that-ControlFileLock-is-held-exclusively-i.patch (2K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: race condition when writing pg_control

Michael Paquier-2
On Tue, May 26, 2020 at 07:30:54PM +0000, Bossart, Nathan wrote:
> While an assertion in UpdateControlFile() would not have helped us
> catch the problem I initially reported, it does seem worthwhile to add
> it.  I have attached a patch that adds this assertion and also
> attempts to fix XLogReportParameters().  Since there is only one place
> where we feel it is safe to call UpdateControlFile() without a lock, I
> just changed it to take the lock.  I don't think this adds any sort of
> significant contention risk, and IMO it is a bit cleaner than the
> boolean flag.

Let's see what Fujii-san and Thomas think about that.  I'd rather
avoid taking a lock here because we don't need it and because it makes
things IMO confusing with the beginning of StartupXLOG() where a lot
of the fields are read, even if we go without this extra assertion.

> For the XLogReportParameters() fix, I simply added an exclusive lock
> acquisition for the portion that updates the values in shared memory
> and calls UpdateControlFile().  IIUC the first part of this function
> that accesses several ControlFile values should be safe, as none of
> them can be updated after server start.

They can get updated when replaying a XLOG_PARAMETER_CHANGE record.
But you are right as all of this happens in the startup process, so
your patch looks right to me here.
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: race condition when writing pg_control

Fujii Masao-4


On 2020/05/27 16:10, Michael Paquier wrote:

> On Tue, May 26, 2020 at 07:30:54PM +0000, Bossart, Nathan wrote:
>> While an assertion in UpdateControlFile() would not have helped us
>> catch the problem I initially reported, it does seem worthwhile to add
>> it.  I have attached a patch that adds this assertion and also
>> attempts to fix XLogReportParameters().  Since there is only one place
>> where we feel it is safe to call UpdateControlFile() without a lock, I
>> just changed it to take the lock.  I don't think this adds any sort of
>> significant contention risk, and IMO it is a bit cleaner than the
>> boolean flag.
>
> Let's see what Fujii-san and Thomas think about that.  I'd rather
> avoid taking a lock here because we don't need it and because it makes
> things IMO confusing with the beginning of StartupXLOG() where a lot
> of the fields are read, even if we go without this extra assertion.

I have no strong opinion about this, but I tend to agree with Michael here.
 
>> For the XLogReportParameters() fix, I simply added an exclusive lock
>> acquisition for the portion that updates the values in shared memory
>> and calls UpdateControlFile().  IIUC the first part of this function
>> that accesses several ControlFile values should be safe, as none of
>> them can be updated after server start.
>
> They can get updated when replaying a XLOG_PARAMETER_CHANGE record.
> But you are right as all of this happens in the startup process, so
> your patch looks right to me here.

LGTM.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION


Reply | Threaded
Open this post in threaded view
|

Re: race condition when writing pg_control

Bossart, Nathan
On 5/29/20, 12:24 AM, "Fujii Masao" <[hidden email]> wrote:

> On 2020/05/27 16:10, Michael Paquier wrote:
>> On Tue, May 26, 2020 at 07:30:54PM +0000, Bossart, Nathan wrote:
>>> While an assertion in UpdateControlFile() would not have helped us
>>> catch the problem I initially reported, it does seem worthwhile to add
>>> it.  I have attached a patch that adds this assertion and also
>>> attempts to fix XLogReportParameters().  Since there is only one place
>>> where we feel it is safe to call UpdateControlFile() without a lock, I
>>> just changed it to take the lock.  I don't think this adds any sort of
>>> significant contention risk, and IMO it is a bit cleaner than the
>>> boolean flag.
>>
>> Let's see what Fujii-san and Thomas think about that.  I'd rather
>> avoid taking a lock here because we don't need it and because it makes
>> things IMO confusing with the beginning of StartupXLOG() where a lot
>> of the fields are read, even if we go without this extra assertion.
>
> I have no strong opinion about this, but I tend to agree with Michael here.
>
>>> For the XLogReportParameters() fix, I simply added an exclusive lock
>>> acquisition for the portion that updates the values in shared memory
>>> and calls UpdateControlFile().  IIUC the first part of this function
>>> that accesses several ControlFile values should be safe, as none of
>>> them can be updated after server start.
>>
>> They can get updated when replaying a XLOG_PARAMETER_CHANGE record.
>> But you are right as all of this happens in the startup process, so
>> your patch looks right to me here.
>
> LGTM.
Thanks for the feedback.  I've attached a new set of patches.

Nathan


0003-Assert-that-ControlFileLock-is-held-within-UpdateCon.patch (8K) Download Attachment
0002-Acquire-ControlFileLock-within-XLogReportParameters.patch (1K) Download Attachment
0001-Fix-race-condition-that-could-corrupt-pg_control.patch (2K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: race condition when writing pg_control

Michael Paquier-2
On Sun, May 31, 2020 at 09:11:35PM +0000, Bossart, Nathan wrote:
> Thanks for the feedback.  I've attached a new set of patches.

Thanks for splitting the set.  0001 and 0002 are the minimum set for
back-patching, and it would be better to merge them together.  0003 is
debatable and not an actual bug fix, so I would refrain from doing a
backpatch.  It does not seem that there is a strong consensus in favor
of 0003 either.

Thomas, are you planning to look at this patch set?
--
Michael

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: race condition when writing pg_control

Thomas Munro-5
On Tue, Jun 2, 2020 at 5:24 PM Michael Paquier <[hidden email]> wrote:

> On Sun, May 31, 2020 at 09:11:35PM +0000, Bossart, Nathan wrote:
> > Thanks for the feedback.  I've attached a new set of patches.
>
> Thanks for splitting the set.  0001 and 0002 are the minimum set for
> back-patching, and it would be better to merge them together.  0003 is
> debatable and not an actual bug fix, so I would refrain from doing a
> backpatch.  It does not seem that there is a strong consensus in favor
> of 0003 either.
>
> Thomas, are you planning to look at this patch set?

Sorry for my radio silence, I got tangled up with a couple of
conferences.  I'm planning to look at 0001 and 0002 shortly.


Reply | Threaded
Open this post in threaded view
|

Re: race condition when writing pg_control

Michael Paquier-2
On Wed, Jun 03, 2020 at 10:56:13AM +1200, Thomas Munro wrote:
> Sorry for my radio silence, I got tangled up with a couple of
> conferences.  I'm planning to look at 0001 and 0002 shortly.

Thanks!
--
Michael

signature.asc (849 bytes) Download Attachment