Skip to content

ENT-14140: psql_wrapper.sh: retry psql commands on transient failures#3165

Merged
larsewi merged 1 commit into
cfengine:masterfrom
larsewi:fr-race
May 31, 2026
Merged

ENT-14140: psql_wrapper.sh: retry psql commands on transient failures#3165
larsewi merged 1 commit into
cfengine:masterfrom
larsewi:fr-race

Conversation

@larsewi
Copy link
Copy Markdown
Contributor

@larsewi larsewi commented May 26, 2026

Observed a race condition in CI where bundle agent superhub_schema interacts with postgres shortly after service restart.

03:12:04 systemd: Stopping CFEngine Enterprise PostgreSQL Database...
03:12:04 systemd: Started CFEngine Enterprise PostgreSQL Database.
03:12:04 cf-agent: Executing ... psql_wrapper.sh cfdb select superhub_schema(...)
03:12:05 cf-agent: returned code '2' defined as promise failed

Fixed by gating superhub_schema, ensure_feeders, and imported_data on a persistent class set by the cf-postgres restart.

Ticket: ENT-14140

Backported to:

@larsewi larsewi added the cherry-pick? Fixes which may need to be cherry-picked to LTS branches label May 26, 2026
Copy link
Copy Markdown
Contributor

@craigcomstock craigcomstock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't feel quite right to me. It would seem we need more of a sequence of actions and not a class/gate situation. We need the restart to finish and then run superhub_schema(). With this solution superhub_schema() would be run at next agent interval, which is OK but maybe not ideal. Could we instead of gating on recent restart gate on postgresql up and ready in hopes that superhub_schema() could run in the same agent run?

Comment thread cfe_internal/enterprise/federation/federation.cf Outdated
@larsewi
Copy link
Copy Markdown
Contributor Author

larsewi commented May 26, 2026

2 if the connection to the server went bad and the session was not interactive

@craigcomstock, @nickanderson what if we have the psql_wrapper.sh retry in case of return code 2 ?

@nickanderson
Copy link
Copy Markdown
Member

2 if the connection to the server went bad and the session was not interactive

@craigcomstock, @nickanderson what if we have the psql_wrapper.sh retry in case of return code 2 ?

yeah I think that would be better.

@larsewi
Copy link
Copy Markdown
Contributor Author

larsewi commented May 27, 2026

2 if the connection to the server went bad and the session was not interactive

@craigcomstock, @nickanderson what if we have the psql_wrapper.sh retry in case of return code 2 ?

yeah I think that would be better.

The only thing @nickanderson is that it will cause the agent to hang while it bootstraps. Or should it perhaps run these commands in the background?

Retry psql command on transient failures. E.g., when postgres is being
restarted due to config change.

Ticket: ENT-14140
Changelog: psql commands are now retried on transient errors in federated reporting
Signed-off-by: Lars Erik Wik <lars.erik.wik@northern.tech>
@larsewi larsewi changed the title ENT-14140: federation.cf: gate postgres interaction on recent service restart ENT-14140: psql_wrapper.sh: retry psql commands on transient failures May 27, 2026
@nickanderson
Copy link
Copy Markdown
Member

nickanderson commented May 27, 2026

The only thing @nickanderson is that it will cause the agent to hang while it bootstraps. Or should it perhaps run these commands in the background?

Hang permanently, or just be slow? Permanent hang needs to be avoided. Just reading the code there it looks like it might hang for up to 30 seconds while re-trying. Also, this hang would be limited to the hub bootstrapping to itself, is that right?

@larsewi
Copy link
Copy Markdown
Contributor Author

larsewi commented May 28, 2026

The only thing @nickanderson is that it will cause the agent to hang while it bootstraps. Or should it perhaps run these commands in the background?

Hang permanently, or just be slow? Permanent hang needs to be avoided. Just reading the code there it looks like it might hang for up to 30 seconds while re-trying. Also, this hang would be limited to the hub bootstrapping to itself, is that right?

No, every agent run on the super hub is affected. Feeders are not as far as I can see (due to am_superhub::). But there is no delay on successful runs. The delay should only be there when postgres is not responding.

Copy link
Copy Markdown
Contributor

@craigcomstock craigcomstock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better. yes.

@larsewi
Copy link
Copy Markdown
Contributor Author

larsewi commented May 29, 2026

@cf-bottom Jenkins please :)

@cf-bottom
Copy link
Copy Markdown

@larsewi larsewi merged commit c8e25d1 into cfengine:master May 31, 2026
38 checks passed
@larsewi larsewi removed the cherry-pick? Fixes which may need to be cherry-picked to LTS branches label May 31, 2026
@larsewi larsewi deleted the fr-race branch May 31, 2026 17:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

4 participants