ENT-14140: psql_wrapper.sh: retry psql commands on transient failures by larsewi · Pull Request #3165 · cfengine/masterfiles

larsewi · 2026-05-26T11:55:17Z

Observed a race condition in CI where bundle agent superhub_schema interacts with postgres shortly after service restart.

03:12:04 systemd: Stopping CFEngine Enterprise PostgreSQL Database...
03:12:04 systemd: Started CFEngine Enterprise PostgreSQL Database.
03:12:04 cf-agent: Executing ... psql_wrapper.sh cfdb select superhub_schema(...)
03:12:05 cf-agent: returned code '2' defined as promise failed

Fixed by gating superhub_schema, ensure_feeders, and imported_data on a persistent class set by the cf-postgres restart.

Ticket: ENT-14140

Backported to:

craigcomstock

This doesn't feel quite right to me. It would seem we need more of a sequence of actions and not a class/gate situation. We need the restart to finish and then run superhub_schema(). With this solution superhub_schema() would be run at next agent interval, which is OK but maybe not ideal. Could we instead of gating on recent restart gate on postgresql up and ready in hopes that superhub_schema() could run in the same agent run?

larsewi · 2026-05-26T14:04:10Z

2 if the connection to the server went bad and the session was not interactive

https://www.postgresql.org/docs/current/app-psql.html

@craigcomstock, @nickanderson what if we have the psql_wrapper.sh retry in case of return code 2 ?

nickanderson · 2026-05-26T16:57:48Z

2 if the connection to the server went bad and the session was not interactive

https://www.postgresql.org/docs/current/app-psql.html

@craigcomstock, @nickanderson what if we have the psql_wrapper.sh retry in case of return code 2 ?

yeah I think that would be better.

larsewi · 2026-05-27T08:02:54Z

2 if the connection to the server went bad and the session was not interactive

https://www.postgresql.org/docs/current/app-psql.html

@craigcomstock, @nickanderson what if we have the psql_wrapper.sh retry in case of return code 2 ?

yeah I think that would be better.

The only thing @nickanderson is that it will cause the agent to hang while it bootstraps. Or should it perhaps run these commands in the background?

Retry psql command on transient failures. E.g., when postgres is being restarted due to config change. Ticket: ENT-14140 Changelog: psql commands are now retried on transient errors in federated reporting Signed-off-by: Lars Erik Wik <lars.erik.wik@northern.tech>

nickanderson · 2026-05-27T20:29:35Z

The only thing @nickanderson is that it will cause the agent to hang while it bootstraps. Or should it perhaps run these commands in the background?

Hang permanently, or just be slow? Permanent hang needs to be avoided. Just reading the code there it looks like it might hang for up to 30 seconds while re-trying. Also, this hang would be limited to the hub bootstrapping to itself, is that right?

larsewi · 2026-05-28T08:23:32Z

The only thing @nickanderson is that it will cause the agent to hang while it bootstraps. Or should it perhaps run these commands in the background?

Hang permanently, or just be slow? Permanent hang needs to be avoided. Just reading the code there it looks like it might hang for up to 30 seconds while re-trying. Also, this hang would be limited to the hub bootstrapping to itself, is that right?

No, every agent run on the super hub is affected. Feeders are not as far as I can see (due to am_superhub::). But there is no delay on successful runs. The delay should only be there when postgres is not responding.

craigcomstock

better. yes.

larsewi · 2026-05-29T13:52:40Z

@cf-bottom Jenkins please :)

cf-bottom · 2026-05-29T13:58:01Z

Alright, I triggered a build:

Jenkins: https://ci.cfengine.com/job/pr-pipeline/13880/

Packages: http://buildcache.cfengine.com/packages/testing-pr/jenkins-pr-pipeline-13880/

larsewi requested review from craigcomstock and nickanderson May 26, 2026 11:55

larsewi added the cherry-pick? Fixes which may need to be cherry-picked to LTS branches label May 26, 2026

craigcomstock reviewed May 26, 2026

View reviewed changes

nickanderson reviewed May 26, 2026

View reviewed changes

Comment thread cfe_internal/enterprise/federation/federation.cf Outdated

larsewi force-pushed the fr-race branch from ba31e72 to ac352ff Compare May 27, 2026 10:38

larsewi changed the title ~~ENT-14140: federation.cf: gate postgres interaction on recent service restart~~ ENT-14140: psql_wrapper.sh: retry psql commands on transient failures May 27, 2026

larsewi requested review from craigcomstock and nickanderson May 27, 2026 10:40

craigcomstock approved these changes May 29, 2026

View reviewed changes

larsewi merged commit c8e25d1 into cfengine:master May 31, 2026
38 checks passed

This was referenced May 31, 2026

ENT-14140: psql_wrapper.sh: retry psql commands on transient failures (3.27.x) #3167

Open

ENT-14140: psql_wrapper.sh: retry psql commands on transient failures (3.24.x) #3168

Open

larsewi removed the cherry-pick? Fixes which may need to be cherry-picked to LTS branches label May 31, 2026

larsewi deleted the fr-race branch May 31, 2026 17:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENT-14140: psql_wrapper.sh: retry psql commands on transient failures#3165

ENT-14140: psql_wrapper.sh: retry psql commands on transient failures#3165
larsewi merged 1 commit into
cfengine:masterfrom
larsewi:fr-race

larsewi commented May 26, 2026 •

edited

Loading

Uh oh!

craigcomstock left a comment

Uh oh!

Uh oh!

larsewi commented May 26, 2026

Uh oh!

nickanderson commented May 26, 2026

Uh oh!

larsewi commented May 27, 2026

Uh oh!

nickanderson commented May 27, 2026 •

edited

Loading

Uh oh!

larsewi commented May 28, 2026

Uh oh!

craigcomstock left a comment

Uh oh!

larsewi commented May 29, 2026

Uh oh!

cf-bottom commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

4 participants

Conversation

larsewi commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

craigcomstock left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

larsewi commented May 26, 2026

Uh oh!

nickanderson commented May 26, 2026

Uh oh!

larsewi commented May 27, 2026

Uh oh!

nickanderson commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

larsewi commented May 28, 2026

Uh oh!

craigcomstock left a comment

Choose a reason for hiding this comment

Uh oh!

larsewi commented May 29, 2026

Uh oh!

cf-bottom commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

4 participants

larsewi commented May 26, 2026 •

edited

Loading

nickanderson commented May 27, 2026 •

edited

Loading