Skip to content

🐛(ZENKO-5288) wrap kafka-server-start.sh to propagate broker exit code#2428

Open
DarkIsDude wants to merge 1 commit into
development/2.15from
feature/ZENKO-5288/kafka-broker-exit-code
Open

🐛(ZENKO-5288) wrap kafka-server-start.sh to propagate broker exit code#2428
DarkIsDude wants to merge 1 commit into
development/2.15from
feature/ZENKO-5288/kafka-broker-exit-code

Conversation

@DarkIsDude
Copy link
Copy Markdown
Contributor

@DarkIsDude DarkIsDude commented Jun 3, 2026

Summary

  • Rename kafka-server-start.sh to kafka-server-start-real.sh at image build time
  • Add a new kafka-server-start.sh wrapper that captures Kafka's exit code and writes it to /var/run/kafka-exit/code before returning
  • This removes the need for the kafka-scripts-patcher init container in the operator (see scality/zenko-operator#620)

Context

Koperator's entrypoint calls kafka-server-start.sh then unconditionally runs rm /var/run/wait/do-not-exit-yet (exit 0), masking any non-zero exit code from the broker. By baking the wrapper into the image, we capture the exit code before that masking occurs, allowing the exit-code-propagator sidecar in the operator to set the pod phase to Failed when Kafka crashes.

Test plan

  • Build the Kafka image and verify kafka-server-start-real.sh and the new kafka-server-start.sh wrapper are present
  • Confirm that a crashing broker writes a non-zero exit code to /var/run/kafka-exit/code
  • Verify the operator's exit-code-propagator sidecar exits with the correct code and pod phase is Failed

I also opened an upstream fix adobe/koperator#260. IMO we should ship our fix and removed it later. Without this fix, it's hard to debug issue for CS team.

Issue: ZENKO-5288

@DarkIsDude DarkIsDude self-assigned this Jun 3, 2026
@bert-e
Copy link
Copy Markdown
Contributor

bert-e commented Jun 3, 2026

Hello darkisdude,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Available options
name description privileged authored
/after_pull_request Wait for the given pull request id to be merged before continuing with the current one.
/bypass_author_approval Bypass the pull request author's approval
/bypass_build_status Bypass the build and test status
/bypass_commit_size Bypass the check on the size of the changeset TBA
/bypass_incompatible_branch Bypass the check on the source branch prefix
/bypass_jira_check Bypass the Jira issue check
/bypass_peer_approval Bypass the pull request peers' approval
/bypass_leader_approval Bypass the pull request leaders' approval
/approve Instruct Bert-E that the author has approved the pull request. ✍️
/create_pull_requests Allow the creation of integration pull requests.
/create_integration_branches Allow the creation of integration branches.
/no_octopus Prevent Wall-E from doing any octopus merge and use multiple consecutive merge instead
/unanimity Change review acceptance criteria from one reviewer at least to all reviewers
/wait Instruct Bert-E not to run until further notice.
Available commands
name description privileged
/help Print Bert-E's manual in the pull request.
/status Print Bert-E's current status in the pull request TBA
/clear Remove all comments from Bert-E from the history TBA
/retry Re-start a fresh build TBA
/build Re-start a fresh build TBA
/force_reset Delete integration branches & pull requests, and restart merge process from the beginning.
/reset Try to remove integration branches unless there are commits on them which do not appear on the source branch.

Status report is not available.

@bert-e
Copy link
Copy Markdown
Contributor

bert-e commented Jun 3, 2026

Waiting for approval

The following approvals are needed before I can proceed with the merge:

  • the author

  • 2 peers

@DarkIsDude DarkIsDude requested review from a team, benzekrimaha and delthas June 3, 2026 14:19
@DarkIsDude DarkIsDude marked this pull request as ready for review June 3, 2026 14:19
Comment thread solution/kafka/Dockerfile
Comment on lines +45 to +52
{ \
echo '#!/bin/bash'; \
echo '"$(dirname "$0")/kafka-server-start-real.sh" "$@"'; \
echo 'KAFKA_EXIT=$?'; \
echo 'printf "%d" "$KAFKA_EXIT" > /var/run/kafka-exit/code'; \
echo 'exit "$KAFKA_EXIT"'; \
} > ${KAFKA_HOME}/bin/kafka-server-start.sh && \
chmod +x ${KAFKA_HOME}/bin/kafka-server-start.sh
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of generating a script, best to add it to the folder and use the ADD command

Comment thread solution/kafka/Dockerfile

RUN chmod a+x ${KAFKA_HOME}/bin/*.sh

RUN mv ${KAFKA_HOME}/bin/kafka-server-start.sh ${KAFKA_HOME}/bin/kafka-server-start-real.sh && \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to keep the existing file, and just create another one next to it.
we can then either use our own script as ENTRYPOINT, or change the CMD command to our own?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Koperator's entrypoint calls kafka-server-start.sh
not clear what this means precisely: i.e. is koperator setting a custom ENTRYPOINT ? or overriding the COMMAND ? or something else still ?

Comment thread solution/kafka/Dockerfile
echo '#!/bin/bash'; \
echo '"$(dirname "$0")/kafka-server-start-real.sh" "$@"'; \
echo 'KAFKA_EXIT=$?'; \
echo 'printf "%d" "$KAFKA_EXIT" > /var/run/kafka-exit/code'; \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: instead of a wrapper, we may be able to add a trap command at the beginning of the script to store exit code to /var/run/kafka-exit/code before returning

@francoisferrand
Copy link
Copy Markdown
Contributor

Koperator's entrypoint calls kafka-server-start.sh then unconditionally runs rm /var/run/wait/do-not-exit-yet (exit 0), masking any non-zero exit code from the broker. By baking the wrapper into the image, we capture the exit code before that masking occurs, allowing the exit-code-propagator sidecar in the operator to set the pod phase to Failed when Kafka crashes.

Did you find how it got there? Seems really weird that koperator went to the trouble to adding the sidecar, setting do-not-exit-yet, but never managed to actually "collect" the exit code...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants