fix(dev): make 'make dev-up' work — readable temp compose files + HAKeeper bootstrap#24974
Conversation
A snap-confined Docker daemon cannot read files under /tmp, so 'make dev-up' failed with 'open /tmp/docker-compose.default-limits.<pid>.yml: no such file or directory' before any container started. Write the generated default-limits and extra-mounts compose fragments in the project directory (under $HOME, which the daemon can access) using relative paths so docker compose keeps it as the project directory. Also fixes a latent cleanup bug: the two per-block 'trap ... EXIT' handlers clobbered each other (the second overwrote the first), leaking the default-limits file whenever extra mounts were used. Both temp files are now tracked in one list and removed by a single trap. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The generated log.toml had no [logservice] section, so when each node was started separately with '-cfg log.toml' nothing ever bootstrapped HAKeeper and the log store panicked with 'shard not bootstrapped' on first start. (The launch stacks avoid this only because '-launch' builds the bootstrap config programmatically; no checked-in toml sets bootstrap-cluster. This was masked until now because compose previously died reading the /tmp limits file before any container started.) generate-config.sh now emits a [logservice] section that gives mo-log a UUID and bootstraps a single-replica HAKeeper (1 log shard, 1 tn shard, 1 replica, init member 131072:<uuid>), matching pkg/embed/cluster.go and logservice.DefaultConfig. Addresses advertise the mo-log hostname while listening on 0.0.0.0, and gossip-allow-self-as-seed is set because Fill() (unlike DefaultConfig) does not default it — required for a lone seed node. The toml files themselves are gitignored (generated), so the fix lives in the generator. Verified from a clean state: make dev-clean && make dev-up brings up all 5 containers, mo-log reports healthy with no panic, and create/insert/select round-trips through the proxy. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
XuPeng-SH
left a comment
There was a problem hiding this comment.
The stale-config regeneration hole is fixed on the latest head, but I found one new correctness issue in start.sh.\n\netc/docker-multi-cn-local-disk/start.sh:212-216 installs an EXIT trap whose body is just [ ${#TMP_COMPOSE_FILES[@]} -gt 0 ] && rm -f .... When no temp compose fragments were created, that test returns status 1, so the trap itself exits non-zero. Under set -e, that turns some otherwise successful start.sh runs into exit code 1. Reproducing the trap logic directly shows the problem: bash -lc 'set -e; TMP_COMPOSE_FILES=(); cleanup_tmp_compose(){ [ ${#TMP_COMPOSE_FILES[@]} -gt 0 ] && rm -f "${TMP_COMPOSE_FILES[@]}"; }; trap cleanup_tmp_compose EXIT; : '; echo EXIT:$? returns EXIT:1.\n\nPlease make the trap handler always succeed (for example, guard the rm with an if block and/or end the function with return 0).
|
Suggested fix for the new EXIT-trap issue in cleanup_tmp_compose() {
if [ ${#TMP_COMPOSE_FILES[@]} -gt 0 ]; then
rm -f "${TMP_COMPOSE_FILES[@]}"
fi
return 0
}The important part is that the trap handler must always exit successfully when there is nothing to clean up; otherwise some successful |
Addresses review feedback: the trap body was a single
`[ ${#TMP_COMPOSE_FILES[@]} -gt 0 ] && rm -f ...` expression, which
returns status 1 when no temp compose fragments were created. As an EXIT
trap under `set -e`, that non-zero status turned otherwise-successful
start.sh runs into exit code 1.
Guard the rm with an if-block and end the handler with `return 0` so the
trap always exits successfully.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
@XuPeng-SH Fixed in 85c6450.
cleanup_tmp_compose() {
if [ ${#TMP_COMPOSE_FILES[@]} -gt 0 ]; then
rm -f "${TMP_COMPOSE_FILES[@]}"
fi
return 0
}Verified with your repro — |
XuPeng-SH
left a comment
There was a problem hiding this comment.
Re-checked the latest head. The stale-config rollout hole is now fixed by regenerating old log.toml files that lack [logservice], and the new cleanup_tmp_compose() trap fix correctly preserves exit code 0 when no temp fragments were created while still cleaning up when they are. I do not see a remaining correctness issue that should keep this blocked.
What
Two fixes that let the
etc/docker-multi-cn-local-diskstack (make dev-up) actually start a working multi-CN cluster. The first failure masked the second, so both are needed.1. Temp compose fragments written to
/tmp(start.sh)start.shgenerated the default-limits and extra-mounts compose fragments under/tmp. A snap-confined Docker daemon cannot read/tmp, sodocker composefailed before any container started:Now the fragments are written in the project directory (under
$HOME, which the daemon can access) using relative paths, so docker compose keeps it as the project directory. Also fixes a latent cleanup bug: the two per-blocktrap ... EXIThandlers clobbered each other (the second overwrote the first), leaking the default-limits file whenever extra mounts were used — both temp files are now tracked in one list and removed by a single trap.2. HAKeeper never bootstrapped (
generate-config.sh)The generated
log.tomlhad no[logservice]section. Each node is started separately with-cfg log.toml(not-launch), so nothing ever bootstrapped HAKeeper and the log store panicked on first start:The launch stacks avoid this only because
-launchbuilds the bootstrap config programmatically; no checked-in toml setsbootstrap-cluster. (This was invisible until fix #1, since compose always died at the/tmpstep before a container ran.)generate-config.shnow emits a[logservice]section that givesmo-loga UUID and bootstraps a single-replica HAKeeper (1 log shard, 1 tn shard, 1 replica, init member131072:<uuid>), matchingpkg/embed/cluster.goandlogservice.DefaultConfig. Addresses advertise themo-loghostname while listening on0.0.0.0, andgossip-allow-self-as-seedis set becauseFill()(unlikeDefaultConfig) does not default it — required for a lone seed node. The*.tomlfiles are gitignored (generated), so the fix lives in the generator.Verification
From a clean state (
make dev-clean && make dev-up), validated through the generated config path (tomls removed → regenerated bygenerate-config.sh→ fresh bring-up):mo-logreports healthy, zeroshard not bootstrappedpanicscreate / insert / select sum()round-trips through the proxy on port 6001🤖 Generated with Claude Code