Skip to content

fix(dev): make 'make dev-up' work — readable temp compose files + HAKeeper bootstrap#24974

Merged
fengttt merged 7 commits into
matrixorigin:mainfrom
fengttt:fix/dev-up-multi-cn
Jun 21, 2026
Merged

fix(dev): make 'make dev-up' work — readable temp compose files + HAKeeper bootstrap#24974
fengttt merged 7 commits into
matrixorigin:mainfrom
fengttt:fix/dev-up-multi-cn

Conversation

@fengttt

@fengttt fengttt commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

What

Two fixes that let the etc/docker-multi-cn-local-disk stack (make dev-up) actually start a working multi-CN cluster. The first failure masked the second, so both are needed.

1. Temp compose fragments written to /tmp (start.sh)

start.sh generated the default-limits and extra-mounts compose fragments under /tmp. A snap-confined Docker daemon cannot read /tmp, so docker compose failed before any container started:

open /tmp/docker-compose.default-limits.<pid>.yml: no such file or directory

Now the fragments are written in the project directory (under $HOME, which the daemon can access) using relative paths, so docker compose keeps it as the project directory. Also fixes a latent cleanup bug: the two per-block trap ... EXIT handlers clobbered each other (the second overwrote the first), leaking the default-limits file whenever extra mounts were used — both temp files are now tracked in one list and removed by a single trap.

2. HAKeeper never bootstrapped (generate-config.sh)

The generated log.toml had no [logservice] section. Each node is started separately with -cfg log.toml (not -launch), so nothing ever bootstrapped HAKeeper and the log store panicked on first start:

panic: shard not bootstrapped

The launch stacks avoid this only because -launch builds the bootstrap config programmatically; no checked-in toml sets bootstrap-cluster. (This was invisible until fix #1, since compose always died at the /tmp step before a container ran.)

generate-config.sh now emits a [logservice] section that gives mo-log a UUID and bootstraps a single-replica HAKeeper (1 log shard, 1 tn shard, 1 replica, init member 131072:<uuid>), matching pkg/embed/cluster.go and logservice.DefaultConfig. Addresses advertise the mo-log hostname while listening on 0.0.0.0, and gossip-allow-self-as-seed is set because Fill() (unlike DefaultConfig) does not default it — required for a lone seed node. The *.toml files are gitignored (generated), so the fix lives in the generator.

Verification

From a clean state (make dev-clean && make dev-up), validated through the generated config path (tomls removed → regenerated by generate-config.sh → fresh bring-up):

  • all 5 containers up, mo-log reports healthy, zero shard not bootstrapped panics
  • create / insert / select sum() round-trips through the proxy on port 6001

🤖 Generated with Claude Code

fengttt and others added 2 commits June 13, 2026 11:03
A snap-confined Docker daemon cannot read files under /tmp, so 'make dev-up'
failed with 'open /tmp/docker-compose.default-limits.<pid>.yml: no such file or
directory' before any container started. Write the generated default-limits and
extra-mounts compose fragments in the project directory (under $HOME, which the
daemon can access) using relative paths so docker compose keeps it as the
project directory.

Also fixes a latent cleanup bug: the two per-block 'trap ... EXIT' handlers
clobbered each other (the second overwrote the first), leaking the
default-limits file whenever extra mounts were used. Both temp files are now
tracked in one list and removed by a single trap.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The generated log.toml had no [logservice] section, so when each node was
started separately with '-cfg log.toml' nothing ever bootstrapped HAKeeper and
the log store panicked with 'shard not bootstrapped' on first start. (The launch
stacks avoid this only because '-launch' builds the bootstrap config
programmatically; no checked-in toml sets bootstrap-cluster. This was masked
until now because compose previously died reading the /tmp limits file before
any container started.)

generate-config.sh now emits a [logservice] section that gives mo-log a UUID and
bootstraps a single-replica HAKeeper (1 log shard, 1 tn shard, 1 replica, init
member 131072:<uuid>), matching pkg/embed/cluster.go and logservice.DefaultConfig.
Addresses advertise the mo-log hostname while listening on 0.0.0.0, and
gossip-allow-self-as-seed is set because Fill() (unlike DefaultConfig) does not
default it — required for a lone seed node. The toml files themselves are
gitignored (generated), so the fix lives in the generator.

Verified from a clean state: make dev-clean && make dev-up brings up all 5
containers, mo-log reports healthy with no panic, and create/insert/select
round-trips through the proxy.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@fengttt fengttt requested a review from XuPeng-SH as a code owner June 13, 2026 18:12
@qodo-code-review

Copy link
Copy Markdown

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

@XuPeng-SH XuPeng-SH left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stale-config regeneration hole is fixed on the latest head, but I found one new correctness issue in start.sh.\n\netc/docker-multi-cn-local-disk/start.sh:212-216 installs an EXIT trap whose body is just [ ${#TMP_COMPOSE_FILES[@]} -gt 0 ] && rm -f .... When no temp compose fragments were created, that test returns status 1, so the trap itself exits non-zero. Under set -e, that turns some otherwise successful start.sh runs into exit code 1. Reproducing the trap logic directly shows the problem: bash -lc 'set -e; TMP_COMPOSE_FILES=(); cleanup_tmp_compose(){ [ ${#TMP_COMPOSE_FILES[@]} -gt 0 ] && rm -f "${TMP_COMPOSE_FILES[@]}"; }; trap cleanup_tmp_compose EXIT; : '; echo EXIT:$? returns EXIT:1.\n\nPlease make the trap handler always succeed (for example, guard the rm with an if block and/or end the function with return 0).

@XuPeng-SH

Copy link
Copy Markdown
Contributor

Suggested fix for the new EXIT-trap issue in start.sh:

cleanup_tmp_compose() {
    if [ ${#TMP_COMPOSE_FILES[@]} -gt 0 ]; then
        rm -f "${TMP_COMPOSE_FILES[@]}"
    fi
    return 0
}

The important part is that the trap handler must always exit successfully when there is nothing to clean up; otherwise some successful start.sh runs still end with exit code 1 under set -e.

Addresses review feedback: the trap body was a single
`[ ${#TMP_COMPOSE_FILES[@]} -gt 0 ] && rm -f ...` expression, which
returns status 1 when no temp compose fragments were created. As an EXIT
trap under `set -e`, that non-zero status turned otherwise-successful
start.sh runs into exit code 1.

Guard the rm with an if-block and end the handler with `return 0` so the
trap always exits successfully.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@fengttt

fengttt commented Jun 20, 2026

Copy link
Copy Markdown
Contributor Author

@XuPeng-SH Fixed in 85c6450.

cleanup_tmp_compose now guards the rm with an if block and ends with return 0, so the EXIT trap always exits successfully even when no temp compose fragments were created:

cleanup_tmp_compose() {
    if [ ${#TMP_COMPOSE_FILES[@]} -gt 0 ]; then
        rm -f "${TMP_COMPOSE_FILES[@]}"
    fi
    return 0
}

Verified with your repro — trap ... EXIT; : now yields EXIT:0 instead of EXIT:1 under set -e.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Comment thread etc/docker-multi-cn-local-disk/start.sh

@XuPeng-SH XuPeng-SH left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-checked the latest head. The stale-config rollout hole is now fixed by regenerating old log.toml files that lack [logservice], and the new cleanup_tmp_compose() trap fix correctly preserves exit code 0 when no temp fragments were created while still cleaning up when they are. I do not see a remaining correctness issue that should keep this blocked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/S Denotes a PR that changes [10,99] lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants