Skip to content

Docs: machine-specific cluster tree + freshness pass#739

Open
cailmdaley wants to merge 30 commits into
developfrom
docs/rework
Open

Docs: machine-specific cluster tree + freshness pass#739
cailmdaley wants to merge 30 commits into
developfrom
docs/rework

Conversation

@cailmdaley

@cailmdaley cailmdaley commented May 31, 2026

Copy link
Copy Markdown
Contributor

Update: this PR now also carries the README front door and the basic_execution.md MPI run docs, relocated from #737 so that all user-facing docs live here. The candide cluster walkthrough lives in clusters.md (not duplicated in container.md).


Audited every narrative docs page against the current code. The install / container / testing / API pages were already fresh (the conda→uv/container work kept them current); staleness concentrated in cluster docs and a handful of content errors. This PR fixes both.

Machine-specific cluster tree

Cluster guidance was scattered and half-invisible: candide lived only inside container.md (and only on the #737 branch), canfar was split across orphaned pages, and none of the canfar/candide pages were in the sidebar at all.

New single clusters.md under a "Running on a cluster" toctree caption:

  • The pattern — the shared truths: the container is the unit of execution, bind-mount your clone at the same path, keep SIFs/cache off a quota-limited $HOME.
  • candide (SLURM)sbatch, the candide_{smp,mpi}.sh scripts, the quota-safe pull → submit, partitions, the MPI/PMIx note.
  • CANFAR — the current model (canfar_submit_job / canfar_monitor console scripts), with the deep production walkthrough kept in pipeline_canfar.md (linked, and now in the toctree).
  • ccin2p3 — honest stub (not yet containerized).

Deleted obsolete pages

canfar.md (old curl-VM submission, superseded by canfar_submit_job), pipeline_v2.0.md (personal paths, a missing script), work_flow_v2.0.md (an unrealized planning wishlist) — all three orphaned. The v2.0 wishlist was preserved in the team's felt store before deletion.

Content fixes

  • dependencies.md — rewritten against pyproject.toml: reframed around the abstract-minimums + uv.lock SSOT (was "pinned per release"); ngmix now points at the aguinot/ngmix@stable_version fork (was esheldon upstream); dropped the phantom CDSclient; added the missing CANFAR/data stack (vos, skaha, canfar, cs_util, astroquery, reproject, h5py, numba).
  • post_processing.md — dropped the removed rho-statistics step and the dead prepare_tiles_for_final command; legacy banner → sp_validation.
  • random_cat.md — legacy banner; fixed random_runnerrandom_cat_runner.
  • pipeline_canfar.md — flagged the matched-star / coverage-mask helpers that moved to sp_validation.
  • basic_execution.md — replaced the conda-era "activate the environment" framing with the container reality. MPI sections deferred pending the Fix MPI on candide (OpenMPI 5 image + latent code bug); containerize & SLURM-ify candide scripts #737 keep/drop decision.
  • Cosmetics: configuration.md (conifgconfig, NUMBERING_LISTNUMBER_LIST), contributing.md (PleasPlease), module_develop.md (src/shapepipe/modules).

Verification

Local sphinx-book-theme build succeeds. The one new warning the tree introduced (a clusters.md heading anchor) is fixed; remaining warnings are all pre-existing (the autosummary API page needs the installed package; the multiple-toctree notice fires on every page).

Relationship to the other docs PRs

— Claude on behalf of Cail

@cailmdaley cailmdaley requested review from martinkilbinger and sfarrens and removed request for sfarrens May 31, 2026 20:14
cailmdaley added a commit that referenced this pull request May 31, 2026
Three fibers from this session's docs work:
- docs-versioning: the versioned-site + switcher design (#738) and the
  recurring unexercised-path bit-rot pattern.
- docs-cluster-tree: the machine-specific clusters.md decision (#739) and why
  a single page beat a thin standalone general page.
- v2-run-plan: the v2.0 run wishlist rescued from the deleted
  work_flow_v2.0.md docs page before removal.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
cailmdaley added a commit that referenced this pull request May 31, 2026
The README front door, the container.md 'Running on a cluster' section, and the
basic_execution.md MPI docs are relocated to #739, which owns the full docs
story (cluster docs now live in a dedicated clusters.md, so keeping the
walkthrough here too would duplicate it). This PR keeps only the code/infra and
the CLAUDE.md build-loop note that the container changes here introduce.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
cailmdaley and others added 3 commits May 31, 2026 23:41
Audited every narrative docs page against the current code. The install /
container / testing / API pages were already fresh; the staleness concentrated
in cluster docs and a few content errors. This rework:

**Machine-specific cluster tree.** Cluster guidance was scattered and half
of it invisible (candide lived only inside container.md on a feature branch;
canfar was split across orphaned pages; none of canfar/candide were in the
sidebar). Add a single `clusters.md` under a new "Running on a cluster" toctree
caption: the shared pattern (container = unit of execution, bind-mount, keep
SIFs off a quota-limited $HOME), then per-machine sections for candide (SLURM,
the candide_{smp,mpi}.sh scripts, the quota-safe pull, MPI/PMIx) and CANFAR
(the current canfar_submit_job / canfar_monitor console scripts), with ccin2p3
stubbed. The deep CANFAR production walkthrough stays in pipeline_canfar.md,
linked, and is now in the toctree too.

**Delete obsolete pages.** canfar.md (the old curl-VM submission model,
superseded by canfar_submit_job), pipeline_v2.0.md (personal paths, a missing
script), and work_flow_v2.0.md (an unrealized planning wishlist) — all three
orphaned from the toctree. The v2.0 wishlist is preserved in the team's felt
store rather than lost.

**Fix content errors.**
- dependencies.md: rewritten against pyproject.toml. Reframed around the
  abstract-minimums + uv.lock SSOT (was "pinned per release"); ngmix now points
  at the aguinot/ngmix@stable_version fork (was esheldon upstream); dropped the
  phantom CDSclient; added the missing CANFAR/data stack (vos, skaha, canfar,
  cs_util, astroquery, reproject, h5py, numba).
- post_processing.md: dropped the removed rho-statistics step and the dead
  prepare_tiles_for_final command; added a legacy banner pointing at sp_validation.
- random_cat.md: legacy banner; fixed module name random_runner -> random_cat_runner.
- pipeline_canfar.md: flagged the matched-star / coverage-mask helpers that
  moved to sp_validation (merge_psf_cat.py, download_headers, …).
- basic_execution.md: replaced the conda-era "activate the environment" framing
  with the container reality. (MPI sections deferred pending the #737 decision.)
- configuration.md (conifg->config, NUMBERING_LIST->NUMBER_LIST),
  contributing.md (Pleas->Please), module_develop.md (src/shapepipe/modules).

Verified with a local sphinx-book-theme build: succeeds; the only new warning
the tree introduced (a clusters.md heading anchor) is fixed. Remaining warnings
are all pre-existing (the autosummary API page needs the installed package;
multiple-toctree notices on every page).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…itHub

The explicit MyST target showed as raw '(candide-slurm)=' in GitHub's blob
view (where PR links point readers). Use a plain-text in-page reference; the
candide section is still reachable via the sidebar and GitHub's own heading
anchor.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Unify all user-facing docs in this PR (relocated from #737, which is now pure
code/infra):
- README front door (Quickstart + Documentation signpost). The signpost now
  has a dedicated 'Running on a cluster' entry pointing at clusters.html, and
  the container-workflow entry no longer claims to carry the cluster example
  (that lives in clusters.md).
- basic_execution.md MPI section: the hybrid-Apptainer run pattern and the
  OpenMPI-5 PMIx note, kept alongside the conda-framing fix.
- container.md gains a one-line pointer to clusters.md.

This removes the container.md/clusters.md duplication at the source rather than
reconciling it after merge.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
cailmdaley and others added 20 commits June 1, 2026 12:07
…staged review

Autonomous prep for the interactive review. Built the new module against real
ngmix 2.4.0 on candide and ran do_ngmix_metacal: shear recovery unbiased
(m=+2e-4). Centroid fix harmless but necessity not reproduced. CI never ran
(fork PR). Draft GitHub comments + report.html staged for the call.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…lished); suite still pending merge-develop

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…now running on #741

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…rd code-review + test work

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…etion landed; bug characterized as old-path m~-2.8e-2

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…port + get_guess removal); part 2 (methodology) deferred

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…11 inline + summary); review delivered, Martin to merge

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Re-reviewed against current #741 head: no code changed since the part-2
review, so all 11 findings stand. Martin closed fork PR #740 and
consolidated onto #741 (canonical, green, mergeable); engaged only to ack
the RNG fix. Triaged 11 findings (5 cut-and-dry / 5 decisions / 1 resume);
weight-norm (949) + *_psfo (1045) flagged as the only two merge-gates.
report.html rewritten as next-steps triage; summary comment posted to
#741 (issuecomment-4626968551); prs-in-flight indexed with #740-closed
disposition. Report-only round; fiber closed for Cail's review.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ied r50/T bug

Martin's morning pass (06-05): greenlit 254 (remove resume) + 766
(configurable stamp size); 737 any->all intentional; 949 -> issue #604;
opened r50/T naming, poked Lucy. Ran his explicit check: pars[4]=T=2sigma^2
confirmed, galaxy r50 stores T (area), PSF r50 stores sigma -- neither is
the half-light radius 1.1774*sigma. *_psfo (1045) now the lone unanswered
merge-gate. report.html + outcome + history refreshed. Analysis only, no
code pushed. Held interactive for Cail.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ecision-ready reports)

Workflow (6 investigators + 2 synth) on the two hard problems:
- WEIGHTS (#604+949): two coupled regressions in prepare_ngmix_weights
  (dead get_noise -> whole-stamp sigma_mad; lost binarization -> double-
  counts real ivar). Empirically confirmed (truth ivar 1e6 -> recovered
  8.8e11). Rec: split — minimal v1-restore in #741 + SExtractor
  BACKGROUND_RMS baseline as separate PR (closes #604).
- SIZE (r50/T): galaxy r50=T(area), PSF r50psf=sigma, neither=1.1774sigma;
  UNIONS-3500 paper reports r_h as primary. Rec: transform-at-source +
  cs_util converters; bonus — sp_validation T_to_fwhm dimensionally wrong.

Adds weights-report.md, size-report.md, deep-dive-report.html.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- shapepipe/ngmix-weights-ivar (Codex): fix the two prepare_ngmix_weights
  regressions (minimal v1-restore + red->green test in #741) and the
  SExtractor BACKGROUND_RMS inverse-variance baseline as a separate PR
  (closes #604). Points at weights-report.md.
- shapepipe/ngmix-size-columns: honest r50 at the ngmix source + cs_util
  converter web + fix the sp_validation T_to_fwhm leakage bug. Points at
  size-report.md.

Both installed as drafts pending Cail's dispatch decision.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…741

- Pushed cleanup bd60dc8 to origin/ngmix_v2.0 (dd4f656..bd60dc8).
- Enabled both oneshot shuttles: ngmix-weights-ivar (codex),
  ngmix-size-columns (claude-opus). Workers prepare branches/PRs+reports;
  merge stays Cail's. *_psfo gate + runner decorators tracked for the
  eventual #741 reply.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…r review

shapepipe fix/ngmix-size-columns, cs_util feat/size-conversions (fork),
sp_validation fix/psf-leakage-fwhm. Also corrects size-report.md's
error-prop claim (old r50_err_PSFo was a factor-2 over-estimate).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
All three branches independently re-reviewed and re-tested; no code
defects. Report corrected: sp_validation CI is green-but-vacuous on
the cs_util.size import (suite never imports galaxy.py, image carries
released cs-util 0.1.9), so merge order is discipline-enforced.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
cailmdaley added a commit that referenced this pull request Jun 10, 2026
The README front door, the container.md 'Running on a cluster' section, and the
basic_execution.md MPI docs are relocated to #739, which owns the full docs
story (cluster docs now live in a dedicated clusters.md, so keeping the
walkthrough here too would duplicate it). This PR keeps only the code/infra and
the CLAUDE.md build-loop note that the container changes here introduce.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
cailmdaley and others added 6 commits June 11, 2026 01:43
Backfill ULID ids across 19 fibers; close docs-versioning,
smoke-test-read-only, docker-uv-revert (superseded by #733);
refresh shapepipe.md active-threads list to current PRs (#737–741);
add np-str0-numpy2 fiber; minor outcome/status normalizations.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…newer)

# Conflicts:
#	.felt/docker-uv-revert/docker-uv-revert.md
#	.felt/fabian-coord-bug/fabian-coord-bug.md
#	.felt/ngmix-update/ngmix-update.md
#	.felt/prs-in-flight/prs-in-flight.md
#	.felt/shapepipe.md
#	.felt/shapepipe/cleanup-rhostats-jobscripts/cleanup-rhostats-jobscripts.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant