Docs: machine-specific cluster tree + freshness pass#739
Open
cailmdaley wants to merge 30 commits into
Open
Conversation
cailmdaley
added a commit
that referenced
this pull request
May 31, 2026
Three fibers from this session's docs work: - docs-versioning: the versioned-site + switcher design (#738) and the recurring unexercised-path bit-rot pattern. - docs-cluster-tree: the machine-specific clusters.md decision (#739) and why a single page beat a thin standalone general page. - v2-run-plan: the v2.0 run wishlist rescued from the deleted work_flow_v2.0.md docs page before removal. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
cailmdaley
added a commit
that referenced
this pull request
May 31, 2026
The README front door, the container.md 'Running on a cluster' section, and the basic_execution.md MPI docs are relocated to #739, which owns the full docs story (cluster docs now live in a dedicated clusters.md, so keeping the walkthrough here too would duplicate it). This PR keeps only the code/infra and the CLAUDE.md build-loop note that the container changes here introduce. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Audited every narrative docs page against the current code. The install /
container / testing / API pages were already fresh; the staleness concentrated
in cluster docs and a few content errors. This rework:
**Machine-specific cluster tree.** Cluster guidance was scattered and half
of it invisible (candide lived only inside container.md on a feature branch;
canfar was split across orphaned pages; none of canfar/candide were in the
sidebar). Add a single `clusters.md` under a new "Running on a cluster" toctree
caption: the shared pattern (container = unit of execution, bind-mount, keep
SIFs off a quota-limited $HOME), then per-machine sections for candide (SLURM,
the candide_{smp,mpi}.sh scripts, the quota-safe pull, MPI/PMIx) and CANFAR
(the current canfar_submit_job / canfar_monitor console scripts), with ccin2p3
stubbed. The deep CANFAR production walkthrough stays in pipeline_canfar.md,
linked, and is now in the toctree too.
**Delete obsolete pages.** canfar.md (the old curl-VM submission model,
superseded by canfar_submit_job), pipeline_v2.0.md (personal paths, a missing
script), and work_flow_v2.0.md (an unrealized planning wishlist) — all three
orphaned from the toctree. The v2.0 wishlist is preserved in the team's felt
store rather than lost.
**Fix content errors.**
- dependencies.md: rewritten against pyproject.toml. Reframed around the
abstract-minimums + uv.lock SSOT (was "pinned per release"); ngmix now points
at the aguinot/ngmix@stable_version fork (was esheldon upstream); dropped the
phantom CDSclient; added the missing CANFAR/data stack (vos, skaha, canfar,
cs_util, astroquery, reproject, h5py, numba).
- post_processing.md: dropped the removed rho-statistics step and the dead
prepare_tiles_for_final command; added a legacy banner pointing at sp_validation.
- random_cat.md: legacy banner; fixed module name random_runner -> random_cat_runner.
- pipeline_canfar.md: flagged the matched-star / coverage-mask helpers that
moved to sp_validation (merge_psf_cat.py, download_headers, …).
- basic_execution.md: replaced the conda-era "activate the environment" framing
with the container reality. (MPI sections deferred pending the #737 decision.)
- configuration.md (conifg->config, NUMBERING_LIST->NUMBER_LIST),
contributing.md (Pleas->Please), module_develop.md (src/shapepipe/modules).
Verified with a local sphinx-book-theme build: succeeds; the only new warning
the tree introduced (a clusters.md heading anchor) is fixed. Remaining warnings
are all pre-existing (the autosummary API page needs the installed package;
multiple-toctree notices on every page).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…itHub The explicit MyST target showed as raw '(candide-slurm)=' in GitHub's blob view (where PR links point readers). Use a plain-text in-page reference; the candide section is still reachable via the sidebar and GitHub's own heading anchor. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Unify all user-facing docs in this PR (relocated from #737, which is now pure code/infra): - README front door (Quickstart + Documentation signpost). The signpost now has a dedicated 'Running on a cluster' entry pointing at clusters.html, and the container-workflow entry no longer claims to carry the cluster example (that lives in clusters.md). - basic_execution.md MPI section: the hybrid-Apptainer run pattern and the OpenMPI-5 PMIx note, kept alongside the conda-framing fix. - container.md gains a one-line pointer to clusters.md. This removes the container.md/clusters.md duplication at the source rather than reconciling it after merge. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…staged review Autonomous prep for the interactive review. Built the new module against real ngmix 2.4.0 on candide and ran do_ngmix_metacal: shear recovery unbiased (m=+2e-4). Centroid fix harmless but necessity not reproduced. CI never ran (fork PR). Draft GitHub comments + report.html staged for the call. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…lished); suite still pending merge-develop Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…now running on #741 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…rd code-review + test work Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…etion landed; bug characterized as old-path m~-2.8e-2 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…port + get_guess removal); part 2 (methodology) deferred Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…11 inline + summary); review delivered, Martin to merge Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Re-reviewed against current #741 head: no code changed since the part-2 review, so all 11 findings stand. Martin closed fork PR #740 and consolidated onto #741 (canonical, green, mergeable); engaged only to ack the RNG fix. Triaged 11 findings (5 cut-and-dry / 5 decisions / 1 resume); weight-norm (949) + *_psfo (1045) flagged as the only two merge-gates. report.html rewritten as next-steps triage; summary comment posted to #741 (issuecomment-4626968551); prs-in-flight indexed with #740-closed disposition. Report-only round; fiber closed for Cail's review. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ied r50/T bug Martin's morning pass (06-05): greenlit 254 (remove resume) + 766 (configurable stamp size); 737 any->all intentional; 949 -> issue #604; opened r50/T naming, poked Lucy. Ran his explicit check: pars[4]=T=2sigma^2 confirmed, galaxy r50 stores T (area), PSF r50 stores sigma -- neither is the half-light radius 1.1774*sigma. *_psfo (1045) now the lone unanswered merge-gate. report.html + outcome + history refreshed. Analysis only, no code pushed. Held interactive for Cail. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ecision-ready reports) Workflow (6 investigators + 2 synth) on the two hard problems: - WEIGHTS (#604+949): two coupled regressions in prepare_ngmix_weights (dead get_noise -> whole-stamp sigma_mad; lost binarization -> double- counts real ivar). Empirically confirmed (truth ivar 1e6 -> recovered 8.8e11). Rec: split — minimal v1-restore in #741 + SExtractor BACKGROUND_RMS baseline as separate PR (closes #604). - SIZE (r50/T): galaxy r50=T(area), PSF r50psf=sigma, neither=1.1774sigma; UNIONS-3500 paper reports r_h as primary. Rec: transform-at-source + cs_util converters; bonus — sp_validation T_to_fwhm dimensionally wrong. Adds weights-report.md, size-report.md, deep-dive-report.html. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- shapepipe/ngmix-weights-ivar (Codex): fix the two prepare_ngmix_weights regressions (minimal v1-restore + red->green test in #741) and the SExtractor BACKGROUND_RMS inverse-variance baseline as a separate PR (closes #604). Points at weights-report.md. - shapepipe/ngmix-size-columns: honest r50 at the ngmix source + cs_util converter web + fix the sp_validation T_to_fwhm leakage bug. Points at size-report.md. Both installed as drafts pending Cail's dispatch decision. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…741 - Pushed cleanup bd60dc8 to origin/ngmix_v2.0 (dd4f656..bd60dc8). - Enabled both oneshot shuttles: ngmix-weights-ivar (codex), ngmix-size-columns (claude-opus). Workers prepare branches/PRs+reports; merge stays Cail's. *_psfo gate + runner decorators tracked for the eventual #741 reply. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…r review shapepipe fix/ngmix-size-columns, cs_util feat/size-conversions (fork), sp_validation fix/psf-leakage-fwhm. Also corrects size-report.md's error-prop claim (old r50_err_PSFo was a factor-2 over-estimate). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
All three branches independently re-reviewed and re-tested; no code defects. Report corrected: sp_validation CI is green-but-vacuous on the cs_util.size import (suite never imports galaxy.py, image carries released cs-util 0.1.9), so merge order is discipline-enforced. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
cailmdaley
added a commit
that referenced
this pull request
Jun 10, 2026
The README front door, the container.md 'Running on a cluster' section, and the basic_execution.md MPI docs are relocated to #739, which owns the full docs story (cluster docs now live in a dedicated clusters.md, so keeping the walkthrough here too would duplicate it). This PR keeps only the code/infra and the CLAUDE.md build-loop note that the container changes here introduce. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Backfill ULID ids across 19 fibers; close docs-versioning, smoke-test-read-only, docker-uv-revert (superseded by #733); refresh shapepipe.md active-threads list to current PRs (#737–741); add np-str0-numpy2 fiber; minor outcome/status normalizations. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ness shell, science threads for Martin
…newer) # Conflicts: # .felt/docker-uv-revert/docker-uv-revert.md # .felt/fabian-coord-bug/fabian-coord-bug.md # .felt/ngmix-update/ngmix-update.md # .felt/prs-in-flight/prs-in-flight.md # .felt/shapepipe.md # .felt/shapepipe/cleanup-rhostats-jobscripts/cleanup-rhostats-jobscripts.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Audited every narrative docs page against the current code. The install / container / testing / API pages were already fresh (the conda→uv/container work kept them current); staleness concentrated in cluster docs and a handful of content errors. This PR fixes both.
Machine-specific cluster tree
Cluster guidance was scattered and half-invisible: candide lived only inside
container.md(and only on the #737 branch), canfar was split across orphaned pages, and none of the canfar/candide pages were in the sidebar at all.New single
clusters.mdunder a "Running on a cluster" toctree caption:$HOME.sbatch, thecandide_{smp,mpi}.shscripts, the quota-safe pull → submit, partitions, the MPI/PMIx note.canfar_submit_job/canfar_monitorconsole scripts), with the deep production walkthrough kept inpipeline_canfar.md(linked, and now in the toctree).Deleted obsolete pages
canfar.md(oldcurl-VM submission, superseded bycanfar_submit_job),pipeline_v2.0.md(personal paths, a missing script),work_flow_v2.0.md(an unrealized planning wishlist) — all three orphaned. The v2.0 wishlist was preserved in the team's felt store before deletion.Content fixes
dependencies.md— rewritten againstpyproject.toml: reframed around the abstract-minimums +uv.lockSSOT (was "pinned per release");ngmixnow points at theaguinot/ngmix@stable_versionfork (was esheldon upstream); dropped the phantomCDSclient; added the missing CANFAR/data stack (vos,skaha,canfar,cs_util,astroquery,reproject,h5py,numba).post_processing.md— dropped the removed rho-statistics step and the deadprepare_tiles_for_finalcommand; legacy banner → sp_validation.random_cat.md— legacy banner; fixedrandom_runner→random_cat_runner.pipeline_canfar.md— flagged the matched-star / coverage-mask helpers that moved to sp_validation.basic_execution.md— replaced the conda-era "activate the environment" framing with the container reality. MPI sections deferred pending the Fix MPI on candide (OpenMPI 5 image + latent code bug); containerize & SLURM-ify candide scripts #737 keep/drop decision.configuration.md(conifg→config,NUMBERING_LIST→NUMBER_LIST),contributing.md(Pleas→Please),module_develop.md(src/shapepipe/modules).Verification
Local
sphinx-book-themebuild succeeds. The one new warning the tree introduced (aclusters.mdheading anchor) is fixed; remaining warnings are all pre-existing (the autosummary API page needs the installed package; the multiple-toctree notice fires on every page).Relationship to the other docs PRs
master.— Claude on behalf of Cail