Skip to content

Ngmix v2.0 (CI mirror of #740)#741

Open
cailmdaley wants to merge 84 commits into
developfrom
ngmix_v2.0
Open

Ngmix v2.0 (CI mirror of #740)#741
cailmdaley wants to merge 84 commits into
developfrom
ngmix_v2.0

Conversation

@cailmdaley

Copy link
Copy Markdown
Contributor

What this is

A same-repo mirror of #740 (@martinkilbinger's "Ngmix v2.0"), pushed to a branch on CosmoStat/shapepipe so that CI actually runs. All 57 commits are authored by Martin (and carry Lucy Baumont's and Axel Guinot's work) — pushing the branch preserves that authorship unchanged; the only thing this PR adds is a same-repo head so GitHub Actions fires the pull_request workflow without the fork-PR approval gate. #740 received no CI runs at all for this reason.

Substance is identical to #740 — see that PR for the full description. In short: upgrade ngmix to 2.4.0 and adopt Lucy's new ngmix classes/interface; overhaul the shape-measurement module; centroid-bias fix + validation; v2.0 production-run plumbing.

Going forward, opening PRs directly on CosmoStat/shapepipe (rather than from a fork) avoids this — fork PRs don't trigger our Docker-image CI without a maintainer approval that wasn't happening.

Closes/supersedes #740 once CI is green (leaving that call to Martin).

Review

A detailed review is on its way (read against Martin's checklist plus a science-quality pass). Headline from exercising the new fitter against real ngmix 2.4.0 on candide: the metacal path runs end-to-end and shear recovery is unbiased at the few×10⁻⁴ level in m. Full notes to follow.

— Claude on behalf of Cail

Lucie Baumont and others added 30 commits January 9, 2023 14:07
- bin/ scripts were untracked, causing Docker build to fail
- Fix license field to use SPDX string format (MIT) to resolve
  SetuptoolsDeprecationWarning

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
cailmdaley added a commit that referenced this pull request Jun 10, 2026
…port + get_guess removal); part 2 (methodology) deferred

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
cailmdaley added a commit that referenced this pull request Jun 10, 2026
…11 inline + summary); review delivered, Martin to merge

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
cailmdaley added a commit that referenced this pull request Jun 10, 2026
Re-reviewed against current #741 head: no code changed since the part-2
review, so all 11 findings stand. Martin closed fork PR #740 and
consolidated onto #741 (canonical, green, mergeable); engaged only to ack
the RNG fix. Triaged 11 findings (5 cut-and-dry / 5 decisions / 1 resume);
weight-norm (949) + *_psfo (1045) flagged as the only two merge-gates.
report.html rewritten as next-steps triage; summary comment posted to
#741 (issuecomment-4626968551); prs-in-flight indexed with #740-closed
disposition. Report-only round; fiber closed for Cail's review.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
cailmdaley added a commit that referenced this pull request Jun 10, 2026
…ecision-ready reports)

Workflow (6 investigators + 2 synth) on the two hard problems:
- WEIGHTS (#604+949): two coupled regressions in prepare_ngmix_weights
  (dead get_noise -> whole-stamp sigma_mad; lost binarization -> double-
  counts real ivar). Empirically confirmed (truth ivar 1e6 -> recovered
  8.8e11). Rec: split — minimal v1-restore in #741 + SExtractor
  BACKGROUND_RMS baseline as separate PR (closes #604).
- SIZE (r50/T): galaxy r50=T(area), PSF r50psf=sigma, neither=1.1774sigma;
  UNIONS-3500 paper reports r_h as primary. Rec: transform-at-source +
  cs_util converters; bonus — sp_validation T_to_fwhm dimensionally wrong.

Adds weights-report.md, size-report.md, deep-dive-report.html.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
cailmdaley added a commit that referenced this pull request Jun 10, 2026
- shapepipe/ngmix-weights-ivar (Codex): fix the two prepare_ngmix_weights
  regressions (minimal v1-restore + red->green test in #741) and the
  SExtractor BACKGROUND_RMS inverse-variance baseline as a separate PR
  (closes #604). Points at weights-report.md.
- shapepipe/ngmix-size-columns: honest r50 at the ngmix source + cs_util
  converter web + fix the sp_validation T_to_fwhm leakage bug. Points at
  size-report.md.

Both installed as drafts pending Cail's dispatch decision.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
cailmdaley added a commit that referenced this pull request Jun 10, 2026
…741

- Pushed cleanup bd60dc8 to origin/ngmix_v2.0 (dd4f656..bd60dc8).
- Enabled both oneshot shuttles: ngmix-weights-ivar (codex),
  ngmix-size-columns (claude-opus). Workers prepare branches/PRs+reports;
  merge stays Cail's. *_psfo gate + runner decorators tracked for the
  eventual #741 reply.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
sigma_mad(gal) == 0 on a constant stamp made the scalar fallback compute
mask * inf, which is NaN wherever the mask is 0 (a fully-masked constant
stamp emitted an all-NaN weight map). Guard on sig_noise > 0 and return
all-zero weights instead; the downstream wsum == 0 epoch cut already
handles the zero-weight case. Pre-existing v1/v2 edge, not introduced by
the #604 work.
Close the #604 coverage gap flagged in review: the per-pixel RMS branch
was pinned only by a 3x3 hand-computed matrix and the rescale unit test.
make_data already accepts a per-pixel noise map (document it), so inject
heteroscedastic truth and assert the Observation weight equals
1/(Fscale*rms)^2 exactly through rescale_epoch_fluxes ->
prepare_ngmix_weights -> make_ngmix_observation, with Megapipe-masked
and flagged pixels zeroed. Also add the degenerate constant-stamp guard
test (np.errstate raise: no divide/invalid warnings, all-zero weights).
scripts/python/fitting.py was removed by the v2.0 dead-code cleanup
(bd60dc8), leaving a silently-skipping stale entry. Drop it, and turn
the missing-file skip into a hard failure so the list keeps reflecting
reality.
When the option is set, the RMS sqlite must exist for every tile
(fail-fast FileNotFoundError, no per-tile fallback); the scalar
sigma_mad fallback engages only when the option is absent from the
config entirely.
cailmdaley and others added 5 commits June 11, 2026 02:06
…columns

# Conflicts:
#	src/shapepipe/tests/test_ngmix.py
fix(ngmix): emit true half-light radii in r50 columns; dedupe PSF size columns
The ngmix resume path was deleted in bd60dc8 (Martin: 'a hack to
resume interrupted runs ... can be removed now'); this template entry
was the last reference wired to ngmix_runner. The mask/get_images
CHECK_EXISTING_DIR entries elsewhere are live features and stay.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The WCS moved from the named LOG_WCS config option into the positional
input list (input_file_list[-1] when MAKE_POST_PROCESS is True), but the
@module_runner metadata and the package docs still described the old
contract. Declare log_exp_headers/.sqlite from merge_headers_runner in
the decorator (matching the ngmix_runner convention) and replace the
stale LOG_WCS docs with the positional contract.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
ngmix_runner takes the merged WCS header log positionally
(input_file_list[6]) and never reads LOG_WCS. Document the positional
contract, add merge_headers_runner to the parent modules, and document
the real SAVE_BATCH option in its place.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@cailmdaley

Copy link
Copy Markdown
Contributor Author

Review-round summary. Every open thread above now has a reply; here's the shape of what landed on ngmix_v2.0 since the review, and what still awaits your judgment.

On this branch:

  • Reproducibility — seeded rng threaded through the noise draws (22f1f0a), guarded by test_metacal_is_reproducible_with_fixed_seed.
  • Dead-code cleanup you greenlit (bd60dc8): CHECK_EXISTING_DIR resume path + get_last_id, the unreachable print, unused sextractor_e1e2, scripts/python/fitting.py, and the 51*51 literal (now v_flag_tmp.size). Dangling config entry pruned in 16789eb.
  • Inverse-variance weights per [NEW FEATURE] Weight Handling #604 (f466c98 + 6ddea5b, 4aa3b2a, 158986a, f79de46, 0bc6016): optional BKG_RMS_VIGNET_PATH gives per-pixel 1/(Fscale·RMS)² with gating masks; scalar σ_MAD fallback hardened; the chain pinned end-to-end by an integration test. The double-weighting flagged in review is gone.
  • Runner contract (d4ada3d, b2dcd79): sextractor_runner's decorator now declares the positional headers input; stale LOG_WCS docs fixed in both packages. No behavior change in any shipped config (all 12 override FILE_PATTERN).

Companion PRs (the r50/T thread):

Awaiting your call (methodology, no code pushed): the *_psfo columns carrying the metacal-reconvolved PSF vs. the documented psfex/mccd input PSF; the galaxy prior/FLUX_AUTO guess reused for the PSF fit; and whether you want an explicit STAMP_SIZE option despite vignetmaker owning the geometry upstream. The zero-pixel any→all change stays as you intended; a fractional-bad-pixel threshold is noted as a possible follow-up.

— Claude on behalf of Cail

cailmdaley and others added 7 commits June 11, 2026 03:06
merge_headers writes TILE_ID as the first key of the tile-level
log_exp_headers<tile>.sqlite, and make_post_process derived n_hdu from
the first key's value — len(tile_id_string) instead of the CCD count —
so every epoch on the unscanned CCDs was silently dropped from
N_EPOCH/EPOCH_* (and hence from ME vignets and shape measurement).
Regression test builds the tile-mode sqlite via merge_headers and
asserts an object on the last CCD keeps all its epochs.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
ngmix 2.x run_fitter returns flags != 0 on failure instead of raising,
and the failed result carries none of the measurement keys (g, g_cov, T,
T_err, flux, flux_err, s2n); compile_results indexed them directly, so a
single failed object crashed the whole tile with a KeyError at save time.
Failed types are now recorded as NaN with their flags preserved.
With ignore_failed_psf=True a failed PSF epoch stays in obsdict carrying
only flags/pars, so reading result['T'] KeyError-dropped the whole object
even when the shear fit succeeded on the surviving epochs. The average
now skips flags != 0 epochs (all-failed still hits the wsum == 0 guard),
and n_epoch_model counts surviving epochs instead of submitted ones.
The rewrite hard-coded res['mcal_flags'] = 0, so the NGMIX_MCAL_FLAGS
column written by make_cat was constant-zero and any mcal_flags == 0
quality cut passed every object, failed fits included. Restores the v1
contract: mcal_flags = bitwise OR of all per-type fit flags, so failed
objects (now NaN-recorded rather than crashing) carry nonzero flags.
The cfis_simu configs still used the removed LOG_WCS/ME_LOG_WCS options,
so the runners' positional reads (ngmix input_file_list[6], mccd_interp/
vignetmaker [1], sextractor [-1] with MAKE_POST_PROCESS) would IndexError
or grab the wrong file. Migrated them to the merge_headers_runner input
pattern used by example/cfis; sextractor exposure runs gain an explicit
FILE_EXT so the 3-entry pattern override no longer mismatches the
runner's 4-entry default. Also renamed stale run_sp_exp_Mh references in
the cfis templates to run_sp_tile_Mh_exp, the name config_tile_Mh_exp.ini
actually produces.
The column now stores ngmix 2.x's nfev (solver function-evaluation
count, ~tens-hundreds, -1 on some failures), not the v1 1-5 retry
count; the old name misrepresented the value. No downstream consumer
reads it: make_cat's _save_ngmix_data never touches it, and the only
ntry matches in sp_validation are base64 image blobs in notebook
outputs.
copyfile was orphaned by the resume-path removal; Tile_cat's size/e/
theta attributes were read from the catalog but never consumed anywhere
in src/ or scripts/. get_noise stays: scripts/jupyter/
test_centroid_shift.py imports and calls it.
@cailmdaley

Copy link
Copy Markdown
Contributor Author

Review — part 3 (fresh pass)

Provenance: same convention as parts 1–2 — this is a fresh full-diff pass by Claude working on candide, against head b2dcd79. Every finding below was empirically demonstrated before being fixed, each fix carries a regression test confirmed red on the unfixed code, and the chain is now pushed to this branch (b2dcd793..05f1584e). I haven't hand-edited it line by line, so please read with that lens.

Blockers (2)

Both live in the v2.0 rewrite's tile flow, and both survive a green suite and an easy-object smoke run — which is exactly why they hid: one sits in a path no test exercised, the other only fires when a fit fails. The fixes restore what we read as the v1 contract, but they deserve your sanity-check of intent, Martin.

  1. TILE_ID silently truncates the post-process CCD scan (sextractor_script.py, make_post_process — fixed in fcd117f). The tile-level merge_headers (new on this branch) writes TILE_ID as the first key of log_exp_headers<tile>.sqlite; make_post_process derived n_hdu = len(f_wcs[key_list[0]]) — i.e. the length of the tile ID string instead of the CCD count. Every epoch on CCDs ≥ len(tile_string) silently vanishes from N_EPOCH/EPOCH_*, and from there out of ME vignets and shape measurement. Repro: tile "51" (len 2), 3 CCDs → an object on CCD 2 comes back N_EPOCH = 0 instead of 2. The fix filters the metadata key and sizes the scan from a real exposure entry; we audited every other log_exp_headers consumer (psfex_interp, vignetmaker, ngmix, mccd interp) — all do keyed [exp][ccd] access, so make_post_process was the only affected site, and it covers both its callers.

  2. compile_results KeyErrors on any failed fit type, discarding the whole tile at save (ngmix.py — fixed in 450f3c1). Re-verified against real ngmix 2.4.0: a failed run_fitter returns normally with flags=512 and keys [errmsg, flags, ier, model, nfev, pars, pars_cov, pars_err] — none of g/g_cov/T/T_err/flux/flux_err/s2n. The old results[idx][name]["flux"] access then KeyErrors, and because it happens at batch/final save, one hard object throws away every measured object in the tile. v1 never saw this because ngmix 1.x raised on failure, which the per-object try/except caught. Now failed types are recorded with their flags and NaN-filled measurement columns (a flags == 0 result missing s2n still KeyErrors, deliberately).

Should-fixes (pushed; please confirm intent)

  • mcal_flags was hardcoded to 0 (7a1b3b1). process() set res['mcal_flags'] = 0 unconditionally, so the NGMIX_MCAL_FLAGS column make_cat writes was constant zero and any mcal_flags == 0 quality cut passed everything. Restored to the v1 contract — bitwise OR of the per-type fit flags (new get_mcal_flags helper). Flagging explicitly: confirm this is the intended semantics for that column.
  • Failed-PSF epochs crashed (or would have contaminated) the multi-epoch PSF average (00e9f89). With ignore_failed_psf=True a failed-PSF epoch stays in the obs dict carrying only flags/pars; average_multiepoch_psf read its ['T'] → KeyError, dropping the object. It now skips flags != 0 epochs, and n_epoch_model is again the count of epochs that survived the PSF fit (the v1 meaning) rather than the number submitted.
  • Config drift, tutorial-facing (3eb6a66). The four example/cfis_simu configs driven by job_sp_simu.bash still used the removed LOG_WCS/ME_LOG_WCS options; they're migrated to the positional WCS-headers input contract, mirroring the migrated example/cfis equivalents. Same commit renames run_sp_exp_Mhrun_sp_tile_Mh_exp in five example/cfis configs (the run name config_tile_Mh_exp.ini actually produces) and adds explicit FILE_EXT where a 3-entry FILE_PATTERN override would trip the new 4-entry decorator default's length check at startup. Related sweep worth doing in this PR: example/cfis/config_exp_psfex.ini has the same latent FILE_PATTERN/FILE_EXT length mismatch.

Noted, not changed

  • moments_fail semantic drift (document-only): in v1 the column counted moments-initial-guess (get_guess) failures; on this branch it counts metacal types with nonzero fit flags. Same name, different meaning — flagged so downstream consumers know.
  • ntry_fitnfev_fit (436bcc8): the value written is now ngmix-2.x's nfev (function evaluations), not the v1 retry count (1–5), so the old name was misleading. We verified the column is unconsumed downstream (make_cat never reads it; no reference anywhere in sp_validation), hence an honest rename rather than a compatibility shim. A small dead-code sweep rode along (05f1584): the orphaned copyfile import and the never-read Tile_cat.size/.e/.theta; get_noise was kept — it has a live caller in scripts/jupyter/test_centroid_shift.py.

Branch state, empirically

With the chain applied the container suite is 270 passed / 1 failed — the one failure is the known test_metacal_is_reproducible_with_fixed_seed, which needs ngmix 2.x the test sandbox doesn't have. Under real ngmix 2.4.0 we ran do_ngmix_metacal directly: seed-reproducibility holds, and an injected shear is recovered at g1 = 0.01996 vs 0.02 after response correction.

— Claude on behalf of Cail

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants