resample: reject out-of-range integer nodata sentinels#2671
Conversation
Dedupe duplicate module rows (last-write-wins by last_inspected) and collapse multi-line notes to single physical lines. The notes had embedded newlines, which the merge=union .gitattributes strategy splits record-by-record, corrupting the file into a 156-column phantom row on parallel-agent appends. One line per record keeps union merges safe.
An out-of-range integer nodata sentinel used to wrap silently on the dtype cast in _resolve_nodata (e.g. 999 -> 231 for uint8), masking the wrong cells. Validate that the sentinel round-trips exactly into the integer dtype and raise a clear ValueError otherwise. Boolean inputs are already rejected upstream by _validate_raster, so the nonzero-collapse case in the finding cannot reach _resolve_nodata via the public resample() API; no separate bool handling was added.
brendancol
left a comment
There was a problem hiding this comment.
PR Review: resample: reject out-of-range integer nodata sentinels
Blockers (must fix before merge)
None.
Suggestions (should fix, not blocking)
xrspatial/resample.py:1236-1237: For a Python int sentinel larger than the dtype's C-long range (e.g.nodata=2**70on an int64 raster),np.asarray(nodata).astype(agg.dtype)raisesOverflowError("Python int too large to convert to C long") before the round-trip check runs. The rest of the nodata validation raisesValueError, so this is an inconsistent error type for the same class of bad input. Wrap the cast so an out-of-range int surfaces as the sameValueError("out of range for integer dtype") instead of a rawOverflowError.
Nits (optional improvements)
None.
What looks good
- The check lives in
_resolve_nodata, which runs before backend dispatch, so the validation covers numpy, dask, cupy, and dask+cupy without per-backend duplication. - The round-trip comparison (
cast != int(nodata)) is the right test: it catches both unsigned wrap (-1 to 255) and signed overflow (200 to -56), and the boundary value (uint8 255) still passes. - Tests parametrize numpy and dask+numpy, cover the _FillValue attr path, and assert the in-range boundary case still masks correctly rather than only checking that bad input raises.
- The PR correctly identifies that boolean input cannot reach this helper (rejected upstream by
_validate_raster) and avoids adding dead code.
Note on the diff
The PR diff shows a change to .claude/sweep-test-coverage-state.csv. That is base divergence, not part of this change. The branch was cut before main updated that file. Merging main into the branch will clear it.
Checklist
- Algorithm matches the intended fix: yes
- All implemented backends produce consistent results: yes (validation is pre-dispatch)
- NaN handling correct: yes (float NaN sentinel path unchanged)
- Edge cases covered by tests: mostly; the beyond-int64 overflow case is not covered
- Dask chunk boundaries handled correctly: yes (boundary masking test passes on dask)
- No premature materialization: yes
- Benchmark exists or not needed: not needed (validation-only change)
- README feature matrix updated: not applicable (no new function)
- Docstrings present and accurate: yes (Raises section updated)
… nodata (#2660) Range-check the integer sentinel before the numpy cast so a Python int past the C-long range surfaces as the same out-of-range ValueError as any other unrepresentable sentinel, instead of a raw OverflowError.
brendancol
left a comment
There was a problem hiding this comment.
Follow-up review (after 7097fb0)
The one suggestion from the first pass is resolved.
xrspatial/resample.py:1236-1247: The integer sentinel is now range-checked againstnp.iinfo(agg.dtype)before the numpy cast. A Python int beyond the C-long range (e.g.2**70) now raises the same out-of-rangeValueErroras any other unrepresentable sentinel rather than a rawOverflowError. A regression test (test_sentinel_beyond_int64_raises_valueerror) covers it.
Status
- Blockers: none.
- Suggestions: 1 raised, 1 fixed.
- Nits: none.
No new findings. test_resample.py passes (26 nodata tests, 198 total) and resample.py is flake8-clean.
One pre-existing note, not introduced here and out of scope for this PR: a np.float32 sentinel with a fractional part bypasses the Python-float fractional check at line 1228 and would truncate silently. np.float64 is covered because it subclasses Python float. Worth a separate look if float32 attr sentinels are a real input path.
|
Closing this as superseded by #2676, which already merged the same One thing this PR got right that the merged version did not: in the per-source (non-chunked) VRT read path, #2676 forwards |
|
Reopening — my earlier close was a mistake. I confused this PR (resample nodata validation, branch |
# Conflicts: # xrspatial/resample.py
…b#2671) * Repair sweep-test-coverage state CSV Dedupe duplicate module rows (last-write-wins by last_inspected) and collapse multi-line notes to single physical lines. The notes had embedded newlines, which the merge=union .gitattributes strategy splits record-by-record, corrupting the file into a 156-column phantom row on parallel-agent appends. One line per record keeps union merges safe. * resample: reject out-of-range integer nodata sentinels (xarray-contrib#2660) An out-of-range integer nodata sentinel used to wrap silently on the dtype cast in _resolve_nodata (e.g. 999 -> 231 for uint8), masking the wrong cells. Validate that the sentinel round-trips exactly into the integer dtype and raise a clear ValueError otherwise. Boolean inputs are already rejected upstream by _validate_raster, so the nonzero-collapse case in the finding cannot reach _resolve_nodata via the public resample() API; no separate bool handling was added. * Address review: raise ValueError (not OverflowError) for beyond-int64 nodata (xarray-contrib#2660) Range-check the integer sentinel before the numpy cast so a Python int past the C-long range surfaces as the same out-of-range ValueError as any other unrepresentable sentinel, instead of a raw OverflowError.
Closes #2660.
What changed
_resolve_nodatainxrspatial/resample.pynow checks that an integernodatasentinel round-trips exactly intoagg.dtypebefore trusting the cast. An out-of-range value (e.g.999foruint8, which wraps to231) raises aValueErrorinstead of silently masking the wrong cells.resample()docstringRaisessection to cover the new case.Note on boolean inputs
The original finding also flagged boolean inputs collapsing nonzero sentinels to
True.resample()rejects boolean input upstream in_validate_raster("must have a real numeric dtype"), so that path cannot reach_resolve_nodatathrough the public API. I did not add bool handling to_resolve_nodatasince it would be dead code for the only caller.Backend coverage
The validation lives in the shared
_resolve_nodatahelper, which runs before backend dispatch, so it applies to all backends. Tests parametrize numpy and dask+numpy; the boundary-value masking test confirms the dask path keeps data chunked and masks correctly.Test plan
TestNodataOutOfRangeclass: uint8-above-max, uint8-negative, int8-above-max, sentinel-via-_FillValue, and an in-range boundary (uint8 255) that must still mask.test_resample.pypasses (197 tests).