fix: register dimensions without coordinates as columns#216
Conversation
…stems#203) A Dataset can have "dimensions without coordinates" (e.g. an image's channel/height/width axes). These are absent from ds.coords, so _parse_schema dropped them from the SQL schema and SELECT * returned too few columns. Even if they were added, slicing a block with isel synthesizes their index relative to the block (restarting at 0 per partition), which would corrupt values on chunked reads. Materialise a default integer index coordinate for every coordinate-less dimension once, at the reader entry points (read_xarray_table and XarrayRecordBatchReader), via a new ensure_default_indexes() helper. This turns them into ordinary dimension coordinates so the existing schema, record-batch, filter-pushdown and round-trip machinery handles them correctly and they carry their absolute position through chunked reads. Adds regression tests for the schema, both record-batch builders (chunked along a coordinate-less dim), and the from_dataset query path.
alxmrs
left a comment
There was a problem hiding this comment.
Looking good so far. Some notes. Thanks for the quick fix.
|
|
||
| @pytest.fixture | ||
| def coordless_dims_ds(self): | ||
| """Mirror the fashion-mnist layout from issue #203: a dimension |
There was a problem hiding this comment.
Let's not refer to the GH issues in code unless there is a TODO.
There was a problem hiding this comment.
Yeah truee.. will take care of this
|
|
||
|
|
||
| def _coordless_ds(): | ||
| """A 2-D dataset whose dimensions have NO coordinates (issue #203).""" |
There was a problem hiding this comment.
Let's omit the reference to the GH issue.
| coords={"sample": ("sample", np.arange(n_sample, dtype="int64"))}, | ||
| ).chunk({"sample": 1}) | ||
|
|
||
| def test_coordless_dims_appear_as_columns(self, coordless_dims_ds): |
| block* (restarting at 0 in every partition). Materialising an explicit | ||
| ``arange`` index up front turns them into ordinary dimension coordinates, so | ||
| they appear as columns and carry their absolute position through chunked | ||
| reads (issue #203). Datasets whose dimensions already have coordinates are |
There was a problem hiding this comment.
Let's omit the gh issue ref.
| Only *dimension coordinates* become dimension columns, so a dimension | ||
| without a coordinate would be dropped. Callers must run the Dataset through | ||
| :func:`ensure_default_indexes` first (the readers do) so every dimension has | ||
| a coordinate and appears as a column (issue #203). |
There was a problem hiding this comment.
This is a good contract to document. Though, let's omit the issue ref.
| each block dict just before it's converted to Arrow. This | ||
| allows tests to track when iteration actually occurs. | ||
| """ | ||
| ds = ensure_default_indexes(ds) |
There was a problem hiding this comment.
I think putting this in read_xarray below is better than the constructor.
| assert batch.num_rows == expected_rows | ||
|
|
||
|
|
||
| def _coordless_ds(): |
There was a problem hiding this comment.
I think we can omit the tests in this file altogether -- the public contract is covered by the sql tests file. Wdyt?
We don't mark it as such, but I see this new utility as a private function.
Privatise the index helper (_ensure_default_indexes), move its call from the XarrayRecordBatchReader constructor to read_xarray, drop the df-level tests (the public contract is covered by the sql tests), and remove the GitHub issue references from code and test docstrings.
alxmrs
left a comment
There was a problem hiding this comment.
LGTM! Thanks for the fix!
Dimensions without coordinates (e.g. an image's channel/height/width axes) were dropped from the SQL schema, so
SELECT *returned too few columns (2 instead of 5 in the #203 repro). This materialises a default integer index for each coordinate-less dimension at the reader boundary, so they show up as columns and keep their absolute position through chunked reads. Adds regression tests. Closes #203.Before / after (issue repro)
Before (main), 2 columns:
After (this PR), 5 columns, values verified against xarray:
Full non-integration test suite passes locally.