Skip to content

feat: add SupportsSetRange protocol and store implementations#3907

Open
d-v-b wants to merge 18 commits into
zarr-developers:mainfrom
d-v-b:feat/byte-range-setter
Open

feat: add SupportsSetRange protocol and store implementations#3907
d-v-b wants to merge 18 commits into
zarr-developers:mainfrom
d-v-b:feat/byte-range-setter

Conversation

@d-v-b
Copy link
Copy Markdown
Contributor

@d-v-b d-v-b commented Apr 15, 2026

Adds a protocol for stores that support synchronously and asynchronously writing a bytes into a range in the target object. only MemoryStore and LocalStore implement this.

this behavior is necessary to enable an in-place writing mode for shards, e.g. where a single subchunk is written without re-writing the entire shard.

Add SupportsSetRange protocol for stores that support writing to a byte
range within an existing value (set_range/set_range_sync). Implement
in MemoryStore and LocalStore, both explicitly subclassing the protocol.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the needs release notes Automatically applied to PRs which haven't added release notes label Apr 15, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 15, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.57%. Comparing base (6ce787d) to head (6aa4e6d).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3907      +/-   ##
==========================================
+ Coverage   93.55%   93.57%   +0.01%     
==========================================
  Files          88       88              
  Lines       11896    11930      +34     
==========================================
+ Hits        11129    11163      +34     
  Misses        767      767              
Files with missing lines Coverage Δ
src/zarr/abc/store.py 96.47% <100.00%> (+0.04%) ⬆️
src/zarr/storage/_local.py 97.42% <100.00%> (+0.17%) ⬆️
src/zarr/storage/_memory.py 96.95% <100.00%> (+0.21%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

d-v-b and others added 2 commits April 15, 2026 11:03
Tests cover isinstance check, async set_range, sync set_range_sync,
and edge case (writing at end of value).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot removed the needs release notes Automatically applied to PRs which haven't added release notes label Apr 15, 2026
@d-v-b d-v-b requested a review from maxrjones April 21, 2026 19:12
@maxrjones
Copy link
Copy Markdown
Member

should this get the same design pivot as #3925 (comment) started in #3925, regarding protocols vs. abc methods?

Do you remember where you were previously pointed to using protocols over methods despite the weight of our existing store API? It might be helpful to quickly jot down our decisions here (use methods for now, plan for a better protocol-based store API in the future) in either https://zarr.readthedocs.io/en/stable/contributing/ or a CLAUDE/AGENTS.md for future reference.

@d-v-b
Copy link
Copy Markdown
Contributor Author

d-v-b commented May 15, 2026

byte-range writes are an optional behavior that only a handful of "niche" stores support (local and memory). There's not really a sensible fallback or default implementation, (unlike get_ranges). So it makes sense for stores to opt in rather than opt out.

And if we made this a method on the Store abc, callers would need to check for NotImplementedError to figure out of the store really supports it, and the method would clutter the signatures of most stores that will never support it (cloud storage).

I don't think we can categorically say "no" to adding functionality to stores or codecs via protocols. There's already a precedent for defining extra functionality with semi-structural mixins: see

class ArrayBytesCodecPartialEncodeMixin:
. Arguably this should have been a protocol from the start.

@maxrjones
Copy link
Copy Markdown
Member

I'd prefer someone whose work is more oriented towards local/HPC filesystems review this PR if they're available and willing (@LDeakin and @ilan-gold come to mind).

I'm not fully prepared to discuss tradeoffs, but lack of a concurrency/atomicity contract in the docstring raised a few questions for me:

  1. Is parallel set_range to disjoint ranges of the same key supposed to be safe? The motivating sharded-write use case suggests yes, but LocalStore doesn't seem to have locking.
  2. Is set_range racing against set defined?
  3. Should crash-mid-write atomicity be a protocol requirement?

@d-v-b
Copy link
Copy Markdown
Contributor Author

d-v-b commented May 15, 2026

Is parallel set_range to disjoint ranges of the same key supposed to be safe? The motivating sharded-write use case suggests yes, but LocalStore doesn't seem to have locking.

yes, in the two target stores (local and memory), disjoint range writes should be safe. overlapping range writes will have order-dependent behavior.

Is set_range racing against set defined?

set + set_range is a race condition, but so are concurrent sets.

Should crash-mid-write atomicity be a protocol requirement?

probably, zarr-python 2.x used a write to a temporary file + a rename for atomicity. we don't do that now, but we should!

It's worth keeping in mind that there is just 1 intended caller of this method, and only under very special circumstances: the sharding codec when the inner chunks have deterministic compressed sizes. I don't know when this method would be called outside that context.

@d-v-b d-v-b requested a review from mkitti May 29, 2026 11:38
Copy link
Copy Markdown
Contributor

@mkitti mkitti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may need to try some kind of file or thread locking. At the very least, we should ensure that the partial writes are atomic.

Comment on lines +81 to +85
def _put_range(path: Path, value: Buffer, start: int) -> None:
"""Write bytes at a specific offset within an existing file."""
with path.open("r+b") as f:
f.seek(start)
f.write(value.as_numpy_array().tobytes())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a lock here? Or maybe use pwrite when possible and this shim otherwise?

import threading

file_lock = threading.Lock()

def pwrite_cross_platform(file_object, data, offset):
    with file_lock:
         file_object.seek(offset)
            file_object.write(data)
        finally:
            file_object.seek(current_pos)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we introduce locking we need to be careful about the guarantees we make. thread-based locks will only block races in the same python process, which means dask can still set up races. and file-based locks are only reliable on a subset of storage backends.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bare minimum is document the need for some kind of concurrency control when doing this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, I can add that. the only anticipated consumer of this API will be the sharding codec when targeting uncompressed data in memory or local storage, so the burden of coordination will be on that call site.

@mkitti
Copy link
Copy Markdown
Contributor

mkitti commented May 29, 2026

Thinking about this a little more, I'm not sure if we should expose all of this as public API.

For the public API, what we could do in lieu of file locking is use a context manager that does the following:

  1. On entering the context, moves, copies, or creates a hard link of the current file being edited to a temporary "lock" file. This is similar to the atomic file write mechanisms that are currently implemented.
  2. Allows for partial writes to the file of interest within the context
  3. On exit, flushes the partial writes to disk and then "moves" the current file back to the orignal location.

What we have now here increasingly looks to me like low-level helper functions than something that should not be exposed to external client software directly. Also, this encourages clients to aggregate partial chunk updates which would increase the potential overhead of opening and closing the file many times.

Under parallel executation situation, we should encourage a single-writer multiple-reader pattern (SWMR), a concept that I'm borrowing from HDF5:

In this case, there should be a single writer that holds the context and lock for doing partial writes to the file. Other parallel units then should communicate with the single writer if they have data that needs to be written. This allows the single writer to coordinate concurrency and ensure consistent atomic partial writes. In some cases, the single writer might even be able to coalesce writes into a single operation. This might involve some buffering of the writes in memory before actually flushing these to disk. If we are smart about memory paging, we could write in units of "super-chunks" that provide more granularity than an entire shard file but that are coarser than individual chunks. At the moment, the I/O bottleneck is often not shear write speed but rather the number of system calls and the required context switching.

@d-v-b
Copy link
Copy Markdown
Contributor Author

d-v-b commented May 29, 2026

@mkitti thanks, that's insightful. I'm perfectly happy making this private API for now. But I would caution that what you propose sounds like a major (albeit helpful) change to the way stores work today, whereas just adding this method here as-is is closer to the minimal required functionality to support atomic subchunk writes inside a shard.

We should probably agree on scope. IMO, in the short term it would be useful to expose range writes via an opt-in mechanism. In the medium-to-long term, we should figure out an elegant and safe way to express this functionality.

@mkitti
Copy link
Copy Markdown
Contributor

mkitti commented May 29, 2026

The problem my suggestion is trying to fix is actually to make this operation look more like the prior store operations. I started to have to worry about locks and concurrency because with these partial writes it became unclear if the key-value pair had a single owner or multiple owners.

The prior store API were either pure functions (get_*) or operated on the entire value associated with a key. These partial write operations break that because they manipulate part of the value, allowing the possibility of multiple partial updates which may have to be deconflicted because it is not clear if the partial updates are coordinated.

My suggestion above is more consistent with the prior API because it forces a macro operation on the entire value. It just implements that operation as a series of micro operations within the macro operation. The important aspect here is that the ownership of the whole value must be clear.

Once we establish atomic ownership of the whole key-value, I think there is now a clear path to extend this operation to other stores. These partial updates could be implemented as whole value updates. The partial write operation is just an optimization available to certain stores.

@d-v-b
Copy link
Copy Markdown
Contributor Author

d-v-b commented May 29, 2026

I started to have to worry about locks and concurrency because with these partial writes it became unclear if the key-value pair had a single owner or multiple owners.

I am not sure thinking about things in terms of ownership helps us much here. When Zarr writes to files / objects, there's always the possibility of a third party accessing the same file / object at the same time. In fact, that's kind of the whole point of the feature in this PR: we want multiple loosely coordinated writers to write subchunks to the same shard. Those writers might be separate threads, or even separate processes. This is only safe under special conditions, but under those conditions it would be a very useful feature.

So in general a single store instance doesn't own the objects it describes. That's unrealistic given the open nature of general storage backends, and also the specific goal of this PR. It might be more realistic to focus on failure modes: e.g., when can a store operation leave a file / object in an undefined state, and what can we do about that?

@mkitti
Copy link
Copy Markdown
Contributor

mkitti commented May 29, 2026

Prior to this pull request though, I don't think you would expect that a value to have orignated partly from one write and partly from another write leaving the combined value in perhaps in an inconsistent state. It would be one or the other. There indeed may still be a race condition, but at least the chunk would be internally consistent and decodable.

The owner could partition that ownership out to concurrent processes by issuing licenses to certain byte ranges. That could occur by perhaps providing a buffer or a dict-like store structure to write into.

The array provided by TensorStore's virtual chunked to user functions is one example of this:
https://google.github.io/tensorstore/python/api/tensorstore.virtual_chunked.html#

Another example is how ImarisWriter works.

@d-v-b
Copy link
Copy Markdown
Contributor Author

d-v-b commented May 30, 2026

Prior to this pull request though, I don't think you would expect that a value to have orignated partly from one write and partly from another write leaving the combined value in perhaps in an inconsistent state. It would be one or the other. There indeed may still be a race condition, but at least the chunk would be internally consistent and decodable.

The owner could partition that ownership out to concurrent processes by issuing licenses to certain byte ranges. That could occur by perhaps providing a buffer or a dict-like store structure to write into.

The array provided by TensorStore's virtual chunked to user functions is one example of this: https://google.github.io/tensorstore/python/api/tensorstore.virtual_chunked.html#

Another example is how ImarisWriter works.

Agreed, I think our sharding write path is missing a key : value model for the contents of a shard. If they keys are strings / bytes and the values are buffers and / or arrays, we would need a transformation on that key : valure that resolves each key to a byte range (with space for the index). We have something roughly like this today, but IMO the abstraction is not complete. Partitioning these keys (byte ranges) among workers serves as the substrate for a coordination / allocation mechanism.

But I think this is out of scope for this PR. I think we need to get to that final state incrementally, and IMO the first step is simply defining a low-level routine stores can use for writing a byte range into an object, which is the goal of this PR.

If that makes sense, then:

  • what needs to change in this PR
  • what should the follow-up PRs contain?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants