feat: add SupportsSetRange protocol and store implementations#3907
feat: add SupportsSetRange protocol and store implementations#3907d-v-b wants to merge 18 commits into
Conversation
Add SupportsSetRange protocol for stores that support writing to a byte range within an existing value (set_range/set_range_sync). Implement in MemoryStore and LocalStore, both explicitly subclassing the protocol. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3907 +/- ##
==========================================
+ Coverage 93.55% 93.57% +0.01%
==========================================
Files 88 88
Lines 11896 11930 +34
==========================================
+ Hits 11129 11163 +34
Misses 767 767
🚀 New features to boost your workflow:
|
Tests cover isinstance check, async set_range, sync set_range_sync, and edge case (writing at end of value). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…r-python into feat/byte-range-setter
|
should this get the same design pivot as #3925 (comment) started in #3925, regarding protocols vs. abc methods? Do you remember where you were previously pointed to using protocols over methods despite the weight of our existing store API? It might be helpful to quickly jot down our decisions here (use methods for now, plan for a better protocol-based store API in the future) in either https://zarr.readthedocs.io/en/stable/contributing/ or a CLAUDE/AGENTS.md for future reference. |
|
byte-range writes are an optional behavior that only a handful of "niche" stores support (local and memory). There's not really a sensible fallback or default implementation, (unlike And if we made this a method on the I don't think we can categorically say "no" to adding functionality to stores or codecs via protocols. There's already a precedent for defining extra functionality with semi-structural mixins: see zarr-python/src/zarr/abc/codec.py Line 256 in 7e58df0 |
|
I'd prefer someone whose work is more oriented towards local/HPC filesystems review this PR if they're available and willing (@LDeakin and @ilan-gold come to mind). I'm not fully prepared to discuss tradeoffs, but lack of a concurrency/atomicity contract in the docstring raised a few questions for me:
|
yes, in the two target stores (local and memory), disjoint range writes should be safe. overlapping range writes will have order-dependent behavior.
set + set_range is a race condition, but so are concurrent sets.
probably, zarr-python 2.x used a write to a temporary file + a rename for atomicity. we don't do that now, but we should! It's worth keeping in mind that there is just 1 intended caller of this method, and only under very special circumstances: the sharding codec when the inner chunks have deterministic compressed sizes. I don't know when this method would be called outside that context. |
mkitti
left a comment
There was a problem hiding this comment.
I think we may need to try some kind of file or thread locking. At the very least, we should ensure that the partial writes are atomic.
| def _put_range(path: Path, value: Buffer, start: int) -> None: | ||
| """Write bytes at a specific offset within an existing file.""" | ||
| with path.open("r+b") as f: | ||
| f.seek(start) | ||
| f.write(value.as_numpy_array().tobytes()) |
There was a problem hiding this comment.
Do we need a lock here? Or maybe use pwrite when possible and this shim otherwise?
import threading
file_lock = threading.Lock()
def pwrite_cross_platform(file_object, data, offset):
with file_lock:
file_object.seek(offset)
file_object.write(data)
finally:
file_object.seek(current_pos)
There was a problem hiding this comment.
if we introduce locking we need to be careful about the guarantees we make. thread-based locks will only block races in the same python process, which means dask can still set up races. and file-based locks are only reliable on a subset of storage backends.
There was a problem hiding this comment.
The bare minimum is document the need for some kind of concurrency control when doing this.
There was a problem hiding this comment.
sounds good, I can add that. the only anticipated consumer of this API will be the sharding codec when targeting uncompressed data in memory or local storage, so the burden of coordination will be on that call site.
|
Thinking about this a little more, I'm not sure if we should expose all of this as public API. For the public API, what we could do in lieu of file locking is use a context manager that does the following:
What we have now here increasingly looks to me like low-level helper functions than something that should not be exposed to external client software directly. Also, this encourages clients to aggregate partial chunk updates which would increase the potential overhead of opening and closing the file many times. Under parallel executation situation, we should encourage a single-writer multiple-reader pattern (SWMR), a concept that I'm borrowing from HDF5:
In this case, there should be a single writer that holds the context and lock for doing partial writes to the file. Other parallel units then should communicate with the single writer if they have data that needs to be written. This allows the single writer to coordinate concurrency and ensure consistent atomic partial writes. In some cases, the single writer might even be able to coalesce writes into a single operation. This might involve some buffering of the writes in memory before actually flushing these to disk. If we are smart about memory paging, we could write in units of "super-chunks" that provide more granularity than an entire shard file but that are coarser than individual chunks. At the moment, the I/O bottleneck is often not shear write speed but rather the number of system calls and the required context switching. |
|
@mkitti thanks, that's insightful. I'm perfectly happy making this private API for now. But I would caution that what you propose sounds like a major (albeit helpful) change to the way stores work today, whereas just adding this method here as-is is closer to the minimal required functionality to support atomic subchunk writes inside a shard. We should probably agree on scope. IMO, in the short term it would be useful to expose range writes via an opt-in mechanism. In the medium-to-long term, we should figure out an elegant and safe way to express this functionality. |
|
The problem my suggestion is trying to fix is actually to make this operation look more like the prior store operations. I started to have to worry about locks and concurrency because with these partial writes it became unclear if the key-value pair had a single owner or multiple owners. The prior store API were either pure functions ( My suggestion above is more consistent with the prior API because it forces a macro operation on the entire value. It just implements that operation as a series of micro operations within the macro operation. The important aspect here is that the ownership of the whole value must be clear. Once we establish atomic ownership of the whole key-value, I think there is now a clear path to extend this operation to other stores. These partial updates could be implemented as whole value updates. The partial write operation is just an optimization available to certain stores. |
I am not sure thinking about things in terms of ownership helps us much here. When Zarr writes to files / objects, there's always the possibility of a third party accessing the same file / object at the same time. In fact, that's kind of the whole point of the feature in this PR: we want multiple loosely coordinated writers to write subchunks to the same shard. Those writers might be separate threads, or even separate processes. This is only safe under special conditions, but under those conditions it would be a very useful feature. So in general a single store instance doesn't own the objects it describes. That's unrealistic given the open nature of general storage backends, and also the specific goal of this PR. It might be more realistic to focus on failure modes: e.g., when can a store operation leave a file / object in an undefined state, and what can we do about that? |
|
Prior to this pull request though, I don't think you would expect that a value to have orignated partly from one write and partly from another write leaving the combined value in perhaps in an inconsistent state. It would be one or the other. There indeed may still be a race condition, but at least the chunk would be internally consistent and decodable. The owner could partition that ownership out to concurrent processes by issuing licenses to certain byte ranges. That could occur by perhaps providing a buffer or a dict-like store structure to write into. The array provided by TensorStore's virtual chunked to user functions is one example of this: Another example is how ImarisWriter works. |
Agreed, I think our sharding write path is missing a key : value model for the contents of a shard. If they keys are strings / bytes and the values are buffers and / or arrays, we would need a transformation on that key : valure that resolves each key to a byte range (with space for the index). We have something roughly like this today, but IMO the abstraction is not complete. Partitioning these keys (byte ranges) among workers serves as the substrate for a coordination / allocation mechanism. But I think this is out of scope for this PR. I think we need to get to that final state incrementally, and IMO the first step is simply defining a low-level routine stores can use for writing a byte range into an object, which is the goal of this PR. If that makes sense, then:
|
Adds a protocol for stores that support synchronously and asynchronously writing a bytes into a range in the target object. only
MemoryStoreandLocalStoreimplement this.this behavior is necessary to enable an in-place writing mode for shards, e.g. where a single subchunk is written without re-writing the entire shard.