feat(format): partitioned bulk ingest API#4317
Conversation
|
Do you want to target this against the spec-1.2.0 branch instead? |
| /// write; that happens at Commit (commit it) or Abort (drop it). | ||
| /// | ||
| /// \since ADBC API revision 1.2.0 | ||
| struct AdbcIngestReceipt { |
There was a problem hiding this comment.
It would be good to explain the receipt/handle explicitly here instead of leaving it implicit from the below definitions.
Additionally I think the comments could generally be cleaned up.
| /// Abort is best-effort. If cleanup is incomplete, the driver | ||
| /// returns a warning status and orphaned storage may remain; it is |
There was a problem hiding this comment.
Perhaps this would leverage ConnectionSetWarningHandler?
There was a problem hiding this comment.
I meant some sort of an "okish" adbc status, but warning handler sounds like a more appropriate solution. unless you think it's not necessary at all.. just documenting it as best-effort could be enough as well.
There was a problem hiding this comment.
I guess the question is, in what sorts of scenarios would it fail? And would it be possible/sensible to retry? Should we try to provide some structured information on what wasn't cleaned up so the user/an operator can try to do it manually? (In which case maybe a warning handler isn't sufficient?)
| /// the driver's responsibility to provide housekeeping (e.g. TTL, | ||
| /// background GC, or documented manual cleanup). Callers may also |
There was a problem hiding this comment.
This sounds more like the semantics of the backend system
CurtHagenlocher
left a comment
There was a problem hiding this comment.
Some initial thoughts.
| /// Abort is best-effort. If cleanup is incomplete, the driver | ||
| /// returns a warning status and orphaned storage may remain; it is |
There was a problem hiding this comment.
Perhaps this would leverage ConnectionSetWarningHandler?
| /// call Abort if the coordinator crashed and was restarted without | ||
| /// the original receipts. |
There was a problem hiding this comment.
Is it worth distinguishing the case of "Abort an unknown handle" from the case of "something went wrong during Abort" with a specific error code for the former?
95d027c to
310e138
Compare
|
@lidavidm should I leave only the spec changes in the PR against spec branch? or postgres impl as well? |
|
Impl is fine as well, but the branch seems to have picked up a lot of extra commits |
Adds the spec document and adbc.h declarations for partitioned bulk ingest — the write-side mirror of ExecutePartitions/ReadPartition.
310e138 to
c264c82
Compare
|
some of the core cpp adbc manager codebase differs between main and spec-1.2.0. so I left only spec changes here. kept the impl in another branch on my repo. |
|
How do you think this API would map to https://www.databricks.com/blog/ingesting-milky-way-petabyte-scale-zerobus-ingest? |
|
@lidavidm I took a look at the sdk. I think the answer depends on whether we're looking at some sort of a bounded batch scenario or pure continuous streaming. Batch: The main problem here is that the API currently targets a scenario where writers are used to stage the data and then a coordinator "commits" during a Complete call. For iceberg/delta, that's a natural abstraction, for databases you can work around it by staging to a temp table and doing a hopefully inexpensive swap at the end. Having said that, there's no reason a driver can't implement it w/o a staging table dance and simply start writing to a target table in which case a Complete call is kind of a no-op, the data has already been materialized. maybe we should also have an option to allow a client to configure which mode they want the driver to work in? wdyt? So, a ZeroBus driver would implement this the same way as any other driver, either stage writes somewhere (temp Delta path, staging topic, etc.), then atomically promote them on Complete (not sure how easy that is in dbx unity) or start writing to the target table directly and Complete becomes something of a no-op. Same staging-and-swap pattern as an RDBMS driver, just with different internals. Streaming/continuous: The API doesn't prevent it — a driver could treat the handle as a long-lived session token. The main value the API provides would be the handle that's produced after a centralized setup (validate target, schema, permissions once). Complete semantics will probably be different in that case. you can either ignore it or treat it as some way to store acked receipts from writers, some sort of a state management. There's also a question whether that state needs to be exposed to the client as well and whether the exposed state should be opaque or something more concrete (for example, offsets). In short, I deliberately avoided going down the streaming rabbithole in this PR, but that can be changed of course. |
| size_t, struct ArrowArrayStream*, | ||
| struct AdbcSerializableHandle*, | ||
| struct AdbcError*); | ||
| AdbcStatusCode (*ConnectionCompleteIngestPartitions)(struct AdbcConnection*, |
There was a problem hiding this comment.
@lidavidm I think we need to distinguish between 3 possible groups of status codes of a Complete call:
- success: ADBC_STATUS_OK
- retryable: Complete failed, but a client should retry a complete call w/o restaging the data or regenerating receipts. The example is delta/iceberg concurrency conflict error.
- terminal: something else went wrong. start from scratch.
I couldn't really map retryable status to any existing status code. I'm thinking of adding either ADBC_STATUS_CONFLICT or ADBC_STATUS_RETRY. wdyt?
There was a problem hiding this comment.
Maybe just an out parameter to indicate whether it was success-or-retryable? A new status code would be a big change
There was a problem hiding this comment.
let me push back a little. out param feels clean for C interface, but replicating the same for language apis will be a patchwork of different solutions, they would have to either go into exceptions or have additional output in the signature depending on the language. btw, what makes a new status code a big change, would something break? are there code paths that rely on exhaustive checks?
Another alternative is to put some additional metadata inside an Error message. That's probably abuse, but also could work.
There was a problem hiding this comment.
I hear you, but my worry is that a new error code could be in principle returned from any API function, which would break existing users.
If we say this code is only used for this particular API function, then I think it makes more sense for an out parameter than an error code (which has a non-local impact).
Also language bindings could use different strategies, e.g. Rust would return an Result<PartitionedIngestStatus> or something and not an out parameter, and deal with the messiness at the FFI layer. This is already the case for various APIs (e.g. Java treats bulk ingest itself differently).
demo PR for a new partitioned ingest API.