feat(format): partitioned bulk ingest API by tokoko · Pull Request #4317 · apache/arrow-adbc

tokoko · 2026-05-17T07:13:13Z

demo PR for a new partitioned ingest API.

lidavidm · 2026-05-20T22:50:08Z

Do you want to target this against the spec-1.2.0 branch instead?

lidavidm · 2026-05-20T22:53:56Z

+/// write; that happens at Commit (commit it) or Abort (drop it).
+///
+/// \since ADBC API revision 1.2.0
+struct AdbcIngestReceipt {


It would be good to explain the receipt/handle explicitly here instead of leaving it implicit from the below definitions.

Additionally I think the comments could generally be cleaned up.

lidavidm · 2026-05-20T22:54:45Z

+/// Abort is best-effort.  If cleanup is incomplete, the driver
+/// returns a warning status and orphaned storage may remain; it is


What is a "warning status"?

Perhaps this would leverage ConnectionSetWarningHandler?

I meant some sort of an "okish" adbc status, but warning handler sounds like a more appropriate solution. unless you think it's not necessary at all.. just documenting it as best-effort could be enough as well.

I guess the question is, in what sorts of scenarios would it fail? And would it be possible/sensible to retry? Should we try to provide some structured information on what wasn't cleaned up so the user/an operator can try to do it manually? (In which case maybe a warning handler isn't sufficient?)

lidavidm · 2026-05-20T22:55:06Z

+/// the driver's responsibility to provide housekeeping (e.g. TTL,
+/// background GC, or documented manual cleanup).  Callers may also


This sounds more like the semantics of the backend system

CurtHagenlocher

Some initial thoughts.

CurtHagenlocher · 2026-05-21T19:09:46Z

+/// Abort is best-effort.  If cleanup is incomplete, the driver
+/// returns a warning status and orphaned storage may remain; it is


Perhaps this would leverage ConnectionSetWarningHandler?

CurtHagenlocher · 2026-05-21T19:14:26Z

+/// call Abort if the coordinator crashed and was restarted without
+/// the original receipts.


Is it worth distinguishing the case of "Abort an unknown handle" from the case of "something went wrong during Abort" with a specific error code for the former?

tokoko · 2026-05-29T04:17:28Z

@lidavidm should I leave only the spec changes in the PR against spec branch? or postgres impl as well?

lidavidm · 2026-05-29T07:32:06Z

Impl is fine as well, but the branch seems to have picked up a lot of extra commits

Adds the spec document and adbc.h declarations for partitioned bulk ingest — the write-side mirror of ExecutePartitions/ReadPartition.

tokoko · 2026-06-02T20:48:24Z

some of the core cpp adbc manager codebase differs between main and spec-1.2.0. so I left only spec changes here. kept the impl in another branch on my repo.

lidavidm · 2026-06-16T21:44:39Z

How do you think this API would map to https://www.databricks.com/blog/ingesting-milky-way-petabyte-scale-zerobus-ingest?

tokoko · 2026-06-17T05:55:56Z

@lidavidm I took a look at the sdk. I think the answer depends on whether we're looking at some sort of a bounded batch scenario or pure continuous streaming.

Batch: The main problem here is that the API currently targets a scenario where writers are used to stage the data and then a coordinator "commits" during a Complete call. For iceberg/delta, that's a natural abstraction, for databases you can work around it by staging to a temp table and doing a hopefully inexpensive swap at the end. Having said that, there's no reason a driver can't implement it w/o a staging table dance and simply start writing to a target table in which case a Complete call is kind of a no-op, the data has already been materialized. maybe we should also have an option to allow a client to configure which mode they want the driver to work in? wdyt?

So, a ZeroBus driver would implement this the same way as any other driver, either stage writes somewhere (temp Delta path, staging topic, etc.), then atomically promote them on Complete (not sure how easy that is in dbx unity) or start writing to the target table directly and Complete becomes something of a no-op. Same staging-and-swap pattern as an RDBMS driver, just with different internals.

Streaming/continuous: The API doesn't prevent it — a driver could treat the handle as a long-lived session token. The main value the API provides would be the handle that's produced after a centralized setup (validate target, schema, permissions once). Complete semantics will probably be different in that case. you can either ignore it or treat it as some way to store acked receipts from writers, some sort of a state management. There's also a question whether that state needs to be exposed to the client as well and whether the exposed state should be opaque or something more concrete (for example, offsets). In short, I deliberately avoided going down the streaming rabbithole in this PR, but that can be changed of course.

tokoko · 2026-06-17T20:45:20Z

+                                                   size_t, struct ArrowArrayStream*,
+                                                   struct AdbcSerializableHandle*,
+                                                   struct AdbcError*);
+  AdbcStatusCode (*ConnectionCompleteIngestPartitions)(struct AdbcConnection*,


@lidavidm I think we need to distinguish between 3 possible groups of status codes of a Complete call:

success: ADBC_STATUS_OK

retryable: Complete failed, but a client should retry a complete call w/o restaging the data or regenerating receipts. The example is delta/iceberg concurrency conflict error.

terminal: something else went wrong. start from scratch.

I couldn't really map retryable status to any existing status code. I'm thinking of adding either ADBC_STATUS_CONFLICT or ADBC_STATUS_RETRY. wdyt?

Maybe just an out parameter to indicate whether it was success-or-retryable? A new status code would be a big change

let me push back a little. out param feels clean for C interface, but replicating the same for language apis will be a patchwork of different solutions, they would have to either go into exceptions or have additional output in the signature depending on the language. btw, what makes a new status code a big change, would something break? are there code paths that rely on exhaustive checks?

Another alternative is to put some additional metadata inside an Error message. That's probably abuse, but also could work.

I hear you, but my worry is that a new error code could be in principle returned from any API function, which would break existing users.

If we say this code is only used for this particular API function, then I think it makes more sense for an out parameter than an error code (which has a non-local impact).

Also language bindings could use different strategies, e.g. Rust would return an Result<PartitionedIngestStatus> or something and not an out parameter, and deal with the messiness at the FFI layer. This is already the case for various APIs (e.g. Java treats bulk ingest itself differently).

lidavidm added this to the ADBC API Specification 1.2.0 milestone May 20, 2026

lidavidm reviewed May 20, 2026

View reviewed changes

CurtHagenlocher reviewed May 21, 2026

View reviewed changes

tokoko force-pushed the partitioned-ingest branch from 95d027c to 310e138 Compare May 29, 2026 04:11

tokoko changed the base branch from main to spec-1.2.0 May 29, 2026 04:11

feat(format): partitioned bulk ingest spec and C API surface

c264c82

Adds the spec document and adbc.h declarations for partitioned bulk ingest — the write-side mirror of ExecutePartitions/ReadPartition.

tokoko force-pushed the partitioned-ingest branch from 310e138 to c264c82 Compare June 2, 2026 20:41

tokoko changed the title ~~Partitioned ingest API~~ feat(format): partitioned bulk ingest API Jun 2, 2026

switch to AdbcSerializableHandle

3cf4587

tokoko marked this pull request as ready for review June 15, 2026 07:49

tokoko mentioned this pull request Jun 15, 2026

[FEATURE REQUEST] Add ADBC (Arrow Database Connectivity) Data Source apache/spark#54603

Open

address pr comments

16a9c4b

tokoko requested a review from zeroshade as a code owner June 15, 2026 19:42

tokoko commented Jun 17, 2026

View reviewed changes

		/// Abort is best-effort. If cleanup is incomplete, the driver
		/// returns a warning status and orphaned storage may remain; it is

		/// the driver's responsibility to provide housekeeping (e.g. TTL,
		/// background GC, or documented manual cleanup). Callers may also

		/// call Abort if the coordinator crashed and was restarted without
		/// the original receipts.

Conversation

tokoko commented May 17, 2026

Uh oh!

lidavidm commented May 20, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CurtHagenlocher left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tokoko commented May 29, 2026

Uh oh!

lidavidm commented May 29, 2026

Uh oh!

tokoko commented Jun 2, 2026

Uh oh!

lidavidm commented Jun 16, 2026

Uh oh!

tokoko commented Jun 17, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tokoko Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tokoko Jun 18, 2026 •

edited

Loading