Skip to content

feat(format): partitioned bulk ingest API#4317

Open
tokoko wants to merge 3 commits into
apache:spec-1.2.0from
tokoko:partitioned-ingest
Open

feat(format): partitioned bulk ingest API#4317
tokoko wants to merge 3 commits into
apache:spec-1.2.0from
tokoko:partitioned-ingest

Conversation

@tokoko

@tokoko tokoko commented May 17, 2026

Copy link
Copy Markdown
Contributor

demo PR for a new partitioned ingest API.

@lidavidm lidavidm added this to the ADBC API Specification 1.2.0 milestone May 20, 2026
@lidavidm

Copy link
Copy Markdown
Member

Do you want to target this against the spec-1.2.0 branch instead?

Comment thread c/include/arrow-adbc/adbc.h Outdated
Comment thread c/include/arrow-adbc/adbc.h Outdated
/// write; that happens at Commit (commit it) or Abort (drop it).
///
/// \since ADBC API revision 1.2.0
struct AdbcIngestReceipt {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to explain the receipt/handle explicitly here instead of leaving it implicit from the below definitions.

Additionally I think the comments could generally be cleaned up.

Comment thread c/include/arrow-adbc/adbc.h Outdated
Comment on lines +2198 to +2199
/// Abort is best-effort. If cleanup is incomplete, the driver
/// returns a warning status and orphaned storage may remain; it is

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a "warning status"?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this would leverage ConnectionSetWarningHandler?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant some sort of an "okish" adbc status, but warning handler sounds like a more appropriate solution. unless you think it's not necessary at all.. just documenting it as best-effort could be enough as well.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the question is, in what sorts of scenarios would it fail? And would it be possible/sensible to retry? Should we try to provide some structured information on what wasn't cleaned up so the user/an operator can try to do it manually? (In which case maybe a warning handler isn't sufficient?)

Comment thread c/include/arrow-adbc/adbc.h Outdated
Comment on lines +2200 to +2201
/// the driver's responsibility to provide housekeeping (e.g. TTL,
/// background GC, or documented manual cleanup). Callers may also

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds more like the semantics of the backend system

@CurtHagenlocher CurtHagenlocher left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial thoughts.

Comment thread c/include/arrow-adbc/adbc.h Outdated
Comment thread c/include/arrow-adbc/adbc.h Outdated
Comment on lines +2198 to +2199
/// Abort is best-effort. If cleanup is incomplete, the driver
/// returns a warning status and orphaned storage may remain; it is

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this would leverage ConnectionSetWarningHandler?

Comment thread c/include/arrow-adbc/adbc.h Outdated
Comment on lines +2202 to +2203
/// call Abort if the coordinator crashed and was restarted without
/// the original receipts.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth distinguishing the case of "Abort an unknown handle" from the case of "something went wrong during Abort" with a specific error code for the former?

Comment thread c/include/arrow-adbc/adbc.h Outdated
Comment thread c/include/arrow-adbc/adbc.h
@tokoko tokoko force-pushed the partitioned-ingest branch from 95d027c to 310e138 Compare May 29, 2026 04:11
@tokoko tokoko changed the base branch from main to spec-1.2.0 May 29, 2026 04:11
@tokoko

tokoko commented May 29, 2026

Copy link
Copy Markdown
Contributor Author

@lidavidm should I leave only the spec changes in the PR against spec branch? or postgres impl as well?

@lidavidm

Copy link
Copy Markdown
Member

Impl is fine as well, but the branch seems to have picked up a lot of extra commits

Adds the spec document and adbc.h declarations for partitioned bulk
ingest — the write-side mirror of ExecutePartitions/ReadPartition.
@tokoko tokoko force-pushed the partitioned-ingest branch from 310e138 to c264c82 Compare June 2, 2026 20:41
@tokoko tokoko changed the title Partitioned ingest API feat(format): partitioned bulk ingest API Jun 2, 2026
@tokoko

tokoko commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

some of the core cpp adbc manager codebase differs between main and spec-1.2.0. so I left only spec changes here. kept the impl in another branch on my repo.

@tokoko tokoko requested a review from zeroshade as a code owner June 15, 2026 19:42
@lidavidm

Copy link
Copy Markdown
Member

How do you think this API would map to https://www.databricks.com/blog/ingesting-milky-way-petabyte-scale-zerobus-ingest?

@tokoko

tokoko commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

@lidavidm I took a look at the sdk. I think the answer depends on whether we're looking at some sort of a bounded batch scenario or pure continuous streaming.

Batch: The main problem here is that the API currently targets a scenario where writers are used to stage the data and then a coordinator "commits" during a Complete call. For iceberg/delta, that's a natural abstraction, for databases you can work around it by staging to a temp table and doing a hopefully inexpensive swap at the end. Having said that, there's no reason a driver can't implement it w/o a staging table dance and simply start writing to a target table in which case a Complete call is kind of a no-op, the data has already been materialized. maybe we should also have an option to allow a client to configure which mode they want the driver to work in? wdyt?

So, a ZeroBus driver would implement this the same way as any other driver, either stage writes somewhere (temp Delta path, staging topic, etc.), then atomically promote them on Complete (not sure how easy that is in dbx unity) or start writing to the target table directly and Complete becomes something of a no-op. Same staging-and-swap pattern as an RDBMS driver, just with different internals.

Streaming/continuous: The API doesn't prevent it — a driver could treat the handle as a long-lived session token. The main value the API provides would be the handle that's produced after a centralized setup (validate target, schema, permissions once). Complete semantics will probably be different in that case. you can either ignore it or treat it as some way to store acked receipts from writers, some sort of a state management. There's also a question whether that state needs to be exposed to the client as well and whether the exposed state should be opaque or something more concrete (for example, offsets). In short, I deliberately avoided going down the streaming rabbithole in this PR, but that can be changed of course.

size_t, struct ArrowArrayStream*,
struct AdbcSerializableHandle*,
struct AdbcError*);
AdbcStatusCode (*ConnectionCompleteIngestPartitions)(struct AdbcConnection*,

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lidavidm I think we need to distinguish between 3 possible groups of status codes of a Complete call:

  • success: ADBC_STATUS_OK
  • retryable: Complete failed, but a client should retry a complete call w/o restaging the data or regenerating receipts. The example is delta/iceberg concurrency conflict error.
  • terminal: something else went wrong. start from scratch.

I couldn't really map retryable status to any existing status code. I'm thinking of adding either ADBC_STATUS_CONFLICT or ADBC_STATUS_RETRY. wdyt?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just an out parameter to indicate whether it was success-or-retryable? A new status code would be a big change

@tokoko tokoko Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me push back a little. out param feels clean for C interface, but replicating the same for language apis will be a patchwork of different solutions, they would have to either go into exceptions or have additional output in the signature depending on the language. btw, what makes a new status code a big change, would something break? are there code paths that rely on exhaustive checks?

Another alternative is to put some additional metadata inside an Error message. That's probably abuse, but also could work.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hear you, but my worry is that a new error code could be in principle returned from any API function, which would break existing users.

If we say this code is only used for this particular API function, then I think it makes more sense for an out parameter than an error code (which has a non-local impact).

Also language bindings could use different strategies, e.g. Rust would return an Result<PartitionedIngestStatus> or something and not an out parameter, and deal with the messiness at the FFI layer. This is already the case for various APIs (e.g. Java treats bulk ingest itself differently).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants