Skip to content

Add optional native Lance scan support#4633

Draft
wirybeaver wants to merge 6 commits into
apache:mainfrom
wirybeaver:xuanyili/lance
Draft

Add optional native Lance scan support#4633
wirybeaver wants to merge 6 commits into
apache:mainfrom
wirybeaver:xuanyili/lance

Conversation

@wirybeaver

Copy link
Copy Markdown

Which issue does this PR close?

Closes #4632.

Rationale for this change

Comet already has a native table-scan path for Iceberg. Lance tables are currently planned and read through Lance Spark. This prototype keeps Lance Spark as the Spark planning contract, then lets an optional Comet contrib reader detect Lance V2 scans, extract a stable descriptor from Lance Spark, and execute the assigned Lance fragments through native Rust Lance APIs.

The Lance Spark side of the descriptor contract is proposed in lance-format/lance-spark#624.

What changes are included in this PR?

  • Adds an opt-in contrib-lance Maven profile and Rust contrib-lance feature.
  • Adds a small reflection-only Lance bridge in Comet core so default builds do not depend on Lance Spark.
  • Adds spark.comet.scan.lanceNative.enabled, disabled by default.
  • Extends scan planning to detect Lance BatchScanExec plans and delegate to contrib-lance when present and enabled.
  • Adds typed native proto support with lance_scan = 118 and split-mode payloads.
  • Adds Scala contrib serialization/execution classes for Lance native scans.
  • Adds Rust native LanceScanExec using the Rust Lance API for dataset open, fragment selection, projection, filter SQL, limit/offset, batch size, and record batch streaming.

This is intentionally a draft prototype. Minimal v1 scope is ordinary Lance table reads only. Index/search reads, namespace-backed credential refresh, metadata/version columns, aggregation pushdown, and production CI coverage are future phases.

Known blocker before this can be merge-ready: packaged Comet currently contains org.apache.arrow.c classes rewritten against Comet's shaded Arrow allocator, while Lance Spark expects the normal Arrow C Data ABI. A packaged Spark smoke with both jars exposes this classpath conflict. We need an explicit Arrow C Data packaging/classloader strategy for Comet + Lance Spark before merging a production-ready native Lance reader.

How are these changes tested?

Passed:

  • ~/.cargo/bin/cargo check -p datafusion-comet --no-default-features
  • ~/.cargo/bin/cargo check -p datafusion-comet --no-default-features --features contrib-lance
  • ./mvnw test -Dtest=none -Dsuites="org.apache.comet.rules.CometScanRuleSuite" -Pspark-4.1,contrib-lance -Dscalastyle.skip=true
  • ./mvnw package -DskipTests -Pspark-4.1,contrib-lance -Dscalastyle.skip=true

Smoke attempted:

  • source ~/uvenv/common/bin/activate && python /home/user/draft/comet_lance_native_smoke.py

The smoke writes and reads a local Lance dataset, but packaged Comet + Lance Spark currently fails at runtime with an Arrow C Data ABI/classpath conflict as described above. The draft PR keeps that blocker visible for design review instead of hiding it behind unit-only coverage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add optional native Lance scan support

1 participant