Skip to content

feat: opt concat into codegen dispatch for non-UTF8_BINARY collations#4640

Open
andygrove wants to merge 1 commit into
apache:mainfrom
andygrove:feat/codegen-dispatch-concat
Open

feat: opt concat into codegen dispatch for non-UTF8_BINARY collations#4640
andygrove wants to merge 1 commit into
apache:mainfrom
andygrove:feat/codegen-dispatch-concat

Conversation

@andygrove

Copy link
Copy Markdown
Member

Which issue does this PR close?

Part of #4596 (the concat candidate, the last one on the list).

Rationale for this change

CometConcat reports Incompatible when any child uses a non-default (non-UTF8_BINARY) collation: Spark 4.0+ widens concat to accept collated strings and preserves the collation in the result type, but the native concat UDF always produces UTF8_BINARY and loses it. With allowIncompatible unset that falls the whole projection back to Spark. Concat has a real Spark doGenCode and string input/output types, so it is eligible for the CodegenDispatchFallback path: route the Incompatible collated case through the JVM codegen dispatcher (Spark's own doGenCode inside the Comet pipeline) so it stays native and matches Spark.

What changes are included in this PR?

  • CometConcat mixes in CodegenDispatchFallback. The Unsupported non-string-input case (binary/array children) is unchanged and still falls back, and default-collation concat is unaffected (still Compatible, native).

How are these changes tested?

The existing string/collation.sql (Spark 4.0+) already asserted expect_fallback(concat does not support non-UTF8_BINARY collations) for collated concat. Those two assertions are replaced with query, so they now assert native execution matching Spark for both a UTF8_LCASE and a UNICODE_CI collated concat. Run with CometSqlFileTestSuite and passing.

CometConcat reports Incompatible when a child uses a non-default collation, because
the native concat UDF produces UTF8_BINARY and loses the collation. Mixing in
CodegenDispatchFallback routes that case through the JVM codegen dispatcher (Spark's
own doGenCode) so collated concat runs natively and matches Spark instead of falling
back. The Unsupported non-string-input case (binary/array children) is unchanged.

Part of apache#4596.
@andygrove andygrove added this to the 0.17.0 milestone Jun 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant