feat: opt concat into codegen dispatch for non-UTF8_BINARY collations by andygrove · Pull Request #4640 · apache/datafusion-comet

andygrove · 2026-06-12T13:43:07Z

Which issue does this PR close?

Part of #4596 (the concat candidate, the last one on the list).

Rationale for this change

CometConcat reports Incompatible when any child uses a non-default (non-UTF8_BINARY) collation: Spark 4.0+ widens concat to accept collated strings and preserves the collation in the result type, but the native concat UDF always produces UTF8_BINARY and loses it. With allowIncompatible unset that falls the whole projection back to Spark. Concat has a real Spark doGenCode and string input/output types, so it is eligible for the CodegenDispatchFallback path: route the Incompatible collated case through the JVM codegen dispatcher (Spark's own doGenCode inside the Comet pipeline) so it stays native and matches Spark.

What changes are included in this PR?

CometConcat mixes in CodegenDispatchFallback. The Unsupported non-string-input case (binary/array children) is unchanged and still falls back, and default-collation concat is unaffected (still Compatible, native).

How are these changes tested?

The existing string/collation.sql (Spark 4.0+) already asserted expect_fallback(concat does not support non-UTF8_BINARY collations) for collated concat. Those two assertions are replaced with query, so they now assert native execution matching Spark for both a UTF8_LCASE and a UNICODE_CI collated concat. Run with CometSqlFileTestSuite and passing.

CometConcat reports Incompatible when a child uses a non-default collation, because the native concat UDF produces UTF8_BINARY and loses the collation. Mixing in CodegenDispatchFallback routes that case through the JVM codegen dispatcher (Spark's own doGenCode) so collated concat runs natively and matches Spark instead of falling back. The Unsupported non-string-input case (binary/array children) is unchanged. Part of apache#4596.

andygrove added this to the 0.17.0 milestone Jun 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: opt concat into codegen dispatch for non-UTF8_BINARY collations#4640

feat: opt concat into codegen dispatch for non-UTF8_BINARY collations#4640
andygrove wants to merge 1 commit into
apache:mainfrom
andygrove:feat/codegen-dispatch-concat

andygrove commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andygrove commented Jun 12, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant