feat: opt concat into codegen dispatch for non-UTF8_BINARY collations#4640
Open
andygrove wants to merge 1 commit into
Open
feat: opt concat into codegen dispatch for non-UTF8_BINARY collations#4640andygrove wants to merge 1 commit into
andygrove wants to merge 1 commit into
Conversation
CometConcat reports Incompatible when a child uses a non-default collation, because the native concat UDF produces UTF8_BINARY and loses the collation. Mixing in CodegenDispatchFallback routes that case through the JVM codegen dispatcher (Spark's own doGenCode) so collated concat runs natively and matches Spark instead of falling back. The Unsupported non-string-input case (binary/array children) is unchanged. Part of apache#4596.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Part of #4596 (the
concatcandidate, the last one on the list).Rationale for this change
CometConcatreportsIncompatiblewhen any child uses a non-default (non-UTF8_BINARY) collation: Spark 4.0+ widensconcatto accept collated strings and preserves the collation in the result type, but the native concat UDF always produces UTF8_BINARY and loses it. WithallowIncompatibleunset that falls the whole projection back to Spark.Concathas a real SparkdoGenCodeand string input/output types, so it is eligible for theCodegenDispatchFallbackpath: route theIncompatiblecollated case through the JVM codegen dispatcher (Spark's owndoGenCodeinside the Comet pipeline) so it stays native and matches Spark.What changes are included in this PR?
CometConcatmixes inCodegenDispatchFallback. TheUnsupportednon-string-input case (binary/array children) is unchanged and still falls back, and default-collation concat is unaffected (stillCompatible, native).How are these changes tested?
The existing
string/collation.sql(Spark 4.0+) already assertedexpect_fallback(concat does not support non-UTF8_BINARY collations)for collatedconcat. Those two assertions are replaced withquery, so they now assert native execution matching Spark for both aUTF8_LCASEand aUNICODE_CIcollatedconcat. Run withCometSqlFileTestSuiteand passing.