fix: route decode through codegen dispatcher to honor Spark 4.0 legacy flags by andygrove · Pull Request #4639 · apache/datafusion-comet

andygrove · 2026-06-12T13:40:40Z

Which issue does this PR close?

Closes #4465.

Rationale for this change

Comet currently lowers decode(bin, charset) to a TRY-mode binary→string cast, which produces the wrong result for invalid byte sequences in every supported Spark version:

Spark version / mode	Spark behavior	Comet behavior (pre-fix)
3.x	Substitutes Unicode replacement char	NULL / undefined
4.0, `legacyErrorAction = true`	Same as 3.x	NULL / undefined
4.0 default (`legacyErrorAction = false`)	Raises `MALFORMED_CHARACTER_CODING`	NULL / undefined

The Spark 4.0 shim destructured the StaticInvoke arguments and discarded both legacyCharsets / legacyErrorAction flags, so neither the substitute nor the strict-error path was reachable.

What changes are included in this PR?

Spark 4.0+ shim (Spark4xCometExprShim) now routes the StringDecode.decode StaticInvoke through CometScalaUDF.emitJvmCodegenDispatch instead of the buggy TRY-cast helper. The dispatcher runs Spark's own doGenCode inside the Comet pipeline, so the captured legacyCharsets / legacyErrorAction literals are honored exactly.
Spark 3.x CometStringDecode collapses to extends CometCodegenDispatch[StringDecode], so the 3.x replacement-character behavior also matches Spark exactly. The per-expression spark.comet.expression.StringDecode.enabled flag still works because the framework checks it before calling convert.
Removes the now-unused CommonStringExprs.stringDecode helper and the CommonStringExprs mixin from the 3.x / 4.x shims.
Updates decode.sql to drop obsolete expect_fallback lines (the dispatcher handles non-UTF-8 charsets too) and adds two new fixtures:
- decode_invalid_utf8.sql — replacement-character behavior on Spark 3.x default and Spark 4.0+ with both legacy flags enabled.
- decode_invalid_utf8_strict.sql — Spark 4.0+ default strict mode with expect_error(MALFORMED_CHARACTER_CODING), including a sentinel valid-input query so the assertion does not pass vacuously through an operator-level fallback.

How are these changes tested?

mvn compile clean on Spark 3.4 / 3.5 / 4.0 / 4.1 (spotless + scalastyle clean).
CometSqlFileTestSuite decode passes on Spark 3.4, 3.5, and 4.0; the strict-mode fixture is skipped on 3.x via MinSparkVersion: 4.0 and runs on 4.0.
Existing CometStringDecodeSuite (Spark 3.x only) still passes, confirming the per-expression enable flag continues to work after the refactor.

…yCharsets / legacyErrorAction Comet's existing lowering of `decode(bin, charset)` to a TRY-mode binary→string cast produced wrong results on invalid byte sequences across all supported Spark versions. The Spark 4.0 shim destructured the StaticInvoke arguments and discarded both legacy-mode flags before falling back into the same helper, so neither the substitute nor the strict-error behavior was reachable. Replace the native lowering with the codegen dispatcher: Spark's own doGenCode runs inside the Comet pipeline and honors both flags exactly, matching Spark for valid input, replacement-character substitution, and the Spark 4.0 default strict-mode error. Adds invalid-UTF-8 SQL fixtures for the legacy and strict modes. Closes apache#4465

- Register decode in the existing staticInvokeExpressions map (statics.scala) instead of carving a special case into Spark4xCometExprShim, deleting the bespoke 17-line StaticInvoke arm and its imports. The map already routes aesEncrypt/aesDecrypt through CometStaticInvokeCodegenDispatch; decode joins the same pattern. - Trim the CometStringDecode scaladoc from a 7-line history of the bug to a 2-line summary plus a apache#4465 link. - Drop the stale reference to a non-existent decode_invalid_utf8_legacy.sql in the decode.sql header. - Fix incorrect UTF-8 byte semantics in decode_invalid_utf8.sql comments (0xFF is not a continuation byte; 0xE9 is a valid 3-byte lead byte that is just missing its continuation bytes) and drop the redundant case-only-different `'UTF-8'` query.

mbutrovich

Nice! Thanks @andygrove!

andygrove added this to the 0.17.0 milestone Jun 12, 2026

mbutrovich approved these changes Jun 12, 2026

View reviewed changes

Merge branch 'main' into fix-decode-legacy-charsets-4465

14ce2fc

andygrove merged commit 7b279f8 into apache:main Jun 12, 2026
69 checks passed

andygrove deleted the fix-decode-legacy-charsets-4465 branch June 12, 2026 23:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: route decode through codegen dispatcher to honor Spark 4.0 legacy flags#4639

fix: route decode through codegen dispatcher to honor Spark 4.0 legacy flags#4639
andygrove merged 3 commits into
apache:mainfrom
andygrove:fix-decode-legacy-charsets-4465

andygrove commented Jun 12, 2026

Uh oh!

mbutrovich left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

andygrove commented Jun 12, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

mbutrovich left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants