Skip to content

fix: route decode through codegen dispatcher to honor Spark 4.0 legacy flags#4639

Merged
andygrove merged 3 commits into
apache:mainfrom
andygrove:fix-decode-legacy-charsets-4465
Jun 12, 2026
Merged

fix: route decode through codegen dispatcher to honor Spark 4.0 legacy flags#4639
andygrove merged 3 commits into
apache:mainfrom
andygrove:fix-decode-legacy-charsets-4465

Conversation

@andygrove

Copy link
Copy Markdown
Member

Which issue does this PR close?

Closes #4465.

Rationale for this change

Comet currently lowers decode(bin, charset) to a TRY-mode binary→string cast, which produces the wrong result for invalid byte sequences in every supported Spark version:

Spark version / mode Spark behavior Comet behavior (pre-fix)
3.x Substitutes Unicode replacement char NULL / undefined
4.0, legacyErrorAction = true Same as 3.x NULL / undefined
4.0 default (legacyErrorAction = false) Raises MALFORMED_CHARACTER_CODING NULL / undefined

The Spark 4.0 shim destructured the StaticInvoke arguments and discarded both legacyCharsets / legacyErrorAction flags, so neither the substitute nor the strict-error path was reachable.

What changes are included in this PR?

  • Spark 4.0+ shim (Spark4xCometExprShim) now routes the StringDecode.decode StaticInvoke through CometScalaUDF.emitJvmCodegenDispatch instead of the buggy TRY-cast helper. The dispatcher runs Spark's own doGenCode inside the Comet pipeline, so the captured legacyCharsets / legacyErrorAction literals are honored exactly.
  • Spark 3.x CometStringDecode collapses to extends CometCodegenDispatch[StringDecode], so the 3.x replacement-character behavior also matches Spark exactly. The per-expression spark.comet.expression.StringDecode.enabled flag still works because the framework checks it before calling convert.
  • Removes the now-unused CommonStringExprs.stringDecode helper and the CommonStringExprs mixin from the 3.x / 4.x shims.
  • Updates decode.sql to drop obsolete expect_fallback lines (the dispatcher handles non-UTF-8 charsets too) and adds two new fixtures:
    • decode_invalid_utf8.sql — replacement-character behavior on Spark 3.x default and Spark 4.0+ with both legacy flags enabled.
    • decode_invalid_utf8_strict.sql — Spark 4.0+ default strict mode with expect_error(MALFORMED_CHARACTER_CODING), including a sentinel valid-input query so the assertion does not pass vacuously through an operator-level fallback.

How are these changes tested?

  • mvn compile clean on Spark 3.4 / 3.5 / 4.0 / 4.1 (spotless + scalastyle clean).
  • CometSqlFileTestSuite decode passes on Spark 3.4, 3.5, and 4.0; the strict-mode fixture is skipped on 3.x via MinSparkVersion: 4.0 and runs on 4.0.
  • Existing CometStringDecodeSuite (Spark 3.x only) still passes, confirming the per-expression enable flag continues to work after the refactor.

…yCharsets / legacyErrorAction

Comet's existing lowering of `decode(bin, charset)` to a TRY-mode binary→string cast produced
wrong results on invalid byte sequences across all supported Spark versions. The Spark 4.0
shim destructured the StaticInvoke arguments and discarded both legacy-mode flags before
falling back into the same helper, so neither the substitute nor the strict-error behavior
was reachable. Replace the native lowering with the codegen dispatcher: Spark's own
doGenCode runs inside the Comet pipeline and honors both flags exactly, matching Spark for
valid input, replacement-character substitution, and the Spark 4.0 default strict-mode
error. Adds invalid-UTF-8 SQL fixtures for the legacy and strict modes.

Closes apache#4465
@andygrove andygrove added this to the 0.17.0 milestone Jun 12, 2026
- Register decode in the existing staticInvokeExpressions map (statics.scala)
  instead of carving a special case into Spark4xCometExprShim, deleting the
  bespoke 17-line StaticInvoke arm and its imports. The map already routes
  aesEncrypt/aesDecrypt through CometStaticInvokeCodegenDispatch; decode
  joins the same pattern.
- Trim the CometStringDecode scaladoc from a 7-line history of the bug to a
  2-line summary plus a apache#4465 link.
- Drop the stale reference to a non-existent decode_invalid_utf8_legacy.sql
  in the decode.sql header.
- Fix incorrect UTF-8 byte semantics in decode_invalid_utf8.sql comments
  (0xFF is not a continuation byte; 0xE9 is a valid 3-byte lead byte that
  is just missing its continuation bytes) and drop the redundant
  case-only-different `'UTF-8'` query.

@mbutrovich mbutrovich left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thanks @andygrove!

@andygrove andygrove merged commit 7b279f8 into apache:main Jun 12, 2026
69 checks passed
@andygrove andygrove deleted the fix-decode-legacy-charsets-4465 branch June 12, 2026 23:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] decode ignores Spark 4.0 legacyCharsets and legacyErrorAction flags

2 participants