Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 27 additions & 0 deletions docs/source/user-guide/latest/compatibility/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,3 +30,30 @@ This guide documents areas where Comet's behavior is known to differ from Spark.
- **Expressions**: per-expression compatibility notes, including cast.
- **JSON**: choosing between the native and Spark-compatible engines for JSON expressions.
- **Spark versions**: version-specific known issues and limitations.

## Native and codegen-dispatch implementations

Some Spark expressions have two implementations in Comet:

- A **codegen-dispatch** implementation that runs Spark's own generated code for the
expression inside Comet's native pipeline (via the Arrow-direct codegen dispatcher). This
produces byte-exact Spark results at the cost of one JNI round-trip per batch. It is gated
globally by `spark.comet.exec.scalaUDF.codegen.enabled` (enabled by default); when the
dispatcher is disabled, these expressions fall back to Spark.
- A **native** (Rust / DataFusion) implementation that is faster, with no JNI overhead, but
has known semantic differences from Spark for some inputs or patterns.

Because the codegen-dispatch path matches Spark exactly, Comet uses it by **default**. The
faster native path is **opt-in per expression** via that expression's
`spark.comet.expression.<ExprClassName>.allowIncompatible=true` flag, which declares that you
accept its differences from Spark. There is no global opt-in. When the native path is enabled
but a specific input or pattern has no native implementation, Comet routes that case back
through the codegen dispatcher rather than running something incompatible.

This is the model behind the [regular expression](regex.md) and [JSON](json.md) families,
which document their per-expression configs and the specific differences to expect.

This is distinct from expressions that have **no** codegen-dispatch path: there, the
incompatible cases fall back to Spark by default, and `allowIncompatible=true` runs the native
(incompatible) path instead. `cast` is the main example; see the
[expression reference](../expressions.md) for which expressions have incompatible cases.
10 changes: 7 additions & 3 deletions docs/source/user-guide/latest/expressions.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,10 +26,14 @@ transparently falls back to Spark for that part of the plan; results are unaffec

Expressions marked ✅ Supported are enabled by default and produce Spark-compatible results.

Some ✅ Supported expressions have specific incompatible cases that fall back to Spark by
default. Those cases must be opted into per expression with
Some ✅ Supported expressions have specific incompatible cases that are not run by default.
Those cases must be opted into per expression with
`spark.comet.expression.EXPRNAME.allowIncompatible=true` (where `EXPRNAME` is the Spark
expression class name, for example `Cast`). There is no global opt-in.
expression class name, for example `Cast`). There is no global opt-in. By default such a case
either falls back to Spark (for example `cast`) or, when the expression has a Spark-compatible
codegen-dispatch implementation, runs through that instead (for example the regex and JSON
families). See [Native and codegen-dispatch implementations](compatibility/index.md#native-and-codegen-dispatch-implementations)
for how Comet chooses.

Most expressions can also be disabled with `spark.comet.expression.EXPRNAME.enabled=false`, where
`EXPRNAME` is the Spark expression class name (for example `Length` or `StartsWith`). See the
Expand Down