Describe the bug
Several Comet array/map expressions crash natively when the input comes through CometLocalTableScanExec with a non-nullable child field (array element, or map key/value). An in-memory Seq[Int] / Map[Int, Int] column encodes with containsNull=false / valueContainsNull=false, and the local scan carries that non-null child into the DataFusion kernel, which disagrees with Comet's planned output type. Reproduced locally (Spark 4.0) on two kernels:
spark_array_slice (slice): Assertion failed: result_data_type == *expected_type: Function 'spark_array_slice' returned value of type 'List(non-null Int32)' while the following type was promised at planning time and expected: 'List(Int32)'.
- ListArray/Struct build (
map_entries): InvalidArgumentError("ListArray expected data type Struct("key": non-null Int32, "value": Int32) got Struct("key": non-null Int32, "value": non-null Int32) for "item"") (panics in datafusion-functions-nested map_entries.rs:126)
A third signature appeared in CI (Spark 4.1) on ArrayInsert (array prepend, SPARK-41233) and is very likely the same root cause, though it did not reproduce locally with a scalar array_insert on Spark 4.0: Type mismatch in ArrayInsert: array type is List(Field { data_type: Int32 }) but item type is List(Field { data_type: Int32, nullable: true }).
It does NOT reproduce over a native Parquet scan (which normalizes children to nullable) or with a SQL array(...)/map(...) literal (literal children are nullable), so it is specific to the local scan path.
Steps to reproduce
ConvertToLocalRelation must be disabled or the optimizer folds the expression over the LocalRelation at plan time and nothing executes natively. Add to a suite extending CometTestBase:
import testImplicits._
private def withLocalTableScanNoFold(f: => Unit): Unit = {
withSQLConf(
CometConf.COMET_EXEC_LOCAL_TABLE_SCAN_ENABLED.key -> "true",
"spark.sql.optimizer.excludedRules" ->
"org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation") {
f
}
}
test("slice on non-null element array") {
withLocalTableScanNoFold {
val df = Seq(Seq(1, 2, 3), Seq(4, 5)).toDF("x")
checkSparkAnswerAndOperator(df.selectExpr("slice(x, 2, 2)"))
}
}
test("map_entries on non-null value map") {
withLocalTableScanNoFold {
val df = Seq(Map(1 -> 100, 2 -> 200)).toDF("m")
checkSparkAnswerAndOperator(df.selectExpr("map_entries(m)"))
}
}
Failure (slice):
org.apache.comet.CometNativeException: Assertion failed: result_data_type == *expected_type: Function 'spark_array_slice' returned value of type 'List(non-null Int32)' while the following type was promised at planning time and expected: 'List(Int32)'.
at org.apache.comet.Native.executePlan(Native Method)
at org.apache.comet.CometExecIterator.$anonfun$getNextBatch$2(CometExecIterator.scala:155)
Failure (map_entries):
org.apache.comet.CometNativeException: called `Result::unwrap()` on an `Err` value: InvalidArgumentError("ListArray expected data type Struct(\"key\": non-null Int32, \"value\": Int32) got Struct(\"key\": non-null Int32, \"value\": non-null Int32) for \"item\"")
at datafusion_functions_nested::map_entries::map_entries_inner (map_entries.rs:126)
at <arrow_array::array::list_array::GenericListArray<i32>>::new (list_array.rs:272)
Expected behavior
Same results as Spark; no native crash. The local-scan child-field nullability should be reconciled with the kernel expected type (normalize one side), consistent with how the Parquet scan path already works.
Additional context
Specific to CometLocalTableScanExec, so directly gated by enabling it by default (#4393). Upstream tests: DataFrameFunctionsSuite "slice function", "array_insert functions", "SPARK-41233: array prepend", "map_entries", "map with arrays".
Describe the bug
Several Comet array/map expressions crash natively when the input comes through
CometLocalTableScanExecwith a non-nullable child field (array element, or map key/value). An in-memorySeq[Int]/Map[Int, Int]column encodes withcontainsNull=false/valueContainsNull=false, and the local scan carries that non-null child into the DataFusion kernel, which disagrees with Comet's planned output type. Reproduced locally (Spark 4.0) on two kernels:spark_array_slice(slice):Assertion failed: result_data_type == *expected_type: Function 'spark_array_slice' returned value of type 'List(non-null Int32)' while the following type was promised at planning time and expected: 'List(Int32)'.map_entries):InvalidArgumentError("ListArray expected data type Struct("key": non-null Int32, "value": Int32) got Struct("key": non-null Int32, "value": non-null Int32) for "item"")(panics indatafusion-functions-nestedmap_entries.rs:126)A third signature appeared in CI (Spark 4.1) on
ArrayInsert(array prepend, SPARK-41233) and is very likely the same root cause, though it did not reproduce locally with a scalararray_inserton Spark 4.0:Type mismatch in ArrayInsert: array type is List(Field { data_type: Int32 }) but item type is List(Field { data_type: Int32, nullable: true }).It does NOT reproduce over a native Parquet scan (which normalizes children to nullable) or with a SQL
array(...)/map(...)literal (literal children are nullable), so it is specific to the local scan path.Steps to reproduce
ConvertToLocalRelationmust be disabled or the optimizer folds the expression over theLocalRelationat plan time and nothing executes natively. Add to a suite extendingCometTestBase:Failure (slice):
Failure (map_entries):
Expected behavior
Same results as Spark; no native crash. The local-scan child-field nullability should be reconciled with the kernel expected type (normalize one side), consistent with how the Parquet scan path already works.
Additional context
Specific to
CometLocalTableScanExec, so directly gated by enabling it by default (#4393). Upstream tests:DataFrameFunctionsSuite"slice function", "array_insert functions", "SPARK-41233: array prepend", "map_entries", "map with arrays".