Skip to content

bug: array/map kernels crash on non-null child field via CometLocalTableScanExec #4789

Description

@mbutrovich

Describe the bug

Several Comet array/map expressions crash natively when the input comes through CometLocalTableScanExec with a non-nullable child field (array element, or map key/value). An in-memory Seq[Int] / Map[Int, Int] column encodes with containsNull=false / valueContainsNull=false, and the local scan carries that non-null child into the DataFusion kernel, which disagrees with Comet's planned output type. Reproduced locally (Spark 4.0) on two kernels:

  • spark_array_slice (slice): Assertion failed: result_data_type == *expected_type: Function 'spark_array_slice' returned value of type 'List(non-null Int32)' while the following type was promised at planning time and expected: 'List(Int32)'.
  • ListArray/Struct build (map_entries): InvalidArgumentError("ListArray expected data type Struct("key": non-null Int32, "value": Int32) got Struct("key": non-null Int32, "value": non-null Int32) for "item"") (panics in datafusion-functions-nested map_entries.rs:126)

A third signature appeared in CI (Spark 4.1) on ArrayInsert (array prepend, SPARK-41233) and is very likely the same root cause, though it did not reproduce locally with a scalar array_insert on Spark 4.0: Type mismatch in ArrayInsert: array type is List(Field { data_type: Int32 }) but item type is List(Field { data_type: Int32, nullable: true }).

It does NOT reproduce over a native Parquet scan (which normalizes children to nullable) or with a SQL array(...)/map(...) literal (literal children are nullable), so it is specific to the local scan path.

Steps to reproduce

ConvertToLocalRelation must be disabled or the optimizer folds the expression over the LocalRelation at plan time and nothing executes natively. Add to a suite extending CometTestBase:

import testImplicits._

private def withLocalTableScanNoFold(f: => Unit): Unit = {
  withSQLConf(
    CometConf.COMET_EXEC_LOCAL_TABLE_SCAN_ENABLED.key -> "true",
    "spark.sql.optimizer.excludedRules" ->
      "org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation") {
    f
  }
}

test("slice on non-null element array") {
  withLocalTableScanNoFold {
    val df = Seq(Seq(1, 2, 3), Seq(4, 5)).toDF("x")
    checkSparkAnswerAndOperator(df.selectExpr("slice(x, 2, 2)"))
  }
}

test("map_entries on non-null value map") {
  withLocalTableScanNoFold {
    val df = Seq(Map(1 -> 100, 2 -> 200)).toDF("m")
    checkSparkAnswerAndOperator(df.selectExpr("map_entries(m)"))
  }
}

Failure (slice):

org.apache.comet.CometNativeException: Assertion failed: result_data_type == *expected_type: Function 'spark_array_slice' returned value of type 'List(non-null Int32)' while the following type was promised at planning time and expected: 'List(Int32)'.
    at org.apache.comet.Native.executePlan(Native Method)
    at org.apache.comet.CometExecIterator.$anonfun$getNextBatch$2(CometExecIterator.scala:155)

Failure (map_entries):

org.apache.comet.CometNativeException: called `Result::unwrap()` on an `Err` value: InvalidArgumentError("ListArray expected data type Struct(\"key\": non-null Int32, \"value\": Int32) got Struct(\"key\": non-null Int32, \"value\": non-null Int32) for \"item\"")
    at datafusion_functions_nested::map_entries::map_entries_inner (map_entries.rs:126)
    at <arrow_array::array::list_array::GenericListArray<i32>>::new (list_array.rs:272)

Expected behavior

Same results as Spark; no native crash. The local-scan child-field nullability should be reconciled with the kernel expected type (normalize one side), consistent with how the Parquet scan path already works.

Additional context

Specific to CometLocalTableScanExec, so directly gated by enabling it by default (#4393). Upstream tests: DataFrameFunctionsSuite "slice function", "array_insert functions", "SPARK-41233: array prepend", "map_entries", "map with arrays".

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    Fields

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions