Skip to content

bug: flatten drops nulls and returns wrong results when a sub-array is null #4788

Description

@mbutrovich

Describe the bug

flatten on an array<array<T>> column returns wrong results and silently drops nulls when a row's outer array contains a null sub-array. Spark returns null for any such row; Comet returns a non-null, misaligned array. Silent data corruption, not a crash.

Steps to reproduce

Add to a suite extending CometTestBase:

test("flatten with null sub-array") {
  val data = Seq(
    Tuple1(Seq(Seq(1, 2, 3), Seq(4, 5))),
    Tuple1(Seq[Seq[Int]](Seq(1), null)),   // Spark: flatten -> null
    Tuple1(Seq[Seq[Int]](null, null)))     // Spark: flatten -> null
  withParquetTable(data, "t") {
    checkSparkAnswerAndOperator("SELECT flatten(_1) FROM t")
  }
}
== Results ==
!== Spark Answer - 3 ==           == Comet Answer - 3 ==
 struct<flatten(_1):array<int>>   struct<flatten(_1):array<int>>
![List(1, 2, 3, 4, 5)]            [List()]
![null]                           [List(1)]
![null]                           [List(1, 2, 3, 4, 5)]

Expected behavior

Match Spark: a null sub-array makes flatten return null for that row.

Additional context

Found while enabling CometLocalTableScanExec by default (#4393), but reproduces over a plain Parquet scan. Upstream test: DataFrameFunctionsSuite "flatten function".

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions