Skip to content

DataFusion: date_part over a Dictionary(Timestamp) returns a type that violates its declared Int32 (crashes GROUP BY/JOIN) #221

Description

@alxmrs

Tracking issue for an upstream DataFusion bug (to file against apache/datafusion). The second of two blockers for dictionary-encoding coordinate columns (#217), and the one that breaks core climate SQL (diurnal climatology, anomaly).

Summary

date_part('hour', t) where t is Dictionary(Int32, Timestamp) preserves the dictionary encoding and returns a Dictionary(Int32, Int32), but the expression's declared logical type is plain Int32. The physical/logical mismatch surfaces as soon as the column is materialized:

Arrow error: Invalid argument error: column types must match schema types,
expected Int32 but found Dictionary(Int32, Int32) at column index 0

A GROUP BY / JOIN on the same expression is worse — it panics a worker thread:

thread 'tokio-rt-worker' panicked at arrow-array-58.3.0/src/cast.rs:849: primitive array

(a failed downcast of the dictionary array to a primitive array). GROUP BY date_part('hour', time) is exactly how you write a diurnal climatology, so this is not an edge case for the domain.

Minimal repro

Pure datafusion + pyarrow, no xarray-sql, no network: repros/datafusion/date_part_dictionary_panic.py (branch claude/datafusion-upstream-repros-fs1bqv; paste-in Rust test in the README).

# t is Dictionary(Int32, Timestamp)
ctx.sql("SELECT date_part('hour', t) AS h FROM x")             # -> expected Int32 but found Dictionary(Int32, Int32)
ctx.sql("SELECT date_part('hour', t), COUNT(*) FROM x GROUP BY 1")  # -> panics ("primitive array")

Expected

date_part (and temporal scalar functions generally) over a dictionary-encoded input should return a plain Int32 (decode the dictionary), or declare the Dictionary(Int32, Int32) it actually produces — logical and physical types must agree, and a downstream consumer must never panic on the result.

Environment

datafusion 54.0.0 / pyarrow 23.0.0 (arrow-rs 58.3.0).

Impact on xarray-sql

Blocks dictionary-encoding the time coordinate (#217). Because the time axis is where climate SQL applies date_part/date_trunc/extract most, encoding it breaks climatology and anomaly queries (benchmarks 02_climatology and 04_anomaly). This is the bug that made #217 unworkable without an upstream fix.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions