Tracking issue for an upstream DataFusion bug (to file against apache/datafusion). The second of two blockers for dictionary-encoding coordinate columns (#217), and the one that breaks core climate SQL (diurnal climatology, anomaly).
Summary
date_part('hour', t) where t is Dictionary(Int32, Timestamp) preserves the dictionary encoding and returns a Dictionary(Int32, Int32), but the expression's declared logical type is plain Int32. The physical/logical mismatch surfaces as soon as the column is materialized:
Arrow error: Invalid argument error: column types must match schema types,
expected Int32 but found Dictionary(Int32, Int32) at column index 0
A GROUP BY / JOIN on the same expression is worse — it panics a worker thread:
thread 'tokio-rt-worker' panicked at arrow-array-58.3.0/src/cast.rs:849: primitive array
(a failed downcast of the dictionary array to a primitive array). GROUP BY date_part('hour', time) is exactly how you write a diurnal climatology, so this is not an edge case for the domain.
Minimal repro
Pure datafusion + pyarrow, no xarray-sql, no network: repros/datafusion/date_part_dictionary_panic.py (branch claude/datafusion-upstream-repros-fs1bqv; paste-in Rust test in the README).
# t is Dictionary(Int32, Timestamp)
ctx.sql("SELECT date_part('hour', t) AS h FROM x") # -> expected Int32 but found Dictionary(Int32, Int32)
ctx.sql("SELECT date_part('hour', t), COUNT(*) FROM x GROUP BY 1") # -> panics ("primitive array")
Expected
date_part (and temporal scalar functions generally) over a dictionary-encoded input should return a plain Int32 (decode the dictionary), or declare the Dictionary(Int32, Int32) it actually produces — logical and physical types must agree, and a downstream consumer must never panic on the result.
Environment
datafusion 54.0.0 / pyarrow 23.0.0 (arrow-rs 58.3.0).
Impact on xarray-sql
Blocks dictionary-encoding the time coordinate (#217). Because the time axis is where climate SQL applies date_part/date_trunc/extract most, encoding it breaks climatology and anomaly queries (benchmarks 02_climatology and 04_anomaly). This is the bug that made #217 unworkable without an upstream fix.
Tracking issue for an upstream DataFusion bug (to file against
apache/datafusion). The second of two blockers for dictionary-encoding coordinate columns (#217), and the one that breaks core climate SQL (diurnal climatology, anomaly).Summary
date_part('hour', t)wheretisDictionary(Int32, Timestamp)preserves the dictionary encoding and returns aDictionary(Int32, Int32), but the expression's declared logical type is plainInt32. The physical/logical mismatch surfaces as soon as the column is materialized:A
GROUP BY/JOINon the same expression is worse — it panics a worker thread:(a failed downcast of the dictionary array to a primitive array).
GROUP BY date_part('hour', time)is exactly how you write a diurnal climatology, so this is not an edge case for the domain.Minimal repro
Pure
datafusion+pyarrow, no xarray-sql, no network:repros/datafusion/date_part_dictionary_panic.py(branchclaude/datafusion-upstream-repros-fs1bqv; paste-in Rust test in the README).Expected
date_part(and temporal scalar functions generally) over a dictionary-encoded input should return a plainInt32(decode the dictionary), or declare theDictionary(Int32, Int32)it actually produces — logical and physical types must agree, and a downstream consumer must never panic on the result.Environment
datafusion 54.0.0 / pyarrow 23.0.0 (arrow-rs 58.3.0).
Impact on xarray-sql
Blocks dictionary-encoding the time coordinate (#217). Because the time axis is where climate SQL applies
date_part/date_trunc/extractmost, encoding it breaks climatology and anomaly queries (benchmarks02_climatologyand04_anomaly). This is the bug that made #217 unworkable without an upstream fix.