Skip to content

feat: migrate common/data/columnar module#24

Open
lxy-9602 wants to merge 3 commits into
apache:mainfrom
lxy-9602:add-columnar-data
Open

feat: migrate common/data/columnar module#24
lxy-9602 wants to merge 3 commits into
apache:mainfrom
lxy-9602:add-columnar-data

Conversation

@lxy-9602
Copy link
Copy Markdown
Contributor

Purpose

No Linked issue.

Migrate common/data/columnar/ module. This module provides columnar data access layer built on top of Apache Arrow arrays, including:

  • ColumnarArray: Access vector column data for array types
  • ColumnarMap: Access vector column data for map types
  • ColumnarRow: Row view over arrow::Array with per-row field access
  • ColumnarRowRef: Lightweight row view sharing batch-level context to reduce per-row overhead
  • ColumnarUtils: Utility functions for reading Arrow arrays (including dictionary-encoded arrays)
  • ColumnarBatchContext: Shared batch context struct for ColumnarRowRef

Also migrates the dependency DateTimeUtils (timestamp conversion utilities).

Tests

  • columnar_utils_test.cpp — Tests GetView/GetBytes for plain and dictionary-encoded string arrays
  • columnar_array_test.cpp — Tests all primitive types, complex/nested types (struct, list, map, decimal, timestamp), and null handling
  • columnar_row_test.cpp — Tests ColumnarRow and ColumnarRowRef for simple types, complex/nested types, dictionary types, null handling, data lifecycle, and binary access

API and Format

Documentation

Generative AI tooling

Copy link
Copy Markdown

@leaves12138 leaves12138 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. The non-owning ColumnarArray constructor is clearer now, but I found two remaining correctness blockers in nested row lifetime and decimal conversion.

std::shared_ptr<InternalRow> ColumnarRow::GetRow(int32_t pos, int32_t num_fields) const {
auto struct_array = arrow::internal::checked_cast<const arrow::StructArray*>(array_vec_[pos]);
assert(struct_array);
return std::make_shared<ColumnarRow>(struct_array->fields(), pool_, row_id_);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This nested row still drops the ownership chain. The returned ColumnarRow stores raw child Array pointers, but it is constructed without a holder. If the parent ColumnarRow is the only object owning the top-level StructArray, a nested row returned from GetRow can outlive the parent and then its raw child pointers dangle. The current DataLifeCycle test only uses result_row while the parent row is alive, so it does not catch this. Please keep an owning holder here, for example by passing a shared_ptr/sliced StructArray or by returning ColumnarRowRef with a ColumnarBatchContext that owns struct_array->fields().

assert(array);
arrow::Decimal128 decimal(array->GetValue(offset_ + pos));
return Decimal(precision, scale,
static_cast<Decimal::int128_t>(decimal.high_bits()) << 64 | decimal.low_bits());
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please avoid reconstructing the signed int128 value by left-shifting a signed high word. For negative Decimal128 values, decimal.high_bits() is negative, and left-shifting a negative signed integer is undefined behavior in C++. This same expression is also used in ColumnarRow and ColumnarRowRef. Please combine the bits through an unsigned 128-bit value first, then cast to Decimal::int128_t, and add a negative decimal test.

Copy link
Copy Markdown

@leaves12138 leaves12138 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did another full pass and found one additional correctness issue beyond the two existing blockers.

auto value_type_id = dict_type->value_type()->id();
auto index_type_id = dict_type->index_type()->id();
int64_t dict_index = -1;
if (index_type_id == arrow::Type::type::INT32) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only decodes dictionary indices for INT32 and INT64, but Arrow dictionary indices are not limited to these two widths (for example int8/int16 dictionaries are valid and commonly used for small dictionaries). For those arrays dict_index stays -1; in release builds the assert is compiled out and the code then calls dictionary->GetView(-1), which can read before the offsets buffer or crash. Please handle all supported integer index types (or return a proper error instead of relying on assert).

Copy link
Copy Markdown

@leaves12138 leaves12138 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. The decimal conversion and dictionary index-width issues look fixed now, but the nested row lifetime blocker is still present in the latest commit. In src/paimon/common/data/columnar/columnar_row.cpp, ColumnarRow::GetRow still returns a nested ColumnarRow built from struct_array->fields() without an owning holder. The new comment documents the caller-side lifetime requirement, but it does not fix the unsafe case introduced by the owning ColumnarRow constructor: if the parent row is the only object keeping the top-level StructArray alive, a nested row returned here can outlive that parent and then its raw child-array pointers dangle. The current TestDataLifeCycle still accesses result_row only while row is alive, so it does not exercise the failure. Please keep the ownership chain for escaped nested rows, e.g. return a row/ref that owns the child arrays via a shared context, or otherwise holds a shared owner/sliced struct array.

@leaves12138 leaves12138 dismissed their stale review May 28, 2026 08:53

Dismissed after clarification: ColumnarRow::GetRow is intentionally non-owning as a hot-path performance optimization, and the caller-side lifetime requirement is now documented. This is not a blocker under that API contract.

Copy link
Copy Markdown

@leaves12138 leaves12138 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying the intended ownership contract. Given that ColumnarRow::GetRow is intentionally a non-owning hot-path view and the caller is required to keep the parent data alive, the remaining lifetime point should not block this PR. The previous decimal conversion and dictionary index-width issues are fixed in the latest commit. Looks good to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants