Skip to content

GH-46179: [Python] Bump index level once if pandas df already contains __index_level_i__ column#46884

Open
AlenkaF wants to merge 2 commits into
apache:mainfrom
AlenkaF:gh-46179-duplicates-index-levels
Open

GH-46179: [Python] Bump index level once if pandas df already contains __index_level_i__ column#46884
AlenkaF wants to merge 2 commits into
apache:mainfrom
AlenkaF:gh-46179-duplicates-index-levels

Conversation

@AlenkaF
Copy link
Copy Markdown
Member

@AlenkaF AlenkaF commented Jun 23, 2025

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions
Copy link
Copy Markdown

⚠️ GitHub issue #46179 has been automatically assigned in GitHub to PR creator.

@AlenkaF AlenkaF changed the title GH-46179: Bump index level once if pandas df already contains __index_level_i__ column GH-46179: [Python] Bump index level once if pandas df already contains __index_level_i__ column Jun 23, 2025
@AlenkaF AlenkaF force-pushed the gh-46179-duplicates-index-levels branch from c915159 to 3ee4599 Compare May 25, 2026 14:34
Copilot AI review requested due to automatic review settings May 25, 2026 14:34
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses GH-46179 in PyArrow’s pandas conversion by avoiding duplicate Arrow field names when a pandas DataFrame already contains __index_level_i__ columns, ensuring generated index columns use a non-conflicting name.

Changes:

  • Update generated index column naming to pick the next available __index_level_{j}__ name if the default collides with existing columns.
  • Ensure uniqueness across both DataFrame columns and previously generated index columns when multiple index levels are serialized.
  • Add regression tests for single-index and MultiIndex cases where __index_level_0__ already exists as a DataFrame column.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
python/pyarrow/pandas_compat.py Adjusts index-level name generation to avoid collisions with existing column names and previously assigned index column names.
python/pyarrow/tests/test_pandas.py Updates existing metadata assertion and adds new regression tests validating the bumped index column names.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +381 to +384
j = i
while f'__index_level_{j:d}__' in column_names:
j += 1
return f'__index_level_{j:d}__'
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't schema based conversion already buggy without this change when it comes to the index levels? It probably silently ignores the duplicated level 0 currently?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, getting used to this :) Copilot can't answer. Well, I think the change suggested can be a possible follow-up if we see this would be needed. But I do not think it is in the scope of this PR.

@github-actions github-actions Bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels May 26, 2026
@AlenkaF AlenkaF marked this pull request as ready for review May 26, 2026 09:28
@AlenkaF AlenkaF requested review from raulcd and rok as code owners May 26, 2026 09:28
@AlenkaF
Copy link
Copy Markdown
Member Author

AlenkaF commented May 26, 2026

@jorisvandenbossche what do you think of the proposed change in this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants