Skip to content

fix(cli): pin UTF-8 encoding on init-options and .extensionignore I/O#2686

Open
Quratulain-bilal wants to merge 1 commit into
github:mainfrom
Quratulain-bilal:fix/cli-explicit-utf8-encoding
Open

fix(cli): pin UTF-8 encoding on init-options and .extensionignore I/O#2686
Quratulain-bilal wants to merge 1 commit into
github:mainfrom
Quratulain-bilal:fix/cli-explicit-utf8-encoding

Conversation

@Quratulain-bilal
Copy link
Copy Markdown
Contributor

What

Path.read_text() / Path.write_text() default to the system locale codec, which is cp1252 / gb2312 / cp932 on Windows.
Two user-facing file paths in spec-kit were calling them without an explicit encoding= argument:

  • src/specify_cli/__init__.py:400, 412save_init_options / load_init_options for
    .specify/init-options.json. A peer machine with a different default locale (or a UTF-8 Unix CI runner reading a file
    written on a cp1252 Windows host) cannot decode the file, raising UnicodeDecodeError. UnicodeDecodeError is a
    subclass of ValueError — not OSError / json.JSONDecodeError — so the existing fall-back except tuple in
    load_init_options also misses it and the error propagates raw to the CLI.

  • src/specify_cli/extensions.py:764.extensionignore pattern reader. The very next line already normalises
    backslashes "so Windows-authored files work", proving the codebase expects Windows authors to write this file.
    Multibyte UTF-8 patterns (Chinese filenames, accented directory names) silently mojibake when the host locale is not
    UTF-8, so the patterns fail to match and unintended files are shipped with the extension.

Reproducer

# On a non-UTF-8 Windows host (e.g. cp1252):
from specify_cli import save_init_options, load_init_options
save_init_options(Path("."), {"project_name": "café"})
# File written as cp1252.

# On a UTF-8 host (or another Windows host with a different default codec):
load_init_options(Path("."))
# UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9...
# — propagates raw because UnicodeDecodeError is a ValueError, not an OSError.

Why this matters

  • The sibling integration-catalog reader at src/specify_cli/integrations/catalog.py:150,156,193,202,374 already pins
    encoding="utf-8" everywhere. PR fix(powershell): ensure UTF-8 templates are written without BOM #2280 (f684305) fixed the symmetric PowerShell-template BOM bug. The two paths in
    this PR were the remaining drifted ones.
  • init-options.json is meant to be a portable record of how a project was scaffolded — a peer cloning the repo on a
    different OS / locale must be able to read it. Today they can't if the original author's project name (or any future
    field) contains non-ASCII.
  • .extensionignore already explicitly supports Windows authors (see line 766). UTF-8 patterns are part of that same
    contract.

The change

  • src/specify_cli/__init__.py — pin encoding="utf-8" on both write_text and read_text; extend the existing
    except tuple in load_init_options to include UnicodeDecodeError so a peer file written in a non-UTF-8 codec falls
    back to {} per the existing contract instead of crashing.
  • src/specify_cli/extensions.py — pin encoding="utf-8" on the .extensionignore reader.

Tests

  • tests/test_presets.py::TestInitOptions — parametrized non-ASCII round-trip (CJK / Latin-1 / Greek / emoji) plus a
    0xe9-byte corrupted-file fallback test.
  • tests/test_extensions.py::TestExtensionIgnore — Japanese (ドキュメント/) and Latin-1 (café/) ignore patterns
    correctly exclude their directories during install.
$ python -m pytest tests/test_presets.py::TestInitOptions tests/test_extensions.py::TestExtensionIgnore -q
22 passed in 4.56s
$ python -m ruff check src/
All checks passed!

Scope

Intentionally narrow: no behaviour change for ASCII content (UTF-8 is a superset). Only non-ASCII content that previously
round-tripped accidentally (when host locale happened to be UTF-8) or silently mojibaked (when it wasn't) now
round-trips reliably across all hosts.

``Path.read_text`` / ``Path.write_text`` default to the system locale
codec, which is cp1252 / gb2312 / cp932 on Windows. Two user-facing
file paths in spec-kit were calling them without an explicit
``encoding=`` argument:

  - ``src/specify_cli/__init__.py:400,412`` —
    ``save_init_options`` / ``load_init_options`` for
    ``.specify/init-options.json``. A peer machine with a different
    default locale (or a UTF-8 Unix CI runner reading a file written on
    a cp1252 Windows host) cannot decode the file, raising
    ``UnicodeDecodeError``. ``UnicodeDecodeError`` is a subclass of
    ``ValueError`` — not ``OSError`` / ``json.JSONDecodeError`` — so
    the existing fall-back ``except`` tuple in ``load_init_options``
    also misses it and the error propagates raw to the CLI.

  - ``src/specify_cli/extensions.py:764`` — ``.extensionignore``
    pattern reader. The very next line already normalises
    backslashes "so Windows-authored files work", proving the codebase
    expects Windows authors to write this file. Multibyte UTF-8
    patterns (Chinese filenames, accented directory names) silently
    mojibake when the host locale is not UTF-8, so the patterns fail
    to match and unintended files are shipped with the extension.

The sibling integration-catalog reader at
``src/specify_cli/integrations/catalog.py:150,156,193,202,374``
already pins ``encoding="utf-8"`` everywhere. PR github#2280 fixed the
symmetric PowerShell-template BOM bug. This change brings the two
remaining drifted paths in line with that precedent.

Regression tests:

  - ``tests/test_presets.py::TestInitOptions`` — parametrized non-ASCII
    round-trip (CJK, Latin-1, Greek, emoji) plus a corrupted-file case
    that asserts the existing "fall back to {}" contract still holds
    when a peer file contains bytes invalid as UTF-8.
  - ``tests/test_extensions.py::TestExtensionIgnore`` — Japanese
    (``ドキュメント/``) and Latin-1 (``café/``) ignore patterns
    correctly exclude their directories during install.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Pins UTF-8 explicitly for a couple of user-authored/portable files to avoid Windows locale-dependent decoding/encoding behavior, and adds regression tests around non-ASCII content.

Changes:

  • Use encoding="utf-8" for .specify/init-options.json read/write and catch UnicodeDecodeError in the loader fallback path.
  • Use encoding="utf-8" when reading .extensionignore so multibyte patterns round-trip across platforms/locales.
  • Add tests covering UTF-8 patterns in .extensionignore and fallback behavior for invalid UTF-8 init-options.
Show a summary per file
File Description
src/specify_cli/__init__.py Pins UTF-8 for init-options I/O; extends error handling to include decode failures.
src/specify_cli/extensions.py Pins UTF-8 when reading .extensionignore patterns.
tests/test_presets.py Adds init-options tests for non-ASCII values and corrupted-byte fallback.
tests/test_extensions.py Adds .extensionignore tests ensuring non-ASCII patterns exclude intended paths.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 4/4 changed files
  • Comments generated: 3

Comment on lines +400 to +408
# Pin UTF-8 explicitly: ``Path.write_text`` defaults to the system
# locale codec, which is cp1252 / gb2312 / cp932 on Windows. A
# locale-encoded write succeeds locally but produces a file a peer
# machine (different locale) or Unix CI cannot decode. The sibling
# integration-catalog writer in ``integrations/catalog.py`` already
# pins ``encoding="utf-8"`` for the same reason.
dest.write_text(
json.dumps(options, indent=2, sort_keys=True), encoding="utf-8"
)
Comment thread tests/test_presets.py
Comment on lines +2272 to +2292
@pytest.mark.parametrize(
"value",
["名前-プロジェクト", "café-résumé", "Ωmega-Δelta", "🚀-launch"],
)
def test_save_load_round_trip_preserves_non_ascii(self, project_dir, value):
"""Non-ASCII values round-trip via explicit UTF-8 encoding.

``Path.write_text`` / ``Path.read_text`` default to the system
locale codec on Windows (cp1252 / gb2312 / cp932). Without
``encoding="utf-8"`` pinned on both ends, a project name like
``café`` written on a UTF-8 host becomes garbled or unreadable on
a cp1252 host (and vice versa). Pin UTF-8 explicitly so init
options round-trip across machines and CI.
"""
from specify_cli import save_init_options, load_init_options

save_init_options(project_dir, {"ai": "claude", "project_name": value})

loaded = load_init_options(project_dir)
assert loaded["project_name"] == value

Comment on lines +764 to +770
# Pin UTF-8 explicitly: ``Path.read_text`` defaults to the system
# locale codec on Windows (cp1252 / gb2312 / cp932), which silently
# corrupts multibyte patterns when the file is shared across
# machines with different locales. The next line already
# normalises backslashes "so Windows-authored files work" — the
# codebase already expects Windows authors to write this file.
lines: List[str] = ignore_file.read_text(encoding="utf-8").splitlines()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants