Skip to content

feat: Unicode LIKE/upper()/lower() via statically-linked ICU#31

Merged
arv merged 5 commits into
mainfrom
arv/unicode-like-icu
Jun 1, 2026
Merged

feat: Unicode LIKE/upper()/lower() via statically-linked ICU#31
arv merged 5 commits into
mainfrom
arv/unicode-like-icu

Conversation

@arv
Copy link
Copy Markdown
Contributor

@arv arv commented Jun 1, 2026

Why

SQLite's built-in LIKE/upper()/lower() only case-fold ASCII. So any case-insensitive comparison of non-English text is wrong — e.g. zqlite's ILIKE (lower(a) LIKE lower(b)) didn't match Ä/ä, Cyrillic, Greek, etc. The companion mono PR (rocicorp/mono#6095) relies on a Unicode lower() from this package to make ILIKE correct.

What

Enable SQLite's bundled ICU extension (SQLITE_ENABLE_ICU). The extension source is already in the committed amalgamation (guarded by #ifdef SQLITE_ENABLE_ICU) and sqlite3IcuInit is in SQLite's built-in auto-init array — so defining the macro auto-registers a Unicode-aware LIKE/upper()/lower()/REGEXP on every connection. No C/C++ glue, just build config.

Static linking (the important part)

ICU is linked statically so the prebuilt binaries stay self-contained. This is required, not optional:

  • zero-cache ships on node:22-alpine and installs this package via prebuild-install (fetches a prebuilt .node; it does not compile). The runtime image has no ICU.
  • Today's prebuilds are self-contained (SQLite is statically compiled in).
  • A dynamic NEEDED libicui18n.so.<ver> would fail to load on that Alpine image, and would couple every binary to one ICU soname (ICU bumps it each release). This is why better-sqlite3 upstream doesn't enable ICU.

Static linking embeds ICU into the binary instead. Verified locally on macOS: otool -L shows no libicu dependency, and Ä LIKE ä, Cyrillic, Greek, and lower('Ä') → 'ä' all work.

New helper deps/icu.js discovers ICU (pkg-config → Homebrew → system paths; ICU_ROOT override) and emits the static-archive paths plus the C++/system libs ICU's archives need. CI installs the static packages:

Platform Package Ships .a?
Debian (linux-x64/arm prebuilds) libicu-dev /usr/lib/<triplet>/libicu*.a
Alpine (prod) icu-dev + icu-static
macOS Homebrew icu4c

Windows: deferred (ASCII-only LIKE for now)

ICU is gated OS != "win". Static ICU on Windows means building it from source via vcpkg — impractically slow to do per-arch (x64/ia32/arm64) on every PR and release. Production runs on Alpine and dev runs on macOS/Linux, so Windows keeps SQLite's ASCII LIKE until this is followed up (likely vcpkg static-CRT triplets + binary caching). The mono PR degrades gracefully on Windows (ASCII ILIKE, no regression).

Cost

ICU's case/data tables make each prebuilt binary ~30 MB larger (≈3 MB → ≈35 MB). Each user only downloads the one prebuild for their platform.

Testing

  • npm run build-release clean on macOS (arm64); binary is self-contained (otool -L); Unicode LIKE/lower() verified via the built zero_sqlite3 shell.
  • Static .a availability confirmed for Debian libicu-dev and Alpine icu-static.
  • CI (this PR) exercises the Linux + macOS builds across the matrix.

🤖 Generated with Claude Code

SQLite's built-in LIKE/upper()/lower() only case-fold ASCII, so non-English
text compared case-insensitively (e.g. zqlite's ILIKE) behaved incorrectly.

Enable SQLite's bundled ICU extension (SQLITE_ENABLE_ICU), which is already
present in the amalgamation guarded by the macro and auto-registers a
Unicode-aware LIKE/upper()/lower()/REGEXP on every connection.

ICU is linked STATICALLY so the prebuilt binaries stay self-contained:
zero-cache installs them via prebuild-install onto runtime images (e.g.
Alpine) that have no ICU, where a dynamic `NEEDED libicu*.so.<ver>` would
fail to load (and would couple each binary to one ICU soname). The new
deps/icu.js discovers ICU (pkg-config / Homebrew / system paths) and emits
the static archive paths plus the C++/system libs they require. CI installs
the static ICU packages (libicu-dev on Debian, icu-dev+icu-static on Alpine,
icu4c via Homebrew on macOS).

Windows is intentionally excluded for now: static ICU there means building it
from source (vcpkg), which is impractically slow in CI. Windows keeps SQLite's
ASCII-only LIKE until that is addressed; production (zero-cache) runs on
Alpine, and macOS/Linux dev builds get full Unicode.

Cost: ICU's data table makes each binary ~30MB larger.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ICU is gated `OS != "win"` in the gyp files, so icu.js is never invoked on
Windows. Remove the unused Windows branches (.lib archive names, advapi32,
the isWin handling) and note the macOS/Linux-only scope.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enables SQLite’s bundled ICU extension on non-Windows builds so LIKE, lower(), upper(), and REGEXP become Unicode-aware, while keeping prebuilt binaries self-contained via static ICU linking.

Changes:

  • Enable SQLITE_ENABLE_ICU for non-Windows builds in the SQLite static library build.
  • Add deps/icu.js to discover ICU headers and static archives for node-gyp builds.
  • Update build/link settings and CI dependencies to compile/link ICU on macOS and Linux (Debian/Alpine).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

File Description
deps/sqlite3.gyp Defines SQLITE_ENABLE_ICU (non-Windows) and adds ICU include path for compiling the SQLite amalgamation with the ICU extension enabled.
deps/icu.js Adds ICU discovery for include dir + static archive linker inputs (with platform-specific fallback behavior).
binding.gyp Links ICU libs into the final .node (and the zero_sqlite3 shell where built) on non-Windows.
.github/workflows/build.yml Installs ICU development/static packages needed for builds across macOS/Linux CI and release prebuild jobs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread deps/icu.js Outdated
Comment thread deps/icu.js Outdated
Comment thread binding.gyp
Comment thread deps/icu.js Outdated
Comment thread deps/icu.js Outdated
Comment thread binding.gyp
@arv
Copy link
Copy Markdown
Contributor Author

arv commented Jun 1, 2026

arv and others added 2 commits June 1, 2026 16:31
…inking; add Unicode tests

Addresses review feedback on #31:

- locate(): the pkg-config path now requires the ICU headers
  (<includedir>/unicode/utypes.h) to actually exist before trusting the .pc,
  and the system-path discovery picks an include dir that has them. The
  `include` mode fails with a clear message instead of emitting an empty/bogus
  path that would later blow up with a confusing missing-header error.

- libsOutput(): no longer silently falls back to dynamic linking when a static
  archive is missing. Prebuilt binaries must stay self-contained (zero-cache
  runs them on ICU-less images like Alpine), so the build now aborts with an
  actionable message. The dynamic fallback is opt-in via ICU_ALLOW_DYNAMIC=1
  for local development.

- Add test/52.icu.js asserting Unicode behavior on non-Windows (lower('Ä')='ä',
  upper('ß')='SS', 'Ä' LIKE 'ä'=1) and ASCII-only behavior on Windows, guarding
  against future regressions from SQLite updates or build-flag changes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Debian/Ubuntu ship libicu*.a built without -fPIC, so they cannot be linked
into a shared object (the .node) — the build failed with "recompile with
-fPIC". Static linking only works where the archives are PIC: macOS (Homebrew)
and Alpine (musl). So:

  * macOS + Alpine  -> static link (self-contained, as before).
  * glibc Linux     -> dynamic link against the distro .so; those consumers
                       must have ICU installed at runtime.

Production (zero-cache on Alpine) keeps a self-contained, statically-linked
binary. ICU_ALLOW_DYNAMIC=1 still forces dynamic everywhere for local dev.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread deps/sqlite3.gyp
Copy link
Copy Markdown
Contributor

@tantaman tantaman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just need that 1 comment in download.sh

Per review feedback: explain that ICU is defined conditionally (non-Windows)
in deps/sqlite3.gyp rather than in this unconditional, all-platform DEFINES
list.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@arv arv merged commit a0f1435 into main Jun 1, 2026
29 checks passed
@arv arv deleted the arv/unicode-like-icu branch June 1, 2026 15:42
thomasmol pushed a commit to thomasmol/rocicorp-mono that referenced this pull request Jun 2, 2026
…orp#6095)

## Problem

The SQLite replica (`zqlite`) diverged from Postgres on `LIKE`/`ILIKE`,
so query results could differ between the server-side replica, Postgres,
and the in-memory IVM matcher:

- **Postgres `LIKE` is case-sensitive**, but SQLite's `LIKE` operator is
case-*insensitive* by default → plain `LIKE` matched too much.
- **`ILIKE` was rewritten to `LIKE`**, which only case-folds **ASCII** →
non-English `ILIKE` was effectively case-sensitive (e.g. `Ä`/`ä`,
Cyrillic, Greek didn't match).
- **No `ESCAPE` was emitted**, but Postgres and the in-memory IVM
matcher (`zql/src/builder/like.ts`) both treat backslash as the default
escape character.

## Fix

Mirror the in-memory IVM matcher so all three backends (Postgres, IVM,
SQLite replica) agree:

- Enable `PRAGMA case_sensitive_like = ON` on every connection (`db.ts`)
so the bare `LIKE` operator is case-sensitive — matching Postgres
`LIKE`.
- Compile `ILIKE`/`NOT ILIKE` as `lower(a) LIKE lower(b)`, using the
Unicode-aware `lower()` that `@rocicorp/zero-sqlite3` provides via ICU —
matching the `toLowerCase()` the IVM matcher uses.
- Emit an explicit `ESCAPE '\'` for all `LIKE`/`ILIKE` operators.

The generated SQL now looks like:

| Op | SQL |
|----|-----|
| `LIKE` | `"name" LIKE ? ESCAPE '\'` |
| `NOT LIKE` | `"name" NOT LIKE ? ESCAPE '\'` |
| `ILIKE` | `lower("name") LIKE lower(?) ESCAPE '\'` |
| `NOT ILIKE` | `lower("name") NOT LIKE lower(?) ESCAPE '\'` |

## Dependency note

Full Unicode case-insensitivity for `ILIKE` requires the **ICU-enabled
build of `@rocicorp/zero-sqlite3`** (see companion PR
rocicorp/zero-sqlite3#31, which compiles SQLite with
`SQLITE_ENABLE_ICU`). Until that release is picked up:

- `LIKE` case-sensitivity and `ESCAPE '\'` take effect **immediately**
(core SQLite).
- `ILIKE` remains ASCII-folded via `lower()` — i.e. **no regression**
vs. today, and it upgrades to full Unicode automatically once the ICU
build lands.

## Safety of the global pragma

`case_sensitive_like = ON` is connection-scoped. The only internal
SQLite `LIKE`s are the lowercase introspection patterns in
`lite-tables.ts` (`'sqlite_%'`, `'_zero.%'`), which match lowercase
identifiers and stay correct under case-sensitivity. All other internal
`LIKE` usage is in Postgres queries, which the pragma doesn't affect.

## Tests

- New unit tests in `query-builder.test.ts` pin the generated SQL for
all four operators.
- Existing `table-source.test.ts` / `query.test.ts` / `db.test.ts` pass
(45 tests).
- `tsc` and `oxfmt` clean.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants