feat: Unicode LIKE/upper()/lower() via statically-linked ICU#31
Merged
Conversation
SQLite's built-in LIKE/upper()/lower() only case-fold ASCII, so non-English text compared case-insensitively (e.g. zqlite's ILIKE) behaved incorrectly. Enable SQLite's bundled ICU extension (SQLITE_ENABLE_ICU), which is already present in the amalgamation guarded by the macro and auto-registers a Unicode-aware LIKE/upper()/lower()/REGEXP on every connection. ICU is linked STATICALLY so the prebuilt binaries stay self-contained: zero-cache installs them via prebuild-install onto runtime images (e.g. Alpine) that have no ICU, where a dynamic `NEEDED libicu*.so.<ver>` would fail to load (and would couple each binary to one ICU soname). The new deps/icu.js discovers ICU (pkg-config / Homebrew / system paths) and emits the static archive paths plus the C++/system libs they require. CI installs the static ICU packages (libicu-dev on Debian, icu-dev+icu-static on Alpine, icu4c via Homebrew on macOS). Windows is intentionally excluded for now: static ICU there means building it from source (vcpkg), which is impractically slow in CI. Windows keeps SQLite's ASCII-only LIKE until that is addressed; production (zero-cache) runs on Alpine, and macOS/Linux dev builds get full Unicode. Cost: ICU's data table makes each binary ~30MB larger. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ICU is gated `OS != "win"` in the gyp files, so icu.js is never invoked on Windows. Remove the unused Windows branches (.lib archive names, advapi32, the isWin handling) and note the macOS/Linux-only scope. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR enables SQLite’s bundled ICU extension on non-Windows builds so LIKE, lower(), upper(), and REGEXP become Unicode-aware, while keeping prebuilt binaries self-contained via static ICU linking.
Changes:
- Enable
SQLITE_ENABLE_ICUfor non-Windows builds in the SQLite static library build. - Add
deps/icu.jsto discover ICU headers and static archives for node-gyp builds. - Update build/link settings and CI dependencies to compile/link ICU on macOS and Linux (Debian/Alpine).
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| deps/sqlite3.gyp | Defines SQLITE_ENABLE_ICU (non-Windows) and adds ICU include path for compiling the SQLite amalgamation with the ICU extension enabled. |
| deps/icu.js | Adds ICU discovery for include dir + static archive linker inputs (with platform-specific fallback behavior). |
| binding.gyp | Links ICU libs into the final .node (and the zero_sqlite3 shell where built) on non-Windows. |
| .github/workflows/build.yml | Installs ICU development/static packages needed for builds across macOS/Linux CI and release prebuild jobs. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Contributor
Author
…inking; add Unicode tests Addresses review feedback on #31: - locate(): the pkg-config path now requires the ICU headers (<includedir>/unicode/utypes.h) to actually exist before trusting the .pc, and the system-path discovery picks an include dir that has them. The `include` mode fails with a clear message instead of emitting an empty/bogus path that would later blow up with a confusing missing-header error. - libsOutput(): no longer silently falls back to dynamic linking when a static archive is missing. Prebuilt binaries must stay self-contained (zero-cache runs them on ICU-less images like Alpine), so the build now aborts with an actionable message. The dynamic fallback is opt-in via ICU_ALLOW_DYNAMIC=1 for local development. - Add test/52.icu.js asserting Unicode behavior on non-Windows (lower('Ä')='ä', upper('ß')='SS', 'Ä' LIKE 'ä'=1) and ASCII-only behavior on Windows, guarding against future regressions from SQLite updates or build-flag changes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Debian/Ubuntu ship libicu*.a built without -fPIC, so they cannot be linked
into a shared object (the .node) — the build failed with "recompile with
-fPIC". Static linking only works where the archives are PIC: macOS (Homebrew)
and Alpine (musl). So:
* macOS + Alpine -> static link (self-contained, as before).
* glibc Linux -> dynamic link against the distro .so; those consumers
must have ICU installed at runtime.
Production (zero-cache on Alpine) keeps a self-contained, statically-linked
binary. ICU_ALLOW_DYNAMIC=1 still forces dynamic everywhere for local dev.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
tantaman
reviewed
Jun 1, 2026
tantaman
approved these changes
Jun 1, 2026
Contributor
tantaman
left a comment
There was a problem hiding this comment.
just need that 1 comment in download.sh
Per review feedback: explain that ICU is defined conditionally (non-Windows) in deps/sqlite3.gyp rather than in this unconditional, all-platform DEFINES list. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
thomasmol
pushed a commit
to thomasmol/rocicorp-mono
that referenced
this pull request
Jun 2, 2026
…orp#6095) ## Problem The SQLite replica (`zqlite`) diverged from Postgres on `LIKE`/`ILIKE`, so query results could differ between the server-side replica, Postgres, and the in-memory IVM matcher: - **Postgres `LIKE` is case-sensitive**, but SQLite's `LIKE` operator is case-*insensitive* by default → plain `LIKE` matched too much. - **`ILIKE` was rewritten to `LIKE`**, which only case-folds **ASCII** → non-English `ILIKE` was effectively case-sensitive (e.g. `Ä`/`ä`, Cyrillic, Greek didn't match). - **No `ESCAPE` was emitted**, but Postgres and the in-memory IVM matcher (`zql/src/builder/like.ts`) both treat backslash as the default escape character. ## Fix Mirror the in-memory IVM matcher so all three backends (Postgres, IVM, SQLite replica) agree: - Enable `PRAGMA case_sensitive_like = ON` on every connection (`db.ts`) so the bare `LIKE` operator is case-sensitive — matching Postgres `LIKE`. - Compile `ILIKE`/`NOT ILIKE` as `lower(a) LIKE lower(b)`, using the Unicode-aware `lower()` that `@rocicorp/zero-sqlite3` provides via ICU — matching the `toLowerCase()` the IVM matcher uses. - Emit an explicit `ESCAPE '\'` for all `LIKE`/`ILIKE` operators. The generated SQL now looks like: | Op | SQL | |----|-----| | `LIKE` | `"name" LIKE ? ESCAPE '\'` | | `NOT LIKE` | `"name" NOT LIKE ? ESCAPE '\'` | | `ILIKE` | `lower("name") LIKE lower(?) ESCAPE '\'` | | `NOT ILIKE` | `lower("name") NOT LIKE lower(?) ESCAPE '\'` | ## Dependency note Full Unicode case-insensitivity for `ILIKE` requires the **ICU-enabled build of `@rocicorp/zero-sqlite3`** (see companion PR rocicorp/zero-sqlite3#31, which compiles SQLite with `SQLITE_ENABLE_ICU`). Until that release is picked up: - `LIKE` case-sensitivity and `ESCAPE '\'` take effect **immediately** (core SQLite). - `ILIKE` remains ASCII-folded via `lower()` — i.e. **no regression** vs. today, and it upgrades to full Unicode automatically once the ICU build lands. ## Safety of the global pragma `case_sensitive_like = ON` is connection-scoped. The only internal SQLite `LIKE`s are the lowercase introspection patterns in `lite-tables.ts` (`'sqlite_%'`, `'_zero.%'`), which match lowercase identifiers and stay correct under case-sensitivity. All other internal `LIKE` usage is in Postgres queries, which the pragma doesn't affect. ## Tests - New unit tests in `query-builder.test.ts` pin the generated SQL for all four operators. - Existing `table-source.test.ts` / `query.test.ts` / `db.test.ts` pass (45 tests). - `tsc` and `oxfmt` clean. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
SQLite's built-in
LIKE/upper()/lower()only case-fold ASCII. So any case-insensitive comparison of non-English text is wrong — e.g.zqlite'sILIKE(lower(a) LIKE lower(b)) didn't matchÄ/ä, Cyrillic, Greek, etc. The companion mono PR (rocicorp/mono#6095) relies on a Unicodelower()from this package to makeILIKEcorrect.What
Enable SQLite's bundled ICU extension (
SQLITE_ENABLE_ICU). The extension source is already in the committed amalgamation (guarded by#ifdef SQLITE_ENABLE_ICU) andsqlite3IcuInitis in SQLite's built-in auto-init array — so defining the macro auto-registers a Unicode-awareLIKE/upper()/lower()/REGEXPon every connection. No C/C++ glue, just build config.Static linking (the important part)
ICU is linked statically so the prebuilt binaries stay self-contained. This is required, not optional:
zero-cacheships onnode:22-alpineand installs this package viaprebuild-install(fetches a prebuilt.node; it does not compile). The runtime image has no ICU.NEEDED libicui18n.so.<ver>would fail to load on that Alpine image, and would couple every binary to one ICU soname (ICU bumps it each release). This is why better-sqlite3 upstream doesn't enable ICU.Static linking embeds ICU into the binary instead. Verified locally on macOS:
otool -Lshows nolibicudependency, andÄ LIKE ä, Cyrillic, Greek, andlower('Ä') → 'ä'all work.New helper
deps/icu.jsdiscovers ICU (pkg-config → Homebrew → system paths;ICU_ROOToverride) and emits the static-archive paths plus the C++/system libs ICU's archives need. CI installs the static packages:.a?libicu-dev/usr/lib/<triplet>/libicu*.aicu-dev+icu-staticicu4cWindows: deferred (ASCII-only LIKE for now)
ICU is gated
OS != "win". Static ICU on Windows means building it from source via vcpkg — impractically slow to do per-arch (x64/ia32/arm64) on every PR and release. Production runs on Alpine and dev runs on macOS/Linux, so Windows keeps SQLite's ASCIILIKEuntil this is followed up (likely vcpkg static-CRT triplets + binary caching). The mono PR degrades gracefully on Windows (ASCIIILIKE, no regression).Cost
ICU's case/data tables make each prebuilt binary ~30 MB larger (≈3 MB → ≈35 MB). Each user only downloads the one prebuild for their platform.
Testing
npm run build-releaseclean on macOS (arm64); binary is self-contained (otool -L); UnicodeLIKE/lower()verified via the builtzero_sqlite3shell..aavailability confirmed for Debianlibicu-devand Alpineicu-static.🤖 Generated with Claude Code