feat: LFM2.5 text-embedding & ColBERT (MLX/XNNPACK) with prompts and multi-vector output#1269
Draft
NorbertKlockiewicz wants to merge 13 commits into
Draft
feat: LFM2.5 text-embedding & ColBERT (MLX/XNNPACK) with prompts and multi-vector output#1269NorbertKlockiewicz wants to merge 13 commits into
NorbertKlockiewicz wants to merge 13 commits into
Conversation
…xSim
Add the LFM2.5-Embedding-350M and LFM2.5-ColBERT-350M models, served from
HuggingFace (MLX on iOS, XNNPACK on Android / iOS simulator).
Text embeddings are unified into one runner and one hook: the native
TextEmbeddings model returns the raw [numTokens, embeddingDim] matrix
(numTokens === 1 for pooled models, the full sequence for multi-vector /
late-interaction models like ColBERT), plus the input token ids. The TS
layer reduces it — toVector() for the single-vector case, getTokenVectors()
and maxSim() for late interaction.
Models trained with asymmetric query/document prompts (LFM uses query:/
document:, ColBERT uses [Q] /[D] ) carry a "prompts" config; forward then
requires a role argument ('query' | 'document') that auto-prepends the
prompt. The role is type-enforced: required for prompted models, forbidden
for plain ones.
Also: tokenizer post_processor is now applied for text embeddings so the
BOS special token is added (CLS-pooled models depend on it), and the
text-to-image Encoder reads the new EmbeddingResult.
Example app gains a semantic-search screen and a ColBERT late-interaction
search screen demonstrating MaxSim.
Authored with Claude.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
b1f5bdd to
50e80e1
Compare
- Migrate the segment-anything (SAM) screen to toVector(forward()) — its CLIP-text path broke when forward started returning EmbeddingResult. - Update the C++ TextEmbeddings integration test for the EmbeddingResult return type (was still using the old OwningArrayBuffer pointer API). - Guard the per-token invariant: throw InvalidModelOutput if output rows != input token count (pooled numTokens==1 exempt), so skiplist masking can't silently misalign if a graph pads/truncates. - Dedup encode()/encodeWithSpecialTokens() into a shared encodeImpl. - Drop the redundant Float32Array copy at the JSI boundary; document the getTokenVectors view lifetime; remove dead BaseEmbeddings::postprocess. Authored with Claude. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
forward(text) returns a single pooled Float32Array again for standard models — restoring the original API, so MiniLM/MPNet/CLIP/SAM consumers need no migration. The reduction (row 0 of the native [numTokens, embeddingDim] matrix) happens in the TS module, not at the call site. Multi-vector (late-interaction) models opt in via a `multiVector: true` config flag; for those, forward returns the full per-token EmbeddingResult so MaxSim/skiplist work. Return type is discriminated by the flag, and the role argument by `prompts` (required when prompted, none when not). Authored with Claude. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ents Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ments Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Adds two LFM2.5 retrieval models from Liquid AI and the API needed to use them, through the existing
useTextEmbeddingshook — one native runner, one hook, no new public surface beyond optional model-config fields:query:/document:prompts.Linear(1024→128)per token). Trained with[Q]/[D]prompts.Both run on MLX on iOS (physical device) and XNNPACK on Android, quantized (MLX int4, XNNPACK 8da4w).
To support them without breaking the existing API, the model config grew three optional fields and
forwardbecame config-driven:prompts— when present,forwardrequires arole('query' | 'document') and auto-prepends the matching prompt.multiVector— whentrue,forwardreturns a per-tokenEmbeddingResult(vectors,numTokens,embeddingDim,tokenIds); otherwise it returns a single pooledFloat32Arrayas before.skipListIds— punctuation token ids the consumer excludes from MaxSim scoring.The library auto-applies the role prompts (the matching
query:/[Q]prefix is prepended inforward), but late-interaction scoring (MaxSim) stays the consumer's concern — it runs wherever the vectors are stored. The example app demonstrates one way to score (its own localmaxSim), and the ColBERT demo is folded into the unified text-embeddings screen, picking the scorer from the model's config.Native side:
TextEmbeddings::generatereturns the raw[numTokens, embeddingDim]matrix as anEmbeddingResult; the TS layer reduces it. The emptyBaseEmbeddingsbase class was removed (TextEmbeddingsnow extendsBaseModeldirectly), and output-shape validation was extracted intoTextEmbeddings::buildResult.Review order: start with the TS types (
types/textEmbeddings.ts—ForwardFn/ForwardReturndiscriminated on the model config), then the module/hook (TextEmbeddingsModule.ts,useTextEmbeddings.ts), then the nativeTextEmbeddings.cpp/Types.h, then the registry/URLs and the example screen.Introduces a breaking change?
forwardstays non-breaking: pooled models still returnFloat32Array. The new return type androlerequirement only apply to models that opt in via config.Type of change
Tested on
Testing instructions
text-embeddingsexample app.C++ unit tests:
TextEmbeddingsTests(incl. newEmbeddingResultmetadata /tokenIdsassertions) compiles and links under the Android NDK toolchain. The suite is cross-compiled, so it is not executed on the host in this setup.Related issues
Checklist
Additional notes
MLX requires a physical iOS device — the MLX delegate does not run on the simulator (use XNNPACK there). The two models are hosted on the Software Mansion Hugging Face org; docs are updated for both
nextand the0.9.xversioned set.