[RNE Rewrite] feat: add tokenizer pipeline (#1248) by msluszniak · Pull Request #1274 · software-mansion/react-native-executorch

msluszniak · 2026-06-22T12:47:00Z

Description

Adds the tokenizer pipeline (issue #1248) using the new worklet-based architecture, with functional parity to the current TokenizerModule.

A new nlp extension exposes a loadTokenizer JSI primitive (top-level on __rnexecutorch_jsi__, like loadModel) returning a Tokenizer host object backed by tokenizers::HFTokenizer. On top of it sits a createTokenizer(config, runtime?) async factory (async + *Worklet variants + dispose) and a useTokenizer hook. Methods: encode, decode, getVocabSize, idToToken, tokenToId — same semantics as today (special tokens follow the tokenizer.json post_processor). The *Worklet variants let an upcoming text-embeddings task tokenize → build tensors → run forward within a single worklet.

C++: cpp/extensions/nlp/{tokenizer,install}.{h,cpp}, wired into RnExecutorch.cpp.
TS: src/extensions/nlp/{ops,tasks}/tokenizer.ts, src/hooks/useTokenizer.ts, exports in index.ts, example models.tokenizer.ALL_MINILM_L6_V2.
Build: tokenizer header-search paths added to android/CMakeLists.txt and the podspec — pytorch/tokenizers/include plus the bundled libs its public headers pull in (nlohmann/json, re2, and re2's abseil dep). Symbols link from the prebuilt libexecutorch. Documented in third-party/README.md.
Demo: a dedicated apps/nlp example app with a Tokenizer screen that drives the full pipeline on device.

Introduces a breaking change?

Yes
No

Type of change

Bug fix (change which fixes an issue)
New feature (change which adds functionality)
Documentation update (improves or adds clarity to existing documentation)
Other (chores, tests, code style improvements etc.)

Tested on

iOS
Android

Testing instructions

CI is TypeScript-only here (native isn't compiled in CI); yarn typecheck, root yarn lint, and yarn prepare (bob build) all pass.

To exercise the native tokenizer end-to-end via the demo app:

Provision the ExecuTorch third-party artifacts into packages/react-native-executorch/third-party/ (see third-party/README.md). The existing PoC bundle already ships the llm/tokenizers extension (headers + symbols), so no rebuild is needed.
yarn && cd apps/nlp && yarn ios — the iOS simulator works since the tokenizer is pure CPU (no GPU/Metal).
Open the Tokenizer screen. It loads all-MiniLM-L6-v2 and auto-runs encode / decode / getVocabSize / idToToken / tokenToId, asserting: encode("Hello world") = [7592, 2088], decode round-trips to "hello world", getVocabSize() = 30522, and tokenToId(idToToken(id)) is the identity. All three assertions should read PASS (also logged as [TokenizerTest]).

The screen genuinely drives the new code — useTokenizer → createTokenizer → the loadTokenizer JSI primitive → the native TokenizerHostObject / HFTokenizer — so a green run validates the whole pipeline, not just the types. Verified locally on iOS (iPhone 16 Pro Max, Xcode 26.5): all assertions pass.

The apps/nlp app is intentionally minimal and exists only to prove this pipeline; it can be dropped after approval if you'd prefer not to keep a demo app in-tree.

Screenshots

Related issues

#1248, part of #1208

Checklist

I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have updated the documentation accordingly
My changes generate no new warnings

Additional notes

The C++ mirrors the current TokenizerModule, which is backed by pytorch/tokenizers (tokenizers::HFTokenizer) from the ExecuTorch llm/tokenizers extension. This PR consumes the same headers (the third-party bundle under extension/llm/tokenizers) and prebuilt symbols (from libexecutorch); it does not use the tokenizers-cpp submodule. Tokenizer download currently uses the temporary react-native-fs-based useResourceDownload introduced in #1264 (to be replaced by the ResourceFetcher in #1253).

barhanc

I only went over the library implementation. Tomorrow I will take a look at the example app and test it.

barhanc · 2026-06-22T19:32:14Z

+    # pytorch/tokenizers headers (and the third-party libs they pull in:
+    # nlohmann/json, re2 and its abseil dependency) ship inside the ExecuTorch
+    # llm extension bundle


I don't think this comment is necessary.

barhanc · 2026-06-22T19:35:36Z

+                throw jsi::JSError(rt, "decode: Failed to decode tokens: error " +
+                                           std::to_string(static_cast<int32_t>(result.error())));


Same as in encode.

barhanc · 2026-06-22T19:38:51Z

+                throw jsi::JSError(rt, "idToToken: Failed to convert id to token: error " +
+                                           std::to_string(static_cast<int32_t>(result.error())));


Same as in encode / decode.

barhanc · 2026-06-22T19:39:13Z

+                throw jsi::JSError(rt, "tokenToId: Failed to convert token to id: error " +
+                                           std::to_string(static_cast<int32_t>(result.error())));


Same as in encode / decode.

barhanc · 2026-06-22T19:43:53Z

+ * All methods are synchronous and worklet-callable, mirroring the {@link Model}
+ * and {@link Tensor} primitives. For app-level usage prefer the asynchronous


The comment about mirroring Model/Tensor seems out of place for user facing API.

barhanc · 2026-06-22T20:02:50Z

+    return tokenizer.tokenToId(token);
+  };
+
+  const dispose = () => tokenizer.dispose();


This should be the first function defined in constructor, so it is easy to see that there are no native memory leaks.

barhanc · 2026-06-22T20:10:51Z

ops/ isn't a very good name for this directory. It's imo too generic. For CV it works because it contains operations for common CV transforms. Here in nlp we will have directories like llm/ for LLM related stuff, privacy-filter and so on, so this should imo either be in root nlp/ as a standalone tokenizer.ts or nested in nlp/tokenizer/tokenizer.ts.

barhanc · 2026-06-22T20:13:37Z

+    encodeWorklet,
+    decodeWorklet,
+    getVocabSizeWorklet,
+    idToTokenWorklet,
+    tokenToIdWorklet,


I don't really see the point of exposing both the async and worklet functions. See my file comment, but if we want to leave this as a task, then imo we should just return encode, decode as async functions and the rest as sync.

barhanc · 2026-06-22T20:16:04Z

+            auto result = self->tokenizer_->encode(text, kNumAddedBosTokens, kNumAddedEosTokens);
+            if (!result.ok()) {
+                throw jsi::JSError(rt, "encode: Failed to encode input: error " +
+                                           std::to_string(static_cast<int32_t>(result.error())));


Please use std::string errorMsg = executorch::runtime::to_string(result.error()); in JS exception string instead of error code.

barhanc · 2026-06-22T20:35:42Z

The problem I see with this file is that there is really no orchestration logic here, no task per se---it's just a boilerplate wrapper over ops/tokenizer.ts. Now, if we want to be 100% functionally backward-compatible with our old API and must expose a hook useTokenizer then imo this file should just be basically something like this (because this is all we need for hook)

export async function createTokenizer(config: TokenizerConfig, runtime?: WorkletRuntime) { const { tokenizerPath } = config; const tokenizer = await wrapAsync(loadTokenizer, runtime)(tokenizerPath); const dispose = () => tokenizer.dispose(); return { encode: wrapAsync(tokenizer.encode, runtime), decode: wrapAsync(tokenizer.decode, runtime), getVocabSize: tokenizer.getVocabSize, idToToken: tokenizer.idToToken, tokenToId: tokenizer.tokenToId, dispose, }; }

But I'm questioning if this is even needed at all. The ops/tokenizer.ts is the abstraction we need for implementing stuff like embeddings, privacy-filter, etc. (the pattern is similar to the LLMRunner use in the LLMChat task in PoC). Additionally, the tasks should be like whole pipelines that typical users can just take and do something end-to-end with. It's hard to imagine why would a typical user who only uses the hooks API want a bare tokenizer without anything else, and for power users who create their own pipelines there is still ops/tokenizer.ts, it is not internal-only.

I'm fine with both approaches but if we decide to keep the tokenization as a task, then let's only do it as this very minimal wrapper above, and in end-to-end tasks like embeddings / privacy-filter let's use the ops/tokenizer.ts abstraction. Also small nit, the task should be called 'tokenization'.

barhanc · 2026-06-22T21:11:07Z

+
+namespace rnexecutorch::extensions::nlp {
+void install(facebook::jsi::Runtime &rt, facebook::jsi::Object &module) {
+    tokenizer::install_loadTokenizer(rt, module);


These should be installed under nlp submodule (same pattern as CV)

void install(facebook::jsi::Runtime &rt, facebook::jsi::Object &module) { jsi::Object nlpModule = jsi::Object(rt); // ... module.setProperty(rt, "nlp", nlpModule); }

barhanc · 2026-06-22T21:11:37Z

+ */
+export function loadTokenizer(tokenizerPath: string): Tokenizer {
+  'worklet';
+  return rnexecutorchJsi.loadTokenizer(tokenizerPath) as Tokenizer;


The loadTokenizer method should be under rnexecutorchJsi.nlp.

msluszniak marked this pull request as draft June 22, 2026 12:49

msluszniak self-assigned this Jun 22, 2026

msluszniak added the feature PRs that implement a new feature label Jun 22, 2026

msluszniak linked an issue Jun 22, 2026 that may be closed by this pull request

[RNE Rewrite] Add tokenizer pipeline implementation #1248

Open

msluszniak force-pushed the @ms/issue1248-tokenizer branch 4 times, most recently from c5817d8 to f426882 Compare June 22, 2026 13:30

msluszniak added the refactoring label Jun 22, 2026

msluszniak force-pushed the @ms/issue1248-tokenizer branch from f426882 to 66dfb9d Compare June 22, 2026 17:10

feat(nlp): add tokenizer pipeline (worklet host object) [#1248]

df228ca

msluszniak force-pushed the @ms/issue1248-tokenizer branch from 66dfb9d to d394a7e Compare June 22, 2026 18:47

feat(nlp): add tokenizer demo to a dedicated nlp example app

81d7ab1

msluszniak force-pushed the @ms/issue1248-tokenizer branch from d394a7e to 81d7ab1 Compare June 22, 2026 18:56

msluszniak marked this pull request as ready for review June 22, 2026 19:14

msluszniak requested a review from barhanc June 22, 2026 19:14

barhanc requested changes Jun 22, 2026

View reviewed changes

barhanc reviewed Jun 22, 2026

View reviewed changes

		throw jsi::JSError(rt, "decode: Failed to decode tokens: error " +
		std::to_string(static_cast<int32_t>(result.error())));

		throw jsi::JSError(rt, "idToToken: Failed to convert id to token: error " +
		std::to_string(static_cast<int32_t>(result.error())));

		throw jsi::JSError(rt, "tokenToId: Failed to convert token to id: error " +
		std::to_string(static_cast<int32_t>(result.error())));

		* All methods are synchronous and worklet-callable, mirroring the {@link Model}
		* and {@link Tensor} primitives. For app-level usage prefer the asynchronous

Conversation

msluszniak commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Introduces a breaking change?

Type of change

Tested on

Testing instructions

Screenshots

Related issues

Checklist

Additional notes

Uh oh!

barhanc left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

barhanc Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

barhanc Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

barhanc Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

msluszniak commented Jun 22, 2026 •

edited

Loading

barhanc Jun 22, 2026 •

edited

Loading

barhanc Jun 22, 2026 •

edited

Loading

barhanc Jun 22, 2026 •

edited

Loading