Skip to content

[RNE Rewrite] feat: add tokenizer pipeline (#1248)#1274

Open
msluszniak wants to merge 2 commits into
rne-rewritefrom
@ms/issue1248-tokenizer
Open

[RNE Rewrite] feat: add tokenizer pipeline (#1248)#1274
msluszniak wants to merge 2 commits into
rne-rewritefrom
@ms/issue1248-tokenizer

Conversation

@msluszniak

@msluszniak msluszniak commented Jun 22, 2026

Copy link
Copy Markdown
Member

Description

Adds the tokenizer pipeline (issue #1248) using the new worklet-based architecture, with functional parity to the current TokenizerModule.

A new nlp extension exposes a loadTokenizer JSI primitive (top-level on __rnexecutorch_jsi__, like loadModel) returning a Tokenizer host object backed by tokenizers::HFTokenizer. On top of it sits a createTokenizer(config, runtime?) async factory (async + *Worklet variants + dispose) and a useTokenizer hook. Methods: encode, decode, getVocabSize, idToToken, tokenToId — same semantics as today (special tokens follow the tokenizer.json post_processor). The *Worklet variants let an upcoming text-embeddings task tokenize → build tensors → run forward within a single worklet.

  • C++: cpp/extensions/nlp/{tokenizer,install}.{h,cpp}, wired into RnExecutorch.cpp.
  • TS: src/extensions/nlp/{ops,tasks}/tokenizer.ts, src/hooks/useTokenizer.ts, exports in index.ts, example models.tokenizer.ALL_MINILM_L6_V2.
  • Build: tokenizer header-search paths added to android/CMakeLists.txt and the podspec — pytorch/tokenizers/include plus the bundled libs its public headers pull in (nlohmann/json, re2, and re2's abseil dep). Symbols link from the prebuilt libexecutorch. Documented in third-party/README.md.
  • Demo: a dedicated apps/nlp example app with a Tokenizer screen that drives the full pipeline on device.

Introduces a breaking change?

  • Yes
  • No

Type of change

  • Bug fix (change which fixes an issue)
  • New feature (change which adds functionality)
  • Documentation update (improves or adds clarity to existing documentation)
  • Other (chores, tests, code style improvements etc.)

Tested on

  • iOS
  • Android

Testing instructions

CI is TypeScript-only here (native isn't compiled in CI); yarn typecheck, root yarn lint, and yarn prepare (bob build) all pass.

To exercise the native tokenizer end-to-end via the demo app:

  1. Provision the ExecuTorch third-party artifacts into packages/react-native-executorch/third-party/ (see third-party/README.md). The existing PoC bundle already ships the llm/tokenizers extension (headers + symbols), so no rebuild is needed.
  2. yarn && cd apps/nlp && yarn ios — the iOS simulator works since the tokenizer is pure CPU (no GPU/Metal).
  3. Open the Tokenizer screen. It loads all-MiniLM-L6-v2 and auto-runs encode / decode / getVocabSize / idToToken / tokenToId, asserting: encode("Hello world") = [7592, 2088], decode round-trips to "hello world", getVocabSize() = 30522, and tokenToId(idToToken(id)) is the identity. All three assertions should read PASS (also logged as [TokenizerTest]).

The screen genuinely drives the new code — useTokenizercreateTokenizer → the loadTokenizer JSI primitive → the native TokenizerHostObject / HFTokenizer — so a green run validates the whole pipeline, not just the types. Verified locally on iOS (iPhone 16 Pro Max, Xcode 26.5): all assertions pass.

The apps/nlp app is intentionally minimal and exists only to prove this pipeline; it can be dropped after approval if you'd prefer not to keep a demo app in-tree.

Screenshots

Related issues

#1248, part of #1208

Checklist

  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have updated the documentation accordingly
  • My changes generate no new warnings

Additional notes

The C++ mirrors the current TokenizerModule, which is backed by pytorch/tokenizers (tokenizers::HFTokenizer) from the ExecuTorch llm/tokenizers extension. This PR consumes the same headers (the third-party bundle under extension/llm/tokenizers) and prebuilt symbols (from libexecutorch); it does not use the tokenizers-cpp submodule. Tokenizer download currently uses the temporary react-native-fs-based useResourceDownload introduced in #1264 (to be replaced by the ResourceFetcher in #1253).

@msluszniak msluszniak marked this pull request as draft June 22, 2026 12:49
@msluszniak msluszniak self-assigned this Jun 22, 2026
@msluszniak msluszniak added the feature PRs that implement a new feature label Jun 22, 2026
@msluszniak msluszniak linked an issue Jun 22, 2026 that may be closed by this pull request
@msluszniak msluszniak force-pushed the @ms/issue1248-tokenizer branch 4 times, most recently from c5817d8 to f426882 Compare June 22, 2026 13:30
@msluszniak msluszniak force-pushed the @ms/issue1248-tokenizer branch from f426882 to 66dfb9d Compare June 22, 2026 17:10
@msluszniak msluszniak force-pushed the @ms/issue1248-tokenizer branch from 66dfb9d to d394a7e Compare June 22, 2026 18:47
@msluszniak msluszniak force-pushed the @ms/issue1248-tokenizer branch from d394a7e to 81d7ab1 Compare June 22, 2026 18:56
@msluszniak msluszniak marked this pull request as ready for review June 22, 2026 19:14
@msluszniak msluszniak requested a review from barhanc June 22, 2026 19:14

@barhanc barhanc left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only went over the library implementation. Tomorrow I will take a look at the example app and test it.

Comment on lines +40 to +42
# pytorch/tokenizers headers (and the third-party libs they pull in:
# nlohmann/json, re2 and its abseil dependency) ship inside the ExecuTorch
# llm extension bundle

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this comment is necessary.

Comment on lines +113 to +114
throw jsi::JSError(rt, "decode: Failed to decode tokens: error " +
std::to_string(static_cast<int32_t>(result.error())));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as in encode.

Comment on lines +166 to +167
throw jsi::JSError(rt, "idToToken: Failed to convert id to token: error " +
std::to_string(static_cast<int32_t>(result.error())));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as in encode / decode.

Comment on lines +198 to +199
throw jsi::JSError(rt, "tokenToId: Failed to convert token to id: error " +
std::to_string(static_cast<int32_t>(result.error())));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as in encode / decode.

Comment on lines +9 to +10
* All methods are synchronous and worklet-callable, mirroring the {@link Model}
* and {@link Tensor} primitives. For app-level usage prefer the asynchronous

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment about mirroring Model/Tensor seems out of place for user facing API.

return tokenizer.tokenToId(token);
};

const dispose = () => tokenizer.dispose();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be the first function defined in constructor, so it is easy to see that there are no native memory leaks.

@barhanc barhanc Jun 22, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ops/ isn't a very good name for this directory. It's imo too generic. For CV it works because it contains operations for common CV transforms. Here in nlp we will have directories like llm/ for LLM related stuff, privacy-filter and so on, so this should imo either be in root nlp/ as a standalone tokenizer.ts or nested in nlp/tokenizer/tokenizer.ts.

Comment on lines +67 to +71
encodeWorklet,
decodeWorklet,
getVocabSizeWorklet,
idToTokenWorklet,
tokenToIdWorklet,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really see the point of exposing both the async and worklet functions. See my file comment, but if we want to leave this as a task, then imo we should just return encode, decode as async functions and the rest as sync.

auto result = self->tokenizer_->encode(text, kNumAddedBosTokens, kNumAddedEosTokens);
if (!result.ok()) {
throw jsi::JSError(rt, "encode: Failed to encode input: error " +
std::to_string(static_cast<int32_t>(result.error())));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use std::string errorMsg = executorch::runtime::to_string(result.error()); in JS exception string instead of error code.

@barhanc barhanc Jun 22, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem I see with this file is that there is really no orchestration logic here, no task per se---it's just a boilerplate wrapper over ops/tokenizer.ts. Now, if we want to be 100% functionally backward-compatible with our old API and must expose a hook useTokenizer then imo this file should just be basically something like this (because this is all we need for hook)

export async function createTokenizer(config: TokenizerConfig, runtime?: WorkletRuntime) {
  const { tokenizerPath } = config;
  const tokenizer = await wrapAsync(loadTokenizer, runtime)(tokenizerPath);
  const dispose = () => tokenizer.dispose();
  return {
    encode: wrapAsync(tokenizer.encode, runtime),
    decode: wrapAsync(tokenizer.decode, runtime),
    getVocabSize: tokenizer.getVocabSize,
    idToToken: tokenizer.idToToken,
    tokenToId: tokenizer.tokenToId,
    dispose,
  };
}

But I'm questioning if this is even needed at all. The ops/tokenizer.ts is the abstraction we need for implementing stuff like embeddings, privacy-filter, etc. (the pattern is similar to the LLMRunner use in the LLMChat task in PoC). Additionally, the tasks should be like whole pipelines that typical users can just take and do something end-to-end with. It's hard to imagine why would a typical user who only uses the hooks API want a bare tokenizer without anything else, and for power users who create their own pipelines there is still ops/tokenizer.ts, it is not internal-only.

I'm fine with both approaches but if we decide to keep the tokenization as a task, then let's only do it as this very minimal wrapper above, and in end-to-end tasks like embeddings / privacy-filter let's use the ops/tokenizer.ts abstraction. Also small nit, the task should be called 'tokenization'.


namespace rnexecutorch::extensions::nlp {
void install(facebook::jsi::Runtime &rt, facebook::jsi::Object &module) {
tokenizer::install_loadTokenizer(rt, module);

@barhanc barhanc Jun 22, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be installed under nlp submodule (same pattern as CV)

void install(facebook::jsi::Runtime &rt, facebook::jsi::Object &module) {
    jsi::Object nlpModule = jsi::Object(rt);
    // ...
    module.setProperty(rt, "nlp", nlpModule);
}

*/
export function loadTokenizer(tokenizerPath: string): Tokenizer {
'worklet';
return rnexecutorchJsi.loadTokenizer(tokenizerPath) as Tokenizer;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loadTokenizer method should be under rnexecutorchJsi.nlp.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature PRs that implement a new feature refactoring

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[RNE Rewrite] Add tokenizer pipeline implementation

2 participants