[RNE Rewrite] feat: add tokenizer pipeline (#1248)#1274
Conversation
c5817d8 to
f426882
Compare
f426882 to
66dfb9d
Compare
66dfb9d to
d394a7e
Compare
d394a7e to
81d7ab1
Compare
barhanc
left a comment
There was a problem hiding this comment.
I only went over the library implementation. Tomorrow I will take a look at the example app and test it.
| # pytorch/tokenizers headers (and the third-party libs they pull in: | ||
| # nlohmann/json, re2 and its abseil dependency) ship inside the ExecuTorch | ||
| # llm extension bundle |
There was a problem hiding this comment.
I don't think this comment is necessary.
| throw jsi::JSError(rt, "decode: Failed to decode tokens: error " + | ||
| std::to_string(static_cast<int32_t>(result.error()))); |
| throw jsi::JSError(rt, "idToToken: Failed to convert id to token: error " + | ||
| std::to_string(static_cast<int32_t>(result.error()))); |
There was a problem hiding this comment.
Same as in encode / decode.
| throw jsi::JSError(rt, "tokenToId: Failed to convert token to id: error " + | ||
| std::to_string(static_cast<int32_t>(result.error()))); |
There was a problem hiding this comment.
Same as in encode / decode.
| * All methods are synchronous and worklet-callable, mirroring the {@link Model} | ||
| * and {@link Tensor} primitives. For app-level usage prefer the asynchronous |
There was a problem hiding this comment.
The comment about mirroring Model/Tensor seems out of place for user facing API.
| return tokenizer.tokenToId(token); | ||
| }; | ||
|
|
||
| const dispose = () => tokenizer.dispose(); |
There was a problem hiding this comment.
This should be the first function defined in constructor, so it is easy to see that there are no native memory leaks.
There was a problem hiding this comment.
ops/ isn't a very good name for this directory. It's imo too generic. For CV it works because it contains operations for common CV transforms. Here in nlp we will have directories like llm/ for LLM related stuff, privacy-filter and so on, so this should imo either be in root nlp/ as a standalone tokenizer.ts or nested in nlp/tokenizer/tokenizer.ts.
| encodeWorklet, | ||
| decodeWorklet, | ||
| getVocabSizeWorklet, | ||
| idToTokenWorklet, | ||
| tokenToIdWorklet, |
There was a problem hiding this comment.
I don't really see the point of exposing both the async and worklet functions. See my file comment, but if we want to leave this as a task, then imo we should just return encode, decode as async functions and the rest as sync.
| auto result = self->tokenizer_->encode(text, kNumAddedBosTokens, kNumAddedEosTokens); | ||
| if (!result.ok()) { | ||
| throw jsi::JSError(rt, "encode: Failed to encode input: error " + | ||
| std::to_string(static_cast<int32_t>(result.error()))); |
There was a problem hiding this comment.
Please use std::string errorMsg = executorch::runtime::to_string(result.error()); in JS exception string instead of error code.
There was a problem hiding this comment.
The problem I see with this file is that there is really no orchestration logic here, no task per se---it's just a boilerplate wrapper over ops/tokenizer.ts. Now, if we want to be 100% functionally backward-compatible with our old API and must expose a hook useTokenizer then imo this file should just be basically something like this (because this is all we need for hook)
export async function createTokenizer(config: TokenizerConfig, runtime?: WorkletRuntime) {
const { tokenizerPath } = config;
const tokenizer = await wrapAsync(loadTokenizer, runtime)(tokenizerPath);
const dispose = () => tokenizer.dispose();
return {
encode: wrapAsync(tokenizer.encode, runtime),
decode: wrapAsync(tokenizer.decode, runtime),
getVocabSize: tokenizer.getVocabSize,
idToToken: tokenizer.idToToken,
tokenToId: tokenizer.tokenToId,
dispose,
};
}But I'm questioning if this is even needed at all. The ops/tokenizer.ts is the abstraction we need for implementing stuff like embeddings, privacy-filter, etc. (the pattern is similar to the LLMRunner use in the LLMChat task in PoC). Additionally, the tasks should be like whole pipelines that typical users can just take and do something end-to-end with. It's hard to imagine why would a typical user who only uses the hooks API want a bare tokenizer without anything else, and for power users who create their own pipelines there is still ops/tokenizer.ts, it is not internal-only.
I'm fine with both approaches but if we decide to keep the tokenization as a task, then let's only do it as this very minimal wrapper above, and in end-to-end tasks like embeddings / privacy-filter let's use the ops/tokenizer.ts abstraction. Also small nit, the task should be called 'tokenization'.
|
|
||
| namespace rnexecutorch::extensions::nlp { | ||
| void install(facebook::jsi::Runtime &rt, facebook::jsi::Object &module) { | ||
| tokenizer::install_loadTokenizer(rt, module); |
There was a problem hiding this comment.
These should be installed under nlp submodule (same pattern as CV)
void install(facebook::jsi::Runtime &rt, facebook::jsi::Object &module) {
jsi::Object nlpModule = jsi::Object(rt);
// ...
module.setProperty(rt, "nlp", nlpModule);
}| */ | ||
| export function loadTokenizer(tokenizerPath: string): Tokenizer { | ||
| 'worklet'; | ||
| return rnexecutorchJsi.loadTokenizer(tokenizerPath) as Tokenizer; |
There was a problem hiding this comment.
The loadTokenizer method should be under rnexecutorchJsi.nlp.
Description
Adds the tokenizer pipeline (issue #1248) using the new worklet-based architecture, with functional parity to the current
TokenizerModule.A new
nlpextension exposes aloadTokenizerJSI primitive (top-level on__rnexecutorch_jsi__, likeloadModel) returning aTokenizerhost object backed bytokenizers::HFTokenizer. On top of it sits acreateTokenizer(config, runtime?)async factory (async +*Workletvariants +dispose) and auseTokenizerhook. Methods:encode,decode,getVocabSize,idToToken,tokenToId— same semantics as today (special tokens follow thetokenizer.jsonpost_processor). The*Workletvariants let an upcoming text-embeddings task tokenize → build tensors → run forward within a single worklet.cpp/extensions/nlp/{tokenizer,install}.{h,cpp}, wired intoRnExecutorch.cpp.src/extensions/nlp/{ops,tasks}/tokenizer.ts,src/hooks/useTokenizer.ts, exports inindex.ts, examplemodels.tokenizer.ALL_MINILM_L6_V2.android/CMakeLists.txtand the podspec —pytorch/tokenizers/includeplus the bundled libs its public headers pull in (nlohmann/json,re2, and re2'sabseildep). Symbols link from the prebuiltlibexecutorch. Documented inthird-party/README.md.apps/nlpexample app with a Tokenizer screen that drives the full pipeline on device.Introduces a breaking change?
Type of change
Tested on
Testing instructions
CI is TypeScript-only here (native isn't compiled in CI);
yarn typecheck, rootyarn lint, andyarn prepare(bob build) all pass.To exercise the native tokenizer end-to-end via the demo app:
third-partyartifacts intopackages/react-native-executorch/third-party/(seethird-party/README.md). The existing PoC bundle already ships the llm/tokenizers extension (headers + symbols), so no rebuild is needed.yarn && cd apps/nlp && yarn ios— the iOS simulator works since the tokenizer is pure CPU (no GPU/Metal).all-MiniLM-L6-v2and auto-runsencode/decode/getVocabSize/idToToken/tokenToId, asserting:encode("Hello world")=[7592, 2088],decoderound-trips to"hello world",getVocabSize()=30522, andtokenToId(idToToken(id))is the identity. All three assertions should read PASS (also logged as[TokenizerTest]).The screen genuinely drives the new code —
useTokenizer→createTokenizer→ theloadTokenizerJSI primitive → the nativeTokenizerHostObject/HFTokenizer— so a green run validates the whole pipeline, not just the types. Verified locally on iOS (iPhone 16 Pro Max, Xcode 26.5): all assertions pass.The
apps/nlpapp is intentionally minimal and exists only to prove this pipeline; it can be dropped after approval if you'd prefer not to keep a demo app in-tree.Screenshots
Related issues
#1248, part of #1208
Checklist
Additional notes
The C++ mirrors the current
TokenizerModule, which is backed bypytorch/tokenizers(tokenizers::HFTokenizer) from the ExecuTorch llm/tokenizers extension. This PR consumes the same headers (the third-party bundle underextension/llm/tokenizers) and prebuilt symbols (fromlibexecutorch); it does not use thetokenizers-cppsubmodule. Tokenizer download currently uses the temporaryreact-native-fs-baseduseResourceDownloadintroduced in #1264 (to be replaced by the ResourceFetcher in #1253).