Add public SentencePieceTokenizer factory methods for Unigram from vocab list and tokenizer.json#7625
Add public SentencePieceTokenizer factory methods for Unigram from vocab list and tokenizer.json#7625Copilot wants to merge 5 commits into
Conversation
…erJson APIs Co-authored-by: ericstj <8918108+ericstj@users.noreply.github.com>
ericstj
left a comment
There was a problem hiding this comment.
Thanks for adding this — the in-memory Create(vocab, ...) overload is clean and the ID-preservation (JSON vocab index = token id) is the right call. I implemented this same JSON-only Unigram capability recently against real Hugging Face models, and hit two correctness issues that this PR's single test model happens to mask. Details inline; summary here.
Bugs (untested by the current suite)
-
BOS/EOS positional fallback corrupts real pieces.
FindSpecialTokenId(pieces, "<s>", 1)/("</s>", 2)fall back to positions 1/2 when the vocab has no piece literally named<s>/</s>. Many HF Unigram tokenizers don't use those names — e.g.minishlab/potion-multilingual-128M(bge-m3 family) hasunk_id=1, vocab[0]="[PAD]", [1]="[UNK]", [2]=",". ThereeosId→2 marks","asControland drops it from the Viterbi trie (it can never be emitted), andbosId→1 collides withunkIdand clobbers the unknown entry. This is structural (independent ofaddBos/addEos). -
Normalizer steps beyond
Precompiledare silently dropped.ExtractPrecompiledCharsMapextracts only the charsmap and discards sibling normalizers. Real Unigram models often have a richer chain (potion/bge-m3:Sequence[Precompiled, Replace(punctuation spacing), Replace("\\s+"->" "), Strip]), whichSentencePieceNormalizercannot reproduce — soCreateFromTokenizerJsonsilently yields different tokens than HF. Since the charsmap must run before thoseReplacesteps, they can't just be reordered into SP; at minimum this should throw on unrecognized normalizer types rather than silently ignore them.
Why the test stays green
Paraphrase-multilingual-MiniLM-L12-v2 names its specials (<s>=0, </s>=2) so the fallback never fires, and its normalizer is a single Precompiled, so the dropped-sibling path is never hit. Recommend adding fixtures that (a) place specials at non-conventional positions / omit <s>/</s>, and (b) use a Sequence normalizer with Replace/Strip, asserting against HF reference ids.
Minor
removeExtraWhitespacesis hard-codedtruein both factories rather than derived from the JSON.added_tokensfrom the JSON aren't auto-wired; correctness depends on the caller passingspecialTokens. Worth documenting or reading them.
(Posting as comments only — not an approval or change request.)
|
@copilot please address feedback |
…end_scheme handling Co-authored-by: ericstj <8918108+ericstj@users.noreply.github.com>
|
@copilot address feedback |
…tion, and add tests Co-authored-by: ericstj <8918108+ericstj@users.noreply.github.com>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #7625 +/- ##
==========================================
+ Coverage 69.59% 69.65% +0.05%
==========================================
Files 1484 1484
Lines 273606 274510 +904
Branches 27949 28065 +116
==========================================
+ Hits 190410 191198 +788
- Misses 75832 75916 +84
- Partials 7364 7396 +32
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR extends SentencePieceTokenizer to support Hugging Face JSON-only Unigram tokenizers by adding new public factory APIs that can construct a Unigram tokenizer from either an in-memory vocab list or a tokenizer.json stream, avoiding the current requirement for a SentencePiece .model protobuf.
Changes:
- Add
SentencePieceTokenizer.Create(IEnumerable<(string Piece, float Score)> vocab, ...)for constructing a Unigram tokenizer directly from a vocab list. - Add
SentencePieceTokenizer.CreateFromTokenizerJson(Stream tokenizerJsonStream, ...)for parsing HFtokenizer.json(Unigram) including vocab,unk_id, precompiled charsmap, and Metaspace settings. - Add internal constructors/refactoring to build a
SentencePieceUnigramModelfrom vocab pieces and config values, plus new tests covering these creation paths.
Show a summary per file
| File | Description |
|---|---|
| test/Microsoft.ML.Tokenizers.Tests/UnigramTests.cs | Adds unit tests for vocab-based and tokenizer.json-based Unigram construction and behavior parity checks. |
| src/Microsoft.ML.Tokenizers/Model/SentencePieceUnigramModel.cs | Adds Unigram model constructors that build vocab/trie from (piece, score) inputs and detect special tokens by name. |
| src/Microsoft.ML.Tokenizers/Model/SentencePieceTokenizer.cs | Adds public factories for vocab and tokenizer.json, plus JSON parsing helpers for normalizer/pre-tokenizer extraction. |
| src/Microsoft.ML.Tokenizers/Model/SentencePieceBaseModel.cs | Adds a new base-model constructor taking explicit config/token IDs instead of ModelProto. |
Copilot's findings
- Files reviewed: 4/4 changed files
- Comments generated: 4
|
The new JSON parser never reads the : base(addBos, addEos,
...,
addDummyPrefix, escapeWhiteSpaces, treatWhitespaceAsSuffix, byteFallback: false,
precompiledCharsmap, removeExtraWhitespaces, specialTokens)Hugging Face Unigram tokenizers serialize this as a top-level "model": {
"type": "Unigram",
"unk_id": 0,
"byte_fallback": true,
"vocab": [ ... ]
}For any model that enables byte fallback, hardcoding Suggested fix: read the flag in bool byteFallback = modelElement.TryGetProperty("byte_fallback", out JsonElement bf) && bf.GetBoolean();defaulting to |
|
The parser determines special tokens in two ways: it auto-detects only the literal strings A typical "added_tokens": [
{ "id": 0, "content": "<s>", "special": true },
{ "id": 1, "content": "<pad>", "special": true },
{ "id": 2, "content": "</s>", "special": true },
{ "id": 3, "content": "<unk>", "special": true },
{ "id": 250001, "content": "<mask>", "special": true }
]Two consequences:
Suggested direction: read |
|
The normalizer walk handles only else
{
throw new NotSupportedException($"Normalizer type '{type}' is not supported. Only 'Precompiled' and 'Sequence' normalizers are supported.");
}Because the "normalizer": {
"type": "Sequence",
"normalizers": [
{ "type": "Nmt" },
{ "type": "Precompiled", "precompiled_charsmap": "..." },
{ "type": "Replace", "pattern": { "Regex": " {2,}" }, "content": " " }
]
}With this input the method throws on the The correct behavior here is to not throw: when walking a Should we have a tracking issue to support the normalizers that we currently not supporting yet? |
|
Hugging Face For models where extra whitespace should be preserved, this changes normalization and therefore the resulting tokens. Two reasonable options:
|
|
"post_processor": {
"type": "TemplateProcessing",
"single": [
{ "SpecialToken": { "id": "<s>", "type_id": 0 } },
{ "Sequence": { "id": "A", "type_id": 0 } },
{ "SpecialToken": { "id": "</s>", "type_id": 0 } }
],
"special_tokens": {
"<s>": { "id": "<s>", "ids": [0], "tokens": ["<s>"] },
"</s>": { "id": "</s>", "ids": [2], "tokens": ["</s>"] }
}
}This template is the source of truth for which special tokens wrap the input, in what order, and with which IDs. By ignoring it, the loader cannot tell whether a model wraps with BOS/EOS by default, and for models whose BOS/EOS are not literally Suggested direction: when present, read |
|
@tarekgh So I did some investigation and post_processor is actually a bit more involved, but we need it in order to have a complete story here. Pulling 57 popular models from HuggingFace that use SentencePiece, 51 use post_processor. 37 of them are simple with just 1 or less prefix and suffix, but 14 have multiple tokens. I'm spiking out a proper |
…ost-processor support
Support Hugging Face tokenizer.json post-processor templates that wrap a sequence with
multiple special tokens (XLNet/mBART/NLLB styles) by generalizing the single BOS/EOS
handling into ordered prefix/suffix special-token lists, applied consistently across the
encode/count/index paths.
CreateFromTokenizerJson now:
- parses post_processor (TemplateProcessing, RobertaProcessing, BertProcessing, Sequence)
to resolve the prefix/suffix wrapping, resolving ids via special_tokens -> added_tokens -> vocab.
- reads model.byte_fallback and exposes byteFallback on Create.
- treats added_tokens (special) as the authoritative special-token source.
- tolerates unmodeled sibling steps inside a Sequence normalizer.
- deduces remove_extra_whitespaces from the normalizer Strip/Replace(" {2,}") steps instead
of hardcoding it, matching the Hugging Face fast-tokenizer runtime.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Design: https://gist.github.com/ericstj/648348b68b28622b90071dfdae6fe932 What changedGeneralized SentencePiece's single BOS/EOS handling into ordered prefix/suffix special-token lists, applied consistently across the encode/count/index paths. This lets The JSON loader now:
Addressing @tarekgh's feedbackAll six comments are resolved: PerformanceNo regression — the prefix/suffix-list refactor is allocation-neutral on every existing path (byte-for-byte identical allocations vs the pre-change baseline across encode/count/index × affix × size). Timing deltas are within run-to-run noise on a dev laptop. https://gist.github.com/ericstj/c2c80feddab1a9cd122f34de07b9df28#file-perf-report-md Out of scope (follow-ups)Pair-sequence encoding + |
| ByteCodeToIdOffset = _vocab.TryGetValue("<0x00>", out int id) ? id : MaxByteId; | ||
| OneByteUtf8EncodingMaxId = ByteCodeToIdOffset + 0x7F; | ||
| MaxIdByteFallbackId = ByteCodeToIdOffset + 0xFF; | ||
|
|
| /// <c>remove_extra_whitespaces</c> has no direct representation in <c>tokenizer.json</c> and is assumed to be | ||
| /// <see langword="true"/>. Pair-sequence templates and per-token <c>type_id</c>s are not applied. Templates that | ||
| /// place a special token in the middle of the sequence are rejected with <see cref="NotSupportedException"/>. |
SentencePieceTokenizeronly exposedCreate(Stream)requiring a SentencePiece protobuf (.model), making it impossible to load Hugging Face JSON-only Unigram tokenizers that have no.modelfile.New public APIs
From in-memory vocab:
From
tokenizer.json:CreateFromTokenizerJsonreadsmodel.vocab,model.unk_id, extractsprecompiled_charsmapfrom aPrecompiledorSequencenormalizer, and reads Metaspace pre-tokenizer settings (add_prefix_space,replacement,prepend_scheme). It validatesmodel.type == "Unigram".Internal changes
SentencePieceBaseModel: new constructor taking individual config parameters instead ofModelProtoSentencePieceUnigramModel: new constructors building vocab fromIReadOnlyList<(string, float)>; BOS/EOS/PAD IDs auto-detected by piece name (<s>,</s>,<pad>) with SentencePiece-conventional positional fallbacksNote on token IDs
HF
tokenizer.jsontypically uses a different special-token ordering than the SentencePiece protobuf (e.g.<s>=0, <pad>=1, </s>=2, <unk>=3vs.<unk>=0, <s>=1, </s>=2). Piece strings produced are identical; numeric IDs will differ by the vocab offset introduced by the extra special tokens.