Add public SentencePieceTokenizer factory methods for Unigram from vocab list and tokenizer.json by Copilot · Pull Request #7625 · dotnet/machinelearning

Copilot · 2026-06-10T23:15:45Z

SentencePieceTokenizer only exposed Create(Stream) requiring a SentencePiece protobuf (.model), making it impossible to load Hugging Face JSON-only Unigram tokenizers that have no .model file.

New public APIs

From in-memory vocab:

SentencePieceTokenizer.Create(
    IEnumerable<(string Piece, float Score)> vocab,
    int unkId,
    bool addBeginningOfSentence = true,
    bool addEndOfSentence = false,
    ReadOnlySpan<byte> precompiledCharsMap = default,
    bool addDummyPrefix = true,
    bool escapeWhiteSpaces = true,
    bool treatWhitespaceAsSuffix = false,
    IReadOnlyDictionary<string, int>? specialTokens = null)

From tokenizer.json:

SentencePieceTokenizer.CreateFromTokenizerJson(
    Stream tokenizerJsonStream,
    bool addBeginningOfSentence = true,
    bool addEndOfSentence = false,
    IReadOnlyDictionary<string, int>? specialTokens = null)

CreateFromTokenizerJson reads model.vocab, model.unk_id, extracts precompiled_charsmap from a Precompiled or Sequence normalizer, and reads Metaspace pre-tokenizer settings (add_prefix_space, replacement, prepend_scheme). It validates model.type == "Unigram".

Internal changes

SentencePieceBaseModel: new constructor taking individual config parameters instead of ModelProto
SentencePieceUnigramModel: new constructors building vocab from IReadOnlyList<(string, float)>; BOS/EOS/PAD IDs auto-detected by piece name (<s>, </s>, <pad>) with SentencePiece-conventional positional fallbacks

Note on token IDs

HF tokenizer.json typically uses a different special-token ordering than the SentencePiece protobuf (e.g. <s>=0, <pad>=1, </s>=2, <unk>=3 vs. <unk>=0, <s>=1, </s>=2). Piece strings produced are identical; numeric IDs will differ by the vocab offset introduced by the extra special tokens.

…erJson APIs Co-authored-by: ericstj <8918108+ericstj@users.noreply.github.com>

ericstj

Thanks for adding this — the in-memory Create(vocab, ...) overload is clean and the ID-preservation (JSON vocab index = token id) is the right call. I implemented this same JSON-only Unigram capability recently against real Hugging Face models, and hit two correctness issues that this PR's single test model happens to mask. Details inline; summary here.

Bugs (untested by the current suite)

BOS/EOS positional fallback corrupts real pieces. FindSpecialTokenId(pieces, "<s>", 1) / ("</s>", 2) fall back to positions 1/2 when the vocab has no piece literally named <s>/</s>. Many HF Unigram tokenizers don't use those names — e.g. minishlab/potion-multilingual-128M (bge-m3 family) has unk_id=1, vocab [0]="[PAD]", [1]="[UNK]", [2]=",". There eosId→2 marks "," as Control and drops it from the Viterbi trie (it can never be emitted), and bosId→1 collides with unkId and clobbers the unknown entry. This is structural (independent of addBos/addEos).
Normalizer steps beyond Precompiled are silently dropped. ExtractPrecompiledCharsMap extracts only the charsmap and discards sibling normalizers. Real Unigram models often have a richer chain (potion/bge-m3: Sequence[Precompiled, Replace(punctuation spacing), Replace("\\s+"->" "), Strip]), which SentencePieceNormalizer cannot reproduce — so CreateFromTokenizerJson silently yields different tokens than HF. Since the charsmap must run before those Replace steps, they can't just be reordered into SP; at minimum this should throw on unrecognized normalizer types rather than silently ignore them.

Why the test stays green

Paraphrase-multilingual-MiniLM-L12-v2 names its specials (<s>=0, </s>=2) so the fallback never fires, and its normalizer is a single Precompiled, so the dropped-sibling path is never hit. Recommend adding fixtures that (a) place specials at non-conventional positions / omit <s>/</s>, and (b) use a Sequence normalizer with Replace/Strip, asserting against HF reference ids.

Minor

removeExtraWhitespaces is hard-coded true in both factories rather than derived from the JSON.
added_tokens from the JSON aren't auto-wired; correctness depends on the caller passing specialTokens. Worth documenting or reading them.

(Posting as comments only — not an approval or change request.)

ericstj · 2026-06-11T00:41:19Z

@copilot please address feedback

…end_scheme handling Co-authored-by: ericstj <8918108+ericstj@users.noreply.github.com>

ericstj · 2026-06-11T01:10:02Z

@copilot address feedback

…tion, and add tests Co-authored-by: ericstj <8918108+ericstj@users.noreply.github.com>

codecov · 2026-06-11T02:54:38Z

Codecov Report

❌ Patch coverage is 84.07367% with 147 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.65%. Comparing base (548d4d0) to head (057cd46).
⚠️ Report is 3 commits behind head on main.

Files with missing lines	Patch %	Lines
...soft.ML.Tokenizers/Model/SentencePieceTokenizer.cs	69.43%	70 Missing and 33 partials ⚠️
...t.ML.Tokenizers/Model/SentencePieceUnigramModel.cs	76.59%	37 Missing and 7 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #7625      +/-   ##
==========================================
+ Coverage   69.59%   69.65%   +0.05%     
==========================================
  Files        1484     1484              
  Lines      273606   274510     +904     
  Branches    27949    28065     +116     
==========================================
+ Hits       190410   191198     +788     
- Misses      75832    75916      +84     
- Partials     7364     7396      +32

Flag	Coverage Δ
Debug	`69.65% <84.07%> (+0.05%)`	⬆️
production	`63.88% <73.93%> (+0.03%)`	⬆️
test	`89.68% <100.00%> (+0.05%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...soft.ML.Tokenizers/Model/SentencePieceBaseModel.cs	`83.14% <100.00%> (+2.70%)`	⬆️
test/Microsoft.ML.Tokenizers.Tests/UnigramTests.cs	`96.06% <100.00%> (+3.10%)`	⬆️
...t.ML.Tokenizers/Model/SentencePieceUnigramModel.cs	`64.80% <76.59%> (+4.41%)`	⬆️
...soft.ML.Tokenizers/Model/SentencePieceTokenizer.cs	`74.10% <69.43%> (-18.40%)`	⬇️

... and 5 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull request overview

This PR extends SentencePieceTokenizer to support Hugging Face JSON-only Unigram tokenizers by adding new public factory APIs that can construct a Unigram tokenizer from either an in-memory vocab list or a tokenizer.json stream, avoiding the current requirement for a SentencePiece .model protobuf.

Changes:

Add SentencePieceTokenizer.Create(IEnumerable<(string Piece, float Score)> vocab, ...) for constructing a Unigram tokenizer directly from a vocab list.
Add SentencePieceTokenizer.CreateFromTokenizerJson(Stream tokenizerJsonStream, ...) for parsing HF tokenizer.json (Unigram) including vocab, unk_id, precompiled charsmap, and Metaspace settings.
Add internal constructors/refactoring to build a SentencePieceUnigramModel from vocab pieces and config values, plus new tests covering these creation paths.

Show a summary per file

File	Description
test/Microsoft.ML.Tokenizers.Tests/UnigramTests.cs	Adds unit tests for vocab-based and tokenizer.json-based Unigram construction and behavior parity checks.
src/Microsoft.ML.Tokenizers/Model/SentencePieceUnigramModel.cs	Adds Unigram model constructors that build vocab/trie from `(piece, score)` inputs and detect special tokens by name.
src/Microsoft.ML.Tokenizers/Model/SentencePieceTokenizer.cs	Adds public factories for vocab and tokenizer.json, plus JSON parsing helpers for normalizer/pre-tokenizer extraction.
src/Microsoft.ML.Tokenizers/Model/SentencePieceBaseModel.cs	Adds a new base-model constructor taking explicit config/token IDs instead of `ModelProto`.

Copilot's findings

Files reviewed: 4/4 changed files
Comments generated: 4

tarekgh · 2026-06-11T18:52:47Z

CreateFromTokenizerJson ignores model.byte_fallback

The new JSON parser never reads the byte_fallback flag from the model section. In the protobuf path, SentencePieceBaseModel sets ByteFallback = modelProto.TrainerSpec.ByteFallback (SentencePieceBaseModel.cs:36), but the new Unigram constructor passes byteFallback: false as a hardcoded literal:

: base(addBos, addEos,
       ...,
       addDummyPrefix, escapeWhiteSpaces, treatWhitespaceAsSuffix, byteFallback: false,
       precompiledCharsmap, removeExtraWhitespaces, specialTokens)

Hugging Face Unigram tokenizers serialize this as a top-level model.byte_fallback boolean, for example:

"model": {
  "type": "Unigram",
  "unk_id": 0,
  "byte_fallback": true,
  "vocab": [ ... ]
}

For any model that enables byte fallback, hardcoding false means unknown characters will be emitted as the unknown token instead of being decomposed into <0x00>..<0xFF> byte pieces, which changes both the produced token IDs and the decoded output.

Suggested fix: read the flag in CreateFromTokenizerJson and thread it through to the model constructor:

bool byteFallback = modelElement.TryGetProperty("byte_fallback", out JsonElement bf) && bf.GetBoolean();

defaulting to false when the property is absent, then pass byteFallback instead of the literal.

tarekgh · 2026-06-11T18:56:52Z

CreateFromTokenizerJson ignores the top-level added_tokens array

The parser determines special tokens in two ways: it auto-detects only the literal strings <s>, </s>, and <pad> via FindSpecialTokenId, and otherwise relies on the caller passing a specialTokens dictionary by hand. The added_tokens array in tokenizer.json, which is the authoritative source for special tokens and their IDs, is never read.

A typical tokenizer.json declares them like this:

"added_tokens": [
  { "id": 0, "content": "<s>",    "special": true },
  { "id": 1, "content": "<pad>",  "special": true },
  { "id": 2, "content": "</s>",   "special": true },
  { "id": 3, "content": "<unk>",  "special": true },
  { "id": 250001, "content": "<mask>", "special": true }
]

Two consequences:

Models that do not use the exact <s>/</s>/<pad> strings (for example [CLS], [SEP], [MASK], [PAD], [UNK], or <bos>) get none of their special tokens recognized. These are exactly the JSON-only Unigram layouts this PR targets (the issue mentions [PAD]=0, [UNK]=1). Such tokens stay classified as Normal pieces, so they are not treated as special during encoding and considerSpecialTokens has nothing to act on.
Callers are forced to re-supply information that already exists in the file, and to keep those IDs in sync manually.

Suggested direction: read added_tokens (content plus id, optionally filtered by "special": true) and use it to populate the special-token map and to reclassify those pieces as Control, falling back to the explicit specialTokens parameter only as an override. This also makes BOS/EOS/PAD detection robust instead of depending on hardcoded literal names.

tarekgh · 2026-06-11T19:05:52Z

ExtractPrecompiledCharsMap throws on legitimate normalizer steps inside a Sequence

The normalizer walk handles only Precompiled and Sequence; every other type hits the else branch and throws:

else
{
    throw new NotSupportedException($"Normalizer type '{type}' is not supported. Only 'Precompiled' and 'Sequence' normalizers are supported.");
}

Because the Sequence branch calls ExtractPrecompiledCharsMap on each inner normalizer, any sibling step that is not Precompiled triggers the throw. Real Unigram tokenizers commonly nest the precompiled map alongside other steps, for example:

"normalizer": {
  "type": "Sequence",
  "normalizers": [
    { "type": "Nmt" },
    { "type": "Precompiled", "precompiled_charsmap": "..." },
    { "type": "Replace", "pattern": { "Regex": " {2,}" }, "content": " " }
  ]
}

With this input the method throws on the Nmt (or Replace) entry and the whole load fails, even though the precompiled_charsmap it needs is present and extractable. This rejects many valid tokenizer.json files.

The correct behavior here is to not throw: when walking a Sequence, skip inner steps that are not Precompiled rather than aborting, and only extract the precompiled map.

Should we have a tracking issue to support the normalizers that we currently not supporting yet?

tarekgh · 2026-06-11T19:10:25Z

removeExtraWhitespaces is hardcoded to true

CreateFromTokenizerJson always passes removeExtraWhitespaces: true to the model constructor. The protobuf path instead reads it from the model (SentencePieceBaseModel.cs:56, modelProto.NormalizerSpec.RemoveExtraWhitespaces), so the two entry points can disagree for the same underlying model.

Hugging Face tokenizer.json has no single field that maps directly to SentencePiece's remove_extra_whitespaces; the behavior is usually expressed through normalizer steps (for example a Strip or a Replace collapsing runs of spaces) and through the Metaspace pre-tokenizer. So unconditionally forcing true is an assumption rather than a value taken from the file.

For models where extra whitespace should be preserved, this changes normalization and therefore the resulting tokens. Two reasonable options:

Keep true as the default but document it explicitly as an assumption of this loader, and consider exposing it as a parameter so callers can override.
Infer it from the normalizer steps present in the JSON (for example, presence/absence of a whitespace-collapsing Replace/Strip).

tarekgh · 2026-06-11T20:00:32Z

[API] Public way to construct SentencePieceTokenizer (Unigram) from tokenizer.json / in-memory pieces+scores #7624 (comment)

tarekgh · 2026-06-11T21:08:51Z

post_processor (TemplateProcessing) is not consulted for BOS/EOS behavior

CreateFromTokenizerJson decides BOS/EOS purely from the addBeginningOfSentence/addEndOfSentence parameters and locates the tokens by guessing the names <s>/</s>. The tokenizer.json already declares this authoritatively in post_processor:

"post_processor": {
  "type": "TemplateProcessing",
  "single": [
    { "SpecialToken": { "id": "<s>",  "type_id": 0 } },
    { "Sequence":     { "id": "A",    "type_id": 0 } },
    { "SpecialToken": { "id": "</s>", "type_id": 0 } }
  ],
  "special_tokens": {
    "<s>":  { "id": "<s>",  "ids": [0], "tokens": ["<s>"] },
    "</s>": { "id": "</s>", "ids": [2], "tokens": ["</s>"] }
  }
}

This template is the source of truth for which special tokens wrap the input, in what order, and with which IDs. By ignoring it, the loader cannot tell whether a model wraps with BOS/EOS by default, and for models whose BOS/EOS are not literally <s>/</s> (or that use only one of them) the name guessing will pick the wrong token or none at all.

Suggested direction: when present, read post_processor.special_tokens (and the single template) to resolve the BOS/EOS token strings and IDs, using the addBeginningOfSentence/addEndOfSentence parameters as the on/off override rather than as the sole source.

ericstj · 2026-06-12T15:24:09Z

@tarekgh So I did some investigation and post_processor is actually a bit more involved, but we need it in order to have a complete story here.

Pulling 57 popular models from HuggingFace that use SentencePiece, 51 use post_processor. 37 of them are simple with just 1 or less prefix and suffix, but 14 have multiple tokens.

I'm spiking out a proper post_processor implementation, but would need your review and guidance before doing major changes. As such I'll move this back to draft.

…ost-processor support Support Hugging Face tokenizer.json post-processor templates that wrap a sequence with multiple special tokens (XLNet/mBART/NLLB styles) by generalizing the single BOS/EOS handling into ordered prefix/suffix special-token lists, applied consistently across the encode/count/index paths. CreateFromTokenizerJson now: - parses post_processor (TemplateProcessing, RobertaProcessing, BertProcessing, Sequence) to resolve the prefix/suffix wrapping, resolving ids via special_tokens -> added_tokens -> vocab. - reads model.byte_fallback and exposes byteFallback on Create. - treats added_tokens (special) as the authoritative special-token source. - tolerates unmodeled sibling steps inside a Sequence normalizer. - deduces remove_extra_whitespaces from the normalizer Strip/Replace(" {2,}") steps instead of hardcoding it, matching the Hugging Face fast-tokenizer runtime. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

ericstj · 2026-06-12T19:25:41Z

Design: https://gist.github.com/ericstj/648348b68b28622b90071dfdae6fe932

What changed

Generalized SentencePiece's single BOS/EOS handling into ordered prefix/suffix special-token lists, applied consistently across the encode/count/index paths. This lets CreateFromTokenizerJson represent Hugging Face post_processor templates that wrap a sequence with multiple special tokens (XLNet </s>+<cls>, mBART/NLLB styles), which the prior single-token model couldn't express. From a survey of 57 resolved HF Unigram models, ~25% need more than a single BOS/EOS.

The JSON loader now:

Parses post_processor (TemplateProcessing, RobertaProcessing, BertProcessing, Sequence), resolving ids via special_tokens → added_tokens → vocab.
Reads model.byte_fallback (exposed on Create) instead of hardcoding false.
Treats added_tokens (special) as the authoritative special-token source.
Tolerates unmodeled sibling steps inside a Sequence normalizer.
Deduces remove_extra_whitespaces from the normalizer's Strip / Replace(" {2,}" → ▁) steps (defaulting false), matching HF's fast-tokenizer runtime, rather than hardcoding true.

Addressing @tarekgh's feedback

All six comments are resolved: byte_fallback, added_tokens, the Sequence-normalizer throw, post_processor / TemplateProcessing BOS/EOS, and removeExtraWhitespaces (now inferred from the normalizer steps per option 2 — HF encodes it as Strip + Replace(" {2,}")).

Performance

No regression — the prefix/suffix-list refactor is allocation-neutral on every existing path (byte-for-byte identical allocations vs the pre-change baseline across encode/count/index × affix × size). Timing deltas are within run-to-run noise on a dev laptop. https://gist.github.com/ericstj/c2c80feddab1a9cd122f34de07b9df28#file-perf-report-md

Out of scope (follow-ups)

Pair-sequence encoding + type_ids; dynamic language tokens (mBART/NLLB); a general Tokenizer-level post-processor abstraction; ByteLevel trim_offsets.

Copilot

Copilot's findings

Files reviewed: 4/4 changed files
Comments generated: 2

+            ByteCodeToIdOffset = _vocab.TryGetValue("<0x00>", out int id) ? id : MaxByteId;
+            OneByteUtf8EncodingMaxId = ByteCodeToIdOffset + 0x7F;
+            MaxIdByteFallbackId = ByteCodeToIdOffset + 0xFF;
+


+        /// <c>remove_extra_whitespaces</c> has no direct representation in <c>tokenizer.json</c> and is assumed to be
+        /// <see langword="true"/>. Pair-sequence templates and per-token <c>type_id</c>s are not applied. Templates that
+        /// place a special token in the middle of the sequence are rejected with <see cref="NotSupportedException"/>.


Initial plan

ff40b11

Copilot AI assigned Copilot and ericstj Jun 10, 2026

Copilot started work on behalf of ericstj June 10, 2026 23:15 View session

Copilot AI linked an issue Jun 10, 2026 that may be closed by this pull request

[API] Public way to construct SentencePieceTokenizer (Unigram) from tokenizer.json / in-memory pieces+scores #7624

Open

Add public SentencePieceTokenizer.Create(vocab) and CreateFromTokeniz…

862be32

…erJson APIs Co-authored-by: ericstj <8918108+ericstj@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Add public way to construct SentencePieceTokenizer from tokenizer.json~~ Add public SentencePieceTokenizer factory methods for Unigram from vocab list and tokenizer.json Jun 10, 2026

Copilot finished work on behalf of ericstj June 10, 2026 23:42

Copilot AI requested a review from ericstj June 10, 2026 23:42

ericstj reviewed Jun 11, 2026

View reviewed changes

Comment thread src/Microsoft.ML.Tokenizers/Model/SentencePieceUnigramModel.cs Outdated

Comment thread src/Microsoft.ML.Tokenizers/Model/SentencePieceTokenizer.cs

Comment thread src/Microsoft.ML.Tokenizers/Model/SentencePieceTokenizer.cs Outdated

Copilot started work on behalf of ericstj June 11, 2026 00:41 View session

Fix BOS/EOS positional fallback, normalizer type validation, and prep…

112ee2b

…end_scheme handling Co-authored-by: ericstj <8918108+ericstj@users.noreply.github.com>

Copilot finished work on behalf of ericstj June 11, 2026 00:50

Copilot AI requested a review from ericstj June 11, 2026 00:50

ericstj reviewed Jun 11, 2026

View reviewed changes

Copilot started work on behalf of ericstj June 11, 2026 01:10 View session

Fix null normalizer guard, recursive Sequence support, BOS/EOS valida…

7e32c05

…tion, and add tests Co-authored-by: ericstj <8918108+ericstj@users.noreply.github.com>

Copilot finished work on behalf of ericstj June 11, 2026 01:23

Copilot AI requested a review from ericstj June 11, 2026 01:23

ericstj marked this pull request as ready for review June 11, 2026 03:27

Copilot AI review requested due to automatic review settings June 11, 2026 03:27

Copilot started reviewing on behalf of ericstj June 11, 2026 03:27 View session

ericstj requested a review from tarekgh June 11, 2026 03:27

Copilot AI reviewed Jun 11, 2026

View reviewed changes

build-analysis Bot mentioned this pull request Jun 11, 2026

The Operation will be canceled. The next steps may not contain expected logs. dotnet/dnceng#3008

Open

3 tasks

github-actions Bot mentioned this pull request Jun 11, 2026

🏥 Repo Health Dashboard #7612

Open

ericstj marked this pull request as draft June 12, 2026 15:24

ericstj requested a review from Copilot June 12, 2026 19:26

Copilot started reviewing on behalf of ericstj June 12, 2026 19:26 View session

Copilot AI reviewed Jun 12, 2026

View reviewed changes

Conversation

Copilot AI commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

New public APIs

Internal changes

Note on token IDs

Uh oh!

ericstj left a comment

Choose a reason for hiding this comment

Bugs (untested by the current suite)

Why the test stays green

Minor

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ericstj commented Jun 11, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ericstj commented Jun 11, 2026

Uh oh!

codecov Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Copilot's findings

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tarekgh commented Jun 11, 2026

Uh oh!

tarekgh commented Jun 11, 2026

Uh oh!

tarekgh commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tarekgh commented Jun 11, 2026

Uh oh!

tarekgh commented Jun 11, 2026

Uh oh!

tarekgh commented Jun 11, 2026

Uh oh!

ericstj commented Jun 12, 2026

Uh oh!

ericstj commented Jun 12, 2026

What changed

Addressing @tarekgh's feedback

Performance

Out of scope (follow-ups)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Copilot's findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Copilot AI commented Jun 10, 2026 •

edited

Loading

codecov Bot commented Jun 11, 2026 •

edited

Loading

tarekgh commented Jun 11, 2026 •

edited

Loading