Store spans instead of lexeme strings on tokens by StreamDemon · Pull Request #76 · StreamDemon/sploosh

StreamDemon · 2026-07-02T11:07:49Z

Summary

Token no longer stores an owned lexeme: String — it is now just { kind, span }, and the new Token::text(source) slices the original buffer on demand. Lexing is now allocation-free per token.
The parser threads source: &'src str through (Parser<'src>) and derives text only where it is actually consumed: identifiers, path segments, literals, the vec head check, at_ident_text, and the extern target.
Unary/binary operator text now comes from the token kind (which fully determines it) instead of the stored lexeme — this also sets up the operator-enum sub-PR that follows.

Second sub-PR of the enhancement wave (PR #71 roadmap: "Token.lexeme: String → span-based slicing or interning"). Span slicing was chosen over interning: the token stream never outlives the source buffer in the bootstrap pipeline, so a borrow is strictly simpler and interning buys nothing yet. No behavior change.

Related Issue

None (PR #71 roadmap item).

Spec Sections Affected

None — implementation quality only.

Checklist

Code follows the Sploosh design principles (one way to do it, explicit over implicit, etc.)
Documentation updated in relevant docs/ pages — N/A, no behavior change
Tests added or updated
All build targets still compile (if applicable)
Spec-only PR (skip Build Targets section if checked)

Build Targets Tested

cargo fmt --all -- --check, cargo clippy --workspace --all-targets -- -D warnings, cargo test --workspace all green locally (49 tests).

Test Plan

Entire existing suite (11 lexer + 37 parser + corpus) passes unchanged — token text is observable only through parse results, which are asserted structurally throughout.
every_numeric_suffix_lexes now exercises the new Token::text API directly (text + span asserted per suffix).

Summary by cubic

Switched tokens to span-based text slicing with Token::text(source) and threaded source through the parser to read text only when needed. No behavior change; this removes per-token allocations and reduces memory use.

Refactors
- Token now stores { kind, span }; added Token::text(&source).
- Parser takes source (Parser<'src>) and reads text for identifiers, literals, path segments, vec head checks, and extern targets.
- Unary/binary operator text is derived from TokenKind instead of stored lexemes.
- Tests updated to assert via Token::text.
Migration
- Replace token.lexeme with token.text(source).
- Pass the source buffer when constructing the parser.

^{Written for commit 2974ffc. Summary will update on new commits.}

Every token carried an owned copy of its source text, so lexing a file allocated one String per token even though the source buffer already holds the same bytes. Token is now just a kind and a span; the new `Token::text(source)` slices the original buffer on demand. The parser threads the source string through and derives text only at the few places that need it (identifiers, literals, the `vec` head, the extern target). Unary and binary operator text now comes from the token kind rather than the lexeme, since the kind already determines it.

cubic-dev-ai

No issues found across 2 files

Confidence score: 5/5

Automated review surfaced no issues in the provided summaries.
No files require special attention.

Architecture diagram

sequenceDiagram
    participant Source as Source buffer (&str)
    participant Lexer as Lexer
    participant TokenStream as Token stream
    participant Parser as Parser<'src>
    participant ASTNode as AST nodes
    
    Note over Source,ASTNode: Span-based token text flow (no allocation per token)
    
    Source->>Lexer: lex(source)
    Lexer->>Lexer: scan characters
    Lexer->>Lexer: push(kind, start, end)
    Lexer->>TokenStream: Token{kind, span}
    Note over TokenStream,Parser: Each token stores only kind + byte span
    
    Source->>Parser: parse_program(source)
    Parser->>TokenStream: consume tokens
    
    alt Parsing an identifier/token
        Parser->>Source: text(&token) slices source by span
        Source-->>Parser: &'src str slice
        Parser->>ASTNode: store slice as String (only for Ident, Path, Literal)
    else Parsing a binary operator (e.g., +, -, |>)
        Parser->>Parser: derive op string from TokenKind (no source slice needed)
        Parser->>ASTNode: store derived operator string
    else Parsing a unary operator (e.g., !, -, *, &)
        Parser->>Parser: match TokenKind to static string
        Parser->>ASTNode: store derived operator string
    else Checking for "vec" keyword
        Parser->>Source: text(&token) slices source
        Source-->>Parser: &'src str
        Parser->>Parser: compare to "vec"
    else Checking ident text (at_ident_text / eat_ident_text)
        Parser->>Source: token.text(self.source)
        Source-->>Parser: &'src str
        Parser->>Parser: compare to target text
    else Parsing extern block target
        Parser->>Source: text(&token).to_string()
        Source-->>Parser: String
        Parser->>ASTNode: store target string
    end
    
    Note over Parser,ASTNode: Token text derived only where AST nodes need it

_{Requires human review: Refactor of core token and parser data structures; risk of subtle bugs despite no behavior change.

Re-trigger cubic}

cubic-dev-ai Bot reviewed Jul 2, 2026

View reviewed changes

StreamDemon merged commit 3e65f7f into feature/parser-enhancements-base Jul 2, 2026
2 checks passed

StreamDemon deleted the feature/parser-enh-token-spans branch July 2, 2026 11:16

This was referenced Jul 2, 2026

Make Token and TokenKind Copy, drop parse-loop clones #77

Merged

Parser enhancements wave: integration base #71

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store spans instead of lexeme strings on tokens#76

Store spans instead of lexeme strings on tokens#76
StreamDemon merged 1 commit into
feature/parser-enhancements-basefrom
feature/parser-enh-token-spans

StreamDemon commented Jul 2, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

StreamDemon commented Jul 2, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue

Spec Sections Affected

Checklist

Build Targets Tested

Test Plan

Summary by cubic

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

StreamDemon commented Jul 2, 2026 •

edited by cubic-dev-ai Bot

Loading