Skip to content

Store spans instead of lexeme strings on tokens#76

Merged
StreamDemon merged 1 commit into
feature/parser-enhancements-basefrom
feature/parser-enh-token-spans
Jul 2, 2026
Merged

Store spans instead of lexeme strings on tokens#76
StreamDemon merged 1 commit into
feature/parser-enhancements-basefrom
feature/parser-enh-token-spans

Conversation

@StreamDemon

@StreamDemon StreamDemon commented Jul 2, 2026

Copy link
Copy Markdown
Owner

Summary

  • Token no longer stores an owned lexeme: String — it is now just { kind, span }, and the new Token::text(source) slices the original buffer on demand. Lexing is now allocation-free per token.
  • The parser threads source: &'src str through (Parser<'src>) and derives text only where it is actually consumed: identifiers, path segments, literals, the vec head check, at_ident_text, and the extern target.
  • Unary/binary operator text now comes from the token kind (which fully determines it) instead of the stored lexeme — this also sets up the operator-enum sub-PR that follows.

Second sub-PR of the enhancement wave (PR #71 roadmap: "Token.lexeme: String → span-based slicing or interning"). Span slicing was chosen over interning: the token stream never outlives the source buffer in the bootstrap pipeline, so a borrow is strictly simpler and interning buys nothing yet. No behavior change.

Related Issue

None (PR #71 roadmap item).

Spec Sections Affected

None — implementation quality only.

Checklist

  • Code follows the Sploosh design principles (one way to do it, explicit over implicit, etc.)
  • Documentation updated in relevant docs/ pages — N/A, no behavior change
  • Tests added or updated
  • All build targets still compile (if applicable)
  • Spec-only PR (skip Build Targets section if checked)

Build Targets Tested

  • cargo fmt --all -- --check, cargo clippy --workspace --all-targets -- -D warnings, cargo test --workspace all green locally (49 tests).

Test Plan

  • Entire existing suite (11 lexer + 37 parser + corpus) passes unchanged — token text is observable only through parse results, which are asserted structurally throughout.
  • every_numeric_suffix_lexes now exercises the new Token::text API directly (text + span asserted per suffix).

Summary by cubic

Switched tokens to span-based text slicing with Token::text(source) and threaded source through the parser to read text only when needed. No behavior change; this removes per-token allocations and reduces memory use.

  • Refactors

    • Token now stores { kind, span }; added Token::text(&source).
    • Parser takes source (Parser<'src>) and reads text for identifiers, literals, path segments, vec head checks, and extern targets.
    • Unary/binary operator text is derived from TokenKind instead of stored lexemes.
    • Tests updated to assert via Token::text.
  • Migration

    • Replace token.lexeme with token.text(source).
    • Pass the source buffer when constructing the parser.

Written for commit 2974ffc. Summary will update on new commits.

Review in cubic

Every token carried an owned copy of its source text, so lexing a file
allocated one String per token even though the source buffer already
holds the same bytes. Token is now just a kind and a span; the new
`Token::text(source)` slices the original buffer on demand.

The parser threads the source string through and derives text only at
the few places that need it (identifiers, literals, the `vec` head, the
extern target). Unary and binary operator text now comes from the token
kind rather than the lexeme, since the kind already determines it.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 2 files

Confidence score: 5/5

  • Automated review surfaced no issues in the provided summaries.
  • No files require special attention.
Architecture diagram
sequenceDiagram
    participant Source as Source buffer (&str)
    participant Lexer as Lexer
    participant TokenStream as Token stream
    participant Parser as Parser<'src>
    participant ASTNode as AST nodes
    
    Note over Source,ASTNode: Span-based token text flow (no allocation per token)
    
    Source->>Lexer: lex(source)
    Lexer->>Lexer: scan characters
    Lexer->>Lexer: push(kind, start, end)
    Lexer->>TokenStream: Token{kind, span}
    Note over TokenStream,Parser: Each token stores only kind + byte span
    
    Source->>Parser: parse_program(source)
    Parser->>TokenStream: consume tokens
    
    alt Parsing an identifier/token
        Parser->>Source: text(&token) slices source by span
        Source-->>Parser: &'src str slice
        Parser->>ASTNode: store slice as String (only for Ident, Path, Literal)
    else Parsing a binary operator (e.g., +, -, |>)
        Parser->>Parser: derive op string from TokenKind (no source slice needed)
        Parser->>ASTNode: store derived operator string
    else Parsing a unary operator (e.g., !, -, *, &)
        Parser->>Parser: match TokenKind to static string
        Parser->>ASTNode: store derived operator string
    else Checking for "vec" keyword
        Parser->>Source: text(&token) slices source
        Source-->>Parser: &'src str
        Parser->>Parser: compare to "vec"
    else Checking ident text (at_ident_text / eat_ident_text)
        Parser->>Source: token.text(self.source)
        Source-->>Parser: &'src str
        Parser->>Parser: compare to target text
    else Parsing extern block target
        Parser->>Source: text(&token).to_string()
        Source-->>Parser: String
        Parser->>ASTNode: store target string
    end
    
    Note over Parser,ASTNode: Token text derived only where AST nodes need it
Loading

Requires human review: Refactor of core token and parser data structures; risk of subtle bugs despite no behavior change.

Re-trigger cubic

@StreamDemon StreamDemon merged commit 3e65f7f into feature/parser-enhancements-base Jul 2, 2026
2 checks passed
@StreamDemon StreamDemon deleted the feature/parser-enh-token-spans branch July 2, 2026 11:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant