Parser enhancements wave: integration base#71
Conversation
c04bde5 to
c7bb5a3
Compare
Empty seed commit for the enhancement-wave base PR; the roadmap lives in the PR description. Rebased onto main after the parser correctness wave (PR #69) merged.
c7bb5a3 to
1a01d83
Compare
`numeric_suffix` and `validate_numeric_body` each carried their own copy of the 13-entry suffix table, so any future suffix change had to be made twice or the two paths would drift apart. Hoist the table into a shared `NUMERIC_SUFFIXES` const and return the matched `&'static str` instead of allocating a fresh `String` on every suffix scan. A new test iterates the shared table and lexes `1<suffix>` for every entry, so additions to the list are covered automatically.
There was a problem hiding this comment.
No issues found across 1 file
Confidence score: 5/5
- Automated review surfaced no issues in the provided summaries.
- No files require special attention.
Architecture diagram
sequenceDiagram
participant L as Lexer
participant Src as Source Text
participant NS as NUMERIC_SUFFIXES const
participant Tok as Token
Note over L,Tok: Numeric literal scanning with shared suffix const
L->>Src: rest = &source[pos..]
L->>L: scan numeric digits & separators
L->>NS: iterate over suffixes for match
NS-->>L: suffix (e.g., "i32") as &'static str
alt suffix found
L->>Src: advance pos by suffix.len()
L->>Tok: create token with &'static suffix (no String alloc)
else no suffix
L->>Tok: create token without suffix
end
Note over L: Later, validate_numeric_body uses same const
L->>NS: find_map(f: strip suffix from body)
NS-->>L: suffix to strip (if present)
L->>L: validate separators in remainder
Note over L,Tok: NEW: test covers all suffixes
Test->>L: lex("1<i32>")
L->>NS: iterate suffixes
NS-->>L: "i32"
L-->>Test: Token(IntLit, "1i32")
Test->>Test: assert kind and lexeme
Auto-approved: Refactors numeric suffix handling into a shared const and eliminates per-scan allocations; adds a test covering all suffixes.
Re-trigger cubic
Every token carried an owned copy of its source text, so lexing a file allocated one String per token even though the source buffer already holds the same bytes. Token is now just a kind and a span; the new `Token::text(source)` slices the original buffer on demand. The parser threads the source string through and derives text only at the few places that need it (identifiers, literals, the `vec` head, the extern target). Unary and binary operator text now comes from the token kind rather than the lexeme, since the kind already determines it.
There was a problem hiding this comment.
0 issues found across 2 files (changes from recent commits).
Requires human review: Refactors token representation in lexer and parser to use span-based text slicing; moderate risk due to core data structure change.
Re-trigger cubic
With the lexeme gone, a token is a payload-free kind plus a span — both trivially copyable. Deriving Copy lets every parse loop pass and return tokens by value, so the `.clone()` calls sprinkled through `at`, `eat`, `expect`, `bump`, `peek_kind`, and the recovery helpers all disappear. The PR #71 roadmap sketched this slot as "`at`/`eat`/`expect` take `&TokenKind`"; deriving Copy reaches the same goal (no clones in parse loops) with by-value call sites instead of reference threading, which only became possible after the span-slicing change landed.
There was a problem hiding this comment.
0 issues found across 2 files (changes from recent commits).
Requires human review: This PR refactors the lexer to store tokens as span-based slices and the parser to use them, along with other internal improvements. Although no behavior changes are intended, such refactors have a broad impact across core data structures and code paths, making human review necessary.
Re-trigger cubic
`ExprKind::Unary`/`Binary` stored their operator as an owned String,
which allocated per node and let any string masquerade as an operator.
Dedicated enums make illegal operators unrepresentable, shrink the
nodes, and give match exhaustiveness checking to every consumer.
`=` never reaches the AST (it builds `ExprKind::Assign`), so the parser
classifies infix tokens with a private `Infix { Assign, Op(BinaryOp) }`
wrapper instead of widening the public enum with a variant no AST node
can carry. `as_str()`/`Display` on both enums recover the source
spelling for diagnostics and the future pretty-printer (#67).
There was a problem hiding this comment.
0 issues found across 2 files (changes from recent commits).
Requires human review: Introduces structural changes in AST, lexer, and parser (enums, span-based tokens, Copy); risk of subtle breakage across multiple crates.
Re-trigger cubic
`@mailbox(capacity: 2048)` and `@supervisor(strategy: "one_for_one")`
lost their arguments at parse time — the parser skipped everything
inside the parens — and attributes on actor handlers were dropped
outright. Nothing downstream (semantic analysis, diagnostics) could
ever see them.
`Attribute` now carries `args: Vec<AttrArg>` plus a span covering `@`
through the closing paren, with `AttrArg` mirroring the §16 grammar
(`attr_arg = IDENT [ ":" expr | "=" expr | "(" expr ")" ] | expr`).
Only the `IDENT ":"` form needs lookahead; the `=` and call forms are
valid expressions, so they parse as expressions and canonicalize to the
most specific attr shape afterwards.
Actor handlers become `Handler { attrs, function }`, since a handler is
a `fn_def` and §16 puts attrs on `fn_def` itself. The now-unused
`skip_balanced_after_open` helper is removed.
The crates/AGENTS.md rule is "add corpus tests for every grammar shape accepted", but extern blocks, onchain modules, use trees, casts, the literal zoo, async/.await, `?` outside pipes, attribute arguments, and struct-literal shapes had no fixtures. Six new fixtures close those gaps. The harness now discovers `tests/corpus/*.sp` instead of maintaining a hard-coded list, so a new fixture cannot be silently skipped; an is-empty guard catches a moved or emptied corpus directory.
Summary
mainwith a true merge commit. Same stacked-PR pattern as the correctness wave (PR Parser correctness wave: integration base #69) and the v0.5.10 cleanup (PR Spec v0.5.10: Cleanup batch (integration) #40).Wave complete — all six roadmap items shipped:
Binary/Unaryop: String→ dedicated operator enums (Replace operator strings with UnaryOp and BinaryOp enums #79)Token.lexeme: String→ span-based slicing (Token { kind, span }+Token::text(source); slicing chosen over interning — the token stream never outlives the source buffer in the bootstrap pipeline) (Store spans instead of lexeme strings on tokens #76)Attribute.args/.spanper §16attr_args; actor handlers are nowHandler { attrs, function }) (Preserve attribute arguments and actor-handler attrs in the AST #80)NUMERIC_SUFFIXESconst,&'static strreturn) (Dedupe the lexer numeric-suffix list #75)Re-sliced:at/eat/expecttake&TokenKindToken/TokenKindderiveCopy, which removes every parse-loop.clone()with by-value call sites instead of reference threading — same goal, simpler result; only possible once Store spans instead of lexeme strings on tokens #76 removed the owned lexeme (Make Token and TokenKind Copy, drop parse-loop clones #77)extern/onchain, casts + literal zoo, async/.await, modules/use, attributes, struct-literal shapes) and auto-discovery oftests/corpus/*.spin the harness (Expand the corpus to cover every accepted grammar shape #81)Ride-along: issue #78 (spurious "expected identifier" for non-ident-headed type args like
Vec<&str>) was found while designing the attribute-argument parser, verified with a probe, and filed for a separate fix — it is a behavior change, so it deliberately does not ride this wave.Related Issue
None (review-driven; companion to PR #69).
Spec Sections Affected
None — implementation quality only; no grammar or behavior changes.
Checklist
docs/pages — N/A, no behavior changeBuild Targets Tested
cargo fmt --all -- --check,cargo clippy --workspace --all-targets -- -D warnings,cargo test --workspacegreen on the assembled base (54 tests, 13 corpus fixtures). Every sub-PR was individually CI-green and cubic-reviewed before merging.Test Plan
Token::textspan assertions (Store spans instead of lexeme strings on tokens #76), structural operator assertions (Replace operator strings with UnaryOp and BinaryOp enums #79), five attribute-shape/span tests (Preserve attribute arguments and actor-handler attrs in the AST #80), and 13 auto-discovered corpus fixtures (Expand the corpus to cover every accepted grammar shape #81).