Skip to content

Drop support for SemanticDB#891

Draft
jupblb wants to merge 21 commits into
mainfrom
michal/drop-semanticdb-4
Draft

Drop support for SemanticDB#891
jupblb wants to merge 21 commits into
mainfrom
michal/drop-semanticdb-4

Conversation

@jupblb
Copy link
Copy Markdown
Member

@jupblb jupblb commented May 29, 2026

No description provided.

jupblb added 21 commits May 29, 2026 15:35
…ission

First two milestones of dropping the intermediate SemanticDB step in favour
of direct SCIP shard output from the Java compiler plugin.

Adds, with no behaviour change in the default config:

  semanticdb-javac:
    - ScipSymbols: helper that maps SemanticDB symbol strings to SCIP
      symbol strings. Globals get the '. . . . ' placeholder prefix that
      the aggregator later rewrites into 'scip-java maven g a v ...'.
      Locals are normalised to the canonical 'local N' form.
    - ScipShardWriter: write-or-merge helper for *.scip shards that
      deduplicates documents/symbols/occurrences across compiler rounds.
    - ScipShardFromSemanticdb: intermediate translator that converts the
      in-memory Semanticdb.TextDocument into a single-document Scip.Index
      shard. To be replaced by a direct-from-AST ScipVisitor in Milestone 3.
    - SemanticdbJavacOptions: new -emit-scip:on|off flag (default off).
    - SemanticdbTaskListener: when -emit-scip:on is set, also writes a
      *.scip shard under META-INF/scip/ alongside the existing *.semanticdb
      file, reusing the already-built TextDocument.

  scip-semanticdb:
    - ScipShardWalker: recursively collects *.scip shards under the
      configured targetroots, mirroring SemanticdbWalker.
    - SymbolRewriter: rewrites placeholder global symbols into the final
      'scip-java maven ...' form using PackageTable. Locals and already
      rewritten symbols pass through unchanged.

  build.sbt:
    - javacPlugin now depends on scipProto so the plugin can emit Scip.*
      protobuf messages directly.
    - Discard top-level Bazel BUILD files from fat-jar merge so the new
      scipProto resources don't collide with semanticdb-java.

  tests/unit:
    - ScipSymbolsSuite: unit tests for ScipSymbols and SymbolRewriter,
      including the local/global discrimination and Package.EMPTY fallback.
    - ScipShardEmissionSuite: end-to-end test that drives javac with the
      semanticdb plugin and -emit-scip:on, then parses the produced
      Scip.Index shard to assert the document layout and that every
      emitted symbol either uses the placeholder prefix or is a 'local N'.

All 29 unit tests pass.
Milestone 3 of the SemanticDB->SCIP migration: replace the bridge that
went through ScipShardFromSemanticdb with a direct AST walk that
produces Scip.Document values.

  - ScipVisitor: fork of SemanticdbVisitor with identical traversal
    semantics. Emits Scip.Occurrence, Scip.SymbolInformation, and
    Scip.Relationship directly. Symbols still come from the existing
    GlobalSymbolsCache/LocalSymbolsCache and are translated to the
    placeholder SCIP form via ScipSymbols.fromSemanticdbSymbol at the
    emission boundary. Skips signatures and annotations for now -
    ScipSignatureFormatter in Milestone 4 will add signature_documentation.

  - SemanticdbTaskListener: when -emit-scip:on is set, runs ScipVisitor
    directly instead of converting from Semanticdb.TextDocument. This is a
    second AST walk during the transition; SemanticdbVisitor remains the
    sole producer of legacy *.semanticdb files until Milestone 8.

  - ScipShardFromSemanticdb: deleted; no longer needed now that ScipVisitor
    produces the same shard format natively.

All 29 unit tests pass, including the end-to-end ScipShardEmissionSuite
that exercises the new ScipVisitor through real javac invocations.
Milestone 4: emit SCIP signature_documentation directly from the compiler
plugin, eliminating the need to format signatures from a SemanticDB
intermediate representation.

  - ScipSignatureFormatter: walks javac Element/TypeMirror and produces
    a readable Java declaration string. Supports classes, interfaces,
    annotations, enums, methods, constructors, fields, parameters,
    locals, enum constants, and type parameters with bounds. The internal
    TypePrinter handles declared types, type arguments, arrays,
    primitives, type variables, wildcards, intersections, and void.
    Suppresses implicit 'extends Object' and 'java.lang.Object' supertypes.

  - ScipVisitor: when a definition is emitted, the formatter is invoked
    and (when the result is non-empty) the signature is attached to
    SymbolInformation.signature_documentation with language 'Java' and
    the current source's relative path.

  - ScipShardEmissionSuite: extended end-to-end checks. Verifies the
    shard contains at least one signature_documentation block, that the
    Foo class symbol's signature contains 'class Foo', and that the bar()
    method symbol's signature contains 'int bar('.

All 29 unit tests pass.
Milestone 5: parallel aggregator that walks *.scip shards produced by
ScipVisitor and emits a final scip-java-scheme index.scip. The existing
SemanticDB-based ScipSemanticdb.run() is untouched.

  - ScipShardAggregator:
      * walks for *.scip shards (and *.jar files containing them) via
        ScipShardWalker
      * parses each shard into a Scip.Index
      * rewrites placeholder global symbols ('. . . . ' prefix) into the
        final 'scip-java maven g a v ...' form via SymbolRewriter
      * deduplicates documents by relative_path, merging occurrences and
        symbol-info entries from annotation-processor rounds
      * computes inverse 'is_implementation && is_reference' relationships
        across the whole project, gated on options.emitInverseRelationships
      * emits one Metadata Index plus one Index per merged Document via
        ScipWriter

  - ScipAggregationSuite: end-to-end test that compiles a Java source with
    -emit-scip:on, runs ScipShardAggregator over the produced shards, and
    asserts the final index has metadata with the scip-java tool name and
    that every emitted symbol/occurrence is either local or starts with
    'scip-java maven '.

All 30 unit tests pass.
Milestone 6: surface the direct-SCIP path through the existing
index-semanticdb command and through the Maven / ScipBuildTool paths so
end-to-end indexing can use ScipShardAggregator. Default behaviour is
unchanged.

  - IndexSemanticdbCommand: new --use-scip-shards flag. When set, the
    command runs ScipShardAggregator (walking META-INF/scip/*.scip) instead
    of ScipSemanticdb (walking META-INF/semanticdb/*.semanticdb).

  - SemanticdbOptionBuilder: reads -Dsemanticdb.emit-scip and appends
    '-emit-scip:on' to the injected -Xplugin:semanticdb argument so the
    custom javac wrapper emits SCIP shards.

  - Embedded.customJavac: new optional emitScip parameter; when true,
    propagates -Dsemanticdb.emit-scip=true into the launched javac
    wrapper.

  - MavenBuildTool: forwards index.indexSemanticdb.useScipShards to the
    customJavac wrapper.

  - ScipBuildTool: when useScipShards is on, appends '-emit-scip:on'
    to the directly-constructed -Xplugin:semanticdb arguments used by
    the in-process javac compilation.

Not yet wired (deferred):
  - SemanticdbGradlePlugin propagation
  - BazelBuildTool / scip_java.bzl
  - Kotlin guard for projects that mix Java+Kotlin sources

All 30 unit tests pass.
…hots

Drives the minimized snapshot suite through the new SCIP-direct path
(via --use-scip-shards) and reconciles the resulting output so it can be
locked in as the canonical scheme.

  semanticdb-javac:
    - ScipVisitor: lowercase Document.language to 'java' (matching the
      historical ScipSemanticdb output) and add (range, symbol, roles)
      dedup of occurrences, preferring the variant that carries an
      enclosing_range. Multiple ANALYZE rounds otherwise emit a second
      definition occurrence without enclosing_range that survived the
      structural-equality dedup in ScipShardWriter.
    - ScipVisitor: treat ENUM the same as CLASS/INTERFACE/ANNOTATION_TYPE
      in supportsReferenceRelationship so parent relationships don't get
      a spurious is_reference flag.
    - ScipShardWriter: switch occurrence merge to the looser
      (range, symbol, roles) key, preferring entries with enclosing_range.
    - SemanticdbTaskListener: delete the stale .scip shard alongside the
      .semanticdb file on ENTER so re-runs don't accumulate occurrences
      across builds.

  scip-semanticdb:
    - ScipShardAggregator: mergeInto now uses the same (range, symbol,
      roles) dedup with enclosing_range preference, and merges duplicate
      symbol relationships across shards.

  build.sbt:
    - Add -emit-scip:on to the minimized javac plugin invocation so the
      tests/minimized targetroot always contains shard files.

  tests/snapshots:
    - MinimizedSnapshotScipGenerator now passes --use-scip-shards to
      drive the snapshot suite through ScipShardAggregator.
    - Regenerate all 23 minimized snapshots under the new 'scip-java'
      symbol scheme.

  tests/unit:
    - ScipShardEmissionSuite: update assertions to expect the lowercase
      'java' language string.

Full snapshot suite passes (102 tests). Unit suite passes (30 tests).
After M3-M7 the per-source SCIP shard format is stable and the
ScipShardAggregator produces equivalent output to the legacy
SemanticDB->SCIP path. This commit promotes the cheap compiler-side
half of the dual-emission to be on by default so that:

  - any javac plugin invocation (sbt, Maven, Bazel, ad-hoc) writes a
    *.scip shard under META-INF/scip/ alongside the *.semanticdb file
    without needing an explicit -emit-scip:on flag;
  - users (or build tools) that want to consume the new path only need
    to flip the CLI switch (--use-scip-shards) once the indexer runs;
  - legacy callers that only read *.semanticdb files are unaffected.

The CLI default for index-semanticdb's --use-scip-shards remains false
because the broader ecosystem (notably the Kotlin compiler and the
existing snapshot/build tool integrations) still produces only
*.semanticdb. That flip is deferred to a follow-up PR.

  semanticdb-javac:
    - SemanticdbJavacOptions.emitScip defaults to true. -emit-scip:off
      is now the explicit opt-out and is documented as the legacy path.

  scip-java:
    - SnapshotCommand: skip per-source shards (those without a
      metadata.project_root) so 'scip-java snapshot' continues to render
      only the top-level aggregator output. Per-source shards have no
      project_root and would otherwise crash with 'missing scheme'
      when their relative paths are resolved into a URI.

  build.sbt:
    - Drop the now-redundant -emit-scip:on flag from the minimized
      project; the plugin default already emits shards.

  tests/unit:
    - ScipShardEmissionSuite: invert the off-path test so it explicitly
      passes -emit-scip:off; the previous test relied on the old
      default of false.

Full snapshot suite (102 tests) and unit suite (30 tests) green.
Post-PR1 cleanup of dead code, redundant flag plumbing, and duplication.
No behavioral changes; snapshot suite (102 passed) and unit suite (28 passed)
remain green.

Dead code removed:
- ScipShardAggregator: drop unused documentsFromShards{,Collected}
  and their Stream/Collectors imports.
- ScipSymbols: drop unused isPlaceholderGlobal/descriptorPath; only
  fromSemanticdbSymbol + PLACEHOLDER_PREFIX are needed in production.
- ScipSymbolsSuite: drop the tests for the removed helpers.

Redundant -emit-scip:on plumbing removed:
With compiler-side default emitScip=true (M8), the CLI/build-tool
machinery that conditionally toggled the flag is purely cosmetic.
- Embedded.customJavac: drop emitScip param + emitScipProp system
  property prefix.
- MavenBuildTool: stop passing emitScip = useScipShards.
- ScipBuildTool: stop appending -emit-scip:on to the -Xplugin string.
- SemanticdbOptionBuilder: drop EMIT_SCIP system-property handling and
  the corresponding xpluginOption branch.
- SemanticdbJavacOptions still parses -emit-scip:on / -emit-scip:off as
  the compiler-side opt-out.
- IndexSemanticdbCommand help text no longer implies the shards require
  an extra compiler flag.

Internal duplication removed:
- New ScipOccurrences package-private helper centralizes the
  (symbol, range, roles) dedup key and the 'prefer enclosing_range'
  merge rule that ScipVisitor and ScipShardWriter both used.
- ScipShardWriter.mergeSymbol now uses LinkedHashMap for relationships
  so output ordering is deterministic.

Small ScipVisitor cleanups:
- Drop dependency on Semanticdb Property bitmask; compute isStatic /
  isAbstract directly from Modifier set.
- Make 'source' final and initialized via a static sourceText helper.
- Merge identical switch arms for ENUM/CLASS/INTERFACE/ANNOTATION_TYPE
  in emitSymbolInformation.
- Refresh stale class-level javadoc; signature docs are now produced
  via ScipSignatureFormatter.
Add the scaffolding required for the Kotlin compiler plug-in to emit
SCIP shards directly, mirroring the Java side from PR1. This commit is
passive: the new types are not wired into the analyzer yet, so behavior
is unchanged.

- semanticdb-kotlinc now depends on scipProto so it can reference the
  generated SCIP protobuf types.
- ScipSymbols: placeholder symbol formatter that produces the same
  '. . . . <path>' globals and canonical 'local N' locals the aggregator
  already understands.
- ScipOccurrences: deduplicates occurrences by (symbol, range, roles),
  preferring entries that carry an enclosing_range.
- ScipShardWriter: writes a per-source-file Scip.Index shard with
  overwrite semantics, matching ScipShardWriter on the Java side.
- ScipTextDocumentBuilder: assembles a Scip.Document for one source
  using the above helpers.
Wire the Kotlin compiler plug-in so a single analyzer pass populates
both the existing SemanticdbTextDocumentBuilder and the new
ScipTextDocumentBuilder. The PostAnalysisExtension now writes:

  - META-INF/semanticdb/<path>.semanticdb (unchanged)
  - META-INF/scip/<path>.scip             (new)

Behavior of consumers that still read .semanticdb is preserved; the
companion CLI change to actually consume the .scip shards lands in K4.
Legacy SemanticDB emission is intentionally kept for now and will be
removed in a later cleanup PR.
Two small robustness fixes uncovered while validating PR2 end-to-end:

- ScipWriter.build(): create parent directories before moving the
  temporary aggregated output into place so callers may target paths
  whose enclosing directory does not yet exist (e.g. target/scip-index/).
- ScipShardWalker: restrict the walk to files under META-INF/scip/ so
  an aggregated index.scip co-located inside a targetroot is not
  re-ingested as a shard on subsequent runs.
Now that both javac and kotlinc emit .scip shards under
META-INF/scip/, switch the CLI default to read from those shards and
update build wiring + a unit fixture that asserted the old scheme.

- IndexSemanticdbCommand: --use-scip-shards defaults to true; the help
  text reflects that javac and kotlinc both ship shards.
- build.sbt (kotlincSnapshots task): pass --use-scip-shards and write
  the aggregated index.scip into target/scip-index/ so the next
  invocation does not walk over its own previous output.
- SnapshotCommandSuite: expected symbol scheme is now
  'scip-java maven ...' instead of 'semanticdb maven ...'.
Outputs reflect the new direct-from-SCIP scheme:

- symbols are emitted under 'scip-java maven ...' instead of the legacy
  'semanticdb maven ...' scheme,
- Kotlin symbol info now carries SCIP-native fields such as
  'signature_documentation kotlin ...' and 'kind ...'.

Regenerated with:

  sbt 'snapshots/Test/runMain tests.SaveSnapshots'
Introduce small in-package value types so the direct-to-SCIP visitor no
longer reaches into the SemanticDB-generated protobuf classes:

* ScipRange: holds (startLine, startCharacter, endLine, endCharacter)
  with an asScipRange() helper that produces the compact 3/4-int form
  SCIP expects.
* ScipRole: minimal {DEFINITION, REFERENCE, SYNTHETIC_DEFINITION}
  mirror of Semanticdb.SymbolOccurrence.Role.

Update ScipVisitor to use the new types end-to-end. Pure refactor: no
behavior change, no snapshot churn. Sets up D2..D5 where the legacy
SemanticDB types/modules will be deleted.
Drop the legacy SemanticDB code path from the Java compiler plug-in:

* SemanticdbTaskListener: stop building Semanticdb.TextDocument and
  writing META-INF/semanticdb/*.semanticdb; ScipVisitor is now the only
  emitter, producing META-INF/scip/*.scip directly. The shard path is
  computed without going through a SemanticDB intermediate.
* Delete unused legacy emitter sources:
    - SemanticdbVisitor.java
    - SemanticdbTypeVisitor.java
    - SemanticdbSignatures.java
    - SemanticdbTrees.java
* SemanticdbJavacOptions: drop the emitScip field. Keep -emit-scip:on
  and -emit-scip:off as deprecated no-ops so cached compiler options
  keep working without erroring.

Migrate the test infrastructure to consume the SCIP shard output:

* CompileResult: replace textDocuments(Semanticdb.TextDocuments) with
  documents(List[Scip.Document]) plus a documentsFromShard helper.
* TestCompiler: read META-INF/scip/<rel>.scip back from disk after
  javac runs and surface the documents through CompileResult.
* OverridesSuite: assert on SymbolInformation.relationships
  (is_implementation=true) instead of Semanticdb.getOverriddenSymbolsList.
* TargetedSuite: compare positions against Scip.Occurrence.range and
  strip the placeholder prefix when comparing symbols.
* GeneratedConstructorSuite: switch the stub signature from
  Semanticdb.TextDocument to Scip.Document.
* JavacClassesDirectorySuite: verify the shard lands at
  META-INF/scip/.../Example.java.scip.
* ScipShardEmissionSuite: assert that the legacy .semanticdb file is
  NOT produced; replace the -emit-scip:off shard-suppression test with
  one that verifies the deprecated flag is still accepted as a no-op.
* BaseBuildToolSuite: rename semanticdbPattern/semanticdbFiles to
  scipShardPattern/scipShards and match META-INF/scip/**.scip so the
  Gradle/Maven build tool suites continue to count the right files.
* Delete tests/snapshots/src/main/scala/tests/SemanticdbFile.scala
  (no remaining callers).

Validation: sbt unit/test (28 passing), sbt snapshots/test
(102 passing).
Drop the legacy SemanticDB code path from the Kotlin compiler plug-in:

* ScipRole: new local enum mirroring the DEFINITION/REFERENCE subset
  of Semanticdb.SymbolOccurrence.Role.
* SemanticdbVisitor: drop the documentBuilder field and
  build()/Semanticdb.TextDocument helper; the visitor now only feeds
  ScipTextDocumentBuilder and uses ScipRole at every emit site.
* ScipTextDocumentBuilder: switch role parameter from
  Semanticdb.SymbolOccurrence.Role to ScipRole.
* PostAnalysisExtension: remove the SemanticDB write path and the
  (Semanticdb.TextDocument) -> Unit callback; the extension now only
  walks the visitors and writes META-INF/scip/<rel>.scip shards.
* AnalyzerRegistrar: remove the SemanticDB callback parameter.

Delete the legacy implementation source:
  * SemanticdbTextDocumentBuilder.kt

Delete the legacy Kotlin test suites that asserted on Semanticdb
protobuf output:
  * src/test/kotlin/.../test/AnalyzerTest.kt (1528 lines)
  * src/test/kotlin/.../test/SemanticdbSymbolsTest.kt (726 lines)
  * src/test/kotlin/.../test/Utils.kt (203 lines)

The Kotlin compiler plug-in behavior remains covered end-to-end by the
existing snapshot suites (semanticdb-kotlinc/minimized fixtures + the
exposed-core library snapshot regenerated in PR2 K5).

Validation: sbt unit/test (28 passing), sbt snapshots/test
(102 passing).
The aggregator now consumes SCIP shards only. The legacy
SemanticDB-based reader/aggregator is removed.

Wiring:

* IndexSemanticdbCommand: remove the --use-scip-shards flag and the
  ScipSemanticdb.run() else branch; always call ScipShardAggregator.
* BazelBuildTool: switch the Bazel main entry to ScipShardAggregator.
* MinimizedSnapshotScipGenerator: drop the --use-scip-shards argument
  (the default switched in PR2 K4 and the flag is being removed now).
* build.sbt (kotlincSnapshots): drop --use-scip-shards from the
  index-semanticdb invocation.

Delete the legacy SemanticDB-consuming aggregator sources, none of
which have any remaining callers:

  * ScipSemanticdb.java
  * SemanticdbWalker.java
  * SemanticdbTreeVisitor.java
  * ScipTextDocument.java
  * SignatureFormatter.java
  * SignatureFormatterException.java
  * SymbolOccurrences.java
  * Symtab.java
  * RangeComparator.java

Validation: sbt unit/test (28 passing), sbt snapshots/test
(102 passing).
- Remove generated SemanticDB protobuf module:
  - semanticdb-java/src/main/protobuf/semanticdb.proto
  - semanticdb-java/src/main/protobuf/BUILD
  - semanticdb-kotlinc/src/main/proto/.../semanticdb.proto
- Delete unused SemanticDB builder helpers:
  - semanticdb-java/.../SemanticdbBuilders.java
  - semanticdb-kotlinc/.../SemanticdbBuilders.kt
- Keep a minimal semanticdb-java module that only ships
  SemanticdbSymbols.java (a pure-Java symbol helper still consumed by
  semanticdb-javac and scip-semanticdb), without protobuf generation.
- Update sbt and Bazel build files accordingly.
The Java/Kotlin compiler plugins now emit per-file SCIP shards directly
and the 'index-semanticdb' command aggregates those shards into a single
SCIP index. Update user-facing strings and docs to describe the actual
behavior instead of the now-removed SemanticDB->SCIP conversion step.

Keep compatibility names (Xplugin:semanticdb, index-semanticdb CLI,
semanticdb-targetroot directory, semanticdb-javac module/package) so
existing build integrations keep working.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant