Skip to content

feat: expose extended vector indexing options on createVectorIndex#2505

Open
erichare wants to merge 14 commits into
mainfrom
feat/vector-index-options-2487
Open

feat: expose extended vector indexing options on createVectorIndex#2505
erichare wants to merge 14 commits into
mainfrom
feat/vector-index-options-2487

Conversation

@erichare

@erichare erichare commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

What this does

Adds a vectorIndexing field under definition.options of createVectorIndex so you can tune the SAI vector index beyond metric / sourceModel.

One or the other, by JSON type:

  • String — a profile name from the in-code VectorIndexProfiles registry (e.g. "small-high-recall") that expands into a set of SAI options.
  • Object — raw Cassandra SAI options in snake_case, restricted to an allow-list: maximum_node_connections, construction_beam_width, neighborhood_overflow, alpha, enable_hierarchy. Unknown keys, reserved keys (similarity_function, source_model), and non-scalar values are rejected.

A profile and raw options are mutually exclusive. metric / sourceModel stay as their own fields and aren't allowed inside the options object.

The profile name is not persisted. On read-back (listIndexes) the index's tuning options are matched against the known profiles and echoed as the profile name when they match exactly, otherwise as the raw options. Detection is a stopgap and will likely be replaced before prod (see the #2508 discussion).

{
  "createVectorIndex": {
    "name": "idx",
    "definition": {
      "column": "v",
      "options": {
        "metric": "cosine",
        "sourceModel": "openai-v3-small",
        "vectorIndexing": "small-high-recall"
      }
    }
  }
}

Or with raw options instead:

"vectorIndexing": { "maximum_node_connections": 32, "alpha": 1.2 }

Issues

Testing

  • Unit: deserialization (string/object/invalid), profile expansion, allow-list / reserved-key / non-scalar rejection, profile detection on read-back, apply→describe round-trip.
  • Integration (CreateTableIndexIntegrationTest): API-validation cases, backend-agnostic (DSE + HCD).
  • The tuning options need a cluster that allows custom SAI HNSW params; on clusters that don't, the DB error is surfaced.

Follow-ups

  • Apply vectorIndexing to createCollection.
  • Externalize profiles to config and tune the mappings.
  • Revisit profile detection vs a durable store before prod.

Checklist

  • Changes manually tested
  • Automated tests added/updated
  • Documentation added/updated (OpenAPI @Schema descriptions)
  • CLA Signed: DataStax CLA

erichare added 2 commits June 15, 2026 16:12
…2487)

Add an `indexingOptions` field to the createVectorIndex command's
`definition.options`. It accepts either:

- a String naming a predefined profile (expanded by the in-code
  VectorIndexProfiles registry into a set of SAI options), or
- an Object of raw Cassandra SAI indexing options, passed through
  verbatim using Cassandra's snake_case names (forward-compatible).

Anything else is rejected. The existing `metric` / `sourceModel` fields
are unchanged and remain the dedicated way to set similarity_function /
source_model; those keys are rejected inside the raw options object.

Implemented by mirroring the existing ApiTextIndex.analyzer JsonNode
pattern. Adds two SchemaException codes
(UNKNOWN_VECTOR_INDEXING_PROFILE, INVALID_VECTOR_INDEXING_OPTIONS) with
errors.yaml templates. listIndexes renders the resolved options back
under indexingOptions (excluding the structural and dedicated-field
keys).

Note: the new tuning options require the target backend to allow custom
SAI HNSW parameters; per the "pass-through" design, the API forwards the
options and surfaces the database error on backends that disallow them.
…ew cleanups

Address review feedback on #2487:

- Reject raw indexingOptions keys class_name/target (set automatically by
  the API) with INVALID_VECTOR_INDEXING_OPTIONS, symmetric with how
  renderIndexingOptions filters them on read. Adds unit + IT coverage.
- @Schema description for indexingOptions: use concatenated string literals
  (matching metric/sourceModel) and drop type=OBJECT to match the analyzer
  precedent for String-or-Object fields.
- Mark applyIndexingOptions/renderIndexingOptions @VisibleForTesting.
- Drop @DisplayName from the new unit tests to match repo convention.
- Remove unused CQLAnnIndex constants (neighborhood_overflow, alpha,
  enable_hierarchy); keep the two used by profiles.
- Use bare assertTableCommand in the new IT cases.
- errors.yaml: revert unrelated whitespace churn on UNKNOWN_VECTOR_METRIC.
@erichare erichare marked this pull request as ready for review June 16, 2026 15:34
@erichare erichare requested a review from a team as a code owner June 16, 2026 15:34
@erichare erichare marked this pull request as draft June 16, 2026 18:53
@erichare erichare linked an issue Jun 16, 2026 that may be closed by this pull request
Per #2509 (amorton), rename the new createVectorIndex option from
`indexingOptions` to `vectorIndexing` before it ships. The field is still
unreleased (added in #2487 / #2505), so this is a clean wire-name change
with no backwards-compatibility concern.

- VectorConstants.VectorColumn: INDEXING_OPTIONS -> VECTOR_INDEXING
  ("vectorIndexing"); this single constant drives the JSON key.
- Rename the record component JsonNode indexingOptions -> vectorIndexing
  so the Java field matches the wire name (as metric / sourceModel do).
- Update all user-visible text: the three INVALID_VECTOR_INDEXING_OPTIONS
  messages, the errors.yaml bodies (+ retitle "Vector indexing options
  are invalid"), and javadoc references.
- Update IT request bodies and assertion strings.

Internal identifiers that describe behavior rather than the wire field
are intentionally unchanged: the error codes
(INVALID_VECTOR_INDEXING_OPTIONS, UNKNOWN_VECTOR_INDEXING_PROFILE) and the
applyIndexingOptions / renderIndexingOptions helpers.

Verified: ./mvnw clean test -Dtest=ApiVectorIndexTest,VectorIndexProfilesTest
passes (18/18); fmt:check clean.
@erichare erichare marked this pull request as ready for review June 16, 2026 19:14
erichare added 2 commits June 16, 2026 16:54
…ted SAI options

Reshape the createVectorIndex `vectorIndexing` field from a polymorphic
String|Object into a structured object `{ profile, options }`:
- `profile` expands via VectorIndexProfiles; explicit `options` override it
- `options` keys are validated against an allow-list (maximum_node_connections,
  construction_beam_width, neighborhood_overflow, alpha, enable_hierarchy);
  reserved metric/sourceModel keys, unknown keys, and non-scalar values are rejected
- describeIndexingOptions filters to the allow-list and echoes `{ options }`
  (echoing the profile name back is a follow-up)

Persist the chosen profile name + the options it expanded to in the table
extensions (VECTOR_INDEX_PROFILES), clobber-safe across all extension writers.

Also: rename renderIndexingOptions -> describeIndexingOptions, trim comments,
normalize numeric option values to plain (non-scientific) strings, and expand
unit coverage (request deserialization, apply, describe, round-trip).
… on no-op create

Address review findings on the vectorIndexing profile persistence:
- Store the options actually applied to the index (profile expansion plus
  explicit overrides) in VECTOR_INDEX_PROFILES, not the base profile, so the
  snapshot matches the live index. Adds ApiVectorIndex.appliedTuningOptions()
  backed by a shared tuningOptions() allow-list filter also used by
  describeIndexingOptions.
- Skip the extension write when the index already exists, so a
  CREATE ... IF NOT EXISTS no-op no longer rewrites or removes a live index's
  stored profile.

Removing the profile entry on dropIndex is tracked as a follow-up (the drop
path is keyspace-scoped and needs a separate cleanup task).
@erichare erichare force-pushed the feat/vector-index-options-2487 branch from e8a9e7b to 495a459 Compare June 22, 2026 19:34
erichare added 3 commits June 22, 2026 13:12
…sting it

Per discussion on #2508: rather than storing the profile name in table
extensions (and cleaning it up on drop), reconstruct it on read-back by
matching the index's tuning options against the known profiles. Echo the
profile name when they match exactly, otherwise the raw options.

Removes the extension-storage path: VectorIndexProfileDefinition, the
VECTOR_INDEX_PROFILES extension, the create-side extension write, and the
dropIndex profile cleanup (DropVectorIndexProfileDBTask, removeIndexProfile).
The request-side API (vectorIndexing field, validation, profile expansion on
create) is unchanged.

Detection is a stopgap and will likely be replaced before prod.
- Reject non-numeric/non-boolean vectorIndexing option values so a quote
  cannot break out of the CQL WITH OPTIONS literal (the driver renders
  option values unescaped); every allowed option is numeric or boolean.
- Remove VectorIndexUnknownOptionProbeIntegrationTest: an always-green
  stdout probe that asserted nothing and reflected into a private base
  method, and was failing CI on a connection-init error during setup.
- Declare the vectorIndexing @Schema as oneOf {String, Map} so OpenAPI
  reflects the real string-or-object wire contract.
- Add capability-gated create + listIndexes round-trip ITs (profile-name
  and raw-options echo); they skip via assumption when the backend lacks
  SAI_HNSW_ALLOW_CUSTOM_PARAMETERS.
- Use JsonUtil.nodeTypeAsString in the deserializer error, drop the
  inaccurate 'null token' Javadoc, and link the profile stopgaps to #2508.
@erichare erichare requested a review from amorton June 22, 2026 21:51
@erichare

Copy link
Copy Markdown
Contributor Author

@amorton I've updated the PR so that it doesn't try to use table extensions to persist the profile name, nor does it accept both a profile and options.... they're mutually exclusive as per the design. Just opened it up for review!

…ejection

Per review feedback: the round-trip ITs skipped on any create error, so a
regression in deserialization, profile expansion, or option rendering would
show as skipped instead of failed. Skip now fires only when the single
response error names SAI_HNSW_ALLOW_CUSTOM_PARAMETERS; any other (or no)
error is asserted via wasSuccessful().

Also switch the raw-options case from 'alpha' (rejected as 'not understood by
StorageAttachedIndex', so never runnable even with the flag) to
maximum_node_connections + construction_beam_width, which the backend
recognizes and gates behind the flag, making the round-trip meaningful on a
flag-enabled cluster.
@stargate stargate deleted a comment from github-actions Bot Jun 22, 2026
@stargate stargate deleted a comment from github-actions Bot Jun 22, 2026
@stargate stargate deleted a comment from github-actions Bot Jun 22, 2026
@stargate stargate deleted a comment from github-actions Bot Jun 22, 2026
@stargate stargate deleted a comment from github-actions Bot Jun 22, 2026
@stargate stargate deleted a comment from github-actions Bot Jun 22, 2026
@github-actions

Copy link
Copy Markdown
Contributor

📈 Unit Test Coverage Delta vs Main Branch

Metric Value
Main Branch 52.97%
This PR 53.15%
Delta 🟢 +0.18%
✅ Coverage improved!

@github-actions

Copy link
Copy Markdown
Contributor

Unit Test Coverage Report

Overall Project 53.15% -0.03% 🍏
Files changed 93.2% 🍏

File Coverage
VectorIndexProfiles.java 100% 🍏
VectorConstants.java 100% 🍏
SchemaException.java 100% 🍏
VectorIndexingDescDeserializer.java 100% 🍏
VectorIndexDefinitionDesc.java 64.71% -17.65% 🍏
ApiVectorIndex.java 52.6% -2.95% 🍏

@github-actions

Copy link
Copy Markdown
Contributor

📈 Integration Test Coverage Delta vs Main Branch (dse69-it)

Metric Value
Main Branch 71.43%
This PR 71.44%
Delta 🟢 +0.01%
✅ Coverage improved!

@github-actions

Copy link
Copy Markdown
Contributor

Integration Test Coverage Report (dse69-it)

Overall Project 71.44% -0.12% 🍏
Files changed 70.52% 🍏

File Coverage
VectorConstants.java 100% 🍏
SchemaException.java 100% 🍏
VectorIndexDefinitionDesc.java 82.35% -17.65% 🍏
VectorIndexingDescDeserializer.java 80.43% -19.57% 🍏
ApiVectorIndex.java 72.86% -12.24% 🍏
VectorIndexProfiles.java 52.83% -47.17%

@github-actions

Copy link
Copy Markdown
Contributor

📈 Integration Test Coverage Delta vs Main Branch (hcd-it)

Metric Value
Main Branch 72.75%
This PR 72.79%
Delta 🟢 +0.04%
✅ Coverage improved!

@github-actions

Copy link
Copy Markdown
Contributor

Integration Test Coverage Report (hcd-it)

Overall Project 72.79% -0.07% 🍏
Files changed 82.77% 🍏

File Coverage
VectorIndexDefinitionDesc.java 100% 🍏
VectorConstants.java 100% 🍏
SchemaException.java 100% 🍏
VectorIndexProfiles.java 92.45% -7.55% 🍏
VectorIndexingDescDeserializer.java 80.43% -19.57% 🍏
ApiVectorIndex.java 76.23% -8.86% 🍏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add vector_indexing to CreateVectorIndex command Add vector_indexing to CreateCollection command

1 participant