RFC - Beta Filter For Disk Search#1101
Conversation
Adds an RFC proposing a `SearchPlan` enum to replace the disk search API's
`(vector_filter, is_flat_search)` parameter pair, and introducing beta-biased
graph search as a new capability. The hierarchical `SearchPlan { FlatScan,
Graph(GraphMode) }` shape makes invalid combinations unrepresentable and
provides a single extension point for future graph algorithms.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Reframe Motivation around supporting BetaFilter on the disk path and designing an extension point for future filter algorithms. - Move the full design doc content under the Proposal section. - Set Benchmark Results, Future Work, References to N/A initially, then add one Future Work item: align disk and in-memory BetaFilter semantics (disk applies bias + post-filter; in-memory only biases). - Drop the Multihop worked example and its back-references. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR adds an RFC proposing a new SearchPlan/GraphMode API for disk search to support beta-biased filtering, replace (vector_filter, is_flat_search) with a single typed parameter, and create an extensible hook for future graph search variants.
Changes:
- Introduces a hierarchical
SearchPlan { FlatScan, Graph(GraphMode) }model for disk search configuration. - Defines
Predicate,GraphModevariants (includingBetaFilter), and migration examples from the current API. - Documents intended internal plumbing changes (strategy projection,
pq_distancesbeta application, and post-filtering behavior).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1101 +/- ##
==========================================
+ Coverage 89.46% 90.53% +1.06%
==========================================
Files 482 482
Lines 91082 91082
==========================================
+ Hits 81491 82461 +970
+ Misses 9591 8621 -970
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
suri-kumkaran
left a comment
There was a problem hiding this comment.
Thanks for working through this — I’m aligned with the design. I do feel we can trim some of the redundant or overly explanatory parts of the RFC and leave a few of the implementation details for the implementation PRs.
- Add lifetime parameter to Predicate/SearchPlan/GraphMode so callers can pass closures borrowing non-'static data, matching the existing VectorFilter<'a, Data> flexibility. - Make GraphMode::beta_filter fallible (returns Result<Self, BetaError>) so invalid beta from JSON/CLI surfaces as a validation error, not a process crash. - Migrate the benchmark JSON schemas in the same PR (no external consumers; no grace period needed). - Restructure for outside-in narrative: search() signature → projection function → DiskSearchStrategy struct change. - Trim implementation-detail sections (pq_distances code block, data-flow diagram, plumbing table, testing list) that belong in the implementation PR. - Rename "The five cases" to "Supported configurations". - Add a Future Work item: align disk vs. in-memory BetaFilter semantics (disk applies bias + post-filter; in-memory only biases). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
suri-kumkaran
left a comment
There was a problem hiding this comment.
This is getting close!
- Introduce FilterMode<'a> sum type at the strategy layer: replaces the (Option<&Predicate>, Option<f32>) tuple with one enum (None | Filter | BetaFilter), making (None, Some(β)) unrepresentable and extending better for future variants (Multihop, LabelBeta, ...). - Fix the internal/external ID mismatch in RerankAndFilter::post_process: predicate is keyed on ExternalId per DataProvider contract; flat_search is correct; RerankAndFilter and the new pq_distances both add to_external_id conversion. Today's identity mapping hides the bug. - Add Non-Goal: newtyping DiskProvider::ExternalId is deferred. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
suri-kumkaran
left a comment
There was a problem hiding this comment.
Aligned on the design. ID convention is a clean fix and the FilterMode<'a> sum type makes the invariants compiler-enforced.
On Predicate<'a> vs Predicate ('static-only): fine either way — lifetime preserves borrow flexibility, 'static-only is simpler. Author's call.
Approving once the inline comments are resolved.
| Current state: | ||
|
|
||
| - **Input predicate is keyed on `ExternalId`** by `DataProvider` contract. | ||
| - **`Index::flat_search`** is correct — calls `to_external_id` before invoking the predicate ([diskann/src/graph/index.rs](../diskann/src/graph/index.rs)). |
There was a problem hiding this comment.
There's no flat_search in diskann/src/graph/index.rs (grep confirms — only disk_provider.rs has one). I think this might be a conflation with diskann::flat::FlatIndex::knn_search in diskann/src/flat/index.rs (the module from RFC 00983-flat-search) — that's a separate, parallel abstraction that the disk path doesn't delegate to.
The disk's own flat_search (disk_provider.rs:912) is a standalone implementation, and it invokes the predicate with positional/internal IDs directly:
let mut iter = (0..provider.num_points as u32).filter(vector_filter);No to_external_id call anywhere in the disk flat_search path (confirmed: grep for to_external_id in disk_provider.rs returns only the trait definition and one unit test). The 0..num_points idiom looks generic but is the positional/internal space by construction — there's no abstraction between integer counting and positional storage. Identity translation masks this today; under any future non-identity provider, flat-scan and graph paths would diverge silently.
So all three sites (flat_search, pq_distances, RerankAndFilter::post_process) are currently incorrect — this RFC needs to fix all three.
Suggested edits:
- Remove the "Sites already correct" bullet (line 269).
- Add
flat_searchto "Sites updated in the implementation PR" (line 262):flat_search(the disk-local one indisk_provider.rs) — callto_external_idon each id in(0..num_points)before invoking the predicate.
There was a problem hiding this comment.
Both BetaFilter and MultihopSearch take QueryLabelProvider<DP::InternalId> — their predicates run in the internal-id space. The only to_internal_id / to_external_id translations in index.rs happen at the insert/delete API boundary, not on the search path.
We can keep Predicate keyed on internal-id here too.
There was a problem hiding this comment.
Checking latest code, flat_search has been moved to disk_provider.rs and is using internal_id as input param. The "ID convension" part will be removed and Predicate will be keyed on internal_id.
There was a problem hiding this comment.
The motivation for external-keyed predicates was broader than the flat_search audit — that was a symptom, not the core argument. External-keyed is still the right contract because:
- Matches the
DataProviderabstraction; doesn't leak the positional internal representation. - Matches in-memory
BetaFilter(keyed onDP::ExternalIdviaQueryLabelProvider) — closes one corner of the divergence the Future Work item flags. - Future-proof: non-identity translation (sparse IDs, deletion compaction,
u64→u32) stays disk-internal; callers don't break silently.
Practically: how do external clients construct a predicate over internal IDs? They only see external IDs — from search outputs, ingest, the DataProvider contract. Asking them to maintain a bitmap over a space they can't observe pushes implementation details out as caller responsibility, and silently couples every consumer to today's identity-translation invariant.
The fix is small and inlines to a no-op under identity translation. Worth keeping in the RFC.
There was a problem hiding this comment.
DiskProvider doesn't really expose an Internal/External contract today. Both types are pinned to u32, VectorIdType = u32 is a hard where bound on the impl (disk_provider.rs:95-103), I'd suggest dropping the ID-convention change from this RFC.
impl<Data> DataProvider for DiskProvider<Data>
where Data: GraphDataType<VectorIdType = u32>,
{
type InternalId = u32;
type ExternalId = u32;
fn to_internal_id(&self, _: &DefaultContext, gid: &u32) -> Result<u32, _> { Ok(*gid) }
fn to_external_id(&self, _: &DefaultContext, id: u32) -> Result<u32, _> { Ok(id) }
}For Practical, external clients implements to_internal_id / to_external_id on their provider, so they define the translation and own both sides. The question isn't whether they can observe internal id — it's which side of the boundary is the natural place to do the translation.
For the 1:N domain model (which is what our team has: external = doc_id, internal = vector_id, one doc-id → many vector-ids), "caller side" is a feasible option.
- DataProvider's ExternalId ↔ InternalId is 1:1 by trait signature; a 1:N domain doesn't fit through it.
- So we keep the doc_id ↔ {vector_id} map outside the library, for example in a filtered scenario:
- Filtered doc-ids → vector-id bitmap (outside DiskANN).
- DiskANN filter-search returns vector-ids.
- Vector-ids → doc-ids (outside DiskANN).
There was a problem hiding this comment.
Two separate things getting mixed up — the outer mapping (domain ↔ vector IDs) and the inner consistency (which ID space DiskANN uses internally).
Outer mapping (1:N): agreed, this lives outside the library. DataProvider::ExternalId ↔ InternalId is 1:1 by signature, so doc-id ↔ {vector-id} has to be maintained caller-side. The flow you describe (filtered doc-ids → vector-id bitmap → DiskANN search → vector-ids → doc-ids) is the only way it can work. No disagreement there.
Inner consistency (the RFC's change): by the time DiskANN receives the predicate, it's already in vector-id space (= external ID). The question is whether DiskANN invokes that predicate with ExternalId or InternalId. Today three sites — RerankAndFilter::post_process, pq_distances, flat_search — pass internal directly. This is a latent bug: it produces wrong results the moment DiskProvider ever uses non-identity translation (e.g., for deletion compaction or sparse IDs added in place to the existing provider — no fork required).
Re: "DiskProvider doesn't expose the distinction today" — true, and that's exactly why the bug is invisible. The trait exists precisely so future in-place changes don't break consumers. Fixing now is essentially free (translation inlines to a no-op under identity). Leaving it means the next person extending DiskProvider has to remember to also audit every predicate site, with no compile-time help.
Want to keep the ID-convention section in this RFC — it pre-existing-bug we surface and fix here. Fix lands in the impl PR.
- Switch the documented ID convention back to internal IDs, matching the implementation: all three invocation sites (flat_search, pq_distances, RerankAndFilter) pass internal IDs uniformly; no to_external_id conversion at the predicate boundary. Identity mapping (InternalId == ExternalId == u32) makes this transparent to callers either way. - Remove the §ID convention section, the Non-Goal about newtyping ExternalId, and the related note in the Future Work item -- all obsolete after the contract flip. - Trim the "Project at the boundary" bullet in §Key design decisions: it dove into FilterMode internals before the reader had context, and the rationale already lives in §search_strategy projection. Replace with a shorter bullet phrased as "variant matching is delegated to search_strategy()" -- avoids the "accessor" terminology clash with the repo's existing Accessor trait. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@microsoft-github-policy-service agree company="Microsoft" |
Summary
Adds beta-biased filtering on the disk search path by introducing a
SearchPlanenum that replaces the existing(vector_filter, is_flat_search)parameter pair onsearcher.search(). The new enum is hierarchical (SearchPlan { FlatScan, Graph(GraphMode) }), makes invalid combinations unrepresentable, and gives future filter algorithms a single extension point without further growing the publicsearch()signature.Motivation
BetaFilterstrategy; the disk side has no equivalent.beta. Bolting it on as a thirdOption<f32>parameter creates meaningless combinations (is_flat_search=true+beta=Some(_)) the type system can't reject.MultihopSearchexists indiskannbut the disk API has no way to select it).What's in the RFC
SearchPlan { FlatScan { filter: Option<Predicate> }, Graph(GraphMode) }+GraphMode { Unfiltered, PostFilter(p), BetaFilter { predicate, beta } }.Predicate = Box<dyn Fn(u32) -> bool + Send + Sync>— closure-based.GraphModematch site insearch_strategy()that projects(predicate, beta); downstreamDiskAccessorandRerankAndFilterread those fields directly — adding a newGraphModevariant touches one place.Caller migration
(vector_filter, is_flat_search)plan(None, false)SearchPlan::graph()(None, true)SearchPlan::flat()(Some(p), false)SearchPlan::graph_with(GraphMode::post_filter(p))(Some(p), true)SearchPlan::flat_filtered(p)SearchPlan::graph_with(GraphMode::beta_filter(p, β))Known follow-ups (deferred)
GraphMode::BetaFilterapplies bias + hard post-filter; in-memoryBetaFilteronly biases. Future work to align.See
rfcs/01101-disk-beta-filter.mdfor the full design.