Skip to content

feat(RFC): implement candidate retrieval checkpoints D1 and D2#944

Open
Abhijeet2409 wants to merge 3 commits into
OWASP:mainfrom
Abhijeet2409:feature/candidate-retrieval-d1-d2
Open

feat(RFC): implement candidate retrieval checkpoints D1 and D2#944
Abhijeet2409 wants to merge 3 commits into
OWASP:mainfrom
Abhijeet2409:feature/candidate-retrieval-d1-d2

Conversation

@Abhijeet2409

Copy link
Copy Markdown
Contributor

Summary

This PR implements Workstream D Checkpoints D1 and D2 from the OWASP Cheat Sheets to CRE Mapping pipeline RFC.

Added

1. candidate_cre_defs.py

This file adds the CandidateCRE dataclass representing a single CRE candidate retrieved for a CheatsheetRecord.

The RFC did not explicitly define the CandidateCRE contract, so the structure was designed accordingly for downstream re-ranking requirements.

Fields

  • cre_id — CRE's external id (e.g. "623-550")
  • name — CRE's name
  • description — text to be used for LLM reasoning
  • score — raw cosine similarity score

description is allowed to be empty as it is optional in the existing CRE DB model. In cases where no description is provided, Workstream E's LLM can fall back to name for reasoning.

Basic validation and normalization have been implemented.


2. cheatsheet_cre_retriever.py

This file adds:

retrieve_candidate_cres(record, cache, ph, top_k)

The retrieval implementation follows the existing cosine similarity approach already used in prompt_client.py.

This function requires two additional runtime dependencies as parameters:

  • cacheNode_collection instance used to fetch precomputed CRE embeddings
  • phPromptHandler instance used to generate query embeddings at runtime

Retrieval Pipeline

  1. Builds query text by combining the CheatsheetRecord summary and headings.
  2. Generates an embedding for the query text.
  3. Computes cosine similarity against all stored CRE embeddings.
  4. Returns the top-k highest scoring CandidateCRE matches.

This implementation provides the deterministic baseline retrieval layer described in the RFC.


Notes

The retrieval pipeline uses sparse matrices inspired from the existing implementation , which may not be optimal for dense Gemini embedding vectors and could be revisited in future optimization work.

Additional fallback behavior, error handling, and tests will be implemented in subsequent checkpoints.

Signed-off-by: Abhijeet Saharan <abhijeetsaharan2236@gmail.com>
Signed-off-by: Abhijeet Saharan <abhijeetsaharan2236@gmail.com>
@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Summary by CodeRabbit

  • New Features

    • Added CRE candidate matching based on cheatsheet content, returning the most relevant CREs with similarity scores.
    • Introduced structured candidate details including ID, name, description, and match score.
  • Bug Fixes

    • Added validation to ensure candidate IDs, names, and scores are well-formed, helping prevent invalid match data from appearing in results.

Walkthrough

Two new files are added: a CandidateCRE dataclass with field validation in candidate_cre_defs.py, and a retrieve_candidate_cres function in cheatsheet_cre_retriever.py that uses cosine similarity between a CheatsheetRecord embedding and stored CRE embeddings to return the top-k matching CandidateCRE instances.

Changes

Cheatsheet-to-CRE Embedding Retrieval

Layer / File(s) Summary
CandidateCRE dataclass and validation
application/defs/candidate_cre_defs.py
Defines the CandidateCRE dataclass with cre_id, name, description, and score fields. __post_init__ validates that cre_id and name are non-empty strings, cre_id matches the NNN-NNN regex, description is a string, and score is a float in [0.0, 1.0]; all string fields are stripped.
Embedding similarity retrieval pipeline
application/utils/external_project_parsers/parsers/cheatsheet_cre_retriever.py
Implements retrieve_candidate_cres, which builds a query string from CheatsheetRecord.summary and .headings, generates an embedding via PromptHandler.get_text_embeddings, fetches all stored CRE embeddings from Node_collection, computes cosine similarity via CSR matrices, selects the top-k results, looks up each CRE object, and returns a list of CandidateCRE instances with similarity scores.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: implementing candidate retrieval checkpoints D1 and D2.
Description check ✅ Passed The description clearly matches the changeset and explains both the new dataclass and retrieval function.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@application/defs/candidate_cre_defs.py`:
- Around line 25-40: Normalize and validate cre_id in CandidateCRE before any
pattern check: strip self.cre_id first, then replace the current re.match
validation in CandidateCRE with a full-string match so only exact NNN-NNN values
pass. Keep the existing score checks intact, and make sure the
trimming/validation order in the CandidateCRE initializer is adjusted so
whitespace-padded IDs are accepted while trailing characters are rejected.

In
`@application/utils/external_project_parsers/parsers/cheatsheet_cre_retriever.py`:
- Around line 50-66: Normalize the cosine similarity output before constructing
CandidateCRE in cheatsheet_cre_retriever.py, because cosine_similarity can
return negative values and CandidateCRE only accepts scores in the 0 to 1 range.
Update the retrieval flow around similarities/top_k_indices/CandidateCRE so the
score passed into CandidateCRE is clamped or remapped to a non-negative value,
and keep the existing ranking logic based on the raw similarity values if
needed.
- Around line 37-64: `cheatsheet_cre_retriever` should guard against empty or
stale embedding cache results before building candidates. In the retrieval flow
that builds `internal_ids`, `cre_matrix`, and `similarities`, add an early
return when `cache.get_embeddings_by_doc_type(...)` yields no vectors, and skip
any `internal_id` whose `cache.get_cre_by_db_id` lookup returns `None` before
constructing `CandidateCRE`. Keep the checks close to the existing candidate
construction loop so the function degrades gracefully instead of crashing.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 8ac23b7a-3348-4025-af8b-601748662275

📥 Commits

Reviewing files that changed from the base of the PR and between 13d2f04 and 9232bbb.

📒 Files selected for processing (2)
  • application/defs/candidate_cre_defs.py
  • application/utils/external_project_parsers/parsers/cheatsheet_cre_retriever.py

Comment thread application/defs/candidate_cre_defs.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant