feat(RFC): implement candidate retrieval checkpoints D1 and D2 by Abhijeet2409 · Pull Request #944 · OWASP/OpenCRE

Abhijeet2409 · 2026-06-24T07:17:24Z

Summary

This PR implements Workstream D Checkpoints D1 and D2 from the OWASP Cheat Sheets to CRE Mapping pipeline RFC.

Added

1. `candidate_cre_defs.py`

This file adds the CandidateCRE dataclass representing a single CRE candidate retrieved for a CheatsheetRecord.

The RFC did not explicitly define the CandidateCRE contract, so the structure was designed accordingly for downstream re-ranking requirements.

Fields

cre_id — CRE's external id (e.g. "623-550")
name — CRE's name
description — text to be used for LLM reasoning
score — raw cosine similarity score

description is allowed to be empty as it is optional in the existing CRE DB model. In cases where no description is provided, Workstream E's LLM can fall back to name for reasoning.

Basic validation and normalization have been implemented.

2. `cheatsheet_cre_retriever.py`

This file adds:

retrieve_candidate_cres(record, cache, ph, top_k)

The retrieval implementation follows the existing cosine similarity approach already used in prompt_client.py.

This function requires two additional runtime dependencies as parameters:

cache — Node_collection instance used to fetch precomputed CRE embeddings
ph — PromptHandler instance used to generate query embeddings at runtime

Retrieval Pipeline

Builds query text by combining the CheatsheetRecord summary and headings.
Generates an embedding for the query text.
Computes cosine similarity against all stored CRE embeddings.
Returns the top-k highest scoring CandidateCRE matches.

This implementation provides the deterministic baseline retrieval layer described in the RFC.

Notes

The retrieval pipeline uses sparse matrices inspired from the existing implementation , which may not be optimal for dense Gemini embedding vectors and could be revisited in future optimization work.

Additional fallback behavior, error handling, and tests will be implemented in subsequent checkpoints.

Signed-off-by: Abhijeet Saharan <abhijeetsaharan2236@gmail.com>

coderabbitai · 2026-06-24T07:17:38Z

Summary by CodeRabbit

New Features
- Added CRE candidate matching based on cheatsheet content, returning the most relevant CREs with similarity scores.
- Introduced structured candidate details including ID, name, description, and match score.
Bug Fixes
- Added validation to ensure candidate IDs, names, and scores are well-formed, helping prevent invalid match data from appearing in results.

Walkthrough

Two new files are added: a CandidateCRE dataclass with field validation in candidate_cre_defs.py, and a retrieve_candidate_cres function in cheatsheet_cre_retriever.py that uses cosine similarity between a CheatsheetRecord embedding and stored CRE embeddings to return the top-k matching CandidateCRE instances.

Changes

Cheatsheet-to-CRE Embedding Retrieval

Layer / File(s)	Summary
CandidateCRE dataclass and validation `application/defs/candidate_cre_defs.py`	Defines the `CandidateCRE` dataclass with `cre_id`, `name`, `description`, and `score` fields. `__post_init__` validates that `cre_id` and `name` are non-empty strings, `cre_id` matches the `NNN-NNN` regex, `description` is a string, and `score` is a float in `[0.0, 1.0]`; all string fields are stripped.
Embedding similarity retrieval pipeline `application/utils/external_project_parsers/parsers/cheatsheet_cre_retriever.py`	Implements `retrieve_candidate_cres`, which builds a query string from `CheatsheetRecord.summary` and `.headings`, generates an embedding via `PromptHandler.get_text_embeddings`, fetches all stored CRE embeddings from `Node_collection`, computes cosine similarity via CSR matrices, selects the top-k results, looks up each CRE object, and returns a list of `CandidateCRE` instances with similarity scores.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: implementing candidate retrieval checkpoints D1 and D2.
Description check	✅ Passed	The description clearly matches the changeset and explains both the new dataclass and retrieval function.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@application/defs/candidate_cre_defs.py`:
- Around line 25-40: Normalize and validate cre_id in CandidateCRE before any
pattern check: strip self.cre_id first, then replace the current re.match
validation in CandidateCRE with a full-string match so only exact NNN-NNN values
pass. Keep the existing score checks intact, and make sure the
trimming/validation order in the CandidateCRE initializer is adjusted so
whitespace-padded IDs are accepted while trailing characters are rejected.

In
`@application/utils/external_project_parsers/parsers/cheatsheet_cre_retriever.py`:
- Around line 50-66: Normalize the cosine similarity output before constructing
CandidateCRE in cheatsheet_cre_retriever.py, because cosine_similarity can
return negative values and CandidateCRE only accepts scores in the 0 to 1 range.
Update the retrieval flow around similarities/top_k_indices/CandidateCRE so the
score passed into CandidateCRE is clamped or remapped to a non-negative value,
and keep the existing ranking logic based on the raw similarity values if
needed.
- Around line 37-64: `cheatsheet_cre_retriever` should guard against empty or
stale embedding cache results before building candidates. In the retrieval flow
that builds `internal_ids`, `cre_matrix`, and `similarities`, add an early
return when `cache.get_embeddings_by_doc_type(...)` yields no vectors, and skip
any `internal_id` whose `cache.get_cre_by_db_id` lookup returns `None` before
constructing `CandidateCRE`. Keep the checks close to the existing candidate
construction loop so the function degrades gracefully instead of crashing.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 8ac23b7a-3348-4025-af8b-601748662275

📥 Commits

Reviewing files that changed from the base of the PR and between 13d2f04 and 9232bbb.

📒 Files selected for processing (2)

application/defs/candidate_cre_defs.py
application/utils/external_project_parsers/parsers/cheatsheet_cre_retriever.py

Abhijeet2409 added 2 commits June 23, 2026 16:08

feat: add candidate retrieval checkpoints D1 & D2

feefa3d

Signed-off-by: Abhijeet Saharan <abhijeetsaharan2236@gmail.com>

fix: address lint test issues

5d098e8

Signed-off-by: Abhijeet Saharan <abhijeetsaharan2236@gmail.com>

Merge branch 'main' into feature/candidate-retrieval-d1-d2

9232bbb

coderabbitai Bot reviewed Jun 24, 2026

View reviewed changes

Comment thread application/defs/candidate_cre_defs.py

Comment thread application/utils/external_project_parsers/parsers/cheatsheet_cre_retriever.py

Comment thread application/utils/external_project_parsers/parsers/cheatsheet_cre_retriever.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(RFC): implement candidate retrieval checkpoints D1 and D2#944

feat(RFC): implement candidate retrieval checkpoints D1 and D2#944
Abhijeet2409 wants to merge 3 commits into
OWASP:mainfrom
Abhijeet2409:feature/candidate-retrieval-d1-d2

Abhijeet2409 commented Jun 24, 2026

Uh oh!

coderabbitai Bot commented Jun 24, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Abhijeet2409 commented Jun 24, 2026

Summary

Added

1. candidate_cre_defs.py

Fields

2. cheatsheet_cre_retriever.py

Retrieval Pipeline

Notes

Uh oh!

coderabbitai Bot commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. `candidate_cre_defs.py`

2. `cheatsheet_cre_retriever.py`

coderabbitai Bot commented Jun 24, 2026 •

edited

Loading