feat(RFC): implement candidate retrieval checkpoints D1 and D2#944
feat(RFC): implement candidate retrieval checkpoints D1 and D2#944Abhijeet2409 wants to merge 3 commits into
Conversation
Signed-off-by: Abhijeet Saharan <abhijeetsaharan2236@gmail.com>
Signed-off-by: Abhijeet Saharan <abhijeetsaharan2236@gmail.com>
Summary by CodeRabbit
WalkthroughTwo new files are added: a ChangesCheatsheet-to-CRE Embedding Retrieval
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@application/defs/candidate_cre_defs.py`:
- Around line 25-40: Normalize and validate cre_id in CandidateCRE before any
pattern check: strip self.cre_id first, then replace the current re.match
validation in CandidateCRE with a full-string match so only exact NNN-NNN values
pass. Keep the existing score checks intact, and make sure the
trimming/validation order in the CandidateCRE initializer is adjusted so
whitespace-padded IDs are accepted while trailing characters are rejected.
In
`@application/utils/external_project_parsers/parsers/cheatsheet_cre_retriever.py`:
- Around line 50-66: Normalize the cosine similarity output before constructing
CandidateCRE in cheatsheet_cre_retriever.py, because cosine_similarity can
return negative values and CandidateCRE only accepts scores in the 0 to 1 range.
Update the retrieval flow around similarities/top_k_indices/CandidateCRE so the
score passed into CandidateCRE is clamped or remapped to a non-negative value,
and keep the existing ranking logic based on the raw similarity values if
needed.
- Around line 37-64: `cheatsheet_cre_retriever` should guard against empty or
stale embedding cache results before building candidates. In the retrieval flow
that builds `internal_ids`, `cre_matrix`, and `similarities`, add an early
return when `cache.get_embeddings_by_doc_type(...)` yields no vectors, and skip
any `internal_id` whose `cache.get_cre_by_db_id` lookup returns `None` before
constructing `CandidateCRE`. Keep the checks close to the existing candidate
construction loop so the function degrades gracefully instead of crashing.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: 8ac23b7a-3348-4025-af8b-601748662275
📒 Files selected for processing (2)
application/defs/candidate_cre_defs.pyapplication/utils/external_project_parsers/parsers/cheatsheet_cre_retriever.py
Summary
This PR implements Workstream D Checkpoints D1 and D2 from the OWASP Cheat Sheets to CRE Mapping pipeline RFC.
Added
1.
candidate_cre_defs.pyThis file adds the
CandidateCREdataclass representing a single CRE candidate retrieved for aCheatsheetRecord.The RFC did not explicitly define the
CandidateCREcontract, so the structure was designed accordingly for downstream re-ranking requirements.Fields
cre_id— CRE's external id (e.g."623-550")name— CRE's namedescription— text to be used for LLM reasoningscore— raw cosine similarity scoredescriptionis allowed to be empty as it is optional in the existing CRE DB model. In cases where no description is provided, Workstream E's LLM can fall back tonamefor reasoning.Basic validation and normalization have been implemented.
2.
cheatsheet_cre_retriever.pyThis file adds:
The retrieval implementation follows the existing cosine similarity approach already used in
prompt_client.py.This function requires two additional runtime dependencies as parameters:
cache—Node_collectioninstance used to fetch precomputed CRE embeddingsph—PromptHandlerinstance used to generate query embeddings at runtimeRetrieval Pipeline
CheatsheetRecordsummary and headings.CandidateCREmatches.This implementation provides the deterministic baseline retrieval layer described in the RFC.
Notes
The retrieval pipeline uses sparse matrices inspired from the existing implementation , which may not be optimal for dense Gemini embedding vectors and could be revisited in future optimization work.
Additional fallback behavior, error handling, and tests will be implemented in subsequent checkpoints.