Skip to content

Embedding-based false positive filtering from developer feedback #27

@haasonsaas

Description

@haasonsaas

Problem

DiffScope has a convention learner with Wilson score confidence intervals — statistically more rigorous than competitors. But Greptile's embedding-based approach to false positive filtering is empirically the most effective technique published in the space, taking their comment address rate from 19% to 55%+.

How Greptile Does It (from their "Make LLMs Shut Up" blog)

What failed:

  • Prompt engineering / few-shot: Model "inferred superficial characteristics" rather than learning meaningful patterns. Backfired.
  • LLM-as-judge: A secondary LLM rating comments 1-10 was "nearly random in its judgment of its own output"

What works:

  1. Store embeddings of all past review comments, tagged with developer 👍/👎 feedback
  2. For each new comment the LLM wants to post:
    • Compute cosine similarity against the feedback database
    • Block if similar to 3+ distinct downvoted comments
    • Pass if similar to 3+ upvoted comments
    • Pass ambiguous cases (not enough signal)
  3. Result: address rate went from 19% to 55%+

Key insight: "Nits are subjective — definitions and standards vary from team to team." This must be learned per-team, not universally.

Proposed Solution

Enhance the existing feedback system (FeedbackStore) with embedding-based similarity:

Data Model

CREATE TABLE review_feedback (
    id SERIAL PRIMARY KEY,
    repo TEXT NOT NULL,
    comment_text TEXT NOT NULL,
    comment_embedding vector(1536),
    category TEXT,  -- logic, style, security, etc.
    file_pattern TEXT,  -- e.g., "*.rs", "src/api/**"
    feedback TEXT NOT NULL,  -- 'accepted' or 'rejected'
    created_at TIMESTAMP DEFAULT NOW()
);
CREATE INDEX ON review_feedback USING ivfflat (comment_embedding vector_cosine_ops);

Filtering Logic

async fn should_post_comment(
    comment: &Comment,
    feedback_db: &FeedbackDb,
    threshold: usize,  // default 3
    similarity_cutoff: f32,  // default 0.85
) -> bool {
    let embedding = embed(comment.text()).await?;
    let similar = feedback_db.find_similar(embedding, similarity_cutoff).await?;
    
    let rejected = similar.iter().filter(|f| f.feedback == "rejected").count();
    let accepted = similar.iter().filter(|f| f.feedback == "accepted").count();
    
    if rejected >= threshold { return false; }  // block
    if accepted >= threshold { return true; }   // pass
    true  // ambiguous → pass (err on side of posting)
}

Feedback Collection

  • diffscope feedback accept <comment-id> — existing CLI, add embedding storage
  • diffscope feedback reject <comment-id> — existing CLI, add embedding storage
  • GitHub reactions (👍/👎) on posted PR comments → auto-collect via webhook
  • Resolved/unresolved thread status → signal for accepted/rejected

Relationship to Existing Convention Learner

  • The Wilson score convention learner operates on exact pattern matches (rule_id, file pattern, category)
  • Embedding-based filtering operates on semantic similarity of the comment text
  • Both should run: Wilson score for structured rules, embeddings for fuzzy/subjective nits
  • The embedding filter runs first (cheap vector lookup), Wilson score augments

Expected Impact

Greptile's published numbers: 19% → 55%+ address rate. Even half that improvement would be significant for DiffScope's signal-to-noise ratio.

Priority

High — direct attack on the #1 churn driver (review fatigue from noisy comments).

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions