Skip to content

feat(parser): ingest transcripts and dedupe by session id#164

Open
hora7ce wants to merge 1 commit into
microsoft:mainfrom
hora7ce:feat/issue-87-review-plan-tests
Open

feat(parser): ingest transcripts and dedupe by session id#164
hora7ce wants to merge 1 commit into
microsoft:mainfrom
hora7ce:feat/issue-87-review-plan-tests

Conversation

@hora7ce

@hora7ce hora7ce commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

Description

This PR adds transcript ingestion support to VS Code parsing and safely deduplicates transcript/chat session overlap by normalized sessionId.

Changes included:

  • Added transcript parsing support and request construction from transcript events
  • Added tool extraction and dedup helpers for transcript tool-call metadata
  • Added merge logic to dedupe transcript sessions with existing chat sessions
  • Scoped dedup indexing to workspace-local parsing context in sync and async paths
  • Added tests for single-turn/multi-turn transcript parsing, invalid/empty transcript handling, tool dedup, transcript-chat dedup, and transcript-only fallback ingestion

Related Issues

Relates to #87
Supersedes prior context in #64

Checklist

  • npm run check passes (typecheck + lint + spellcheck + knip + tests)
  • Changes are covered by tests (if applicable)
  • Documentation updated (if applicable)

@hora7ce

hora7ce commented Jun 28, 2026

Copy link
Copy Markdown
Contributor Author

Reviewer context against issue requirements (#87, #64):

This change is intentionally aligned to the investigation direction in #87 and the acceptance criteria in #64.

Requirement coverage:

  • Dual-source ingestion without duplicate sessions:
  • Canonical source + enrichment/recovery behavior:
    • Existing chatSessions parsing remains primary.
    • transcripts/*.jsonl are ingested as secondary input and merged when the same session exists, or added as fallback when transcript-only.
  • Transcript event-stream parsing:
    • session.start / user.message / assistant.message and tool execution events are mapped into SessionData requests.
    • Tool names are deduplicated per turn.
  • Graceful handling of bad input:
    • Empty/malformed transcript files return null and are skipped.

Test coverage added for the above:

  • parseTranscriptFile:
    • single-turn parse
    • multi-turn parse
    • empty/corrupt input => null
    • per-turn tool dedup
  • processWorkspaceEntry dedup flow:
    • chat + transcript with same sessionId emits one merged session
    • transcript-only session ingestion works

Validation status:

  • npm run check passed locally on this branch (typecheck + lint + spellcheck + knip + tests)

Notes for reviewers:

- Parse transcript files into turns and tool metadata

- Merge transcript sessions with existing chatSessions by normalized sessionId

- Add coverage for transcript parsing, dedup, and transcript-only ingestion

- Keep workspace parsing behavior aligned with stripped response text fields

Refs microsoft#87
@hora7ce hora7ce force-pushed the feat/issue-87-review-plan-tests branch from c436780 to 7fd40b9 Compare June 28, 2026 12:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant