Skip to content

transcriptintel/pii-eval

Repository files navigation

TranscriptIntel PII Evaluation POC

A reusable proof of concept for the TranscriptIntel platform. It compares PII and investigation-sensitive data detection engines on two tracks: labelled benchmark evaluation and unlabelled transcript screening.

The primary scenario is digital-forensics and criminal-investigation transcript analysis. Pharmaceutical HCP interview screening is included as a secondary regulated-domain scenario to prove that the engine and taxonomy structure reuse cleanly across verticals.

Quick Start

Prerequisites: task, uv, Python 3.12 or 3.13, and ruff (global preferred; .venv/bin/ruff is the fallback).

task setup          # create .venv and install runtime dependencies
task check          # run ruff + pytest
task analyze -- data/pharma_hcp_interview_sample.vtt --taxonomy pharma --engine presidio

The first command creates a local .venv managed by uv. The third runs the fastest local engine against a bundled pharma transcript. From here, pick an engine (presidio, nvidia-gliner, privacy-filter, gliner2) and a taxonomy (core, digital-forensics, pharma) and move on to the command recipes below.

How It Works

flowchart LR
  A["Input files (.vtt, .txt)"] --> B["Parser"]
  B --> C["Engine adapter"]
  T["Taxonomy JSON"] --> C
  C --> D["Post-process findings"]
  T --> D
  D --> E["Tables / JSON / JSONL"]
  D --> F["Redacted transcript"]
  D --> G["Review annotations"]
  G --> H["Benchmark JSONL"]
  H --> I["Metrics comparison"]
  E --> J["PII-safe summary"]
  I --> J
Loading

Each engine adapter returns a list of spans. A shared post-processing layer applies canonical labels, taxonomy severity, optional overlap deduplication, and output formatting. That boundary is what makes the POC reusable across engines and domains — swap the adapter or swap the taxonomy, the rest is unchanged.

Two modes run through the same adapter layer:

  • Benchmark evaluation compares engine predictions against annotated JSONL ground truth. Produces precision, recall, F1, and per-entity TP/FP/FN.
  • Transcript screening analyses raw .vtt or .txt files without ground truth. Produces findings, counts, severity distribution, optional redaction, and review-ready annotation JSONL.

An optional plain-text summary step can interpret metrics or finding counts through an OpenAI-compatible model. Summary payloads carry only counts and metrics — never raw transcript text or detected values (see Data Safety).

Engines

Engine Deployment Configurable labels Local download When to pick it
presidio local (CPU) via supported recognizers none fastest, highest recall, over-detects — review-heavy workflows
nvidia-gliner hosted (OpenAI-compatible endpoint) yes, arbitrary labels none broad configurable labels without local model weight
privacy-filter local (Torch) no (fixed OPF labels, mapped to canonical) ~2.6 GB high-precision local detection, heaviest cold start
gliner2 local (Torch) yes, arbitrary labels ~810 MB configurable labels without a hosted endpoint

List engines at runtime:

uv run --python .venv/bin/python pii-eval list-engines

Privacy Filter

OpenAI Privacy Filter's redact() API returns detected spans with labels and character offsets — the adapter consumes those directly, it does not anonymise. Its fixed OPF labels (e.g. private_person, private_email, account_number, secret) are mapped to the canonical space so metrics are comparable with other engines.

task privacy-filter:install
task privacy-filter:check
task privacy-filter:warmup   # pre-download the checkpoint
task privacy-filter:clear    # remove the local cache

First real run downloads ~2.6 GB into ~/.opf/privacy_filter.

GLiNER2

Fastino GLiNER2 is a local, labels-configurable NER model. Same shape as nvidia-gliner but running through torch/transformers on your own hardware.

task gliner2:install
task gliner2:check
task gliner2:warmup          # pre-download the checkpoint
task gliner2:clear           # remove the local cache

Default model is fastino/gliner2-base-v1 (~810 MB on disk as the full repo). Override via GLINER2_MODEL in .env or a constructor argument.

Nvidia GLiNER (hosted)

Hosted OpenAI-compatible endpoint. Configure OPENAI_BASE_URL and OPENAI_API_KEY in .env, optionally NVIDIA_PII_MODEL. Validate the endpoint before running evaluations:

task nvidia:model-check
MODEL=mistralai/mistral-large-3-675b-instruct-2512 task nvidia:model-check
TIMEOUT_SECONDS=120 MODEL=some/model task nvidia:model-check

Taxonomies

Taxonomies live in taxonomies/*.json. Each taxonomy declares:

  • canonical_entities — project-level labels used for reporting and metrics.
  • severity — optional severity group per canonical entity.
  • engine_labels — labels requested from engines that accept configurable labels (nvidia-gliner, gliner2).
  • label_mappings — engine-specific output label → canonical label.

List and inspect:

uv run --python .venv/bin/python pii-eval list-taxonomies

Bundled taxonomies:

  • core — general synthetic benchmark labels; used by the bundled data/samples.jsonl benchmark.
  • digital-forensics — examiner interviews, device examinations, chat analysis, OS artifacts, account/device identifiers, sanctions and embargo references, export-control review.
  • pharma — HCP interview and pharmaceutical compliance screening, including GDPR special categories.

Engine behaviour differs: nvidia-gliner and gliner2 receive taxonomy labels directly; presidio receives only the subset of labels its recognisers support; privacy-filter uses fixed output labels and relies on mapping.

When a taxonomy is selected for benchmark evaluation, ground-truth labels outside the taxonomy's canonical_entities are excluded from scoring. This keeps domain-specific evaluations focused on the chosen review scope.

Digital Forensics Scenario

The digital-forensics taxonomy is designed for reviewing investigative transcripts, examiner interviews, forensic notes, and chat-analysis discussions. It flags ordinary PII alongside investigation-sensitive identifiers and context relevant to criminal investigations, chain-of-custody review, and export-control or sanctions analysis.

It covers direct identifiers (names, emails, phone numbers, locations, dates, URLs, IPs); account and communication identifiers (usernames, chat handles, social media handles, cloud accounts, WhatsApp-style participant identifiers); device and operating-system identifiers (hostnames, device IDs, MAC addresses, Windows SIDs, virtual-machine names, OS artifacts); forensic artifacts (filesystem paths, command-line references, registry/event log references, hashes, evidence IDs, case IDs, transaction IDs, credentials, secret keys); and export-control and sanctions context (controlled technologies, dual-use items, export licences, sanctioned countries, embargo references).

This is an engineering configuration for detection and investigative review. It is not legal advice and should be validated against the applicable jurisdiction, warrant scope, investigative policy, sanctions programme, and export-control classification process before production use.

Pharma Scenario

The pharma taxonomy targets HCP interview transcripts. It screens personal data and sensitive data relevant to pharmaceutical research and compliance review.

It covers direct identifiers (names, emails, phone numbers, locations, dates, URLs, IPs); HCP identifiers (medical licence numbers, professional IDs, job titles, healthcare organisations); study and patient references (subject IDs, patient IDs, clinical trial IDs, account-like identifiers); GDPR special-category areas (health data, genetic data, biometric data, racial or ethnic origin, religion or belief, political opinion, trade union membership, sex life or sexual orientation); and pharma context (diagnoses, medications, treatments, adverse events).

Reference basis:

  • European Data Protection Board guidance on personal data and sensitive data.
  • Regulation (EU) 2024/1689, the EU AI Act.
  • European Medicines Agency clinical-data publication and anonymisation practices.

This is an engineering configuration for detection and review. It is not legal advice and should be validated by privacy, legal, and pharmacovigilance stakeholders before production use.

Commands

All commands are exposed through both pii-eval <subcommand> and task <name>. The task wrapper handles the .venv interpreter; examples below use task.

Screen one transcript

task analyze -- data/pharma_hcp_interview_sample.vtt --taxonomy pharma --engine nvidia-gliner

Useful flags on top of the base invocation:

  • --hide-values — omit detected text from the output table.
  • --dedupe {none,prefer-longest,prefer-highest-score} — collapse overlapping spans before printing. Default is none.
  • --format json / --output findings.json — machine-readable output.
  • --redact-output redacted.vtt — write a redacted transcript that preserves VTT cue timestamps. Redaction always collapses overlaps with prefer-longest, regardless of --dedupe, so the widest span wins and no character of a flagged region is left unmasked.
  • --no-cache — disable the per-run disk cache.
  • --no-progress — silence stderr progress logs.

Compare engines on one transcript

task analyze-compare -- data/pharma_hcp_interview_sample.vtt \
  --taxonomy pharma \
  --engine presidio --engine nvidia-gliner --engine gliner2

Add --summary to produce a plain-text interpretation (requires SUMMARY_MODEL_NAME; see Data Safety for what is and is not sent to the model). Add --run-dir runs/hcp-demo (or --run-dir auto for a timestamped folder) to write metadata.json, taxonomy.json, finding-counts.json, and summary.txt into a self-contained artefact.

Batch analyse a directory

task analyze-batch -- data --taxonomy pharma --engine presidio
task analyze-batch -- data --taxonomy pharma --engine presidio --format jsonl --output batch-results.jsonl --no-progress

Review loop: annotation → benchmark JSONL

  1. Export finding-level annotations for human review:

    task analyze -- data/pharma_hcp_interview_sample.vtt \
      --taxonomy pharma --engine nvidia-gliner \
      --annotation-output annotations/pharma-review.jsonl --hide-values

    Each row has status (default pending), type, severity, offsets, timestamp, score, and an optional value. Reviewers flip status to approved or rejected, correct spans/types, and add comments.

  2. Convert reviewed annotations into a benchmark sample:

    task annotations-to-jsonl -- annotations/pharma-review.jsonl \
      --source-text data/pharma_hcp_interview_sample.vtt \
      --output data/pharma-reviewed.jsonl \
      --sample-id pharma-hcp-reviewed

    approved and pending rows are included by default. Use --approved-only to exclude pending once review is complete.

  3. The resulting JSONL drops into task run / task compare as labelled ground truth.

Benchmark evaluation against JSONL

task run                                               # presidio on data/
task compare -- --engine presidio --engine nvidia-gliner --engine gliner2 --taxonomy core --match-mode partial

--match-mode partial uses IoU-based matching (default threshold 0.5, adjustable with --iou-threshold); --match-mode exact requires start+end+type to match. --score-threshold N is forwarded to any engine that advertises one (Presidio's native threshold; GLiNER2 and nvidia-gliner's threshold kwarg). Privacy Filter rejects a non-zero threshold since it has no score concept.

Add --summary for an LLM interpretation and --run-dir / --output for artefacts, same as analyze-compare.

Threshold sweep

task sweep

Runs the chosen engine at ten thresholds from 0.0 to 0.9 and prints a compact comparison table. Useful for finding the precision-recall knee.

Workflow examples

Two runnable workflow scripts live in examples/:

task cli:forensics           # digital-forensics transcript through Presidio
task cli:review-workflow     # analyze → export annotations → build benchmark JSONL
task cli:all                 # run every script

Scripts 0105 are low-level Presidio reference (detection, anonymisation, custom recognisers, batch processing, threshold tuning). Scripts 0607 are the TranscriptIntel workflow examples.

Caching and progress logs

Analysis commands use a disk cache by default for engine results — primarily useful for hosted engines and large local models.

task analyze-compare -- data/pharma_hcp_interview_sample.vtt --taxonomy pharma --engine nvidia-gliner --no-cache
task cache-clear

Progress logs go to stderr and include engine names, taxonomy, sample IDs, counts, and phase changes. They never print raw text or detected values. Disable with --no-progress.

Data Safety

Summary generation is the only place this POC sends data to an external LLM. Two kinds of payload exist, both PII-scrubbed before rendering:

  • Metrics summary (benchmark runs): precision, recall, F1, runtime, samples per second, characters per second, total predictions, total ground truth, TP/FP/FN totals, and per-entity prediction counts. The summary model never sees raw sample text, detected values, false-positive examples, or false-negative examples.
  • Finding-count summary (screening runs): engine names, total counts, per-entity counts, severity counts, and an optional source filename (basename only, not the full path). Never transcript text, never detected values.

Prompt templates in prompts/*.mako format instructions around the sanitised JSON payload — they do not and must not decide whether raw transcript text or detected values may be sent to an LLM.

Bundled Samples

Annotated benchmark data uses JSONL:

{"id":"example-001","text":"Contact John Smith at john@example.com.","entities":[{"type":"PERSON","start":8,"end":18,"value":"John Smith"},{"type":"EMAIL_ADDRESS","start":22,"end":38,"value":"john@example.com"}]}

Precision, recall, and F1 require ground-truth spans. Raw VTT transcript screening reports findings and counts but cannot compute metrics unless the transcript is first converted into annotated JSONL via the review loop.

Benchmark data: data/samples.jsonl — 43 synthetic samples in the core taxonomy.

Digital-forensics VTT samples:

  • digital_forensics_windows_multipass_trace_hiding_sample.vtt — examiner review of a Windows laptop with Ubuntu Multipass installed; user/account identifiers, filesystem paths, OS artifacts, VM traces, network identifiers, hashes, and a credential artefact.
  • digital_forensics_whatsapp_embargo_technology_sample.vtt — WhatsApp analysis involving people, handles, companies, payment identifiers, controlled technologies, dual-use items, export licence references, sanctioned destination context, and US embargo discussion.

Pharma VTT samples:

  • pharma_hcp_interview_sample.vtt — balanced HCP interview with direct identifiers, study references, patient references, special-category details, and reimbursement data.
  • pharma_hcp_adverse_event_case_sample.vtt — pharmacovigilance-style adverse-event discussion with HCP contact details, patient identifiers, pregnancy, genetic data, ethnicity, IP address, and reimbursement data.
  • pharma_hcp_rare_disease_indirect_identifiers_sample.vtt — rare-disease pathway interview focused on indirect identifiers (small populations, village references, school schedule, genetic variant, caregiver context).
  • pharma_hcp_false_positive_control_sample.vtt — low-PII control transcript with pharma business vocabulary (account tier, target profile, batch, protocol deck, route), designed to expose false positives.

Current benchmark artefact

The authoritative benchmark result lives in runs/benchmark-all-engines-core-20260424/metrics.json, produced by:

task compare -- --engine presidio --engine nvidia-gliner --engine privacy-filter \
  --taxonomy core --match-mode partial \
  --run-dir runs/benchmark-all-engines-core-20260424 \
  --output runs/benchmark-all-engines-core-20260424/metrics.json --no-progress

Dataset: 43 samples, 1,951 characters. Matching: partial, IoU threshold 0.5.

The recorded JSON predates the Privacy Filter label-space normalisation and the gliner2 adapter; re-run the command above to refresh. Do not copy numbers into this README — read them from the artefact so the markdown cannot drift from the measurement.

Qualitative shape on this small synthetic benchmark: presidio is the fastest with very high recall but over-detects; nvidia-gliner tends to produce the best Micro-F1 but is slower because every sample goes through a hosted endpoint; privacy-filter has the highest precision and low false positives but lower recall.

Deployment

Resource usage depends on Python version, platform, package resolver, model cache state, transcript size, and whether hosted endpoints are used. The values below are practical sizing notes, not hard limits.

Disk footprint:

Component Local model download Notes
presidio none Uses local Presidio/spaCy dependencies; no multi-GB detector checkpoint.
nvidia-gliner none Hosted endpoint; only Python client dependencies locally.
privacy-filter ~2.6 GB at ~/.opf/privacy_filter model.safetensors; download on first load or via task privacy-filter:warmup.
gliner2 ~810 MB at ~/.cache/huggingface/hub fastino/gliner2-base-v1 full repo (measured 24 April 2026). Large variant is larger.
.venv with Privacy Filter installed ~1.0 GB Includes Torch and transitive runtime dependencies.

Peak RSS (observed with task compare -- --taxonomy core --match-mode partial on 43 synthetic samples):

Engine Peak RSS Sizing guidance
presidio ~0.9 GiB CPU-bound, short synchronous jobs. Allow ≥1 GiB, preferably 2 GiB.
nvidia-gliner ~0.3 GiB Low local memory; latency bound by hosted endpoint.
privacy-filter ~3.6 GiB Torch model dominates. Allow ≥4 GiB, preferably 6–8 GiB.
gliner2 not yet profiled Torch + mid-size transformer. Confirm with task compare on your own workload before sizing.

Railway:

  • presidio and nvidia-gliner fit smaller Railway services comfortably.
  • Do not re-download the Privacy Filter checkpoint on every deploy. Use a persistent volume, or pre-bake the checkpoint into an image if image size and deploy time are acceptable.
  • Size Privacy Filter containers at ≥4 GiB RAM, preferably 6–8 GiB once web-server overhead, concurrency, and transcript size are included.
  • Treat ephemeral filesystems as cache only — if ~/.opf/privacy_filter does not persist, the next cold start re-downloads ~2.6 GB.

Kubernetes (starting point for a single Privacy Filter worker):

resources:
  requests:
    cpu: "1"
    memory: "4Gi"
  limits:
    cpu: "2"
    memory: "8Gi"

For Presidio-only or hosted nvidia-gliner workers, start smaller (e.g. 512Mi–1Gi request, 1–2Gi limit) and adjust from observed pod metrics. For Privacy Filter specifically: mount a persistent volume at ~/.opf or override HOME/cache paths so the checkpoint survives restarts; use an init container or startup job running task privacy-filter:warmup to pre-populate the model before traffic is routed; keep concurrency low until memory is measured on production-size transcripts; separate hosted-engine workers from local-model workers if you want different autoscaling profiles.

Extending

Add a taxonomy: create taxonomies/<name>.json; define canonical_entities; optionally add severity, engine_labels, and label_mappings; run with --taxonomy <name>.

Add an engine: create an adapter in evaluation/engines/; implement analyze(text) -> list[dict] returning type, start, end, and optional score; set SCORE_THRESHOLD_PARAM (the kwarg name for confidence thresholding, or None); implement map_ground_truth() if the engine uses a different taxonomy; register in evaluation/engines/__init__.py; add a focused unit test.

Add an input format: parsers live in evaluation/analyze.py (parse_input_file, parse_vtt, parse_plain_text). Add a branch that returns ParsedDocument(text, segments) — the rest of the pipeline is format-agnostic.

Embed in another app: review annotations are JSONL rather than a separate database specifically so an embedding app can render the rows in its own review UI, update statuses and spans, then call annotations-to-jsonl or reuse evaluation.annotations directly.

Reference

Environment variables (create a .env in the project root):

OPENAI_BASE_URL=https://your-compatible-endpoint/v1
OPENAI_API_KEY=...
SUMMARY_MODEL_NAME=mistralai/mistral-large-3-675b-instruct-2512

# Optional
NVIDIA_PII_MODEL=nvidia/gliner-pii
GLINER2_MODEL=fastino/gliner2-base-v1
SUMMARY_TIMEOUT_SECONDS=120

Prompt templates (Mako, rendered against sanitised JSON payloads):

  • prompts/metrics_summary_system.mako / prompts/metrics_summary_user.mako
  • prompts/finding_summary_system.mako / prompts/finding_summary_user.mako

Repository layout:

  • data/ — sample benchmark data and sample transcript files.
  • prompts/ — Mako templates for LLM summary instructions.
  • taxonomies/ — reusable taxonomy definitions.
  • evaluation/engines/ — engine adapters.
  • evaluation/ — CLI, parsing, metrics, reports, summaries.
  • examples/ — Presidio API demos and TranscriptIntel workflow examples.
  • tests/ — regression tests.

Useful tasks:

task setup
task check
task run
task compare -- --engine presidio --engine nvidia-gliner --engine gliner2 --taxonomy core --match-mode partial
task analyze -- data/digital_forensics_windows_multipass_trace_hiding_sample.vtt --taxonomy digital-forensics --engine nvidia-gliner --dedupe prefer-longest
task analyze-compare -- data/pharma_hcp_interview_sample.vtt --taxonomy pharma --engine presidio --engine nvidia-gliner --summary
task annotations-to-jsonl -- annotations/pharma-review.jsonl --source-text data/pharma_hcp_interview_sample.vtt --output data/pharma-reviewed.jsonl
task cli:forensics
task cli:review-workflow
task privacy-filter:install
task gliner2:install
task nvidia:model-check
task cache-clear

About

POC for comparing PII and investigation-sensitive data detection methods for transcript intelligence workflows

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors