A reusable proof of concept for the TranscriptIntel platform. It compares PII and investigation-sensitive data detection engines on two tracks: labelled benchmark evaluation and unlabelled transcript screening.
The primary scenario is digital-forensics and criminal-investigation transcript analysis. Pharmaceutical HCP interview screening is included as a secondary regulated-domain scenario to prove that the engine and taxonomy structure reuse cleanly across verticals.
Prerequisites: task, uv, Python 3.12 or 3.13, and
ruff (global preferred; .venv/bin/ruff is the fallback).
task setup # create .venv and install runtime dependencies
task check # run ruff + pytest
task analyze -- data/pharma_hcp_interview_sample.vtt --taxonomy pharma --engine presidioThe first command creates a local .venv managed by uv. The third runs the
fastest local engine against a bundled pharma transcript. From here, pick an
engine (presidio, nvidia-gliner, privacy-filter, gliner2) and a
taxonomy (core, digital-forensics, pharma) and move on to the command
recipes below.
flowchart LR
A["Input files (.vtt, .txt)"] --> B["Parser"]
B --> C["Engine adapter"]
T["Taxonomy JSON"] --> C
C --> D["Post-process findings"]
T --> D
D --> E["Tables / JSON / JSONL"]
D --> F["Redacted transcript"]
D --> G["Review annotations"]
G --> H["Benchmark JSONL"]
H --> I["Metrics comparison"]
E --> J["PII-safe summary"]
I --> J
Each engine adapter returns a list of spans. A shared post-processing layer applies canonical labels, taxonomy severity, optional overlap deduplication, and output formatting. That boundary is what makes the POC reusable across engines and domains — swap the adapter or swap the taxonomy, the rest is unchanged.
Two modes run through the same adapter layer:
- Benchmark evaluation compares engine predictions against annotated JSONL ground truth. Produces precision, recall, F1, and per-entity TP/FP/FN.
- Transcript screening analyses raw
.vttor.txtfiles without ground truth. Produces findings, counts, severity distribution, optional redaction, and review-ready annotation JSONL.
An optional plain-text summary step can interpret metrics or finding counts through an OpenAI-compatible model. Summary payloads carry only counts and metrics — never raw transcript text or detected values (see Data Safety).
| Engine | Deployment | Configurable labels | Local download | When to pick it |
|---|---|---|---|---|
presidio |
local (CPU) | via supported recognizers | none | fastest, highest recall, over-detects — review-heavy workflows |
nvidia-gliner |
hosted (OpenAI-compatible endpoint) | yes, arbitrary labels | none | broad configurable labels without local model weight |
privacy-filter |
local (Torch) | no (fixed OPF labels, mapped to canonical) | ~2.6 GB | high-precision local detection, heaviest cold start |
gliner2 |
local (Torch) | yes, arbitrary labels | ~810 MB | configurable labels without a hosted endpoint |
List engines at runtime:
uv run --python .venv/bin/python pii-eval list-enginesOpenAI Privacy Filter's redact() API returns detected spans with labels and
character offsets — the adapter consumes those directly, it does not anonymise.
Its fixed OPF labels (e.g. private_person, private_email, account_number,
secret) are mapped to the canonical space so metrics are comparable with
other engines.
task privacy-filter:install
task privacy-filter:check
task privacy-filter:warmup # pre-download the checkpoint
task privacy-filter:clear # remove the local cacheFirst real run downloads ~2.6 GB into ~/.opf/privacy_filter.
Fastino GLiNER2 is a local,
labels-configurable NER model. Same shape as nvidia-gliner but running
through torch/transformers on your own hardware.
task gliner2:install
task gliner2:check
task gliner2:warmup # pre-download the checkpoint
task gliner2:clear # remove the local cacheDefault model is fastino/gliner2-base-v1 (~810 MB on disk as the full repo).
Override via GLINER2_MODEL in .env or a constructor argument.
Hosted OpenAI-compatible endpoint. Configure OPENAI_BASE_URL and
OPENAI_API_KEY in .env, optionally NVIDIA_PII_MODEL. Validate the
endpoint before running evaluations:
task nvidia:model-check
MODEL=mistralai/mistral-large-3-675b-instruct-2512 task nvidia:model-check
TIMEOUT_SECONDS=120 MODEL=some/model task nvidia:model-checkTaxonomies live in taxonomies/*.json. Each taxonomy declares:
canonical_entities— project-level labels used for reporting and metrics.severity— optional severity group per canonical entity.engine_labels— labels requested from engines that accept configurable labels (nvidia-gliner,gliner2).label_mappings— engine-specific output label → canonical label.
List and inspect:
uv run --python .venv/bin/python pii-eval list-taxonomiesBundled taxonomies:
core— general synthetic benchmark labels; used by the bundleddata/samples.jsonlbenchmark.digital-forensics— examiner interviews, device examinations, chat analysis, OS artifacts, account/device identifiers, sanctions and embargo references, export-control review.pharma— HCP interview and pharmaceutical compliance screening, including GDPR special categories.
Engine behaviour differs: nvidia-gliner and gliner2 receive taxonomy
labels directly; presidio receives only the subset of labels its recognisers
support; privacy-filter uses fixed output labels and relies on mapping.
When a taxonomy is selected for benchmark evaluation, ground-truth labels
outside the taxonomy's canonical_entities are excluded from scoring. This
keeps domain-specific evaluations focused on the chosen review scope.
The digital-forensics taxonomy is designed for reviewing investigative
transcripts, examiner interviews, forensic notes, and chat-analysis
discussions. It flags ordinary PII alongside investigation-sensitive
identifiers and context relevant to criminal investigations,
chain-of-custody review, and export-control or sanctions analysis.
It covers direct identifiers (names, emails, phone numbers, locations, dates, URLs, IPs); account and communication identifiers (usernames, chat handles, social media handles, cloud accounts, WhatsApp-style participant identifiers); device and operating-system identifiers (hostnames, device IDs, MAC addresses, Windows SIDs, virtual-machine names, OS artifacts); forensic artifacts (filesystem paths, command-line references, registry/event log references, hashes, evidence IDs, case IDs, transaction IDs, credentials, secret keys); and export-control and sanctions context (controlled technologies, dual-use items, export licences, sanctioned countries, embargo references).
This is an engineering configuration for detection and investigative review. It is not legal advice and should be validated against the applicable jurisdiction, warrant scope, investigative policy, sanctions programme, and export-control classification process before production use.
The pharma taxonomy targets HCP interview transcripts. It screens personal
data and sensitive data relevant to pharmaceutical research and compliance
review.
It covers direct identifiers (names, emails, phone numbers, locations, dates, URLs, IPs); HCP identifiers (medical licence numbers, professional IDs, job titles, healthcare organisations); study and patient references (subject IDs, patient IDs, clinical trial IDs, account-like identifiers); GDPR special-category areas (health data, genetic data, biometric data, racial or ethnic origin, religion or belief, political opinion, trade union membership, sex life or sexual orientation); and pharma context (diagnoses, medications, treatments, adverse events).
Reference basis:
- European Data Protection Board guidance on personal data and sensitive data.
- Regulation (EU) 2024/1689, the EU AI Act.
- European Medicines Agency clinical-data publication and anonymisation practices.
This is an engineering configuration for detection and review. It is not legal advice and should be validated by privacy, legal, and pharmacovigilance stakeholders before production use.
All commands are exposed through both pii-eval <subcommand> and
task <name>. The task wrapper handles the .venv interpreter; examples
below use task.
task analyze -- data/pharma_hcp_interview_sample.vtt --taxonomy pharma --engine nvidia-glinerUseful flags on top of the base invocation:
--hide-values— omit detected text from the output table.--dedupe {none,prefer-longest,prefer-highest-score}— collapse overlapping spans before printing. Default isnone.--format json/--output findings.json— machine-readable output.--redact-output redacted.vtt— write a redacted transcript that preserves VTT cue timestamps. Redaction always collapses overlaps withprefer-longest, regardless of--dedupe, so the widest span wins and no character of a flagged region is left unmasked.--no-cache— disable the per-run disk cache.--no-progress— silence stderr progress logs.
task analyze-compare -- data/pharma_hcp_interview_sample.vtt \
--taxonomy pharma \
--engine presidio --engine nvidia-gliner --engine gliner2Add --summary to produce a plain-text interpretation (requires
SUMMARY_MODEL_NAME; see Data Safety for what is and is not
sent to the model). Add --run-dir runs/hcp-demo (or --run-dir auto for a
timestamped folder) to write metadata.json, taxonomy.json,
finding-counts.json, and summary.txt into a self-contained artefact.
task analyze-batch -- data --taxonomy pharma --engine presidio
task analyze-batch -- data --taxonomy pharma --engine presidio --format jsonl --output batch-results.jsonl --no-progress-
Export finding-level annotations for human review:
task analyze -- data/pharma_hcp_interview_sample.vtt \ --taxonomy pharma --engine nvidia-gliner \ --annotation-output annotations/pharma-review.jsonl --hide-values
Each row has
status(defaultpending),type,severity, offsets, timestamp, score, and an optional value. Reviewers flipstatustoapprovedorrejected, correct spans/types, and add comments. -
Convert reviewed annotations into a benchmark sample:
task annotations-to-jsonl -- annotations/pharma-review.jsonl \ --source-text data/pharma_hcp_interview_sample.vtt \ --output data/pharma-reviewed.jsonl \ --sample-id pharma-hcp-reviewed
approvedandpendingrows are included by default. Use--approved-onlyto exclude pending once review is complete. -
The resulting JSONL drops into
task run/task compareas labelled ground truth.
task run # presidio on data/
task compare -- --engine presidio --engine nvidia-gliner --engine gliner2 --taxonomy core --match-mode partial--match-mode partial uses IoU-based matching (default threshold 0.5,
adjustable with --iou-threshold); --match-mode exact requires start+end+type
to match. --score-threshold N is forwarded to any engine that advertises one
(Presidio's native threshold; GLiNER2 and nvidia-gliner's threshold kwarg).
Privacy Filter rejects a non-zero threshold since it has no score concept.
Add --summary for an LLM interpretation and --run-dir / --output for
artefacts, same as analyze-compare.
task sweepRuns the chosen engine at ten thresholds from 0.0 to 0.9 and prints a compact comparison table. Useful for finding the precision-recall knee.
Two runnable workflow scripts live in examples/:
task cli:forensics # digital-forensics transcript through Presidio
task cli:review-workflow # analyze → export annotations → build benchmark JSONL
task cli:all # run every scriptScripts 01–05 are low-level Presidio reference (detection, anonymisation,
custom recognisers, batch processing, threshold tuning). Scripts 06–07
are the TranscriptIntel workflow examples.
Analysis commands use a disk cache by default for engine results — primarily useful for hosted engines and large local models.
task analyze-compare -- data/pharma_hcp_interview_sample.vtt --taxonomy pharma --engine nvidia-gliner --no-cache
task cache-clearProgress logs go to stderr and include engine names, taxonomy, sample IDs,
counts, and phase changes. They never print raw text or detected values.
Disable with --no-progress.
Summary generation is the only place this POC sends data to an external LLM. Two kinds of payload exist, both PII-scrubbed before rendering:
- Metrics summary (benchmark runs): precision, recall, F1, runtime, samples per second, characters per second, total predictions, total ground truth, TP/FP/FN totals, and per-entity prediction counts. The summary model never sees raw sample text, detected values, false-positive examples, or false-negative examples.
- Finding-count summary (screening runs): engine names, total counts, per-entity counts, severity counts, and an optional source filename (basename only, not the full path). Never transcript text, never detected values.
Prompt templates in prompts/*.mako format instructions around the sanitised
JSON payload — they do not and must not decide whether raw transcript text
or detected values may be sent to an LLM.
Annotated benchmark data uses JSONL:
{"id":"example-001","text":"Contact John Smith at john@example.com.","entities":[{"type":"PERSON","start":8,"end":18,"value":"John Smith"},{"type":"EMAIL_ADDRESS","start":22,"end":38,"value":"john@example.com"}]}Precision, recall, and F1 require ground-truth spans. Raw VTT transcript screening reports findings and counts but cannot compute metrics unless the transcript is first converted into annotated JSONL via the review loop.
Benchmark data: data/samples.jsonl — 43 synthetic samples in the core
taxonomy.
Digital-forensics VTT samples:
digital_forensics_windows_multipass_trace_hiding_sample.vtt— examiner review of a Windows laptop with Ubuntu Multipass installed; user/account identifiers, filesystem paths, OS artifacts, VM traces, network identifiers, hashes, and a credential artefact.digital_forensics_whatsapp_embargo_technology_sample.vtt— WhatsApp analysis involving people, handles, companies, payment identifiers, controlled technologies, dual-use items, export licence references, sanctioned destination context, and US embargo discussion.
Pharma VTT samples:
pharma_hcp_interview_sample.vtt— balanced HCP interview with direct identifiers, study references, patient references, special-category details, and reimbursement data.pharma_hcp_adverse_event_case_sample.vtt— pharmacovigilance-style adverse-event discussion with HCP contact details, patient identifiers, pregnancy, genetic data, ethnicity, IP address, and reimbursement data.pharma_hcp_rare_disease_indirect_identifiers_sample.vtt— rare-disease pathway interview focused on indirect identifiers (small populations, village references, school schedule, genetic variant, caregiver context).pharma_hcp_false_positive_control_sample.vtt— low-PII control transcript with pharma business vocabulary (account tier, target profile, batch, protocol deck, route), designed to expose false positives.
The authoritative benchmark result lives in
runs/benchmark-all-engines-core-20260424/metrics.json, produced by:
task compare -- --engine presidio --engine nvidia-gliner --engine privacy-filter \
--taxonomy core --match-mode partial \
--run-dir runs/benchmark-all-engines-core-20260424 \
--output runs/benchmark-all-engines-core-20260424/metrics.json --no-progressDataset: 43 samples, 1,951 characters. Matching: partial, IoU threshold 0.5.
The recorded JSON predates the Privacy Filter label-space normalisation and the gliner2 adapter; re-run the command above to refresh. Do not copy numbers into this README — read them from the artefact so the markdown cannot drift from the measurement.
Qualitative shape on this small synthetic benchmark: presidio is the
fastest with very high recall but over-detects; nvidia-gliner tends to
produce the best Micro-F1 but is slower because every sample goes through a
hosted endpoint; privacy-filter has the highest precision and low false
positives but lower recall.
Resource usage depends on Python version, platform, package resolver, model cache state, transcript size, and whether hosted endpoints are used. The values below are practical sizing notes, not hard limits.
Disk footprint:
| Component | Local model download | Notes |
|---|---|---|
presidio |
none | Uses local Presidio/spaCy dependencies; no multi-GB detector checkpoint. |
nvidia-gliner |
none | Hosted endpoint; only Python client dependencies locally. |
privacy-filter |
~2.6 GB at ~/.opf/privacy_filter |
model.safetensors; download on first load or via task privacy-filter:warmup. |
gliner2 |
~810 MB at ~/.cache/huggingface/hub |
fastino/gliner2-base-v1 full repo (measured 24 April 2026). Large variant is larger. |
.venv with Privacy Filter installed |
~1.0 GB | Includes Torch and transitive runtime dependencies. |
Peak RSS (observed with task compare -- --taxonomy core --match-mode partial on 43 synthetic samples):
| Engine | Peak RSS | Sizing guidance |
|---|---|---|
presidio |
~0.9 GiB | CPU-bound, short synchronous jobs. Allow ≥1 GiB, preferably 2 GiB. |
nvidia-gliner |
~0.3 GiB | Low local memory; latency bound by hosted endpoint. |
privacy-filter |
~3.6 GiB | Torch model dominates. Allow ≥4 GiB, preferably 6–8 GiB. |
gliner2 |
not yet profiled | Torch + mid-size transformer. Confirm with task compare on your own workload before sizing. |
Railway:
presidioandnvidia-glinerfit smaller Railway services comfortably.- Do not re-download the Privacy Filter checkpoint on every deploy. Use a persistent volume, or pre-bake the checkpoint into an image if image size and deploy time are acceptable.
- Size Privacy Filter containers at ≥4 GiB RAM, preferably 6–8 GiB once web-server overhead, concurrency, and transcript size are included.
- Treat ephemeral filesystems as cache only — if
~/.opf/privacy_filterdoes not persist, the next cold start re-downloads ~2.6 GB.
Kubernetes (starting point for a single Privacy Filter worker):
resources:
requests:
cpu: "1"
memory: "4Gi"
limits:
cpu: "2"
memory: "8Gi"For Presidio-only or hosted nvidia-gliner workers, start smaller (e.g.
512Mi–1Gi request, 1–2Gi limit) and adjust from observed pod metrics.
For Privacy Filter specifically: mount a persistent volume at ~/.opf or
override HOME/cache paths so the checkpoint survives restarts; use an init
container or startup job running task privacy-filter:warmup to pre-populate
the model before traffic is routed; keep concurrency low until memory is
measured on production-size transcripts; separate hosted-engine workers from
local-model workers if you want different autoscaling profiles.
Add a taxonomy: create taxonomies/<name>.json; define canonical_entities;
optionally add severity, engine_labels, and label_mappings; run with
--taxonomy <name>.
Add an engine: create an adapter in evaluation/engines/; implement
analyze(text) -> list[dict] returning type, start, end, and optional
score; set SCORE_THRESHOLD_PARAM (the kwarg name for confidence
thresholding, or None); implement map_ground_truth() if the engine uses a
different taxonomy; register in evaluation/engines/__init__.py; add a
focused unit test.
Add an input format: parsers live in evaluation/analyze.py
(parse_input_file, parse_vtt, parse_plain_text). Add a branch that
returns ParsedDocument(text, segments) — the rest of the pipeline is
format-agnostic.
Embed in another app: review annotations are JSONL rather than a separate
database specifically so an embedding app can render the rows in its own
review UI, update statuses and spans, then call annotations-to-jsonl or
reuse evaluation.annotations directly.
Environment variables (create a .env in the project root):
OPENAI_BASE_URL=https://your-compatible-endpoint/v1
OPENAI_API_KEY=...
SUMMARY_MODEL_NAME=mistralai/mistral-large-3-675b-instruct-2512
# Optional
NVIDIA_PII_MODEL=nvidia/gliner-pii
GLINER2_MODEL=fastino/gliner2-base-v1
SUMMARY_TIMEOUT_SECONDS=120Prompt templates (Mako, rendered against sanitised JSON payloads):
prompts/metrics_summary_system.mako/prompts/metrics_summary_user.makoprompts/finding_summary_system.mako/prompts/finding_summary_user.mako
Repository layout:
data/— sample benchmark data and sample transcript files.prompts/— Mako templates for LLM summary instructions.taxonomies/— reusable taxonomy definitions.evaluation/engines/— engine adapters.evaluation/— CLI, parsing, metrics, reports, summaries.examples/— Presidio API demos and TranscriptIntel workflow examples.tests/— regression tests.
Useful tasks:
task setup
task check
task run
task compare -- --engine presidio --engine nvidia-gliner --engine gliner2 --taxonomy core --match-mode partial
task analyze -- data/digital_forensics_windows_multipass_trace_hiding_sample.vtt --taxonomy digital-forensics --engine nvidia-gliner --dedupe prefer-longest
task analyze-compare -- data/pharma_hcp_interview_sample.vtt --taxonomy pharma --engine presidio --engine nvidia-gliner --summary
task annotations-to-jsonl -- annotations/pharma-review.jsonl --source-text data/pharma_hcp_interview_sample.vtt --output data/pharma-reviewed.jsonl
task cli:forensics
task cli:review-workflow
task privacy-filter:install
task gliner2:install
task nvidia:model-check
task cache-clear