Subgroup-stratified, calibration-aware fairness auditing for machine-learning models — grounded in peer-reviewed methods.
📖 Documentation: https://rajveer-code.github.io/fairscope/
fairscope packages statistical machinery that mainstream fairness toolkits do not expose as
first-class, subgroup-stratified functions, and adds one novel protocol on top:
- DeLong confidence intervals for per-subgroup AUC (fast midrank algorithm).
- Per-subgroup Expected/Maximum Calibration Error with reliability diagrams.
- Significance testing of subgroup performance gaps (paired/unpaired DeLong, stratified bootstrap) with Bonferroni / Benjamini–Hochberg correction.
- A subgroup-stratified interface to standard recalibration — temperature scaling (Guo et al. 2017) and isotonic regression (Zadrozny & Elkan 2002), with pre/post-ECE.
- A novel five-axis Cross-Platform Fairness Evaluation (CPFE) protocol.
- One-call domain audits:
healthcare,lending,federated.
Only the CPFE protocol is presented as novel. Every other function ports a documented method and cites its source; the recalibration methods are standard, and the contribution there is the per-subgroup interface and pre/post-ECE reporting.
Status — v0.3.0. All five modules (
core,healthcare,nlp/CPFE,federated,lending) are implemented, tested, and released. 100% line coverage on the statistical core; CI green across Python 3.9–3.12. Seedocs/DESIGN.mdfor methods and design.
pip install fairscopeReleases are uploaded to PyPI by the maintainer; if a version isn't available there yet, install from source or from the release assets:
git clone https://github.com/Rajveer-code/fairscope
cd fairscope
pip install -e ".[dev]"
pytestThe base install is light (NumPy, SciPy, scikit-learn, pandas, matplotlib). Optional extras:
fairscope[nlp] (torch, transformers, captum), fairscope[lending] (econml),
fairscope[shap], fairscope[docs].
from fairscope.healthcare import HealthcareFairnessAudit
# y_true : binary outcomes
# y_score: the model's positive-class probabilities
# age_group: a protected attribute, aligned row-for-row
report = HealthcareFairnessAudit.from_scores(
y_true, y_score, {"age_group": age_group}
).run()
print(report.summary()) # per-subgroup AUC (DeLong CI), ECE, Brier, F1; flags the largest gap
report.to_dataframe() # tidy per-subgroup table
report.plot_auc_forest() # forest plot of per-subgroup AUC with DeLong intervalsEvery domain is also reachable through one dispatcher,
FairnessAudit(model, domain=...), with domain in {"healthcare", "nlp", "federated", "lending"}. A runnable end-to-end example on a committed synthetic fixture is in the
getting-started guide and in
notebooks/.
| Module | Purpose | Status |
|---|---|---|
core/ |
DeLong CI, bootstrap-AUC test, ECE/MCE + reliability, multiple-testing correction, subgroup metrics | ✅ shipped |
healthcare/ |
one-call clinical fairness audit + report (tables, forest & reliability plots, PDF, optional SHAP) | ✅ shipped |
nlp/ |
CPFE five-axis cross-platform protocol (centerpiece) + Captum attribution stability | ✅ shipped |
federated/ |
per-node DeLong + cross-node disparity + per-node recalibration | ✅ shipped |
lending/ |
annual approval-gap + subgroup CATE (Causal Forest DML) | ✅ shipped |
Plotting (forest plots, reliability diagrams) currently lives in the domain reports.
lending's CATE estimation needs the optional fairscope[lending] extra (econml). The
federated module audits per-node predictions only — it performs no training and provides no
privacy guarantee.
fairscope is complementary to AIF360 and Fairlearn, not a replacement: those toolkits do bias
mitigation; fairscope does uncertainty-aware measurement. The table below was verified by
inspecting the installed public APIs of AIF360 0.6.1 and Fairlearn 0.14.0 (checked
2026-06; re-confirm if versions change).
| Capability | AIF360 | Fairlearn | fairscope |
|---|---|---|---|
| Per-subgroup AUC confidence interval (DeLong) | no | no* | yes |
| Per-subgroup Expected Calibration Error | no | no | yes |
| Subgroup significance test + multiple-comparison correction | no | no | yes |
| Subgroup-stratified recalibration (temperature / isotonic) | partial† | no | yes |
| Cross-platform five-axis protocol (CPFE) | no | no | yes (novel) |
| Per-node / federated audit | no | no | yes |
| Bias-mitigation algorithms | yes | yes | out of scope |
* Fairlearn's MetricFrame computes per-subgroup AUC point estimates (e.g.
roc_auc_score_group_min), but provides no analytic (DeLong) confidence interval.
† AIF360 ships CalibratedEqOddsPostprocessing (calibration-aware equalized-odds
postprocessing), not a general per-subgroup temperature/isotonic recalibration interface.
Closest related work — meval (Sutariya & Petersen, 2025,
arXiv:2512.17409): a statistical toolbox for stratified,
fine-grained model-performance analysis that also provides subgroup metric uncertainty and
multiple-comparison corrections (with a medical-imaging focus). fairscope overlaps with it on
uncertainty + significance; what fairscope adds is the specific DeLong AUC intervals, the
per-subgroup calibration and recalibration interface, the five-axis cross-platform CPFE
protocol, and one-call domain audits (healthcare / lending / federated).
- Test-driven, with regression tests anchored to authoritative reference values where they
exist (DeLong's worked example;
statsmodelsmultiple-testing routines). - 100% line coverage on the statistical core; CI runs pytest + coverage, ruff, and black
across Python 3.9–3.12, and executes the replication notebooks via
nbmake. - Full type hints, NumPy-style docstrings with runnable examples, and explicit input validation (an AUC on a single-class subgroup raises rather than returning a meaningless value).
- Committed fixtures are small, synthetic, and labelled as such; no datasets or trained models are bundled.
fairscope ports methods from the author's peer-reviewed and under-review papers; it invents no
new mathematics. Each function cites its source. Full venues and identifiers are in
CITATION.cff.
- Diabetes risk prediction with external validation + fairness analysis (XGBoost, NHANES→BRFSS) — IEEE CIPHER, 2026.
- A five-axis Cross-Platform Fairness Evaluation for mental-health NLP — under review.
- Privacy-preserving federated learning for diabetes risk across heterogeneous nodes — under review.
- Heterogeneous racial effects in mortgage approval (Causal Forest Double Machine Learning, HMDA) — under review.
- Racial disparities in mortgage lending (RDD / DiD / decomposition, HMDA) — under review.
If you use fairscope, please cite it via CITATION.cff.
MIT © 2026 Rajveer Singh Pall · ORCID 0009-0001-6762-6134