ORACLE is research software for oral and craniofacial single-cell analysis. The current release provides an oral-adapted scGPT query workflow for generating cell embeddings, predicting hierarchical cell type labels, and estimating candidate unknown-state scores.
The ORACLE source code is distributed through this repository. The cap5000 ORACLE model artifacts required for query embedding and annotation are distributed separately through the ORACLE GitHub Release assets. Large training atlases, raw sequencing data, processed private datasets, and manuscript analysis outputs are not included.
Status: pre-release research software. ORACLE is not a clinical or diagnostic tool, and outputs should be interpreted with appropriate biological validation.
- Query embedding using the oral-adapted scGPT encoder distributed as a release asset.
- Reference-based level 1 to level 4 cell type annotation.
- Candidate unknown-state scoring for cells that may not match the current reference label space.
tumorandnon_tumorlabel policies for controlling the prediction label space.- Output embedding key:
X_oracle.
Use Python 3.10-3.13. The full query workflow requires the scientific Python and scGPT runtime stack.
Clone the repository and install the package:
git clone https://github.com/Teichlab/ORACLE.git
cd oracle
python -m pip install -e .Download the cap5000 model artifacts from the ORACLE GitHub Release:
oracle download-modelBy default this installs the model into:
~/.cache/oracle/models/cap5000
If you manually download the release asset, extract it to a local directory and either set:
export ORACLE_CAP5000_DIR=/path/to/cap5000or pass --model-dir /path/to/cap5000/encoder --artifact /path/to/cap5000/classifier to oracle run-query.
Development test dependencies can be installed with:
python -m pip install -e ".[dev]"
pytestimport oracle
print(oracle.__version__)Inspect the installed model:
oracle --version
oracle infoRun ORACLE on a query AnnData file:
oracle run-query \
--h5ad query.h5ad \
--outdir oracle_query_output \
--label-policy tumorUse --label-policy non_tumor to mask tumor-source epithelial labels during prediction while retaining immune, stromal, endothelial, neural, mural, and muscle labels. Both policies use the same query embedding workflow and produce the same X_oracle embedding for a given input.
Typical outputs include:
query_oracle_embedded.h5ad: query object withadata.obsm["X_oracle"].query_oracle_annotated.h5ad: query object with ORACLE embedding, predicted labels, and unknown scores.query_oracle_embedded.stats.json: lightweight embedding summary.
Predicted annotation columns follow the current ORACLE schema:
oral_scgpt_pred_level1oral_scgpt_pred_level2oral_scgpt_pred_level3oral_scgpt_pred_level4oral_scgpt_unknown_score_level1oral_scgpt_unknown_score_level2oral_scgpt_unknown_score_level3oral_scgpt_unknown_score_level4
oracle/: importable Python package.oracle/resources/: lightweight package resource namespace. Large cap5000 model artifacts are distributed through GitHub Release assets rather than committed directly to the repository.tests/: minimal smoke tests that do not require private datasets.
The repository intentionally excludes raw data, private h5ad files, large reference atlases, manuscript analysis folders, generated figures, and large model files. The cap5000 model checkpoint and classifier artifacts are distributed as versioned GitHub Release assets because they are required for ORACLE query embedding and annotation.
Before reusing ORACLE outputs in a publication or downstream biological analysis, users should confirm that the input data are processed appropriately and that predicted rare or unknown states are supported by independent marker, sample, or experimental evidence.
ORACLE is distributed under the Apache License 2.0. Unless otherwise noted, this license applies to the source code and ORACLE cap5000 model artifacts distributed with the corresponding GitHub Release.
Citation information will be added when the associated manuscript or formal software release is available.
Weimin Lin