Skip to content

Teichlab/ORACLE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ORACLE

ORACLE is research software for oral and craniofacial single-cell analysis. The current release provides an oral-adapted scGPT query workflow for generating cell embeddings, predicting hierarchical cell type labels, and estimating candidate unknown-state scores.

The ORACLE source code is distributed through this repository. The cap5000 ORACLE model artifacts required for query embedding and annotation are distributed separately through the ORACLE GitHub Release assets. Large training atlases, raw sequencing data, processed private datasets, and manuscript analysis outputs are not included.

Status: pre-release research software. ORACLE is not a clinical or diagnostic tool, and outputs should be interpreted with appropriate biological validation.

Features

  • Query embedding using the oral-adapted scGPT encoder distributed as a release asset.
  • Reference-based level 1 to level 4 cell type annotation.
  • Candidate unknown-state scoring for cells that may not match the current reference label space.
  • tumor and non_tumor label policies for controlling the prediction label space.
  • Output embedding key: X_oracle.

Installation

Use Python 3.10-3.13. The full query workflow requires the scientific Python and scGPT runtime stack.

Clone the repository and install the package:

git clone https://github.com/Teichlab/ORACLE.git
cd oracle
python -m pip install -e .

Download the cap5000 model artifacts from the ORACLE GitHub Release:

oracle download-model

By default this installs the model into:

~/.cache/oracle/models/cap5000

If you manually download the release asset, extract it to a local directory and either set:

export ORACLE_CAP5000_DIR=/path/to/cap5000

or pass --model-dir /path/to/cap5000/encoder --artifact /path/to/cap5000/classifier to oracle run-query.

Development test dependencies can be installed with:

python -m pip install -e ".[dev]"
pytest

Quick Start

import oracle

print(oracle.__version__)

Inspect the installed model:

oracle --version
oracle info

Run ORACLE on a query AnnData file:

oracle run-query \
  --h5ad query.h5ad \
  --outdir oracle_query_output \
  --label-policy tumor

Use --label-policy non_tumor to mask tumor-source epithelial labels during prediction while retaining immune, stromal, endothelial, neural, mural, and muscle labels. Both policies use the same query embedding workflow and produce the same X_oracle embedding for a given input.

Typical outputs include:

  • query_oracle_embedded.h5ad: query object with adata.obsm["X_oracle"].
  • query_oracle_annotated.h5ad: query object with ORACLE embedding, predicted labels, and unknown scores.
  • query_oracle_embedded.stats.json: lightweight embedding summary.

Predicted annotation columns follow the current ORACLE schema:

  • oral_scgpt_pred_level1
  • oral_scgpt_pred_level2
  • oral_scgpt_pred_level3
  • oral_scgpt_pred_level4
  • oral_scgpt_unknown_score_level1
  • oral_scgpt_unknown_score_level2
  • oral_scgpt_unknown_score_level3
  • oral_scgpt_unknown_score_level4

Repository Contents

  • oracle/: importable Python package.
  • oracle/resources/: lightweight package resource namespace. Large cap5000 model artifacts are distributed through GitHub Release assets rather than committed directly to the repository.
  • tests/: minimal smoke tests that do not require private datasets.

Data and Model Resource Policy

The repository intentionally excludes raw data, private h5ad files, large reference atlases, manuscript analysis folders, generated figures, and large model files. The cap5000 model checkpoint and classifier artifacts are distributed as versioned GitHub Release assets because they are required for ORACLE query embedding and annotation.

Before reusing ORACLE outputs in a publication or downstream biological analysis, users should confirm that the input data are processed appropriately and that predicted rare or unknown states are supported by independent marker, sample, or experimental evidence.

License

ORACLE is distributed under the Apache License 2.0. Unless otherwise noted, this license applies to the source code and ORACLE cap5000 model artifacts distributed with the corresponding GitHub Release.

Citation

Citation information will be added when the associated manuscript or formal software release is available.

Contact

Weimin Lin

About

Oral Reference Atlas for Cell Learning and Embedding (ORACLE)

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages