SCALLOPS (Scalable Library for Optical Pooled Screens) is a comprehensive Python package designed to streamline and scale the analysis of Optical Pooled Screens (OPS) for biological data. With a focus on handling large-scale, high-throughput screening data, SCALLOPS provides tools for efficiently processing, analyzing, and interpreting OPS data, leveraging modern distributed computing frameworks like Dask.
The full documentation, API reference, and tutorials can be found at: http://scallops.readthedocs.io
scallops/
├── .github/ # CI/CD workflows
├── docs/ # Documentation source (Sphinx)
├── scallops/ # Main Python package source
│ ├── cli/ # CLI entry points
│ ├── core/ # Core processing logic
│ └── utils/ # Utilities
├── wdl/ # WDL pipeline definitions
├── Dockerfile # Docker image definition
├── pyproject.toml # Build metadata
├── requirements.txt # Main dependencies
├── setup.py # Installation script
└── README.md # Project overview
SCALLOPS requires Python 3.11 or newer.
We recommend using uv for high-performance Python environment management. You will need uv installed on your system. Installation instructions can be found here: https://docs.astral.sh/uv/
To set up a virtual environment:
# Create a virtual environment with a specific Python version
uv venv --python 3.12
# Activate the environment
# On macOS/Linux:
source .venv/bin/activate
# On Windows:
.venv\Scripts\activateThe easiest way to install the stable version is via pip (or uv pip):
uv pip install scallopsSCALLOPS is available as a containerized image via the GitHub Container Registry (GHCR). This is the best option for ensuring environment consistency.
# Pull the latest image
docker pull ghcr.io/genentech/scallops:latest
# Run the CLI directly
docker run --rm ghcr.io/genentech/scallops:latest scallops --helpIf you wish to contribute to the codebase or need the latest unreleased changes:
- Clone the repository:
git clone [https://github.com/Genentech/scallops.git](https://github.com/Genentech/scallops.git)
cd scallops- Install in editable mode:
uv pip install -r requirements.txt -e .- High-Throughput Data Processing: Designed to manage massive datasets typical of OPS experiments across multiple scales.
- Scalability and Performance: Optimized for both local and cloud-based distributed environments using Dask.
- Modular Workflows: Includes customizable WDL workflows for cloud platforms like Terra or Cromwell.
- Efficient Data Handling: Advanced memory management and lazy evaluation to minimize resource usage.
- Command-Line Interface (CLI): Automates batch processing for seamless pipeline integration.
- Customizable Outputs: Generates versatile data visualizations and summary statistics.
- Notebook Examples: Practical Jupyter notebooks are included to guide users through real-world workflows.
- Rich API: A comprehensive API that allows for the creation of fully customized biological data pipelines.
- Large-Scale Screening: Handling the immense data loads of genome-wide OPS projects.
- Biological Discovery: Identifying and quantifying biological perturbations from high-throughput imaging.
We welcome all forms of contributions, including bug reports, documentation improvements, and feature enhancements.
