Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
# Changelog

## Version 0.10.0
## Version 0.10.0 - 0.10.1

- Added methods to write to RDS/RData files.
- Supports atomic types, generic dictionaries/lists, and **BiocPy objects**.
- Read `symbols` registered in RDS objects.
- Fixed an issue with S4 classes not properly saved as RDS files.

## Version 0.9.0 - 0.9.1

Expand Down
142 changes: 91 additions & 51 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,101 +4,141 @@

# rds2py

Parse and save Python objects as **RDS or RData** files. `rds2py` supports various base classes from R, and Bioconductor's `SummarizedExperiment` and `SingleCellExperiment` S4 classes. **_For more details, check out [rds2cpp library](https://github.com/LTLA/rds2cpp)._**
`rds2py` allows you to read and write R's native **RDS** and **RData** files directly in Python. Beyond standard R types, it provides integration with the [BiocPy](https://github.com/biocpy) ecosystem, allowing you to easily roundtrip complex S4 data structures like `SummarizedExperiment`, `SingleCellExperiment`, and `GenomicRanges`. **_For more details, check out [rds2cpp library](https://github.com/LTLA/rds2cpp)._**

## Installation

Package is published to [PyPI](https://pypi.org/project/rds2py/)

```shell
pip install rds2py
```

To enable automatic conversion to Bioconductor/BiocPy classes, make sure to install the optional dependencies:

# or install optional dependencies
```shell
pip install rds2py[optional]
```

By default, the package does not install packages to convert python representations to BiocPy classes. Please consider installing all optional dependencies.

## Usage
## Quickstart

> [!NOTE]
>
> If you do not have an RDS object handy, feel free to download one from [single-cell-test-files](https://github.com/jkanche/random-test-files/releases).
### 1. Reading RDS and RData files

Reading an RDS or RData file is as simple as a single function call. `rds2py` automatically detects and maps known R/Bioconductor classes to their Python equivalents:

```python
from rds2py import read_rds, read_rda
r_obj = read_rds("path/to/file.rds") # or read_rda("path/to/file.rda")

# Read an RDS file (returns a Python/BiocPy object or dict)
data = read_rds("path/to/file.rds")

# Read objects from an RData workspace file (returns a dictionary of objects)
workspace = read_rda("path/to/workspace.rda")
```

The returned `r_obj` either returns an appropriate Python class if a parser is already implemented or returns the dictionary containing the data from the RDS file.
If `rds2py` encounters an S4 class or complex R structure it doesn't have a parser registered for, it falls back to returning a dictionary so you don't lose any data.

### Save RDS/RData files
### 2. Saving to RDS and RData files

You can also construct RDS or RData files from Python objects. `rds2py` supports writing atomic types, generic dictionaries/lists, and **BiocPy objects**.
You can serialize Python objects back to RDS or RData formats. This includes NumPy arrays, SciPy sparse matrices, standard dictionaries/lists, and BiocPy objects:

```python
from rds2py import write_rds, write_rda
import numpy as np

# Write atomic types
write_rds(np.array([1, 2, 3], dtype=np.int32), "path/to/file.rds")

# Write complex objects
from rds2py import write_rds, write_rda
from genomicranges import GenomicRanges
from iranges import IRanges

gr = GenomicRanges(
seqnames=["chr1", "chr2"],
ranges=IRanges(start=[1, 2], width=[10, 20]),
strand=["+", "-"]
)
write_rds(gr, "path/to/granges.rds")
# 1. Write an atomic NumPy array
write_rds(np.array([10, 20, 30], dtype=np.int32), "array.rds")

# 2. Write a complex Bioconductor GenomicRanges object
gr = GenomicRanges(seqnames=["chr1", "chr2"], ranges=IRanges(start=[1, 100], width=[10, 50]), strand=["+", "-"])
write_rds(gr, "genomic_ranges.rds")

# 3. Write multiple Python objects into a single RData workspace
objects = {"my_array": np.array([1.1, 2.2, 3.3]), "my_granges": gr}
write_rda(objects, "workspace.rda")
```

### Write-your-own-reader
### 3. Custom Extensions

Reading RDS or RData files as dictionary representations allows users to write their own custom readers into appropriate Python representations.
If you have custom S4 representations or class mapping needs, you can parse the raw RDS structure into Python dictionary representations using `parse_rds`/`parse_rda` and apply your custom deserializers:

```python
from rds2py import parse_rds, parse_rda
from rds2py import parse_rds
from rds2py.read_granges import read_genomic_ranges

# 1. Parse into a raw dictionary representation of the RDS tree
raw_dict = parse_rds("path/to/file.rds")
print(raw_dict.keys()) # ['type', 'class_name', 'attributes', 'data', ...]

robject = parse_rds("path/to/file.rds") # or use parse_rda for rdata files
print(robject)
# 2. Build or invoke custom parser logic
if raw_dict.get("class_name") == "GRanges":
gr = read_genomic_ranges(raw_dict)
print(gr)
```

If you know this RDS file contains an `GenomicRanges` object, you can use the built-in reader or write your own reader to convert this dictionary.
For writing custom objects, you can register your classes to `rds2py`'s serialization registry using the `save_rds` singledispatch generic:

```python
from rds2py.read_granges import read_genomic_ranges
from rds2py.generics import save_rds


gr = read_genomic_ranges(robject)
print(gr)
class MyCustomClass:
def __init__(self, value):
self.value = value


@save_rds.register(MyCustomClass)
def _serialize_custom(x: MyCustomClass, path=None):
# Construct the raw RDS dictionary representation expected by rds2cpp
converted = {
"type": "integer",
"data": [x.value],
"attributes": {"class": {"type": "string", "data": ["MyCustomRClass"]}},
}

# Optionally save if path is provided, otherwise return representation
if path is not None:
from rds2py.lib_rds_parser import write_rds as write_rds_native

write_rds_native(converted, path)
return converted
```


## Type Conversion Reference

| R Type | Python/NumPy Type |
| ---------- | ------------------------------------ |
| numeric | numpy.ndarray (float64) |
| integer | numpy.ndarray (int32) |
| character | list of str |
| logical | numpy.ndarray (bool) |
| factor | list |
| data.frame | BiocFrame |
| matrix | numpy.ndarray or scipy.sparse matrix |
| dgCMatrix | scipy.sparse.csc_matrix |
| dgRMatrix | scipy.sparse.csr_matrix |

and integration with BiocPy ecosystem for Bioconductor classes
- SummarizedExperiment
- RangedSummarizedExperiment
- SingleCellExperiment
- GenomicRanges
- MultiAssayExperiment
The table below describes how core R types are mapped to Python/NumPy/SciPy counterparts:

| R Type / Class | Python / NumPy / SciPy Counterpart |
| :--- | :--- |
| **numeric** | `numpy.ndarray` (`float64`) |
| **integer** | `numpy.ndarray` (`int32`) |
| **logical** | `numpy.ndarray` (`bool`) |
| **character** | `list` of `str` |
| **factor** | `list` / representation levels |
| **matrix (dense)** | `numpy.ndarray` |
| **dgCMatrix** (Column-sparse) | `scipy.sparse.csc_matrix` |
| **dgRMatrix** (Row-sparse) | `scipy.sparse.csr_matrix` |
| **data.frame** / **DFrame** | `biocframe.BiocFrame` |

### Supported Bioconductor Classes
When `rds2py[optional]` is installed, the package fully translates R/S4 classes to their BiocPy equivalents:
- **GenomicRanges** / **GRanges** <-> `genomicranges.GenomicRanges`
- **GenomicRangesList** / **GRangesList** <-> `genomicranges.CompressedGenomicRangesList`
- **SummarizedExperiment** <-> `summarizedexperiment.SummarizedExperiment`
- **RangedSummarizedExperiment** <-> `summarizedexperiment.RangedSummarizedExperiment`
- **SingleCellExperiment** <-> `singlecellexperiment.SingleCellExperiment`
- **MultiAssayExperiment** <-> `multiassayexperiment.MultiAssayExperiment`

---

## Developer Notes

This project uses pybind11 to provide bindings to the rds2cpp library. Please make sure necessary C++ compiler is installed on your system.
- `rds2py` uses `pybind11` to bind the core C++ `rds2cpp` library. Compiling from source requires a compatible C++ compiler.
- Tests can be run via `tox` or directly using `pytest`.

<!-- pyscaffold-notes -->

Expand Down
138 changes: 138 additions & 0 deletions docs/custom_serialization.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
# Custom Serialization and Deserialization Guide

This guide shows you how to extend `rds2py` to support custom Python classes. By implementing custom readers and writers, you can serialize your custom Python representations directly into native R RDS/RData structures, and read them back seamlessly.

`rds2py` achieves this two-way extensibility using:
1. Python's `functools.singledispatch` mechanism for writing/serialization (`save_rds`).
2. A global class mapping registry for reading/deserialization (`read_rds`).

---

## 1. Custom Serialization (Python -> RDS)

To serialize a custom Python class, you register it with the `save_rds` generic dispatcher. Your custom function needs to take your object and convert it into a structured dictionary that matches R's internal representation format.

### The Structured RDS Representation Format
R objects are represented in Python as nested dictionaries containing the following keys:
- `"type"`: The R type descriptor (e.g., `"S4"`, `"vector"`, `"integer"`, `"double"`, `"string"`, `"logical"`, or `"null"`).
- `"class_name"`: The target R class name (e.g., `"MyCustomRClass"`).
- `"package_name"`: *(Optional, for S4 classes)* The name of the R package where the class is defined.
- `"attributes"`: A dictionary representing R attributes or S4 slots. Each slot value must also be a structured representation dictionary.
- `"data"`: The flat list or array of values for vector/atomic types.

### Example: Implementing a Custom Serializer

Let's say we have a custom Python class named `MyFeature`:

```python
class MyFeature:
def __init__(self, name: str, values: list):
self.name = name
self.values = values
```

To serialize `MyFeature` as a native R S4 class called `"MyCustomRClass"` from package `"MyRPackage"`, we register it using `@save_rds.register`:

```python
from typing import Optional
from rds2py import save_rds


@save_rds.register(MyFeature)
def _save_rds_myfeature(x: MyFeature, path: Optional[str] = None):
# Native C++ writer call
from rds2py.lib_rds_parser import write_rds as write_rds_native

# 1. Structure the Python object into the expected R dictionary format
converted = {
"type": "S4",
"class_name": "MyCustomRClass",
"package_name": "MyRPackage",
"attributes": {
# Recursively call save_rds to serialize internal elements
"featureName": save_rds(x.name),
"featureValues": save_rds(x.values),
},
}

# 2. If a save path is specified, write directly using the native writer
if path is not None:
write_rds_native(converted, path)

return converted
```

---

## 2. Custom Deserialization (RDS -> Python)

To read custom S4 objects back into Python classes via `read_rds`, you need to:
1. Write a deserialization function that constructs your Python class from the raw parsed dictionary.
2. Register your deserializer function in `rds2py`'s global class mapping registry.

### Example: Implementing the Reader

```python
from rds2py.generics import _dispatcher
from rds2py.rdsutils import get_class


def read_my_custom_class(robject: dict, **kwargs) -> MyFeature:
# 1. Verify the incoming R class name
cls_name = get_class(robject)
if cls_name != "MyCustomRClass":
raise ValueError(f"Expected class 'MyCustomRClass', but received '{cls_name}'")

# 2. Extract and parse the slots recursively
# We call the internal _dispatcher helper to parse child structures
feature_name = _dispatcher(robject["attributes"]["featureName"], **kwargs)
feature_values = _dispatcher(robject["attributes"]["featureValues"], **kwargs)

# 3. Instantiate and return your custom Python class
return MyFeature(name=feature_name, values=list(feature_values))
```

### Registering the Reader
Map your class name to the reader function in the global class registry (`REGISTRY` from `rds2py.generics`):

```python
from rds2py.generics import REGISTRY

# Register our custom deserializer in the global map
REGISTRY["MyCustomRClass"] = read_my_custom_class
```

---

## 3. Full Roundtrip

Here is how the entire custom serialization and deserialization workflow works together:

```python
import tempfile
import os
from rds2py import write_rds, read_rds

# 1. Create a custom instance
feature = MyFeature(name="expression_level", values=[10, 20, 30])

# 2. Serialize to a temporary RDS file
with tempfile.NamedTemporaryFile(suffix=".rds", delete=False) as tmp:
path = tmp.name

try:
# Write custom class to RDS format
write_rds(feature, path)

# Read the RDS file back into Python
recreated = read_rds(path)

# 3. Verify that the roundtrip correctly recreated the custom class
assert isinstance(recreated, MyFeature)
assert recreated.name == "expression_level"
assert recreated.values == [10, 20, 30]
print("Roundtrip validation successful!")
finally:
if os.path.exists(path):
os.unlink(path)
```
19 changes: 13 additions & 6 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -1,24 +1,31 @@
# rds2py
# rds2py: R Serialization Formats in Python

Parse, extract and create Python representations for datasets stored in RDS files. It supports Bioconductor's `SummarizedExperiment` and `SingleCellExperiment` objects. This is possible because of [Aaron's rds2cpp library](https://github.com/LTLA/rds2cpp).
`rds2py` is designed to parse, extract, and write R data formats (RDS and RData) directly in Python. It provides native, out-of-the-box integration with the [BiocPy](https://github.com/biocpy) ecosystem, allowing seamless roundtripping of complex S4 datasets like `SummarizedExperiment`, `SingleCellExperiment`, and `GenomicRanges`.

The package uses memory views (except for strings) so that we can access the same memory from C++ space in Python (through Cython of course). This is especially useful for large datasets so we don't make copies of data.
This library is built on top of [Aaron Lun's rds2cpp library](https://github.com/LTLA/rds2cpp).

## Install
## Installation

Package is published to [PyPI](https://pypi.org/project/rds2py/)
`rds2py` is available on [PyPI](https://pypi.org/project/rds2py/):

```shell
pip install rds2py
```

## Contents
To enable full conversion support for Bioconductor/BiocPy classes, consider installing the optional dependencies:

```shell
pip install rds2py[optional]
```

## Table of Contents

```{toctree}
:maxdepth: 2

Overview <readme>
Tutorial <tutorial>
Custom Serialization Guide <custom_serialization>
Contributions & Help <contributing>
License <license>
Authors <authors>
Expand Down
Loading
Loading