Skip to content

StrategicProjects/doclingr

Repository files navigation

doclingr doclingr logo

Document intelligence for R — turn messy PDFs, Office files and HTML into AI-ready, structured data.

R-CMD-check Project Status: Active License: MIT Backend: Python via reticulate Docs Powered by Docling

doclingr is an R interface to Docling, an open-source document-understanding library. It brings layout-aware PDF/DOCX/PPTX/HTML parsing, table extraction, OCR and RAG-ready chunking to R, exposing it through a small, tidy-friendly API built on reticulate.

R already has pdftools, tabulizer, officer, readtext and friends, but no single "document intelligence for RAG" package. doclingr aims to fill that gap: take a document, understand its layout, extract its tables, preserve its structure, chunk it, and hand it back ready for search and embeddings.

How it works

doclingr pipeline: document -> convert -> extract -> chunk -> embed

Installation

# install.packages("pak")
pak::pak("StrategicProjects/doclingr")

doclingr talks to the Docling Python package via reticulate. Install the backend once:

library(doclingr)
install_docling()      # creates an "r-docling" Python environment
# restart R
docling_available()    # TRUE

Usage

library(doclingr)

doc <- docling_convert("https://arxiv.org/pdf/2408.09869")

# Export
as_markdown(doc)       # layout-aware Markdown
as_json(doc)           # structured DoclingDocument as an R list

# Pages and tables
docling_n_pages(doc)
tables <- docling_tables(doc)   # list of tibbles
tables[[1]]

# Figures -> tibble (captions, pages, optional saved images)
doc <- docling_convert("paper.pdf", images = TRUE)
docling_figures(doc, image_dir = "figures")

# RAG-ready chunks -> tibble
chunks <- docling_chunk(doc, max_tokens = 512)
chunks$text[1]

# Match your embedding model's tokenizer for accurate budgets
chunks <- docling_chunk(doc, tokenizer = "BAAI/bge-small-en-v1.5", max_tokens = 512)

From chunks to embeddings

doclingr stays provider-agnostic: bring any embedding function (an API call, a local model via reticulate, ...) and docling_embed() handles batching and tidy assembly into an embedding list-column.

embed_openai <- function(txt) {
  # your call to an embeddings API -> matrix (one row per text)
}

doc |>
  docling_chunk(max_tokens = 512) |>
  docling_embed(embed_openai, batch_size = 64)
#> # adds `embedding` (list-column) and `n_dim` columns

Why Docling, why reticulate?

Docling's quality comes from deep-learning models (layout analysis, the TableFormer table-structure model, OCR). Those have no R-native equivalent, so doclingr binds the maintained Python implementation rather than reimplementing it — the same strategy used by tensorflow, keras and spacyr. You get upstream parity for free; doclingr focuses on an idiomatic, tidy R surface.

Status

Actively developed and heading to CRAN. The public API is settling but may still change before 1.0. Contributions and issues are welcome at https://github.com/StrategicProjects/doclingr.

License

MIT © doclingr authors

About

Document intelligence for R via Docling — layout-aware PDF/DOCX/HTML parsing, table extraction and RAG chunking

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages