paged-attention

Here are 16 public repositories matching this topic...

xlite-dev / Awesome-LLM-Inference

📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉

mla vllm llm-inference awesome-llm flash-attention tensorrt-llm paged-attention deepseek flash-attention-3 deepseek-v3 minimax-01 deepseek-r1 flash-mla qwen3

Updated Apr 20, 2026
Python

lumia431 / photon_infer

Star

A High-Performance LLM Inference Engine with vLLM-Style Continuous Batching

modern-cpp inference-engine ai-infra vllm llm-inference paged-attention continuous-batching

Updated Jan 2, 2026
C++

Implementation of PagedAttention from vLLM paper - a breakthrough attention algorithm that treats KV cache like virtual memory. Eliminates memory fragmentation, increases batch sizes, and dramatically improves LLM serving throughput.

memory-optimization kv-cache llm-inference paged-attention transformer-optimization

Updated Dec 3, 2025
Python

gyunggyung / Agent.cpp

Star

High-performance On-Device MoA (Mixture of Agents) Engine in C++. Optimized for CPU inference with RadixCache & PagedAttention. (Tiny-MoA Native)

c cpp moa on-device-ai llm llamacpp llama-cpp ggml paged-attention cpu-optimization mixture-of-agents radix-attention

Updated Jan 25, 2026
C++

developertogo / velo-core

Star

A production-grade, native Rust speculative inference engine for Apple Silicon with Metal GPU acceleration and paged attention.

metal gpu-acceleration systems-programming apple-silicon openai-api tensor-parallelism llm-inference speculative-decoding paged-attention continuous-batching prefix-caching disaggregated-serving

Updated May 9, 2026
Rust

nshkrdotcom / vllm

Sponsor

Star

vLLM - High-throughput, memory-efficient LLM inference engine with PagedAttention, continuous batching, CUDA/HIP optimization, quantization (GPTQ/AWQ/INT4/INT8/FP8), tensor/pipeline parallelism, OpenAI-compatible API, multi-GPU/TPU/Neuron support, prefix caching, and multi-LoRA capabilities

Updated Apr 23, 2026
Elixir

achi9629 / llm-inference-engine

Star

A from scratch LLM inference engine build in PyTorch with custom GPT2 transformers, kv cache, paged kv cache, continuous batching and A100 benchmarks

nlp deep-learning transformers autoregressive inference-engine model-serving fastapi gpt2 kv-cache llm llm-serving llm-inference paged-attention mistral-7b continuous-batching paged-kv-cache

Updated May 8, 2026
Python

AICL-Lab / hetero-paged-infer

Star

High-Performance LLM Inference Engine with PagedAttention & Continuous Batching in Rust

rust machine-learning high-performance inference transformer gpu-computing production-ready systems-programming inference-engine serving kv-cache llm vllm llm-inference paged-attention continuous-batching

Updated May 25, 2026
Rust

yogik84 / Tiny-MoA

Star

🤖 Enhance task management with Tiny MoA, a GPU-free multi-agent system that plans, reasons, and collaborates efficiently in real time.

c lightweight cpp falcon agents moa uv on-device-ai llm llamacpp llama-cpp ggml paged-attention cpu-optimization tool-calling lfm2 radix-attention

Updated May 26, 2026
Python

jrw96 / kv-cache-sim

Star

Discrete-event simulator for LLM inference serving — PagedAttention memory management and continuous batching

simulation inference kv-cache llm paged-attention

Updated Apr 3, 2026
Python

giulio98 / langchain-pced

Star

LangChain integration for Parallel Context-of-Experts Decoding (PCED)

transformers rag langchain paged-attention constrastive-decoding pced

Updated Feb 13, 2026
Python

MrAMS / llaisys

Star

AI Infra 手写算子实现Qwen2推理，支持Paged Attention

ai-infra paged-attention qwen2

Updated Sep 28, 2025
C++

StanByriukov02 / hwatom-kv-shim

Star

Intra-2MiB CUDA leaf packing (cuMem). GATE12 iron: workload_id=t1_leaf_physics_v1 + legacy t1-eval-20260522. 42% VRAM liberation @70% budget. Evaluation-Only.

cuda inference memory-allocator gpu-memory nvidia-gpu kv-cache vllm llm-inference h100 paged-attention

Updated May 25, 2026
C

sridivya9398 / InferenceNexus

Star

A high-performance visual exploration platform for understanding LLM Inference, vLLM optimization, RAG architectures, and GPU warm startup concepts

react typescript inference gpu-optimization rag vllm genai paged-attention warm-startup

Updated May 14, 2026
TypeScript

iFurySt / nanoLLMServe

Star

🌱 A tiny, readable LLM serving engine with vLLM/SGLang-style features.

Updated May 22, 2026
Python

DONGRYEOLLEE1 / attn-kv-bench

Star

A compact benchmark lab for measuring TTFT, throughput, and KV-memory gains from GQA, KV cache, and paged KV management.

benchmark transformer mps mla gqa kv-cache paged-attention

Updated Apr 16, 2026
Python

Improve this page

Add a description, image, and links to the paged-attention topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the paged-attention topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

paged-attention

Here are 16 public repositories matching this topic...

xlite-dev / Awesome-LLM-Inference

lumia431 / photon_infer

VARUN3WARE / Paged-Attention

gyunggyung / Agent.cpp

developertogo / velo-core

nshkrdotcom / vllm

achi9629 / llm-inference-engine

AICL-Lab / hetero-paged-infer

yogik84 / Tiny-MoA

jrw96 / kv-cache-sim

giulio98 / langchain-pced

MrAMS / llaisys

StanByriukov02 / hwatom-kv-shim

sridivya9398 / InferenceNexus

iFurySt / nanoLLMServe

DONGRYEOLLEE1 / attn-kv-bench

Improve this page

Add this topic to your repo