A library for making RepE control vectors
-
Updated
Sep 24, 2025 - Jupyter Notebook
A library for making RepE control vectors
[ICLR 2025] General-purpose activation steering library
A resource repository for representation engineering in large language models
Steering vectors for transformer language models in Pytorch / Huggingface
[🏆 CHI26 Best Paper] CoBRA: Reproducible control of LLM agent behavior via classic social science experiments
KV Cache Steering for Inducing Reasoning in Small Language Models
Lightweight representation engineering dataflow operations for agent developers.
Activation steering and trait monitoring for HuggingFace transformers
[🔥 ICLR 2026] - Misaligned Roles, Misplaced Images: Structural Input Perturbations Expose Multimodal Alignment Blind Spots
Official code for "Activation Steering for Accent Adaptation in Speech Foundation Models" (Interspeech 2026). Parameter-free accent adaptation via mean-shift steering vectors — no weight updates, consistent WER reductions across 8 accents.
CRSM (Continuous Reasoning State Model): An asynchronous "System 2" architecture that implements Hierarchical State Sovereignty within a Mamba backbone. Unlike traditional search wrappers, CRSM uses Forward-Projected Planning and Sparse-Gated Injection to steer latent manifolds in real-time, decoupling strategic reasoning from token generation.
Pre-generation tool-call gating via linear probes on LLM hidden states. F1 ≈ 0.91–0.94 on BFCL v4, 14–22× faster than full generation. Cross-architecture transfer across Llama / Qwen / Phi / Mistral (3B–7B) with ≥96% retention.
Phase-aware LLM activation steering and linear probing. A memory-efficient, practical implementation of Representation Engineering (RepE) for safety research.
Early baby steps towards a long-term vision regarding Mamba-2's state interpretability.
Representation Rerouting for Agentic Safety: Defending LLM Agents against Prompt Injection via Circuit Breakers and Triplet Loss.
Mechanistic interpretability experiments on political control circuits, refusal behavior, concept steering, and late-decoder interactions in open LLMs.
Code, vectors, and figures for the paper 'Emotion and authorization steering both move cheat; trained-probe suppression doesn't undo it: a mechanistic study in Gemma-2-2B'
Functional emotional architecture for LLMs — 42 systems, 1994 tests, 27 psychological theories. Emergent emotions via 7 ANIMA pillars: predictive processing, global workspace, autobiographical memory, ontogenic development, motivational drives, emotional discovery, computational phenomenology.
LatentBiopsy: Geometric Anomaly Detection for LLM Residual Streams.
Add a description, image, and links to the representation-engineering topic page so that developers can more easily learn about it.
To associate your repository with the representation-engineering topic, visit your repo's landing page and select "manage topics."