2026.01 · Inference

KV-Cache Compression with Zero Quality Degradation at 4× Reduction

Grouped quantization with per-head scale factors preserves generation quality while cutting KV-cache memory by 75%.

KV-cache memory dominates inference cost for long-context workloads. Standard INT8 quantization introduces measurable quality degradation beyond 2× compression. We developed a grouped quantization scheme that achieves 4× reduction without eval-visible quality loss.

Approach

Keys and values are quantized in groups of 64 tokens with independent per-head scale and zero-point parameters. A lightweight dequantization kernel fuses with the attention computation, avoiding materialized full-precision cache in GPU memory.

Benchmarks

KV-cache memory: −75% (4× compression)
Long-context perplexity (32K tokens): +0.02 vs. FP16 baseline
Needle-in-haystack (128K): 100% retrieval at 4× compression
Attention kernel overhead: +3.2% latency vs. FP16 cache

Deployment

The compression kernel is deployed in our inference stack for all context lengths exceeding 8K tokens. Partners running self-hosted inference can integrate via our open inference adapter layer.