KV-cache memory dominates inference cost for long-context workloads. Standard INT8 quantization introduces measurable quality degradation beyond 2× compression. We developed a grouped quantization scheme that achieves 4× reduction without eval-visible quality loss.
Approach
Keys and values are quantized in groups of 64 tokens with independent per-head scale and zero-point parameters. A lightweight dequantization kernel fuses with the attention computation, avoiding materialized full-precision cache in GPU memory.
Benchmarks
- KV-cache memory: −75% (4× compression)
- Long-context perplexity (32K tokens): +0.02 vs. FP16 baseline
- Needle-in-haystack (128K): 100% retrieval at 4× compression
- Attention kernel overhead: +3.2% latency vs. FP16 cache
Deployment
The compression kernel is deployed in our inference stack for all context lengths exceeding 8K tokens. Partners running self-hosted inference can integrate via our open inference adapter layer.