Hacker News | Expected Attention: KV Cache Compression by Estimating Attention

Expected Attention: KV Cache Compression by Estimating Attention(arxiv.org)

20 points by sonabinu 2 days ago | 3 comments

yalok 2 days ago
The paper only mentions evals for Ruler 4K and 16K - I wish they’d go further and measure for longer context windows. I was wondering if there would be some gain as compared to baseline (no compression) for this method - their results for Qwen with Ruler 16K seem to allude to that - at small compression ratios the evals look better than baseline - which means they are not just improving inference speed/memory, but addressing attenuation dilution problem…
tripplyons 2 days ago
Great work! I wonder if there is a way to combine similar cache items instead of dropping unlikely ones. Could the proposed attention estimation be used for that?
[-]
- yorwba 1 day ago
  Yes, for example https://arxiv.org/pdf/2506.05410 merges two neighboring tokens with the lowest sum of past attention scores, and this method would enable using future expected attention instead.