Show HN: KVSplit - 在 Apple Silicon 上运行上下文长度增加 2-3 倍的模型

Show HN: KVSplit – Run 2-3× longer contexts on Apple Silicon

Source | HN Comments

KVSplit 是一种针对 Apple Silicon 优化的 KV 缓存量化方案，通过对 keys 和 values 使用不同的量化精度，实现在 M1/M2/M3 Mac 上运行更大上下文的 LLM。该方案主要特点是：使用 8-bit keys 和 4-bit values (K8V4) 可减少 59% 内存占用，同时仅损失 0.86% 的质量，并提升推理速度。项目提供基准测试、可视化工具和一键安装，方便用户使用。

dipampaul17 / KVSplit Public

使用差异化的精度进行 KV 缓存量化，在 Apple Silicon 上运行具有更长上下文的更大的 LLM。 KVSplit 启用 8-bit 的 keys 和 4-bit 的 values，减少 59% 的内存，质量损失小于 1%。包括基准测试、可视化和一键式设置。针对具有 Metal 支持的 M1/M2/M3 Mac 进行了优化。

License

View license 4 stars 0 forks Branches Tags Activity

dipampaul17/KVSplit

main

Branches Tags

Go to file

Code

Folders and files

Name | Name | Last commit message | Last commit date ---|---|---|--- .github/workflows | .github/workflows models | models patch | patch plots | plots results | results scripts | scripts .gitignore | .gitignore LICENSE | LICENSE README.md | README.md perplexity_test_data.txt | perplexity_test_data.txt

View all files

Repository files navigation

🚀 KVSplit

Differentiated KV Cache Quantization for Apple Silicon

📌 Overview

通过对注意力机制的 KV 缓存中的 keys 和 values 应用不同的量化精度，在你的 Mac 上运行更大的上下文窗口和更重的 LLM。 KVSplit 使你能够：

减少高达 72% 的内存使用量，且质量损失最小
在相同的内存预算中运行 2-3 倍更长的上下文
与 FP16 相比，保持或提高推理速度
针对 Apple Silicon 进行优化，具有完整的 Metal 支持

Key Findings

Configuration | VRAM @ 8K tokens | Tokens/sec | Perplexity Change ---|---|---|--- FP16 (base) | 176.00 MB (100%) | 54,360 | -- K8V8 (8-bit) | 93.50 MB (47%) | 51,503 | +0.03% K8V4 | 71.50 MB (41%) | 57,438 | +0.86% K4V8 | 71.50 MB (41%) | 58,690 | +6.06% K4V4 (4-bit) | 49.50 MB (28%) | 55,193 | +6.15%

Memory Savings by Sequence Length

Configuration | 128 tokens | 2048 tokens | 4096 tokens | 8192 tokens ---|---|---|---|--- FP16 (baseline) | 5.50 MB | 44.00 MB | 88.00 MB | 176.00 MB K8V8 (8-bit) | 2.92 MB | 23.38 MB | 46.75 MB | 93.50 MB K8V4 (mixed) | 2.23 MB | 17.88 MB | 35.75 MB | 71.50 MB K4V8 (mixed) | 2.23 MB | 17.88 MB | 35.75 MB | 71.50 MB K4V4 (4-bit) | 1.55 MB | 12.38 MB | 24.75 MB | 49.50 MB

Features

KV 缓存中 keys 和 values 的独立量化
针对具有 Metal 支持的 Apple Silicon 进行了优化
带有困惑度测量的综合基准测试套件
内存使用和性能分析工具
出版质量的可视化工具
易于设置和使用

Prerequisites

macOS (在 Apple Silicon 上测试)
Homebrew 包管理器
Xcode Command Line Tools

⚡ One-Command Installation

# Clone the repository
git clone https://github.com/dipampaul17/KVSplit.git
cd kvsplit
# Run the installer script
chmod +x scripts/install_kvsplit.sh
./scripts/install_kvsplit.sh

安装程序将：

设置项目结构
克隆并构建带有 Metal 支持的 llama.cpp
配置差异化的 KV 缓存量化
下载一个小型测试模型（可选）
设置用于可视化的 Python 环境

🏎️ Quick Comparison

想立即看到好处？使用你的模型运行快速比较：

# Run quick comparison with different configurations
python scripts/quick_compare.py --model models/your-model.gguf

这将向你展示 FP16、K8V8、K8V4、K4V8 和 K4V4 的并排比较，其中包含内存使用量、速度和质量指标。

📊 Impressive Results

📉 Memory Reduction

Configuration | VRAM @ 8K tokens | Memory Savings | Quality Impact ---|---|---|--- FP16 (base) | 176.00 MB | — | — K8V8 (8-bit) | 93.50 MB | 47% | +0.03% K8V4 | 71.50 MB | 59% | +0.86% K4V8 | 71.50 MB | 59% | +6.06% K4V4 (4-bit) | 49.50 MB | 72% | +6.15%

📈 Performance Impact

使用 KVSplit 不仅可以节省内存，而且通常可以将推理速度提高 5-15%！

Configuration | Tokens/sec (8K ctx) | Speedup vs FP16 ---|---|--- FP16 | 54,360 | — K8V8 | 51,503 | -5.3% K8V4 | 57,438 | +5.7% K4V8 | 58,690 | +8.0% K4V4 | 55,193 | +1.5%

🧠 Project Structure

kvsplit/
├── llama.cpp/   # Optimized llama.cpp build
├── models/     # LLM model files
├── scripts/    # Utility scripts
│  ├── benchmark_kvsplit.py  # Comprehensive benchmark tool
│  ├── install_kvsplit.sh   # One-command installer
│  ├── quick_compare.py    # Quick comparison utility
│  ├── capture_memory.sh    # GIF creation for memory visualization
│  └── visualize_results.py  # Generate publication-quality plots
├── results/    # Benchmark results (CSV/JSON)
├── plots/     # Generated visualizations
└── README.md    # This file

🔬 Scientific Insight

KV 缓存内存主要用于存储每个 token 的 key 和 value 向量。我们的研究揭示了一个关键的见解：keys 比 values 对量化更敏感。

🔑 Key Findings

不对称影响： Keys 需要比 values 更高的精度才能保持质量
最佳点： K8V4 (8-bit keys, 4-bit values) 提供了最佳平衡
- 与 FP16 相比，困惑度仅下降 0.86%
- 内存减少 59%
- 比 FP16 更快的推理
确认： K4V8 配置显示出比 K8V4 多 7 倍的质量下降，尽管使用了相同的总 bits

这种不对称性允许更有效地使用内存，而不会影响模型质量，从而在消费硬件上实现更长的上下文窗口和更大的模型。

💻 Usage Examples

Running with Different KV Cache Precisions

# Baseline (FP16)
./llama.cpp/build/bin/llama-cli -m models/your-model.gguf -p "Your prompt" \
 -t 8 --flash-attn
# ⭐ RECOMMENDED: 8-bit keys, 4-bit values (K8V4)
# Best balance of quality and memory savings
./llama.cpp/build/bin/llama-cli -m models/your-model.gguf -p "Your prompt" \
 -t 8 --flash-attn --kvq 8
# 4-bit keys, 8-bit values (K4V8)
# Shows why key precision matters more than value precision
./llama.cpp/build/bin/llama-cli -m models/your-model.gguf -p "Your prompt" \
 -t 8 --flash-attn --kvq-key 4 --kvq-val 8
# 4-bit keys and values (K4V4)
# Maximum memory savings (72% reduction) with acceptable quality
./llama.cpp/build/bin/llama-cli -m models/your-model.gguf -p "Your prompt" \
 -t 8 --flash-attn --kvq 4

Long Context Example (32K)

# Run with a 32K context (would require ~1.4GB in FP16, only ~400MB with K8V4)
./llama.cpp/build/bin/llama-cli -m models/your-model.gguf \
 -c 32768 -n 4096 -t 8 --flash-attn --kvq 8 \
 -f your-long-document.txt

🚩 Command-Line Arguments

Flag | Description | Recommendation ---|---|--- -t 8 | Number of threads | 8 is optimal for most Apple Silicon chips --flash-attn | Enables optimized attention | Recommended for Apple Silicon --kvq N | Sets both key and value bits to N | Use --kvq 8 for K8V4 configuration --kvq-key N | Sets key bits only | Key precision has major quality impact --kvq-val N | Sets value bits only | Value precision has minor quality impact -c N | Context size in tokens | Longer contexts benefit more from KVSplit -n N | Number of tokens to generate | Adjust based on your needs -f FILE | Input file | For processing documents -m MODEL | Model path | Path to your .gguf model file

📏 Advanced Benchmarking

对于全面的性能分析，请使用我们的完整基准测试套件：

# Run the full benchmark suite (all configurations and sequence lengths)
python scripts/benchmark_kvsplit.py
# Run a specific configuration test
python scripts/benchmark_kvsplit.py --config K8V4 --seq-len 4096
# Generate publication-quality visualizations
python scripts/visualize_results.py

基准测试脚本提供了对以下方面的全面测量：

📊 Memory Usage： VRAM 和 KV 缓存专门针对
⚡ Performance：不同序列长度的每秒 Tokens
🎯 Quality：使用 llama-perplexity 进行困惑度测量
📈 Scaling：内存使用量和性能如何随序列长度缩放

结果以 CSV/JSON 格式保存，并具有自动摘要统计信息，并且可视化脚本生成出版质量的图，显示关键见解。

License

MIT

🎬 Visual Memory Savings

你可以使用我们的捕获工具可视化内存节省：

# Capture memory reduction in Activity Monitor
./scripts/capture_memory.sh

| ---|--- |

🍎 Apple Silicon Optimization

Metal Performance：针对 Apple 的 Metal 框架进行了全面优化
Memory Efficiency：对于内存受限的 M1/M2/M3 设备至关重要
Activity Monitor：使用我们的 capture_memory.sh 脚本来可视化实时内存减少
Alignment： llama.cpp 中的 256B 页面对齐意味着实际内存节省可能与理论计算略有不同

⭐ Key Features

Differentiated Precision：独立的 key 和 value 位精度 (K8V4, K4V8, 等)
Apple Silicon Optimization：对 M1/M2/M3 芯片的完整 Metal 支持
Comprehensive Benchmarking：内存、速度和质量指标
Publication-Quality Visualization：用于分析的精美图
Simple User Interface：一键式安装和快速比较工具
Memory Visualization：用于捕获和可视化内存节省的工具

🙏 Acknowledgments

本项目实现了近期研究的思想，包括：

"More for Keys, Less for Values: Adaptive KV Cache Quantization" (2024)
"Unifying KV Cache Compression for Large Language Models with LeanKV" (2025)

其他鸣谢：

llama.cpp - 基本实现
TinyLlama - 测试模型

Contributing

欢迎贡献！请打开一个 issue 或提交一个 pull request。

🧠 Configuration Recommendations

Best Overall： 🌟 K8V4 🌟 (8-bit keys, 4-bit values)
- 内存减少 59%，质量损失仅为 0.86%
- 提高了推理速度（比 FP16 提高 +5.7%）
- 质量和效率的良好平衡
Absolute Maximum Memory Savings： K4V4 (4-bit keys and values)
- 内存减少 72%，质量损失约为 6%
- 适用于内存受限的设备
- 对于不太敏感的应用程序可以接受
Best for Very Long Contexts： K8V4 或 K4V4
- 内存节省会随着上下文长度而增加
- 在相同的内存预算中运行 2-3 倍更长的上下文

🔮 Future Roadmap

Adaptive Precision：基于 token 重要性的动态精度
Layer-Specific Quantization：不同模型层的不同精度
Model-Specific Optimizations：专为 Mistral, Phi-3 等定制
Web Demo：交互式测试环境
Mobile Support：适用于 iOS 和 iPadOS

📜 License

MIT

🤝 Contributing

欢迎贡献！请打开一个 issue 或提交一个 pull request。

About

Releases

No releases published

Packages 0

No packages published

Show HN: KVSplit - 在 Apple Silicon 上运行上下文长度增加 2-3 倍的模型

License

dipampaul17/KVSplit

Folders and files

Repository files navigation

🚀 KVSplit

📌 Overview

Key Findings

Memory Savings by Sequence Length

Features

Prerequisites

⚡ One-Command Installation

🏎️ Quick Comparison

📊 Impressive Results

📉 Memory Reduction

📈 Performance Impact

🧠 Project Structure

🔬 Scientific Insight

🔑 Key Findings

💻 Usage Examples

Running with Different KV Cache Precisions

Long Context Example (32K)

🚩 Command-Line Arguments

📏 Advanced Benchmarking

License

🎬 Visual Memory Savings

🍎 Apple Silicon Optimization

⭐ Key Features

🙏 Acknowledgments

Contributing

🧠 Configuration Recommendations

🔮 Future Roadmap

📜 License

🤝 Contributing

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages