Skip to content

dipampaul17 / KVSplit Public

使用差异化的精度进行 KV 缓存量化,在 Apple Silicon 上运行具有更长上下文的更大的 LLM。 KVSplit 启用 8-bit 的 keys 和 4-bit 的 values,减少 59% 的内存,质量损失小于 1%。 包括基准测试、可视化和一键式设置。 针对具有 Metal 支持的 M1/M2/M3 Mac 进行了优化。

License

View license 4 stars 0 forks Branches Tags Activity

dipampaul17/KVSplit

main

BranchesTags

Go to file

Code

Folders and files

Name | Name | Last commit message | Last commit date ---|---|---|--- .github/workflows | .github/workflows models | models patch | patch plots | plots results | results scripts | scripts .gitignore | .gitignore LICENSE | LICENSE README.md | README.md perplexity_test_data.txt | perplexity_test_data.txt

View all files

Repository files navigation

🚀 KVSplit

Differentiated KV Cache Quantization for Apple Silicon

GitHub Stars License Platform KV Cache Memory Usage

📌 Overview

通过对注意力机制的 KV 缓存中的 keys 和 values 应用不同的量化精度,在你的 Mac 上运行更大的上下文窗口更重的 LLM。 KVSplit 使你能够:

Key Findings

Configuration | VRAM @ 8K tokens | Tokens/sec | Perplexity Change ---|---|---|--- FP16 (base) | 176.00 MB (100%) | 54,360 | -- K8V8 (8-bit) | 93.50 MB (47%) | 51,503 | +0.03% K8V4 | 71.50 MB (41%) | 57,438 | +0.86% K4V8 | 71.50 MB (41%) | 58,690 | +6.06% K4V4 (4-bit) | 49.50 MB (28%) | 55,193 | +6.15%

Memory Savings by Sequence Length

Configuration | 128 tokens | 2048 tokens | 4096 tokens | 8192 tokens ---|---|---|---|--- FP16 (baseline) | 5.50 MB | 44.00 MB | 88.00 MB | 176.00 MB K8V8 (8-bit) | 2.92 MB | 23.38 MB | 46.75 MB | 93.50 MB K8V4 (mixed) | 2.23 MB | 17.88 MB | 35.75 MB | 71.50 MB K4V8 (mixed) | 2.23 MB | 17.88 MB | 35.75 MB | 71.50 MB K4V4 (4-bit) | 1.55 MB | 12.38 MB | 24.75 MB | 49.50 MB

Features

Prerequisites

⚡ One-Command Installation

# Clone the repository
git clone https://github.com/dipampaul17/KVSplit.git
cd kvsplit
# Run the installer script
chmod +x scripts/install_kvsplit.sh
./scripts/install_kvsplit.sh

安装程序将:

🏎️ Quick Comparison

想立即看到好处? 使用你的模型运行快速比较:

# Run quick comparison with different configurations
python scripts/quick_compare.py --model models/your-model.gguf

这将向你展示 FP16、K8V8、K8V4、K4V8 和 K4V4 的并排比较,其中包含内存使用量、速度和质量指标。

📊 Impressive Results

Memory vs Quality

📉 Memory Reduction

Configuration | VRAM @ 8K tokens | Memory Savings | Quality Impact ---|---|---|--- FP16 (base) | 176.00 MB | — | — K8V8 (8-bit) | 93.50 MB | 47% | +0.03% K8V4 | 71.50 MB | 59% | +0.86% K4V8 | 71.50 MB | 59% | +6.06% K4V4 (4-bit) | 49.50 MB | 72% | +6.15%

📈 Performance Impact

使用 KVSplit 不仅可以节省内存,而且通常可以将推理速度提高 5-15%!

Configuration | Tokens/sec (8K ctx) | Speedup vs FP16 ---|---|--- FP16 | 54,360 | — K8V8 | 51,503 | -5.3% K8V4 | 57,438 | +5.7% K4V8 | 58,690 | +8.0% K4V4 | 55,193 | +1.5%

🧠 Project Structure

kvsplit/
├── llama.cpp/   # Optimized llama.cpp build
├── models/     # LLM model files
├── scripts/    # Utility scripts
│  ├── benchmark_kvsplit.py  # Comprehensive benchmark tool
│  ├── install_kvsplit.sh   # One-command installer
│  ├── quick_compare.py    # Quick comparison utility
│  ├── capture_memory.sh    # GIF creation for memory visualization
│  └── visualize_results.py  # Generate publication-quality plots
├── results/    # Benchmark results (CSV/JSON)
├── plots/     # Generated visualizations
└── README.md    # This file

🔬 Scientific Insight

Configuration Summary

KV 缓存内存主要用于存储每个 token 的 key 和 value 向量。 我们的研究揭示了一个关键的见解:keys 比 values 对量化更敏感

🔑 Key Findings

这种不对称性允许更有效地使用内存,而不会影响模型质量,从而在消费硬件上实现更长的上下文窗口和更大的模型。

💻 Usage Examples

Running with Different KV Cache Precisions

# Baseline (FP16)
./llama.cpp/build/bin/llama-cli -m models/your-model.gguf -p "Your prompt" \
 -t 8 --flash-attn
# ⭐ RECOMMENDED: 8-bit keys, 4-bit values (K8V4)
# Best balance of quality and memory savings
./llama.cpp/build/bin/llama-cli -m models/your-model.gguf -p "Your prompt" \
 -t 8 --flash-attn --kvq 8
# 4-bit keys, 8-bit values (K4V8)
# Shows why key precision matters more than value precision
./llama.cpp/build/bin/llama-cli -m models/your-model.gguf -p "Your prompt" \
 -t 8 --flash-attn --kvq-key 4 --kvq-val 8
# 4-bit keys and values (K4V4)
# Maximum memory savings (72% reduction) with acceptable quality
./llama.cpp/build/bin/llama-cli -m models/your-model.gguf -p "Your prompt" \
 -t 8 --flash-attn --kvq 4

Long Context Example (32K)

# Run with a 32K context (would require ~1.4GB in FP16, only ~400MB with K8V4)
./llama.cpp/build/bin/llama-cli -m models/your-model.gguf \
 -c 32768 -n 4096 -t 8 --flash-attn --kvq 8 \
 -f your-long-document.txt

🚩 Command-Line Arguments

Flag | Description | Recommendation ---|---|--- -t 8 | Number of threads | 8 is optimal for most Apple Silicon chips --flash-attn | Enables optimized attention | Recommended for Apple Silicon --kvq N | Sets both key and value bits to N | Use --kvq 8 for K8V4 configuration --kvq-key N | Sets key bits only | Key precision has major quality impact --kvq-val N | Sets value bits only | Value precision has minor quality impact -c N | Context size in tokens | Longer contexts benefit more from KVSplit -n N | Number of tokens to generate | Adjust based on your needs -f FILE | Input file | For processing documents -m MODEL | Model path | Path to your .gguf model file

📏 Advanced Benchmarking

对于全面的性能分析,请使用我们的完整基准测试套件:

# Run the full benchmark suite (all configurations and sequence lengths)
python scripts/benchmark_kvsplit.py
# Run a specific configuration test
python scripts/benchmark_kvsplit.py --config K8V4 --seq-len 4096
# Generate publication-quality visualizations
python scripts/visualize_results.py

基准测试脚本提供了对以下方面的全面测量:

结果以 CSV/JSON 格式保存,并具有自动摘要统计信息,并且可视化脚本生成出版质量的图,显示关键见解。

License

MIT

🎬 Visual Memory Savings

你可以使用我们的捕获工具可视化内存节省:

# Capture memory reduction in Activity Monitor
./scripts/capture_memory.sh

Memory Usage | Key-Value Sensitivity ---|--- Quality Impact | Speed Impact

🍎 Apple Silicon Optimization

⭐ Key Features

🙏 Acknowledgments

本项目实现了近期研究的思想,包括:

其他鸣谢:

Contributing

欢迎贡献! 请打开一个 issue 或提交一个 pull request。

🧠 Configuration Recommendations

🔮 Future Roadmap

📜 License

MIT

🤝 Contributing

欢迎贡献! 请打开一个 issue 或提交一个 pull request。

About

使用差异化的精度进行 KV 缓存量化,在 Apple Silicon 上运行具有更长上下文的更大的 LLM。 KVSplit 启用 8-bit 的 keys 和 4-bit 的 values,减少 59% 的内存,质量损失小于 1%。 包括基准测试、可视化和一键式设置。 针对具有 Metal 支持的 M1/M2/M3 Mac 进行了优化。

Topics

metal optimization quantization m2 m3 m1 memory-optimization kv-cache apple-silicon llm generative-ai llama-cpp

Resources

Readme

License

View license

Activity

Stars

4 stars

Watchers

1 watching

Forks

0 forks

Report repository

Releases

No releases published

Packages 0

No packages published

Languages