>8 token/s DeepSeek R1 671B Q4_K_M with 1~2 Arc A770 on Xeon

Source | HN Comments

文章介绍了如何在搭载 Xeon 处理器和 1-2 块 Arc A770 GPU 的平台上，使用 llama.cpp portable zip 运行 DeepSeek R1 671B Q4_K_M 模型，并实现超过 8 token/s 的速度。提供了 Windows 和 Linux 平台的快速入门指南，包括准备工作、下载解压、运行时配置和运行 GGUF 模型的步骤。文章还介绍了多 GPU 使用、性能优化环境配置，以及常见错误排查方法。此外，还介绍了使用 FlashMoE 工具运行 DeepSeek V3/R1 模型的方法。

使用 Xeon 平台上的 1~2 块 Arc A770 运行 >8 token/s 的 DeepSeek R1 671B Q4_K_M 模型

在 Intel GPU 上使用 IPEX-LLM 运行 llama.cpp Portable Zip

< English | 中文 >

重要提示：

现在我们可以使用最新的 llama.cpp Portable Zip，在 Xeon 平台上利用 1 或 2 块 Arc A770 运行 DeepSeek-R1-671B-Q4_K_M 模型。

本指南演示了如何使用 llama.cpp portable zip 直接在 Intel GPU 上通过 ipex-llm 运行 llama.cpp (无需手动安装)。

注意：

llama.cpp portable zip 已经在以下平台验证通过：

Intel Core Ultra 处理器
Intel Core 11th - 14th gen 处理器
Intel Arc A-Series GPU
Intel Arc B-Series GPU

Windows 快速入门

准备工作

检查您的 GPU 驱动版本，并在需要时进行更新：

对于 Intel Core Ultra 处理器（Series 2）或 Intel Arc B-Series GPU，我们建议将 GPU 驱动更新到最新版本
对于其他 Intel iGPU/dGPU，我们建议使用 GPU 驱动版本 32.0.101.6078

步骤 1：下载并解压

从链接下载 IPEX-LLM llama.cpp portable zip 文件，适用于 Windows 用户。

然后，将 zip 文件解压到文件夹。

步骤 2：运行时配置

打开“命令提示符”（cmd），并通过 cd /d PATH\TO\EXTRACTED\FOLDER 进入解压后的文件夹。
要使用 GPU 加速，在运行 llama.cpp 之前，需要或建议设置几个环境变量。

set SYCL_CACHE_PERSISTENT=1

对于多 GPU 用户，请转到提示了解如何选择特定的 GPU。

步骤 3：运行 GGUF 模型

这里我们提供一个简单的例子，展示如何使用 IPEX-LLM 运行社区 GGUF 模型。

模型下载

在运行之前，您应该下载或复制社区 GGUF 模型到您的本地目录。例如，bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF 的 DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf 。

运行 GGUF 模型

请在运行以下命令之前，将 PATH\TO\DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf 更改为您的模型路径。

llama-cli.exe -m PATH\TO\DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf -p "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>. User: Question:The product of the ages of three teenagers is 4590. How old is the oldest? a. 18 b. 19 c. 15 d. 17 Assistant: <think>" -n 2048 -t 8 -e -ngl 99 --color -c 2500 --temp 0

部分输出：

Found 1 SYCL devices:
| |          |                    |    |Max  |    |Max |Global |           |
| |          |                    |    |compute|Max work|sub |mem  |           |
|ID|    Device Type|                  Name|Version|units |group  |group|size  |    Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|           Intel Arc Graphics| 12.71|  128|  1024|  32| 13578M|      1.3.27504|
llama_kv_cache_init:   SYCL0 KV buffer size =  138.25 MiB
llama_new_context_with_model: KV self size = 138.25 MiB, K (f16):  69.12 MiB, V (f16):  69.12 MiB
llama_new_context_with_model: SYCL_Host output buffer size =   0.58 MiB
llama_new_context_with_model:   SYCL0 compute buffer size = 1501.00 MiB
llama_new_context_with_model: SYCL_Host compute buffer size =  58.97 MiB
llama_new_context_with_model: graph nodes = 874
llama_new_context_with_model: graph splits = 2
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 8
system_info: n_threads = 8 (n_threads_batch = 8) / 22 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
sampler seed: 341519086
sampler params:
    repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
    dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
    top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.000
    mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 2528, n_batch = 4096, n_predict = 2048, n_keep = 1
<think>
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
</think>
<answer>XXXX</answer> [end of text]

llama_perf_sampler_print:  sampling time =   xxx.xx ms / 1386 runs  (  x.xx ms per token, xxxxx.xx tokens per second)
llama_perf_context_print:    load time =  xxxxx.xx ms
llama_perf_context_print: prompt eval time =   xxx.xx ms /  129 tokens (  x.xx ms per token,  xxx.xx tokens per second)
llama_perf_context_print:    eval time =  xxxxx.xx ms / 1256 runs  (  xx.xx ms per token,  xx.xx tokens per second)
llama_perf_context_print:    total time =  xxxxx.xx ms / 1385 tokens

Linux 快速入门

准备工作

检查您的 GPU 驱动版本，并在需要时进行更新；我们建议遵循 Intel 客户端 GPU 驱动安装指南来安装您的 GPU 驱动。

步骤 1：下载并解压

从链接下载 IPEX-LLM llama.cpp portable tgz 文件，适用于 Linux 用户。

然后，将 tgz 文件解压到文件夹。

步骤 2：运行时配置

打开“终端”，并通过 cd /PATH/TO/EXTRACTED/FOLDER 进入解压后的文件夹。
要使用 GPU 加速，在运行 llama.cpp 之前，需要或建议设置几个环境变量。

export SYCL_CACHE_PERSISTENT=1

对于多 GPU 用户，请转到提示了解如何选择特定的 GPU。

步骤 3：运行 GGUF 模型

这里我们提供一个简单的例子，展示如何使用 IPEX-LLM 运行社区 GGUF 模型。

模型下载

在运行之前，您应该下载或复制社区 GGUF 模型到您的本地目录。例如，bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF 的 DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf 。

运行 GGUF 模型

请在运行以下命令之前，将 /PATH/TO/DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf 更改为您的模型路径。

./llama-cli -m /PATH/TO/DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf -p "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>. User: Question:The product of the ages of three teenagers is 4590. How old is the oldest? a. 18 b. 19 c. 15 d. 17 Assistant: <think>" -n 2048 -t 8 -e -ngl 99 --color -c 2500 --temp 0

部分输出：

Found 1 SYCL devices:
| |          |                    |    |Max  |    |Max |Global |           |
| |          |                    |    |compute|Max work|sub |mem  |           |
|ID|    Device Type|                  Name|Version|units |group  |group|size  |    Driver version|
|--|-------------------|--------------------------------ார்கள்-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|           Intel Arc Graphics| 12.71|  128|  1024|  32| 13578M|      1.3.27504|
llama_kv_cache_init:   SYCL0 KV buffer size =  138.25 MiB
llama_new_context_with_model: KV self size = 138.25 MiB, K (f16):  69.12 MiB, V (f16):  69.12 MiB
llama_new_context_with_model: SYCL_Host output buffer size =   0.58 MiB
llama_new_context_with_model:   SYCL0 compute buffer size = 1501.00 MiB
llama_new_context_with_model: SYCL_Host compute buffer size =  58.97 MiB
llama_new_context_with_model: graph nodes = 874
llama_new_context_with_model: graph splits = 2
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 8
system_info: n_threads = 8 (n_threads_batch = 8) / 22 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
sampler seed: 341519086
sampler params:
    repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
    dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
    top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.000
    mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 2528, n_batch = 4096, n_predict = 2048, n_keep = 1
<think>
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
</think>
<answer>XXXX</answer> [end of text]

FlashMoE for DeepSeek V3/R1

FlashMoE 是一个基于 llama.cpp 构建的命令行工具，针对混合专家模型 (MoE) 进行了优化，例如 DeepSeek V3/R1。现在，它可用于 Linux 平台。

已测试的 MoE GGUF 模型（也支持其他 MoE GGUF 模型）：

使用 FlashMoE 运行 DeepSeek V3/R1

要求：

380GB CPU 内存
1-8 个 ARC A770
500GB 磁盘

注意：

更大的模型和其他精度可能需要更多资源。
对于 1 个 ARC A770 平台，请减少上下文长度（例如，1024）以避免 OOM。在下面命令的末尾添加此选项 -c 1024。

在运行之前，您应该下载或复制社区 GGUF 模型到您的本地目录。例如，DeepSeek-R1-Q4_K_M.gguf 的 DeepSeek-R1-Q4_K_M.gguf。

将 /PATH/TO/DeepSeek-R1-Q4_K_M-00001-of-00009.gguf 更改为您的模型路径，然后运行 DeepSeek-R1-Q4_K_M.gguf

./flash-moe -m /PATH/TO/DeepSeek-R1-Q4_K_M-00001-of-00009.gguf --prompt "What's AI?"

部分输出

llama_kv_cache_init:   SYCL0 KV buffer size = 1280.00 MiB
llama_kv_cache_init:   SYCL1 KV buffer size = 1280.00 MiB
llama_kv_cache_init:   SYCL2 KV buffer size = 1280.00 MiB
llama_kv_cache_init:   SYCL3 KV buffer size = 1280.00 MiB
llama_kv_cache_init:   SYCL4 KV buffer size = 1120.00 MiB
llama_kv_cache_init:   SYCL5 KV buffer size = 1280.00 MiB
llama_kv_cache_init:   SYCL6 KV buffer size = 1280.00 MiB
llama_kv_cache_init:   SYCL7 KV buffer size =  960.00 MiB
llama_new_context_with_model: KV self size = 9760.00 MiB, K (i8): 5856.00 MiB, V (i8): 3904.00 MiB
llama_new_context_with_model: SYCL_Host output buffer size =   0.49 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=1)
llama_new_context_with_model:   SYCL0 compute buffer size = 2076.02 MiB
llama_new_context_with_model:   SYCL1 compute buffer size = 2076.02 MiB
llama_new_context_with_model:   SYCL2 compute buffer size = 2076.02 MiB
llama_new_context_with_model:   SYCL3 compute buffer size = 2076.02 MiB
llama_new_context_with_model:   SYCL4 compute buffer size = 2076.02 MiB
llama_new_context_with_model:   SYCL5 compute buffer size = 2076.02 MiB
llama_new_context_with_model:   SYCL6 compute buffer size = 2076.02 MiB
llama_new_context_with_model:   SYCL7 compute buffer size = 3264.00 MiB
llama_new_context_with_model: SYCL_Host compute buffer size = 1332.05 MiB
llama_new_context_with_model: graph nodes = 5184 (with bs=4096), 4720 (with bs=1)
llama_new_context_with_model: graph splits = 125
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 48
system_info: n_threads = 48 (n_threads_batch = 48) / 192 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
sampler seed: 2052631435
sampler params:
    repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
    dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
    top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
    mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 4096, n_predict = -1, n_keep = 1
<think>
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
</think>
<answer>XXXX</answer> [end of text]

提示 & 问题排查

错误：检测到不同的 SYCL 设备

您会遇到如下错误日志：

Found 3 SYCL devices:
| |          |                    |    |Max  |    |Max |Global |           |
| |          |                    |    |compute|Max work|sub |mem  |           |
|ID|    Device Type|                  Name|Version|units |group  |group|size  |    Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|        Intel Arc A770 Graphics| 12.55|  512|  1024|  32| 16225M|   1.6.31907.700000|
| 1| [level_zero:gpu:1]|        Intel Arc A770 Graphics| 12.55|  512|  1024|  32| 16225M|   1.6.31907.700000|
| 2| [level_zero:gpu:2]|         Intel UHD Graphics 770|  12.2|   32|   512|  32| 63218M|   1.6.31907.700000|
Error: Detected different sycl devices, the performance will limit to the slowest device. 
If you want to disable this checking and use all of them, please set environment SYCL_DEVICE_CHECK=0, and try again.
If you just want to use one of the devices, please set environment like ONEAPI_DEVICE_SELECTOR=level_zero:0 or ONEAPI_DEVICE_SELECTOR=level_zero:1 to choose your devices.
If you want to use two or more deivces, please set environment like ONEAPI_DEVICE_SELECTOR="level_zero:0;level_zero:1"
See https://github.com/intel/ipex-llm/blob/main/docs/mddocs/Overview/KeyFeatures/multi_gpus_selection.md for details. Exiting.

因为 GPU 不相同，作业将根据设备的内存进行分配。例如，iGPU (Intel UHD Graphics 770) 将获得 2/3 的计算任务。性能会很差。所以您有两种选择：

禁用 iGPU 将获得最佳性能。有关详细信息，请访问多 GPU 使用。
禁用此检查并使用所有 GPU，您可以运行以下命令：
- set SYCL_DEVICE_CHECK=0 (Windows 用户)
- export SYCL_DEVICE_CHECK=0 (Linux 用户)

多 GPU 使用

如果您的机器有多个 Intel GPU，llama.cpp 默认情况下将在所有 GPU 上运行。如果您不清楚您的硬件配置，您可以在运行 GGUF 模型时获取配置。像这样：

Found 3 SYCL devices:
| |          |                    |    |Max  |    |Max |Global |           |
| |          |                    |    |compute|Max work|sub |mem  |           |
|ID|    Device Type|                  Name|Version|units |group  |group|size  |    Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|        Intel Arc A770 Graphics| 12.55|  512|  1024|  32| 16225M|   1.6.31907.700000|
| 1| [level_zero:gpu:1]|        Intel Arc A770 Graphics| 12.55|  512|  1024|  32| 16225M|   1.6.31907.700000|
| 2| [level_zero:gpu:2]|         Intel UHD Graphics 770|  12.2|   32|   512|  32| 63218M|   1.6.31907.700000|

要指定您希望 llama.cpp 使用哪个 Intel GPU，您可以在启动 llama.cpp 命令之前设置环境变量 ONEAPI_DEVICE_SELECTOR，如下所示：

对于 Windows 用户：

set ONEAPI_DEVICE_SELECTOR=level_zero:0 (如果您想在一个 GPU 上运行，llama.cpp 将使用第一个 GPU。)
set ONEAPI_DEVICE_SELECTOR="level_zero:0;level_zero:1" (如果您想在两个 GPU 上运行，llama.cpp 将使用第一个和第二个 GPU。)

对于 Linux 用户：

export ONEAPI_DEVICE_SELECTOR=level_zero:0 (如果您想在一个 GPU 上运行，llama.cpp 将使用第一个 GPU。)
export ONEAPI_DEVICE_SELECTOR="level_zero:0;level_zero:1" (如果您想在两个 GPU 上运行，llama.cpp 将使用第一个和第二个 GPU。)

性能环境

SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS

要启用 SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS，您可以运行以下命令：

set SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 (Windows 用户)
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 (Linux 用户)

注意：

环境变量 SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS 决定了将任务提交到 GPU 时是否使用立即命令列表。虽然此模式通常可以提高性能，但也可能发生例外情况。请考虑尝试使用和不使用此环境变量以获得最佳性能。更多详情，您可以参考这篇文章。