Orpheus-3B：Canopy Labs 的情感型 TTS

Orpheus-3B – Emotive TTS by Canopy Labs

Source | HN Comments

Canopy Labs 发布了 Orpheus TTS，一个基于 Llama 架构的 Speech-LLM，旨在生成更像人声的语音。该模型提供多种尺寸，包括 3B 参数的 Medium 版本，并支持零样本语音克隆和情感控制。文章介绍了模型的架构、训练数据和涌现能力，展示了其在处理口吃、情感表达和实时应用方面的优势，并提供了 GitHub、Hugging Face 和 Google Colab 的链接，方便用户体验和使用。

Canopy Labs

Model Releases Our Mission Open Positions

Canopy Labs

内容

Introducing Orpheus TTS
Exploring Capabilities
Speaking Like a Human
Try a Demo
Natural Zero-Shot Voice Cloning
Guided Emotion and Intonation
In Production Usage
Stay Updated

迈向更像人声的 TTS

2025年3月19日

Introducing Orpheus Speech

目前为止，开源的 TTS 模型在性能上还无法与闭源模型竞争 [1]。而且，TTS 模型也缺乏人类的情感共鸣和情感智能。

(此处插入视频)

我们推出了 Orpheus，一系列用于生成人声级别语音的先进的 Speech-LLM。我们还发布了基于 Llama 架构，经过预训练和微调的四种尺寸的模型：

Medium – 3B parameters
Small – 1B parameters
Tiny – 400M parameters
Nano – 150M parameters

即使是很小尺寸的模型，我们也能展示出极高质量，令人愉悦的语音生成效果。

我们经过微调的模型，通过对一系列声音的训练，可用于生产环境。我们还提供基础模型和示例微调脚本，可用于零样本语音克隆，以及您自己的微调。

我们还提供代码，用于在简单的 Python 包中进行实时流式传输。对于 30 亿参数模型，即使在 A100 40GB 上，流式推理也比回放更快。（参见我们的 Google Colab notebook）(see our Google Colab notebook)

Try a Demo

我们为预训练和微调模型设置了简单的推理流程。查看以下链接，了解模型的实际应用！

Technical Overview

(此处插入图片：Architecture)

Model 的架构

我们的预训练模型使用 Llama-3b 作为 backbone。我们在超过 10 万小时的英语语音数据和数十亿文本 tokens 上对其进行了训练。在文本 tokens 上对其进行训练可以提高其在 TTS 任务中的性能，因为它保持了对语言的出色理解。下面我们探讨该模型的一些有趣的涌现能力。

我们使用完全相同的架构和训练方法来训练端到端语音模型，我们可能会在未来几周内发布开源端到端语音模型。

Handling disfluencies

Orpheus (Ours)| ElevenLabs| PlayHT ---|---|--- (此处插入音频) | (此处插入音频) | (此处插入音频) (此处插入音频) | (此处插入音频) | (此处插入音频) (此处插入音频) | (此处插入音频) | (此处插入音频)

Natural Zero-Shot Voice Cloning (Pretrained Model)

虽然我们的预训练模型没有经过任何语音克隆目标的训练，但由于大量的预训练数据，可能会出现零样本语音克隆。

我们的模型选择了自然的语调和情感，达到了甚至超过了领先模型的水平。

Voice of Prompt

我们的模型在训练期间没有见过这种声音。该声音被传递到 prompt，这是模型第一次接触到它。

(此处插入音频)

Orpheus| ElevenLabs| PlayHT ---|---|--- (此处插入音频) | (此处插入音频) | (此处插入音频) (此处插入音频) | (此处插入音频) | (此处插入音频) (此处插入音频) | (此处插入音频) | (此处插入音频)

Orpheus：Sample 与文本一起传递到 prompt 中以进行生成

ElevenLabs & PlayHT：Sample 被赋予即时语音克隆

Guided Emotion and Intonation

我们可以通过几十个高质量的微调示例来教导基础模型用特定的情感说话。我们为模型提供了文本-语音对，包括我们手动收集的情感标签。

Audio| Prompt ---|--- (此处插入音频) | He qualified for the national tournament. (此处插入音频) | He qualified for the national tournament. (此处插入音频) | He qualified for the national tournament. (此处插入音频) | He qualified for the national tournament. (此处插入音频) | He qualified for the national tournament. (此处插入音频) | The, uhm, men at those, , fundraisers are always SO serious.

In Production Usage

由于我们的 LLM 架构，以及我们扩展模型的庞大 Llama 模型支持，以及大量的音频和文本数据，我们的模型非常准确、富有表现力且可定制。

Realtime Usage

实时使用支持对话用例。我们的模型支持实时输出流，延迟非常低，约为 ~ 200 毫秒。为了获得更低的延迟，将文本的输入流式传输到我们模型的 KV cache 中可以将延迟降低到 ~25-50 毫秒。

Model Design

我们选择了两种与实时 Speech-LLM 的惯例相反的设计范例。

(此处插入图片：Architecture)

Snac samples tokens at different frequencies which we flatten as shown

We get 7 tokens per frame which we decode as a single flattened sequence rather than using 7 LM heads. This increases the number of steps the model is required to generate. The model is able to generate tokens comfortably faster than realtime playback using a straightforward vLLM implementation on an A100 or H100 GPU, which means longer sequences are still generated in realtime.

We use a non-streaming (CNN-based) tokenizer. Other speech LLMs which use SNAC as the decoder suffer from popping between frames fed into the detokenizer. We offer a simple, sliding window modification to the implementation of the detokenizer which enables streaming with no popping.

Stay Updated

Canopy Labs

Time, Date 01:09:07 3/20/2025

Latitude, Longitude (37.774929, -122.419418)