Llasa:扩展基于 Llama 的语音合成的训练和推理计算

摘要。 近期基于文本的大型语言模型(LLMs)的进展,特别是在 GPT 系列和 o1 模型中,已经展示了扩展训练和推理计算的有效性。然而,当前最先进的利用 LLMs 的 TTS 系统通常是多阶段的,需要单独的模型(例如,在 LLM 之后的 diffusion models),这使得在训练或测试期间是否扩展特定模型的决策变得复杂。这项工作做出了以下贡献:首先,我们探索了语音合成中训练时和推理时计算的扩展。其次,我们提出了一个简单的语音合成框架 LLaSA,该框架采用单层矢量量化器(VQ)编解码器和单个 Transformer 架构,以完全对齐诸如 LLaMA 之类的标准 LLMs。我们的实验表明,扩展 LLaSA 的训练时计算可以持续提高合成语音的自然度,并能够生成更复杂和准确的韵律模式。此外,从扩展推理时计算的角度来看,我们在搜索过程中使用语音理解模型作为验证器,发现扩展推理时计算会将采样模式转移到特定验证器的偏好,从而提高情感表现力、音色一致性和内容准确性。此外,我们公开发布了 TTS 模型(1B、3B、8B)和编解码器模型的 checkpoint 和训练代码。

目录

使用不同评估指标比较推理时扩展的结果

左图使用不同的 speaker embedding 模型 speechbrain/spkrec-ecapa-voxceleb 作为参考评估指标来评估说话人相似度。右图是原始的图 2。 Image 1 Image 2

Ravdess 基准测试上的比较结果

Ravdess 只有两个文本:"Dogs are sitting by the door." 作为提示文本,以及 "Kids are talking by the door." 作为合成文本。以下 NaturalSpeech 3、NaturalSpeech 2、Voicebox (R)、VALL-E (R)、Mega-TTS 2、StyleTTS 2 和 HierSpeech++ 的结果来自官方 NaturalSpeech 3 demo page。(R) 表示这些是由 NaturalSpeech 3 复现的。

Prompt Emotion | Prompt | Ground Truth | Llasa-1b-250k | Llasa-3b-250k | Llasa-8b-250k | FireRedTTS | F5-TTS | MaskGCT | E2-TTS | CosyVoice2 | CosyVoice | NaturalSpeech 3 | NaturalSpeech 2 | Voicebox (R) | VALL-E (R) | Mega-TTS 2 | StyleTTS 2 | HierSpeech++
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---
neutral | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element.
happy | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element.
calm | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element.
sad | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element.
angry | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element.
fearful | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element.
disgust | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element.
surprised | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element.

Scaling Train-Time Compute

我们从英文测试集中随机选择了两个样本。所有合成音频都仅从输入文本生成(没有任何语音提示),并且每个模型都随机采样了三次,以专门评估其文本理解能力。下表显示了各种大小和训练数据量的模型的结果。

Sample | Llasa-1b-80k | Llasa-1b-160k | Llasa-1b-250k | Llasa-3b-250k | Llasa-8b-250k
---|---|---|---|---|---
"Uh, are you sure about this?" Tim asked nervously, looking at the steep slope before them. "Whoa, it’s higher than I thought," he continued, his voice filled with trepidation. "Aha, but look at the view," Emily responded with excitement, "it’s worth the climb!"
Random Sample 1 | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element.
Random Sample 2 | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element.
Random Sample 3 | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element.
Her hands shaking with excitement, Alice Monroe stuttered, "oh..I-I can’t believe it! Is this really my acceptance letter to Harvard?" Marco cannot believe it either: "God damn it! How did you pull this off?"
Random Sample 1 | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element.
Random Sample 2 | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element.
Random Sample 3 | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element.

从中文测试集中随机选择两个样本。

Sample | Llasa-1b-80k | Llasa-1b-160k | Llasa-1b-250k | Llasa-3b-250k | Llasa-8b-250k
---|---|---|---|---|---
帘外雨潺潺,春意阑珊。罗衾不耐五更寒。梦里不知身是客,一晌贪欢。独自莫凭栏,无限江山。别时容易见时难。流水落花春去也,天上人间。
Random Sample 1 | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element.
Random Sample 2 | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element.
Random Sample 3 | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element.
人要是行,干一行行一行,一行行行行行,行行行干哪行都行,要是不行,干一行不行一行,一行不行行行不行,行行不行,干哪行都不行。
Random Sample 1 | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element.
Random Sample 2 | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element.
Random Sample 3 | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element.

Scaling Inference-Time Compute

我们使用 Llasa-1b-250k 模型,比较了直接推理和推理时扩展的结果。以下显示的两个示例是从 seed-tts-eval test-hard 中随机选择的。

Target Text | Prompt | Directly Inference | Scaling Inference-Time Compute
---|---|---|---
喇嘛与哑巴 打南边来了个哑巴,腰里别了个喇叭; 打北边来了个喇嘛,手里提了个獭犸. 提着獭犸的喇嘛要拿獭犸换别着喇叭的哑巴的喇叭; 别着喇叭的哑巴不愿拿喇叭换提着獭犸的喇嘛的獭犸. 不知是别着喇叭的哑巴打了提着獭犸的喇嘛一喇叭; 还是提着獭犸的喇嘛打了别着喇叭的哑巴一獭犸. 喇嘛回家炖獭犸; 哑巴嘀嘀哒哒吹喇叭 | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element.
高高山上一座庙,住了八个出家人,八个道人都有名:大弟子,叫凳大,二弟子,叫大凳,三弟子,叫猴三,四弟子,叫三猴,五弟子,叫瓶茶,六弟子,叫茶瓶,七弟子,叫冰别边,八弟子,叫边别冰。凳大会打鼓,大凳会撞钟,猴三会烧火,三猴会点灯;瓶茶会吹管,茶瓶会吹笙;冰别边会煮饭,边别冰会念经。大凳要打凳大鼓,凳大要撞大凳钟;三猴要烧猴三火,猴三要点三猴灯;茶瓶要吹瓶茶管,瓶茶要吹茶瓶笙;边别冰要煮冰别边的饭,冰别边要念边别冰的经。大凳打不好凳大的鼓,凳大撞不好大凳的钟;三猴烧不好猴三的火,猴三点不好三猴的灯;茶瓶吹不好瓶茶的管,瓶茶吹不好茶瓶的笙;边别冰煮不好冰别边的饭,冰别边念不好边别冰的经。凳大还打凳大鼓,大凳还撞大凳钟;猴三还烧猴三火,三猴还点三猴灯;瓶茶还吹瓶茶管,茶瓶还吹茶瓶笙;冰别边还煮冰别边的饭,边别冰还念边别冰的经。各人还干各一行,白白争个脸红脖子青。 | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element.

我们使用 Llasa-1b-250k 在 LibriSpeech test-clean 数据集上进行延续性实验。每个样本生成的音频都以 ground truth 音频的前 3 秒开始,然后是模型生成的延续部分。

Ground Truth | Directly Inference | Scaling Inference-Time Compute
---|---|---
Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element.
Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element.
Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element.
Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element.
Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element.

Codec 重建样本

Sample | GT | Xcodec2 | StableCodec | WavTokenizer_40tps | WavTokenizer_75tps | Xcodec_nq1 | Xcodec_nq2 | BigCodec | DAC_16k_nq1 | DAC_16k_nq2 | DAC_16k_nq12 | Encodec_nq2 | Encodec_nq8 | Mimi_nq4 | Mimi_nq6 | Mimi_nq8 | SemanticCodec | SpeechTokenizer_nq1 | SpeechTokenizer_nq2
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---
Sample 1 | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element.
Sample 2 | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element.
Sample 3 | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element.
Sample 4 | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element.