Liquid:语言模型是可扩展且统一的多模态生成器

Junfeng Wu1,2, Yi Jiang2,†, Chuofan Ma2,3, Yuliang Liu1, Hengshuang Zhao3, Zehuan Yuan2, Song Bai2,, Xiang Bai1, 1华中科技大学, 2ByteDance Inc, 3香港大学 *通讯作者, †项目负责人. Code arXiv-Liquid 🤗 HF Demo: Liquid 🤗 HF Model carousel27 silhouette of tree against the starry night, in the style of intense use of light and shadow, stockphoto, sharpprickly, rounded, kintsukuroi, sunrays shine upon it, coastal scenery carousel28 Capture a surreal portrait of a mythical creature against a bright yellow background. Dress them in a futuristic outfit with textural elements... carousel29 A pig dressed as a mason, by Bill Gekas carousel30 a cute and funny dragon giving a rose to the viewer cartoon style Pixar 3D carousel31 the witch Larina Nix carousel1 Photography of a tree with a heart, showcasing a whimsical, twisted tree trunk forming a perfect heartshaped void, set against a dreamy, ethereal forest backdrop... carousel2 shot in fe gm 35mm f1.4 of asian female model, doe eyes, carousel3 A majestic Goddes of beauty, charming dressed in a regal, jeweled gown and ornate crown, her golden hair cascading down her back... carousel4 Jabberwock, hyperrealistic, photorealistic, high details, high quality, shot on Nikon D6, Galen Rowell, Peter Lik, Marc Adamus, David Muench. carousel5 pirate character portrait gray hair weathered face thick beard weathered face colorful headband patched clothing and ocean for the background carousel6 Smoked cheese hamburger with spicy tomato sauce., Editorial Photography, Photography, Shot on 70mm lens... carousel7 strawberries splashing, swirling liquid, realism, octane render, raytracing carousel8 A magazine quality photograph of the sky filled with lightning and starting to show the first light of dawn. No ground in the photo. carousel8 photo of Princess of Persia, beauty, wallpapers, in the style of light maroon and azure, wandering eye, oriental, portrait, hurufiyya, darkly romantic realism... carousel9 A highly realistic, closeup photograph of a beautiful 35 year old redread woman writing in her journal, sitting on her balcony wearing warm, stylish outfits. Shot on a Canon... carousel10 artificial intelligence, revolution, publishing, writer, hyperrealistic carousel11 photo realistic beautiful young gothic woman carousel12 groow cannabis, photo réal 4K carousel13 female character fantasy world, for fantasy story, protagonist, interesting and detailed clothes, beautiful, medieval fantasy cinematic shot photo taken by canon, photo taken by fuji... carousel14 an intellectual brunette girl, normal looking, portrait style, 25 years, fancy dress, a party in the background, beautiful eyes... carousel15 Lets express the bright hope of a sprout that bloomed after enduring for a long time in a barren soil carousel16 young blue dragon with horn lightning in the style of dd fantasy full body carousel17 ychedelic parrot looking at the camera fine art painting, in the style of fluid lines, 8k resolution... carousel18 youtube banner environment nature, no writing, fantasy, islam carousel19 tshirt vector, car in city graphic, synthwave, vivid colors, detailed carousel20 Portrait of an asian woman. She has pink violet hair style with modern complex hairdressing... carousel21 dragon dnd epic, full body, ultra wide angle, incredible detail, epic pose, in the style of Craig Mullins Realistic face, realistic hands, full body, hyper detailed dynamic scene... carousel22 Willem Dafoe is the only Jesus Christ Ill ever need carousel23 ghost of dragon, art style of kazuki takahashi, carousel24 hedgehog face, floating in space, wearing space suit no helmet, cinematic, 50mm f1.8, unreal engine carousel25 frozen heart broken neon carousel26 Maple tree in a bottle, elixir, the last dance of the sun bends into your palms the autumn light shines, the healing power of the Maple carousel27 silhouette of tree against the starry night, in the style of intense use of light and shadow, stockphoto, sharpprickly, rounded, kintsukuroi, sunrays shine upon it, coastal scenery carousel28 Capture a surreal portrait of a mythical creature against a bright yellow background. Dress them in a futuristic outfit with textural elements... carousel29 A pig dressed as a mason, by Bill Gekas carousel30 a cute and funny dragon giving a rose to the viewer cartoon style Pixar 3D carousel31 the witch Larina Nix carousel1 Photography of a tree with a heart, showcasing a whimsical, twisted tree trunk forming a perfect heartshaped void, set against a dreamy, ethereal forest backdrop... carousel2 shot in fe gm 35mm f1.4 of asian female model, doe eyes, carousel3 A majestic Goddes of beauty, charming dressed in a regal, jeweled gown and ornate crown, her golden hair cascading down her back... carousel4 Jabberwock, hyperrealistic, photorealistic, high details, high quality, shot on Nikon D6, Galen Rowell, Peter Lik, Marc Adamus, David Muench. carousel5 pirate character portrait gray hair weathered face thick beard weathered face colorful headband patched clothing and ocean for the background carousel6 Smoked cheese hamburger with spicy tomato sauce., Editorial Photography, Photography, Shot on 70mm lens... carousel7 strawberries splashing, swirling liquid, realism, octane render, raytracing carousel8 A magazine quality photograph of the sky filled with lightning and starting to show the first light of dawn. No ground in the photo. carousel8 photo of Princess of Persia, beauty, wallpapers, in the style of light maroon and azure, wandering eye, oriental, portrait, hurufiyya, darkly romantic realism... carousel9 A highly realistic, closeup photograph of a beautiful 35 year old redread woman writing in her journal, sitting on her balcony wearing warm, stylish outfits. Shot on a Canon... carousel10 artificial intelligence, revolution, publishing, writer, hyperrealistic carousel11 photo realistic beautiful young gothic woman carousel12 groow cannabis, photo réal 4K carousel13 female character fantasy world, for fantasy story, protagonist, interesting and detailed clothes, beautiful, medieval fantasy cinematic shot photo taken by canon, photo taken by fuji... carousel14 an intellectual brunette girl, normal looking, portrait style, 25 years, fancy dress, a party in the background, beautiful eyes... carousel15 Lets express the bright hope of a sprout that bloomed after enduring for a long time in a barren soil carousel16 young blue dragon with horn lightning in the style of dd fantasy full body carousel17 ychedelic parrot looking at the camera fine art painting, in the style of fluid lines, 8k resolution... carousel18 youtube banner environment nature, no writing, fantasy, islam carousel19 tshirt vector, car in city graphic, synthwave, vivid colors, detailed carousel20 Portrait of an asian woman. She has pink violet hair style with modern complex hairdressing... carousel21 dragon dnd epic, full body, ultra wide angle, incredible detail, epic pose, in the style of Craig Mullins Realistic face, realistic hands, full body, hyper detailed dynamic scene... carousel22 Willem Dafoe is the only Jesus Christ Ill ever need carousel23 ghost of dragon, art style of kazuki takahashi, carousel24 hedgehog face, floating in space, wearing space suit no helmet, cinematic, 50mm f1.8, unreal engine carousel25 frozen heart broken neon carousel26 Maple tree in a bottle, elixir, the last dance of the sun bends into your palms the autumn light shines, the healing power of the Maple carousel27 silhouette of tree against the starry night, in the style of intense use of light and shadow, stockphoto, sharpprickly, rounded, kintsukuroi, sunrays shine upon it, coastal scenery carousel28 Capture a surreal portrait of a mythical creature against a bright yellow background. Dress them in a futuristic outfit with textural elements... carousel29 A pig dressed as a mason, by Bill Gekas carousel30 a cute and funny dragon giving a rose to the viewer cartoon style Pixar 3D carousel31 the witch Larina Nix carousel1 Photography of a tree with a heart, showcasing a whimsical, twisted tree trunk forming a perfect heartshaped void, set against a dreamy, ethereal forest backdrop... carousel2 shot in fe gm 35mm f1.4 of asian female model, doe eyes, carousel3 A majestic Goddes of beauty, charming dressed in a regal, jeweled gown and ornate crown, her golden hair cascading down her back... carousel4 Jabberwock, hyperrealistic, photorealistic, high details, high quality, shot on Nikon D6, Galen Rowell, Peter Lik, Marc Adamus, David Muench.

摘要

我们提出了 Liquid,这是一种自回归生成范式,它通过将图像标记为离散代码,并在视觉和语言的共享特征空间中学习这些代码嵌入以及文本 tokens,从而无缝地整合了视觉理解和生成。与之前的多模态大型语言模型 (MLLM) 不同,Liquid 使用单个大型语言模型 (LLM) 实现这种集成,无需外部预训练的视觉嵌入,例如 CLIP。Liquid 首次发现了一个 scaling law,即视觉和语言任务的统一训练不可避免地带来的性能下降会随着模型规模的增加而减小。 此外,统一的 token 空间使视觉生成和理解任务能够相互增强,有效地消除了早期模型中常见的典型干扰。我们表明,现有的 LLM 可以作为 Liquid 的强大基础,节省 100 倍的训练成本,同时在多模态能力方面优于 Chameleon,并保持与主流 LLM(如 LLAMA2)相当的语言性能。 Liquid 还在 MJHQ-30K 上优于 SD v2.1 和 SD-XL 等模型(FID 为 5.47),在视觉语言和纯文本任务中均表现出色。这项工作表明,诸如 Qwen2.5 和 GEMMA2 等 LLM 是强大的多模态生成器,为增强视觉语言理解和生成提供了一种可扩展的解决方案。

在线演示

LLM 在视觉生成方面的表现如何?

与其他基于自回归的方法相比,在基本提示和高级提示下,Liquid 在 GenAI-Bench 上获得了更好的总体评分。这表明 Liquid 生成的图像在语义上与输入文本提示更吻合。 在 MJHQ-30K 上,Liquid 不仅具有比所有其他自回归方法更低的 FID,而且还超过了大多数知名的扩散模型,表明 LLM 也能够生成高质量的图像。

T2I_results

理解和生成任务会相互促进吗?

为了回答这个问题,我们进行了三组实验。在第一组中,我们使用了 1000 万纯文本数据、1000 万视觉生成数据和 1000 万视觉理解数据的组合,从而在预训练阶段总共获得了 3000 万个数据。 三个任务的数据比例为 1 : 1 : 1 的实验作为基线。 在此基线的基础上,我们分别添加了额外的 1000 万视觉生成数据和额外的 1000 万视觉理解数据,总共形成了 4000 万个数据,数据比例分别为 1 : 2 : 1 和 1 : 1 : 2,标记为“Add T2I”和“Add I2T”。

huawei

“Visual Gen." 是指用于训练文本到图像生成的数据,而“Visual Und." 是指用于视觉理解的数据。 与基线相比,添加更多视觉理解数据可增强视觉生成能力,从而提高生成内容与提示之间的语义一致性。 相反,增加视觉生成数据同样有助于增强模型 的视觉理解能力。 这表明,当视觉生成和理解的 tokens 统一时,它们共享一个共同的优化目标,并且可以相互增强。

多模态生成是否遵循 Scaling Laws?

我们探索了 LLM 在与语言数据和文本到图像数据混合训练后,尺寸范围从 0.5B 到 32B 的视觉生成性能。 如下图所示,随着模型尺寸和训练迭代次数的增加,验证损失平稳下降,而 token 准确率和 VQA Score 持续增加。

T2I_results T2I_results

较大的模型最终会获得更强大的视觉生成结果。 样本来自 4 种不同尺寸(0.5B、1B、2B、9B)和 3 种不同训练步骤(5K、15K、40K)的 Liquid 模型。

视觉生成是否会损害语言能力?

为了验证获取图像理解和生成能力是否对 LLM 原始语言能力有任何影响,我们报告了一套流行的基准测试的总体 zero-shot 性能。

T2I_results 与纯文本训练相比,Liquid 可以在很大程度上保留语言能力。 当模型尺寸较小时,多模态混合训练确实会影响语言性能。 但是,这种退化会随着模型尺寸的增加而逐渐消失。 T2I_results

语言能力是否限制了视觉生成性能?

对于每种尺寸的模型,混合训练都会导致视觉生成任务的验证损失更高。 但是,它对 VQA Score 的影响会随着模型尺寸的增加而减小。

T2I_results T2I_results

BibTeX

@article{wu2024liquid,
 title={Liquid: Language models are scalable multi-modal generators},
 author={Wu, Junfeng and Jiang, Yi and Ma, Chuofan and Liu, Yuliang and Zhao, Hengshuang and Yuan, Zehuan and Bai, Song and Bai, Xiang},
 journal={arXiv preprint arXiv:2412.04332},
 year={2024}
}

标题: Liquid:语言模型是可扩展的统一多模态生成器