利用 Embeddings 的通用几何特性

Harnessing the Universal Geometry of Embeddings

Source | HN Comments

该研究提出了一种无监督方法，用于将文本 embeddings 在不同向量空间之间转换，无需配对数据或预定义匹配。这种方法将任何 embedding 转换为通用潜在表示，并能从中转换出来。实验表明，该转换在不同模型之间实现了高余弦相似度。研究还强调了这种转换在向量数据库安全方面的意义，因为它可以防止攻击者通过获取 embedding 向量来提取敏感信息。

Computer Science > Machine Learning

arXiv:2505.12540 (cs) [Submitted on 18 May 2025 (v1), last revised 20 May 2025 (this version, v2)]

Title: Harnessing the Universal Geometry of Embeddings

Authors:Rishi Jha, Collin Zhang, Vitaly Shmatikov, John X. Morris View a PDF of the paper titled Harnessing the Universal Geometry of Embeddings, by Rishi Jha and 3 other authors View PDF HTML (experimental)

Abstract: 我们介绍了一种全新的方法，无需任何配对数据、编码器或预定义的匹配集合，即可将文本 embeddings 从一个向量空间转换到另一个向量空间。我们的无监督方法可以将任何 embedding 转换为一个通用的潜在表示（即 Platonic Representation Hypothesis 推测的通用语义结构），并从该表示转换出来。我们的转换在具有不同架构、参数数量和训练数据集的模型对之间实现了高余弦相似度。在保留几何特性的同时，将未知 embeddings 转换到不同空间的能力对向量数据库的安全性具有重要影响。仅拥有 embedding 向量的攻击者可以提取有关底层文档的敏感信息，足以进行分类和属性推断。 Subjects: | Machine Learning (cs.LG)
---|---
Cite as: | arXiv:2505.12540 [cs.LG]
(or arXiv:2505.12540v2 [cs.LG] for this version)
https://doi.org/10.48550/arXiv.2505.12540 Focus to learn more arXiv-issued DOI via DataCite

Submission history

From: Rishi Jha [view email] [v1] Sun, 18 May 2025 20:37:07 UTC (3,179 KB) [v2] Tue, 20 May 2025 15:38:41 UTC (3,180 KB) Full-text links:

Access Paper:

View a PDF of the paper titled Harnessing the Universal Geometry of Embeddings, by Rishi Jha and 3 other authors