SmolDocling:用于端到端多模态文档转换的超紧凑型视觉语言模型
arXiv:2503.11576 (cs) [Submitted on 14 Mar 2025]
Title:SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion
Authors:Ahmed Nassar, Andres Marafioti, Matteo Omenetti, Maksym Lysak, Nikolaos Livathinos, Christoph Auer, Lucas Morin, Rafael Teixeira de Lima, Yusik Kim, A. Said Gurbuz, Michele Dolfi, Miquel Farré, Peter W. J. Staar 查看由 Ahmed Nassar 和其他 12 位作者撰写的题为 SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion 的论文的 PDF 版本 View PDF HTML (experimental)
Abstract:We introduce SmolDocling, an ultra-compact vision-language model targeting end-to-end document conversion. Our model comprehensively processes entire pages by generating DocTags, a new universal markup format that captures all page elements in their full context with location. Unlike existing approaches that rely on large foundational models, or ensemble solutions that rely on handcrafted pipelines of multiple specialized models, SmolDocling offers an end-to-end conversion for accurately capturing content, structure and spatial location of document elements in a 256M parameters vision-language model. SmolDocling exhibits robust performance in correctly reproducing document features such as code listings, tables, equations, charts, lists, and more across a diverse range of document types including business documents, academic papers, technical reports, patents, and forms -- significantly extending beyond the commonly observed focus on scientific papers. Additionally, we contribute novel publicly sourced datasets for charts, tables, equations, and code recognition. Experimental results demonstrate that SmolDocling competes with other Vision Language Models that are up to 27 times larger in size, while reducing computational requirements substantially. The model is currently available, datasets will be publicly available soon. Comments: | 24 pages, 10 figures ---|--- Subjects: | Computer Vision and Pattern Recognition (cs.CV) Cite as: | arXiv:2503.11576 [cs.CV] (or arXiv:2503.11576v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2503.11576 Focus to learn more arXiv-issued DOI via DataCite
Submission history
From: Ahmed Nassar [view email] [v1] Fri, 14 Mar 2025 16:44:14 UTC (14,179 KB) Full-text links:
Access Paper:
View a PDF of the paper titled SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion, by Ahmed Nassar and 12 other authors
摘要:我们介绍了 SmolDocling,这是一种面向端到端文档转换的超紧凑型视觉语言模型。我们的模型通过生成 DocTags 来全面处理整个页面,DocTags 是一种新的通用标记格式,可以捕获所有页面元素及其完整上下文的位置。与依赖大型基础模型或依赖于多个专用模型的手工流水线的集成解决方案的现有方法不同,SmolDocling 提供了一种端到端转换,可以在 256M 参数的视觉语言模型中准确捕获文档元素的内容、结构和空间位置。SmolDocling 在正确重现文档特征方面表现出强大的性能,例如代码清单、表格、公式、图表、列表等,涵盖各种文档类型,包括商业文档、学术论文、技术报告、专利和表格 —— 显着扩展了通常观察到的对科学论文的关注。此外,我们还贡献了用于图表、表格、公式和代码识别的新型公开来源数据集。实验结果表明,SmolDocling 可以与其他尺寸高达 27 倍的 Vision Language Model 相媲美,同时显着降低了计算要求。该模型目前可用,数据集即将公开。