VGGT：基于视觉几何信息的 Grounded Transformer

VGGT: Visual Geometry Grounded Transformer

Source | HN Comments

VGGT 是一篇 CVPR 2025 论文，介绍了一种基于视觉几何信息的 Grounded Transformer。该模型能够在几秒钟内从少量或多张图像中推断出场景的 3D 属性，包括相机参数、点云、深度图和 3D 点轨迹。提供了快速开始的指南，包括代码示例和依赖安装说明。文章还介绍了多种可视化工具，如 Gradio Web 界面和 Viser 3D 查看器，用于展示重建和跟踪结果。此外，还提到了单视图重建的性能，以及运行时和 GPU 内存的基准测试结果。

[CVPR 2025] VGGT: Visual Geometry Grounded Transformer

License

查看许可证 3k stars 168 forks Branches Tags Activity

facebookresearch/vggt

main Branches Tags

转到文件代码

文件夹和文件

名称| 名称| 最后提交信息| 最后提交日期 ---|---|---|---

历史

仓库文件导航

VGGT: Visual Geometry Grounded Transformer

Visual Geometry Group, University of Oxford ; Meta AI Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, David Novotny

@inproceedings{wang2025vggt,
 title={VGGT: Visual Geometry Grounded Transformer},
 author={Wang, Jianyuan and Chen, Minghao and Karaev, Nikita and Vedaldi, Andrea and Rupprecht, Christian and Novotny, David},
 booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
 year={2025}
}

概述

Visual Geometry Grounded Transformer (VGGT, CVPR 2025) 是一个前馈神经网络，它在几秒钟内直接从场景的单个、少量或数百个视图中推断出场景的所有关键 3D 属性，包括外参和内参相机参数、点云图、深度图和 3D 点轨迹。

快速开始

首先，将此仓库克隆到本地计算机，并安装依赖项 (torch, torchvision, numpy, Pillow, 和 huggingface_hub)。

git clone git@github.com:facebookresearch/vggt.git 
cd vggt
pip install -r requirements.txt

或者，您可以将 VGGT 作为包安装（点击此处查看详情）。现在，尝试用几行代码运行模型：

import torch
from vggt.models.vggt import VGGT
from vggt.utils.load_fn import load_and_preprocess_images
device = "cuda" if torch.cuda.is_available() else "cpu"
# bfloat16 is supported on Ampere GPUs (Compute Capability 8.0+) 
dtype = torch.bfloat16 if torch.cuda.get_device_capability()[0] >= 8 else torch.float16
# Initialize the model and load the pretrained weights.
# This will automatically download the model weights the first time it's run, which may take a while.
model = VGGT.from_pretrained("facebook/VGGT-1B").to(device)
# Load and preprocess example images (replace with your own image paths)
image_names = ["path/to/imageA.png", "path/to/imageB.png", "path/to/imageC.png"] 
images = load_and_preprocess_images(image_names).to(device)
with torch.no_grad():
  with torch.cuda.amp.autocast(dtype=dtype):
    # Predict attributes including cameras, depth maps, and point maps.
    predictions = model(images)

模型权重将自动从 Hugging Face 下载。如果遇到加载缓慢等问题，您可以手动从这里下载并加载，或者：

model = VGGT()
_URL = "https://huggingface.co/facebook/VGGT-1B/resolve/main/model.pt"
model.load_state_dict(torch.hub.load_state_dict_from_url(_URL))

详细用法

您还可以选择性地选择要预测的属性（分支），如下所示。这与上面的例子实现了相同的结果。这个例子使用了批量大小为 1（处理单个场景），但它自然适用于多个场景。

from vggt.utils.pose_enc import pose_encoding_to_extri_intri
from vggt.utils.geometry import unproject_depth_map_to_point_map
with torch.no_grad():
  with torch.cuda.amp.autocast(dtype=dtype):
    images = images[None] # add batch dimension
    aggregated_tokens_list, ps_idx = model.aggregator(images)
        
  # Predict Cameras
  pose_enc = model.camera_head(aggregated_tokens_list)[-1]
  # Extrinsic and intrinsic matrices, following OpenCV convention (camera from world)
  extrinsic, intrinsic = pose_encoding_to_extri_intri(pose_enc, images.shape[-2:])
  # Predict Depth Maps
  depth_map, depth_conf = model.depth_head(aggregated_tokens_list, images, ps_idx)
  # Predict Point Maps
  point_map, point_conf = model.point_head(aggregated_tokens_list, images, ps_idx)
    
  # Construct 3D Points from Depth Maps and Cameras
  # which usually leads to more accurate 3D points than point map branch
  point_map_by_unprojection = unproject_depth_map_to_point_map(depth_map.squeeze(0), 
                                extrinsic.squeeze(0), 
                                intrinsic.squeeze(0))
  # Predict Tracks
  # choose your own points to track, with shape (N, 2) for one scene
  query_points = torch.FloatTensor([[100.0, 200.0], 
                    [60.72, 259.94]]).to(device)
  track_list, vis_score, conf_score = model.track_head(aggregated_tokens_list, images, ps_idx, query_points=query_points[None])

此外，如果输入帧中的某些像素是不需要的（例如，反射表面、天空或水），您可以简单地通过将相应的像素值设置为 0 或 1 来屏蔽它们。不需要精确的分割掩码 - 简单的边界框掩码即可有效工作（查看此问题以获取示例）。

可视化

我们提供了多种可视化 3D 重建和跟踪结果的方法。在使用这些可视化工具之前，请安装所需的依赖项：

pip install -r requirements_demo.txt

交互式 3D 可视化

请注意： VGGT 通常在不到 1 秒的时间内重建场景。但是，由于第三方渲染，可视化 3D 点可能需要几十秒钟，这与 VGGT 的处理时间无关。当图像数量很大时，可视化速度会很慢。

Gradio Web 界面

我们基于 Gradio 的界面允许您上传图像/视频、运行重建，并在浏览器中交互式探索 3D 场景。您可以在本地计算机上启动它，也可以在 Hugging Face 上尝试。

python demo_gradio.py

点击预览 Gradio 交互界面

Viser 3D 查看器

运行以下命令以运行重建并在 viser 中可视化点云。请注意，此脚本需要包含图像的文件夹的路径。它假定文件夹下只有图像文件。您可以设置 --use_point_map 以使用来自点云图分支的点云，而不是基于深度的点云。

python demo_viser.py --image_folder path/to/your/images/folder

跟踪可视化

要可视化多个图像上的点轨迹：

from vggt.utils.visual_track import visualize_tracks_on_images
track = track_list[-1]
visualize_tracks_on_images(images, track, (conf_score>0.2) & (vis_score>0.2), out_dir="track_visuals")

这会将轨迹绘制在图像上，并将其保存到指定的输出目录。

单视图重建

我们的模型在单视图重建上表现出令人惊讶的良好性能，尽管它从未为此任务进行过训练。该模型不需要将单视图图像复制成一对，而是可以直接从单视图图像的token中推断出3D结构。欢迎使用我们的演示进行尝试，它自然适用于单视图重建。

我们自己没有定量测试单目深度估计性能，但 @kabouzeid 慷慨地提供了 VGGT 与最新方法的比较此处。与最新的单目方法（如 DepthAnything v2 或 MoGe）相比，VGGT 显示出有竞争力或更好的结果，尽管它从未经过单视图任务的明确训练。

运行时和 GPU 内存

我们在一张 NVIDIA H100 GPU 上，针对各种输入大小，对 VGGT 的 aggregator 的运行时和 GPU 内存使用情况进行了基准测试。

输入帧数 | 1 | 2 | 4 | 8 | 10 | 20 | 50 | 100 | 200 ---|---|---|---|---|---|---|---|---|--- 时间 (s) | 0.04 | 0.05 | 0.07 | 0.11 | 0.14 | 0.31 | 1.04 | 3.12 | 8.75 内存 (GB) | 1.88 | 2.07 | 2.45 | 3.23 | 3.63 | 5.58 | 11.41 | 21.15 | 40.63

请注意，这些结果是使用 Flash Attention 3 获得的，它比默认的 Flash Attention 2 实现更快，同时保持几乎相同的内存使用量。欢迎从源代码编译 Flash Attention 3 以获得更好的性能。

研究进展

我们的工作建立在一系列先前的研究项目之上。如果您有兴趣了解我们的研究如何发展，请查看我们以前的作品：

Deep SfM Revisited | ──┐ ---|--- PoseDiffusion | ─────► | VGGSfM ──► VGGT CoTracker | ──┘

致谢

感谢这些出色的仓库：PoseDiffusion, VGGSfM, CoTracker, DINOv2, Dust3r, Moge, PyTorch3D, Sky Segmentation, Depth Anything V2, Metric3D 以及社区中许多其他鼓舞人心的作品。

待办事项

发布训练代码
发布 VGGT-500M 和 VGGT-200M

许可证

有关此代码可用的许可证的详细信息，请参阅 LICENSE 文件。

关于

[CVPR 2025] VGGT: Visual Geometry Grounded Transformer

资源

自述文件

许可证

查看许可证

行为准则

语言

Python 100.0%

页脚

页脚导航

条款
隐私
安全
状态
文档
联系方式
管理 cookies
不要分享我的个人信息

你现在无法执行此操作。

VGGT：基于视觉几何信息的 Grounded Transformer

License

facebookresearch/vggt

文件夹和文件

最新提交

历史

仓库文件导航

VGGT: Visual Geometry Grounded Transformer

概述

快速开始

详细用法

可视化

交互式 3D 可视化

Gradio Web 界面

Viser 3D 查看器

跟踪可视化

单视图重建

运行时和 GPU 内存

研究进展

致谢

待办事项

许可证

关于

资源

许可证

行为准则

安全策略

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

语言

页脚

页脚导航