Can we build 3D city models with Vision Transformers?

Emmanuel Larralde Ortiz

Centro de Investigación en Matemáticas

Abstract Resumen

In this work, we tested the Visual Geometry Grounded Transformer (VGGT) — a 3D reconstruction model based mainly on encoding blocks of a Vision Transformer — to be able to perform 3D reconstructions of urban sites using a set of color photographs as the sole source of information. Additionally, we conducted a technical analysis of the computational requirements needed to use VGGT and proposed a workflow to perform large-scale reconstructions by merging multiple prediction outputs while maintaining a memory quota.

Introduction Introducción

With a virtual and dynamic representation of a city, it is possible to consider different risk scenarios for autonomous driving robots, both ground and aerial vehicles. For example, we are interested in developing autonomous landing algorithms for teleoperated aerial robots when the connection is unstable, the robot loses a critical mechanical part for its operation, or it does not have enough energy to continue flying.

In works like ViVa-SAFELAND , high-resolution videos captured in urban sites using only an aerial view are utilized, while sections of these videos are traversed through the movement of a virtual drone. Although practical, this is limited to a single viewpoint. It is true that 3D structure estimation techniques such as Structure from Motion (SfM) can be used to generate sparse 3D reconstructions and build a 3D flight world, but this would be technically impossible due to the amount of computational resources required. However, thanks to the development of VGGT, this could be possible using consumer-grade computers.

Visual Geometry Grounded Transformer

VGGT is a transformer architecture proposed collaboratively by Meta AI and the Oxford Visual Geometry Group. Unlike Mast3r and Dust3r (the VGGT's closest alternatives), it is composed of attention blocks per frame and global attention blocks that attend to tokens of each frame and tokens of all frames (respectively), regardless of the number of frames (at least one). The VGGT architecture consists of several blocks inspired by architectures proposed in other works. Image tokenization occurs roughly as in the original vision transformer; the tokens are processed by L alternating attention blocks (per-frame and global), and the output is bifurcated into two sections: (1) DPT (an architecture for depth estimation) and (2) the camera properties prediction head. The DPT is connected to different prediction heads for the following outputs: depth maps, point maps, and tracks. Specifically, the tracks prediction head is an instance of CoTracker. Additionally, the point maps and depth maps heads also predict uncertainty maps.

It is worth noting that there is some redundancy in the predictions, but they found that doing so improves model accuracy. Nevertheless, ...

    
      ... it is better to predict the point maps using the extrinsic properties and the depth maps
      rather than using the directly predicted point maps.
      Furthermore, due to the model's accuracy and precision,
      the prediction results can be used to initialize optimization algorithms such as bundle adjustment,
      achieving convergence in a few (~10) iterations.
    
    💡
        ... es mejor predecir los mapas de puntos utilizando las propiedades extrínsecas y los mapas de profundidad
        que utilizar directamente los mapas de puntos predichos. Además, gracias a la exactitud y precisión del modelo, los resultados de predicción
        pueden utilizarse para inicializar algoritmos de optimización como bundle adjustment consiguiendo convergencia con pocas (~10)
        iteraciones.
    
  

⚠️ There is a bug in the VGGT's tracks prediction head that will be fixed in the second version. Their main demo does not include this feature, and when tracks predictions are needed for other demos, they use another model for that task.

Figure 1: VGGT architecture. Figura 1: Arquitectura del VGGT.

Computer requirements Requisitos computacionales

The VGGT authors demonstrate that their model is fast and can process many images without increasing the memory footprint. However, VGGT was programmed using transformer implementations optimized for hardware found in the latest Nvidia GPU architectures. Using less advanced GPUs, the amount of RAM needed can be a limitation, but not the runtime.

Initially, we tried to run VGGT on an RTX 1660 Super GPU with 6GB of RAM, but the model required more RAM to make a prediction with at least one image. Therefore, we resorted to a workstation with an RTX A6000 card with 48 GB of RAM, where VGGT was hosted as a web API, and it was possible to log the amount of GPU RAM used (see Figure 2).

Figure 2: VGGT memory usage for inference with 1 frame (left) and 75 frames (right).

Hosting the model on an external computer is a bottleneck due to the amount of information that must be transmitted over the internet. An alternative is to quantize all possible layers to 8-bits and run the model on the CPU. We found this feasible in terms of runtime and memory usage, but indeed there is a loss in prediction quality as shown in Figure 3.

        
          💡
          In the official VGGT code, all possible layers are quantized to 16-bit precision,
          either bfloat (if supported by the hardware) or half-precision floating point.
        
        💡
            En el código oficial del VGGT, se cuantizan todas las capas que sean posibles a precisiones de 16 bits,
            ya sea a bfloat (si existe compatibilidad con el hardware) o flotante de media precisión (half float).
        
      

Figure 3: Prediction results using different quantization levels (left: 8-bit; right: 16-bit), as distributed.

        💡
          We recommend using a graphics card with at least 12 GB of RAM to run VGGT as distributed.
          This would allow inference with approximately 20 frames.
        
        💡
            Recomendamos usar una tarjeta gráfica con al menos 12 GB de RAM para correr el VGGT tal como se distribuye.
            Esto permitiría hacer inferencia con apróximadamente 20 frames.
        
    

Linking scenes Uniendo escenas

To reconstruct a city block in a downtown area, we could well use VGGT with hundreds of photos. It would probably work (it still needs to be evaluated whether prediction accuracy varies with the number of images used), and technically, having that much RAM available to do so is possible. However, if we want to do an unlimitedly large reconstruction, we have to merge many prediction results. For example, if we want to model a street, instead of using professional graphics cards to run VGGT with hundreds of images, we could group that number of photos into several (virtually unlimited) groups with fewer than 20 photos and store the results for later processing.

⚠️ Dust3r, Mast3r, and VGGT produce predictions without units of measure and with ambiguous scale. Two predictions of very similar scenes could produce point clouds at different scales. Specifically, VGGT was trained to produce normalized results.

Therefore, we need an optimization method that not only estimates the rotations and translations required to join the groups but also the scaling factors needed for re-scaling. We could use the geographic location of all photos, the camera orientation, and a scale reference to estimate all transformations, but it is not necessary.

Consider having two groups of predictions sharing one photo. We call one the target group and the other the source group. The goal is to transform the source group so that it has the same scale as the target group and that the global reference frame is that of the target group. Using the depth maps and their respective confidence maps, since both see exactly the same pixels, their results should be almost identical regardless of the reference frame.

        💡
          We estimate the scale factor between two groups of predictions as the value that minimizes
          the mean Huber distance between point maps of different groups but corresponding to the same images.
        
        💡
            Estimamos el factor de escala entre dos grupos de predicciones como el valor que minimiza la media de la distancia de Huber
            entre mapas de puntos de diferentes grupos pero que corresponden a las mismas imágenes.
        
    

Once the groups' scales match, we reduce the problem to finding the rigid transformation (rotation and translation) needed to align the two groups. There are many methods, such as Iterative Closest Point (ICP), to perform this estimation. However, a priori, we have a shared image, so in the target and source groups there are two reference frames that should coincide with the global reference frame of the target group.

        💡
          We align two prediction groups using ICP with an initial rigid transformation estimate equal
          to the transformation needed so that the extrinsic properties of the shared image in the source group
          match the extrinsic properties of the same image in the target group.
        
        💡
            Alineamos dos grupos de predicciones usando Iterative Closest Point con una estimación inicial de transformación rígida
            igual a la transformación necesaria para que las propiedades extrínsecas de la imágen compartida en el grupo source
            coincida con las propiedades extrínsecas de la misma imágen pero en el grupo target.
        
    

        💡
          Before merging the groups, we refine the predictions using bundle adjustment.
          This yields a better estimate of the depth maps and the intrinsic and extrinsic properties
          that will be used later to merge all groups.
        
        💡
            Antes de unir los grupos, refinamos las predicciones utilizando bundle adjustment. Así obtenemos una
            mejor estimación de los mapas de profundidad y de las propiedades intrínsecas y extrínsecas
            que posteriormente serán utilizadas para unir todos los grupos
        
    

Groups tree Árbol de grupos

Previously, the process to align two groups sharing at least one image was described. If we want to do large reconstructions, we will have a large number of groups that are connected. Assuming there is a root group chosen arbitrarily that sets the scale and the global reference frame for the entire reconstruction, we build the shallowest possible tree by joining a group of predictions with another group already in the tree that is as close as possible to the root group. For this to be possible, all groups must share at least one photo with at least another group. At the time this report was written, the shared image must be the first image used as input to VGGT, and the images are lexicographically ordered using the filename, but this is a code limitation that should be revisited.

Figure 4: Construction of the tree of prediction groups for large reconstructions.

More VGGT features Más características del VGGT

        💡
          In VGGT, theoretically, the order in which images are passed does not matter,
          except for the first one, which defines the reference frame for that prediction.
        
        💡
            En el VGGT, en teoría, el orden en el que se pasan las imágenes no importa, a excepción de la primera que define el marco
            de referencia de esa predicción.
        
    

⚠️ Although VGGT should theoretically generate good predictions with images of different dimensions, in practice this is not always the case.

Code Código

For now, all code is in a fork of the VGGT repository, but in the future, it will be migrated to its own repository integrating VGGT code.

Future Work Trabajo Futuro

Distile smaller models. Destilar modelos más pequeños.
Outdoors SLAM. SLAM en exteriores.
Gaussian Splatting.
3D videos.

Can we build 3D city models with Vision Transformers? ¿Podemos construir modelos 3D de ciudades con Vision Transformers?