Splatt3R: Zero-shot Gaussian Splatting from Uncalibarated Image Pairs

Read original: arXiv:2408.13912 - Published 8/29/2024 by Brandon Smart, Chuanxia Zheng, Iro Laina, Victor Adrian Prisacariu

Splatt3R: Zero-shot Gaussian Splatting from Uncalibarated Image Pairs

Overview

Splatt3R is a zero-shot method for generating 3D Gaussian splats from uncalibrated image pairs.
It can reconstruct 3D scenes without any prior camera calibration or depth information.
The key contributions include a novel Gaussian splatting formulation and an unsupervised training approach.

Plain English Explanation

The paper presents Splatt3R, a technique that can create 3D models from pairs of regular photos without any special setup.

Typically, generating 3D content requires either specialized hardware like depth sensors, or extensive manual labeling of images. Splatt3R avoids these limitations by using a novel approach called "Gaussian splatting" to reconstruct 3D shapes directly from regular 2D photos.

The key idea is to model each point in the 3D scene as a 3D Gaussian "splat" that can be projected into the 2D images. By learning how to predict these splats in an unsupervised way, the system can reconstruct the 3D structure of a scene without any prior information about the camera setup or scene depth.

This zero-shot capability makes Splatt3R a versatile tool that could enable 3D modeling from widely available 2D photos, with applications in areas like 3D content creation, virtual/augmented reality, and robot vision.

Technical Explanation

The key technical contributions of Splatt3R include:

Gaussian Splatting Formulation: Rather than representing 3D points as discrete voxels or meshes, the method models each point as a 3D Gaussian distribution. This Gaussian "splat" can then be projected into the 2D images, allowing the 3D structure to be inferred from the image pairs.
Unsupervised Training: Splatt3R learns to predict these 3D Gaussian splats in an unsupervised manner, without requiring any ground truth 3D data or camera calibration information. This zero-shot capability is a key innovation of the work.
Network Architecture: The method uses a neural network with an encoder-decoder structure to predict the 3D Gaussian parameters from the input image pairs. Careful design choices, such as using 3D convolutions and skip connections, enable effective 3D reconstruction.
Evaluation: The paper extensively evaluates Splatt3R on several benchmark datasets, demonstrating its ability to reconstruct high-quality 3D models from uncalibrated image pairs. Comparisons to prior work showcase the advantages of the Gaussian splatting approach.

Critical Analysis

The paper presents a compelling approach for 3D reconstruction from 2D images, with several notable strengths:

The zero-shot capability to reconstruct 3D scenes without any prior camera or depth information is a significant advancement over existing methods.
The Gaussian splatting formulation appears to be a powerful way to represent 3D geometry, enabling effective reconstruction from image pairs.
The unsupervised training approach is an elegant solution to the challenge of obtaining ground truth 3D data for supervision.

However, the paper also acknowledges some limitations and areas for future work:

The method may struggle with thin or transparent objects, as the Gaussian splat representation may not capture these well.
The computational complexity of the 3D convolutions could limit the scalability of the approach to high-resolution inputs.
Further research is needed to extend the method to handle more than two input images, which could improve robustness and reconstruction quality.

Overall, Splatt3R represents an exciting advance in 3D reconstruction that could have significant impact in various applications. The novel technical contributions and promising results warrant further exploration and development of this promising approach.

Conclusion

Splatt3R presents a novel zero-shot method for generating high-quality 3D Gaussian splats from uncalibrated image pairs. By modeling the 3D scene as a collection of Gaussian distributions and learning to predict these splats in an unsupervised manner, the method can reconstruct 3D content without any prior knowledge about the camera setup or scene geometry.

This breakthrough capability could enable a wide range of applications, from 3D content creation to robot vision, where 3D modeling from readily available 2D photos is highly valuable. While the method has some limitations, the technical innovations and strong empirical results demonstrate the potential of this approach to revolutionize 3D reconstruction from 2D images.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Splatt3R: Zero-shot Gaussian Splatting from Uncalibarated Image Pairs

Brandon Smart, Chuanxia Zheng, Iro Laina, Victor Adrian Prisacariu

In this paper, we introduce Splatt3R, a pose-free, feed-forward method for in-the-wild 3D reconstruction and novel view synthesis from stereo pairs. Given uncalibrated natural images, Splatt3R can predict 3D Gaussian Splats without requiring any camera parameters or depth information. For generalizability, we build Splatt3R upon a ``foundation'' 3D geometry reconstruction method, MASt3R, by extending it to deal with both 3D structure and appearance. Specifically, unlike the original MASt3R which reconstructs only 3D point clouds, we predict the additional Gaussian attributes required to construct a Gaussian primitive for each point. Hence, unlike other novel view synthesis methods, Splatt3R is first trained by optimizing the 3D point cloud's geometry loss, and then a novel view synthesis objective. By doing this, we avoid the local minima present in training 3D Gaussian Splats from stereo views. We also propose a novel loss masking strategy that we empirically find is critical for strong performance on extrapolated viewpoints. We train Splatt3R on the ScanNet++ dataset and demonstrate excellent generalisation to uncalibrated, in-the-wild images. Splatt3R can reconstruct scenes at 4FPS at 512 x 512 resolution, and the resultant splats can be rendered in real-time.

8/29/2024

🖼️

pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction

David Charatan, Sizhe Li, Andrea Tagliasacchi, Vincent Sitzmann

We introduce pixelSplat, a feed-forward model that learns to reconstruct 3D radiance fields parameterized by 3D Gaussian primitives from pairs of images. Our model features real-time and memory-efficient rendering for scalable training as well as fast 3D reconstruction at inference time. To overcome local minima inherent to sparse and locally supported representations, we predict a dense probability distribution over 3D and sample Gaussian means from that probability distribution. We make this sampling operation differentiable via a reparameterization trick, allowing us to back-propagate gradients through the Gaussian splatting representation. We benchmark our method on wide-baseline novel view synthesis on the real-world RealEstate10k and ACID datasets, where we outperform state-of-the-art light field transformers and accelerate rendering by 2.5 orders of magnitude while reconstructing an interpretable and editable 3D radiance field.

4/8/2024

TranSplat: Generalizable 3D Gaussian Splatting from Sparse Multi-View Images with Transformers

Chuanrui Zhang, Yingshuang Zou, Zhuoling Li, Minmin Yi, Haoqian Wang

Compared with previous 3D reconstruction methods like Nerf, recent Generalizable 3D Gaussian Splatting (G-3DGS) methods demonstrate impressive efficiency even in the sparse-view setting. However, the promising reconstruction performance of existing G-3DGS methods relies heavily on accurate multi-view feature matching, which is quite challenging. Especially for the scenes that have many non-overlapping areas between various views and contain numerous similar regions, the matching performance of existing methods is poor and the reconstruction precision is limited. To address this problem, we develop a strategy that utilizes a predicted depth confidence map to guide accurate local feature matching. In addition, we propose to utilize the knowledge of existing monocular depth estimation models as prior to boost the depth estimation precision in non-overlapping areas between views. Combining the proposed strategies, we present a novel G-3DGS method named TranSplat, which obtains the best performance on both the RealEstate10K and ACID benchmarks while maintaining competitive speed and presenting strong cross-dataset generalization ability. Our code, and demos will be available at: https://xingyoujun.github.io/transplat.

8/27/2024

🔎

Splatter Image: Ultra-Fast Single-View 3D Reconstruction

Stanislaw Szymanowicz, Christian Rupprecht, Andrea Vedaldi

We introduce the method, an ultra-efficient approach for monocular 3D object reconstruction. Splatter Image is based on Gaussian Splatting, which allows fast and high-quality reconstruction of 3D scenes from multiple images. We apply Gaussian Splatting to monocular reconstruction by learning a neural network that, at test time, performs reconstruction in a feed-forward manner, at 38 FPS. Our main innovation is the surprisingly straightforward design of this network, which, using 2D operators, maps the input image to one 3D Gaussian per pixel. The resulting set of Gaussians thus has the form an image, the Splatter Image. We further extend the method take several images as input via cross-view attention. Owning to the speed of the renderer (588 FPS), we use a single GPU for training while generating entire images at each iteration to optimize perceptual metrics like LPIPS. On several synthetic, real, multi-category and large-scale benchmark datasets, we achieve better results in terms of PSNR, LPIPS, and other metrics while training and evaluating much faster than prior works. Code, models, demo and more results are available at https://szymanowiczs.github.io/splatter-image.

4/17/2024