pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction

Read original: arXiv:2312.12337 - Published 4/8/2024 by David Charatan, Sizhe Li, Andrea Tagliasacchi, Vincent Sitzmann

🖼️

Overview

This paper presents a new 3D reconstruction technique called "pixelSplat" that can create 3D models from pairs of images.
The method uses a neural network to generate "splats" (shapes representing 3D points) from 2D image pixels, allowing for scalable and generalizable 3D reconstruction.
The authors demonstrate that pixelSplat outperforms previous state-of-the-art techniques on several 3D reconstruction benchmarks.

Plain English Explanation

Imagine you have a bunch of photos of an object from different angles. Wouldn't it be cool if you could use those photos to create a 3D model of the object? That's exactly what the researchers in this paper set out to do.

Their new technique, called pixelSplat, works by taking a pair of 2D photos and using a special type of neural network to figure out where the 3D points are in the scene. It does this by generating "splats" - little shapes that represent the 3D points. The great thing about this approach is that it's scalable, meaning it can handle a large number of photos, and it's generalizable, meaning it can work with all kinds of different objects, not just specific ones.

The researchers tested their pixelSplat method on several standard 3D reconstruction benchmarks and found that it outperformed other state-of-the-art techniques. This means pixelSplat can create higher quality 3D models from pairs of photos compared to previous methods.

Technical Explanation

The core of the pixelSplat technique is a neural network that takes a pair of 2D images as input and outputs a set of 3D "splats" - Gaussian-shaped discs that represent the 3D geometry of the scene. The network is trained end-to-end to predict the 3D position, normal, and radius of each splat directly from the 2D image pixels, without requiring any intermediate 3D reconstruction steps.

The authors design a novel network architecture that combines a convolutional encoder to extract features from the input images, along with a multi-layer perceptron decoder to predict the splat parameters. Crucially, the network is trained on a diverse dataset of synthetic 3D scenes, allowing it to generalize to a wide range of real-world objects and scenes.

Experiments on standard 3D reconstruction benchmarks show that pixelSplat outperforms previous state-of-the-art methods. The authors attribute this to the network's ability to efficiently represent 3D geometry using the learnt splat primitives, as well as the generalization enabled by the large-scale synthetic training data.

Critical Analysis

The paper provides a compelling technical approach and demonstrates strong empirical results compared to prior work. However, some potential limitations and areas for future research are worth noting:

The authors train their network solely on synthetic data, raising questions about how well it will generalize to real-world scenes with all their complexities. Further evaluation on diverse real-world datasets would help validate the technique's practical applicability.

The paper does not provide much analysis on the types of scenes or objects where pixelSplat may struggle. Exploring the failure modes and robustness of the approach to factors like occlusions, textureless regions, or thin structures could yield valuable insights.

While the splat representation is efficient, it may not capture fine-grained details as well as dense point cloud or mesh-based 3D reconstruction methods. Investigating ways to combine pixelSplat with complementary techniques could lead to even higher fidelity 3D models.

Overall, the pixelSplat method represents an interesting and promising direction for scalable 3D reconstruction from image pairs. Further research to address the above limitations could help solidify its practical impact.

Conclusion

The pixelSplat technique introduced in this paper offers a novel approach to 3D reconstruction that can efficiently generate high-quality 3D models from pairs of 2D images. By learning to predict 3D splat primitives directly from image pixels, the method achieves state-of-the-art results on standard benchmarks while being scalable and generalizable to diverse scenes.

While the paper highlights some promising capabilities, further research is needed to fully validate the technique's real-world applicability and explore ways to combine it with complementary 3D reconstruction methods. Nevertheless, the pixelSplat work represents an important step forward in making 3D modeling more accessible and practical for a wide range of applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🖼️

pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction

David Charatan, Sizhe Li, Andrea Tagliasacchi, Vincent Sitzmann

We introduce pixelSplat, a feed-forward model that learns to reconstruct 3D radiance fields parameterized by 3D Gaussian primitives from pairs of images. Our model features real-time and memory-efficient rendering for scalable training as well as fast 3D reconstruction at inference time. To overcome local minima inherent to sparse and locally supported representations, we predict a dense probability distribution over 3D and sample Gaussian means from that probability distribution. We make this sampling operation differentiable via a reparameterization trick, allowing us to back-propagate gradients through the Gaussian splatting representation. We benchmark our method on wide-baseline novel view synthesis on the real-world RealEstate10k and ACID datasets, where we outperform state-of-the-art light field transformers and accelerate rendering by 2.5 orders of magnitude while reconstructing an interpretable and editable 3D radiance field.

4/8/2024

Splatt3R: Zero-shot Gaussian Splatting from Uncalibarated Image Pairs

Brandon Smart, Chuanxia Zheng, Iro Laina, Victor Adrian Prisacariu

In this paper, we introduce Splatt3R, a pose-free, feed-forward method for in-the-wild 3D reconstruction and novel view synthesis from stereo pairs. Given uncalibrated natural images, Splatt3R can predict 3D Gaussian Splats without requiring any camera parameters or depth information. For generalizability, we build Splatt3R upon a ``foundation'' 3D geometry reconstruction method, MASt3R, by extending it to deal with both 3D structure and appearance. Specifically, unlike the original MASt3R which reconstructs only 3D point clouds, we predict the additional Gaussian attributes required to construct a Gaussian primitive for each point. Hence, unlike other novel view synthesis methods, Splatt3R is first trained by optimizing the 3D point cloud's geometry loss, and then a novel view synthesis objective. By doing this, we avoid the local minima present in training 3D Gaussian Splats from stereo views. We also propose a novel loss masking strategy that we empirically find is critical for strong performance on extrapolated viewpoints. We train Splatt3R on the ScanNet++ dataset and demonstrate excellent generalisation to uncalibrated, in-the-wild images. Splatt3R can reconstruct scenes at 4FPS at 512 x 512 resolution, and the resultant splats can be rendered in real-time.

8/29/2024

A Pixel Is Worth More Than One 3D Gaussians in Single-View 3D Reconstruction

Jianghao Shen, Nan Xue, Tianfu Wu

Learning 3D scene representation from a single-view image is a long-standing fundamental problem in computer vision, with the inherent ambiguity in predicting contents unseen from the input view. Built on the recently proposed 3D Gaussian Splatting (3DGS), the Splatter Image method has made promising progress on fast single-image novel view synthesis via learning a single 3D Gaussian for each pixel based on the U-Net feature map of an input image. However, it has limited expressive power to represent occluded components that are not observable in the input view. To address this problem, this paper presents a Hierarchical Splatter Image method in which a pixel is worth more than one 3D Gaussians. Specifically, each pixel is represented by a parent 3D Gaussian and a small number of child 3D Gaussians. Parent 3D Gaussians are learned as done in the vanilla Splatter Image. Child 3D Gaussians are learned via a lightweight Multi-Layer Perceptron (MLP) which takes as input the projected image features of a parent 3D Gaussian and the embedding of a target camera view. Both parent and child 3D Gaussians are learned end-to-end in a stage-wise way. The joint condition of input image features from eyes of the parent Gaussians and the target camera position facilitates learning to allocate child Gaussians to ``see the unseen'', recovering the occluded details that are often missed by parent Gaussians. In experiments, the proposed method is tested on the ShapeNet-SRN and CO3D datasets with state-of-the-art performance obtained, especially showing promising capabilities of reconstructing occluded contents in the input view.

6/4/2024

🔎

Splatter Image: Ultra-Fast Single-View 3D Reconstruction

Stanislaw Szymanowicz, Christian Rupprecht, Andrea Vedaldi

We introduce the method, an ultra-efficient approach for monocular 3D object reconstruction. Splatter Image is based on Gaussian Splatting, which allows fast and high-quality reconstruction of 3D scenes from multiple images. We apply Gaussian Splatting to monocular reconstruction by learning a neural network that, at test time, performs reconstruction in a feed-forward manner, at 38 FPS. Our main innovation is the surprisingly straightforward design of this network, which, using 2D operators, maps the input image to one 3D Gaussian per pixel. The resulting set of Gaussians thus has the form an image, the Splatter Image. We further extend the method take several images as input via cross-view attention. Owning to the speed of the renderer (588 FPS), we use a single GPU for training while generating entire images at each iteration to optimize perceptual metrics like LPIPS. On several synthetic, real, multi-category and large-scale benchmark datasets, we achieve better results in terms of PSNR, LPIPS, and other metrics while training and evaluating much faster than prior works. Code, models, demo and more results are available at https://szymanowiczs.github.io/splatter-image.

4/17/2024