Crafting 3D Worlds from 2D Vision: The Pioneering Role of Diffusion Models

The realm of generative artificial intelligence has seen astounding progress in recent years, particularly with text and 2D image generation. Large language models like those for text, and diffusion models for images and videos, have truly revolutionized content creation. However, the generation of complex 3D objects has historically presented a unique set of challenges, developing at a comparatively slower pace. While we now have dedicated 3D generation models, the journey began with a clever approach: leveraging the power of existing 2D diffusion models.

The Foundational Role of 2D Diffusion

To understand 3D object generation, it is essential to first grasp the mechanics of 2D diffusion models. These models operate by taking an image that is pure noise, a canvas of random pixel values, and iteratively removing noise based on a given text prompt. Imagine providing a prompt such as "frogs on stilts," a creative idea that Mike introduced in his insightful video on 2D diffusion. The model then progressively denoises the random image, step by step, until a recognizable image matching the prompt emerges. What makes these 2D diffusion models so remarkable is their ability to synthesize abstract or even fantastical concepts, like frogs on stilts, which may not exist in real-world photographs. This capability stems from their training on billions of image-caption pairs, allowing them to learn individual concepts (like frogs, or stilts) and combine them in novel ways.

Generating 3D objects, however, introduces layers of complexity. One primary hurdle is the sheer scale of training data. While 2D diffusion models benefit from datasets containing billions of images and their corresponding captions, 3D datasets are considerably smaller, often comprising only a few million entries. This scarcity means a 3D model would struggle to independently learn and combine concepts like "frogs" and "stilts" due to insufficient examples. Furthermore, a 3D object must maintain visual coherence and realism from every possible angle, a far more demanding task than generating a single 2D viewpoint.

Crafting 3D Worlds from 2D Vision: The Pioneering Role of Diffusion Models
Generating 3D Models with Diffusion - Computerphile

Dream Fusion: A Bridge Between Dimensions

The breakthrough came in 2022 with the introduction of Dream Fusion, a groundbreaking model that pioneered the use of 2D diffusion models to generate 3D content. This was a pivotal moment, as it allowed researchers to harness the advanced conceptual understanding of 2D models and project it into the three-dimensional space, finally enabling the creation of abstract 3D objects like a frog on stilts.

The Dream Fusion approach, and subsequent models that build upon it, combine two key components: a powerful 2D diffusion model, such as Stable Diffusion, and a view synthesis model for 3D representation. Instead of traditional 3D meshes, these systems often employ modern view synthesis techniques like Neural Radiance Fields (NeRFs) or the more recent 3D Gaussian Splatting. These view synthesis models are adept at reconstructing a 3D scene from a series of 2D images, effectively mapping 2D observations to a consistent 3D structure.

The Score Distillation Sampling Process

The magic happens through a technique known as Score Distillation Sampling (SDS). The process typically begins with an empty or sparsely initialized 3D scene. Let us consider the example of generating a "frog on stilts":

  1. Initial Rendering: A 2D image is rendered from the current (initially blank or noisy) 3D scene, viewed from a specific, randomly chosen camera angle.
  2. Noise Injection: A controlled amount of noise is then added to this rendered 2D image.
  3. Diffusion Model Query: The noised image, along with the text prompt ("frogs on stilts"), is fed into the 2D diffusion model. The diffusion model then estimates the noise that needs to be removed to make the image conform to the prompt.
  4. 3D Scene Optimization: This estimated noise, or
4 min read