MPOD123: One Image to 3D Content Generation Using Mask-enhanced Progressive Outline-to-Detail Optimization


Recent advancements in single image driven 3D content generation have been propelled by leveraging prior knowledge from pretrained 2D diffusion models. However, the 3D content generated by existing methods often exhibits distorted outline shapes and inadequate details. To solve this problem, we propose a novel framework called Mask-enhanced Progressive Outline-to-Detail optimization (aka. MPOD123), which consists of two stages. Specifically, in the first stage, MPOD123 utilizes the pretrained view-conditioned diffusion model to guide the outline shape optimization of the 3D content. Given a certain viewpoint, we estimate outline shape priors in the form of 2D mask from the 3D content by leveraging opacity calculation. In the second stage, MPOD123 incorporates Detail Appearance Inpainting (DAI) to guide the refinement on local geometry and texture with the shape priors. The essence of DAI lies in the Mask Rectified Cross-Attention (MRCA), which can be conveniently plugged into the stable diffusion model. The MRCA module utilizes the mask to rectify the attention map from each cross-attention layer. Accompanied with this new module, DAI is capable of guiding the detail refinement of the 3D content, while better preserving the outline shape. To assess the applicability in practical scenarios, we contribute a new dataset modeled on real-world e-commerce environments. Extensive quantitative and qualitative experiments on this dataset and open benchmarks demonstrate the effectiveness of MPOD123 over the state-of-the-arts.


Overview of MPOD123. We generate high-quality 3D content from an input image in a progressive optimization manner. At the first stage, we utilize a view-conditioned diffusion model (Zero-1-to-3) to guide the optimization of neural radiance field (NeRF) in novel views. For a certain viewpoint, relative viewpoint transformation (R,T) and input image are used as conditional information of Zero-1-to-3. At the second stage, we initialize a textured 3D mesh from the NeRF. We utilize our Detail Appearance Inpainting approach (DAI) to guide the optimization of the 3D mesh in novel views. For a certain viewpoint, DAI takes two conditional inputs: 2D mask built from the NeRF in the same viewpoint and text prompt derived from the input image. We impose a loss ref  in both stages to ensure the image rendered from the reference view is fitted to the input image.