Category-Aware 3D Object Composition with Disentangled Texture and Shape Multi-view Diffusion

Anonymous authors

Generate Meshes

Original model(Sabrewulf)

Cobalt metal

Titanium metal

African chameleon

Ant

Bald eagle

Bighorn

Chromium metal

Cock

Orchid plant

Shark

fire salamander

Original model
(Kinni - character)

Aluminum metal

Egyptian cat

Theater building

Gold

Tomato

Lily plant

Indigo bunting

Broccoli

Sycamore tree

Triceratops

Orchid plant

Original model (Apatosaurus)

Original model
(Devil)

Original model (Nesting doll)

Original model
(Tiger)

Banana

Kit fox

Cock

Giraffe

Gold

Gazelle

Flamingo

African chameleon

King penguin

Polar bear

Peacock

King penguin

Abstract

In this paper, we tackle a new task of 3D object synthesis, where a 3D model is combined with another object text to create a novel 3D model. However, most existing text/image/3D-to-3D methods struggle to effectively integrate multiple content sources, often resulting in inconsistent textures and inaccurate shapes. To overcome these challenges, we propose a straightforward yet powerful approach, Text+3D-to-3D (T33D), for generating novel and compelling 3D models. Our method begins by rendering multi-view images and normal maps from the input 3D model, then generating a novel, surprising 2D object using ATIH with the front-view image and the another object text as inputs. To ensure texture consistency, we introduce texture multi-view diffusion, which refines the textures of the remaining multi-view RGB images based on the novel 2D object. For enhanced shape accuracy, we propose shape multi-view diffusion to improve the 2D shapes of both the multi-view RGB images and the normal maps, also conditioned on the novel 2D object. Finally, these outputs are used to reconstruct a complete and novel 3D model. Extensive experiments demonstrate the effectiveness of our method, yielding impressive 3D creations.

An example of texture consistency

Our TMDiff achieves better texture consistency compared to ATIH

An example of shape accuracy

Our SMDiff demonstrates better shape accuracy compared to Era3D

Comparisons with different image-to-3D methods

We observe that Era3D, CRM and LGM, and Vfusion3D struggle with inconsistent textures and inaccurate shapes in the generated 3D object models. In contrast, our method successfully synthesizes novel 3D objects, such as the Cat-Dock and Tiger-Egyptian Cat, shown in the first and fourth columns, respectively.

Comparisons with 3D-to-3D method

We observe that ThemeStation produces 3D object models with inconsistent textures and inaccurate shapes, whereas our method successfully generates coherent and novel 3D object syntheses.

Ablation Study

Different results with varying values of α

Ablation study of SMDiff and TMDiff