Category-Aware 3D Object Composition with Disentangled Texture and Shape Multi-view Diffusion

Anonymous authors


Teaser figure.


Generate Meshes

1

Original model(Sabrewulf)

图片2

Cobalt metal

图片3

Titanium metal

图片4

African chameleon

图片5

Ant

图片6

Bald eagle

图片7

Bighorn

图片8

Chromium metal

图片9

Cock

图片10

Orchid plant

图片11

Shark

图片12

fire salamander






图片1

Original model
(Kinni - character)

图片2

Aluminum metal

图片3

Egyptian cat

图片4

Theater building

图片5

Gold

图片6

Tomato

图片7

Lily plant

图片8

Indigo bunting

图片9

Broccoli

图片10

Sycamore tree

图片11

Triceratops

图片12

Orchid plant






图片1

Original model (Apatosaurus)

图片2

Original model
(Devil)

图片3

Original model (Nesting doll)

图片3

Original model
(Tiger)

图片4

Banana

图片5

Kit fox

图片6

Cock

图片6

Giraffe

图片7

Gold

图片8

Gazelle

图片9

Flamingo

图片6

African chameleon

图片10

King penguin

图片11

Polar bear

图片12

Peacock

图片12

King penguin






Abstract

In this paper, we tackle a new task of 3D object synthesis, where a 3D model is combined with another object text to create a novel 3D model. However, most existing text/image/3D-to-3D methods struggle to effectively integrate multiple content sources, often resulting in inconsistent textures and inaccurate shapes. To overcome these challenges, we propose a straightforward yet powerful approach, Text+3D-to-3D (T33D), for generating novel and compelling 3D models. Our method begins by rendering multi-view images and normal maps from the input 3D model, then generating a novel, surprising 2D object using ATIH with the front-view image and the another object text as inputs. To ensure texture consistency, we introduce texture multi-view diffusion, which refines the textures of the remaining multi-view RGB images based on the novel 2D object. For enhanced shape accuracy, we propose shape multi-view diffusion to improve the 2D shapes of both the multi-view RGB images and the normal maps, also conditioned on the novel 2D object. Finally, these outputs are used to reconstruct a complete and novel 3D model. Extensive experiments demonstrate the effectiveness of our method, yielding impressive 3D creations.



An example of texture consistency

Breed mixing results

Our TMDiff achieves better texture consistency compared to ATIH




An example of shape accuracy

Breed mixing results

Our SMDiff demonstrates better shape accuracy compared to Era3D




Comparisons with different image-to-3D methods

Concept removal results

We observe that Era3D, CRM and LGM, and Vfusion3D struggle with inconsistent textures and inaccurate shapes in the generated 3D object models. In contrast, our method successfully synthesizes novel 3D objects, such as the Cat-Dock and Tiger-Egyptian Cat, shown in the first and fourth columns, respectively.

Comparisons with 3D-to-3D method

Breed mixing results

We observe that ThemeStation produces 3D object models with inconsistent textures and inaccurate shapes, whereas our method successfully generates coherent and novel 3D object syntheses.



Ablation Study

Different results with varying values of α

Semantic style transfer results

Ablation study of SMDiff and TMDiff

Novel object synthesis results