Custyle: Gaussian Splatting Stylization with Customized Reference via Dynamic Style Score Distillation

Abstract

Recent research in 2D image stylization has shifted from traditional approaches based on universally pre-trained VGG networks or adversarial learning paradigms to diffusion models, which facilitate progressive and fine-grained style infusion. Nevertheless, the advancement has been merely explored for 3D stylization. In this paper, we introduce a comprehensive Gaussian Splatting (GS) stylization framework that facilitates style transfer from a customized reference image to a random 3D model. In general, we distill style scores of a pre-trained specialized diffusion model into GS optimization through an adaptive dynamic schedule. Specifically, we begin by embedding the style of a customized reference into the front view using the image stylization diffusion model. To ensure geometric consistency, the stylization adjustments of the front view are propagated to fixed perspectives using a multiview diffusion model guided by the reference image. Furthermore, we introduce a straightforward yet effective score distillation strategy, termed style outpainting, to progressively supplement the remaining views without ground truth supervisions. Additionally, we find that eliminating outlier Gaussians with excessively high gradients can effectively reduce the risk of stylization failure. We conduct extensive experiments on a collection of style references (i.e., artistic paintings and customized designs) and 3D models to validate our framework. Comprehensive visualizations and quantitative analyses demonstrate our superiority in achieving high-fidelity, geometry-consistent GS stylization compared to previous methods.

Video

Overview of Our Method

Overview of our pipeline for 3D stylization with three modules:

(a) Multi-view Style Enhancement: Style information is extracted from reference images, with Style Cleaning refining the style representation and Style Injection integrating it across multiple views to ensure geometric consistency.

(b) Dynamic Style Score Distillation (DSSD): This module employs dynamic noise scheduling and adaptive style guidance, integrating both latent and pixel losses to achieve consistent stylization step by step.

(c) Progressive Style Outpainting (PSO): Progressive outpainting achieves multi-view style propagation. While Gaussian Refinement (e.g., scale regularization and high-gradient Gaussian culling) is simultaneously conducted during the distillation process, it enhances the final stylized 3D representation, particularly in art scenes.

Multi-view Stylization

Multi-view Style Enhancement:

(a) Style Cleaning: Isolates pure style information by removing content elements and enhancing style features.

(b) Style Injection: Integrates this refined style across multiple views, ensuring consistent and geometry-preserving 3D style enhancement.

Visual Results

Visual Results: This figure demonstrates the performance of our method across six distinct styles: Sky Painting, Cartoon, Drawing, Fire, Cloud, and Black Myth (Wukong), applied to three 3D objects: Chair, Hotdog, and Mic.

These results highlight the model's capability to handle two main categories of styles:

Non-photorealistic Art Styles (e.g., Cartoon, Drawing): These styles showcase traditional artistic expressions.
State-based Styles (e.g., Fire, Cloud): These styles capture physical properties.

Our method demonstrates versatility in stylizing 3D models while preserving both style fidelity and geometric consistency. For example:

The original texture on the Chair.
The lighting effects on the Hotdog.
The metallic sheen on the Mic.

These features are effectively preserved, ensuring high-quality results across diverse artistic and physical characteristics.

Qualitative Comparisons

Qualitative Comparisons: We compare our method against other state-of-the-art (SOTA) approaches: StyleGaussian (Liu et al., 2024), IGS2GS (IGS2GS Reference), and Ref-NPR (Zhang et al., 2023) on three datasets (Chair, Hotdog, and Mic) using three styles: Cartoon, Sky Painting, and Fire.

The horizontal axis represents the compared methods, while the vertical axis displays different datasets. Our method effectively preserves the original model details, including:

The texture of the Chair.
The lighting effects on the Hotdog.
The metallic sheen of the Mic.

Compared to other methods, our approach demonstrates stronger semantic understanding, clearly distinguishing key elements such as the sausage, bun, and plate on the Hotdog. This results in more stable and visually coherent stylizations.