In this work, we introduce a novel high-fidelity full-head 3D avatar generation method from a single image, regardless of perspective, style, expression, or accessories. Prior works often fail to preserve consistent head geometry and facial details, primarily due to their limited capacity in modeling fine-grained facial textures and maintaining identity information. To address these challenges, we construct a new high-quality dataset containing 227 sequences of digital human portraits captured from 96 different perspectives, totalling 21,792 frames, featuring high-quality facial texture details. To further improve performance, we propose a novel multi-view diffusion named ID-TS diffusion model , which integrate identity and expression information into the two-stage multi-view diffusion process. The low-resolution stage ensures structural consistency of heads across multiple views, while the high-resolution stage preserves facial detail fidelity and coherence. Finally, we propose an enhanced feed-forward Gaussian avatar reconstruction method that optimizes the network on multi-view images of each single subject, significantly improving 3D facial texture details. Extensive experiments show that our method demonstrates robust performance across challenging scenarios, while showcasing broad applicability across numerous downstream tasks.
Overview of our inference pipeline. Phase 1: In the low-resolution stage, we embed camera pose e and noise step t via positional encoding, concatenate them, and feed them into the U-Net’s residual blocks. The conditional image x is encoded into the latent space of the VAE encoder E, concatenated with noise, and processed alongside CLIP and ArcFace embeddings via cross-attention in the transformer block, generating multi-view images with accurate head shapes. In the high-resolution stage, we upsample the previous outputs, do element-wise addition with latent noise, and denoise them, outputting comprises multi-view images with high-fidelity texture details. Phase 2: With front/left/back/right images as inputs and remaining frames as supervision for 3D U-Net optimization, we finally yield a high-quality Guassian head P.