High-Quality Full-Head 3D Avatar Generation from Any Single Portrait Image

AAAI 2026 Poster

Yujie Gao¹, Chencheng Wang¹, Xianbing Sun¹ Jiahui Zhan¹ Wentao Wang² Yiyi Zhang¹ Haohua Zhao¹ Liqing Zhang¹ Jianfu Zhang¹

¹Shanghai Jiao Tong University, ²Shanghai AI Laboratory

Abstract

In this work, we introduce a novel high-fidelity full-head 3D avatar generation method from a single image, regardless of perspective, style, expression, or accessories. Prior works often fail to preserve consistent head geometry and facial details, primarily due to their limited capacity in modeling fine-grained facial textures and maintaining identity information. To address these challenges, we construct a new high-quality dataset containing 227 sequences of digital human portraits captured from 96 different perspectives, totalling 21,792 frames, featuring high-quality facial texture details. To further improve performance, we propose a novel multi-view diffusion named ID-TS diffusion model , which integrate identity and expression information into the two-stage multi-view diffusion process. The low-resolution stage ensures structural consistency of heads across multiple views, while the high-resolution stage preserves facial detail fidelity and coherence. Finally, we propose an enhanced feed-forward Gaussian avatar reconstruction method that optimizes the network on multi-view images of each single subject, significantly improving 3D facial texture details. Extensive experiments show that our method demonstrates robust performance across challenging scenarios, while showcasing broad applicability across numerous downstream tasks.

Method

Overview of our inference pipeline. Phase 1: In the low-resolution stage, we embed camera pose e and noise step t via positional encoding, concatenate them, and feed them into the U-Net’s residual blocks. The conditional image x is encoded into the latent space of the VAE encoder E, concatenated with noise, and processed alongside CLIP and ArcFace embeddings via cross-attention in the transformer block, generating multi-view images with accurate head shapes. In the high-resolution stage, we upsample the previous outputs, do element-wise addition with latent noise, and denoise them, outputting comprises multi-view images with high-fidelity texture details. Phase 2: With front/left/back/right images as inputs and remaining frames as supervision for 3D U-Net optimization, we finally yield a high-quality Guassian head P.

Generate frontal face from side view