High-Quality Full-Head 3D Avatar Generation from Any Single Portrait Image

AAAI 2026 Poster

Yujie Gao1, Chencheng Wang1, Xianbing Sun1 Jiahui Zhan1 Wentao Wang2 Yiyi Zhang1 Haohua Zhao1 Liqing Zhang1 Jianfu Zhang1
1Shanghai Jiao Tong University, 2Shanghai AI Laboratory


We generate multi-view images around the head from a single portrait image and the 3D Gaussian point cloud of the identity is then reconstructed using the multi-view images. Our method is capable of performing frontal face prediction, stylized head generation, and facial expression animation, achieving compelling results across these tasks.

Abstract

In this work, we introduce a novel high-fidelity full-head 3D avatar generation method from a single image, regardless of perspective, style, expression, or accessories. Prior works often fail to preserve consistent head geometry and facial details, primarily due to their limited capacity in modeling fine-grained facial textures and maintaining identity information. To address these challenges, we construct a new high-quality dataset containing 227 sequences of digital human portraits captured from 96 different perspectives, totalling 21,792 frames, featuring high-quality facial texture details. To further improve performance, we propose a novel multi-view diffusion named ID-TS diffusion model , which integrate identity and expression information into the two-stage multi-view diffusion process. The low-resolution stage ensures structural consistency of heads across multiple views, while the high-resolution stage preserves facial detail fidelity and coherence. Finally, we propose an enhanced feed-forward Gaussian avatar reconstruction method that optimizes the network on multi-view images of each single subject, significantly improving 3D facial texture details. Extensive experiments show that our method demonstrates robust performance across challenging scenarios, while showcasing broad applicability across numerous downstream tasks.

Method

Image 1

Overview of our inference pipeline. Phase 1: In the low-resolution stage, we embed camera pose e and noise step t via positional encoding, concatenate them, and feed them into the U-Net’s residual blocks. The conditional image x is encoded into the latent space of the VAE encoder E, concatenated with noise, and processed alongside CLIP and ArcFace embeddings via cross-attention in the transformer block, generating multi-view images with accurate head shapes. In the high-resolution stage, we upsample the previous outputs, do element-wise addition with latent noise, and denoise them, outputting comprises multi-view images with high-fidelity texture details. Phase 2: With front/left/back/right images as inputs and remaining frames as supervision for 3D U-Net optimization, we finally yield a high-quality Guassian head P.

Experiment

Qualitative Comparison of single image to 3D head

More Qualitative Results

Video as Input for 4D Novel View Synthesis


Generate frontal face from side view

Input Image

horizontal 3D

vertical 3D

Input Image

horizontal 3D

vertical 3D


Anime-style generation

Input Image

generated novel-view

horizontal 3D

vertical 3D

Input Image

generated novel-view

horizontal 3D

vertical 3D