Bringing Your Portrait to 3D Presence

Teaser of our method
TL;DR

We present a unified framework capable of reconstructing animatable 3D avatars from a single portrait, supporting head, half-body, and full-body inputs.

Abstract

We present a unified framework for reconstructing animatable 3D human avatars from a single portrait across head, half-body, and full-body inputs. Our method tackles three bottlenecks: pose- and framing-sensitive feature representations, limited scalable data, and unreliable proxy-mesh estimation. We introduce a Dual-UV representation that maps image features to a canonical UV space via Core-UV and Shell-UV branches, eliminating pose- and framing-induced token shifts. We also build a factorized synthetic data manifold combining 2D generative diversity with geometry-consistent 3D renderings, supported by a training scheme that improves realism and identity consistency. A robust proxy-mesh tracker maintains stability under partial visibility. Together, these components enable strong in-the-wild generalization. Trained only on half-body synthetic data, our model achieves state-of-the-art head and upper-body reconstruction and competitive full-body results. Extensive experiments and analyses further validate the effectiveness of our approach.

Method Overview

Pipeline Diagram

We address three key bottlenecks: reconstruction representation for 3D avatar, insufficient data, and unstable mesh estimation.

1
Reconstruction Representation

We propose a Dual-UV representation for reconstructing 3D human avatars, mapping image features to canonical UV space.

2
Data Generation

We build a hybrid dataset by combining geometry-anchored 3D rendering with semantics-driven generative synthesis.

3
Mesh Estimation

We develop a robust mesh tracker that combines multiple estimators to produce consistent estimation.

Reconstruction Process

We illustrate our proxy-mesh estimation pipeline using a single image for clarity, while noting that the pipeline naturally supports parallel processing for multi-frame inputs. Starting from an input image, we preprocess it to extract a foreground mask and apply a pretrained human mesh recovery model to obtain an initial mesh estimate. The initial estimate is subsequently refined through body, head, and hand refinement.

Self Reenactment

0:00 / 0:00

Novel View Synthesis

0:00 / 0:00

Cross Reenactment

0:00 / 0:00

Generalization Ability

Text-Generated
Image-Editing
Out-of-Distribution-1
Out-of-Distribution-2

Ethics Statement

Our work focuses on reconstructing animatable 3D human avatars from single images. All training data used in this project are purely synthetic, without relying on real-world personal photos or biometric information. This reduces the risk of privacy violations and unauthorized data collection.

Nevertheless, we recognize that 3D avatar reconstruction technology can be misused, for example to create identity-mimicking content, deepfakes, or other forms of impersonation. To mitigate these risks, our implementation is intended solely for research purposes, and we strongly discourage any use that violates applicable laws, platform policies, or individual rights.

We encourage future deployments of such systems to follow strict data protection principles, obtain informed consent when real data are involved, and provide clear transparency about how avatars are generated and used. These measures are essential to foster responsible development of 3D avatar technologies and to maintain public trust.