Abstract and 1. Introduction

  1. Related Work

    2.1. Motion Reconstruction from Sparse Input

    2.2. Human Motion Generation

  2. SAGE: Stratified Avatar Generation and 3.1. Problem Statement and Notation

    3.2. Disentangled Motion Representation

    3.3. Stratified Motion Diffusion

    3.4. Implementation Details

  3. Experiments and Evaluation Metrics

    4.1. Dataset and Evaluation Metrics

    4.2. Quantitative and Qualitative Results

    4.3. Ablation Study

  4. Conclusion and References

Supplementary Material

A. Extra Ablation Studies

B. Implementation Details

B. Implementation Details

B.1 Disentangled VQ-VAE

B.2 Stratified Diffusion

In our transformer-based model for upper-body and lowerbody diffusion, we integrate an additional DiT block as described in [29]. Each model features 12 DiT blocks, each with 8 attention heads, and an input embedding dimension of 512. The full-body decoder is structured with 6 transformer layers.

B.3 Refiner

The complete loss term for training the refiner can be written as:

We set α, β, γ, δ to 0.01, 10, 0.05, and 0.01 to force the refiner to focus more on motion smoothness in the training process.

All experiments can be carried out on a single NVIDIA GeForce RTX 3090 GPU card, using the Pytorch framework.

Authors:

(1) Han Feng, equal contributions, ordered by alphabet from Wuhan University;

(2) Wenchao Ma, equal contributions, ordered by alphabet from Pennsylvania State University;

(3) Quankai Gao, University of Southern California;

(4) Xianwei Zheng, Wuhan University;

(5) Nan Xue, Ant Group ([email protected]);

(6) Huijuan Xu, Pennsylvania State University.


This paper is available on arxiv under CC BY 4.0 DEED license.