Abstract and 1. Introduction

  1. Related Work

    2.1. Motion Reconstruction from Sparse Input

    2.2. Human Motion Generation

  2. SAGE: Stratified Avatar Generation and 3.1. Problem Statement and Notation

    3.2. Disentangled Motion Representation

    3.3. Stratified Motion Diffusion

    3.4. Implementation Details

  3. Experiments and Evaluation Metrics

    4.1. Dataset and Evaluation Metrics

    4.2. Quantitative and Qualitative Results

    4.3. Ablation Study

  4. Conclusion and References

Supplementary Material

A. Extra Ablation Studies

B. Implementation Details

5. Conclusion

We study the problem of human avatar generation from sparse observations. Our key finding is that the upper and lower body motions should be disentangled with respect to the input signals from the upper-body joints. Based on this, we propose a novel stratified solution where the upper-body motion is reconstructed first, and the lower-body motion is reconstructed next and conditioned on the upper-body motion. Our proposed stratified solution achieves superior performance on public available benchmarks.

References

[1] RootMotion Final IK., 2018. 2, 6, 7

[2] Advanced Computing Center for the Arts and Design. ACCAD MoCap Dataset. 5, 6

[3] Karan Ahuja, Eyal Ofek, Mar Gonzalez-Franco, ´ Christian Holz, and Andrew D. Wilson. Coolmoves: User motion accentuation in virtual reality. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., 5(2): 52:1–52:23, 2021. 2, 7

[4] Ijaz Akhter and Michael J. Black. Pose-conditioned joint angle limits for 3d human pose reconstruction. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1446–1455, 2015. 5, 6

[5] Sadegh Aliakbarian, Pashmina Cameron, Federica Bogo, Andrew W. Fitzgibbon, and Thomas J. Cashman. FLAG: flow-based 3d avatar generation from sparse observations. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13243–13252, 2022. 1, 2, 5, 6, 7

[6] Carnegie Mellon University. CMU MoCap Dataset. 5, 6

[7] Angela Castillo, Maria Escobar, Guillaume Jeanneret, Albert Pumarola, Pablo Arbelaez, Ali Thabet, and Art- ´ siom Sanakoyeu. Bodiffusion: Diffusing sparse observations for full-body human motion synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, pages 4221– 4231, 2023. 2, 5

[8] Ling-Hao Chen, Jiawei Zhang, Yewen Li, Yiren Pang, Xiaobo Xia, and Tongliang Liu. Humanmac: Masked motion completion for human motion prediction. In IEEE/CVF International Conference on Computer Vision (ICCV), pages 9510–9521, 2023. 2

[9] Junyoung Chung, C¸ aglar Gulc¸ehre, KyungHyun Cho, ¨ and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR, abs/1412.3555, 2014. 5

[10] Andrea Dittadi, Sebastian Dziadzio, Darren Cosker, Ben Lundell, Thomas J. Cashman, and Jamie Shotton. Full-body motion from a single head-mounted device: Generating SMPL poses from partial observations. In IEEE/CVF International Conference on Computer Vision (ICCV), pages 11667–11677, 2021. 1, 2, 5, 6, 7

[11] Yuming Du, Robin Kips, Albert Pumarola, Sebastian Starke, Ali K. Thabet, and Artsiom Sanakoyeu. Avatars grow legs: Generating smooth human motion from sparse tracking inputs with diffusion model. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 481–490, 2023. 2, 5, 6, 7, 8, 1

[12] Saeed Ghorbani, Kimia Mahdaviani, Anne Thaler, Konrad P. Kording, Douglas James Cook, Gunnar ¨ Blohm, and Nikolaus F. Troje. Movi: A large multipurpose motion and video dataset. CoRR, abs/2003.01888, 2020. 5, 6

[13] Chuan Guo, Xinxin Zuo, Sen Wang, Shihao Zou, Qingyao Sun, Annan Deng, Minglun Gong, and Li Cheng. Action2motion: Conditioned generation of 3d human motions. In ACM International Conference on Multimedia (MM), pages 2021–2029, 2020. 2

[14] Mohamed Hassan, Duygu Ceylan, Ruben Villegas, Jun Saito, Jimei Yang, Yi Zhou, and Michael J. Black. Stochastic scene-aware motion prediction. In IEEE/CVF International Conference on Computer Vision (ICCV), pages 11354–11364, 2021. 2

[15] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Annual Conference on Neural Information Processing Systems (NeurIPS), 2020. 2

[16] Yinghao Huang, Manuel Kaufmann, Emre Aksan, Michael J. Black, Otmar Hilliges, and Gerard PonsMoll. Deep inertial poser: learning to reconstruct human pose from sparse inertial measurements in real time. ACM Trans. Graph., 37(6):185, 2018. 1, 2

[17] Biao Jiang, Xin Chen, Wen Liu, Jingyi Yu, Gang Yu, and Tao Chen. Motiongpt: Human motion as a foreign language. Advances in Neural Information Processing Systems (NeurIPS), 36, 2024. 2

[18] Jiaxi Jiang, Paul Streli, Huajian Qiu, Andreas Fender, Larissa Laich, Patrick Snape, and Christian Holz. Avatarposer: Articulated full-body pose tracking from sparse motion sensing. In European Conference on Computer Vision (ECCV), pages 443–460, 2022. 2, 3, 5, 6, 7, 8, 1

[19] Yifeng Jiang, Yuting Ye, Deepak Gopinath, Jungdam Won, Alexander W. Winkler, and C. Karen Liu. Transformer inertial poser: Real-time human motion reconstruction from sparse imus with simultaneous terrain generation. In SIGGRAPH Asia 2022 Conference Papers, pages 3:1–3:9. ACM, 2022. 1, 2

[20] Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In International Conference on Learning Representations (ICLR), 2014. 2

[21] Matthew Loper, Naureen Mahmood, and Michael J. Black. Mosh: motion and shape capture from sparse markers. ACM Trans. Graph., 33(6):220:1–220:13, 2014. 6

[22] Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. SMPL: a skinned multi-person linear model. ACM Trans. Graph., pages 248:1–248:16, 2015. 2, 3, 5

[23] Eyes JAPAN Co. Ltd. Eyes Japan MoCap Dataset. 5, 6

[24] Thomas Lucas, Fabien Baradel, Philippe Weinzaepfel, and Gregory Rogez. Posegpt: Quantization-based 3d ´ human motion generation and forecasting. In European Conference on Computer Vision (ECCV), pages 417–435, 2022. 2, 1

[25] Naureen Mahmood, Nima Ghorbani, Nikolaus F. Troje, Gerard Pons-Moll, and Michael J. Black. AMASS: archive of motion capture as surface shapes. In IEEE/CVF International Conference on Computer Vision (ICCV), pages 5441–5450, 2019. 2, 5, 6

[26] Christian Mandery, Omer Terlemez, Martin Do, Niko- ¨ laus Vahrenkamp, and Tamim Asfour. The KIT whole-body human motion database. In IEEE International Conference on Advanced Robotics (ICAR), pages 329–336, 2015. 5, 6

[27] Wei Mao, Miaomiao Liu, and Mathieu Salzmann. Generating smooth pose sequences for diverse human motion prediction. In IEEE/CVF International Conference on Computer Vision (ICCV), pages 13289– 13298, 2021. 2

[28] M. Muller, T. R ¨ oder, M. Clausen, B. Eberhardt, ¨ B. Kruger, and A. Weber. Documentation mocap ¨ database HDM05. Technical Report CG-2007-2, Universitat Bonn, 2007. ¨ 5, 6

[29] William Peebles and Saining Xie. Scalable diffusion models with transformers. In IEEE/CVF International Conference on Computer Vision (ICCV), pages 4172– 4182, 2023. 5, 1

[30] Mathis Petrovich, Michael J. Black, and Gul Varol. ¨ Action-conditioned 3d human motion synthesis with transformer VAE. In IEEE/CVF International Conference on Computer Vision (ICCV), pages 10965– 10975, 2021. 2

[31] Mathis Petrovich, Michael J. Black, and Gul Varol. ¨ TEMOS: generating diverse human motions from textual descriptions. In European Conference on Computer Vision (ECCV), pages 480–497, 2022. 2

[32] Huaijin Pi, Sida Peng, Minghui Yang, Xiaowei Zhou, and Hujun Bao. Hierarchical generation of humanobject interactions with diffusion probabilistic models. In IEEE/CVF International Conference on Computer Vision (ICCV), pages 15061–15073, 2023. 2

[33] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical textconditional image generation with CLIP latents. CoRR, 2022. 5, 1

[34] Davis Rempe, Tolga Birdal, Aaron Hertzmann, Jimei Yang, Srinath Sridhar, and Leonidas J. Guibas. Humor: 3d human motion model for robust pose estimation. In IEEE/CVF International Conference on Computer Vision (ICCV), pages 11468–11479, 2021. 1, 5, 6, 7

[35] Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. In International Conference on Machine Learning (ICML), pages 1530–1538, 2015. 2

[36] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bjorn Ommer. High- ¨ resolution image synthesis with latent diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10674– 10685, 2022. 2, 5

[37] Leonid Sigal, Alexandru O. Balan, and Michael J. Black. Humaneva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. Int. J. Comput. Vis., 87 (1-2):4–27, 2010. 5, 6

[38] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations (ICLR), 2021. 2

[39] Guy Tevet, Brian Gordon, Amir Hertz, Amit H. Bermano, and Daniel Cohen-Or. Motionclip: Exposing human motion generation to CLIP space. In European Conference on Computer Vision (ECCV), pages 358–374, 2022. 2

[40] Guy Tevet, Sigal Raab, Brian Gordon, Yonatan Shafir, Daniel Cohen-Or, and Amit Haim Bermano. Human motion diffusion model. In International Conference on Learning Representations (ICLR), 2023. 2, 5, 1

[41] Nikolaus F. Troje. Decomposing biological motion: A framework for analysis and synthesis of human gait patterns. Journal of Vision, 2(5):2–2, 2002. 5, 6

[42] Matthew Trumble, Andrew Gilbert, Charles Malleson, Adrian Hilton, and John P. Collomosse. Total capture: 3d human pose estimation fusing video and inertial sensors. In British Machine Vision Conference (BMVC), 2017.

[43] Simon Fraser University and National University of Singapore. SFU Motion Capture Database. 5, 6

[44] Aaron van den Oord, Oriol Vinyals, and Koray ¨ Kavukcuoglu. Neural discrete representation learning. In Annual Conference on Neural Information Processing Systems (NeurIPS), pages 6306–6315, 2017. 2, 3, 5

[45] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Annual Conference on Neural Information Processing Systems (NeurIPS), pages 5998–6008, 2017. 5, 1

[46] Timo von Marcard, Bodo Rosenhahn, Michael J. Black, and Gerard Pons-Moll. Sparse inertial poser: Automatic 3d human pose estimation from sparse imus. Comput. Graph. Forum, 36(2):349–360, 2017. 1, 2

[47] Alexander W. Winkler, Jungdam Won, and Yuting Ye. Questsim: Human motion tracking from sparse sensors with simulated avatars. In SIGGRAPH Asia 2022 Conference Papers, pages 2:1–2:8. ACM, 2022. 2

[48] Sirui Xu, Zhengyuan Li, Yu-Xiong Wang, and LiangYan Gui. Interdiff: Generating 3d human-object interactions with physics-informed diffusion. In IEEE/CVF International Conference on Computer Vision (ICCV), 2023. 2

[49] Dongseok Yang, Doyeon Kim, and Sung-Hee Lee. Lobstr: Real-time lower-body pose prediction from sparse upper-body tracking signals. Comput. Graph. Forum, 40(2):265–275, 2021. 2, 6, 7

[50] Xinyu Yi, Yuxiao Zhou, and Feng Xu. Transpose: real-time 3d human translation and pose estimation with six inertial sensors. ACM Trans. Graph., 40(4): 86:1–86:13, 2021. 1, 2

[51] Xinyu Yi, Yuxiao Zhou, Marc Habermann, Soshi Shimada, Vladislav Golyanik, Christian Theobalt, and Feng Xu. Physical inertial poser (PIP): physics-aware real-time human motion tracking from sparse inertial sensors. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13157– 13168, 2022. 1, 2

[52] Jianrong Zhang, Yangsong Zhang, Xiaodong Cun, Shaoli Huang, Yong Zhang, Hongwei Zhao, Hongtao Lu, and Xi Shen. T2M-GPT:generating human motion from textual descriptions with discrete representations. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 2

[53] Chuanxia Zheng and Andrea Vedaldi. Online clustered codebook. In IEEE/CVF International Conference on Computer Vision (ICCV), pages 22741– 22750, 2023. 3

[54] Xiaozheng Zheng, Zhuo Su, Chao Wen, Zhou Xue, and Xiaojie Jin. Realistic full-body tracking from sparse observations via joint-level modeling. In IEEE/CVF international conference on computer vision (ICCV), 2023. 2, 5, 6, 7, 8, 1

[55] Yi Zhou, Connelly Barnes, Jingwan Lu, Jimei Yang, and Hao Li. On the continuity of rotation representations in neural networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5745–5753, 2019. 3

Authors:

(1) Han Feng, equal contributions, ordered by alphabet from Wuhan University;

(2) Wenchao Ma, equal contributions, ordered by alphabet from Pennsylvania State University;

(3) Quankai Gao, University of Southern California;

(4) Xianwei Zheng, Wuhan University;

(5) Nan Xue, Ant Group ([email protected]);

(6) Huijuan Xu, Pennsylvania State University.


This paper is available on arxiv under CC BY 4.0 DEED license.