This story on HackerNoon has a decentralized backup on Sia.
Transaction ID: vrOg3LwSl4b0NsJnmW4C6pCO2pXXBc6VHWvmCmhL6eo
Cover

How Fast Is PyJuice? Testing Compilation Speed Across GPUs and Batch Sizes

Written by @probabilistic | Published on 2025/8/25

TL;DR
This article presents experimental benchmarks for PyJuice, highlighting its efficiency in both compilation and runtime. Tests show that even models with nearly 1 billion parameters can be compiled in about 30 seconds, and PyJuice consistently outperforms baseline methods across different GPUs (RTX 4090, NVIDIA A40) and batch sizes. These results underline PyJuice’s speed, scalability, and advantage in real-world machine learning workloads.

Abstract and 1. Introduction

  1. Preliminaries and Related Work

  2. Key Bottlenecks in PC Parallelization

  3. Harnessing Block-Based PC Parallelization

    4.1. Fully Connected Sum Layers

    4.2. Generalizing To Practical Sum Layers

    4.3. Efficient Implementations by Compiling PC Layers

    4.4. Analysis: IO and Computation Overhead

  4. Optimizing Backpropagation with PC Flows

  5. Experiments

    6.1. Faster Models with PyJuice

    6.2. Better PCs At Scale

    6.3. Benchmarking Existing PCs

  6. Conclusion, Acknowledgements, Impact Statement, and References

A. Algorithm Details

B. Additional Technical Details

C. Experimental Details

D. Additional Experiments

D. Additional Experiments

D.1. Speed of the Compilation Process

In Table 5, we show the compilation speed of PCs with different structures and different sizes. Experiments are conducted on a server with an AMD EPYC 7763 64-Core Processor and 8 RTX 4090 GPUs (we only use one GPU). The results demonstrate the efficiency of the compilation process, where even the PD model with close to 1B parameters can be compiled in around 30 seconds.

Table 5. Average (± standard deviation of 3 runs) runtime (in seconds) of the compilation process of four PCs.

D.2. Runtime on Different GPUs

In addition to the RTX 4090 GPU adopted in the experiments in Table 1, we compare the runtime of PyJuice with the baselines on an NVIDIA A40 GPU. As shown in the following table, PyJuice is still significantly faster than all baselines for PCs of different sizes.

Table 6. Average (± standard deviation of 5 runs) runtime (in seconds) per training epoch of 60K samples for PyJuice and the baselines on five RAT-SPNs (Peharz et al., 2020b) with different sizes. All other settings are the same as described in Section 6.1.

D.3. Runtime on Different Batch Sizes

As a supplement to Table 1, we report the runtime for a RAT-SPN (Peharz et al., 2020b) with 465K nodes and 33.4M edges using batch sizes {8, 16, 32, 64, 128, 256, 512}. To minimize distractions, we only record the time to compute the forward and backward process, but not the time used for EM updates. Results are shown in the table below.

Table 7. Average (± standard deviation of 5 runs) runtime (in seconds) per training epoch (excluding EM updates) of 60K samples for PyJuice and the baselines on a RAT-SPNs (Peharz et al., 2020b) with 465K nodes and 33.4M edges. All other settings are the same as described in Section 6.1. OOM denotes out-of-memory.

Authors:

(1) Anji Liu, Department of Computer Science, University of California, Los Angeles, USA (liuanji@cs.ucla.edu);

(2) Kareem Ahmed, Department of Computer Science, University of California, Los Angeles, USA;

(3) Guy Van den Broeck, Department of Computer Science, University of California, Los Angeles, USA;


This paper is available on arxiv under CC BY 4.0 DEED license.

[story continues]


Written by
@probabilistic
Probabilistic

Topics and
tags
scalable-generative-models|gpu-accelerated-computation|probabilistic-circuits-(pcs)|pyjuice|efficient-parallelization|memory-efficient-training|block-based-parallelization|probabilistic-inference
This story on HackerNoon has a decentralized backup on Sia.
Transaction ID: vrOg3LwSl4b0NsJnmW4C6pCO2pXXBc6VHWvmCmhL6eo