Authors:
(1) Mathias Brossard, Systems Group, Arm Research;
(2) Guilhem Bryant, Systems Group, Arm Research;
(3) Basma El Gaabouri, Systems Group, Arm Research;
(4) Xinxin Fan, IoTeX.io;
(5) Alexandre Ferreira, Systems Group, Arm Research;
(6) Edmund Grimley-Evans, Systems Group, Arm Research;
(7) Christopher Haster, Systems Group, Arm Research;
(8) Evan Johnson, University of California, San Diego;
(9) Derek Miller, Systems Group, Arm Research;
(10) Fan Mo, Imperial College London;
(11) Dominic P. Mulligan, Systems Group, Arm Research;
(12) Nick Spinale, Systems Group, Arm Research;
(13) Eric van Hensbergen, Systems Group, Arm Research;
(14) Hugo J. M. Vincent, Systems Group, Arm Research;
(15) Shale Xiong, Systems Group, Arm Research.
Editor's note: this is part 5 of 6 of a study detailing the development of a framework to help people collaborate securely. Read the rest below.
Table of Links
- Abstract and 1 Introduction
- 2 Hardware-backed Confidential Computing
- 3 IceCap
- 4 Veracruz
- 4.1 Attestation
- 4.2 Programming model
- 4.3 Ad hoc acceleration
- 4.4 Threat model
- 5 Evaluation and 5.1 Case-study: deep learning
-
5.2 Case-study: video object detection
-
5.3 Further comparisons
-
- 6 Closing remarks and References
5 Evaluation
This section uses the following test platforms: Intel Core i7-8700, 16GiB RAM, 1TB SSD (Core i7, henceforth); c5.xlarge AWS VM, 8GiB RAM, EBS (EC2, henceforth); Raspberry Pi 4, 4GiB RAM, 32GB µSD (RPi4, henceforth). We use GCC 9.30 for x86-64, GCC 7.5.0 for AArch64, and Wasi SDK-14.0 with LLVM 13.0 for Wasm.
5.1 Case-study: deep learning
Training datasets, algorithms, and learnt models may be sensitive IP and the learning and inference processes are vulnerable to malicious changes in model parameters that can cause a negative influence on a model’s behaviors that is hard to detect [10, 62]. We present two Veracruz case-studies in protecting deep learning (DL henceforth) applications: privacypreserving training and inference, and privacy-preserving model aggregation service, a step toward federated DL. We use Darknet [63,78] in both cases, and the Open Neural Network eXchange [11, 26] (ONNX, henceforth) as the aggregation format. We focus on the execution time of training, inference, and model aggregation on the Core i7 test platform.
In the training and inference case-study, the program receives input datasets from the respective data providers and a pre-learnt model from a model provider. Thereafter, the provisioned program starts training or inference, protected inside Veracruz. The results—that is, the trained model or prediction—are made available to a result receiver. In the model aggregation case-study, clients conduct local training with their favorite DL frameworks, convert the models to ONNX format, and provision these derived models into Veracruz. The program then aggregates the models, making the result available to all clients. By converting to ONNX locally, we support a broad range of local training frameworks—i.e., PyTorch [74], Tensorflow [1], Darknet, or similar.
We trained a LeNet [57] on MNIST [57], a dataset of handwritten digits consisting of 60,000 training and 10,000 validation images. Each image is 28×28 pixels and less than 1KiB; we used a batch size of 100 in training, obtaining a trained model of 186KiB. We take the average of 20 trials for training on 100 batches (hence, 10,000 images) and then ran inference on one image. For aggregation, we use three copies of this Darknet model (186KiB), obtaining three ONNX models (26KiB), performing 200 trials for aggregation, as aggregation time is significantly less. Results are presented in Fig. 3.
For all DL tasks we observe the same execution time between Wasmtime and Veracruz, as expected, with both around 2.1–4.1× slower than native CPU-only execution, likely due to more aggressive code optimization available in native compilers. However, the similarity between Wasmtime and Veracruz diverges for file operations such as loading and saving of model data. Loading data from disk is 1.2–3.1× slower when using Wasmtime compared to executing natively. However, I/O in Veracruz is usually faster than Wasmtime, and sometimes faster than native execution, e.g., when saving images in inference. This is likely due to Veracruz’s in-memory filesystem exhibiting a faster read and write speed transferring data, compared to the SSD of the test machine.
5.2 Case-study: video object detection
We have used Veracruz to prototype a Confidential FaaS, running on AWS Nitro Enclaves and using Kubernetes [53]. In this model, a cloud infrastructure or other delegate initializes an isolate containing only the Veracruz runtime and provides an appropriate global policy file. Confidential functions are registered in a Confidential Computing as a Service (CCFaaS, henceforth) component, which acts as a registry for clients wishing to use the service and which collaborates, on behalf of clients, with a Veracruz as a Service (VaaS, henceforth) component which manages the lifetime of any spawned Veracruz instances. Together, the CCFaaS and VaaS components draft policies and initialize Veracruz instances, while attestation is handled by clients, using the proxy attestation service.
Building atop this confidential FaaS infrastructure, we applied Veracruz in a full end-to-end encrypted video object detection flow (see Fig. 5). Our intent is to demonstrate that Veracruz can be applied to industrially-relevant use-cases: here, a video camera manufacturer wishes to offer an object detection service to their customers while providing believable guarantees that they cannot access customer video.
The encrypted video clips originating from an IoTeX Ucam video camera [47] are stored in an AWS S3 bucket. The encryption key is owned by the camera operator and perhaps generated by client software on their mobile phone or tablet. Independently, a video processing and object detection function, compiled to Wasm, is registered with the CCFaaS component which takes on the role of program provider in the Veracruz computation. This function makes use of the Cisco openh264 library as well as the Darknet neural network framework and a prebuilt YOLOv3 model, as previously discussed in §5.1, for object detection (our support for Wasi eased this porting).
Upon the request of the camera owner, the CCFaaS and VaaS infrastructure spawn a new AWS Nitro Enclave loaded with the Veracruz runtime, and configured using an appropriate global policy that lists the camera owner as having the role of data provider and result receiver. The confidential FaaS infrastructure forwards the global policy to the camera owner, where it is automatically analyzed by their client software, with the camera owner thereafter attesting the AWS Nitro Enclave instance. If the global policy is acceptable, and attestation succeeds, the camera owner securely connects to the spawned isolate, containing the Veracruz runtime, and securely provisions their decryption key using TLS in their role as data provider. The encrypted video clip is also then provisioned into the isolate, by a dedicated AWS S3 application, which is also listed in the global policy as a data provider, and the computation can then go ahead. Once complete, metadata containing the bounding boxes of any object detected in the frames of the video clips can be securely retrieved by the camera owner via TLS, in their result receiver role, for interpretation by their client software.
Note that in this FaaS infrastructure desirable cloud application characteristics are preserved: the computation is on-demand and scaleable, and our infrastructure allows multiple instances of Veracruz, running different functions, to be executed concurrently. Only the AWS S3 application, the camera owner’s client application and the video decoding and object detection function are specific to this use-case. All other modules are generic, allowing other applications to be implemented. Moreover, note that no user credentials or passwords are shared directly with the FaaS infrastructure in realizing this flow, beyond the name of the video clip to retrieve from the AWS S3 bucket and a one-time access credential for the AWS S3 application. Decryption keys are only shared with the Veracruz runtime inside an attested isolate.
We benchmark by passing a 1920×1080 video to the object detection program, which decodes frame by frame, converts, downscales, and passes frames to the ML model. We compare four configurations on two different platforms:
• On EC2, a native x86-64 binary on Amazon Linux; a Wasm binary under Wasmtime-0.27; a Wasm binary inside Veracruz as a Linux process; a Wasm binary inside Veracruz on AWS Nitro Enclaves. The video is 240 frames long and fed to the YOLOv3-608 model [79].
• On RPi4: a native AArch64 binary on Ubuntu 18.04 Linux; a Wasm binary under Wasmtime-0.27; a Wasm binary inside Veracruz as a Linux process; a Wasm binary inside Veracruz on IceCap. Due to memory limits the video is 240 frames long and fed to the YOLOv3- tiny model [79].
We take the native x86-64 configuration as our baseline, and present average runtimes for each configuration, along with observed extremes, in Fig. 4.
EC2 results Wasm (with experimental SIMD support in Wasmtime) has an overhead of ∼39% over native code; most CPU cycles are spent in matrix multiplication, which the native compiler can better autovectorize than the Wasm compiler. The vast majority of execution time is spent in neural network inference, rather than video decode or image downscaling. Since execution time is dominated by the Wasm execution, Veracruz overhead is negligible. A ∼5% performance discrepancy exists between Nitro and Wasmtime, which could originate from our observation that Nitro is slower at loading data into an enclave, but faster at writing, though Nitro runs a different kernel with a different configuration, on a separate CPU, making this hard to pinpoint. Deployment overheads for Nitro are presented in Table 2, showing a breakdown of overheads for provisioning a new Veracruz instance.
RPi4 results The smaller ML model significantly improves inference performance at the expense of accuracy. Wasm has an overhead of ∼10% over native code, smaller than the gap on EC2, and could be due to reduced vectorization support in GCC’s AArch64 backend. Veracruz overhead is again negligible, though IceCap induces an overhead of ∼3% over Veracruz-Linux. This observation approximately matches the overhead of ∼2% for CPU-bound workloads measured in Fig. 1, explained by extra context switching through trusted resource management services during scheduling operations.
Using “native modules”, introduced in §4.3, explicit support for neural network inference could be added to the Veracruz runtime, though our results above suggest a max ∼38% performance boost by pursuing this, likely less due to the costs of marshalling data between the native module and Veracruz file system. For larger performance boosts, dedicated ML acceleration could be used, requiring support from the Veracruz runtime, though establishing trust in accelerators outside the isolate is hard, with PCIe attestation still a workin-progress.
5.3 Further comparisons
PolyBench/C microbenchmarks We further evaluate the performance of Veracruz on compute-bound programs using the PolyBench/C suite (version 4.2.1-beta) [75], a suite of small, simple computationally-intensive kernels. We compare execution time of four different configurations on the EC2 instance running Amazon Linux 2: a native x86-64 binary; a Wasm binary under Wasmtime-0.27; a Wasm binary under Veracruz as a Linux process; and a Wasm binary executing under Veracruz in an AWS Nitro Enclave. We take x86-64 as our baseline, and present results in Fig. 6. Wasmtime’s overhead against native CPU execution is relatively small with a geometric mean of ∼13%, though we observe that some test programs execute even faster under Wasmtime than when natively compiled. Again, we compile our test programs with Wasmtime’s experimental support for SIMD proposal, though this boosts performance for only a few programs. Veracruz-Linux doesn’t exhibit a visible overhead compared to Wasmtime, which is expected as most execution time is spent in Wasmtime, and the presence of the Veracruz VFS is largely irrelevant for CPU-bound programs. VeracruzNitro exhibits a small but noticeable overhead (∼3%) compared to Veracruz-Linux, likely due to the reasons mentioned in §5.2.
VFS performance We evaluate Veracruz VFS I/O performance, previously discussed in §4.2. Performance is measured by timing common granular file-system operations and dividing by input size, to find the expected bandwidth.
Results gathered on Core i7 test platform with a swap size of zero so that measurements would not be invalidated by physical disk access, are presented in Fig. 7. Here, read denotes bandwidth of file read operations, write denotes bandwidth of file write operations with no initial file, and update denotes bandwidth of file write operations with an existing file. We use two access patterns, in-order and random, to avoid measuring only file-system-friendly access patterns. All random inputs, for both data and access patterns, used reproducible, pseudorandom data generated by xorshift64 to ensure consistency between runs. All operations manipulate a 64MiB file with 16KiB buffer size—in practice, we expect most files will be within an order of magnitude of this size.
We compare variations of our VFS against Linux’s tmpfs, the standard in-memory filesystem for Linux. Veracruz copy moves data between the Wasm’s sandboxed memory and the VFS through two copies, one at the Wasi API layer, and one at the internal VFS API layer. Veracruz no-copy improved on this by performing a single copy directly from the Wasm’s sandboxed memory into the destination in the VFS. This was made possible thanks to Rust’s borrow checker, which is able to express the temporarily shared ownership of the Wasm’s sandboxed memory without sacrificing memory or lifetime safety. In theory this overhead can be reduced to zero copies through memmap, however this API is not available in standard Wasi. Veracruz no-copy+soo is our latest design, extending the no-copy implementation with a smallobject optimization (SOO) iovec implementation—a Wasi structure describing a set of buffers containing data to be operated on, which for the majority of operations contain a reference to a single buffer. Through this, we inline two or fewer buffers into the iovec structure itself, completely removing memory allocations from the read and write path for all programs we tested with. Performance impact is negligible, however.
Being in an-memory filesystem, the internal representation is relatively simple: directories and a global inode table are implemented using hash tables, with each file represented as a vector of bytes. While apparently naïve, these datastructures have seen decades of optimization for in-memory performance, and even sparse files perform efficiently due to RAM over-commitment by the runtimes. However, we were still surprised to see very close performance between Veracruz and tmpfs, with Veracruz nearly doubling the tmpfs performance for reads, likely due to the overhead of kernel syscalls necessary to communicate with tmpfs in Linux. (Unfortunately tmpfs is deeply integrated into the Linux VFS layer, so it is not possible to compare with tmpfs in isolation.)
Both Veracruz and tmpfs use hash tables to store directory information, with the file data-structure and memory allocator representing significant differences. In Veracruz we use byte vectors backed by the runtime’s general purpose allocator, whereas tmpfs uses a tree of pages backed by the Linux
VFS’s page cache, which acts as a cache-aware fixed-size allocator. We expect this page cache to have a much cheaper allocation cost, at the disadvantage of storing file data in non-linear blocks of memory—observable in the difference between the write and update measurements. For write, tmpfs outperforms Veracruz due to faster memory allocations and no unnecessary copies, while update requires no memory allocation, and has more comparable performance.
Fully-homomorphic encryption An oft-suggested usecase for fully-homomorphic encryption (FHE, hencefoth) is protecting delegated computations. We briefly compare Veracruz against SEAL [61], a leading FHE library, in computing a range of matrix multiplications over square matrices of various dimensions. Algorithms in both cases are written in C, though floating point arithmetic is replaced by the SEAL multiplication function for use with FHE. Results are presented in Fig. 8. Our results demonstrate that overheads for FHE are impractical, even for simple computations.
Teaclave Apache Teaclave [33] is a privacy-preserving FaaS infrastructure built on Intel SGX, supporting Python and Wasm with a custom programming model using the Wamr [3] interpreter. We compare the performance of Teaclave running under Intel SGX with Veracruz as a Linux process, both on Core i7, and Veracruz on AWS Nitro enclaves on EC2—admittedly an imperfect comparison, due to significant differences in design, isolation technology, Wasm runtime, and hardware between the two. We run the PolyBench/C suite with its mini dataset—Teaclave’s default configuration errors for larger datasets—and measure end-to-end execution time, which includes initialization, provisioning, execution and fetching the results, which we present in Fig. 9. While Veracruz has better performance than Teaclave when executing Wasm—with Veracruz under AWS Nitro exhibiting a mean 2.11× speed-up compared to Teaclave in simulation mode, and faster still than Teaclave in SGX—the fixed initial overhead of Veracruz, ∼4s in Linux and ∼2.7s in AWS Nitro, dominates the overall overhead in either case.
This paper is available on arxiv under CC BY 4.0 DEED license.