sia.hackernoon.com

Authors:

(1) Pavan L. Veluvali, Max Planck Institute for Dynamics of Complex Technical Systems, Sandtorstr. 1, 39106 Magdeburg;

(2) Jan Heiland, Max Planck Institute for Dynamics of Complex Technical Systems, Sandtorstr. 1, 39106 Magdeburg;

(3) Peter Benner, Max Planck Institute for Dynamics of Complex Technical Systems, Sandtorstr. 1, 39106 Magdeburg.

Table of Links

Abstract and 1. Introduction

Existing Solutions

2. MaRDIFlow

Minimum working examples

Spinodal decomposition in a binary A-B alloy

Summary and Outlook, Acknowledgments, Data Availability, and References

Abstract: Numerical algorithms and computational tools are instrumental in navigating and addressing complex simulation and data processing tasks. The exponential growth of metadata and parameter-driven simulations has led to an increasing demand for automated workflows that can replicate computational experiments across platforms. In general, a computational workflow is defined as a sequential description for accomplishing a scientific objective, often described by tasks and their associated data dependencies. If characterized through input-output relation, workflow components can be structured to allow interchangeable utilization of individual tasks and their accompanying metadata. In the present work, we develop a novel computational framework, namely, MaRDIFlow, that focuses on the automation of abstracting meta-data embedded in an ontology of mathematical objects. This framework also effectively addresses the inherent execution and environmental dependencies by incorporating them into multi-layered descriptions. Additionally, we demonstrate a working prototype with example use cases and methodically integrate them into our workflow tool and data provenance framework. Furthermore, we show how to best apply the FAIR principles to computational workflows, such that abstracted components are Findable, Accessible, Interoperable, and Reusable in nature.

1 Introduction

The interplay of data-intensive computational studies is a substantial part of scientific endeavors across all disciplines. Computational workflows have been used as a systematic way of describing the methods needed, the data involved, as well as computing resources and infrastructures. With ever more complex simulation models and ever larger primary data volumes, CSE (Computational Sciences and Engineering) workflow descriptions themselves have become an enabler for research beyond the execution of simulations, for example, to extract latent information from various data repositories and compare methodologies across diverse data and computational frameworks [AGMT17].

The FAIR principles [WDA+16], describe a set of requirements for data management and stewardship to ensure that the research data are Findable, Accessible, Interoperable, and Reusable. Each guiding principle is proposed to define the degree of ‘FAIRness’ via describing the distinct considerations for contemporary environments, tools, vocabularies and data infrastructures. While the elements of FAIR Principles are related and separable, they are equally applied to identify, describe, discover, and reuse meta-data assets of scholarly outputs. Overall, FAIR principles act as a guide to assist data stewards in evaluating their implementation choices. More recently, they have been adopted by funding agencies, such as the German Research Foundation [For22] for developing assessment metrics of research metadata across various disciplines [DHM+20].

While there is a knowledge base for CSE workflows from a (software) engineering point of view [HW09,BCG+19] and while it has been acknowledged that for documentation, model descriptions and code can complement each other [FHHS16], an inclusive abstract description of CSE workflows is not yet anchored. As for combining models, code, and data for the description of CSE simulations in a virtual lab notebook, Jupyter notebooks have gained popularity [KRKP+16a]. Also services like Code Ocean [CSFG19] target the combination of code and model descriptions. Still, little effort has been made to use abstraction for CSE workflow components in view of documentation tools that are generally applicable and that scale well with ever more demanding and sophisticated simulations. Lately, with the advancement of data intensive research, there has been a rise in the development of automated and reusable workflows, wherein these workflows aim to seamlessly integrate computer-based and laboratory computations through artificial intelligence [Nat22].

In this work, we analyze general and particular components and provide an abstract multi-layered description of CSE workflows: Each component will be characterized through an input/output description so that model, data, and code can be used interchangeably and, in the best case, redundantly. For that, we describe suitable meta data and a low level language for the descriptions of general CSE workflows [VHB23]. Additionally, we emphasize that the introduction of redundancy in the representation of models, code, and data serves as a positive feature for CSE workflows. This redundancy enhances the robustness of workflows via ensuring compatibility during potential execution issues. With interchangeable and multi-level components, workflows become more adaptable and reproducible, contributing to the overall reliability of a scientific task. Generally, we understand a CSE workflow as a chain of one or more interconnected models used for simulations. From existing literature, a CSE workflow is defined as a precise description of a multi-step procedure to coordinate multiple tasks and their meta-data dependencies [GCBSR+20]. In workflow systems, each task is represented through the execution of a computational process, such as, executing a code, calling a command line tool, accessing a database, submitting a job to a HPC cluster, or executing a data processing script.

In their general treatment, the following constraints need to be taken into account

In particular in a CSE context, each model might be arbitrarily complex and computationally demanding.
Often, the particular numerical realization represents a compromise between accuracy and computational costs.
Within a workflow, models are likely implemented in different frameworks or languages.
In the case of, say, commercial codes that may well be one part of a workflow, some simulation models might not be fully available but only evaluable through interfaces.
Possibly, the actual simulation code is not available at all but only descriptions and, in the better case, alternative implementations.

Nevertheless, the goal of any CSE workflow framework is to offer a specialized programming environment that minimizes the efforts required by scientists or researchers to perform a computational experiment [VHB23]. In general, CSE workflow description can be categorized into distinct parts or phases, as listed below, governing its functional operation.

• Composition and abstraction

• Execution

• Meta-data mapping and provenance

Firstly, during composition and abstraction, a CSE workflow is created either from scratch or from modifying a previously designed workflow, whereby the user relies on different workflow components and data catalogs. Some of the well-known methods for editing and composing workflows are either textual or graphical or mechanism-based semantic models [DGST09]. The workflow then abstracts software components written by third parties, and handles heterogeneity via shielding run time incompatibility and complexities. Secondly, during execution, the workflow components are executed either by a computational engine or via a subsystem, wherein a static or an adaptive model is implemented to realize the meta-data. Importantly, repetitive and reproducible pipelines in order to manage the control and flow of a simulation are an important aspect of the second phase. Once the workflow is well defined, all, or portions of the workflow are sent for mapping. Finally, the data and all associated metadata and provenance information are recorded and placed in user-defined registries which are then accessed to design a new workflow description. Through the following stages, CSE workflow description act as modular building blocks with standardized interfaces, and are generally linked and run together by a computational framework.

The present article is organized as follows: in the following section, we discuss the current state of the art in Jupyter notebooks and computational workflows. Afterwards, we present our research data management tool, namely, MaRDIFlow, wherein the framework and its usage as a commandline tool is discussed in detail. Next, we discuss our RDM tool via minimum working examples. Lastly, we put forward the conclusions and future direction from the present work.

This paper is available on arxiv under CC BY 4.0 DEED license.

New Framework Makes Scientific Computing Workflows Truly Reproducible

Table of Links

1 Introduction