Provenance

The graph concept

Data provenance and logical provenance

AiiDA automatically stores entities in its database and links them forming a directed graph. This directed graph automatically tracks the provenance of all data produced by calculations or returned by workflows. By tracking the provenance in this way, one can always fully retrace how a particular piece of data came into existence, thus ensuring its reproducibility.

In particular, we define two types of provenance:

  • The data provenance, consisting in that part of the graph that ONLY consists of data and calculations (i.e. without considering workflows), and only input links to calculations and output create links. The data provenance records the full history of how data has been generated. Due to the causality principle, the data provenance part of the graph is a directed acyclic graph (DAG), i.e. its nodes are connected by directed edges and it does not contain any cycles.
  • The logical provenance, consisting of calculations and workflows, and input links to workflows, output return links, and call links. The logical provenance is not acyclic (e.g., a workflow that acts as a filter can return one of its inputs).

The data provenance is substantially a log of which calculation generated the data and with which inputs. The data provenance alone already guarantees reproducibility (one could run again one by one the calculations with the provided input and would obtain the same outputs). The logical provenance gives additional information on why a specific calculation was run. Imagine the case in which you start from 100 structures, you have a filter operation that picks one, and then you run a simulation on it. The data provenance only shows the simulation you run on the structure that was picked, while the logical provenance can also show that the specific structure was not picked at random but via a specific workflow logic.

Other entities

Beside nodes (data and processes), AiiDA defines a few more entities, like computers (representing a computer, supercomputer or computer cluster where calculations are run or data is stored), groups (that group together nodes for organizational purposes) and users (to keep track of the user who first generated a given node, computer or group).

In the following section we describe in more detail how the general provenance concepts above are actually implemented in AiiDA, with specific reference to the python classes that implement them and the class-inheritance relationships.

The implementation

Graph nodes

The nodes of the AiiDA provenance graph can be grouped into two main types: process nodes (ProcessNode), that represent the execution of calculations or workflows, and data nodes (Data), that represent pieces of data.

In particular, process nodes are divided into two sub categories:

  • calculation nodes (CalculationNode): Represent code execution that creates new data. These are further subdivided in two subclasses:

    • CalcJobNode: Represents the execution of a calculation external to AiiDA, typically via a job batch scheduler (i.e. the execution of a simulation code on some computer).
    • CalcFunctionNode: Represents the execution of a python function (wrapped with the @calcfunction decorator) that takes AiiDA data nodes as input and creates AiiDA data nodes as output (see the description of calcfunctions), e.g. when manipulating and processing data objects in the python interpreter or inside a workflow, without the need to run an external code.
  • workflow nodes (WorkflowNode): Represent python code that orchestrates the execution of other (sub)workflows or of calculations and that optionally return the data created by the calculations they called. These are further subdivided in two subclasses:

    • WorkChainNode: Represents the execution of a python class instance with built-in checkpoints, such that the process may be paused/stopped/resumed.
    • WorkFunctionNode: Represents the execution of a python function (wrapped with the @workfunction decorator).

The class hierarchy of the process nodes is shown in the figure below.

../_images/class_hierarchy.png

Figure 1: The hierarchy of the ORM classes for the process nodes. Only instances of the lowest level of classes will actually enter into the provenance graph. The two upper levels have a mostly taxonomical purpose as they allow us to refer to multiple classes at once when reasoning about the graph as well as a place to define common functionality (see section on processes).

For what concerns data nodes, the base class (Data) is subclassed to provide functionalities specific to the data type and python methods to operate on it. Often, the name of the subclass contains the word “Data” appended to it, but this is not a requirement. A few examples:

  • Float, Int, Bool, Str, List, …
  • Dict: represents a dictionary of key-value pairs - these are parameters of a general nature that do not need to belong to more specific data sub-classes
  • StructureData: represents crystal structure data (containing chemical symbols, atomic positions of the atoms, periodic cell for periodic structures, …)
  • ArrayData: represents generic numerical arrays of data (python numpy arrays)

For more detailed information see AiiDA data types.

In the next section we introduce the links between nodes, creating the AiiDA graph, and then we show some examples to clarify what we introduced up to now.

Graph examples

With these basic definitions of AiiDA’s provenance graph in place, let’s take a look at some more interesting. Consider the sequence of computations that adds two number a and b and multiplies the result with a third number c. This sequence as represented in the provenance graph would look something like is shown in Figure 4.

../_images/schematic_provenance_03_data_add_multiply.png

Figure 4: The DAG for computing (a+b)*c. We have two simple calculations: C1 represents the sum and C2 the multiplication. The two data nodes D1 and D2 are the inputs of C1, which creates the data node D4. Together with D3, D4 then forms the input of C2 which multiplies their values in order to creates the product, represented by D5.

In this simple example, there was no external process that controlled the exact sequence of these operations. When introducing a workflow, that calls the two calculations in succession, we get a graph as is shown in Figure 5.

../_images/schematic_provenance_04_call_workflow.png

Figure 5: The same calculation (a+b)*c is performed using a workflow. Here the data nodes D1, D2 and D3 are the inputs of the workflow W1, which calls calculation C1 with inputs D1 and D2, and then calls calculation C2, using as inputs D3 and D4 (which was created by C2). Calculation C2 creates data node D5, which is finally returned by workflow W1.

Notice that if we were to omit the workflow nodes and all its links from the provenance graph in Figure 5, one would end up with the exact same graph as shown in Figure 4.