Implementation#

Graph nodes#

The nodes of the AiiDA provenance graph can be grouped into two main types: process nodes (ProcessNode), that represent the execution of calculations or workflows, and data nodes (Data), that represent pieces of data.

In particular, process nodes are divided into two sub categories:

calculation nodes (CalculationNode): Represent code execution that creates new data. These are further subdivided in two subclasses:
- CalcJobNode: Represents the execution of a calculation external to AiiDA, typically via a job batch scheduler (see the concept of calculation jobs).
- CalcFunctionNode: Represents the execution of a python function (see the concept of calculation functions).
workflow nodes (WorkflowNode): Represent python code that orchestrates the execution of other workflows and calculations, that optionally return the data created by the processes they called. These are further subdivided in two subclasses:
- WorkChainNode: Represents the execution of a python class instance with built-in checkpoints, such that the process may be paused/stopped/resumed (see the concept of work chains).
- WorkFunctionNode: Represents the execution of a python function calling other processes (see the concept of work functions).

The class hierarchy of the process nodes is shown in the figure below.

../../_images/class_hierarchy.png — Fig. 21 The hierarchy of the ORM classes for the process nodes. Only instances of the lowest level of classes will actually enter into the provenance graph. The two upper levels have a mostly taxonomical purpose as they allow us to refer to multiple classes at once when reasoning about the graph as well as a place to define common functionality (see section on processes).#

For what concerns data nodes, the base class (Data) is subclassed to provide functionalities specific to the data type and python methods to operate on it. Often, the name of the subclass contains the word “Data” appended to it, but this is not a requirement. A few examples:

Float, Int, Bool, Str, List, …
Dict: represents a dictionary of key-value pairs - these are parameters of a general nature that do not need to belong to more specific data sub-classes
StructureData: represents crystal structure data (containing chemical symbols, atomic positions of the atoms, periodic cell for periodic structures, …)
ArrayData: represents generic numerical arrays of data (python numpy arrays)
KpointsData: represents a numerical array of k-points data, is a sub-class of ArrayData

For more detailed information see AiiDA data types.

In the next section we introduce the links between nodes, creating the AiiDA graph, and then we show some examples to clarify what we introduced up to now.

Graph links#

Process nodes are connected to their input and output data nodes through directed links. Calculation processes can create data, while workflow processes can call calculations and return their outputs. Consider the following graph example, where we represent data nodes with circles, calculation nodes with squares and workflow nodes with diamond shapes.

../../_images/schematic_provenance_01_simple_workflow.png — Fig. 22 Simple provenance graph for a workflow (W₁) *calling* a calculation (C₁). The workflow takes a single **data node** (D₁) as input, and passes it to the calculation when *calling* it. The calculation *creates* a new **data node** (D₂) that is also *returned* by the **workflow node**.#

Notice that the different style and names for the two links coming into D₂ is intentional, because it was the calculation that created the new data, whereas the workflow merely returned it. This subtle distinction has big consequences. By allowing workflow processes to return data, it can also return data that was among its inputs.

../../_images/schematic_provenance_02_cycle.png — Fig. 23 Provenance graph example of a **workflow node** that receives three **data nodes** as input and *returns* one of those inputs. The input link from D₃ to W₁ and the return link from W₁ to D₃ introduce a cycle in the graph.#

A scenario like this, represented in Fig. 23, would create a cycle in the provenance graph, breaking the “acyclicity” of the DAG. To restore the directed acyclic graph, we separate the entire provenance graph into two planes as described above: the data provenance and the logical provenance. With this division, the acyclicity of the graph is restored in the data provenance plane.

An additional benefit of thinking of the provenance graph in these two planes, is that it allows you to inspect it with different layers of granularity. Imagine a high level workflow that calls a large number of calculations and sub-workflows, that each may also call more sub-processes, to finally produce and return one or more data nodes as its result.

Graph examples#

With these basic definitions of AiiDA’s provenance graph in place, let’s take a look at some examples. Consider the sequence of computations that adds two numbers x and y, and then multiplies the result with a third number z. This sequence as represented in the provenance graph would look something like what is shown in Fig. 24.

../../_images/add_multiply_calcfunction_data1.png — Fig. 24 The DAG for computing (x+y)*z. We have two simple calculations: C₁ represents the addition and C₂ the multiplication. The two data nodes D₁ and D₂ are the inputs of C₁, which *creates* the data node D₄. Together with D₃, D₄ then forms the input of C₂, which multiplies their values that *creates* the product, represented by D₅.#

In this simple example, there was no external process that controlled the exact sequence of these operations. This may be imagined however, by adding a workflow that calls the two calculations in succession, as shown in Fig. 25.

../../_images/add_multiply_calcfunction_full.png — Fig. 25 The same calculation (x+y)*z is performed using a workflow. Here the data nodes D₁, D₂, and D₃ are the inputs of the workflow W₁, which *calls* calculation C₁ with inputs D₁ and D₂. It then *calls* calculation C₂, using as inputs D₃ and D₄ (which was *created* by C₂). Calculation C₂ *creates* data node D₅, which is finally *returned* by workflow W₁.#

Notice that if we were to omit the workflow nodes and all its links from the provenance graph in Fig. 25, one would end up with the exact same graph as shown in Fig. 24 (the data provenance graph).