Implementation#
Graph nodes#
The nodes of the AiiDA provenance graph can be grouped into two main types: process nodes (ProcessNode
), that represent the execution of calculations or workflows, and data nodes (Data
), that represent pieces of data.
In particular, process nodes are divided into two sub categories:
calculation nodes (
CalculationNode
): Represent code execution that creates new data. These are further subdivided in two subclasses:CalcJobNode
: Represents the execution of a calculation external to AiiDA, typically via a job batch scheduler (see the concept of calculation jobs).CalcFunctionNode
: Represents the execution of a python function (see the concept of calculation functions).
workflow nodes (
WorkflowNode
): Represent python code that orchestrates the execution of other workflows and calculations, that optionally return the data created by the processes they called. These are further subdivided in two subclasses:WorkChainNode
: Represents the execution of a python class instance with built-in checkpoints, such that the process may be paused/stopped/resumed (see the concept of work chains).WorkFunctionNode
: Represents the execution of a python function calling other processes (see the concept of work functions).
The class hierarchy of the process nodes is shown in the figure below.
For what concerns data nodes, the base class (Data
) is subclassed to provide functionalities specific to the data type and python methods to operate on it.
Often, the name of the subclass contains the word “Data” appended to it, but this is not a requirement. A few examples:
Dict
: represents a dictionary of key-value pairs - these are parameters of a general nature that do not need to belong to more specific data sub-classesStructureData
: represents crystal structure data (containing chemical symbols, atomic positions of the atoms, periodic cell for periodic structures, …)ArrayData
: represents generic numerical arrays of data (python numpy arrays)KpointsData
: represents a numerical array of k-points data, is a sub-class ofArrayData
For more detailed information see AiiDA data types.
In the next section we introduce the links between nodes, creating the AiiDA graph, and then we show some examples to clarify what we introduced up to now.
Graph links#
Process nodes are connected to their input and output data nodes through directed links. Calculation processes can create data, while workflow processes can call calculations and return their outputs. Consider the following graph example, where we represent data nodes with circles, calculation nodes with squares and workflow nodes with diamond shapes.
Notice that the different style and names for the two links coming into D2 is intentional, because it was the calculation that created the new data, whereas the workflow merely returned it. This subtle distinction has big consequences. By allowing workflow processes to return data, it can also return data that was among its inputs.
A scenario like this, represented in Fig. 23, would create a cycle in the provenance graph, breaking the “acyclicity” of the DAG. To restore the directed acyclic graph, we separate the entire provenance graph into two planes as described above: the data provenance and the logical provenance. With this division, the acyclicity of the graph is restored in the data provenance plane.
An additional benefit of thinking of the provenance graph in these two planes, is that it allows you to inspect it with different layers of granularity. Imagine a high level workflow that calls a large number of calculations and sub-workflows, that each may also call more sub-processes, to finally produce and return one or more data nodes as its result.
Graph examples#
With these basic definitions of AiiDA’s provenance graph in place, let’s take a look at some examples. Consider the sequence of computations that adds two numbers x and y, and then multiplies the result with a third number z. This sequence as represented in the provenance graph would look something like what is shown in Fig. 24.
In this simple example, there was no external process that controlled the exact sequence of these operations. This may be imagined however, by adding a workflow that calls the two calculations in succession, as shown in Fig. 25.
Notice that if we were to omit the workflow nodes and all its links from the provenance graph in Fig. 25, one would end up with the exact same graph as shown in Fig. 24 (the data provenance graph).