Concepts#
Nodes and links#
Two of the most important concepts in AiiDA are data and processes. The former are pieces of data, such as a simple integer or float, all the way to more complex data concepts such as a dictionary of parameters, a folder of files or a crystal structure. Processes operate on this data in order to produce new data.
Processes come in two different forms:
Calculations are processes that are able to create new data. This is the case, for instance, for externals simulation codes, that generate new data
Workflows are processes that orchestrate other workflows and calculations, i.e. they manage the logical flow, being able to call other processes. Workflows have data inputs, but cannot generate new data. They can only return data that is already in the database (one typical case is to return data created by a calculation they called).
Data and processes are represented in the AiiDA provenance graph as the nodes of that graph. The graph edges are referred to as links and come in different forms:
input links: connect data nodes to the process nodes that used them as input, both calculations and workflows
create links: connect calculation nodes to the data nodes that they created
return links: connect workflow nodes to the data nodes that they returned
call links: connecting workflow nodes to the process nodes that they directly called, be it calculations or workflows
Note that the create and return links are often collectively referred to as output links.
Data provenance and logical provenance#
AiiDA automatically stores entities in its database and links them forming a directed graph. This directed graph automatically tracks the provenance of all data produced by calculations or returned by workflows. By tracking the provenance in this way, one can always fully retrace how a particular piece of data came into existence, thus ensuring its reproducibility.
In particular, we define two types of provenance:
The data provenance, consisting of the part of the graph that only consists of data and calculations (i.e. without considering workflows), and only the input and create links that connect them. The data provenance records the full history of how data has been generated. Due to the causality principle, the data provenance part of the graph is a directed acyclic graph (DAG), i.e. its nodes are connected by directed edges and it does not contain any cycles.
The logical provenance which consists of workflow and data nodes, together with the input, return and call links that connect them. The logical provenance is not acyclic, e.g. a workflow that acts as a filter can return one of its own inputs, directly introducing a cycle.
The data provenance is essentially a log of which calculation generated what data using certain inputs. The data provenance alone already guarantees reproducibility (one could run again one by one the calculations with the provided input and would obtain the same outputs). The logical provenance gives additional information on why a specific calculation was run. Imagine the case in which you start from 100 structures, you have a filter operation that picks one, and then you run a simulation on it. The data provenance only shows the simulation you run on the structure that was picked, while the logical provenance can also show that the specific structure was not picked at random but via a specific workflow logic.
Other entities#
Beside nodes (data and processes), AiiDA defines a few more entities, like a Computer
(representing a computer, supercomputer or computer cluster where calculations are run or data is stored), a Group
(that group nodes together for organizational purposes) and the User
(to keep track of the user who first generated a given node, computer or group).
In the following section we describe in more detail how the general provenance concepts above are actually implemented in AiiDA, with specific reference to the python classes that implement them and the class-inheritance relationships.