Data provenance and logical provenance#
AiiDA automatically stores entities in its database and links them forming a directed graph. This directed graph automatically tracks the provenance of all data produced by calculations or returned by workflows. By tracking the provenance in this way, one can always fully retrace how a particular piece of data came into existence, thus ensuring its reproducibility.
In particular, we define two types of provenance:
The data provenance, consisting of the part of the graph that only consists of data and calculations (i.e. without considering workflows), and only the input and create links that connect them. The data provenance records the full history of how data has been generated. Due to the causality principle, the data provenance part of the graph is a directed acyclic graph (DAG), i.e. its nodes are connected by directed edges and it does not contain any cycles.
The logical provenance which consists of workflow and data nodes, together with the input, return and call links that connect them. The logical provenance is not acyclic, e.g. a workflow that acts as a filter can return one of its own inputs, directly introducing a cycle.
The data provenance is essentially a log of which calculation generated what data using certain inputs. The data provenance alone already guarantees reproducibility (one could run again one by one the calculations with the provided input and would obtain the same outputs). The logical provenance gives additional information on why a specific calculation was run. Imagine the case in which you start from 100 structures, you have a filter operation that picks one, and then you run a simulation on it. The data provenance only shows the simulation you run on the structure that was picked, while the logical provenance can also show that the specific structure was not picked at random but via a specific workflow logic.
Beside nodes (data and processes), AiiDA defines a few more entities, like a
Computer (representing a computer, supercomputer or computer cluster where calculations are run or data is stored), a
Group (that group nodes together for organizational purposes) and the
User (to keep track of the user who first generated a given node, computer or group).
In the following section we describe in more detail how the general provenance concepts above are actually implemented in AiiDA, with specific reference to the python classes that implement them and the class-inheritance relationships.