Caching and hashing¶
This section covers the more general considerations of the hashing/caching mechanism. For a more practical guide on how to enable and disable this feature, please visit the corresponding how-to section. If you want to know more about how the internal design of the mechanism is implemented, you can check the internals section instead.
How are nodes hashed¶
Hashing is turned on by default, i.e., all nodes in AiiDA are hashed. This means that even when you enable caching once you have already completed a number of calculations, those calculations can still be used retro-actively by the caching mechanism since their hashes have been computed.
The hash of a Data
node is computed from:
all attributes of the node, except the
_updatable_attributes
and_hash_ignored_attributes
the
__version__
of the package which defined the node classthe content of the repository folder of the node
the UUID of the computer, if the node is associated with one
The hash of a ProcessNode
includes, on top of this, the hashes of all of its input Data
nodes.
Once a node is stored in the database, its hash is stored in the _aiida_hash
extra, and this extra is used to find matching nodes.
If a node of the same class with the same hash already exists in the database, this is considered a cache match.
You can use the get_hash()
method to check the hash of any node.
In order to figure out why a calculation is not being reused, the _get_objects_to_hash()
method may be useful:
In [5]: node = load_node(1234)
In [6]: node.get_hash()
Out[6]: '62eca804967c9428bdbc11c692b7b27a59bde258d9971668e19ccf13a5685eb8'
In [7]: node._get_objects_to_hash()
Out[7]:
[
'1.0.0',
{
'resources': {'num_machines': 2, 'default_mpiprocs_per_machine': 28},
'parser_name': 'cp2k',
'linkname_retrieved': 'retrieved'
},
<aiida.common.folders.Folder at 0x1171b9a20>,
'6850dc88-0949-482e-bba6-8b11205aec11',
{
'code': 'f6bd65b9ca3a5f0cf7d299d9cfc3f403d32e361aa9bb8aaa5822472790eae432',
'parameters': '2c20fdc49672c3505cebabacfb9b1258e71e7baae5940a80d25837bee0032b59',
'structure': 'c0f1c1d1bbcfc7746dcf7d0d675904c62a5b1759d37db77b564948fa5a788769',
'parent_calc_folder': 'e375178ceeffcde086546d3ddbce513e0527b5fa99993091b2837201ad96569c'
}
]
Controlling hashing¶
Data nodes¶
The hashing of Data nodes can be customized both when implementing a new data node class and during runtime.
In the Node
subclass:
Use the
_hash_ignored_attributes
to exclude a list of node attributes['attr1', 'attr2']
from computing the hash.Include extra information in computing the hash by overriding the
_get_objects_to_hash()
method. Use thesuper()
method, and then append to the list of objects to hash.
You can also modify hashing behavior during runtime by passing a keyword argument to get_hash()
, which are forwarded to make_hash()
.
Controlling Caching¶
Caching can be configured at runtime (see 如何配置缓存机制) and when implementing a new process class:
The
spec.exit_code
has a keyword argumentinvalidates_cache
. If this is set toTrue
, that means that a calculation with this exit code will not be used as a cache source for another one, even if their hashes match.The
Process
parent class from which calcjobs inherit has anis_valid_cache
method, which can be overridden in the plugin to implement custom ways of invalidating the cache. When doing this, make sure to callsuper().is_valid_cache(node)
and respect its output: if it is False, your implementation should also return False. If you do not comply with this, the ‘invalidates_cache’ keyword on exit codes will not work.
Limitations and Guidelines¶
Workflow nodes are not cached. In the current design this follows from the requirement that the provenance graph be independent of whether caching is enabled or not:
Calculation nodes: Calculation nodes can have data inputs and create new data nodes as outputs. In order to make it look as if a cloned calculation produced its own outputs, the output nodes are copied and linked as well.
Workflow nodes: Workflows differ from calculations in that they can return an input node or an output node created by a calculation. Since caching does not care about the identity of input nodes but only their content, it is not straightforward to figure out which node to return in a cached workflow.
This limitation has typically no significant impact since the runtime of AiiDA work chains is commonly dominated by expensive calculations.
The caching mechanism for calculations should trigger only when the inputs and the calculation to be performed are exactly the same. While AiiDA’s hashes include the version of the Python package containing the calculation/data classes, it cannot detect cases where the underlying Python code was changed without increasing the version number. Another scenario that can lead to an erroneous cache hit is if the parser and calculation are not implemented as part of the same Python package, because the calculation nodes store only the name, but not the version of the used parser.
Note that while caching saves unnecessary computations, it does not save disk space: the output nodes of the cached calculation are full copies of the original outputs.
Finally, When modifying the hashing/caching behaviour of your classes, keep in mind that cache matches can go wrong in two ways:
False negatives, where two nodes should have the same hash but do not
False positives, where two different nodes get the same hash by mistake
False negatives are highly preferrable because they only increase the runtime of your calculations, while false positives can lead to wrong results.