缓存和散列#

本节介绍散列/缓存机制的一般注意事项。有关如何启用和禁用此功能的实用指南,请访问相应的 how-to section 。如果您想进一步了解该机制的内部设计是如何实现的,可以查阅 internals section

node 如何散列#

默认情况下, 散列 是开启的,即 AiiDA 中的所有 node 都是散列的。这意味着,即使你已经完成了一些计算并启用了缓存,这些计算仍然可以被缓存机制追溯使用,因为它们的哈希值已经被计算过了。

Data node 的哈希值是通过计算得出的:

  • node 的所有属性,但 _updatable_attributes_hash_ignored_attributes 除外

  • 定义了 node 类的软件包的 __version__

  • node 资源库文件夹的内容

  • 计算机的 UUID,如果 node 与一个

The hash of a ProcessNode includes, on top of this, the hashes of all of its input Data nodes.

Once a node is stored in the database, its hash is stored in the _aiida_hash extra, and this extra is used to find matching nodes. If a node of the same class with the same hash already exists in the database, this is considered a cache match. You can use the get_hash() method to check the hash of any node. In order to figure out why a calculation is not being reused, the get_objects_to_hash() method may be useful:

In [5]: node = load_node(1234)

In [6]: node.base.caching.get_hash()
Out[6]: '62eca804967c9428bdbc11c692b7b27a59bde258d9971668e19ccf13a5685eb8'

In [7]: node.base.caching.get_objects_to_hash()
Out[7]:
{
    'class': "<class 'aiida.orm.nodes.process.calculation.calcjob.CalcJobNode'>",
    'version': '2.6.0',
    'attributes': {
        'resources': {'num_machines': 2, 'default_mpiprocs_per_machine': 28},
        'parser_name': 'cp2k',
        'linkname_retrieved': 'retrieved'
    },
    'computer_uuid': '85faf55e-8597-4649-90e0-55881271c33c',
    'links': {
        'code': 'f6bd65b9ca3a5f0cf7d299d9cfc3f403d32e361aa9bb8aaa5822472790eae432',
        'parameters': '2c20fdc49672c3505cebabacfb9b1258e71e7baae5940a80d25837bee0032b59',
        'structure': 'c0f1c1d1bbcfc7746dcf7d0d675904c62a5b1759d37db77b564948fa5a788769',
        'parent_calc_folder': 'e375178ceeffcde086546d3ddbce513e0527b5fa99993091b2837201ad96569c'
    }
]

在 2.6 版本发生变更: Version information removed from hash computation

Up until v2.6, the objects used to compute the hash of a ProcessNode included the version attribute. This attribute stores a dictionary of the installed versions of the aiida-core and plugin packages (if relevant) at the time of creation. When the caching mechanism was first introduced, this information was added intentionally to the hash to err on the safe side and prevent false positives as much as possible. This turned out to be too limiting, however, as this means that each time aiida-core or a plugin package’s version is updated, all existing valid cache sources are essentially invalidated. Even if an identical process were to be run, its hash would be different, solely because the version information differs. Therefore, as of v2.6, the version information is no longer part of the hash computation. The most likely source for false positives due to changes in code are going to be CalcJob and Parser plugins. See this section on a mechanism to control the caching of CalcJob plugins.

控制散列#

数据 nodes#

在实现新的数据 node 类时和运行时,都可以自定义 Data nodes 的散列。

Node 子类中:

  • 使用 _hash_ignored_attributes 将 node 属性 ['attr1', 'attr2'] 列表排除在哈希值计算之外。

  • Include extra information in computing the hash by overriding the get_objects_to_hash() method. Use the super() method, and then append to the list of objects to hash.

您还可以在运行时通过向 get_hash() 传递关键字参数来修改散列行为,这些参数会被转发到 make_hash()

Process nodes#

Process nodes 的哈希值是固定的,只能通过其输入的哈希值间接影响。有关 process nodes 散列机制的实施细节,请参见 here

Calculation jobs and parsers#

Added in version 2.6: Resetting the calculation job cache

When the implementation of a CalcJob or Parser plugin changes significantly, it can be the case that for identical inputs, significantly different outputs are expected The following non-exhaustive list provides some examples:

  • The CalcJob.prepare_for_submission changes input files that are written independent of input nodes

  • The Parser adds an output node for identical output files produced by the calculation

  • The Parser changes an existing output node even for identical output files produced by the calculation

In this case, existing completed nodes of the CalcJob plugin in question should be invalidated as a cache source, because they could constitute false positives. For that reason, the CalcJob and Parser base classes each have the CACHE_VERSION class attribute. By default it is set to None, but when set to an integer, it is included into the computed hash for its nodes. This allows a plugin developer to invalidate the cache of existing nodes by simply incrementing this attribute, for example:

class SomeCalcJob(CalcJob):

    CACHE_VERSION = 1

Note that the exact value of the CACHE_VERSION does not really matter, all that matters is that changing it, invalidates the existing cache. To keep things simple, it is recommended to treat it as a counter and simply increment it by 1 each time.

控制缓存#

在缓存机制中,node 扮演着两种不同的角色:目前正在存储的 node 被称为 target,而已经存储在数据库中且被视为等价的 node 被称为 source

目标#

在类级别上控制 node 在存储时在数据库中查找现有的等价物。第 如何配置缓存 节介绍了如何通过配置文件配置进行全局控制,或通过上下文管理器进行本地控制。

#

当正在存储 node 时(target),且其 node 类已启用缓存(见上节),则可通过方法 _get_same_node() 获得有效缓存 source。该方法调用迭代器 _iter_all_same_nodes() ,如果有的话,取其返回的第一个。为了找到与正在存储的 target 等价的 source node 列表, _iter_all_same_nodes() 执行了以下步骤:

  1. 它会在数据库中查询所有与 target node 哈希值相同的 node。

  2. 从结果来看,只有属性 is_valid_cache() 返回 True 的 node 才会被返回。

因此,通过 is_valid_cache() 属性,可以控制存储的 node 能否在缓存机制中用作 source。默认情况下,对于所有 node,该属性都返回 True 。不过,可以根据每个 node 的情况进行更改,将其设置为 False

node = load_node(<IDENTIFIER>)
node.base.caching.is_valid_cache = False

如果将属性设置为 False ,就会在数据库中的 node 上存储一个额外值,这样即使在以后加载时, is_valid_cache 也会返回 False

node = load_node(<IDENTIFIER>)
assert node.base.caching.is_valid_cache is False

通过这种方法,可以保证单个 node 绝不会被用作缓存的 source

The Process class overrides the is_valid_cache() property to give more fine-grained control on process nodes as caching sources. If either is_valid_cache() of the base class or is_finished() returns False, the process node is not a valid source. Likewise, if the process class cannot be loaded from the node, through the process_class(), the node is not a valid caching source. Finally, if the associated process class implements the is_valid_cache() method, it is called, passing the node as an argument. If that returns True, the node is considered to be a valid caching source.

The is_valid_cache() is implemented on the Process class. It will check whether the exit code that is set on the node, if any, has the keyword argument invalidates_cache set to True, in which case the property will return False indicating the node is not a valid caching source. Whether an exit code invalidates the cache, is controlled with the invalidates_cache argument when it is defined on the process spec through the spec.exit_code method.

警告

Process 插件可以覆盖 is_valid_cache() 方法,以进一步控制 node 如何被视为有效的缓存源。这样做时,请确保调用 super().base.caching.is_valid_cache(node) 并尊重其输出:如果输出为 False,您的实现也应返回 False。如果不这样做,关于退出代码的 invalidates_cache 关键字将不再起作用。

限制与准则#

  1. Workflow node 没有缓存。目前的设计要求 provenance graph 不受是否启用缓存的影响:

    • 计算 node: 计算 node 可以有数据输入,并创建新的数据 node 作为输出。为了使克隆计算看起来像产生了自己的输出,输出 node 也会被复制和链接。

    • Workflow nodes: Workflow 与计算的不同之处在于,它们可以 返回 输入 node 或计算创建的输出 node。由于缓存并不关心输入 node 的 identity 而只关心其 content ,因此要确定在缓存 workflow 中返回哪个 node 并不简单。

    由于 AiiDA 工作链的运行时间通常由昂贵的计算主导,因此这一限制通常不会产生重大影响。

  2. 计算的缓存机制应该只在输入和计算完全相同时触发。虽然AiiDA的哈希值包括了包含计算/数据类的Python包的版本,但它不能检测底层Python代码在不增加版本号的情况下被修改的情况。另一种可能导致错误缓存命中的情况是,如果解析器和计算不是作为同一个 Python 包的一部分实现的,因为计算 node 只存储了名称,而没有存储所用解析器的版本。

  3. 虽然缓存节省了不必要的计算,但并不一定节省空间,因为缓存计算及其输出 nodes 在 provenance graph 中是重复的。不过,AiiDA 的默认磁盘-对象存储后端在对象级别自动进行了重复数据删除。因此,除了存储在数据库级别的 node 元数据外,磁盘使用量不受这个后端影响。

  4. 最后,在修改类的散列/缓存行为时,请记住缓存匹配可能在两种情况下出错:

    • 假阴性,即两个 node 本应具有相同的哈希值,但却不相同

    • 误报,即两个不同的 node 误得到相同的哈希值

    假阴性是可取的,因为它只会增加计算的运行时间,而假阳性则可能导致错误的结果。