Repository#

The file repository in AiiDA, often referred to simply as the repository, is the data store where all binary data is persisted, that belong to the Node in the provenance graph. In this chapter, the design and implementation of the file repository is described, which largely draws from AEP 007: Abstract and improve the file repository and AEP 006: Efficient object store for the AiiDA repository.

The current architecture is heavily influenced by lessons learned from the original design of the file repository in the earliest version of AiiDA that had difficulty scaling to large numbers of files. For that reason, at end of the chapter there is a description of the original design and its limitations. This can be instructive in understanding the design of the current solution.

Design#

The following requirements were considered during the design of the file repository implementation:

  • Scalability: the repository should be able to store millions of files, all the while permitting efficient backups.

  • Heterogeneity: the repository should operate efficiently for data that is heterogeneous in size, with object of size ranging from a few bytes to multiple gigabytes.

  • Simplicity: the solution should not require an actively running server to operate.

  • Concurrency: the repository should support multiple concurrent reading and writing processes.

  • Efficiency: the repository should automatically deduplicate file content in an effort to reduce the total amount of required storage space.

These are merely the requirements for the data store that persists the content of the files, or the backend file repository. The frontend interface that is employed by users to store files has another set of requirements altogether. Users are used to think about file storage in terms of a file hierarchy, as they would on a normal file system, where files are stored in (nested) directories. Moreover, in the context of AiiDA, a node is expected to have their own subset of files with their own hierarchy, an example of which is shown in Fig. 27. The frontend interface therefore needs to allow users to store and address files with such a hierarchy on a per node basis, even if only virtually. With that guarantee, the backend implementation is free to store the files in any way imaginable in order to meet the requirements specified above.

../../_images/schematic_design_node_repo.png

Fig. 27 Schematic representation of a possible virtual file hierarchy of a node in the provenance graph. From a user’s perspective, each node can contain an arbitrary number of files and directories with a certain file hierarchy. The hierarchy is completely virtual, however, in the sense that the hierarchy is not necessarily maintained literally in the data store containing the file content.#

To satisfy the requirements of the frontend interface and the actual data store at the same time, the file repository solution in AiiDA is divided into two components: a frontend, which is agnostic of any particular storage technology, and a backend, which implements the interface to the storage.

The file repository frontend#

To understand how the file repository frontend integrates the ORM and the file repository backend, consider the following class diagram:

../../_images/schematic_design_class_hierarchy.png

Fig. 28 The file repository backend is interfaced through the Repository class. It maintains a reference of an instance of one of the available file repository backend implementations, be it the sandbox or disk object store variant, to store file objects and to retrieve the content for stored objects. Internally, it keeps a virtual file hierarchy, which allows clients to address files by their path in the file hierarchy as opposed to have the unique key identifiers created by the backend. Finally, the Node class, which is the main point of interaction of users with the entire API, mixes in the NodeRepository class. The latter keeps an instance of the Repository class to which all repository operations are forwarded, after the check of node mutability is performed.#

As opposed to the backend interface, the frontend Repository does understand the concept of a file hierarchy and keeps it fully in memory. This allows clients to interact with this interface as if the files were stored with the indicated hierarchy and address them by their relative filepaths, as opposed to the unique string identifiers that are generated by the backend. It is important to note, however, that this virtual hierarchy is not initialized based on the actual contents of the file repository backend. In fact it cannot be, because the backend has no notion of a file hierarchy and so cannot provide its hierarchy to the Repository when it is constructed. This means that the Repository only exposes a subset of the files that are contained within a file repository backend.

To persist the virtual hierarchy of the files stored for any particular node, it is stored in the database. The node database model has a JSONB column called repository_metadata that contains the virtual file hierarchy in a serialized form. This serialized form is generated by the serialize() method, and the from_serialized() class method can be used to reconstruct a repository instance with a pre-existing file hierarchy. Note that upon constructing from a serialized file hierarchy, the Repository will not actually validate that the file objects contained within the hierarchy are actually contained in the backend.

The final integration of the Repository class with the ORM is through the NodeRepository class, which is mixed into the Node class. This layer serves a couple of functions:

  • It implements the mutability rules of nodes

  • It serves as a translation layer between string and byte streams.

The first is necessary because after a node has been stored, its content is considered immutable, which includes the content of its file repository. The Sealable mixin overrides the Repository methods that mutate repository content, to ensure that process nodes can mutate their content, as long as they are not sealed.

The second raison-d’être of the NodeRepository is to allow clients to work with string streams instead of byte streams. As explained in the section on the file repository backend, it only works with byte streams. However, users of the frontend API are often more used to working with strings and files with a given encoding. The NodeRepository overrides the put-methods and accepts string streams, and enables returning file handles to existing file objects that automatically decode the byte content. The only additional requirement for operating with strings instead of bytes is that the client specifies the encoding. Since the file repository backend does not store any sort of metadata, it is impossible to deduce the file encoding and automatically decode it. Likewise, using the default file encoding of the system may yield the wrong result since the file could have been imported and actually have been originally written on another system with a different encoding. Encoding and decoding of file objects is therefore the responsibility of the frontend user.

The file repository backend#

In a clear separation of responsibilities, the backend is solely tasked with storing the content of files and returning them upon request as efficiently as possible, both when retrieving files individual as well as in bulk. For simplicity, the repository backend only deals with raw byte streams and does not maintain any sort of file hierarchy. The interface that any backend file repository should implement is defined by the AbstractRepositoryBackend abstract class.

class AbstractRepositoryBackend(metaclass=abc.ABCMeta):
    """Class that defines the abstract interface for an object repository.

    The repository backend only deals with raw bytes, both when creating new objects as well as when returning a stream
    or the content of an existing object. The encoding and decoding of the byte content should be done by the client
    upstream. The file repository backend is also not expected to keep any kind of file hierarchy but must be assumed
    to be a simple flat data store. When files are created in the file object repository, the implementation will return
    a string-based key with which the content of the stored object can be addressed. This key is guaranteed to be unique
    and persistent. Persisting the key or mapping it onto a virtual file hierarchy is again up to the client upstream.
    """

    @property
    @abc.abstractmethod
    def uuid(self) -> Optional[str]:
        """Return the unique identifier of the repository."""

    @property
    @abc.abstractmethod
    def key_format(self) -> Optional[str]:
        """Return the format for the keys of the repository.

        Important for when migrating between backends (e.g. archive -> main), as if they are not equal then it is
        necessary to re-compute all the `Node.base.repository.metadata` before importing (otherwise they will not match
        with the repository).
        """

    @abc.abstractmethod
    def initialise(self, **kwargs) -> None:
        """Initialise the repository if it hasn't already been initialised.

        :param kwargs: parameters for the initialisation.
        """

    @property
    @abc.abstractmethod
    def is_initialised(self) -> bool:
        """Return whether the repository has been initialised."""

    @abc.abstractmethod
    def erase(self) -> None:
        """Delete the repository itself and all its contents.

        .. note:: This should not merely delete the contents of the repository but any resources it created. For
            example, if the repository is essentially a folder on disk, the folder itself should also be deleted, not
            just its contents.
        """

    @staticmethod
    def is_readable_byte_stream(handle) -> bool:
        return hasattr(handle, 'read') and hasattr(handle, 'mode') and 'b' in handle.mode

    def put_object_from_filelike(self, handle: BinaryIO) -> str:
        """Store the byte contents of a file in the repository.

        :param handle: filelike object with the byte content to be stored.
        :return: the generated fully qualified identifier for the object within the repository.
        :raises TypeError: if the handle is not a byte stream.
        """
        if (
            not isinstance(handle, io.BufferedIOBase)  # type: ignore[redundant-expr,unreachable]
            and not self.is_readable_byte_stream(handle)
        ):
            raise TypeError(f'handle does not seem to be a byte stream: {type(handle)}.')
        return self._put_object_from_filelike(handle)

    @abc.abstractmethod
    def _put_object_from_filelike(self, handle: BinaryIO) -> str:
        pass

    def put_object_from_file(self, filepath: Union[str, pathlib.Path]) -> str:
        """Store a new object with contents of the file located at `filepath` on this file system.

        :param filepath: absolute path of file whose contents to copy to the repository.
        :return: the generated fully qualified identifier for the object within the repository.
        :raises TypeError: if the handle is not a byte stream.
        """
        with open(filepath, mode='rb') as handle:
            return self.put_object_from_filelike(handle)

    @abc.abstractmethod
    def has_objects(self, keys: List[str]) -> List[bool]:
        """Return whether the repository has an object with the given key.

        :param keys:
            list of fully qualified identifiers for objects within the repository.
        :return:
            list of logicals, in the same order as the keys provided, with value True if the respective
            object exists and False otherwise.
        """

    def has_object(self, key: str) -> bool:
        """Return whether the repository has an object with the given key.

        :param key: fully qualified identifier for the object within the repository.
        :return: True if the object exists, False otherwise.
        """
        return self.has_objects([key])[0]

    @abc.abstractmethod
    def list_objects(self) -> Iterable[str]:
        """Return iterable that yields all available objects by key.

        :return: An iterable for all the available object keys.
        """

    @abc.abstractmethod
    def get_info(self, detailed: bool = False, **kwargs) -> dict:
        """Returns relevant information about the content of the repository.

        :param detailed:
            flag to enable extra information (detailed=False by default, only returns basic information).

        :return: a dictionary with the information.
        """

    @abc.abstractmethod
    def maintain(self, dry_run: bool = False, live: bool = True, **kwargs) -> None:
        """Performs maintenance operations.

        :param dry_run:
            flag to only print the actions that would be taken without actually executing them.

        :param live:
            flag to indicate to the backend whether AiiDA is live or not (i.e. if the profile of the
            backend is currently being used/accessed). The backend is expected then to only allow (and
            thus set by default) the operations that are safe to perform in this state.
        """

    @contextlib.contextmanager
    def open(self, key: str) -> Iterator[BinaryIO]:  # type: ignore[return]
        """Open a file handle to an object stored under the given key.

        .. note:: this should only be used to open a handle to read an existing file. To write a new file use the method
            ``put_object_from_filelike`` instead.

        :param key: fully qualified identifier for the object within the repository.
        :return: yield a byte stream object.
        :raise FileNotFoundError: if the file does not exist.
        :raise OSError: if the file could not be opened.
        """
        if not self.has_object(key):
            raise FileNotFoundError(f'object with key `{key}` does not exist.')

    def get_object_content(self, key: str) -> bytes:
        """Return the content of a object identified by key.

        :param key: fully qualified identifier for the object within the repository.
        :raise FileNotFoundError: if the file does not exist.
        :raise OSError: if the file could not be opened.
        """
        with self.open(key) as handle:
            return handle.read()

    @abc.abstractmethod
    def iter_object_streams(self, keys: List[str]) -> Iterator[Tuple[str, BinaryIO]]:
        """Return an iterator over the (read-only) byte streams of objects identified by key.

        .. note:: handles should only be read within the context of this iterator.

        :param keys: fully qualified identifiers for the objects within the repository.
        :return: an iterator over the object byte streams.
        :raise FileNotFoundError: if the file does not exist.
        :raise OSError: if a file could not be opened.
        """

    def get_object_hash(self, key: str) -> str:
        """Return the SHA-256 hash of an object stored under the given key.

        .. important::
            A SHA-256 hash should always be returned,
            to ensure consistency across different repository implementations.

        :param key: fully qualified identifier for the object within the repository.
        :raise FileNotFoundError: if the file does not exist.
        :raise OSError: if the file could not be opened.
        """
        with self.open(key) as handle:
            return chunked_file_hash(handle, hashlib.sha256)

    @abc.abstractmethod
    def delete_objects(self, keys: List[str]) -> None:
        """Delete the objects from the repository.

        :param keys: list of fully qualified identifiers for the objects within the repository.
        :raise FileNotFoundError: if any of the files does not exist.
        :raise OSError: if any of the files could not be deleted.
        """
        keys_exist = self.has_objects(keys)
        if not all(keys_exist):
            error_message = 'some of the keys provided do not correspond to any object in the repository:\n'
            for indx, key_exists in enumerate(keys_exist):
                if not key_exists:
                    error_message += f' > object with key `{keys[indx]}` does not exist.\n'
            raise FileNotFoundError(error_message)

    def delete_object(self, key: str) -> None:
        """Delete the object from the repository.

        :param key: fully qualified identifier for the object within the repository.
        :raise FileNotFoundError: if the file does not exist.
        :raise OSError: if the file could not be deleted.
        """
        return self.delete_objects([key])

The put_object_from_filelike() is the main method that, given a stream or filelike-object of bytes, will write it as an object to the repository and return a key. The put_object_from_file() is a convenience method that allows to store a file object directly from a file on the local file system, and simply calls through to put_object_from_filelike(). The key returned by the put-methods, which could be any type of string, should uniquely identify the stored object. Using the key, open() and get_object_content() can be used to obtain a handle to the object or its entire content read into memory, respectively. Finally, the has_object() and delete_object() can be used to determine whether the repository contains an object with a certain key, or delete it, respectively.

The abstract repository backend interface is implemented for the disk object store (DiskObjectStoreRepositoryBackend) as well as a scratch sandbox (SandboxRepositoryBackend). The latter implementation simply implements the interface using a temporary scratch folder on the local file system to store the file content. File objects are stored in a flat manner where the filename, that functions as the unique key, is based on a randomly generated UUID, as shown in Fig. 29.

../../_images/schematic_design_sandbox.png

Fig. 29 The file structure created by the SandboxRepositoryBackend implementation of the file repository backend. Files are stored in a completely flat structure with the name determined by a randomly generated UUID. This is the most efficient method for writing and reading files on a local file system. Since these sandbox repositories are intended to have very short lifetimes and contain relatively few objects, the typical drawbacks of a flat file store do not apply.#

The simple flat structure of this sandbox implementation should not be a limitation since this backend should only be used for short-lived temporary file repositories. The use case is to provide a file repository for unstored nodes. New node instances that created in interactive shell sessions are often discarded before being stored, so it is important that not only the creation of new files, but also their deletion once the node is deleted, is as efficient as possible. The disk object store is not optimized for efficient ad-hoc object deletion, but rather, object deletion is implemented as a soft-delete and the actual deletion should be performed during maintenance operations, such as the packing of loose objects. That is why a new node instance upon instantiation gets an instance of the class:~aiida.repository.backend.sandbox.SandboxRepositoryBackend repository. Only when the node gets stored, are the files copied to the permanent backend file repository (such as the DiskObjectStoreRepositoryBackend).

The disk object store#

The disk object store was designed from scratch in order to satisfy the technical requirements of the file repository described in the previous section. The concept is simple: the file repository is represented by a container which is a directory on the local file system and contains all the file content. When a file is written to the container, it is first written to the scratch directory. Once this operation has finished successfully, the file is moved atomically to the loose directory. It is called loose because each file in this directory is stored as an individual or loose object. The name of the object is given by the hash computed from its content, currently using the sha256 algorithm. The loose directory applies one level of sharding based on the first two characters of the object hashes, in order to make the lookup of objects more performant as described in The original design. A schematic overview of the folder structure of a disk object store container is shown in Fig. 30.

../../_images/schematic_design_dos.png

Fig. 30 Schematic representation of the file hierarchy in a container of the disk object store package. When writing files to the container, they are first written to a scratch sandbox folder and then moved atomically to the loose directory. During maintenance operations, loose files can be concatenated to pack files that are stored in the packed directory.#

The approach of creating new files in the repository by first writing them to the scratch sandbox folder before atomically moving them to the loose object directory, directly addresses the requirement of concurrency. By relying on the atomic file move operation of the operating system, all loose objects are guaranteed to be protected from data corruptions, within the limits of the atomicity guarantees of the local file system. The usage of the file content’s hash checksum as the filename automatically fulfils the efficiency requirement. Assuming that the hashing algorithm used has no collisions, two objects with the same hash are guaranteed to have the same content and so therefore can be stored as a single object. Although the computation of a file’s hash before storing it incurs a non-negligible overhead, the chosen hashing algorithm is fast enough that it justifies that cost given that it gives a significant reduction in required storage space due to the automatic and implicit data deduplication.

While the approach of the scratch and loose directories address the criteria of concurrency and efficiency, the solution is not scalable. Just as the original design, this solution does not scale to file repositories of multiple millions of nodes, since every object is stored as an individual file on disk. As described there, this makes the repository impractical to backup since merely constructing the list of files present is an expensive operation. To tackle this problem, the disk object store implements the concept of packing. In this maintenance operation, the contents of all loose objects stored in the loose directory are concatenated into single files that are stored in the packed folder. The pack files have a configurable maximum size and once it is reached the next pack file is created, whose filenames are named by consecutive integers.

A sqlite database is used to track in which pack file each object is stored, the byte offset at which it starts and its total byte length. Such an index file is necessary once individual objects are packed into a smaller number of files, and to respect the simplicity requirement, a sqlite database was chosen, since it is serverless and efficient. The loose objects are concatenated in a random order, which is to say that the disk object store undertakes no effort to order objects according to their content size in any way, such as to align them with blocks on the file system, unlike some other key-value store solutions. Files of any size are treated equally and as such there is no optimization towards storing smaller files nor larger files. This is done intentionally because the disk object store is expected to be able to store files that are strongly heterogeneous in size and as such can not make optimizations for a particular range of file sizes.

Currently, the packing operation is seen as a maintenance operation, and therefore, unlike the writing of new loose objects, cannot be operated concurrently by multiple processes. Despite this current limitation, the packing mechanism satisfies the final scalability requirement. By reducing the total number of files and the packing strategy, the pack files can be copied to a backup copy very efficiently. Since new objects are concatenated to the end of existing pack files and existing pack files are in principle never touched after they have reached their maximum size (unless the pack files are forcefully repacked), backup up tools, such as rsync, can reduce the transfer of content to the bare minimum.

The lifetime of a node#

Now that all the components are described, here we describe how they are employed throughout the lifetime of a node. When a new node instance is constructed, it will not yet have an instance of the Repository. Instead, this is done lazily as soon as an operation on the file repository is executed. This is crucial for performance since node instances may often be initialized without their repository contents ever needing to be accessed and constructing the repository interface instances will have a non-negligible overhead. If the node is unstored the Repository will be constructed with an instance of the SandboxRepositoryBackend implementation. The advantage is that if the node object does go out of scope before it has been stored, the contents that may have been created in the repository will be automatically cleaned from the local file system.

When a node gets stored, the Repository instance is replaced with a new instance, this time with the backend set to the DiskObjectStoreRepositoryBackend that is initialized to point to the container of the current profile. The contents of the sandbox repository are copied over to the disk object store through the Repository interface and at the end its contents are serialized. The serialized file hierarchy is then stored on the node itself in the repository_metadata column. This allows to reconstruct the Repository instance correctly once the node is reloaded from the database, by calling the from_serialized() class method while passing the stored repository_metadata.

The original design#

The original file repository in AiiDA was implemented as a simple directory on the local file system. The files that belong to a node would be written as is to a subdirectory within that repository directory, without any compression or packing. The name of the directory was equal to the UUID of the node, guaranteeing that each subdirectory was unique, and the files of different nodes would not overwrite one another, even if they have identical names.

In anticipation of databases containing many nodes leading to many subdirectories, which would slow down the operation of finding the directory of a particular node, the repository was sharded. In the context of a file system, this means that instead of a flat structure of the file repository, the node directories are stored in a (nested) subdirectory. Which subdirectory is once again determined by the UUID: the first and second subdirectories are given by the first and second two characters, respectively. The remaining characters of the UUID form the name of the final subdirectory. For example, given the UUID 4802585e-18da-42e1-b063-7504585ea9af, the relative path of the subdirectory would be 48/02/585e-18da-42e1-b063-7504585ea9af. With this structure, the file repository would contain at most 256 directories, 00 through ff, with the same applying to each one of those. Starting from the third level, however, the file hierarchy would once again be flat. A schematic overview of the resulting file hierarchy is shown in Fig. 31.

../../_images/schematic_design_original.png

Fig. 31 Schematic representation of the original structure of the file repository. The files of each node are stored in a directory that is exclusive to that node, whose path is determined by the UUID of the node. The directory is sharded twice, using two consecutive characters of the start of the UUID to create two levels of subdirectories. This is done to limit the number of directories in any one level in order to make the looking up of a directory for a given node more efficient.#

Limitations#

While a simple and robust design, the original architecture of the file repository had various limitations, many of which would start to play a significant role for larger project that contain many files. The main limitation of the original design is that all files were stored as is, which eventually leads to a large number of files stored in a large number of subdirectories. On a file system, each file or directory requires an inode, which is label that the file system to be able to map the filepath to the actual location on disk. The number of available inodes on a file system are limited, and certain AiiDA projects were hitting these limits, making it impossible to write more files to disk, even though there might have been plenty of raw storage space left.

In addition, backing up a file repository with the original design was practically impossible. Due to the sheer number of files, even just determining the difference between the original and backup of a repository, for example using rsync could take days. And that is just computing the difference, let alone the time it would take to perform the actual backup.

However, it wasn’t merely the number of files that was problematic, but even the number of directories that typical repositories would contain, would already significantly slow down backup operations. Since the database kept no reference of which files and directories were stored in the file repository for any given node, the original design would always create a subdirectory in the file repository for any node, even if it contained no files whatsoever. Otherwise, it would have been impossible to know whether a node really did not contain any files, or if the directory in the file repository was accidentally mistaken. This approach did, however, lead to a large number of empty directories, as many nodes often contain no files at all, for example, base type data nodes.