Using the querytool
¶
We provide a Python class (aiida.orm.querytool.QueryTool
) to perform the most common types of queries
(mainly on nodes, links and their attributes) through an easy Python
class interface, without the need to know anything about the SQL query language.
Note
We are working a lot on the interface for querying through the QueryTool, so the interface could change significantly in the future to allow for more advanced querying capabilities.
To use it, in your script (or within the verdi shell)
you need first to load the QueryTool
class:
from aiida.orm.querytool import QueryTool
Then, create an instance of this class, which will represent your query (you need to create a new instance for each different query you want to execute):
q = QueryTool()
Now, you can call a set of methods on the q
object to decide the filters
you want to apply. The first type of filter one may want to apply is on the
type of nodes you want to obtain (the QueryTool, in the current version,
always queries only nodes in the DB). You can do so passing the correct
Node subclass to the set_class()
method, for instance:
q.set_class(Calculation)
Then, if you want to query only calculations within a given group:
q.set_group(group_name, exclude=False)
where group_name
is the name of the group you want to select.
The exclude
parameter, if True
,
negates the query (i.e., considers all objects not included in the
give group). You can call the
set_group()
method
multiple times to add more filters.
The most important query specification, though, is on the attributes of a given node.
If you want to query for attributes in the DbAttribute
table,
use the
add_attr_filter()
method:
q.add_attr_filter("energy", "<=", 0., relnode="res")
At this point, the query q
describes a query you still have to run, which
will return each calculation calc
for which the result node calc.res.energy
is less or equal to 0.
The relnode
parameter allows the user to perform queries not only
on the nodes you want to get out of the query (in this case, do not specify
any relnode
parameter) but also on the value of the attributes of
nodes linked to the result nodes. For instance, specifying "res"
as relnode
, one gets as result of the query nodes whose output result
has a negative energy.
Also in this case, you can add multiple filters on attributes, or you can
use the same syntax also on data you stored in the DbExtra
table
using add_extra_filter()
.
Note
We remind here that while attributes are properties that describe
a node, are used internally by AiiDA and cannot be changed
after the node is stored –
for instance, the coordinates of atoms in a crystal structure, the input
parameters for a calculation, ... – extras (stored in DbExtra
) have
the same format and are at full disposal of the user for adding metadata
to each node, tagging, and later quick querying.
Finally, to run the query and get the results, you can use the
run_query()
method, that will
return an iterator over the results of the query. For instance, if you
stored A
and B
as extra data of a given node, you can get a list
of the energy of each calculation, and the value of A
and B
, using
the following command:
res = [(node.res.energy,
node.get_extra("A"),
node.get_extra("B") )
for node in q.run_query()]
Note
After having run a query, if you want to run a new one, even if
it is a simple modification of the current one, please discard the q
object and create a new one with the new filters.
The transitive closure table¶
Another type of query that is very common is the discovery of whether two nodes are linked through a path in the AiiDA graph database, regardless of how many nodes are in between.
This is particularly important because, for instance, you may be interested in discovering which crystal structures have, say, all phonon frequencies that are positive; but the information on the phonon frequencies is in a node that is typically not directly linked to the crystal structure (you typically have in between at least a SCF calculation, a phonon calculation on a coarse grid, and an interpolation of the phonon bands on a denser grid; moreover, each calculation may include multiple restarts).
In order to make these queries very efficient (and since we expect that typical workflows, especially in Physics and Materials Science, involve a lot of relatively small, disconnected graphs), we have implemented triggers at the database SQL level to automatically generate a transitive closure table, i.e., a table that for each node contains all his parents (at any depth level) and all the children (at any depth level). This means that, every time two nodes are joined by a link, this table is automatically updated to contain all the new available paths.
With the aid of such a table, discovering if two nodes are connected or not
becomes a matter of a single query.
This table is accessible using Django commands, and is called
DbPath
.
Transitive closure paths contain a parent and a child.
Moreover, they also contain a depth
, giving how many nodes have to
be traversed to connect the two parent
and child
nodes (to make
this possible, an entry in the DbPath table is stored for each possible
path in the graph). The depth does not include the first and last node
(so, a depth of zero means that two nodes are directly connected through
a link).
Three further columns are stored, and they are mainly used to quickly (and recursively) discover which are the nodes that have been traversed.
Todo
The description of the exact meaning of the three additional
columns (entry_edge_id
, direct_edge_id
, and exit_edge_id
,
will be added soon; in the meatime, you can give a look to the
implementation of the expand()
method).
Finally, given a DbPath
object, we provide a
expand()
method to get a list
of all the nodes (in the correct order) that are traversed by
the specific path. List elements are AiiDA nodes.
Here we present a simple example of how you can use the transitive closure
table, imagining that you want to get the path between two nodes n1
and n2
.
We will assume that only a single path exists between the two nodes. If no
path exists, an exception will be raised in the line marked below.
If more than one path exists, only the first one will be returned.
The extension to manage the exception and to manage multiple paths
is straightforward:
n1 = load_node(NODEPK1)
n2 = load_node(NODEPK2)
# In the following line, we are choosing only the first
# path returned by the query (with [0]).
# Change here to manage zero or multiple paths!
dbpath = models.DbPath.objects.filter(parent=n1, child=n2)[0]
# Print all nodes in the path
print dbpath.expand()