cspy-db command

The cspy-db command is used to process and analyse the SQLite3 database files that a produced from a crystal structure prediction simulation.

cspy-db <command> [<args>]
        For each command, run 'cspy-db <command> --help' for more information.
        
        Available commands are:
        prune            Remove duplicate structures from databases
        cluster          synonymous with 'prune'
        dump             Extract data from databases into other formats
        plot             Plot a landscape from a database
        info             Return information about the contents of a database
        remove_outliers  Remove gapped structures and undetected Buckingham catastrophies from a database
        convert          Convert an old 5-column database to a new 6-column format (adds molecule_id column)
        build            Build a new database from res files in the current directory

positional arguments

command - sub command to run

options

-h, --help - show this help message and exit

Finding redundant structures via PXRD comparison

After a CSP calculation, you will often (almost always) find the same structure multiple times. These redundant structures may be found by using the cluster subprogram in cspy-db:

cspy-db cluster *.db

This will find redundant structures within all the database files, combine the unique structures into a new database file (defaulting to output.db), then find unique structures within the combined file.

Finding redundant structures with COMPACK

Note

This section is only relevant to those which have a license for the CSD Python API and followed the installation instructions in CSD Python API.

In addition to PXRD clustering, we are able to perform clustering with the COMPACK algorithm on a csp database. This uses the CSD Python API and requires a conda environment which combines the CSD Python API and mol-CSPy environments into one.

Clustering crystal structures

To cluster a database using COMPACK, the following command can be used:

cspy-db cluster input.db -m compack

Other optional flags include:

-cdt CLUSTER_DENSITY_THRESHOLD, --cluster-density-threshold CLUSTER_DENSITY_THRESHOLD
                      The density threshold used in clustering, within which
                      structures are considered the same.

-cet CLUSTER_ENERGY_THRESHOLD, --cluster-energy-threshold CLUSTER_ENERGY_THRESHOLD
                      The energy threshold used in clustering, within which
                      structures are considered the same.

-j JOBS, --jobs JOBS  Number of parallel processes/threads to use for
                      xrd/compack clustering

-rms CLUSTER_RMS_THRESHOLD, --cluster-rms-threshold CLUSTER_RMS_THRESHOLD
                      RMS difference threshold used in compack clustering.

There are a number of COMPACK search settings that can be tweaked. The defaults of these are recorded in the configuration.py file. Alternatively, user defined values can be read from a cspy.toml file. See below for an example:

[compack]
angle_tolerance = 30
distance_tolerance = 0.3
packing_shell_size = 60
ignore_hydrogen_counts = true
ignore_hydrogen_positions = true

Searching the database for a match

To search through a database or series of databases and compare to a given structure, the following command should be employed:

cspy-db cluster input.db -m compack --compack_exp_str NAME_OF_STRUCTURE

Where --compack_exp_str NAME_OF_STRUCTURE is the filename containing the comparison crystal structure (typically an experimental SCXRD structure). Alternatively, the user may specify the CSD reference code (The user should be aware that some CSD structures may contain disordered atoms or solvent molecules that will affect the overlay comparison).

Extracting crystal structures from a database

If you’d prefer to work with a csv file, you can dump out the data about unique structures by using the dump subprogram in cspy-db:

cspy-db dump output.db # only unique structures
cspy-db dump output.db --include-duplicates # all structures in the database

This will result in a data table being written to structures.csv, and an archive of SHELX res files being written to structures.zip. By default this will only export unique structures.

If you want to dump structures within a specificed energy range from the global minimium structure, this can be done with the -e flag:

cspy-db dump output.db -e 7 # only unique structures 7 kJ mol^-1 from the global minimum

If you would prefer to have these structures in a database format, add the --copy-db flag.

Summarising information in a database

mol-CSPy’s CSP databases contain a lot of information and it is often more useful to summarise it. The info subprogram will return: - The number of structures - The number/percentage of unique structures (if the information is available) - Statistics for energies and density - Details of the five lowest energy structures

The percentage of unique structures is a convenient (albeit crude) means of measuring whether enough structures were sampled during CSP.

cspy-db info database.db

Plotting a landscape from a database

mol-CSPy’s CSP databases contain all the information necessary to plot and render a crystal landscape. This is a useful for on-the-fly analysis of CSP outputs. The contents of one or more databases may be plotted together as:

cspy-db plot database0.db database1.db

Labelling by space group

By default, data points are coloured are labelled according to the database they were sourced from. If you would instead prefer to colour and labell according to spacegroup, you need only append the --spg flag. Additionally, a list of space group numbers may optionally be provided after the --spg to instruct cspy-db plot to plot only data points belonging to those space groups.

To plot all space groups:

cspy-db plot database0.db --spg

To plot only space groups 1 and 14:

cspy-db plot database0.db --spg 1 14

Plotting from clustered databases

cspy-db plot will automatically search for clustering information in a database and adapt it’s behaviour if this information is found.

Note

The output databases from cspy-db cluster contain only unique crystal structures. Input databases (aftering clustering) keep track of which crystal structures are equivalent. This section refers to these databases

By default, only unique crystal structures will be plotted. If you wish to plot all crystal structures, append the --ignore-clustering flag.

Additionally, the --equivalents can be added to colour each unique crystal structures according to the number of equivalent structures in the database. This is another (crude) means of measuring whether enough structures were sampled during CSP. If the lowest energy structures have all only been found once, it may be worth running the CSP for longer.