cspy-threshold-utils command

The cspy-threshold-utils provides three subcommands to interact with threshold databases, generate disconnectivity graphs and prepare the data to run threshold jobs with the cspy-threshold command.

cspy-threshold-utils [-h] {command} [args]

positional arguments

command - Available commands

options

-h, --help - show this help message and exit

Preparing the data

cspy-threshold-utils setup [-h] [--excluded-spgs EXCLUDED_SPGS [EXCLUDED_SPGS ...]]
                           [-e MAX_ENERGY] [-n NUM_STRUCTURES] [--xyz XYZ [XYZ ...]]
                           [-c CHARGES] [-m MULTIPOLES] [-a AXIS] [--fix] [-z MOLECULES]
                           [--smallest] [--limit LIMIT] [--niggli] [--table] [-tz TABLE_Z]
                           [-to TABLE_OUTPUT]
                           [--method {grow_shortest,double_shortest,split,prime_fact}]
                           [-o OUTPUT] [--dry] [--ignore] [-ll {INFO,DEBUG,WARNING,ERROR}]
                           crystals [crystals ...]

positional arguments

crystals - Crystal files (.res or .cif) or CSP databases (.db)

options

-h, --help - show this help message and exit
--excluded-spgs EXCLUDED_SPGS - Crystals in these space groups will be ignored. Values must be in the range [0-230]
--method METHOD - Set the method to create the supercells: - grow_shortest: Each step grow the supercell by 1 (default: grow_shortest)
-o OUTPUT, --output OUTPUT - String to add to the end of output filenames (default: _thresh_setup)
--dry - Dry run, do not save the P1 crystals to output files
--ignore - Ignore any errors raised by bond and multipole mapping checks
-ll LOG_LEVEL, --log-level LOG_LEVEL - Set the logging level (default: INFO)

Read Crystals from Databases

Settings to modify how crystals
are extracted from a database

-e MAX_ENERGY, --max-energy MAX_ENERGY - Select structures within this energy of the global minimum from any database files
-n NUM_STRUCTURES, --num-structures NUM_STRUCTURES - Select this many number of structures at most from a database, in order of increasing energy

Check Validity

Provide these files to check if they can be mapped to the crystals. 
Bonding issues may be fixed using the --fix argument

--xyz XYZ - .xyz file of molecule(s) that will be supplied to cspy-threshold
-c CHARGES, --charges CHARGES - Rank0 multipole file
-m MULTIPOLES, --multipoles MULTIPOLES - RankN multipole file
-a AXIS, --axis AXIS - Molecular axis file
--fix - Try and fix some possible issues

Number of Molecules

Modify the target number of molecules in the final supercells

-z MOLECULES, --molecules MOLECULES - Number of molecules in the P1 unit cells
--smallest - Create the smallest supercell that accommodates all initial structures provided
--limit LIMIT - Limit of number of molecules in supercell to check when setting --smallest (default: 200)
--niggli - Calculate niggli cells of input crystals before finding smallest supercells

Table Output

Check multiple target number of molecules and save results to csv.
Ignores the --smallest and --molecules arguments if set.

--table - Create .csv file with permitted supercell Z values. If this flag is set, all other flags are ignored except --table-output and --table-z
-tz TABLE_Z, --table-z TABLE_Z - Max number Z to test in table output (default: 100)
-to TABLE_OUTPUT, --table-output TABLE_OUTPUT - Table output filename ending. The name of the method will be included at the beginning (default: table.csv)

In order to run threshold jobs, the starting crystals must be properly setup. This app will accept crystal files or CSPy databases and create a new file for each crystal in the P1 space group and in the SHELXL file format. The appeal of this app is that it generates the output crystals ensuring they all have the same number of molecules in the cell. This is done by creating supercells where necessary. If all the input crystals can’t be made to have the same number of molecules in the cell, the app will refuse to run.

The reason to generate the crystals with the same number of molecules in the cell is so that, when the threshold databases are merged and clustered, matches can be potentially found between different trajectories. If your set of initial crystals can’t all have the same number of molecules in the cell, you should make sure that the different cells have number of molecules multiples of each other, for example: 4, 8, 16.

Interacting with databases

cspy-threshold-utils db [-h] (--minimize | --combine | --dump | --cluster | --split)
                        [--minimize-skip MINIMIZE_SKIP] [--minimize-increasing]
                        [--minimize-trial MINIMIZE_TRIAL] [-x XYZ_FILES [XYZ_FILES ...]]
                        [-c CHARGES] [-m MULTIPOLES] [-a AXIS]
                        [-p {fit,fit_disponly,fit_reponly,w99,fit_water_X,Day_halobenzenes,w99_orig_Halogens,w99_orig_H,w99rev_6311,w99rev_6311_s,w99rev_631,w99rev_pcm_6311,w99_s_cl,w99sp,w99rev_pcm_6311_and_Halides,isoPAHAP,PAHAP,nothing,gaff2_LJ,gaff2_fit}]
                        [--cutoff CUTOFF] [--keep-files] [--combine-db-name COMBINE_DB_NAME]
                        [--combine-only-valid] [--combine-serial] [--dump-format {cif,res}]
                        [--dump-kind {unique,trajectory}] [--cluster-superbasin]
                        [--cluster-method {compack}] [--workers WORKERS]
                        [-ll {INFO,DEBUG,WARNING,ERROR}]
                        databases [databases ...]

positional arguments

databases - Input databases (.db)

options

-h, --help - show this help message and exit
--minimize - Minimize the structures in the databases
--combine - Combine the databases into a single one
--dump - Dump the min structures of a database. The output will be a .zip file with the name of the input database(s)
--cluster - Further cluster the structures in a threshold database, right now the only supported clustering method is COMPACK
--split - Split the input databases into their trials
--workers WORKERS - Number of workers to use for some actions (default: 1)
-ll LOG_LEVEL, --log-level LOG_LEVEL - Set the logging level (default: INFO)

Minimize

Configure how the minimization of the db will be done

--minimize-skip MINIMIZE_SKIP - Number of steps to skip between minimization of structures (default: 0)
--minimize-increasing - Minimize structures that have higher SPE than the previous MC step
--minimize-trial MINIMIZE_TRIAL - Filter which trial to minimize
-x XYZ_FILES, --xyz-files XYZ_FILES - Xyz files containing molecules for generation
-c CHARGES, --charges CHARGES - Rank0 multipole file
-m MULTIPOLES, --multipoles MULTIPOLES - RankN multipole file
-a AXIS, --axis AXIS - Axis filename for structure minimization
-p POTENTIAL, --potential POTENTIAL - intermolecular potential name (default: fit)
--cutoff CUTOFF - dmacrys real space/repulsion-dispersion cutoff (default: calculate)
--keep-files - Keep DMACRYS and NEIGHCRYS files which, for each structure, are stored in a new directory in the pwd.

Combine

Combine databases into single one

--combine-db-name COMBINE_DB_NAME - Name of the combined output database (default: combined.db)
--combine-only-valid - Reduce amount of data added to new database
--combine-serial - Combine the databases in serial instead of parallel

Dump

Dump structures of a database into a folder

--dump-format DUMP_FORMAT - Set the format to dump the crystal files into (default: cif)
--dump-kind DUMP_KIND - Which structures to dump from the database (default: cif)

Cluster

Further cluster threshold database

--cluster-superbasin - Cluster structures in smallest basin containing all trajectory starts
--cluster-method CLUSTER_METHOD - Cluster algorithm (default: compack)

This utility provides the following actions on threshold databases:

–minimize minimizes the valid points along the trajectory. There are flags that can be set in order to skip holding points between minimizations or to minimize structures from a specific trial of the database.
–combine combines multiple threshold databases into a single database. This is useful when the trajectories for different starting points have been run in individual jobs. The databases must be merged in the end so that they can be clustered to find connections between the trajectories, and so that the disconnectivity graph can be generated.
–dump saves the valid minimization structures of a threshold database to a zip file in CIF or SHELXL file.
–cluster allows running further clustering using the COMPACK algorithm on threshold databases.
–split allows splitting a single database into one database for each trial.

Generating disconnectivity graphs

cspy-threshold-utils disconn [-h] [-c CLUSTER_FILE | --plot-density | --colour-ini-basins]
                             [-e EN_MIN] [--up-limit UP_LIMIT] [--interval INTERVAL]
                             [-o OUTPUT] [--plot-branch] [--plot-nodes] [--plot-ini-labels]
                             [--plot-lid-heights] [--plot-all-labels] [--plot-by-unique-id]
                             [--plot-by-relative-energy] [--save-pickle]
                             [-ll {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
                             input

positional arguments

input - Input clustered database file or pickled data

options

-h, --help - show this help message and exit
--save-pickle - Save pickle file of disconnectivtiy graph object
-ll LOG_LEVEL, --log-level LOG_LEVEL - Set the logging level (default: INFO)

Clustering

Options to colour the disconnectivity graph edges

-c CLUSTER_FILE, --cluster-file CLUSTER_FILE - Clustering file to be coloured on disconnectivity graph, currently hard coded as fourth row is label of cluster
--plot-density - Colour leaf edges by density of the crystal
--colour-ini-basins - Colour edges by basins of individual trajectories

Disconnectivity

Modify how the disconnectivity graph is generated

-e EN_MIN, --en-min EN_MIN -
Minimum energy, only works when reading
from a database not a pickled object
--up-limit UP_LIMIT -
Upper-lid energy limit, only works when reading
from a database not a pickled object
--interval INTERVAL -
Energy interval for disconnectivity graph, only works when reading
from a database not a pickled object

Plotting

Control the plot settings and look

-o OUTPUT, --output OUTPUT - Name for the output file (default: disconnectivity_graph)
--plot-branch - Dump main branch disconnecitivity graph instead of the full graph
--plot-nodes - Show leaf nodes
--plot-ini-labels - Plot trial numbers of the initial structures
--plot-lid-heights - Plot lid heights
--plot-all-labels - Plot labels of all nodes on graph
--plot-by-unique-id - Use the unique ids of the structures instead of the node ids
--plot-by-relative-energy - Plot using relative energy form global energy minimum

Once the all the trajectories have been run and minimized, disconnectivity graphs can be generated. To do so pass the clustered trajectory database to this utility:

cspy-threshold-utils disconn database-1.db

This will generate a file with the default name disconnectivity_graph.png. Many options are provided to change what will be plotted and how:

–plot-branch will only plot the disconnectivity of the structures from which trajectories were started. It will remove any structures that were found during the course of the trajectories.
–plot-nodes adds the node at the end of the branches of the graph. Nodes for trajectory starts are dhown differently than those of structures found during the trajectories.

Generation of the disconnectivity graph can be a costly operation for very large databases. In order to avoid having to recalculate the graph every time, this utility provides the –save-pickle option:

cspy-threshold-utils disconn --save-pickle database-1.db

This generates a pickle file that can be used to replot the disconnectivity graph:

cspy-threshold-utils disconn disconnectivity_graph.pickle

This utility is limited in terms of the types of modifying the looks of the disconnectivity graphs. In order to modify the looks of the plot or how the disconnectivity graph is generated, the user can write their own scripts. To do so, use the cspy.threshold.disconnectivity_graph.DisconnectivityGraph class.

from cspy.db import CspDataStore
from cspy.threshold.disconnectivity_graph import DisconnectivityGraph, DrawFilters

ds = CspDataStore("database-1.db")
disconn = DisconnectivityGraph(ds=ds)

# Obtain data from the disconnectivity graph. Check the available methods
# that DisconnectivityGraph provides.
lid_id = disconn.barrier_between_ids("STRUCT_ID1", "STRUCT_ID2")

# DrawFilters provides some default utilities to change the looks of the
# disconnectivity graph. The user can also write their own filters.
disconn.draw("my_disconn_graph", DrawFilters.initial_structures())

# DisconnectivityGraph can also provide the figure and axes of the graph
fig, ax = disconn.create_plot()