cspy-threshold-utils command
The cspy-threshold-utils provides three subcommands to interact with
threshold databases, generate disconnectivity graphs and prepare the data
to run threshold jobs with the cspy-threshold command.
cspy-threshold-utils [-h] {command} [args]
positional arguments
command- Available commands
options
Preparing the data
cspy-threshold-utils setup [-h] [--excluded-spgs EXCLUDED_SPGS [EXCLUDED_SPGS ...]]
[-e MAX_ENERGY] [-n NUM_STRUCTURES] [--xyz XYZ [XYZ ...]]
[-c CHARGES] [-m MULTIPOLES] [-a AXIS] [--fix] [-z MOLECULES]
[--smallest] [--limit LIMIT] [--niggli] [--table] [-tz TABLE_Z]
[-to TABLE_OUTPUT]
[--method {grow_shortest,double_shortest,split,prime_fact}]
[-o OUTPUT] [--dry] [--ignore] [-ll {INFO,DEBUG,WARNING,ERROR}]
crystals [crystals ...]
positional arguments
crystals- Crystal files (.res or .cif) or CSP databases (.db)
options
--excluded-spgsEXCLUDED_SPGS- Crystals in these space groups will be ignored. Values must be in the range [0-230]--methodMETHOD- Set the method to create the supercells: - grow_shortest: Each step grow the supercell by 1 (default:grow_shortest)-oOUTPUT,--outputOUTPUT- String to add to the end of output filenames (default:_thresh_setup)--dry- Dry run, do not save the P1 crystals to output files--ignore- Ignore any errors raised by bond and multipole mapping checks-llLOG_LEVEL,--log-levelLOG_LEVEL- Set the logging level (default:INFO)
Read Crystals from Databases
Settings to modify how crystals
are extracted from a database
-eMAX_ENERGY,--max-energyMAX_ENERGY- Select structures within this energy of the global minimum from any database files-nNUM_STRUCTURES,--num-structuresNUM_STRUCTURES- Select this many number of structures at most from a database, in order of increasing energy
Check Validity
Provide these files to check if they can be mapped to the crystals.
Bonding issues may be fixed using the --fix argument
Number of Molecules
Modify the target number of molecules in the final supercells
-zMOLECULES,--moleculesMOLECULES- Number of molecules in the P1 unit cells--smallest- Create the smallest supercell that accommodates all initial structures provided--limitLIMIT- Limit of number of molecules in supercell to check when setting –smallest (default:200)--niggli- Calculate niggli cells of input crystals before finding smallest supercells
Table Output
Check multiple target number of molecules and save results to csv.
Ignores the --smallest and --molecules arguments if set.
--table- Create .csv file with permitted supercell Z values. If this flag is set, all other flags are ignored except –table-output and –table-z-tzTABLE_Z,--table-zTABLE_Z- Max number Z to test in table output (default:100)-toTABLE_OUTPUT,--table-outputTABLE_OUTPUT- Table output filename ending. The name of the method will be included at the beginning (default:table.csv)
In order to run threshold jobs, the starting crystals must be properly setup. This app will accept crystal files or CSPy databases and create a new file for each crystal in the P1 space group and in the SHELXL file format. The appeal of this app is that it generates the output crystals ensuring they all have the same number of molecules in the cell. This is done by creating supercells where necessary. If all the input crystals can’t be made to have the same number of molecules in the cell, the app will refuse to run.
The reason to generate the crystals with the same number of molecules in the cell is so that, when the threshold databases are merged and clustered, matches can be potentially found between different trajectories. If your set of initial crystals can’t all have the same number of molecules in the cell, you should make sure that the different cells have number of molecules multiples of each other, for example: 4, 8, 16.
Interacting with databases
cspy-threshold-utils db [-h] (--minimize | --combine | --dump | --cluster | --split)
[--minimize-skip MINIMIZE_SKIP] [--minimize-increasing]
[--minimize-trial MINIMIZE_TRIAL] [-x XYZ_FILES [XYZ_FILES ...]]
[-c CHARGES] [-m MULTIPOLES] [-a AXIS]
[-p {fit,fit_disponly,fit_reponly,w99,fit_water_X,Day_halobenzenes,w99_orig_Halogens,w99_orig_H,w99rev_6311,w99rev_6311_s,w99rev_631,w99rev_pcm_6311,w99_s_cl,w99sp,w99rev_pcm_6311_and_Chloride,w99rev_pcm_6311_and_Bromide,w99rev_pcm_6311_and_Iodide,w99rev_pcm_6311_and_Halides,isoPAHAP,PAHAP,nothing,gaff2_LJ,gaff2_fit}]
[--cutoff CUTOFF] [--keep-files] [--combine-db-name COMBINE_DB_NAME]
[--combine-only-valid] [--combine-serial] [--dump-format {cif,res}]
[--dump-kind {unique,trajectory}] [--cluster-superbasin]
[--cluster-method {compack}] [--workers WORKERS]
[-ll {INFO,DEBUG,WARNING,ERROR}]
databases [databases ...]
positional arguments
databases- Input databases (.db)
options
--minimize- Minimize the structures in the databases--combine- Combine the databases into a single one--dump- Dump the min structures of a database. The output will be a .zip file with the name of the input database(s)--cluster- Further cluster the structures in a threshold database, right now the only supported clustering method is COMPACK--split- Split the input databases into their trials--workersWORKERS- Number of workers to use for some actions (default:1)-llLOG_LEVEL,--log-levelLOG_LEVEL- Set the logging level (default:INFO)
Minimize
Configure how the minimization of the db will be done
--minimize-skipMINIMIZE_SKIP- Number of steps to skip between minimization of structures (default:0)--minimize-increasing- Minimize structures that have higher SPE than the previous MC step--minimize-trialMINIMIZE_TRIAL- Filter which trial to minimize-xXYZ_FILES,--xyz-filesXYZ_FILES- Xyz files containing molecules for generation-mMULTIPOLES,--multipolesMULTIPOLES- RankN multipole file-aAXIS,--axisAXIS- Axis filename for structure minimization-pPOTENTIAL,--potentialPOTENTIAL- intermolecular potential name (default:fit)--cutoffCUTOFF- dmacrys real space/repulsion-dispersion cutoff (default:calculate)--keep-files- Keep DMACRYS and NEIGHCRYS files which, for each structure, are stored in a new directory in the pwd.
Combine
Combine databases into single one
--combine-db-nameCOMBINE_DB_NAME- Name of the combined output database (default:combined.db)--combine-only-valid- Reduce amount of data added to new database--combine-serial- Combine the databases in serial instead of parallel
Dump
Dump structures of a database into a folder
--dump-formatDUMP_FORMAT- Set the format to dump the crystal files into (default:cif)--dump-kindDUMP_KIND- Which structures to dump from the database (default:cif)
Cluster
Further cluster threshold database
--cluster-superbasin- Cluster structures in smallest basin containing all trajectory starts--cluster-methodCLUSTER_METHOD- Cluster algorithm (default:compack)
This utility provides the following actions on threshold databases:
–minimize minimizes the valid points along the trajectory. There are flags that can be set in order to skip holding points between minimizations or to minimize structures from a specific trial of the database.
–combine combines multiple threshold databases into a single database. This is useful when the trajectories for different starting points have been run in individual jobs. The databases must be merged in the end so that they can be clustered to find connections between the trajectories, and so that the disconnectivity graph can be generated.
–dump saves the valid minimization structures of a threshold database to a zip file in CIF or SHELXL file.
–cluster allows running further clustering using the COMPACK algorithm on threshold databases.
–split allows splitting a single database into one database for each trial.
Generating disconnectivity graphs
cspy-threshold-utils disconn [-h] [-c CLUSTER_FILE | --plot-density | --colour-ini-basins]
[-e EN_MIN] [--up-limit UP_LIMIT] [--interval INTERVAL]
[-o OUTPUT] [--plot-branch] [--plot-nodes] [--plot-ini-labels]
[--plot-lid-heights] [--plot-all-labels] [--plot-by-unique-id]
[--plot-by-relative-energy] [--save-pickle]
[-ll {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
input
positional arguments
input- Input clustered database file or pickled data
options
--save-pickle- Save pickle file of disconnectivtiy graph object-llLOG_LEVEL,--log-levelLOG_LEVEL- Set the logging level (default:INFO)
Clustering
Options to colour the disconnectivity graph edges
-cCLUSTER_FILE,--cluster-fileCLUSTER_FILE- Clustering file to be coloured on disconnectivity graph, currently hard coded as fourth row is label of cluster--plot-density- Colour leaf edges by density of the crystal--colour-ini-basins- Colour edges by basins of individual trajectories
Disconnectivity
Modify how the disconnectivity graph is generated
- Minimum energy, only works when reading
from a database not a pickled object
--up-limitUP_LIMIT-- Upper-lid energy limit, only works when reading
from a database not a pickled object
--intervalINTERVAL-- Energy interval for disconnectivity graph, only works when reading
from a database not a pickled object
Plotting
Control the plot settings and look
-oOUTPUT,--outputOUTPUT- Name for the output file (default:disconnectivity_graph)--plot-branch- Dump main branch disconnecitivity graph instead of the full graph--plot-nodes- Show leaf nodes--plot-ini-labels- Plot trial numbers of the initial structures--plot-lid-heights- Plot lid heights--plot-all-labels- Plot labels of all nodes on graph--plot-by-unique-id- Use the unique ids of the structures instead of the node ids--plot-by-relative-energy- Plot using relative energy form global energy minimum
Once the all the trajectories have been run and minimized, disconnectivity graphs can be generated. To do so pass the clustered trajectory database to this utility:
cspy-threshold-utils disconn database-1.db
This will generate a file with the default name disconnectivity_graph.png. Many options are provided to change what will be plotted and how:
–plot-branch will only plot the disconnectivity of the structures from which trajectories were started. It will remove any structures that were found during the course of the trajectories.
–plot-nodes adds the node at the end of the branches of the graph. Nodes for trajectory starts are dhown differently than those of structures found during the trajectories.
Generation of the disconnectivity graph can be a costly operation for very large databases. In order to avoid having to recalculate the graph every time, this utility provides the –save-pickle option:
cspy-threshold-utils disconn --save-pickle database-1.db
This generates a pickle file that can be used to replot the disconnectivity graph:
cspy-threshold-utils disconn disconnectivity_graph.pickle
This utility is limited in terms of the types of modifying the looks of the disconnectivity graphs. In order to modify the looks of the plot or how the disconnectivity graph is generated, the user can write their own scripts. To do so, use the cspy.threshold.disconnectivity_graph.DisconnectivityGraph class.
1from cspy.db import CspDataStore
2from cspy.threshold.disconnectivity_graph import DisconnectivityGraph, DrawFilters
3
4ds = CspDataStore("database-1.db")
5disconn = DisconnectivityGraph(ds=ds)
6
7# Obtain data from the disconnectivity graph. Check the available methods
8# that DisconnectivityGraph provides.
9lid_id = disconn.barrier_between_ids("STRUCT_ID1", "STRUCT_ID2")
10
11# DrawFilters provides some default utilities to change the looks of the
12# disconnectivity graph. The user can also write their own filters.
13disconn.draw("my_disconn_graph", DrawFilters.initial_structures())
14
15# DisconnectivityGraph can also provide the figure and axes of the graph
16fig, ax = disconn.create_plot()