Skip to content


Julien Steffen edited this page May 22, 2024 · 17 revisions

This program enables fast and fully controlled selections of basis sets (list of local reference configurations) for VASP machine-learning force fields. Further, training sets of several on-the-fly learning can be combined (a more detailed description with example applications will soon follow as part of a new paper currently in preparation).

It takes one or several ML_AB files and writes a new file ML_AB_sel, which can be used for subsequent generations of ML-FFs (see here).

The objective of the program is to select a diverse as possible subset of atoms in the given configurations as basis functions. In order to accelerate the process, three different selections with increasing complexity are done:

  1. The atoms are sorted into bins according to the number of neighbors in their environments within the cutoff. All atoms with 20 neighbors are sorted into one bin, all atoms with 21 atoms are sorted into the next bin, and so forth. With this, e.g., atoms at the surface of a slab or in different regions of a gas phase are separated from those of bulk systems.
  2. Each neighborhood number bin is subdivided into bins based on the neighborhood diversity. This diversity is calculated by multiplying the number of atoms of each element within the environment. If, e.g., 16 Ga atoms, 4 Pt atoms and 3 H atoms are within an environment of an atom, its neighborhood diversity would be 1643=192. Given by the keyword -neigh_classes, the number of these bins is defined and the atoms are sorted into them with rising diversity. The chosen total number of basis functions per element is now divided and allocated to the different resulting bins.
  3. Each neighborhood-diversity bin is analyzed by hierarchial cluster analysis of radial and angular distribution functions. The overlap integral matrix of both functions is obtained by calculating the integrals for all pairs of atoms within the bin. Depending on the total number of basis functions allocated to this bin, one atom of each cluster in the respective hierarchy layer of the clustering (where the total number of clusters is equal to the number of available basis functions) is chosen for the final basis set.
  4. Additionally, a predefined fraction of atoms is chosen from gradient norm outliers and from rare neighborhoods.

The program offers a number of keywords, which need to be given as command line arguments, e.g.:

mlff_select -keyword1 -keyword2 -keyword3

If a overview of all important keywords and the general functionality of the program is needed, type:

mlff_select -help

The following keywords (either required or optional) are available:

  • -ml_ab=[file1],[file2],... The list of ML_AB files that shall be analyzed and processed by the program. Up to 20 different files can be used. Example: -ml_ab=ML_AB_liquid,ML_AB_solid1,ML_AB_solid2
  • -nbasis=[number] (option 1) The desired total number of basis functions to be selected by the program. In this option, the same number is used for all involved elements. In principle, arbitrary numbers can be chosen. There are, however, restraints. The number shall not be larger than the total number of different atoms of the element with the smallest number of atoms (e.g., 1000 configurations with 3 Pt atoms per configuration, then, nbasis needs to be smaller than 3000). Further, huge numbers might still lead to memory problems with the VASP Refit calculation. Numbers larger than 10.000 are thus currently not recommended.
  • -nbasis_el=[el1:number1],[el2:number2],... (option 2) Number of basis functions, resolved by elements. This option should be used if one element is much less abundant than the others. If, e.g., a system with 200 Ga and 5 Pt atoms has been trained, a useful choice might be nbasis_el=Ga:8000,Pt:4000. In general, however, the number for scarce elements should not be too low since the amount of needed total memory is determined by the element with the largest number of basis functions.
  • -cutoff=[value] The radial (and angular) cutoff (in Angstroms) to be used in the subsequent refit calculation. This keyword is analogous to the ML_RCUT1 and ML_RCUT2 keywords of VASP (see here). Currently, radial and angular cutoffs cannot be distinguished. With this, useful basis function selections can be provided for refits with different cutoffs.
  • -grad_frac=[value] (optional) The fraction of basis functions to be allocated for atoms with the largest gradient components. The mlff_select program calculates the gradient norm of each atom in each configuration from the gradient data provided within the ML_AB file(s). From the global list of atoms with their gradients, the N atoms with the largest gradients are taken as basis functions, where N is the total number of desired basis functions times this fraction for each element, respectively (if nbasis_el is given). The idea is that atoms with large gradients might be part of extreme situations such as almost merging atoms or unwanted dissociations, such that a disproportionately high fraction of these outliers is included into the training set, increasing the overall stability of the force field (extrapolation problems become less likely). Default: 0.1
  • -train_div=[value] (optional) Increasing the diversity of the included basis set. All atoms within the given ML_AB file(s) are sorted by their number of neighbors. The fraction of the total basis functions given by this keyword is chosen by uniformly picking atoms of all neighbor-number bins, independent of the total number of atoms in each bin. If this value is raised, more atoms from less populated bins are taken, thus increasing the number of outliers in the training set. Default: 0.1
  • -rdf_only (optional) Switch for deactivating the calculation of angular distribution functions. If larger cutoffs are used for the selection, the calculation of ADFs might become extremely expensive, since it scales quadratically with the number of atoms within the cutoff (and thus with the fourth power of the cutoff!). If the keyword (without a parameter) is given, all ADF calculations are thus skipped and only RDFs are used for hierarchial clusterings. Default: not given
  • -rdf2adf=[value] (optional) Relative weight of radial distribution functions (RDF) and angular distribution functions (ADF) in the final clustering of atoms within each neighborhood-diversity bin. Example: rdf2adf=2.0: RDF will have 67% weight, ADF will have 33% weight. The effect of this keyword is so far not well tested, therefore, the default (equal weight of both) should be a good choice. Default: 1.0
  • -neigh_basmax=[number] (optional) ) Maximum number of basis functions to be allocated to a certain neighborhood-diversity subclass. This keyword is rather technical and has a large impact on the performance of the program. The atoms with a certain number of neighbor atoms within the cutoff are allocated to one neighborhood class, and these classes are subdivided into a number of neighborhood-diversity subclasses. The number of those subclasses per neighborhood class are now determined by this keyword. The largest number of basis functions assigned to the largest neighborhood class (either linear or square root) is divided by the number given by this keyword to obtain the number of neighborhood-subclasses for each neighborhood class. A smaller number would result in less influence of hierarchial clustering based on RDFs and ADFs, a larger number raises the computational effort of the hierarchial clusterings and thus the application time. The given default should be reasonable in most cases. Default: 10
  • -bas_scale=[word] (optional) word can be linear or root. Determines the algorithm by which the number of basis functions for each neighborhood subclass is chosen. If a certain number of neighbors is by far the most probable (more atoms within the given ML_AB file(s) fall into this category), more basis functions will be chosen for this number. If this number shall be determined by a linear function (prefactor, depending on the total number of available basis functions and the total number of atoms in the ML_AB file(s), times the atom number in the subclass), choose linear. If a square root function shall be used (prefactor times square root of the current atom number), choose root. The root option will again give more weight to outliers (or less abundant neighborhoods), since less atoms will be allocated to more abundant neighborhoods. Default: linear
  • -max_environ=[number] (optional) The maximum number of atoms within the cutoff region of each atom, needed for initial array initializations. If larger cutoffs shall be selected by the cutoff keyword, this number should be raised as well, which will increase the memory requirements. Default: 100
  • -s_grid=[number] (optional) The number of grid points for the numerical representation of radial and angular distribution functions within the final hierarchial clustering algorithm for neighborhood subclasses. A larger value will lead to a better representation of the functions, but also lead to much more computational effort (calculation of numerical overlap integrals). The default value should be OK for most purposes. Default: 200
  • -rdf_exp=[value] (optional) The width of the Gaussians for the smoothed representation of the environments in the radial distribution function calculation. The default values should be OK for most purposes. Default: 20.0
  • -adf_exp=[value] (optional) The width of the Gaussians for the smoothed representation of the environments in the angular distribution function calculation. The default values should be OK for most purposes. Default: 0.5

After running the mlff_select program, a refit calculation need to be done to generate a new ML_FF file from the ML_AB file. It is important in that case to add the keyword ML_EPS_LOW = 1E-20 (or a similar very low value), since else a significant part of the chosen basis functions might be removed again from the force field, which of course kind of destroys the whole process! Look into the ML-FF section for more details.