Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BREAKING CHANGE: mof interface, Graph featurizer, more BU featurizers #422

Open
wants to merge 17 commits into
base: main
Choose a base branch
from
3 changes: 2 additions & 1 deletion docs/source/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,4 +10,5 @@ API documentation
api/splitters
api/metrics
api/bench
api/helpers
api/helpers
api/structures
17 changes: 17 additions & 0 deletions docs/source/api/featurizers.rst
Original file line number Diff line number Diff line change
Expand Up @@ -94,4 +94,21 @@ Host Guest featurization
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. automodule:: mofdscribe.featurizers.hostguest.host_guest_featurizer
:members:


Graph featurization
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. automodule:: mofdscribe.featurizers.graph.graph_featurizer
:members:

.. automodule:: mofdscribe.featurizers.graph.dimensionality
:members:


Matminer adapter
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. automodule:: mofdscribe.featurizers.matmineradapter
:members:
9 changes: 9 additions & 0 deletions docs/source/api/structures.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@

Structure inputs
-------------------

MOF
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. automodule:: mofdscribe.mof
:members:
27 changes: 20 additions & 7 deletions docs/source/background.rst
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,17 @@ descriptor that only considers the local environment (of e.g., 3 atoms). For thi
featurizers/global/*


Graph featurizers
--------------------

Graph featurizers are a special class of featurizers that are based on the structure graph. The structure graph is a periodic graph in which the edges are bonds and the nodes are atoms.

.. toctree::
:glob:
:maxdepth: 1

featurizers/graph/*


BU-centered featurizers
-----------------------------
Expand All @@ -75,11 +86,13 @@ mofdscribe can compute descriptors that are BU-centred, for instance, using RDKi
from matminer.featurizers.structure import SiteStatsFingerprint
from pymatgen.core import Structure
from mofdscribe.featurizers.bu import BUFeaturizer
from mofdscribe.mof import MOF
from mofdscribe.featurizers.matmineradapter import MatminerAdapter

base_feat = SiteStatsFingerprint(SOAP.from_preset("formation_energy"))
base_feat.fit([hkust_structure])
base_feat = MatminerAdapter(SiteStatsFingerprint(SOAP.from_preset("formation_energy")))
base_feat.fit([MOF(hkust_structure)])
featurizer = BUFeaturizer(base_feat, aggregations=("mean",))
features = featurizer.featurize(structure=hkust_structure)
features = featurizer.featurize(structure=MOF(hkust_structure))

For this, you can either provide your building blocks that you extracted with any of the available tools, or use our integration with our `moffragmentor <https://github.com/kjappelbaum/moffragmentor>`_ package. In this case, we will fragment the MOF into its building blocks and then compute the features for each building block and let you choose how you want to aggregate them.

Expand All @@ -100,15 +113,15 @@ This featurizer will automatically extract the host and guest structures from th
.. code-block:: python

from matminer.featurizers.structure.sites import SiteStatsFingerprint

from mofdscribe.featurizers.matmineradapter import MatminerAdapter
from mofdscribe.featurizers.hostguest import HostGuestFeaturizer

featurizer = HostGuestFeaturizer(
featurizer=SiteStatsFingerprint.from_preset("SOAP_formation_energy"),
featurizer=MatminerAdapter(SiteStatsFingerprint.from_preset("SOAP_formation_energy")),
aggregations=("mean",),
)
featurizer.fit([structure])
features = featurizer.featurize(structure)
featurizer.fit([mof])
features = featurizer.featurize(mof)

If you are interested in surface chemistry features, you might also find suitable featurizers in the `matminer <https://hackingmaterials.lbl.gov/matminer/>`_ package.

Expand Down
9 changes: 5 additions & 4 deletions docs/source/contributing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,12 @@ Extending and contributing to mofdscribe
Implementing a new featurizer
-----------------------------

To implement a new featurizer, you typically need to create a new class that inherits from the :py:class:`~mofdscribe.featurizers.base.MOFBaseFeaturizer`. In this class, you need to implement three methods:
:py:meth:`~mofdscribe.featurizers.base.MOFBaseFeaturizer.featurize`, :py:meth:`~mofdscribe.featurizers.base.MOFBaseFeaturizer.feature_labels` and :py:meth:`~mofdscribe.featurizers.base.MOFBaseFeaturizer.citation`.
To implement a new featurizer, you typically need to create a new class that inherits from the :py:class:`~mofdscribe.featurizers.base.MOFBaseFeaturizer`. In this class, you need to implement four methods:
:py:meth:`~mofdscribe.featurizers.base.MOFBaseFeaturizer.featurize`, :py:meth:`~mofdscribe.featurizers.base.MOFBaseFeaturizer._featurize`, :py:meth:`~mofdscribe.featurizers.base.MOFBaseFeaturizer.feature_labels` and :py:meth:`~mofdscribe.featurizers.base.MOFBaseFeaturizer.citation`.

The main featurization logic happens in :py:meth:`~mofdscribe.featurizers.base.MOFBaseFeaturizer._featurize`.
Your method should accept as input a :py:class:`~pymatgen.core.Structure` (:py:class:`~pymatgen.core.IStructure`, :py:class:`~pymatgen.core.Molecule`, :py:class:`~pymatgen.core.StructureGraph`) object and return a :py:class:`numpy.array`. The :py:meth:`~mofdscribe.featurizers.base.MOFBaseFeaturizer._featurize` is supposed to extract the relevant object from a :py:object:`~mofdscribe.mof.MOF` object.

The main featurization logic happens in :py:meth:`~mofdscribe.featurizers.base.MOFBaseFeaturizer.featurize`.
Your method should accept as input a :py:class:`~pymatgen.core.Structure` object and return a :py:class:`numpy.array`.
The :py:meth:`mofdscribe.featurizers.base.MOFBaseFeaturizer.feature_labels` method should return a list of strings that describe the features returned by :py:meth:`~mofdscribe.featurizers.base.MOFBaseFeaturizer.featurize`. The number of feature names should match the number of features returned by :py:meth:`~mofdscribe.featurizers.base.MOFBaseFeaturizer.featurize` (i.e. the number of columns in the feature matrix). The :py:meth:`~mofdscribe.featurizers.base.MOFBaseFeaturizer.citation` method should return a list of strings of BibTeX citations for the featurizer.

Generally, you also want to decorate your structure with the
Expand Down
2 changes: 2 additions & 0 deletions docs/source/featurizers/atom_centered/racs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@
Revised autocorrelation functions (RACs)
.............................................

.. _RACS:

Revised autocorrelation functions have originally been proposed for
metal-complexes [Janet2017]_. Autocorrelation functions have been widely used as
compact, fixed-length descriptors and are defined as
Expand Down
24 changes: 24 additions & 0 deletions docs/source/featurizers/bu_centered/angles.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
Angle-based description of BU shape
=======================================

The following featurizers compute the angles between all pairs of atoms in a building block.
We always compute the angles A-COM-B, where COM is the center of mass of the building block.

Given the distribution of the angles, we can compute fixed-length descriptors by either converting
the distribution to a histogram or computing some statistics (mean, standard deviation, etc.) of the distribution.

.. featurizer:: PairWiseAngleHist
:id: PairWiseAngleHist
:considers_geometry: True
:considers_structure_graph: False
:encodes_chemistry: False
:scope: bu
:scalar: False

.. featurizer:: PairWiseAngleStats
:id: PairWiseAngleStats
:considers_geometry: True
:considers_structure_graph: False
:encodes_chemistry: False
:scope: bu
:scalar: False
25 changes: 25 additions & 0 deletions docs/source/featurizers/bu_centered/num_hops.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
Shortest-Path Based Description of Building Blocks
======================================================

For certain targets, the proximity of connecting groups in a building unit
(e.g. carboxy groups) can be interesting features.

One way to describe this generally is to compute the distribution
of shortest paths between special sites in the building units that
our ``moffragmentor`` package calls "binding sites" and "branching sites".

.. featurizer:: BranchingNumHopFeaturizer
:id: BranchingNumHopFeaturizer
:considers_geometry: False
:considers_structure_graph: True
:encodes_chemistry: False
:scope: bu
:scalar: False

.. featurizer:: BindingNumHopFeaturizer
:id: BindingNumHopFeaturizer
:considers_geometry: False
:considers_structure_graph: True
:encodes_chemistry: False
:scope: bu
:scalar: False
21 changes: 21 additions & 0 deletions docs/source/featurizers/bu_centered/racs.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@

Revised autocorrelation functions (RACs)
.............................................

See also :ref:`RACs <RACs>`.

This featurizer is a flavor of the :ref:`RACs <RACs>` featurizer, that can split the computation over user-defined atom groups and automatically determined communities. For determining communities, we use ``networkx``'s implementation of greedy modularity maximization [NewmanModularity]_.
This community detection often corresponds to chemically meaningful parts (often ring systems) of the structure (but does not require an explicit fragmentation algorithm).

In mofdscribe, you can customize the encodings :math:`P` (using all properties that are available in our `element-coder <https://github.com/kjappelbaum/element-coder>`_ package) as well as the aggregation functions.

.. featurizer:: ModularityCommunityCenteredRACS
:id: ModularityCommunityCenteredRACS
:considers_geometry: False
:considers_structure_graph: True
:encodes_chemistry: optionally
:scope: local
:scalar: False
:style: only-light

Initially described in [Janet2017]_ for metal complexes, extended to MOFs in [Moosavi2021]_.
4 changes: 4 additions & 0 deletions docs/source/featurizers/global/size.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
Extensive size descriptors
================================


18 changes: 18 additions & 0 deletions docs/source/featurizers/graph/dimensionality.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
Dimensionality
..................

Returns the dimensionality of the structure. This measure is based on [LarsenDimensionality]_, where the structure graph is analyzed.

In the case of MOFs, rod like structures are considered 1D, sheet like structures are considered 2D, and 3D structures are considered 3D.

This can be interesting for the metal nodes, where the typical SBUs such as Cu-paddlewheels are 0D. However, many well-known MOFs such as MOF-74 have infinite rod nodes, that this featurizer would consider 1D.

.. featurizer:: Dimensionality
:id: Dimensionality
:considers_geometry: False
:considers_structure_graph: True
:encodes_chemistry: false
:scope: global
:scalar: True

Returns the dimensionality of the structure. This measure is based on [LarsenDimensionality]_, where the structure graph is analyzed.
2 changes: 2 additions & 0 deletions docs/source/featurizers/host_guest/host_guest_aprdf.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@ This featurizer builds on the :ref:`APRDF` featurizer, but instead of using the
correlations between all atoms, it only considers the ones between the guest and all host atoms
(within some cutoff distance).

The APRDF is defined as:

.. math::

\operatorname{RDF}^{P}(R)=f \sum_{i, j}^{\text {all atom pairs }} P_{i} P_{j} \mathrm{e}^{-B\left(r_{i j}-R\right)^{2}}
Expand Down
50 changes: 28 additions & 22 deletions docs/source/getting_started.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,45 +8,34 @@ Featurizing a MOF
.. code-block:: python

from mofdscribe.chemistry.racs import RACS
from pymatgen.core import Structure
from mofdscribe.mof import MOF

s = Structure.from_file(<my_cif>)
mof = MOF.from_file(<my_cif>)
featurizer = RACS()
features = featurizer.featurize(s)
features = featurizer.featurize(mof)

.. admonition:: mofdscribe base classes
:class: hint

Most featurizers in mofdscribe inherit from :py:class:`~mofdscribe.featurizers.base.MOFBaseFeaturizer`.
This class can also handle the conversion to primitive cells if you pass :code:`primitive=True` to the
constructor. This can be useful to save computational time but also make it possible to, e.g.,
use the :code:`sum` aggregation.

To avoid re-computation of the primitive cell, you should use the :py:class:`~mofdscribe.featurizers.base.MOFMultipleFeaturizer`
for combining multiple featurizers. This will accept a keyword argument :code:`primitive=True` in the constructor
and then compute the primitive cell once and use it for all the featurizers.

It is also easy to combine multiple featurizers into a single pipeline:

.. code-block:: python

from mofdscribe.chemistry.racs import RACS
from mofdscribe.pore.geometric_properties import PoreDiameters
from pymatgen.core import Structure
from mofdscribe.mof import MOF
from mofdscribe.featurizers.base import MOFMultipleFeaturizer

s = Structure.from_file(<my_cif>)
mof = MOF.from_file(<my_cif>)
featurizer = MOFMultipleFeaturizer([RACS(), PoreDiameters()])
features = featurizer.featurize(s)
features = featurizer.featurize(mof)

You can, of course, also pass multiple structures to the featurizer (and the
featurization is automatically parallelized via matminer):

.. code-block:: python

s = Structure.from_file(<my_cif>)
s2 = Structure.from_file(<my_cif2>)
features = featurizer.featurize_many([s, s2])
mof_1 = MOF.from_file(<my_cif>)
mof_2 = MOF.from_file(<my_cif2>)
features = featurizer.featurize_many([mof_1, mof_2])


And, clearly, you can also use the `mofdscribe` featurizers alongside ones from `matminer`:
Expand All @@ -55,9 +44,10 @@ And, clearly, you can also use the `mofdscribe` featurizers alongside ones from

from matminer.featurizers.structure import LocalStructuralOrderParams
from mofdscribe.chemistry.racs import RACS
from mofdscribe.featurizers.matmineradapter import MatminerAdapter

featurizer = MOFMultipleFeaturizer([RACS(), LocalStructuralOrderParams()])
features = featurizer.featurize_many([s, s2])
featurizer = MOFMultipleFeaturizer([RACS(), MatminerAdapter(LocalStructuralOrderParams())])
features = featurizer.featurize_many([mof_2, mof_2])


If you use the :code:`zeo++` or :code:`raspa2` packages, you can customize the temporary
Expand All @@ -72,6 +62,22 @@ directory.
and notebook in the `examples folder <https://github.com/kjappelbaum/mofdscribe/tree/main/examples>`_.


.. admonition:: Saving time using the MOF object
:class: tip

From our experience, the most time-consuming part of featurization is the
the computation of the structure graph or the fragments.

Additionally, you often do not know in advance which featurizers you will
use.

If you want to save in the case you need to compute additional features,
it can be practical to serialize the :py:class:`~mofdscribe.mof.MOF` objects
after the first featurization.
The objects will already contain the structure graph and the fragments (if
they have been computed in the first featurization).


Using a reference dataset
--------------------------

Expand Down
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ dependencies, the featurizers are currently not integrated in matminer itself.
splitters
datasets
background
metrics
leaderboard
contributing
api
Expand Down
14 changes: 14 additions & 0 deletions docs/source/metrics.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
Metrics
===================

In order to compare our models we need to score them using a metric.
Most commonly used in are scores such as accuracy, precision, recall, or the mean absolute error for regression problem.

However, these metrics are not always the best choice.
It is well known, for instance, that accuracy is not a good metric for imbalanced datasets.
However, even beyond such considerations it is important to take into consideration for what purpose the model is used.

For materials discovery, this often implies that a metric that measures how many of the top materials we find is more
important than an averaged, overall score.

``mofdscribe`` provides some utilities to help with this in the ``mofdscribe.metrics`` subpackage.
4 changes: 4 additions & 0 deletions docs/source/references.rst
Original file line number Diff line number Diff line change
Expand Up @@ -112,3 +112,7 @@ References
.. [Trappe] `Potoff, J. J.; Siepmann, J. I. Vapor–Liquid Equilibria of Mixtures Containing Alkanes, Carbon Dioxide, and Nitrogen. AIChE Journal 2001, 47 (7), 1676–1682. <https://doi.org/10.1002/aic.690470719>`_

.. [Varoquaux] `Varoquaux, G. Cross-Validation Failure: Small Sample Sizes Lead to Large Error Bars. NeuroImage 2018, 180, 68–77. <https://doi.org/10.1016/j.neuroimage.2017.06.061>`_

.. [LarsenDimensionality] `Larsen, P. M.; Pandey, M.; Strange, M.; Jacobsen, K. W. Definition of a Scoring Parameter to Identify Low-Dimensional Materials Components. Phys. Rev. Materials 2019, 3 (3), 034003. <https://doi.org/10.1103/PhysRevMaterials.3.034003>`_

.. [NewmanModularity] `Newman, M. E. J. Modularity and community structure in networks. Proceedings of the National Academy of Sciences 2006, 103 (23), 8577–8582. <https://doi.org/10.1073/pnas.0601602103>`_
7 changes: 4 additions & 3 deletions examples/build_model_using_mofdscribe.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,8 @@
"from mofdscribe.featurizers.base import MOFMultipleFeaturizer\n",
"from mofdscribe.datasets.thermal_stability_dataset import ThermalStabilityDataset\n",
"from mofdscribe.datasets.structuredataset import FrameDataset\n",
"from mofdscribe.splitters import HashSplitter"
"from mofdscribe.splitters import HashSplitter\n",
"from mofdscribe.mof import MOF"
]
},
{
Expand Down Expand Up @@ -19021,7 +19022,7 @@
],
"source": [
"feats = featurizer.featurize_many(\n",
" all_structures, ignore_errors=True\n",
" [MOF(s) for s in all_structures], ignore_errors=True\n",
") # we ignore errors here because some structures might not be fragmentable, those will then have NaNs in the feature matrix"
]
},
Expand Down Expand Up @@ -19633,7 +19634,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.13"
"version": "3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:05:16) \n[Clang 12.0.1 ]"
},
"orig_nbformat": 4,
"vscode": {
Expand Down
Loading