-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BigChem implementation #199
Conversation
I reviewed the code and it looks great; @coltonbh could you please take a quick look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
High level changes I would suggest:
- Do not use the global switches turning BigChem on/off. This will lead to great unhappiness! You've hidden when the system will/won't use BigChem and why inside of checking functions (
CheckBigChem
, etc.) that bury all exceptions and run calculations against a system that may have hours of work on a queue. You do not want this :) Instead I would simplify your architecture by adding abigchem: bool = False
keyword argument to the functions (such ascalc_cartensian_hessian
) where you added a BigChem option. This will maintain your existing API while adding new, optional functionality. Users (including your own code) can explicitly declare when they want to use bigchem like thiscalc_cartensian_hessian(..., bigchem=True)
. - Remove checks for BigChem readiness. If the system is not available you WANT to raise those exceptions and report back to the end user why they were not able to use BigChem, expecidally since they now explicitly passed a
bigchem = True
argument either to the function call or the command line arguments. - Definitely remove your
CheckBigChem
checks fromdef __init__(...)
functions. These calls would hang your program for potentially hours if there is work on the queue. You can assume BigChem is available, call it in your code, and if it is not available have it report back to end users the exception to the end user explaining why. This is the way 🙌
The mental model you should have for BigChem is simply that they are functions you can call to get back results. In a sense, it is akin to an engine
in your system, very much like the QCSchema
engine. You can ask BigChem for energy
, gradient
, or (parallel) hessian
objects, just like any other engine, and you can tell this engine what subprogram (i.e., terachem
, psi4
) to use. The difference is you can ask for many results in parallel (like the hessian or NEB beads).
This may seem like a lot of feedback but I think we can eliminate 80% of the code you've written and have a cleaner, easier to use, easier to understand implementation with a few architectural modifications :)
Feel free to ask any question for clarification!
geometric/nifty.py
Outdated
#==============================# | ||
#| BigChem stuff |# | ||
#==============================# | ||
|
||
BIGCHEM = False | ||
|
||
def CheckBigChem(engine): | ||
global BIGCHEM | ||
try: | ||
from bigchem import compute | ||
from qcio import Molecule as qcio_Molecule, ProgramInput | ||
|
||
molecule = qcio_Molecule( | ||
symbols=["O", "H", "H"], | ||
geometry=[ | ||
[0.0, 0.0, 0.0], | ||
[0.52421003, 1.68733646, 0.48074633], | ||
[1.14668581, -0.45032174, -1.35474466], | ||
], | ||
) | ||
|
||
prog_input = ProgramInput(molecule=molecule, | ||
calctype='energy', | ||
model={"method":'hf', "basis":'3-21g'}) | ||
output = compute.delay(engine, prog_input) | ||
|
||
BIGCHEM = output.get().success | ||
|
||
except: | ||
BIGCHEM = False | ||
logger.warning("BigChem/%s is not working properly. Calculations will be carried linearly." %engine) | ||
|
||
def BigChemReady(): | ||
global BIGCHEM | ||
return BIGCHEM | ||
|
||
def BigChemOff(): | ||
global BIGCHEM | ||
BIGCHEM = False | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not check BigChem with a calculation. The queue may have many calculations waiting and this computation could wait for hours, for example.
What is it you need to check? Perhaps I can help you think of a better way to implement it. You can simply try to run a calculation and if BigChem isn't running it will raise an exception because it will not be able to connect to Redis (your broker) to deliver the message. This error message will inform the end user about the issue that needs fixing instead of burying it with no output, as this function does here.
Burying exceptions with naked except:
statements is considered bad practice because it hides from the end user the cause of the problem and the won't know how to take action to fix it :)
Also, in Python the correct way to define function is with def snake_case
and not def CaptialCase
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, rather than having these global "switches" I would add bigchem: bool = False
to the functions where it might be used so that end users can easily use/not use BigChem in an explicit way, rather than hiding it's use in a buried global value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @coltonbh, I replaced the global BigChem switch with the function you suggested. It would be great if you could review the changes. Thank you!
geometric/normal_modes.py
Outdated
|
||
elif BigChemReady(): | ||
# If BigChem is ready, it will be used to parallelize the Hessian calculation. | ||
logger.info("BigChem will be used to calculate the Hessian. \n") | ||
from qcio import Molecule as qcio_Molecule, ProgramInput | ||
from bigchem import compute, group | ||
|
||
elems = molecule[0].elem | ||
|
||
# Uncommenting the following 6 lines and commenting out the rest of the lines will use BigChem's parallel_hessian function. | ||
#from bigchem.algos import parallel_hessian | ||
#qcio_M = qcio_Molecule(symbols=elems, geometry=coords.reshape(-1,3)) | ||
#input = ProgramInput(molecule=qcio_M, model={"method":engine.method, "basis": engine.basis}, calctype='hessian') | ||
#output = parallel_hessian("psi4", input).delay() | ||
#rec = output.get() | ||
#Hx = rec.results.hessian | ||
|
||
# Creating a list containing qcio Molecule obejcts with different geometries. | ||
molecules = [] | ||
for i in range(nc): | ||
coords[i] += h | ||
molecules.append(qcio_Molecule(symbols=elems, geometry=coords.reshape(-1,3))) | ||
coords[i] -= 2*h | ||
molecules.append(qcio_Molecule(symbols=elems, geometry=coords.reshape(-1,3))) | ||
coords[i] += h | ||
|
||
# Submitting calculations | ||
outputs = group(compute.s(engine.__class__.__name__.lower(), | ||
ProgramInput(molecule=qcio_M, calctype='gradient', | ||
model={"method":engine.method, "basis": engine.basis}, | ||
extras={"order":i}), | ||
) for i, qcio_M in enumerate(molecules)).apply_async() | ||
|
||
# Getting the records | ||
records = outputs.get() | ||
assert len(records) == nc*2 | ||
|
||
# Grouping the recrods | ||
grouped_records = list(zip(records[::2], records[1::2])) | ||
|
||
# Iterating through the grouped records to calculate the Hessian | ||
for i, (fwd_rec, bak_rec) in enumerate(grouped_records): | ||
# Double checking the order | ||
assert fwd_rec.input_data.extras["order"] == i*2 | ||
assert bak_rec.input_data.extras["order"] == fwd_rec.input_data.extras["order"] + 1 | ||
gfwd = fwd_rec.results.gradient.ravel() | ||
gbak = bak_rec.results.gradient.ravel() | ||
Hx[i] = (gfwd-gbak)/(2*h) | ||
|
||
# Deleting the records in the backend | ||
outputs.forget() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than using BigChem without the end user knowing, which can lead to surprising behavior, errors, and code paths, I would instead recommend adding a new keyword argument bigchem: bool = False
to calc_cartesian_hessian
and then have an if
bigchem: ` statement for this code block. Also, you can just use from bigchem import parallel_hessian and use it in place of your hand coded hessian assembly and then this would be a one-liner :)
geometric/params.py
Outdated
# Check BigChem. | ||
if kwargs.get('bigchem', False): | ||
CheckBigChem(kwargs.get('engine', 'terachem')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would strongly recommend removing this from your def __init__(...)
method. This CheckBigChem
function is executing an ab initio calculation against a distributed system that may have hundreds (or thousands!) of items on a job queue. Even without anything on the queue this will add at least a few seconds of overhead to a simple instantiation of the OptParams
object.
geometric/params.py
Outdated
if kwargs.get('bigchem', False): | ||
CheckBigChem(kwargs.get('engine', 'terachem')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto to the comment above. This will add seconds to hours of delay time to the instantiation of this object!! Do not do this! You do not need to check that BigChem is "ready" by performing a calculation against the system--this could results in hours of waiting if the queue is very full--you do not want this. BigChem can simply be trusted to be running and if someone tries a BigChem calculation without the system being on, it will raise the correct exception (telling the end user that the code cannot connect to the BigChem system and why).
grp_hessian.add_argument('--port', type=int, help='Work Queue port used to distribute Hessian calculations. Workers must be started separately.') | ||
grp_hessian.add_argument('--frequency', type=str2bool, help='Perform frequency analysis whenever Hessian is calculated, default is yes/true.\n ') | ||
grp_hessian.add_argument('--wqport', type=int, help='Work Queue port used to distribute Hessian calculations. Workers must be started separately. \n ') | ||
grp_hessian.add_argument('--bigchem', type=str2bool, help='Provide "Yes" to use BigChem for performing the Hessian calculation in parallel. \n' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd have this flag communicate to your calc_cartesian_hessian
function by passing bigchem=True
, after you change the implementation of the calc_cartensian_hessian
function.
geometric/tests/test_neb.py
Outdated
# Turning off BigChem | ||
geometric.nifty.BigChemOff() | ||
assert not geometric.nifty.BigChemReady() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
geometric/tests/test_neb.py
Outdated
# Turning on BigChem | ||
geometric.nifty.CheckBigChem('qchem') | ||
assert geometric.nifty.BigChemReady() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
geometric/tests/test_neb.py
Outdated
# Turning off BigChem | ||
geometric.nifty.BigChemOff() | ||
assert not geometric.nifty.BigChemReady() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
geometric/tests/test_neb.py
Outdated
# Turning on BigChem | ||
geometric.nifty.CheckBigChem('terachem') | ||
assert geometric.nifty.BigChemReady() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
geometric/tests/test_neb.py
Outdated
# Turning off BigChem | ||
geometric.nifty.BigChemOff() | ||
assert not geometric.nifty.BigChemReady() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks dramatically better! Very nice!
I've added a note about using pytest fixtures so you can DRY up your test code and reduce the code you're adding while making your tests easier to reason about. You can extend this fixture
pattern to your other tests as well (in a separate PR) and make your whole tests suite much smaller.
This is already a huge improvement though! We've cut out over 100 lines of code and have something much easier to reason about.
At this point I'd say your integration with BigChem is much cleaner and easier to understand. Up to you if you'd like to clean up your tests with fixtures
. It's a good pattern to learn and will make writing tests much faster and easier.
You need to add bigchem
to your setup.py
file. If you don't want to install bigchem
by default, add it to extra_requires
. So either:
setup(
....
'numpy', # Main dependencies that are always installed
'bigchem', # Either here
],
extras_require={
'bigchem': ['bigchem'] # Or here (Optional dependencies)
}
)
Nice work 🙌
P.S. If you add it to extra_require
then you would install it with pip install geometric[bigchem]
. That's what those [someextras]
notation means on install--also install these extras. The array can contain a whole list of packages so for example if you had some visualization component to geometric and you want it to be optionally installable you could list all the packages required for that like this:
...
extra_require={
'visualization': ['pkg1', 'pkg2', 'pkg3']
}
And then if you wanted that visualization feature you'd do pip install geometric[visualization]
.
geometric/tests/addons.py
Outdated
using_bigchem_psi4 = pytest.mark.skipif( | ||
(not bigchem_found("psi4")), reason="BigChem/Psi4 is not working. please install the package to enable tests") | ||
using_bigchem_qchem = pytest.mark.skipif( | ||
(not bigchem_found("qchem")), reason="BigChem/QChem is not working. please install the package to enable tests") | ||
using_bigchem_terachem = pytest.mark.skipif( | ||
(not bigchem_found("terachem")), reason="BigChem/TeraChem is not working. please install the package to enable tests") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding to review!
geometric/tests/addons.py
Outdated
def bigchem_found(engine): | ||
geometric.nifty.CheckBigChem(engine) | ||
found = geometric.nifty.BigChemReady() | ||
geometric.nifty.BigChemOff() | ||
return found | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than perform checks, you may just want to use pytest's built in mark feature
@pytest.mark.bigchem # or maybe @pytest.mark.integration for all tests that utilize external programs
def my_bigchem_tests(...):
...
Then you can register these markets in your pyproject.toml
file (or wherever you keep your pytest
configuration:
[pytest]
markers =
slow: mark a test as slow running
integration: mark test as integration test
And then control which tests run with
pytest -m 'not integration`
or you can deselect integration test by default by updating your pytest configuration with something like this:
[tool.pytest.ini_options]
addopts = "-m 'not integration'"
markers = [
"integration: marks tests as integration (deselect with '-m \"not integration\"')",
]
Just a few additional ways to achieve the same outcome without needing to run checks over and over to see if a an program is available :)
geometric/tests/test_neb.py
Outdated
M, engine = geometric.prepare.get_molecule_engine( | ||
input=os.path.join(datad, "hcn_neb_input.psi4in"), | ||
chain_coords=os.path.join(datad, "hcn_neb_input.xyz"), | ||
images=11, | ||
neb=True, | ||
engine="psi4", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For pieces of code like this that are repeated boilerplate for your tests, I would recommend a pytest fixture. In your conftest.py
file (should live in top level of tests
directory):
@pytest.fixture
def molecule_engine_neb():
"""Return the Molecule and Engine for an NEB Calculation."""
def select_engine(engine: str):
return geometric.prepare.get_molecule_engine(
input=os.path.join(datad, "hcn_neb_input.psi4in"),
chain_coords=os.path.join(datad, "hcn_neb_input.xyz"),
images=11,
neb=True,
engine=engine,
)
return create_prog_input
Then you can use it in your function like this. Pytest will automatically inject the fixture as an argument at runtime:
@addons.using_psi4
@addons.using_bigchem
def test_psi4_bigchem(localizer, molecule_engine_neb):
M, engine = molecule_engine_neb('psi4')
# Continue with test
This helps reduce all the repeated boilerplate code in your test suite (DRY's up your code!) and lets your focus on what you want actually test rather than adding hundreds of lines of repeated code.
You can generalize a fixture like this further for other use cases or create other fixtures for repeated test prep boilerplate :) I'm mentioning this because it's a bit concerning to add ~500 lines of code for a few simple function calls in BigChem! That is a lot of code to have to maintain in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @coltonbh, Thank you for teaching me about pytest.fixture. I was able to remove repetitive lines in the test scripts.
geometric/tests/test_hessian.py
Outdated
def test_psi4_bigchem_hessian(): | ||
molecule, engine = geometric.prepare.get_molecule_engine(engine="psi4", input=os.path.join(datad, 'hcn_minimized.psi4in')) | ||
coords = molecule.xyzs[0].flatten()*geometric.nifty.ang2bohr | ||
hessian = geometric.normal_modes.calc_cartesian_hessian(coords, molecule, engine, tempfile.mkdtemp()) | ||
freqs, modes, G = geometric.normal_modes.frequency_analysis(coords, hessian, elem=molecule.elem, verbose=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another highly repeated section of code that is copy-and-pasted over and over again in the tests. I'd refactor this into a fixture and/or parent function to reduce replication. When we say "DRY your code" we mean "don't repeat yourself." So now you have something that works, go back and refactor repeated sections into shared functions or pytest fixtures (depending on how you want to use the function) so you have less code to maintain and it's easier to understand :)
elif self.params.bigchem: | ||
# If BigChem is available, it will be used to carry the single-point calculations. | ||
from qcio import Molecule as qcio_Molecule, ProgramInput | ||
from bigchem import compute, group | ||
elems = self.Structures[0].M.elem | ||
molecules = [] | ||
|
||
# Creating a list with qcio Molecule objects and submitting calculations. | ||
for Structure in self.Structures: | ||
molecules.append(qcio_Molecule(symbols=elems, geometry=Structure.cartesians.reshape(-1,3))) | ||
|
||
outputs = group(compute.s(self.engine.__class__.__name__.lower(), | ||
ProgramInput(molecule=qcio_M, calctype="gradient", | ||
model={"method":self.engine.method, "basis": self.engine.basis}, | ||
extras={"order":i}), | ||
) for i, qcio_M in enumerate(molecules)).apply_async() | ||
|
||
# Getting the records | ||
records = outputs.get() | ||
|
||
# Passing the results to chain.ComputeEnergyGradient() | ||
for i in range(len(self)): | ||
assert records[i].input_data.extras["order"] == i | ||
result = {"energy": records[i].results.energy, "gradient": np.array(records[i].results.gradient).ravel()} | ||
self.Structures[i].ComputeEnergyGradient(result=result) | ||
|
||
# Deleting the records | ||
outputs.forget() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can do this instead if you want to make it simpler:
from bigchem import parallel_hessian
pi = ProgramInput(molecule=qcio_m, calctype='hessian', model={...}, extras={...})
fr = parallel_hessian(self.engine.basis, pi).delay()
output = fr.get()
fr.forget()
hessian = output.results.hessian
No worries if you want to keep your own implementation. You can see my implementation of this here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part is for NEB, which won't need the Hessian calculation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see! I mistakenly thought it was a hessian :)
… test_hessian.py is updated.
… the broker connection.
Hi @leeping and @hjnpark. Any idea when this will be done? We were hoping to use this in our lab so wanted to check in :) @jandestrada |
@jandestrada its merged in now :) |
geomeTRIC will check to see if BigChem and a given engine are working properly when
--bigchem yes
argument is provided. If BigChem is ready, the Hessian and NEB calculation will be carried in parallel via BigChem. Currently TeraChem, Psi4, and QChem are available.Here are the changes:
--port
changed to--wqport
.method
andbasis
attributes were added to the engines.number_output
function was added to Psi4 engine.calc_new
function inQCEngineAPI
was modified to be compatible with TeraChem (this change is for QCEngine).nifty.py
.