Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can you provide more information on the features in the predfeatures numpy file? #26

Open
ngszyba opened this issue May 18, 2022 · 6 comments

Comments

@ngszyba
Copy link

ngszyba commented May 18, 2022

Dear dMaSIF team and users,

I am using the Google Colab version of dMaSIF to get the surface predictions from the model protein pdb files.
Among the outputs of dMaSIF I found predfeatures_emb1.npy file with 34 columns and the corresponding .npy file containing the coordinates.
If I understand correctly, this is an array of surface patches with biochemical features and coordinates of each patch.

Maybe I have overlooked it, but couldn't find anywhere hints on how to decipher the columns in the file.
Right now they are numbered from 0 to 33, but can you provide a list of columns/features, so people can check e.g. what feature is predicted in the column 10 and how this changes between different patches.
It would dramatically increase the usability of these predictions.

Thanks!

@rmwu
Copy link

rmwu commented May 18, 2022

Not sure if the authors are still replying but you can find the information that it saves in data_iteration.py line 94

coloring = torch.cat([inputs, embedding, predictions, labels], axis=1)

Where inputs is computed from line 358 of model.py and embedding is the output from their model, computed at line 413 of the same file. If I recall correctly, embedding is only dim=16 or so.

@ngszyba
Copy link
Author

ngszyba commented May 18, 2022

Hi, @rmwu

Thanks so much for such a quick answer and pointing towards data_iteration.py and model.py.
This is already a great hint. I assume that the columns could be deciphered in this section of the model.py (line 481) :

"P1": P1,
"P2": P2,
"R_values": R_values,
"conv_time": conv_time,
"memory_usage": memory_usage,

However, I still find it pretty difficult to get whatever is packed into P1 and P2.
It would be great to get a feature list with a brief description e.g. is time in seconds?, etc.

cheers,

@jeanfeydy
Copy link
Collaborator

jeanfeydy commented May 18, 2022

Hi @ngszyba, @rmwu,

Thanks for your interest in our work!

Just a quick word to say that we are still replying but:

  • Unfortunately, it’s hard to be as reactive as we would like to be (e.g. I’m now working mostly on public health problems, quite far away from proteins, and am discovering the joys of paperwork as a “tenured/senior” researcher…).
  • I’m only sharp on low-level problems related to e.g. KeOps or the convolutional layers.
  • @FreyrS is by far the sharpest person on the global structure of the dMaSIF repository, but he may be very busy at the moment - plus his PhD is coming to an end soon.

I have lost track of the structure of our files, but an important thing to note is that our dMaSIF model produces trained ”neural” features that do not have a clear physical meaning. P1 and P2 just contain “one vector of trained features per surface point”, where our feature extractor has been optimized with respect to a certain task (see our three papers on this: Nature methods, CVPR and MLDD). The lack of interpretability is unavoidable here, even though we’re trying to mitigate this as much as possible with our architectures.

I hope that @FreyrS will be able to confirm that predfeatures_emb1.npy does indeed contain the output of our convolutional layers :-)
Best regards,
Jean

@ngszyba
Copy link
Author

ngszyba commented May 20, 2022

Hi @jeanfeydy!

Thanks for your input!
I understand that interpretation of these features in an "absolute" manner is not advisable. In our project we are focusing on side-by-side comparisons of the surface features and projecting it on other protein properties like enzymatic activity and we wanted to give dMaSIF a shot.

Would be great if @FreyrS can come back with the column/features list at a later time.

thank you guys,

cheers!

@FreyrS
Copy link
Owner

FreyrS commented Jun 10, 2022

Hi @ngszyba,

What @rmwu is absolutely correct.
The first 16 values are the "input features" consisting of 10 mean and gaussian curvatures at different scales + 6 learned chemical features, then there come 16 values which are the point descriptors after the convolution ("embedding"), then there are interface predictions (1 value) and interface labels (1 value) if there are any

@BJWiley233
Copy link

BJWiley233 commented Jul 21, 2024

So this mean that dMaSIF doesn't to the radial grid system of mapping radii and theta generated neighbhood patches like MaSIF, summary is the picture below and mapping saying for every vertex, a neighborhood of 200 points, and then mapping that to an [N_radii x N_theta] grid system correct?

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants