Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dockopt treats protomers/tautormers as separate unique molecules when calculating enrichment #30

Open
svigneron opened this issue Apr 4, 2023 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@svigneron
Copy link

The different protomers and tautomers for the same molecule get built as separate db2 files with the same ZINC id, but different numbers after the decimal points (ie ZINC00000000aBcD.0.0 vs ZINC00000000aBcD.1.0 ). Each of these get docked on their own and scored, however, only the best scoring protomer/tautormer should be considered when calculating enrichment. Dockopt currently treats every separate db2 file as a unique 'active' molecule which alters the calculated enrichment and allows for situations where a poor scoring promoter will bring down the enrichment score despite the alternative protomer scoring well compared to decoy compounds.

@ianscottknight ianscottknight added the bug Something isn't working label Apr 4, 2023
@ianscottknight ianscottknight self-assigned this Apr 4, 2023
@ianscottknight ianscottknight pinned this issue Apr 5, 2023
@ianscottknight
Copy link
Collaborator

@jir322 Please comment on this if you have any remarks.

@jir322
Copy link
Collaborator

jir322 commented Apr 5, 2023 via email

@ianscottknight
Copy link
Collaborator

ianscottknight commented Apr 6, 2023

@jir322 From DockOpt's perspective, there is no such thing as "ZINC ID". There is only the id_num column in the OUTDOCK file, which corresponds to the zincname field encoded in the .db2 file of the molecule.

(Note that zincname and id_num are both misnomers for their data types, and are partly responsible for the confusion here. E.g., it is possible for built molecules to come from somewhere other than ZINC, such as the actives in the DUDE-Z dataset, which come from RCSB PDB.)

The real problem here is that there is no general ID for molecules in the DB2 file format. One possible solution is to just use the zincname field in the .db2 file as an actual molecule ID, since DockOpt currently treats the id_num column of OUTDOCK as a molecule ID, but doing so would almost certainly only create confusion in the long run.

Another solution is to update the .db2 file format to account for this ambiguity by adding a molecule_id field (and rectifying the zincname misnomer).

Yet another solution is to adopt a naming convention for zincname field entries of the same molecule which would allow DockOpt to figure out what to treat as the same molecule. I would suggest a regex. E.g., ^.*\.\d$ would match any strings which are ZINC codes followed by a period and a number, where the number would identify the protomer / tautomer.

@jir322
Copy link
Collaborator

jir322 commented Apr 7, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants