Virtual screening of the SARS-CoV-2 main protease with rxDock and pose scoring

Authors: AvatarSimon Bray
Overview
Questions:
  • How can candidate ligands be generated and docked to a protein in Galaxy?

  • How can the poses of the docked ligands be evaluated?

  • How can a workflow for drug virtual screening be constructed in Galaxy?

Objectives:
  • Understand how Galaxy was used to perform docking and pose scoring on the SARS-CoV-2 main protease (MPro).

  • Replicate the study on a (very) small scale

  • Gain familiarity with the docking and scoring techniques involved.

Requirements:
Time estimation: 2 hours
Level: Intermediate Intermediate
Supporting Materials:
Last modification: Oct 18, 2022
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License The GTN Framework is licensed under MIT

Introduction

This tutorial provides a companion to the work performed in March 2020 by InformaticsMatters, the Diamond Light Source, and the European Galaxy Team to perform virtual screening on candidate ligands for the SARS-CoV-2 main protease (MPro). This work is described here.

In this tutorial, you will perform protein-ligand docking to MPro using rxDock (Ruiz-Carmona et al. 2014), a version of the popular rDock software, and score the results using two different methods. The same tools will be used as in the original study, but with a smaller dataset.

Agenda

In this tutorial, we will cover:

  1. Introduction
  2. Background
  3. Get data
  4. Preparation for docking
    1. Charge enumeration
    2. Generate three-dimensional conformations
    3. Splitting the SD-file into a collection
  5. Active site preparation
    1. Convert protein structure to MOL2 format
    2. Generate Frankenstein ligand
    3. Generate active site definition
  6. Docking and scoring
    1. Docking with rxDock
    2. Collapse collection to a single file
    3. Pose scoring with TransFS
    4. Pose scoring with SuCOS
  7. Compound selection
  8. Conclusion

Background

Early in March 2020, the Diamond Light Source completed a successful fragment screen on MPro, which provided 55 fragment hits (see their press release here). In an effort to identify candidate molecules for binding, InformaticsMatters, the XChem group and the European Galaxy team joined forces to construct and execute a Galaxy workflow for performing and evaluating molecular docking on a massive scale.

An initial list of 41,000 candidate molecules was assembled by using the Fragalysis fragment network to elaborate from the initial fragment hits, as described here. These were used as inputs for the docking and scoring workflow. The workflow consists of the following steps, each of which was carried out using tools installed on the European Galaxy server:

  1. Charge enumeration of the 41,000 candidate molecules selected based on the fragment hits.
  2. Generation of 3D conformations based on SMILES strings of the candidate molecules.
  3. Docking of molecules into each of the MPro structures using rxDock.
  4. Evaluation of the docking poses using a TransFS, a deep learning approach (Scantlebury 2019) developed by the XChem group and collaborators, and SuCOS scoring (Leong 2019), which compares the poses with the structures of the original fragment hits.

The original study required almost 20 years of CPU time, not counting GPU resources consumed. This is obviously not reproducible as a tutorial. Therefore, we will repeat the workflow with a small library of just 100 molecules, on a single MPro fragment structure. Links will be provided to original Galaxy histories, with notes to explain where and why things were done differently to the tutorial.

MPro structure, with a fragment bound.
Figure 1: Structure of MPro, with a fragment bound. Click to view in NGL. (Rose et al. 2018)

Get data

We require three datasets for the simulation and analysis:

  1. A list of 100 ligand candidates. These are the molecules which will be docking into the protein binding site.
  2. A PDB file of the receptor MPro protein (without ligand or solvent).
  3. A list of fragment hits (17 in total) in SDF format.

Of the initial 55 fragment hits, 17 were chosen for further study. From these, 41,587 compounds were generated using the Fragalysis fragment network for further study. The 100 compounds used in the tutorial are taken from this list.

Starting data is available from this Galaxy history: https://usegalaxy.eu/u/sbray/h/mpro-raw-data.

This history contains 103 files. One of these (Initial candidates for docking) contains the 41k candidate compounds in SMILES format. The remaining 102 files (all with names beginning with Mpro-x...) provide structural information on the fragment hits - 6 files per hit (hence 17 x 6 = 102).

The identity of the files is as follows:

  • the *_0.mol files contain the fragment structure in mol format.
  • the *_0.pdb files contain the fragment structure in pdb format.
  • the *_0_apo.pdb files contain the protein with solvent, but without ligand
  • the *_0_apo-desolv.pdb files contain the protein without either solvent or ligand
  • the *_0_apo-solv.pdb files contain only solvent
  • the *_0_bound.pdb file contain everything (protein, ligand and solvent)

The PDB file of the receptor that we are using is Mpro-x0195_0_apo-desolv.pdb. In other words, the structure is derived from just one fragment hit. In the original study, however, all compounds were docked against all of the fragment hit structures.

Hands-on: Data upload
  1. Create a new history for this tutorial
  2. Import the files from Zenodo:

    https://zenodo.org/record/3730474/files/candidates.smi
    https://zenodo.org/record/3730474/files/Mpro-x0195_0_apo-desolv.pdb
    https://zenodo.org/record/3730474/files/hits.sdf
    
    • Copy the link location
    • Open the Galaxy Upload Manager (galaxy-upload on the top-right of the tool panel)

    • Select Paste/Fetch Data
    • Paste the link into the text field

    • Press Start

    • Close the window
  3. Rename the datasets Candidates SMILES, Receptor PDB and Hits SDF respectively.
  4. Check that the datatypes (smi, pdb, and sdf respectively) are correct. In particularly, check the Candidates SMILES file, as the SMILES datatype is not detected automatically by Galaxy.

    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, click on the galaxy-chart-select-data Datatypes tab on the top
    • Select datatypes
      • tip: you can start typing the datatype into the field to filter the dropdown menu
    • Click the Save button

Preparation for docking

Before docking, the candidate ligands need to be prepared for docking with the following steps: 1) charge enumeration, 2) generation of three-dimensional structures, and 3) splitting the SD-file into a collection.

This stage is carried out as described here, except of course with the full set of 42,000 compounds. See here for more details.

Charge enumeration

Many of the compounds may contain functional groups which can exist in multiple charge states, and this will affect the quality of binding to the receptor dramatically. Therefore, we perform ‘charge enumeration’, which means that we generate all charge forms of the compounds within a certain pH range.

Hands-on: Charge enumeration
  1. Enumerate charges Tool: toolshed.g2.bx.psu.edu/repos/bgruening/enumerate_charges/enumerate_charges/2020.03.4+galaxy0 with the following parameters:
    • “Input molecule data”: Candidate SMILES
    • “Minimum pH”: 4.4
    • “Maximum pH”: 10.4
  2. Rename the output file Enumerated candidates SMILES.

    Comment

    The Enumerate charges tool tool is based on the Dimorphite-DL program. (Ropp et al. 2019)

The output is another SMILES file, with several hundred entries.

Generate three-dimensional conformations

So far our list of enumerated candidate compounds is still in SMILES format; we need to produce three-dimensional structures in SDF format for docking. This can be done with the Compound conversion tool tool.

If you are not familiar with SMILES and SDF formats, consult the introductory protein-ligand docking tutorial for more details.

Hands-on: Convert to SDF format
  1. Compound conversion Tool: toolshed.g2.bx.psu.edu/repos/bgruening/openbabel_compound_convert/openbabel_compound_convert/3.1.1+galaxy0 with the following parameters:
    • “Molecular input file”: Enumerated candidates dataset.
    • “Output format”: MDL MOL format (sdf, mol)
      • “Generate 3D coordinates”: Yes
  2. Rename the output file Enumerated candidates SDF.

    Comment

    The Compound conversion tool tool is based on the OpenBabel toolkit. (O’Boyle et al. 2011)

Splitting the SD-file into a collection

The next stage is to split the SD-file with the candidate ligands into a set of smaller SD-files.

Question

Why is splitting the file necessary?

The rxDock tool performs one docking at a time (more technically: the task is not parallelized, as it uses only a single CPU). Therefore, splitting the large SD-file into many small files allows the work to be carried out by multiple Galaxy jobs in parallel, so it completes faster.

In the original study, this kind of parallelization was absolutely essential because of the enormous dataset; at some points, there were 5,000 docking jobs running concurrently on the European Galaxy server. Even on the much smaller scale of this tutorial, we can speed things up considerably using this trick.

Hands-on: Split the SD-files
  1. Split file to dataset collection Tool: toolshed.g2.bx.psu.edu/repos/bgruening/split_file_to_collection/split_file_to_collection/0.5.0 with the following parameters:
    • “Select the file type to split”: SD-files
    • “SD-file to split”: Enumerated candidates SDF
      • “Specify number of output files or number of records per file?”: Number of output files
        • “Number of new files”: 10
      • “Method to allocate records to new files”: Alternate output files

Active site preparation

The active site also needs to be prepared for docking, using the following steps: 1) conversion to MOL2 format, and 2) generation of the active site using the rbcavity tool tool.

This stage was carried out as described here. However, it was repeated for each of the fragment hit structures, not just the Mpro-x0195_0_apo-desolv.pdb file used here. See here for more details.

Convert protein structure to MOL2 format

The receptor file we are using is in PDB format, but the rxDock tool we use for docking requires an input in MOL2 format. Therefore, we first convert the file.

Hands-on: Conversion to MOL2 format
  1. Compound conversion Tool: toolshed.g2.bx.psu.edu/repos/bgruening/openbabel_compound_convert/openbabel_compound_convert/3.1.1+galaxy0 with the following parameters:
    • “Molecular input file”: Receptor PDB dataset.
    • “Output format”: Sybyl Mol2 format (mol2)
  2. Rename the output file Receptor MOL2.

Generate Frankenstein ligand

For docking with rxDock, a file needs to be created defining the active site. This requires two input files - one for the protein and one for the ligand. We want an active site generation that takes into account the features of all 17 fragments, and therefore need to generate a ‘Frankenstein ligand’ which possesses the properties of all the fragments. A very simple Galaxy tool is available for this.

Question

What is a ‘Frankenstein ligand’ and why do we need it?

The Frankenstein ligand combines the atoms of all the fragments and therefore occupies the entire space of the binding site. Therefore, it is the best choice for cavity definition. See the information provided by InformaticsMatters for more details.

Hands-on: Generate Frankenstein ligand
  1. Create Frankenstein ligand Tool: toolshed.g2.bx.psu.edu/repos/bgruening/ctb_frankenstein_ligand/ctb_frankenstein_ligand/2013.1-0+galaxy0 with the following parameters:
    • “Input file”: Hits SDF
  2. Rename the file to Frankstein SDF.

Generate active site definition

The active site can now be generated using the rbcavity tool tool, which requires the receptor in MOL2 format as input as well as a single reference ligand in Mol/SDF format. We use the Frankenstein ligand as the reference.

Hands-on: Active site preparation
  1. rxDock cavity definition Tool: toolshed.g2.bx.psu.edu/repos/bgruening/rxdock_rbcavity/rxdock_rbcavity/0.1.5 with the following parameters:
    • “Receptor”: Receptor MOL2
    • “Reference ligand”: Frankenstein SDF
    • “Mapper sphere radius”: 3.0
    • “Mapper small sphere radius”: 1.0
    • “Mapper minimum volume”: 100
    • “Mapper volume increment”: 0
    • “Mapper grid step”: 0.5
    • “Cavity weight”: 1.0
  2. Rename the output file Active site.
    Comment

    The meanings of these parameters are too complex to go into in this tutorial. If you are interested, see the rDock documentation for more details.

Docking and scoring

Docking and scoring are now performed, using the following steps: 1) docking using rxDock, 2) recombining the results into a single SDF file, 3) TransFS scoring, and 4) SuCOS scoring.

This section in the original study differed from this tutorial in the following ways:

  1. Docking was performed on over 100,000 enumerated candidates, rather than the 300 used here.
  2. 25 different poses were generated per candidate, rather than 5, as in this tutorial.
  3. Because of the large number of poses to score (more than a million), the scoring steps were parallelized by splitting into collections. This is skipped in the tutorial.
  4. The entire process was repeated 17 times, using a different fragment hit as the receptor structure each time. See here and here for more details. A full list of Galaxy histories generated is listed here.

Docking with rxDock

Hands-on: Docking
  1. rxDock docking Tool: toolshed.g2.bx.psu.edu/repos/bgruening/rxdock_rbdock/rxdock_rbdock/0.1.5 with the following parameters:
    • “Receptor”: Receptor MOL2
    • “Active site”: Active site
    • “Ligands”: Split file collection
    • “Number of dockings”: 5
    • “Number of best poses”: 5
    Comment

    For more information about docking, check out the introductory tutorial. It uses a different tool, AutoDock Vina, rather than rxDock, but the general principles are the same.

Collapse collection to a single file

Having created a collection to parallelize the docking procedure, we can now recombine the results to a single file.

Hands-on: Collapse collection
  1. Collapse Collection Tool: toolshed.g2.bx.psu.edu/repos/nml/collapse_collections/collapse_dataset/4.2 with the following parameters:
    • “Collection of files to collapse into single dataset”: Output of docking step
  2. Rename to Docked poses SDF.

The output file should contain around 1,900 docked poses in SDF format.

Pose scoring with TransFS

In this step, we carry out scoring of the poses using TransFS. This is a deep learning approach developed at the University of Oxford, employing augmentation of training data with incorrectly docked ligands to prompt the model to learn from protein-ligand interactions. (Scantlebury 2019)

The TransFS scoring returns a value (saved as <TransFSScore> in the SDF file) between 0 (poor) and 1 (good).

Hands-on: TransFS scoring
  1. XChem TransFS pose scoring Tool: toolshed.g2.bx.psu.edu/repos/bgruening/xchem_transfs_scoring/xchem_transfs_scoring/0.4.0 with the following parameters:
    • “Receptor”: Receptor PDB
    • “Ligands”: Docked poses SDF
    • “Distance to waters”: 2

Pose scoring with SuCOS

This step involves scoring of the poses from each molecule against the original fragment screening hit ligands using the SuCOS MAX shape and feature overlay algorithm. (Leong 2019) The conformation and position of the poses are compared to known structures (i.e. the fragment hits) to determine a score.

SuCOS scoring returns a value (saved as <Max_SuCOS_Score> in the SDF file) between 0 (poor) and 1 (good).

Hands-on: SuCOS scoring
  1. Max SuCOS score Tool: toolshed.g2.bx.psu.edu/repos/bgruening/sucos_max_score/sucos_max_score/2020.03.4+galaxy0 with the following parameters:
    • “Ligands to be scored”: Output of the TransFS step
    • “Set of clusters to score against”: Hits SDF
  2. Rename the output file to Scored poses.

Compound selection

The aim of the original study was to select 500 candidate molecules for synthesis and experimental study. In order to do this, the data for all fragment hits had to be combined (i.e. so that each compound was assigned the lowest score from all the fragments). The resulting table was then compared with a list of compounds available from Enamine and Chemspace and the 500 highest scoring matching compounds were selected for purchase.

This step is skipped in the tutorial, as only 100 compounds were tested, using only a single fragment hit structure. If you want, though, check out the history and workflow used.

Conclusion

This tutorial guided you through docking and scoring of a small set of compounds to the MPro protein. Hopefully, you have a better understanding of how docking can be practically used, as well as how the original study was performed.

Key points
  • Galaxy can support large, rapid studies in computational chemistry

  • Protein-ligand docking contributes to the discovery of new drugs

Frequently Asked Questions

Have questions about this tutorial? Check out the tutorial FAQ page or the FAQ page for the Computational chemistry topic to see if your question is listed there. If not, please ask your question on the GTN Gitter Channel or the Galaxy Help Forum

Useful literature

Further information, including links to documentation and original publications, regarding the tools, analysis techniques and the interpretation of results described in this tutorial can be found here.

References

  1. O’Boyle, N. M., M. Banck, C. A. James, C. Morley, T. Vandermeersch et al., 2011 Open Babel: An open chemical toolbox. Journal of Cheminformatics 3: 10.1186/1758-2946-3-33
  2. Ruiz-Carmona, S., D. Alvarez-Garcia, N. Foloppe, A. B. Garmendia-Doval, S. Juhos et al., 2014 rDock: A Fast, Versatile and Open Source Program for Docking Ligands to Proteins and Nucleic Acids. PLOS Computational Biology 10: 1–7. 10.1371/journal.pcbi.1003571
  3. Rose, A. S., A. R. Bradley, Y. Valasatava, J. M. Duarte, A. Prlić et al., 2018 NGL viewer: web-based molecular graphics for large complexes. Bioinformatics 34: 3755–3758. 10.1093/bioinformatics/bty419
  4. Leong, S. et al., 2019 SuCOS is Better than RMSD for Evaluating Fragment Elaboration and Docking Poses. 10.26434/chemrxiv.8100203.v1 https://chemrxiv.org/articles/SuCOS_is_Better_than_RMSD_for_Evaluating_Fragment_Elaboration_and_Docking_Poses/8100203/1
  5. Ropp, P. J., J. C. Kaminsky, S. Yablonski, and J. D. Durrant, 2019 Dimorphite-DL: an open-source program for enumerating the ionization states of drug-like small molecules. Journal of Cheminformatics 11: 10.1186/s13321-019-0336-9
  6. Scantlebury, J. et al., 2019 Dataset Augmentation Allows Deep Learning-Based Virtual Screening To Better Generalise To Unseen Target Classes, And Highlight Important Binding Interactions. https://www.biorxiv.org/content/10.1101/2020.03.06.979625v1

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.

Click here to load Google feedback frame

Citing this Tutorial

  1. Simon Bray, 2022 Virtual screening of the SARS-CoV-2 main protease with rxDock and pose scoring (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/computational-chemistry/tutorials/covid19-docking/tutorial.html Online; accessed TODAY
  2. Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012


@misc{computational-chemistry-covid19-docking,
author = "Simon Bray",
title = "Virtual screening of the SARS-CoV-2 main protease with rxDock and pose scoring (Galaxy Training Materials)",
year = "2022",
month = "10",
day = "18"
url = "\url{https://training.galaxyproject.org/training-material/topics/computational-chemistry/tutorials/covid19-docking/tutorial.html}",
note = "[Online; accessed TODAY]"
}
@article{Batut_2018,
    doi = {10.1016/j.cels.2018.05.012},
    url = {https://doi.org/10.1016%2Fj.cels.2018.05.012},
    year = 2018,
    month = {jun},
    publisher = {Elsevier {BV}},
    volume = {6},
    number = {6},
    pages = {752--758.e1},
    author = {B{\'{e}}r{\'{e}}nice Batut and Saskia Hiltemann and Andrea Bagnacani and Dannon Baker and Vivek Bhardwaj and Clemens Blank and Anthony Bretaudeau and Loraine Brillet-Gu{\'{e}}guen and Martin {\v{C}}ech and John Chilton and Dave Clements and Olivia Doppelt-Azeroual and Anika Erxleben and Mallory Ann Freeberg and Simon Gladman and Youri Hoogstrate and Hans-Rudolf Hotz and Torsten Houwaart and Pratik Jagtap and Delphine Larivi{\`{e}}re and Gildas Le Corguill{\'{e}} and Thomas Manke and Fabien Mareuil and Fidel Ram{\'{\i}}rez and Devon Ryan and Florian Christoph Sigloch and Nicola Soranzo and Joachim Wolff and Pavankumar Videm and Markus Wolfien and Aisanjiang Wubuli and Dilmurat Yusuf and James Taylor and Rolf Backofen and Anton Nekrutenko and Björn Grüning},
    title = {Community-Driven Data Analysis Training for Biology},
    journal = {Cell Systems}
}
                   

Congratulations on successfully completing this tutorial!