Protein target prediction of a bioactive ligand with Align-it and ePharmaLib

Authors:

Overview
Questions:

What is a pharmacophore model?

How can I perform protein target prediction with a multi-step workflow or the one-step Zauberkugel workflow?

Objectives:

Create an SMILES file of a bioactive ligand.

Screen the query ligand against a pharmacophore library.

Analyze the results of the protein target prediction.

Requirements:

Introduction to Galaxy Analyses

Time estimation: 2 hours

Level: Intermediate Intermediate

Supporting Materials:

Datasets

Workflows

FAQs

instances Available on these Galaxies

docker_image Docker image
ChemicalToolbox Galaxy Africa Galaxy India Street Science UseGalaxy.eu UseGalaxy.org (Main)

Last modification: Oct 18, 2022

License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License The GTN Framework is licensed under MIT

Introduction

Historically, the pharmacophore concept was formulated in 1909 by the German physician and Nobel prize laureate Paul Ehrlich (Ehrlich 1909). According to the International Union of Pure and Applied Chemistry (IUPAC), a pharmacophore is defined as “an ensemble of steric and electronic features that is necessary to ensure the optimal supramolecular interactions with a specific biological target and to trigger (or block) its biological response” (Wermuth et al. 1998). Starting from the cocrystal structure of a non-covalent protein–ligand complex (e.g. Figure 1), pharmacophore perception involves the extraction of the key molecular features of the bioactive ligand at the protein–ligand contact interface into a single model (Moumbock et al. 2019). These pharmacophoric features mainly include: H-bond acceptor (HACC or A), H-bond donor (HDON or D), lipophilic group (LIPO or H), negative center (NEGC or N), positive center (POSC or P), and aromatic ring (AROM or R) moieties. Moreover, receptor-based excluded spheres (EXCL) can be added in order to mimic spatial constraints of the binding pocket (Figure 2). Once a pharmacophore model has been generated, a query can be performed either in a forward manner, using several ligands to search for novel putative hits of a given target, or in a reverse manner, by screening a single ligand against multiple pharmacophore models in search of putative protein targets (Steindl et al. 2006).

PDB ID: 4MVF. — Figure 1: Crystal Structure of *Plasmodium falciparum* calcium-dependent protein kinase 2 (CDPK2) complexed with staurosporine (STU) with PDB ID: [4MVF](https://www.rcsb.org/structure/4mvf). Image generated using Maestro (Schrödinger LLC, NY).

Bioactive compounds often bind to several target proteins, thereby exhibiting polypharmacology. However, experimentally determining these interactions is laborious, and structure-based virtual screening of bioactive compounds could expedite drug discovery by prioritizing hits for experimental validation. The recently reported ePharmaLib (Moumbock et al. 2021) dataset is a library of 15,148 e-pharmacophores modeled from solved structures of pharmaceutically relevant protein–ligand complexes of the screening Protein Data Bank (sc-PDB, Desaphy et al. 2014). ePharmaLib can be used for target fishing of phenotypic hits, side effect predictions, drug repurposing, and scaffold hopping.

STU. — Figure 2: Depiction of the 2D structure of staurosporine (left) and 3D structure (right) with key pharmacophoric features extracted from the STU–CDPK2 complex (PDB ID: [4MVF](https://www.rcsb.org/structure/4mvf)). Image generated using Maestro (Schrödinger LLC, NY).

In this tutorial, you will perform pharmacophore-based target prediction of a bioactive ligand known as staurosporine (Figure 2) with the ePharmaLib subset representing Plasmodium falciparum protein targets (138 pharmacophore models) and the open-source pharmacophore alignment program Align-it, formerly known as PHARAO (Taminau et al. 2008).

Staurosporine (PDB hetID: STU) is an indolocarbazole secondary metabolite isolated from several bacteria of the genus Streptomyces. It displays diverse biological activities such as anticancer and antiparasitic activities (Nakano and Ōmura 2009).

Agenda

In this tutorial, we will cover:

Introduction

Create a history

Get data

Fetching the ePharmaLib dataset

Creating a query ligand structure file

Pre-processing

Ligand hydration

Splitting ePharmaLib into individual pharmacophores

Ligand conformational flexibility

Pharmacophore alignment

Post-processing

Concatenating the pharmacophore alignment scores

Ranking the predicted protein targets

One-step Zauberkugel workflow vs. multi-step workflow

Further analysis

Conclusion

Create a history

As a first step, we create a new history for the analysis.

Hands-on: Hands-on 1: Create history

Create a new history.

Click the new-history icon at the top of the history panel.

If the new-history is missing:

Click on the galaxy-gear icon (History options) on the top of the history panel

Select the option Create New from the menu

Rename it to Staurosporine target prediction.

Click on Unnamed history (or the current name of the history) (Click to rename history) at the top of your history panel

Type the new name: Staurosporine target prediction

Press Enter

Get data

For this exercise, we need two datasets: the ePharmaLib pharmacophore library (PHAR format) and a query ligand structure file (SMI format).

Fetching the ePharmaLib dataset

Firstly, we will retrieve the concatenated ePharmaLib subset representing P. falciparum protein targets.

Hands-on: Hands-on 2: Upload ePharmaLib

Upload the dataset from the Zenodo link provided to your Galaxy history.

Copy the link location

Open the Galaxy Upload Manager (galaxy-upload on the top-right of the tool panel)

Select Paste/Fetch Data

Paste the link into the text field

hhttps://zenodo.org/record/6055897/files/ePharmaLib_PHARAO_plasmodium.phar

Press Start

Close the window

Comment: ePharmaLib versions

Two versions of the ePharmaLib (PHAR & PHYPO formats) have been created for use with the pharmacophore alignment programs Align-it and Phase, respectively. Both versions can be broken down into small datasets. e.g. for human targets. They are freely available at Zenodo under the link: https://zenodo.org/record/6055897

Change the datatype from tabularto phar. This step is essential, as Galaxy does not automatically detect the datatype for PHAR files.

Click on the galaxy-pencil pencil icon for the dataset to edit its attributes

In the central panel, click on the galaxy-chart-select-data Datatypes tab on the top

Select phar

tip: you can start typing the datatype into the field to filter the dropdown menu

Click the Save button

You can view the contents of the downloaded PHAR file by pressing the eye icon (View data) for this dataset.

A PHAR file is essentially a series of lines containing the three-dimensional coordinates of pharmacophoric features and excluded spheres. The first column specifies a feature type (e.g. HACC is a hydrogen bond acceptor). Subsequent columns specify the position of the feature center in a three-dimensional space. Individual pharmacophores are separated by lines containing four dollar signs ($$$$). The pharmacophores of the ePharmaLib dataset were labeled according to the following three-component code PDBID-hetID-UniprotEntryName.

Creating a query ligand structure file

In this step, we will manually create an SMI file containing the SMILES of staurosporine.

The simplified molecular-input line-entry system (SMILES) is a string notation for describing the 2D chemical structure of a compound. It only states the atoms present in a compound and the connectivity between them. As an example, the SMILES string of acetone is CC(=O)C. SMILES strings can be imported by most molecule editors and converted into either two-dimensional structural drawings or three-dimensional models of the compounds, and vice versa. For more information on how the notation works, please consult the OpenSMILES specification or the description provided by Wikipedia.

Hands-on: Hands-on 3: Create an SMI file
Create a new file using the Galaxy upload manager, with the following contents. Make sure to select the datatype (with Type) as smi. This step is essential, as Galaxy does not automatically detect the datatype for SMI files.
C[C@@]12[C@@H]([C@@H](C[C@@H](O1)N3C4=CC=CC=C4C5=C6C(=C7C8=CC=CC=C8N2C7=C53)CNC6=O)NC)OC	staurosporine
Open the Galaxy Upload Manager

Select Paste/Fetch Data

Paste the file contents into the text field

Change Type from “Auto-detect” to smi

Press Start and Close the window
A SMILES string can automatically be generated from a ligand name or 2D structure with a desktop molecule editor such ChemDraw® and Marvin®, or with web-based molecule editors such as PubChem Sketcher and ChemDraw® JS. Moreover, the pre-computed SMILES strings of a large number of bioactive compounds can be retrieved from chemical databases such as PubChem. e.g.
   https://pubchem.ncbi.nlm.nih.gov/compound/44259#section=Isomeric-SMILES&fullscreen=true

Question

Why do we specifically use a so-called isomeric SMILES string?

Staurosporine is a chiral molecule possessing four chiral centers. The SMILES notation allows the specification of configuration at tetrahedral centers and double bond geometry, by marking atoms with @ or @@. These are structural features that cannot be specified by connectivity alone, and therefore SMILES which encode this information are termed isomeric SMILES. A notable feature of these rules is that they allow rigorous partial specification of chirality.

Pre-processing

Prior to pharmacophore alignment, the predominant ionization state(s) of the query ligand as well as its 3D conformers should be generated. Also, the pharmacophore dataset will be split into a collection of individual pharmacophore files.

Ligand hydration

More often than not, the bioactive form of a compound is its predominant form at physiological pH (7.4). In this step, we predict the most probable ionization state(s) of the query ligand at pH 7.4 with the cheminformatics toolkit OpenBabel (O’Boyle et al. 2011).

Hands-on: Hands-on 4: Add hydrogen atoms

Add hydrogen atoms Tool: toolshed.g2.bx.psu.edu/repos/bgruening/openbabel_addh/openbabel_addh/3.1.1+galaxy1 with the following parameters:

param-file “Molecular input file”: staurosporine.smi (from Hands-on 3)

“Add hydrogens to polar atoms only (i.e. not to carbon atoms)”: Yes

Rename the output to staurosporine_hydrated.

Click on the galaxy-pencil pencil icon for the dataset to edit its attributes

In the central panel, change the Name field to staurosporine_hydrated

Click the Save button

Question

Nitrogen-containing functional groups are known to be basic. Which of them present in staurosporine (Figure 2) do you expect to be protonated at pH 7.4, and which not? And why?

Only the secondary N-methylamino group will be protonated because indoles, much like aromatic amides, are typically not basic.

Splitting ePharmaLib into individual pharmacophores

The ePharmaLib subset representing P. falciparum protein targets (ePharmaLib_PHARAO_plasmodium.phar) is a concatenated file containing 148 individual pharmacophore files. To speed up our analysis, it is preferable to split the dataset into individual files in order to perform several pharmacophore alignments in parallel, using Galaxy’s collection functionality.

Hands-on: Hands-on 5: Splitting ePharmaLib

Split file Tool: toolshed.g2.bx.psu.edu/repos/bgruening/split_file_to_collection/split_file_to_collection/0.5.0 with the following parameters:

“Select the file type to split”: Generic

param-file “File to split”: ePharmaLib_PHARAO_plasmodium.phar (from Hands-on 2)

“Method to split files”: Specify record separator as regular expression

“Regex to match record separator”: \$\$\$\$

“Split records before or after the separator?”: After

“Specify number of output files or number of records per file?”: Number of records per file ('chunk mode')

“Base name for new files in collection”: epharmalib

“Method to allocate records to new files”: Maintain record order

Rename the output to ePharmaLib_PLAF_split.

Click on the galaxy-pencil pencil icon for the dataset to edit its attributes

In the central panel, change the Name field to ePharmaLib_PLAF_split

Click the Save button

Ligand conformational flexibility

To reduce the calculation time, the Align-it (Taminau et al. 2008) tool performs rigid alignment rather than flexible alignment. Conformational flexibility of the ligand is accounted for by introducing a preliminary step, in which a set of energy-minimized conformers for the query ligand are generated with the RDConf (Koes) tool (using the RDKit (Landrum and others 2013) toolkit).

Hands-on: Hands-on 6: Low-energy ligand conformer search

RDConf: Low-energy ligand conformer search Tool: toolshed.g2.bx.psu.edu/repos/bgruening/rdconf/rdconf/2020.03.4+galaxy0 with the following parameters:

param-file “Input file”: staurosporine_hydrated (from Hands-on 4)

“Maximum number of conformers to generate per molecule”: 100

Rename the output to staurosporine_3D_conformers.

Click on the galaxy-pencil pencil icon for the dataset to edit its attributes

In the central panel, change the Name field to staurosporine_3D_conformers

Click the Save button

Comment: RDConf

It is recommended to use the default settings, except for the number of conformers which should be changed to 100. As a rule of thumb, a threshold of 100 conformers appropriately represents the conformational flexibility of a compound with less than 10 rotatable bonds. The output SDF (structure data file) format encodes three-dimensional atomic coordinates of each conformer, separated by lines containing four dollar signs ($$$$).

Question

Have a look at the contents of the created collection staurosporine_3D_conformers. Why were less than 100 conformers were generated for staurosporine?

Staurosporine is a fused 8-ring system with only two rotatable bonds, due to its planar aromatic 5-ring indolocarbozole scaffold which confers a high structural rigidity upon the compound, i.e. it exists in relatively few energetically distinct 3D conformations.

Pharmacophore alignment

In this step, the ligand conformer dataset (SDF format) is converted on-the-fly to a pharmacophore dataset (PHAR format) and simultaneously aligned to the individual pharmacophores of the ePharmaLib dataset in a batch mode with Align-it (Taminau et al. 2008). The pharmacophoric alignments and thus the predicted targets are ranked in terms of a scoring metric: Tversky index = [0,1]. The higher the Tversky index, the higher the likelihood of the predicted protein–ligand interaction.

Hands-on: Hands-on 7: Pharmacophore alignment

Pharmacophore alignment Tool: toolshed.g2.bx.psu.edu/repos/bgruening/align_it/ctb_alignit/1.0.4+galaxy0 with the following parameters:

param-file “Defines the database of molecules that will be used to screen”: staurosporine_3D_conformers (from Hands-on 7)

param-file “Reference molecule”: ePharmaLib_PLAF_split (from Hands-on 5)

“No normal information is included during the alignment”: Yes

“Disable the use of hybrid pharmacophore points”: Yes

“Only structures with a score larger than this cutoff will be written to the files”: 0.0

“Maximum number of best scoring structures to write to the files”: 1

“This option defines the used scoring scheme”: TVERSKY_REF

Post-processing

The above pharmacophore alignment produces three types of outputs: the aligned pharmacophores (PHAR format), aligned structures (SMI format), and alignment scores (tabular format). Of these results, only the alignment scores are of interest and will be post-processed prior to analysis.

Concatenating the pharmacophore alignment scores

The alignment score of the best ranked ligand conformer aligned against each ePharmaLib pharmacophore is stored in an individual file. In total, this job generates a collection of 138 output files which should be concatenated in a single file, for a better overview of the predictions.

Hands-on: Hands-on 8: Concatenating the scores

Concatenate datasets Tool: toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_cat/0.1.1 with the following parameters:

param-file “Datasets to concatenate”: scores (from Hands-on 7)

Rename the output to concatenated_scores.

Click on the galaxy-pencil pencil icon for the dataset to edit its attributes

In the central panel, change the Name field to concatenated_scores

Click the Save button

Ranking the predicted protein targets

The resulting concatenated_scores needs to be re-sorted according to the alignment metric, the Tversky index, i.e. the 10th column. The pharmacophores of the ePharmaLib dataset were labeled according to the following three-component code PDBID-hetID-UniprotEntryName. The contents of the concatenated_scores are as follows:

------    ---------------------------------------------------------------------
column    Content
------    ---------------------------------------------------------------------
  Id of the reference structure
  Maximum volume of the reference structure
  Id of the database structure
  Maximum volume of the database structure
  Maximum volume overlap of the two structures
  Overlap between pharmacophore and exclusion spheres in the reference
  Corrected volume overlap between database pharmacophore and reference
  Number of pharmacophore points in the processed pharmacophore
  TANIMOTO score
  TVERSKY_REF score
  TVERSKY_DB score
------    --------------------------------------------------------------------- 

Hands-on: Hands-on 9: Sort Dataset

Sort Tool: sort1 with the following parameters:

param-file “Sort Dataset”: concatenated_scores (from Hands-on 8)

“on column”: c10

Rename the output to final_target_prediction_scores.

Click on the galaxy-pencil pencil icon for the dataset to edit its attributes

In the central panel, change the Name field to final_target_prediction_scores

Click the Save button

You can view the contents of the collection final_target_prediction_scores by pressing the eye icon (View data).

The top-ranked protein of our target prediction experiment is 4mvf-STU-CDPK2_PLAFK (Figures 1 & 2) with a Tversky index = 0.73. The general observation that can be made from this ranking of protein hits is the high self-retrieval rate of known targets, which demonstrates the high prediction accuracy of the method. The higher the Tversky index, the higher the likelihood of the predicted protein–ligand interaction; with a value of 0.5 corresponding to a 50% likelihood.

Question

Why was a perfect pharmacophore alignment (Tversky index = 1) not achieved for the top-ranked protein target for which the cocrystallized ligand is staurosporine (STU)?

A perfect pharmacophore alignment because a computational conformer generator (here RDConf in Hands-on 6) is unlikely to be able to reproduce a crystallographic (native) ligand pose with 100% accuracy.

One-step Zauberkugel workflow vs. multi-step workflow

For pharmacophore-based protein target prediction, you can choose to use Galaxy tools separately and in succession as described above, or alternatively use the one-step Zauberkugel workflow as described below (Figure 3).

Hands-on: Upload the Zauberkugel workflow

Upload the Zauberkugel workflow from the following URL:
https://github.com/galaxyproject/training-material/blob/main/topics/computational-chemistry/tutorials/zauberkugel/workflows/main_workflow.ga
Click on Workflow on the top menu bar of Galaxy. You will see a list of all your workflows.

Click on the upload icon galaxy-upload at the top-right of the screen

Provide your workflow

Option 1: Paste the URL of the workflow into the box labelled “Archived Workflow URL”

Option 2: Upload the workflow file in the box labelled “Archived Workflow File”

Click the Import workflow button

The Zauberkugel workflow requires only two inputs; the ligand structure file (SMI format) and the ePharmaLib dataset (PHAR format). The output of the prediction of human targets of staurosporine performed with the ePharmaLib human target subset (https://zenodo.org/record/6055897) and this workflow is available as a Galaxy history.

Snapshot of Zauberkugel workflow. — Figure 3: Zauberkugel — protein target prediction of a bioactive ligand with Align-it and ePharmaLib

Further analysis

To obtain a docking pose of a protein–ligand interaction predicted from pharmacophore-based protein target prediction, follow the Protein–ligand docking Galaxy training.

Conclusion

Key points

A pharmacophore is an abstract description of the molecular features of a bioactive ligand.

Pharmacophore-based target prediction is an efficient and cost-effective method.

Frequently Asked Questions

Have questions about this tutorial? Check out the tutorial FAQ page or the FAQ page for the Computational chemistry topic to see if your question is listed there. If not, please ask your question on the GTN Gitter Channel or the Galaxy Help Forum

Useful literature

Further information, including links to documentation and original publications, regarding the tools, analysis techniques and the interpretation of results described in this tutorial can be found here.

References

Ehrlich, P., 1909 Über den jetzigen Stand der Chemotherapie. Berichte der deutschen chemischen Gesellschaft 42: 17–47. 10.1002/cber.19090420105
Wermuth, C. G., C. R. Ganellin, P. Lindberg, and L. A. Mitscher, 1998 Glossary of terms used in medicinal chemistry (IUPAC Recommendations 1998). Pure and Applied Chemistry 70: 1129–1143. 10.1351/pac199870051129
Steindl, T. M., D. Schuster, C. Laggner, and T. Langer, 2006 Parallel Screening:\hspace0.167em A Novel Concept in Pharmacophore Modeling and Virtual Screening. Journal of Chemical Information and Modeling 46: 2146–2157. 10.1021/ci6002043
Taminau, J., G. Thijs, and H. D. Winter, 2008 Pharao: Pharmacophore alignment and optimization. Journal of Molecular Graphics and Modelling 27: 161–169. 10.1016/j.jmgm.2008.04.003
Nakano, H., and S. Ōmura, 2009 Chemical biology of natural indolocarbazole products: 30 years since the discovery of staurosporine. The Journal of Antibiotics 62: 17–26. 10.1038/ja.2008.4
O’Boyle, N. M., M. Banck, C. A. James, C. Morley, T. Vandermeersch et al., 2011 Open Babel: An open chemical toolbox. Journal of Cheminformatics 3: 10.1186/1758-2946-3-33
Landrum, G., and others, 2013 RDKit: A software suite for cheminformatics, computational chemistry, and predictive modeling.
Desaphy, J., G. Bret, D. Rognan, and E. Kellenberger, 2014 sc-PDB: a 3D-database of ligandable binding sites—10 years on. Nucleic Acids Research 43: D399–D404. 10.1093/nar/gku928
Moumbock, A. F. A., J. Li, P. Mishra, M. Gao, and S. Günther, 2019 Current computational methods for predicting protein interactions of natural products. Computational and Structural Biotechnology Journal 17: 1367–1376. 10.1016/j.csbj.2019.08.008
Moumbock, A. F. A., J. Li, H. T. T. Tran, R. Hinkelmann, E. Lamy et al., 2021 ePharmaLib: A Versatile Library of e-Pharmacophores to Address Small-Molecule (Poly-)Pharmacology. Journal of Chemical Information and Modeling 61: 3659–3666. 10.1021/acs.jcim.1c00135
Koes, D. RDConf: Low-energy ligand conformer search. https://github.com/dkoes/rdkit-scripts

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.

Citing this Tutorial

Aurélien F. A. Moumbock, Simon Bray, 2022 Protein target prediction of a bioactive ligand with Align-it and ePharmaLib (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/computational-chemistry/tutorials/zauberkugel/tutorial.html Online; accessed TODAY
Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012

@misc{computational-chemistry-zauberkugel,
author = "Aurélien F. A. Moumbock and Simon Bray",
title = "Protein target prediction of a bioactive ligand with Align-it and ePharmaLib (Galaxy Training Materials)",
year = "2022",
month = "10",
day = "18"
url = "\url{https://training.galaxyproject.org/training-material/topics/computational-chemistry/tutorials/zauberkugel/tutorial.html}",
note = "[Online; accessed TODAY]"
}
@article{Batut_2018,
    doi = {10.1016/j.cels.2018.05.012},
    url = {https://doi.org/10.1016%2Fj.cels.2018.05.012},
    year = 2018,
    month = {jun},
    publisher = {Elsevier {BV}},
    volume = {6},
    number = {6},
    pages = {752--758.e1},
    author = {B{\'{e}}r{\'{e}}nice Batut and Saskia Hiltemann and Andrea Bagnacani and Dannon Baker and Vivek Bhardwaj and Clemens Blank and Anthony Bretaudeau and Loraine Brillet-Gu{\'{e}}guen and Martin {\v{C}}ech and John Chilton and Dave Clements and Olivia Doppelt-Azeroual and Anika Erxleben and Mallory Ann Freeberg and Simon Gladman and Youri Hoogstrate and Hans-Rudolf Hotz and Torsten Houwaart and Pratik Jagtap and Delphine Larivi{\`{e}}re and Gildas Le Corguill{\'{e}} and Thomas Manke and Fabien Mareuil and Fidel Ram{\'{\i}}rez and Devon Ryan and Florian Christoph Sigloch and Nicola Soranzo and Joachim Wolff and Pavankumar Videm and Markus Wolfien and Aisanjiang Wubuli and Dilmurat Yusuf and James Taylor and Rolf Backofen and Anton Nekrutenko and Björn Grüning},
    title = {Community-Driven Data Analysis Training for Biology},
    journal = {Cell Systems}
}
                   

Congratulations on successfully completing this tutorial!