Peptide Library Data Analysis

Overview
Questions:
  • How to utilize quantitative properties of amino acids and peptide sequence to analyse peptide data?

Objectives:
  • Calculate descriptors

  • Quantitative analysis of peptide sequence properties

Requirements:
Time estimation: 20 minutes
Level: Intermediate Intermediate
Supporting Materials:
Last modification: Oct 18, 2022
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License The GTN Framework is licensed under MIT

Introduction

Several computational methods have been proven very useful in the initial screening and prediction of peptides for various biological properties. These methods have emerged as effective alternatives to the lengthy and expensive traditional experimental approaches. Properties associated with a group of peptide sequences such as overall charge, hydrophobicity profile, or k-mer composition can be utilized to compare peptide sequences and libraries. In this tutorial, we will be discussing how peptide-based properties like charge, hydrophobicity, the composition of amino acids, etc. can be utilized to analyze the biological properties of peptides. Additionally, we will learn how to use different utilities of the Peptide Design and Analysis Under Galaxy (PDAUG) package to calculate various peptide-based descriptors, and use these descriptors and feature spaces to build informative plots.

Easy access to tools, workflows and data from the docker image

An easy way to install and use the PDAUG toolset, and follow this tutorial is via a prebuilt docker image equipped with a PDAUG toolset, workflow, and data library. A prebuilds docker image can be downloaded and run by typing a simple command at the terminal after installing docker software on any operating system.

Hands-on: Easy access of tools, workflows and data from docker image
  1. Downloading the docker image from the docker hub using docker pull jayadevjoshi12/galaxy_pdaug:latest command.
  2. Running the container with latest PDAUG tools docker run -i -t -p 8080:80 jayadevjoshi12/galaxy_pdaug:latest.
  3. Workflow is available under the workflow section, use admin as username and password as a password to login as an administrator of your galaxy instance.
  4. Use admin as username and password as a password to login galaxy instance, which is available at localhost to access workflow and data.
Agenda

In this tutorial, we will cover:

  1. Introduction
    1. Peptide Data
    2. Converting tabular data into fasta format
    3. Analyzing peptide libraries (AMPs and TMPs) based on features and feature space
    4. Assessing the relation between peptide features by 3D scatter plot
  2. Conclusion

Peptide Data

Several inbuilt data sets have been provided with the toolPDAUG Peptide Data Access. The antimicrobial peptides (AMPs) versus transmembrane peptides (TMPs) dataset was used as an example data set to understand the overall relation between features and biological properties of peptides. AMPs consist of an intersection of all activity annotations of the APD2 and CAMP databases, where gram-positive, gram-negative, and antifungal exact matches were observed. TMPs were extracted from alpha-helical transmembrane regions of proteins for classification.

Hands-on: Fetching inbuild data
  1. PDAUG Peptide Data Access Tool: toolshed.g2.bx.psu.edu/repos/jay/pdaug_peptide_data_access/pdaug_peptide_data_access/0.1.0 with the following parameters:
    • “Datasets”: AMPvsTMP

Converting tabular data into fasta format

PDAUG Peptide Data Access tool returns data as a tabular file that contains sequences from both the classes. In order to utilize this data in the next steps, first we need to convert tabular data into fasta format. If data contains sequences from two different classes PDAUG TSVtoFASTA tool converts and splits data into two separate files for each of the class, AMPs, and TMPs. The reason behind converting and splitting the data is that all the downstream tools require two separate files if we are comparing two different peptide classes or calculating features.

Hands-on: Converting tabular data into fasta formate
  1. PDAUG TSVtoFASTA Tool: toolshed.g2.bx.psu.edu/repos/jay/pdaug_tsvtofasta/pdaug_tsvtofasta/0.1.0 with the following parameters:
    • param-file “Input file”: PDAUG Peptide Data Access - AMPvsTMP (tabular) (output of PDAUG Peptide Data Access tool)
    • “Peptide Column”: name
    • “Method to convert data”: Split Data By Class Label
    • “Column with the class label”: class label

Analyzing peptide libraries (AMPs and TMPs) based on features and feature space

Summary Plot for peptide libraries

In this step, we utilize PDAUG Peptide Sequence Analysis tool to compare peptide sequences based on hydrophobicity, hydrophobic movement, charge, amino acid fraction, and sequence length and create a summary plot.

Hands-on: Generating a summary plot to assess peptide dataset
  1. PDAUG Peptide Sequence Analysis Tool: toolshed.g2.bx.psu.edu/repos/jay/pdaug_peptide_sequence_analysis/pdaug_peptide_sequence_analysis/0.1.0 with the following parameters:
    • “Analysis options”: Plot Summary
      • param-file “First input file”: PDAUG TSVtoFASTA on data 1 - first (fasta) (first output of PDAUG TSVtoFASTA tool)
      • param-file “Second input file”: PDAUG TSVtoFASTA on data 1 - second (fasta) (second output of PDAUG TSVtoFASTA tool)
      • “first input file”: TMPs
      • “Second input file”: AMPs
Question

What can be concluded from the summary plot based on different properties?

The summary plot represents differences between two sets of peptides based on an amino acid fraction, global charge, sequence length, global hydrophobicity, glocal hydrophobic movement. Additionally, 3D scattered plot shows the clustering of peptides based on three features.

  1. Leucine and Valine show relatively higher differences in terms of their fraction within both groups.
  2. TMPs show a global charge in the range of 0-5 in comparison to AMPs which show a global charge in a range of 0-14.
  3. AMPs show higher variability in terms of their length, global hydrophobic movement, and hydrophobicity in comparison to TMPs.
  4. Hydrophobic properties are important in determining transmembrane properties of proteins and peptides which is evident with this summary plot.
  5. Clustering of two different kinds of peptides can be observed with a 3D scattered plot based on their properties, however, we can also observe a few peptides with overlapping feature space.
Summary Plot.
Figure 1: Summary plot shows comparison between AMPs and TMPs

Assessing feature space distribution

In this tool, we have used PDAUG Fisher's Plot that compares two peptide libraries based on the feature space using the Fisher test.

Hands-on: Generating a Fisher's plot to assess peptide dataset
  1. PDAUG Fisher’s Plot Tool: toolshed.g2.bx.psu.edu/repos/jay/pdaug_fishers_plot/pdaug_fishers_plot/0.1.0 with the following parameters:
    • param-file “First fasta file”: PDAUG TSVtoFASTA on data 1 - first (fasta) (first output of PDAUG TSVtoFASTA tool)
    • param-file “Second fasta file”: PDAUG TSVtoFASTA on data 1 - second (fasta) (second output of PDAUG TSVtoFASTA tool)
    • “Label for first population”: TMPs
    • “Label for second population”: AMPs
Question

What does Fisher’s plot represents?

Fisher’s plot represents the difference between two groups of peptides based on their feature space. Each tiny square in this plot represents the feature space. Based on the sliding window Fisher’s test was performed for each feature space to assess the presence of peptides from two different groups on each of the tiny squares. The AMPs and TMPs in the feature space represented by their mean hydropathy and amino acid volume. Fisher’s plot shows that the sequences with larger hydrophobic amino acids are more frequent in TMPs in comparison to AMPs.

Fishers plot.
Figure 2: TMPs peptides show amino acids with larger hydrophobic residues in compare to AMPs

The AMPs and TMPs in the feature space represented by their mean hydropathy and amino acid volume. Fisher’s plot shows that the sequences with larger hydrophobic amino acids are more frequent in TMPs in comparison to AMPs.

Assessing the relation between peptide features by 3D scatter plot

Calculating Sequence Property-Based Descriptors

In this step we will calculate Composition, Transition and Distribution (CTD) descriptos. Composition describptors are defined as the number of amino acids of a particular property divided by total number of amino acids. Transition descriptors are representd as the number of transition from a particular property to different property divided by (total number of amino acids − 1). Distribution descriptors are derived by chain length and the amino acids of a particular property located on this length Govindan and Nair 2013.

Hands-on: Calculating descriptors for the peptide dataset
  1. PDAUG Sequence Property Based Descriptors Tool: toolshed.g2.bx.psu.edu/repos/jay/pdaug_sequence_property_based_descriptors/pdaug_sequence_property_based_descriptors/0.1.0 with the following parameters:
    • param-file “Input fasta file”: PDAUG TSVtoFASTA on data 1 - first (fasta) (first output of PDAUG TSVtoFASTA tool)
    • “Descriptor Type”: CTD
  2. PDAUG Sequence Property Based Descriptors Tool: toolshed.g2.bx.psu.edu/repos/jay/pdaug_sequence_property_based_descriptors/pdaug_sequence_property_based_descriptors/0.1.0 with the following parameters:
    • param-file “Input fasta file”: PDAUG TSVtoFASTA on data 1 - second (fasta) (second output of PDAUG TSVtoFASTA tool)
    • “Descriptor Type”: CTD

Adding the Class Label in both AMPs and TMPs

Class labels or target labels usually represents the class of peptides. Here in our data set, we have peptides, either as AMP or TMP. Since we have two classes we can represent these two classes with their actual labels AMPs and TMPs.

  • Adding Class Label (target labels) in AMPs and TMPs data
Hands-on: Adding Class Labels (target labels) to the tabular data
  1. PDAUG Add Class Label Tool: toolshed.g2.bx.psu.edu/repos/jay/pdaug_addclasslabel/pdaug_addclasslabel/0.1.0 with the following parameters:
    • param-file “Input file”: PDAUG Sequence Property Based Descriptors on data 2 - CTD (tabular) (output of PDAUG Sequence Property Based Descriptors tool)
    • “Class Label”: TMPs
  2. PDAUG Add Class Label Tool: toolshed.g2.bx.psu.edu/repos/jay/pdaug_addclasslabel/pdaug_addclasslabel/0.1.0 with the following parameters:
    • param-file “Input file”: PDAUG Sequence Property Based Descriptors on data 3 - CTD (tabular) (output of PDAUG Sequence Property Based Descriptors tool)
    • “Class Label”: AMPs

Merging the two tabular data files

We utilize PDAUG Merge Dataframes to merge two tabular data files.

Hands-on: Merging two tabular data files
  1. PDAUG Merge Dataframes Tool: toolshed.g2.bx.psu.edu/repos/jay/pdaug_merge_dataframes/pdaug_merge_dataframes/0.1.0 with the following parameters:
    • param-files “Input files”: PDAUG Add Class Label on data 6 - (tabular) (output of PDAUG Add Class Label tool), PDAUG Add Class Label on data 7 - (tabular) (output of PDAUG Add Class Label tool)
    • “Option to merg data”: Merge data without adding class label

Plotting CTD descriptor data as Scatter plot

Tool PDAUG Basic Plots will be used to compare two peptide libraries based on three CTD descriptors SecondaryStrD1100, SolventAccessibilityD2001, and NormalizedVDWVD3050 respectively. A 3D scatter plot will be generated.

Hands-on: Generating a scatter plot to assess features
  1. PDAUG Basic Plots Tool: toolshed.g2.bx.psu.edu/repos/jay/pdaug_basic_plots/pdaug_basic_plots/0.1.0 with the following parameters:
    • “Data plotting method”: Scatter Plot
      • param-file “Input file”: PDAUG Merge Dataframes on data 9 and data 8 - (tabular) (output of PDAUG Merge Dataframes tool)
      • “Scatter Plot type”: 3D
        • “First feature”: _SecondaryStrD1100
        • “Second feature”: _SolventAccessibilityD2001
        • “Third feature”: _NormalizedVDWVD3050
      • “Class label column”: Class_label
3D Scatter plot .
Figure 3: 3D scatter Plot shows relation between featues

Figure 3 Represent 3D scattered plot generated based on the CTD descriptors. Red dots represent TMPs and blue dots represent AMPs. Based on these 3 features, we can observe that both groups do not show any clear separation or cluster in the 3D space.

In this tutorial, we learned how to utilize inbuild data, calculate features, and utilize descriptors or features to assess biological properties. We also learned how to utilize various utilities of PDAUG to generate useful plots to include in our peptide research.

Conclusion

In this tutorial, we learned an example flexible and extensible analysis of peptide data using PDAUG tools. We generated various plots based on the quantitative properties of amino acids and peptide sequences.

Workflow.
Figure 4: Workflow used

Frequently Asked Questions

Have questions about this tutorial? Check out the tutorial FAQ page or the FAQ page for the Proteomics topic to see if your question is listed there. If not, please ask your question on the GTN Gitter Channel or the Galaxy Help Forum

Useful literature

Further information, including links to documentation and original publications, regarding the tools, analysis techniques and the interpretation of results described in this tutorial can be found here.

References

  1. Govindan, G., and A. S. Nair, 2013 Bagging with CTD – A Novel Signature for the Hierarchical Prediction of Secreted Protein Trafficking in Eukaryotes. Genomics, Proteomics & Bioinformatics 11: 385–390. 10.1016/j.gpb.2013.07.005

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.

Click here to load Google feedback frame

Citing this Tutorial

  1. Jayadev Joshi, Daniel Blankenberg, 2022 Peptide Library Data Analysis (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/proteomics/tutorials/peptide-library-data-analysis/tutorial.html Online; accessed TODAY
  2. Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012


@misc{proteomics-peptide-library-data-analysis,
author = "Jayadev Joshi and Daniel Blankenberg",
title = "Peptide Library Data Analysis (Galaxy Training Materials)",
year = "2022",
month = "10",
day = "18"
url = "\url{https://training.galaxyproject.org/training-material/topics/proteomics/tutorials/peptide-library-data-analysis/tutorial.html}",
note = "[Online; accessed TODAY]"
}
@article{Batut_2018,
    doi = {10.1016/j.cels.2018.05.012},
    url = {https://doi.org/10.1016%2Fj.cels.2018.05.012},
    year = 2018,
    month = {jun},
    publisher = {Elsevier {BV}},
    volume = {6},
    number = {6},
    pages = {752--758.e1},
    author = {B{\'{e}}r{\'{e}}nice Batut and Saskia Hiltemann and Andrea Bagnacani and Dannon Baker and Vivek Bhardwaj and Clemens Blank and Anthony Bretaudeau and Loraine Brillet-Gu{\'{e}}guen and Martin {\v{C}}ech and John Chilton and Dave Clements and Olivia Doppelt-Azeroual and Anika Erxleben and Mallory Ann Freeberg and Simon Gladman and Youri Hoogstrate and Hans-Rudolf Hotz and Torsten Houwaart and Pratik Jagtap and Delphine Larivi{\`{e}}re and Gildas Le Corguill{\'{e}} and Thomas Manke and Fabien Mareuil and Fidel Ram{\'{\i}}rez and Devon Ryan and Florian Christoph Sigloch and Nicola Soranzo and Joachim Wolff and Pavankumar Videm and Markus Wolfien and Aisanjiang Wubuli and Dilmurat Yusuf and James Taylor and Rolf Backofen and Anton Nekrutenko and Björn Grüning},
    title = {Community-Driven Data Analysis Training for Biology},
    journal = {Cell Systems}
}
                   

Congratulations on successfully completing this tutorial!