Peptide and Protein ID using SearchGUI and PeptideShaker
OverviewQuestions:Objectives:
How to convert LC-MS/MS raw files?
How to identify peptides?
How to identify proteins?
How to evaluate the results?
Requirements:
Protein identification from LC-MS/MS raw files.
Time estimation: 45 minutesLevel: Introductory IntroductorySupporting Materials:Last modification: Oct 18, 2022
Introduction
Identifying the proteins contained in a sample is an important step in any proteomic experiment. However, in most experimental set ups, proteins are digested to peptides before the LC-MS/MS analysis. In this so-called “bottom-up” procedure, only peptide masses are measured. Therefore, protein identification cannot be performed directly from raw data, but is a multi-step process:
- Raw data preparation
- Peptide-to-Spectrum matching
- Peptide inference
- Protein inference
A plethora of software solutions exist for each step. In this tutorial, we will show how to use the ProteoWizard tool MSconvert and the OpenMS tool PeakPickerHiRes for step 1, and the Compomics tools SearchGUI and PeptideShaker, for the steps 2-4.
For an alternative identification pipeline using only tools provided by the OpenMS software suite, please consult this tutorial.
Input data
As an example dataset, we will use an LC-MS/MS analysis of HeLa cell lysate published in Vaudel et al., 2014, Proteomics. Detailed information about the dataset can be found on PRIDE. For step 2, we will use a validated human Uniprot FASTA database without appended decoy sequences. If you already completed the tutorial on Database Handling you can use the constructed database priot to the DecoyDatabase tool step. You can find a prepared database, as well as the input proteomics data in different file formats on Zenodo.
AgendaIn this tutorial, we will deal with:
Preparing Raw Data
Raw data conversion is the first step of any proteomic data analysis. The most common converter is msconvert from the ProteoWizard software suite, the format to convert to is mzML. SearchGUI needs MGF
format as input, but as we need the mzML
format for several other tasks, we will convert to mzML
first. Due to licensing reasons, msconvert runs only on windows systems and will not work on most Galaxy servers.
Depending on your machine settings, raw data will be generated either in profile mode or centroid mode. For most peptide search engines, the tandem mass spectrometry (MS2) data have to be converted to centroid mode, a process called “peak picking” or “centroiding”. Machine vendors offer algorithms to extract peaks from profile raw data. This is implemented in msconvert tool and can be run in parallel to the mzML conversion. However, the OpenMS tool PeakPickerHiRes tool is reported to generate slightly better results (Lange et al., 2006, Pac Symp Biocomput) and is therefore recommended for quantitative studies (Vaudel et al., 2010, Proteomics). If your data were generated on a low resolution mass spectrometer, use PeakPickerWavelet tool instead.
Hands-on: Hands-On: File Conversion and Peak PickingWe provide the input data in the original
raw
format and also already converted toMGF
andmzML
file formats. If msconvert tool does not run on your Galaxy instance, please download the preconvertedmzML
as an input.
Create a new history for this Peptide and Protein ID exercise.
Click the new-history icon at the top of the history panel.
If the new-history is missing:
- Click on the galaxy-gear icon (History options) on the top of the history panel
- Select the option Create New from the menu
- Load the example dataset into your history from Zenodo: raw mzML
- Rename the dataset to something meaningful.
- (optional) Run msconvert tool on the test data to convert to the
mzML
format.- Run PeakPickerHiRes tool on the resulting file. Click
+ Insert param.algorithm_ms_levels
and change the entry to “2”. Thus, peak picking will only be performed on MS2 level.- Run FileConverter tool on the picked mzML. In the Advanced Options set the Output file type to
MGF
.Comment: Local Use of MSConvertThe vendor libraries used by msconvert are only licensed for Windows systems and are therefore rarely implemented in Galaxy instances. If msconvert tool is not available in your Galaxy instance, please install the software on a Windows computer and run the conversion locally. You can find a detailed description of the necessary steps here (“Peak List Generation”). Afterwards, upload the resulting mzML file to your Galaxy history.
Peptide and Protein Identification
Mass spectrometry experiments identify peptides by isolating them, ioinizing and subsequently colliding them with a gas for fragmentation. This method generates a spectrum of peptide fragment masses for each isolated peptide - an MS2 spectrum. To find out the peptide sequences, the MS2 spectrum is compared to a theoretical spectrum generated from a protein database. This step is called peptide-to-spectrum (also: spectrum-to-sequence) matching. Accordingly, a peptide that is successfully matched to a sequence is termed PSM (Peptide-Spectrum-Match). There can be multiple PSMs per peptide, if the peptide was fragmented several times. Different peptide search engines have been developed to fulfill the matching procedure.
It is generally recommended to use more than one peptide search engine and use the combined results for the final peptide inference (Shteynberg et al., 2013, Mol. Cell. Proteomics). Again, there are several software solutions for this, e.g. iProphet (TPP) or ConsensusID (OpenMS). In this tutorial we will use Search GUI tool, as it can automatically search the data using several search engines. Its partner tool Peptide Shaker tool is then used to combine and evaluate the search engine results.
In bottom-up proteomics, it is necessary to combine the identified peptides to proteins. This is not a trivial task, as proteins are redundant in most eukaryotic organisms. Thus, not every peptide can be assigned to only one protein. Luckily, the Peptide Shaker tool already takes care of protein inference and even gives us some information on validity of the protein identifications. We will discuss validation in a later step of this tutorial.
Hands-on: Hands-On: Peptide and Protein Identification
- Copy the prepared protein database from the tutorial Database Handling into your current history by using the multiple history view or upload the ready-made database from this link.
- Open Search GUI tool to search the mgf file against the protein database. In the
Search Engine Options
selectX!Tandem
andMS-GF+
. In theProtein Modification Options
add theFixed Modifications
:Carbamidomethylation of C
and theVariable Modifications
:Oxidation of M
.- Run Peptide Shaker tool on the Search GUI output. Enable the following outputs:
Zip File for import to Desktop App
,mzidentML File
,PSM Report
,Peptide Report
,Protein Report
.Comment: Search GUI ParametersWe ran Search GUI tool with default settings. When you are processing files of a different experiment, you may need to adjust some of the parameters. Search GUI bundles numerous peptide search engines for matching MS/MS to peptide sequences within a database. In practice, using 2-3 different search engines offers high confidence while keeping analysis time reasonable. In our hands, X! tandem, MS-GF+, OMSSA and Comet search algorithms offer good results. The
Precursor Options
have to be adjusted to the mass spectrometer which was used to generate the files. The default settings fit a high resolution Orbitrap instrument. In theAdvanced Options
you may set much more detailed settings for each of the used search engines. When using X!Tandem, we recommend to switch off the advanced X!Tandem optionsNoise suppression
,Quick Pyrolidone
andQuick Acetyl
. When using MSGF, we recommend to select the correctInstrument type
.Comment: PeptideShaker OutputsPeptide Shaker offers a variety of outputs. The
Zip File for import to Desktop App
can be downloaded to view and evaluate the search results in the Peptide Shaker viewer (Download). The severalReports
contain tabular, human-readable information. Also, anmzidentML
(=mzid
) file can be created that contains all peptide sequence matching information and can be utilized by compatible downstream software. TheCertificate of Analysis
provides details on all parameters settings of both Search GUI and Peptide Shaker used for the analysis.Question
- How many peptides were identified? How many proteins?
- How many peptides with oxidized methionine were identified?
- You should have identified 3,325 peptides and 1,170 proteins.
- 328 peptides contain an oxidized methionine (MeO). To get to this number, you can use Select tool on the Peptide Report and search for either “Oxidation of M” or “M<ox>”.
Analysis of Contaminants
The FASTA database used for the peptide to spectrum matching contained some entries that were not expected to stem from the HeLa cell lysate, but are common contaminations in LC-MS/MS samples. The main reason to add those is to avoid misidentification of the spectra to other proteins. However, it also enables you to check for contaminations in your samples. CAVE: in human samples, many proteins that are common contaminants may also stem from the real sample. The real source of such human proteins might require advanced investigation.
Hands-on: Hands-On: Analysis of Contaminants
- Run Select tool on the Peptide Shaker Protein Report to select all lines that match the pattern “CONTAMINANT”.
- Remove all contaminants from your protein list by running Select tool on the Peptide Shaker Protein Report. Select only those lines that do not match the pattern “CONTAMINANT”.
Question
- Which contaminants did you identify? Where do these contaminations come from?
- What other sources of contaminants exist?
- How many mycoplasma proteins did you identify? Does this mean that the analyzed HeLa cells were infected with mycoplasma?
- How many false positives do we expect in our list? How many of these are expected to match mycoplasma proteins?
- TRY_BOVIN is bovine trypsin. It was used to degrade the proteins to peptides. ALBU_BOVIN is bovine serum albumin. It is added to cell culture medium in high amounts.
- Contaminants often stem from the experimenter, these are typically keratins or other high-abundant human proteins. Basically any protein present in the room of the mass spectrometer might get into the ion source, if it is airborne. As an example, sheep keratins are sometimes found in proteomic samples, stemming from clothing made of sheep wool.
- There should be five Mycoplasma proteins in your protein list. However, all of them stem from different Mycoplasma species. Also, every protein was identified by one peptide only. You can see this in column 17-19 of your output. These observations make it quite likely that we might have identified false positives here.
- As we were allowing for a false discovery rate of 1 %, we would expect 12 false positive proteins in our list. False positives are expected to be randomly assigned to peptides in the FASTA database. Our database consists of about 20,000 human proteins and 4,000 mycoplasma proteins. Therefore, we would expect 17 % (= 2) of all false positives matching to mycoplasma proteins.
Evaluation of Peptide and Protein IDs
Peptide Shaker tool provides you with validation results for the identified PSM, peptides and proteins. It classifies all these IDs in the categories “Confident” or “Doubtful”. On each level, the meaning of these terms differs to some extent:
- PSMs are marked as “Doubtful” when the measured MS2 spectrum did not fit well to the theoretical spectrum.
- Peptides have a combined scoring of their PSMs. They are marked as “Doubtful”, when the score is below a set threshold. The threshold is defined by the false discovery rate (FDR).
- Proteins are marked as “Doubtful”, when they were identified by only a single peptide or when they were identified solely by “Doubtful” peptides.
Hands-on: Hands-On: Evaluation of Peptide and Protein IDs
- Remove all “Doubtful” proteins from your protein list by running Select tool on the Peptide Shaker Protein Report. Select only those lines that do not match the pattern “Doubtful”.
Question
- How to exclude mycoplasma proteins?
- How many “Confident” non-contaminant proteins were identified?
- Add another Select tool matching the pattern “HUMAN”.
- You should have identified 582 human non-contaminant proteins that were validated to be “Confident”.
Premade Workflow
A premade workflow for this tutorial can be found here
Further Reading
Key points
LC-MS/MS raw files have to be locally converted to mgf/mzML prior to further analysis on most Galaxy servers.
SearchGUI can be used for running several peptide search engines at once.
PeptideShaker can be used to combine and evaluate the results, and to perform protein inference.
Frequently Asked Questions
Have questions about this tutorial? Check out the tutorial FAQ page or the FAQ page for the Proteomics topic to see if your question is listed there. If not, please ask your question on the GTN Gitter Channel or the Galaxy Help ForumUseful literature
Further information, including links to documentation and original publications, regarding the tools, analysis techniques and the interpretation of results described in this tutorial can be found here.
Feedback
Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.
Citing this Tutorial
- Florian Christoph Sigloch, Björn Grüning, 2022 Peptide and Protein ID using SearchGUI and PeptideShaker (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/proteomics/tutorials/protein-id-sg-ps/tutorial.html Online; accessed TODAY
- Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012
Congratulations on successfully completing this tutorial!@misc{proteomics-protein-id-sg-ps, author = "Florian Christoph Sigloch and Björn Grüning", title = "Peptide and Protein ID using SearchGUI and PeptideShaker (Galaxy Training Materials)", year = "2022", month = "10", day = "18" url = "\url{https://training.galaxyproject.org/training-material/topics/proteomics/tutorials/protein-id-sg-ps/tutorial.html}", note = "[Online; accessed TODAY]" } @article{Batut_2018, doi = {10.1016/j.cels.2018.05.012}, url = {https://doi.org/10.1016%2Fj.cels.2018.05.012}, year = 2018, month = {jun}, publisher = {Elsevier {BV}}, volume = {6}, number = {6}, pages = {752--758.e1}, author = {B{\'{e}}r{\'{e}}nice Batut and Saskia Hiltemann and Andrea Bagnacani and Dannon Baker and Vivek Bhardwaj and Clemens Blank and Anthony Bretaudeau and Loraine Brillet-Gu{\'{e}}guen and Martin {\v{C}}ech and John Chilton and Dave Clements and Olivia Doppelt-Azeroual and Anika Erxleben and Mallory Ann Freeberg and Simon Gladman and Youri Hoogstrate and Hans-Rudolf Hotz and Torsten Houwaart and Pratik Jagtap and Delphine Larivi{\`{e}}re and Gildas Le Corguill{\'{e}} and Thomas Manke and Fabien Mareuil and Fidel Ram{\'{\i}}rez and Devon Ryan and Florian Christoph Sigloch and Nicola Soranzo and Joachim Wolff and Pavankumar Videm and Markus Wolfien and Aisanjiang Wubuli and Dilmurat Yusuf and James Taylor and Rolf Backofen and Anton Nekrutenko and Björn Grüning}, title = {Community-Driven Data Analysis Training for Biology}, journal = {Cell Systems} }