Removal of human reads from SARS-CoV-2 sequencing data

Authors: AvatarWolfgang Maier
Overview
Questions:
  • How can you remove traces of human sequence data from a sequenced viral sample?

Objectives:
  • Obtain viral (SARS-CoV-2) sequencing data with contaminating human reads from public sources

  • Organize the data into a collection

  • Preprocess and map the data against the human reference genome

  • Eliminate reads/read pairs that map to the human genome from the original data

Requirements:
Time estimation: 1 hour
Level: Intermediate Intermediate
Supporting Materials:
Last modification: Oct 18, 2022
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License The GTN Framework is licensed under MIT

Introduction

Patient samples for pathogen detection are usually “contaminated” with human host DNA. Such contamination, if not removed from sequencing data, may pose an issue with certain types of data analyses. Another issue is that submitting the raw sequenced reads with the human sequence “contamination” included to a public database may simply violate national or institutional regulations for handling patient data. For this second reason in particular, researchers will regularly have to remove traces of human sequencing data even from samples that, by their nature, should be highly enriched for just pathogen reads, as is the case, for example, for ampliconic viral samples, in which viral sequences have been PCR-amplified with virus-specific primers before sequencing.

This tutorial will guide you through a typical workflow for clearing human sequences from any kind (ampliconic or not) of viral sequenced sample, which retains non-human reads in a format ready to be submitted to public databases.

Comment: Nature of the input data

We will use sequencing data of bronchoalveolar lavage fluid (BALF) samples obtained from early COVID-19 patients in China as our input data. Since such samples are expected to be contaminated signficantly with human sequenced reads, the effect of the cleaning steps will be much more apparent than for ampliconic sequencing data from diagnostic patient swabs. Nevertheless the cleaning processs does not rely on any particular sample isolation or preprcocessing method and could be used unaltered on, for example, ARTIC amplified SARS-CoV-2 samples.

Agenda

In this tutorial, we will deal with:

  1. Introduction
  2. Get data
    1. Getting the sequencing data
  3. Read trimming and mapping
  4. Obtaining the identifiers of non-human read pairs
  5. Use the non-human read identifiers to extract the reads of interest from the original inputs
  6. Conclusion

Get data

As always, it is best to give each analysis you are performing with Galaxy its own history, so let’s start with preparing this:

Hands-on: Prepare new history
  1. Create a new history for this tutorial and give it a proper name

    Click the new-history icon at the top of the history panel.

    If the new-history is missing:

    1. Click on the galaxy-gear icon (History options) on the top of the history panel
    2. Select the option Create New from the menu
    1. Click on Unnamed history (or the current name of the history) (Click to rename history) at the top of your history panel
    2. Type the new name
    3. Press Enter

Getting the sequencing data

The sequenced reads datasets used in this tutorial (4 files representing 2 Illumina paired-end sequenced samples) have been deposited on Zenodo and can be uploaded to Galaxy via their URLs. After that, we will arrange the uploaded data into a collection in our Galaxy history to facilitate its handling in the analysis.

Hands-on: Data upload and rearrangement
  1. Upload the data from Zenodo via URLs

    • Copy the link location
    • Open the Galaxy Upload Manager (galaxy-upload on the top-right of the tool panel)

    • Select Paste/Fetch Data
    • Paste the link into the text field

    • Press Start

    • Close the window

    The URLs for our example data are these:

    https://zenodo.org/record/3732359/files/SRR10903401_r1.fq.gz
    https://zenodo.org/record/3732359/files/SRR10903401_r2.fq.gz
    https://zenodo.org/record/3732359/files/SRR10903402_r1.fq.gz
    https://zenodo.org/record/3732359/files/SRR10903402_r2.fq.gz
    
  2. Arrange the data into a paired dataset collection

    • Click on Operations on multiple datasets (check box icon) at the top of the history panel Operations on multiple datasets button
    • Check all the datasets in your history you would like to include
    • Click For all selected.. and choose Build List of Dataset Pairs

    • Change the text of unpaired forward to a common selector for the forward reads
    • Change the text of unpaired reverse to a common selector for the reverse reads
    • Click Pair these datasets for each valid forward and reverse pair.
    • Enter a name for your collection
    • Click Create List to build your collection
    • Click on the checkmark icon at the top of your history again

    For the example datasets this means:

    • You need to tell Galaxy about the suffix for your forward and reverse reads, respectively:
      • set the text of unpaired forward to: _r1.fq.gz
      • set the text of unpaired reverse to: _r2.fq.gz
      • click: Auto-pair

      All datasets should now be moved to the paired section of the dialog, and the middle column there should show that only the sample accession numbers, i.e. SRR10903401 and SRR10903402, will be used as the pair names.

    • Make sure Hide original elements is checked to obtain a cleaned-up history after building the collection.
    • Click Create Collection

Read trimming and mapping

In order to learn which of the reads of each of the input samples are of likely human origin, we need to map the reads to the human reference genome. To improve the quality of this step we will also trim low-quality bases from the ends of all reads before passing them to the read mapping software.

Comment

Do you want to learn more about the principles behind quality control (including trimming) and mapping? Follow our dedicated tutorials:

Hands-on: Trim reads and map them to the human reference genome
  1. Trimmomatic Tool: toolshed.g2.bx.psu.edu/repos/pjbriggs/trimmomatic/trimmomatic/0.38.0 with default settings and:
    • “Single-end or paired-end reads?”: Paired-end (as collection)
      • “Select FASTQ dataset collection with R1/R2 pair”: the collection of sequenced reads input datasets
  2. Map with BWA-MEM Tool: toolshed.g2.bx.psu.edu/repos/devteam/bwa/bwa_mem/0.7.17.2 with the following parameters:
    • “Will you select a reference genome from your history or use a built-in index?”: Use a built-in genome index
      • “Using reference genome”: Human Dec. 2013 (GRCh38/hg38)(hg38)

        The exact name of this version of the human reference genome may vary between Galaxy servers. Start typing hg38 in the select box to reveal available choices on your server.

    • “Single or Paired-end reads: Paired Collection
      • “Select a paired collection”: the paired output of Trimmomatic

Obtaining the identifiers of non-human read pairs

The mapping step above has produced a collection of BAM datasets that store, for each sample, which of its reads map to the human genome (and if so to which position on the genome though we are not interested in this information), and which ones could not be mapped to it. It is these unmapped reads that we are interested in, but how do we get at them?

The logic of the following steps is:

  1. Filter the BAM dataset for each sample for only those reads that could not be mapped to the human reference genome, and for which also the read mate in the pair could not be mapped to this genome.

    In other words, if only one read of a read pair can be mapped to the human genome we play it safe and discard the pair.

  2. Emit the remaining reads in a format (we choose FASTA here) from which it is easy to extract the read identifiers

  3. Extract the read identifiers

Hands-on: Filter for read pairs not mapping to the human genome and extract their identifiers
  1. Samtools fastx Tool: toolshed.g2.bx.psu.edu/repos/iuc/samtools_fastx/samtools_fastx/1.9+galaxy1 with the following parameter settings:
    • param-collection “BAM or SAM file to convert”: the output of Map with BWA-MEM
    • “Output format”: FASTA
    • “outputs”: select READ1

      With this setting we are choosing to retrieve only the forward reads of each pair as a single result dataset. Since the forward and reverse reads of a pair share their read identifier, we do not need the reverse reads.

    • “Omit or append read numbers”: Do not append /1 and /2 to read names

      Adding this first or second read of the pair indicator to the read name would make the result dependent on which half of the reads we retrieve. We do not want this.

    • “Require that these flags are set”:
      • read is unmapped
      • mate is unmapped

      This setting makes sure we retain only read pairs for which none of the two reads could be mapped to the human genome.

  2. Select lines that match an expression Tool: Grep1 with the following parameters:
    • param-collection “Select lines from”: the output of Samtools fastx
    • “that”: Matching
    • “the pattern”: ^>.+

    This will select only those lines from the FASTA inputs that start with a > character, i.e. will retain the identifier lines, but not the sequence lines.

  3. Replace Text in entire line Tool: toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_replace_in_line/1.1.2 with the following parameters:
    • param-collection “File to process”: the output of Select
    • param-repeat “Replacement”:
      • “Find Pattern”: ^>(.+)
      • “Replace with”: \1

      This find and replacement pattern will work on lines starting with a > character and replace each such line with the content following the > character (the (.+) part that gets referenced as \1 in the replacement pattern).

Use the non-human read identifiers to extract the reads of interest from the original inputs

We are almost there! We already know the identifiers of all reads we would like to retain, we only need a way to filter the original input datasets for these identifiers.

Comment: Why not use the unmapped reads directly?

When you followed the previous steps you may have noticed that Samtools fastx offers fastq as an output format. Since we used this tool to report the unmapped reads, you may be wondering why we did not simply ask the tool to write the reads in the format we want and be done with it.

The answer is: we could have done that (and some people follow this approach), but remember that we trimmed the reads before mapping them, which means the extracted unmapped reads would not be identical to the input reads.

It is good practice to submit sequencing data to public databases in as raw a state as possible to leave it up to the people downloading that data to apply processing steps as they see fit.

The seqtk_subseq tool offfers the identifier-based filtering functionality we are looking for. As a slight complication, however, the tool is not prepared to handle paired read collections like the one we organized our input data into. For this reason, we need to first unzip our collection into two simple list collections - one with the forward reads, the other one with the reverse reads of all samples. Then, after processing both with seqtk_subseq, we zip the filtered collections back into one.

Hands-on: Identifier-based extraction of non-human reads from the input data
  1. Unzip Collection Tool: UNZIP_COLLECTION with the following setting:
    • “Input Paired Dataset”: the collection of original sequenced reads input datasets
  2. seqtk_subseq Tool: toolshed.g2.bx.psu.edu/repos/iuc/seqtk/seqtk_subseq/1.3.1 with the following parameters:
    • param-collection “Input FASTA/Q file”: the first (forward reads) output of Unzip Collection
    • “Select source of sequence choices”: FASTA/Q ID list
      • param-collection the collection of read identifiers; output of Replace Text
  3. seqtk_subseq Tool: toolshed.g2.bx.psu.edu/repos/iuc/seqtk/seqtk_subseq/1.3.1 with the following parameters:
    • param-collection “Input FASTA/Q file”: the second (reverse reads) output of Unzip Collection
    • “Select source of sequence choices”: FASTA/Q ID list
      • param-collection the collection of read identifiers; output of Replace Text
  4. Zip Collection Tool: ZIP_COLLECTION with the following settings:
    • param-collection “Input Dataset (Forward)”: the collection of cleaned forward reads; output of the first run of seqtk_subseq
    • param-collection “Input Dataset (Reverse)”: the collection of cleaned reverse reads; output of the second run of seqtk_subseq

Conclusion

Data cleaning is a standard procedure for clinical sequencing datasets and is a rather straightforward process in Galaxy. You just learnt how to remove human reads from any number of paired-end sequenced samples in parallel using collections and just a handful of tools. Cleaning of single-end data would use the same tools (with adjusted settings), but would work on data arranged in a simple list collection instead of in a paired collection.

Key points
  • Before submitting raw viral sequencing data to public databases you will want to remove human sequence traces

  • Human reads can be identified by mapping the data to the human reference genome

  • After mapping, you can extract the read identifiers of the non-human reads and use these to extract just the desired original sequenced reads from the input data

Frequently Asked Questions

Have questions about this tutorial? Check out the tutorial FAQ page or the FAQ page for the Sequence analysis topic to see if your question is listed there. If not, please ask your question on the GTN Gitter Channel or the Galaxy Help Forum

Useful literature

Further information, including links to documentation and original publications, regarding the tools, analysis techniques and the interpretation of results described in this tutorial can be found here.

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.

Click here to load Google feedback frame

Citing this Tutorial

  1. Wolfgang Maier, 2022 Removal of human reads from SARS-CoV-2 sequencing data (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/sequence-analysis/tutorials/human-reads-removal/tutorial.html Online; accessed TODAY
  2. Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012


@misc{sequence-analysis-human-reads-removal,
author = "Wolfgang Maier",
title = "Removal of human reads from SARS-CoV-2 sequencing data (Galaxy Training Materials)",
year = "2022",
month = "10",
day = "18"
url = "\url{https://training.galaxyproject.org/training-material/topics/sequence-analysis/tutorials/human-reads-removal/tutorial.html}",
note = "[Online; accessed TODAY]"
}
@article{Batut_2018,
    doi = {10.1016/j.cels.2018.05.012},
    url = {https://doi.org/10.1016%2Fj.cels.2018.05.012},
    year = 2018,
    month = {jun},
    publisher = {Elsevier {BV}},
    volume = {6},
    number = {6},
    pages = {752--758.e1},
    author = {B{\'{e}}r{\'{e}}nice Batut and Saskia Hiltemann and Andrea Bagnacani and Dannon Baker and Vivek Bhardwaj and Clemens Blank and Anthony Bretaudeau and Loraine Brillet-Gu{\'{e}}guen and Martin {\v{C}}ech and John Chilton and Dave Clements and Olivia Doppelt-Azeroual and Anika Erxleben and Mallory Ann Freeberg and Simon Gladman and Youri Hoogstrate and Hans-Rudolf Hotz and Torsten Houwaart and Pratik Jagtap and Delphine Larivi{\`{e}}re and Gildas Le Corguill{\'{e}} and Thomas Manke and Fabien Mareuil and Fidel Ram{\'{\i}}rez and Devon Ryan and Florian Christoph Sigloch and Nicola Soranzo and Joachim Wolff and Pavankumar Videm and Markus Wolfien and Aisanjiang Wubuli and Dilmurat Yusuf and James Taylor and Rolf Backofen and Anton Nekrutenko and Björn Grüning},
    title = {Community-Driven Data Analysis Training for Biology},
    journal = {Cell Systems}
}
                   

Congratulations on successfully completing this tutorial!