Submitting sequence data to ENA

Overview
Questions:
  • How do you submit raw sequence reads and assembled genomes to the European Nucleotide Archive?

Objectives:
  • Submit raw sequencing reads and metadata to ENA’s test server

  • Submit consensus sequence and metadata to ENA’s test server

Requirements:
Time estimation: 1 hour
Level: Intermediate Intermediate
Supporting Materials:
Last modification: Oct 18, 2022
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License The GTN Framework is licensed under MIT

Introduction

Raw reads contain valuable information, such as coverage depth and quality scores, that is lost in a consensus sequence. Submission of raw reads to public repositories allows reuse of data and reproducibility of analysis and enables discovery of minor allelic variants and intrahost variation, for example during the recent COVID-19 pandemic (Maier et al. 2021).

The European Nucleotide Archive is an Open and FAIR repository of nucleotide data. As part of the International Nucleotide Sequence Database Collaboration (INSDC), ENA also indexes data from the NCBI and DDBJ Arita et al. 2020. Data submitted to ENA must be accompanied by sufficient metadata. You can learn more from this introductory slide deck or directly from ENA.

In this tutorial we will show you how to use Galaxy’s ‘ENA Upload tool’ to submit raw sequencing reads, consensus sequences and their associated metadata to ENA Roncoroni et al. 2021. You will learn to add your ENA Webin credentials to Galaxy, input metadata interactively or via a metadata template and submit the reads to ENA (test) server using Galaxy’s ‘ENA upload tool’. Specifically, we will use one ONT sequencing file to demonstrate interactive metadata input and two sets of PE Illumina reads to demonstrate how to use the ENA metadata template. Finally, we will submit consensus sequences to ENA using ‘Submit consensus sequence to ENA’ tool.

Data will be submitted to ENA’s test server and will not be public.

Comment: Nature of the input data

We will use data derived from sequencing data of bronchoalveolar lavage fluid (BALF) samples obtained from early COVID-19 patients in China as our input data. Human traces have been removed in Galaxy.

Agenda

In this tutorial, we will cover:

  1. Introduction
  2. Adding ENA Webin credentials to your Galaxy user information
  3. Submitting raw sequence data (reads) to the ENA
    1. Option 1: submitting to ENA using interactive metadata generator
    2. Option 2: submitting to ENA using a metadata template
  4. Submitting consensus sequences to ENA

Adding ENA Webin credentials to your Galaxy user information

In order to submit data to ENA, you need to have a valid Webin account. If you don’t have one already you can register for one here. Webin credentials need to be included in your Galaxy user information before you can use the ENA Upload tool.

Hands-on: Add Webin credentials to your Galaxy user information
  1. If you have not already done so, log in to usegalaxy.eu
  2. Navigate to “User” > “Preferences” on the top menu
    • Click on Manage Information
    • Scroll down to “Your ENA Webin account details” and fill in your ENA Webin ID and Password
ENA Webin Account details in Galaxy.
Figure 1: ENA Webin Account details

Submitting raw sequence data (reads) to the ENA

Option 1: submitting to ENA using interactive metadata generator

In this first example, you will submit one ONT sequence file using the interactive metadata forms from the ENA Upload tool. This method is only convenient for small submissions. For bulk submissions, we recommend you use the metadata template described below in Option 2.

Hands-on: Data upload
  1. Upload the ONT data from Zenodo via URLs

    • Copy the link location
    • Open the Galaxy Upload Manager (galaxy-upload on the top-right of the tool panel)

    • Select Paste/Fetch Data
    • Paste the link into the text field

    • Press Start

    • Close the window

    The URL for our example data is this:

    https://zenodo.org/record/6912963/files/SRR10902284_ONT.fq.gz
    

Once the data is uploaded, we fill the metadata using the ENA Upload tool. Interactive metadata forms are nested to fit ENA’s metadata model. Briefly, you add Samples to a Study, Experiments to Samples and Runs to Experiments. The interactive metadata form does only two two ENA Sample Checklists, the basic minimal sample metadata and the ENA virus pathogen reporting standard checklist. Switch between the basic template and the virus pathogen one under the “Does your submission contains viral samples?” question. If you wish to include additional metadata from a sample checklist, please use Option 2 below.

We recommend always submitting to the test server before submitting to the public one. After you confirm that all the data and metadata looks ok, you can go ahead and submit to the public ENA server.

Hands-on: add metadata interactively and submit a single sequence to ENA
  1. ENA Upload tool Tool: toolshed.g2.bx.psu.edu/repos/iuc/ena_upload/ena_upload/0.6.1 :
    • “Action to execute”: Add new data
    • Under “Testing options”:
      • “Submit to test ENA server?”: yes
      • “Print the tables but do not submit the datasets”: no
    • “Would you like to submit pregenerated table files or interactively define the input structures?”: Interactive generation of the study structure
    • “Add .fastq.(gz.bz2) extension to the Galaxy dataset names to match the ones described in the input tables?”: No
    • “Does your submission contains viral samples?”: yes
  2. Fill all metadata boxes and make sure that:
    • ”“:
    • “Please select the type of study”: Whole Genome Sequencing
    • “Enter the species of the sample”: Severe acute respiratory syndrome coronavirus 2
    • “Enter the taxonomic ID corresponding to the sample species”: 2697049
    • “Host common name”: human
    • “Host subject id”: avoid using ID that can be use to trace samples back to patients
    • “Host scientific name”: Homo sapiens
    • “Library strategy”: RNA-Seq
    • “Select library source”: METAGENOMIC
    • “Library selection”: RANDOM
    • “Library layout”: SINGLE
    • “Select the sequencing platform used”: Oxford Nanopore
    • “Instrument model”: minION
    • “Runs executed within this experiment”
      • param-files “File(s) associated with this run”: SRR10902284_ONT.fq.gz
    • “Affiliation center”: your institution
Warning: Do not include personal identifiable data

In some cases, some information is requested by ENA that may classify as personal or could be used to identify persons (e.g. ‘host ID’ for checklist ERC000033). Make sure that you do not publish any personal metadata that infringes privacy protection regulations in your jurisdiction.

Warning: Submit to the test server first

Make sure “Submit to test ENA server?”: yes. Otherwise your data will be submitted to the public server.

Four metadata tables (Study, Sample, Experiment and Run), and a metadata ticket with submission information are generated. You can confirm a successful submission at ENA test server (or the public server, if you chose it).

Upon successful submission, a metadata ticket is generated. This contains information of the submission, including parseable metadata. Importantly, it contains Study, Sample, Run and Experiment accession numbers. The former two you will use later to link the consensus sequence to the raw data.

Option 2: submitting to ENA using a metadata template

For larger submissions, interactive metadata input can be tedious and not practical. In the second example, you will submit two sets of Illumina PE sequence files and input metadata using a template spreadsheet. This template contains all fields for ENA sample checklist ‘ERC000033 - ENA virus pathogen reporting standard checklist’. Tabular (.tsv and .xlsx) metadata templates for all sample checklist can be found in this repository. For this tutorial, we provide you with a pre-filled template and encourage you to explore it.

Hands-on: Upload and inspect data
  1. Upload the ONT data from Zenodo via URLs:

    https://zenodo.org/record/6912963/files/SRR10903401_1.fastq.gz
    https://zenodo.org/record/6912963/files/SRR10903401_2.fastq.gz
    https://zenodo.org/record/6912963/files/SRR10903402_1.fastq.gz
    https://zenodo.org/record/6912963/files/SRR10903402_2.fastq.gz
    https://zenodo.org/record/6912963/files/metadata_template_ERC000033_mock_complete.xlsx
    
  2. Arrange the data into a paired dataset collection

    • Click on Operations on multiple datasets (check box icon) at the top of the history panel Operations on multiple datasets button
    • Check all the datasets in your history you would like to include
    • Click For all selected.. and choose Build List of Dataset Pairs

    • Change the text of unpaired forward to a common selector for the forward reads
    • Change the text of unpaired reverse to a common selector for the reverse reads
    • Click Pair these datasets for each valid forward and reverse pair.
    • Enter a name for your collection
    • Click Create List to build your collection
    • Click on the checkmark icon at the top of your history again

    For the example datasets this means:

    • You need to tell Galaxy about the suffix for your forward and reverse reads, respectively:
      • set the text of unpaired forward to: _1.fastq.gz
      • set the text of unpaired reverse to: _2.fastq.gz
      • click: Auto-pair

      All datasets should now be moved to the paired section of the dialog, and the middle column there should show that only the sample accession numbers, i.e. SRR10903401 and SRR10903402, will be used as the pair names.

    Warning: Paired collection names

    It is very important that the paired collection names contain no suffix (e.g. _1, _R1, etc.) or file extensions (.fastq, .fastq.gz). The submission tool will add these at runtime and leaving them in the paired collection names will cause a mismatch with the filenames in the metadata table.

    • Make sure Hide original elements is checked to obtain a cleaned-up history after building the collection.
    • Give your collection a name
    • Click Create Collection
  3. Inspect the GTN_tutorial_mock_metadata.xlsx (filled-in template) file by clicking on the galaxy-eye (eye) icon

    https://github.com/enasequence/webin-cli

    Question
    1. How many metadata sheets are there?
    2. How is the ‘Sample’ section in the template different from that in the interactive metadata input?
    1. There are four metadata sheets, one per metadata object (Study, Sample, Experiment, Run)
    2. The Sample section is more extensive in the template spreadsheet, because it contains all the fields from ENA ERC000033 sample checklist, as well as all ‘Recommended’ and ‘Optional’ fields in the other sections.

As before, the submission is done to the test server before submitting to the public one.

Hands-on: use a metadata template and submit multiple sequences to ENA
  1. ENA Upload tool Tool: toolshed.g2.bx.psu.edu/repos/iuc/ena_upload/ena_upload/0.6.1 :
    • “Action to execute”: Add new data
    • Under “Testing options”:
      • “Submit to test ENA server?”: yes
      • “Print the tables but do not submit the datasets”: no
    • “Would you like to submit pregenerated table files or interactively define the input structures?”: User generated metadata tables based on Excel templates
    • “Select the metadata checklist”: ENA virus pathogen reporting standard checklist (ERC000033)
    • “Select Excel (xlsx) file based on template”: metadata_template_ERC000033_mock_complete.xlsx
    • “Select runs input format”: Input from a paired collection
    • “List of paired-end runs files”: select the collection you made above during data upload
    • “Affiliation center”: your institution
Warning: Do not include personal identifiable data

In some cases, some information is requested by ENA that may classify as personal or could be used to identify persons (e.g. ‘host ID’ for checklist ERC000033). Make sure that you do not publish any personal metadata that infringes privacy protection regulations in your jurisdiction.

Warning: Submit to the test server first

Make sure “Submit to test ENA server?”: yes. Otherwise your data will be submitted to the public server.

Four metadata tables (Study, Sample, Experiment and Run), and a metadata ticket with submission information are generated. You can confirm a successful submission at ENA test server (or the public server, if you chose it).

Upon succesful submission, a metadata ticket is generated. This contains information of the submission, including parseable metadata. Importantly, it contains Study, Sample, Run and Experiment accession numbers. The former two you will use later to link the consensus sequence to the raw data.

Submitting consensus sequences to ENA

We produced consensus sequences for the Illumina data from Option 2 above following SARS-CoV-2-PE-Illumina-WGS-variant-calling, SARS-CoV-2-variation-reporting and COVID-19-consensus-construction workflows.

In this step we will submit one consensus sequence to ENA. We will link it to the reads submitted in the first step using the accession numbers given to those submissions. Galaxy’s ‘ENA Upload tool’ captures and stores metadata on a metadata ticket.

Hands-on: submit consensus sequences to ENA
  1. Upload the consensus sequence to Galaxy from Zenodo via URLs:
    https://zenodo.org/record/6912963/files/SRR10903401.fasta
    
  2. galaxy-eye Open the ‘ENA submsission receipt’ and find the Study and Sample accession numbers from the raw data submission.
  3. Submit consensus sequence to ENA Tool: toolshed.g2.bx.psu.edu/repos/ieguinoa/ena_webin_cli/ena_webin_cli/7d751b5943b0 :
    • “Submit to test ENA server?”: yes
    • “Validate files and metadata but do not submit”: no
    • Fill the assembly metadata. For our assembly:
      • “Assembly type”: Clone
      • “Assembly program”: BWA-MEM
      • “Molecule type”: genomic RNA
      • “Coverage”: 1000
      • “Select the method to load study and sample metadata”: Fill in required metadata
      • “Assembly name”: give a name to your assembly.
      • “Study accession”: ERP139884 (you can find the Study accession number from your raw data submission metadata ticket)
      • “Sample accession”: ERS12519941 (you can find the Sample accession number from your raw data submission metadata ticket)
      • “Sequencing platform”: Illumina
    • Select the consensus sequence assembly file from your history: SRR10903401.fasta

The output are a list of manifest (in our example one manifest) and a ‘submission log’. The manifest file is the metadata required by ENA’s Webin CLI tool. The ‘submission log’ can be useful to troubleshoot failed submissions.

Key points
  • Use Galaxy’s ‘ENA Upload tool’ to submit raw reads to ENA

  • Use Galaxy’s ‘Submit consensus sequence to ENA’ tool

  • You need to include your ENA Webin credentials in Galaxy

  • For small submission use ‘ENA Upload tool’ interactive metadata forms feature

  • For bulk submissions use a spreadsheet metadata template

Frequently Asked Questions

Have questions about this tutorial? Check out the tutorial FAQ page or the FAQ page for the Using Galaxy and Managing your Data topic to see if your question is listed there. If not, please ask your question on the GTN Gitter Channel or the Galaxy Help Forum

References

  1. Arita, M., I. Karsch-Mizrachi, and G. Cochrane, 2020 The international nucleotide sequence database collaboration. Nucleic Acids Research 49: D121–D124. 10.1093/nar/gkaa967
  2. Maier, W., S. Bray, M. van den Beek, D. Bouvier, N. Coraor et al., 2021 Freely accessible ready to use global infrastructure for SARS-CoV-2 monitoring. 10.1101/2021.03.25.437046
  3. Roncoroni, M., B. Droesbeke, I. Eguinoa, K. D. Ruyck, F. D’Anna et al., 2021 A SARS-CoV-2 sequence submission tool for the European Nucleotide Archive (Z. Lu, Ed.). Bioinformatics. 10.1093/bioinformatics/btab421

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.

Click here to load Google feedback frame

Citing this Tutorial

  1. Miguel Roncoroni, Bert Droesbeke, 2022 Submitting sequence data to ENA (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/galaxy-interface/tutorials/upload-data-to-ena/tutorial.html Online; accessed TODAY
  2. Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012


@misc{galaxy-interface-upload-data-to-ena,
author = "Miguel Roncoroni and Bert Droesbeke",
title = "Submitting sequence data to ENA (Galaxy Training Materials)",
year = "2022",
month = "10",
day = "18"
url = "\url{https://training.galaxyproject.org/training-material/topics/galaxy-interface/tutorials/upload-data-to-ena/tutorial.html}",
note = "[Online; accessed TODAY]"
}
@article{Batut_2018,
    doi = {10.1016/j.cels.2018.05.012},
    url = {https://doi.org/10.1016%2Fj.cels.2018.05.012},
    year = 2018,
    month = {jun},
    publisher = {Elsevier {BV}},
    volume = {6},
    number = {6},
    pages = {752--758.e1},
    author = {B{\'{e}}r{\'{e}}nice Batut and Saskia Hiltemann and Andrea Bagnacani and Dannon Baker and Vivek Bhardwaj and Clemens Blank and Anthony Bretaudeau and Loraine Brillet-Gu{\'{e}}guen and Martin {\v{C}}ech and John Chilton and Dave Clements and Olivia Doppelt-Azeroual and Anika Erxleben and Mallory Ann Freeberg and Simon Gladman and Youri Hoogstrate and Hans-Rudolf Hotz and Torsten Houwaart and Pratik Jagtap and Delphine Larivi{\`{e}}re and Gildas Le Corguill{\'{e}} and Thomas Manke and Fabien Mareuil and Fidel Ram{\'{\i}}rez and Devon Ryan and Florian Christoph Sigloch and Nicola Soranzo and Joachim Wolff and Pavankumar Videm and Markus Wolfien and Aisanjiang Wubuli and Dilmurat Yusuf and James Taylor and Rolf Backofen and Anton Nekrutenko and Björn Grüning},
    title = {Community-Driven Data Analysis Training for Biology},
    journal = {Cell Systems}
}
                   

Congratulations on successfully completing this tutorial!