Chloroplast genome assembly

Authors: AvatarAnna Syme
Overview
Questions:
  • How can we assemble a chloroplast genome?

Objectives:
  • Assemble a chloroplast genome from long reads

  • Polish the assembly with short reads

  • Annotate the assembly and view

  • Map reads to the assembly and view

Requirements:
Time estimation: 2 hours
Supporting Materials:
Last modification: Oct 18, 2022
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License The GTN Framework is licensed under MIT

Introduction

What is genome assembly?

Genome assembly is the process of joining together DNA sequencing fragments into longer pieces, ideally up to chromosome lengths.The DNA fragments are produced by DNA sequencing machines, and are called “reads”. These are in lengths of about 150 nucleotides (base pairs), to up to a million+ nucleotides, depending on the sequencing technology used. Currently, most reads are from Illumina (short), PacBio (long) or Oxford Nanopore (long and extra-long).

It is difficult to assemble plant genomes as they are often large (for example, 3,000,000,000 base pairs), have many repeat regions (such as transposons), and may be polyploid. This tutorial shows genome assembly for a smaller data set - the plant chloroplast genome - a single circular chromosome which is typically about 160,000 base pairs. It is thought that the the chloroplast evolved from a cyanobacteria that was living in plant cells.

In this tutorial, we will use a subset of a real data set from sweet potato, from the paper Zhou et al. 2018. To find out more about each of the tools used here, see the tool panel page for a summary and links to more information.

Agenda

In this tutorial we will deal with:

  1. Introduction
    1. What is genome assembly?
  2. Upload data
  3. Check read quality
  4. Assemble reads
  5. Polish assembly
  6. Annotate the assembly
  7. View reads
  8. Repeat with new data
  9. Conclusion

Upload data

Let’s start with uploading the data.

Hands-on: Import the data
  1. Create a new history for this tutorial and give it a proper name

    Click the new-history icon at the top of the history panel.

    If the new-history is missing:

    1. Click on the galaxy-gear icon (History options) on the top of the history panel
    2. Select the option Create New from the menu
    1. Click on Unnamed history (or the current name of the history) (Click to rename history) at the top of your history panel
    2. Type the new name
    3. Press Enter
  2. Import from Zenodo or a data library (ask your instructor):

    • FASTQ file with illumina reads: sweet-potato-chloroplast-illumina-reduced.fastq
    • FASTQ file with nanopore reads: sweet-potato-chloroplast-nanopore-reduced.fastq
    • Note: make sure to import the files with “reduced” in the names, not the ones with “tiny” in the names.
      https://zenodo.org/record/3567224/files/sweet-potato-chloroplast-illumina-reduced.fastq
      https://zenodo.org/record/3567224/files/sweet-potato-chloroplast-nanopore-reduced.fastq
      
    • Copy the link location
    • Open the Galaxy Upload Manager (galaxy-upload on the top-right of the tool panel)

    • Select Paste/Fetch Data
    • Paste the link into the text field

    • Press Start

    • Close the window

    As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a shared data library:

    • Go into Shared data (top panel) then Data libraries
    • Navigate to the correct folder as indicated by your instructor
    • Select the desired files
    • Click on the To History button near the top and select as Datasets from the dropdown menu
    • In the pop-up window, select the history you want to import the files to (or create a new one)
    • Click on Import

Check read quality

We will look at the quality of the nanopore reads.

Hands-on: Check read quality
  1. Nanoplot Tool: toolshed.g2.bx.psu.edu/repos/iuc/nanoplot/nanoplot/1.28.2+galaxy1 :
    • “Select multifile mode”: batch
    • “Type of file to work on”: fastq
    • “files”: select the nanopore FASTQ file
  2. View output:
    • There are five output files.
    • Look at the HTML report to learn about the read quality.
Question

What summary statistics would be useful to look at?

This will depend on the aim of your analysis, but usually:

  • Sequencing depth (the number of reads covering each base position; also called “coverage”). Higher depth is usually better, but at very high depths it may be better to subsample the reads, as errors can swamp the assembly graph.
  • Sequencing quality (the quality score indicates probability of base call being correct). You may trim or filter reads on quality. Phred quality scores are logarithmic: phred quality 10 = 90% chance of base call being correct; phred quality 20 = 99% chance of base call being correct. More detail here.
  • Read lengths (read lengths histogram, and reads lengths vs. quality plots). Your analysis or assembly may need reads of a certain length.

Optional further steps:

  • Find out the quality of your reads using other tools such as fastp or FastQC.
  • To visualize base quality using emoji you can also use FASTQE.
  • Run FASTQE for the illumina reads. In the output, look at the mean values (the middle row)
  • Repeat FASTQE for the nanopore reads. In the tool settings, increase the maximum read length to 30000.
  • To learn more, see the Quality Control tutorial

Assemble reads

We will assemble the long nanopore reads.

Hands-on: Assemble reads
  1. Flye Tool: toolshed.g2.bx.psu.edu/repos/bgruening/flye/flye/2.6+galaxy0 :
    • “Input reads: sweet-potato-chloroplast-nanopore-reduced.fastq
    • “Estimated genome size”: 160000
    • Leave other settings as default
  2. Re-name the consensus output file to flye-assembly.fasta

    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, change the Name field to flye-assembly.fasta
    • Click the Save button
  3. View output:
    • There are five output files.
    • Note: this tool is heuristic; your results may differ slightly from the results here, and if repeated.
    • View the log file and scroll to the end to see how many contigs (fragments) were assembled and the length of the assembly.
    • View the assembly_info file to see contig names and lengths.
Hands-on: View the assembly
  1. Bandage Info Tool: toolshed.g2.bx.psu.edu/repos/iuc/bandage/bandage_info/0.8.1+galaxy1
    • “Graphical Fragment Assembly”: the Flye output file Graphical Fragment Assembly (not the “assembly_graph” file)
    • Leave other settings as default
  2. Bandage Image Tool: toolshed.g2.bx.psu.edu/repos/iuc/bandage/bandage_image/0.8.1+galaxy2
    • “Graphical Fragment Assembly”: the Flye output file Graphical Fragment Assembly (not the “assembly_graph” file)
    • “Node length labels”: Yes
    • Leave other settings as default

Your assembly graph may look like this:

assembly graph.
Figure 1: Assembly graph of the nanopore reads, for the sweet potato chloroplast genome. In your image, some of the labels may be truncated; this is a known bug under investigation.

Note: a newer version of the Flye assembly tool now resolves this assembly into a single circle.

Question

What is your interpretation of this assembly graph?

One interpretation is that this represents the typical circular chloroplast structure: There is a long single-copy region (the node of around 78,000 bp), connected to the inverted repeat (a node of around 28,000 bp), connected to the short single-copy region (of around 11,000 bp). In the graph, each end loop is a single-copy region (either long or short) and the centre bar is the collapsed inverted repeat which should have about twice the sequencing depth.

Comment: Further Learning
  • Repeat the Flye assembly with different parameters, and/or a filtered read set.
  • You can also try repeating the Flye assembly with an earlier version of the tool, to see the difference it makes. In the tool panel for Flye, click on the ‘Versions’ button at the top to change.
  • Try an alternative assembly tool, such as Canu or Unicycler.

Polish assembly

Short illumina reads are more accurate the nanopore reads. We will use them to correct errors in the nanopore assembly.

First, we will map the short reads to the assembly and create an alignment file.

Hands-on: Map reads
  1. Map with BWA-MEM Tool: toolshed.g2.bx.psu.edu/repos/devteam/bwa/bwa_mem/0.7.17.1 :
    • “Will you select a reference genome from your history”: Use a genome from history
    • “Use the following dataset as the reference sequence”: flye-assembly.fasta
    • “Algorithm for constructing the BWT index”: Auto. Let BWA decide
    • “Single or Paired-end reads: Single
    • “Select fastq dataset”: sweet-potato-illumina-reduced.fastq
    • “Set read groups information?”: Do not set
    • “Select analysis mode”: Simple Illumina mode
  2. Re-name output file:
    • Re-name this file illumina.bam

Next, we will compare the short reads to the assembly, and create a polished (corrected) assembly file.

Hands-on: Polish
  1. pilon Tool: toolshed.g2.bx.psu.edu/repos/iuc/pilon/pilon/1.20.1 :
    • “Source for reference genome used for BAM alignments”: Use a genome from history
    • “Select a reference genome”: flye-assembly.fasta
    • “Type automatically determined by pilon”: Yes
    • “Input BAM file”: illumina.bam
    • “Variant calling mode”: No
    • “Create changes file”: Yes
  2. View output:
    • What is in the changes file?
    • Rename the fasta output to polished-assembly.fasta
    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, change the Name field to polished-assembly.fasta
    • Click the Save button
  3. Fasta Statistics Tool: toolshed.g2.bx.psu.edu/repos/iuc/fasta_stats/fasta-stats/1.0.1
    • Find and run the tool called “Fasta statistics” on both the original flye assembly and the polished version.
Question

How does the polished assembly compare to the unpolished assembly?

This will depend on the settings, but as an example: your polished assembly might be about 10-15 Kbp longer. Nanopore reads can have homopolymer deletions - a run of AAAA may be interpreted as AAA - so the more accurate illumina reads may correct these parts of the long-read assembly. In the changes file, there may be a lot of cases showing a supposed deletion (represented by a dot) being corrected to a base.

Optional further steps:

  • Run a second round (or more) of Pilon polishing. Keep track of file naming; you will need to generate a new bam file first before each round of Pilon.
  • Run an alternative polishing tool, such as Racon. This uses the long reads themselves to correct the long-read (Flye) assembly. It would be better to run this tool on the Flye assembly before running Pilon, rather than after Pilon.

Annotate the assembly

We can now annotate our assembled genome with information about genomic features.

  • A chloroplast genome annotation tool is not yet available in Galaxy; for an approximation, here we can use the tool for bacterial genome annotation, Prokka.
Hands-on: Annotate with Prokka
  1. Prokka Tool: toolshed.g2.bx.psu.edu/repos/crs4/prokka/prokka/1.14.5+galaxy0 with the following parameters (leave everything else unchanged)
    • param-file “contigs to annotate”: polished-assembly.fasta
  2. View output:
    • The GFF and GBK files contain all of the information about the features annotated (in different formats.)
    • The .txt file contains a summary of the number of features annotated.
    • The .faa file contains the protein sequences of the genes annotated.
    • The .ffn file contains the nucleotide sequences of the genes annotated.

Alternatively, you might want to use a web-based tool designed for chloroplast genomes.

  • One option is the GeSeq tool, described here. Skip this step if you have already used Prokka above.
Hands-on: Annotate with GeSeq
  • Download polished-assembly.fasta to your computer (click on the file in your history; then click on the disk icon).
  • In a new browser tab, go to Chlorobox where we will use the GeSeq tool (Tillich et al. 2017) to annotate our sequence.
  • Upload the fasta file there. Information about how to use the tool is available on the page.
  • Once the annotation is completed, download the required files.
  • In Galaxy, import the annotation GFF3 file.

Now make a JBrowse file to view the annotations (the GFF3 file - produced from either Prokka or GeSeq) under the assembly (the polished-assembly.fasta file).

Hands-on: View annotations
  1. JBrowse genome browser Tool: toolshed.g2.bx.psu.edu/repos/iuc/jbrowse/jbrowse/1.16.4+galaxy3 :
    • “Reference genome to display”: Use a genome from history
      • “Select a reference genome”: polished-assembly.fasta
    • “Produce Standalone Instance”: Yes
    • “Genetic Code”: 11. The Bacterial, Archaeal and Plant Plastid Code
    • “JBrowse-in-Galaxy Action”: New JBrowse instance
    • “Insert Track Group”
      • “Insert Annotation Track”
        • “Track Type”: GFF/GFF3/BED Features
        • “GFF/GFF3/BED Track Data”: the GFF3 file
        • Leave the other track features as default
  2. Re-name output file:
    • JBrowse may take a few minutes to run. There is one output file: re-name it view-annotations
  3. View output:
    • Click on the eye icon to view the annotations file.
    • Select the right contig to view, in the drop down box.
    • Zoom out (with the minus button) until annotations are visible.

Here is an embedded snippet showing JBrowse and the annotations:

Open JBrowse in a new tab

View reads

We will look at the original sequencing reads mapped to the genome assembly. In this tutorial, we will import very cut-down read sets so that they are easier to view.

Hands-on: Import cut-down read sets
  1. Import Tool: upload1 from Zenodo or a data library (ask your instructor):
    • FASTQ file with illumina reads: sweet-potato-chloroplast-illumina-tiny.fastq
    • FASTQ file with nanopore reads: sweet-potato-chloroplast-nanopore-tiny.fastq
    • Note: these are the “tiny” files, not the “reduced” files we imported earlier.
      https://zenodo.org/record/3567224/files/sweet-potato-chloroplast-illumina-tiny.fastq
      https://zenodo.org/record/3567224/files/sweet-potato-chloroplast-nanopore-tiny.fastq
      
Hands-on: Map the reads to the assembly
  • Map the Illumina reads (the new “tiny” dataset) to the polished-assembly.fasta, the same way we did before, using bwa mem.
  • This creates one output file: re-name it illumina-tiny.bam
  • Map the Nanopore reads (the new “tiny” dataset) to the polished-assembly.fasta. The settings will be the same, except Select analysis mode should be Nanopore
  • This creates one output file: re-name it nanopore-tiny.bam
Hands-on: Visualise mapped reads
  1. JBrowse genome browser Tool: toolshed.g2.bx.psu.edu/repos/iuc/jbrowse/jbrowse/1.16.4+galaxy3 :
    • “Reference genome to display”: Use a genome from history
      • “Select a reference genome”: polished-assembly.fasta
    • “Produce Standalone Instance”: Yes
    • “Genetic Code”: 11. The Bacterial, Archaeal and Plant Plastid Code
    • “JBrowse-in-Galaxy Action”: New JBrowse instance
    • “Insert Track Group”
      • “Insert Annotation Track”
        • “Track Type”: BAM pileups
        • “BAM track data”: nanopore-tiny.bam
        • “Autogenerate SNP track”: No
        • Leave the other track features as default
      • “Insert Annotation Track”.
        • “Track Type”: BAM pileups
        • “BAM track data”: illumina-tiny.bam
        • “Autogenerate SNP track”: No
        • Leave the other track features as default
  2. Re-name output file:
    • JBrowse may take a few minutes to run. There is one output file: re-name it assembly-and-reads
  3. View output:
    • Click on the eye icon to view. (For more room, collapse Galaxy side menus with corner < > signs).
    • Make sure the bam files are ticked in the left hand panel.
    • Choose a contig in the drop down menu. Zoom in and out with + and - buttons.

Here is an embedded snippet showing JBrowse and the mapped reads:

Open JBrowse in a new tab

Question
  1. What are the differences between the nanopore and the illumina reads?
  2. What are some reasons that the read coverage may vary across the reference genome?
  1. Nanopore reads are longer and have a higher error rate.
  2. There may be lots of reasons for varying read coverage. Some possibilities: In areas of high read coverage: this region may be a collapsed repeat. In areas of low or no coverage: this region may be difficult to sequence; or, this region may be a misassembly.

Repeat with new data

Optional extension exercise

We can assemble another chloroplast genome using sequence data from a different plant species: the snow gum, Eucalyptus pauciflora. This data is from Wang et al. 2018. It is a subset of the original FASTQ read files (Illumina - SRR7153063, Nanopore - SRR7153095).

Hands-on: Assembly and annotation
  • Get data: at this Zenodo link, then upload to Galaxy.
  • Check reads: Run Nanoplot on the nanopore reads.
  • Assemble: Use Flye to assemble the nanopore reads, then get Fasta statistics Note: this may take several hours.
  • Polish assembly: Use Pilon to polish the assembly with short Illumina reads. Note: Don’t forget to map these Illumina reads to the assembly first using bwa-mem, then use the resulting bam file as input to Pilon.
  • Annotate: Use the GeSeq tool at Chlorobox or the Prokka tool within Galaxy.
  • View annotations:Use JBrowse to view the assembled, annotated genome.

Conclusion

Key points
  • A chloroplast genome can be assembled with long reads and polished with short reads

  • The assembly graph is useful to look at and think about genomic structure

  • We can map raw reads back to the assembly and investigate areas of high or low read coverage

  • We can view an assembly, its mapped reads, and its annotations in JBrowse

Frequently Asked Questions

Have questions about this tutorial? Check out the tutorial FAQ page or the FAQ page for the Assembly topic to see if your question is listed there. If not, please ask your question on the GTN Gitter Channel or the Galaxy Help Forum

References

  1. Tillich, M., P. Lehwark, T. Pellizzer, E. S. Ulbricht-Jones, A. Fischer et al., 2017 GeSeq – versatile and accurate annotation of organelle genomes. Nucleic Acids Research 45: W6–W11. 10.1093/nar/gkx391
  2. Wang, W., M. Schalamun, A. Morales-Suarez, D. Kainer, B. Schwessinger et al., 2018 Assembly of chloroplast genomes with long- and short-read data: a comparison of approaches using Eucalyptus pauciflora as a test case. BMC Genomics 19: 10.1186/s12864-018-5348-8
  3. Zhou, C., T. Duarte, R. Silvestre, G. Rossel, R. O. M. Mwanga et al., 2018 Insights into population structure of East African sweetpotato cultivars from hybrid assembly of chloroplast genomes. Gates Open Research 2: 41. 10.12688/gatesopenres.12856.1

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.

Click here to load Google feedback frame

Citing this Tutorial

  1. Anna Syme, 2022 Chloroplast genome assembly (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/assembly/tutorials/chloroplast-assembly/tutorial.html Online; accessed TODAY
  2. Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012


@misc{assembly-chloroplast-assembly,
author = "Anna Syme",
title = "Chloroplast genome assembly (Galaxy Training Materials)",
year = "2022",
month = "10",
day = "18"
url = "\url{https://training.galaxyproject.org/training-material/topics/assembly/tutorials/chloroplast-assembly/tutorial.html}",
note = "[Online; accessed TODAY]"
}
@article{Batut_2018,
    doi = {10.1016/j.cels.2018.05.012},
    url = {https://doi.org/10.1016%2Fj.cels.2018.05.012},
    year = 2018,
    month = {jun},
    publisher = {Elsevier {BV}},
    volume = {6},
    number = {6},
    pages = {752--758.e1},
    author = {B{\'{e}}r{\'{e}}nice Batut and Saskia Hiltemann and Andrea Bagnacani and Dannon Baker and Vivek Bhardwaj and Clemens Blank and Anthony Bretaudeau and Loraine Brillet-Gu{\'{e}}guen and Martin {\v{C}}ech and John Chilton and Dave Clements and Olivia Doppelt-Azeroual and Anika Erxleben and Mallory Ann Freeberg and Simon Gladman and Youri Hoogstrate and Hans-Rudolf Hotz and Torsten Houwaart and Pratik Jagtap and Delphine Larivi{\`{e}}re and Gildas Le Corguill{\'{e}} and Thomas Manke and Fabien Mareuil and Fidel Ram{\'{\i}}rez and Devon Ryan and Florian Christoph Sigloch and Nicola Soranzo and Joachim Wolff and Pavankumar Videm and Markus Wolfien and Aisanjiang Wubuli and Dilmurat Yusuf and James Taylor and Rolf Backofen and Anton Nekrutenko and Björn Grüning},
    title = {Community-Driven Data Analysis Training for Biology},
    journal = {Cell Systems}
}
                   

Congratulations on successfully completing this tutorial!