Genome assembly using PacBio data

Overview
Questions:
  • How to perform a genome assembly with PacBio data ?

  • How to check assembly quality ?

Objectives:
  • Assemble a Genome with PacBio data

  • Assess assembly quality

Requirements:
Time estimation: 6 hours
Level: Intermediate Intermediate
Supporting Materials:
Last modification: Oct 18, 2022
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License The GTN Framework is licensed under MIT

Introduction

In this tutorial, we will assemble a genome of a species of fungi in the family Mucoraceae, Mucor mucedo, from PacBio sequencing data. These data were obtained from NCBI (SRR8534473, SRR8534474 and SRR8534475). The quality of the assembly obtained will be analyzed, in particular by comparing it to a reference assembly, obtained with Falcon assembler, and available on the JGI website.

Agenda

In this tutorial, we will cover:

  1. Introduction
  2. Get data
    1. Get data from Zenodo
    2. Get data from JGI website
  3. Genome Assembly with Flye
  4. Quality assessment
    1. Genome assembly metrics with Fasta Statistics
    2. Genome assemblies comparison with Quast
    3. Genome assembly assessment with BUSCO
  5. Conclusion

Get data

We will use long reads sequencing data: CLR (continuous long reads) from PacBio sequencing of Mucor mucedo genome. This data is a subset of data from NCBI. We will also use later a reference genome assembly downloaded from the JGI website. This reference genome was assembled using the same PacBio data, we will use it as a comparison with our own assembly.

Get data from Zenodo

Hands-on: Data upload from Zenodo
  1. Create a new history for this tutorial
  2. Import the files from Zenodo

    https://zenodo.org/api/files/d010d8f1-a1fd-4366-991f-916c2f0c55db/SRR8534473_subreads.fastq.gz
    https://zenodo.org/api/files/d010d8f1-a1fd-4366-991f-916c2f0c55db/SRR8534474_subreads.fastq.gz
    https://zenodo.org/api/files/d010d8f1-a1fd-4366-991f-916c2f0c55db/SRR8534475_subreads.fastq.gz
    
    • Copy the link location
    • Open the Galaxy Upload Manager (galaxy-upload on the top-right of the tool panel)

    • Select Paste/Fetch Data
    • Paste the link into the text field

    • Press Start

    • Close the window
  3. Rename the datasets
  4. Check that the datatype is fastqsanger.gz

    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, click on the galaxy-chart-select-data Datatypes tab on the top
    • Select datatypes
      • tip: you can start typing the datatype into the field to filter the dropdown menu
    • Click the Save button

Get data from JGI website

Hands-on: Data upload from JGI website
  1. Create a JGI account in registration page of JGI: JGI registration
  2. Sign in JGI Genome Portal JGI Genome Portal
  3. Genome assembly is available here: JGI Mucor mucedo
  4. Import fasta assembly file Mucmuc1_AssemblyScaffolds.fasta on your computer locally
  5. Upload this file on Galaxy
  6. Check that the datatype is fasta

    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, click on the galaxy-chart-select-data Datatypes tab on the top
    • Select datatypes
      • tip: you can start typing the datatype into the field to filter the dropdown menu
    • Click the Save button

Genome Assembly with Flye

We will use Flye, a de novo assembler for single molecule sequencing reads, such as those produced by PacBio and Oxford Nanopore Technologies. It is designed for a wide range of datasets, from small bacterial projects to large mammalian-scale assemblies. The package represents a complete pipeline: it takes raw PacBio / ONT reads as input and outputs polished contigs. Flye also has a special mode for metagenome assembly. All informations about Flye assembler are here: Flye.

Hands-on: Assembly
  1. Flye Tool: toolshed.g2.bx.psu.edu/repos/bgruening/flye/flye/2.9+galaxy0 with the following parameters:
    • param-file “Input reads: the three sequencing datasets
    • “Mode”: PacBio raw
    • “Number of polishing iterations”: 1
    • “Reduced contig assembly coverage”: Disable reduced coverage for initial disjointing assembly

    The tool produces four datasets: consensus, assembly graph, graphical fragment assembly and assembly info

Question

What are the different output datasets?

  • The first dataset (consensus) is a fasta file containing the final assembly (1461 contigs). You may notice that the result (contigs number) you obtained is sligthy different from the one presented here. This is due to the Flye assembly algorithm which doesn’t always give the eact same results.
  • The second and third dataset are assembly graph files. These graphs are used to represent the final assembly of a genome, they are based on reads and their overlap information. Some tools such as Bandage allow to visualize the assembly graph.
  • The fourth dataset is a tabular file (assembly_info) containing extra information about contigs/scaffolds.

Quality assessment

Genome assembly metrics with Fasta Statistics

Fasta statistics displays the summary statistics for a fasta file. In the case of a genome assembly, we need to calculate different metrics such as assembly size, scaffolds number or N50 value. These metrics will allow us to evaluate the quality of this assembly.

Hands-on: Fasta statistics on Flye assembly
  1. Fasta Statistics Tool: toolshed.g2.bx.psu.edu/repos/iuc/fasta_stats/fasta-stats/2.0 with the following parameters:
    • param-file “fasta or multifasta file”: consensus (output of Flye tool)
Hands-on: Fasta statistics on the reference assembly
  1. Fasta Statistics Tool: toolshed.g2.bx.psu.edu/repos/iuc/fasta_stats/fasta-stats/2.0 with the following parameters:
    • param-file “fasta or multifasta file”: Mucmuc1_AssemblyScaffolds.fasta
Question
  1. Compare the different metrics obtained for Flye assembly and reference genome.
  2. What can you conclude about the quality of this new assembly ?
  1. We compare the metrics of the two genome assembly:
    • The Flye assembly: 1461 contigs/scaffolds, N50 = 222 kb, length max = 897 kb, size = 48.6 Mb, 36.6% GC
    • The reference genome: 456 contigs/scaffolds, N50 = 202 kb, length max = 776 kb, size = 46.1 Mb, 36.7% GC
  2. Metrics are very similar, Flye generated an assembly with a quality similar to that of the reference genome.

Genome assemblies comparison with Quast

Another way to calculate metrics assembly is to use QUAST = QUality ASsessment Tool. Quast is a tool to evaluate genome assemblies by computing various metrics and to compare genome assembly with a reference genome. The manual of Quast is here: Quast

Hands-on: Task description
  1. Quast Tool: toolshed.g2.bx.psu.edu/repos/iuc/quast/quast/5.0.2+galaxy3 with the following parameters:
    • “Use customized names for the input files?”: No, use dataset names
      • param-file “Contigs/scaffolds file”: consensus (output of Flye tool)
    • “Type of assembly”: Genome
      • “Use a reference genome?”: Yes
      • param-file “Reference genome”: Mucmuc1_AssemblyScaffolds.fasta
      • “Type of organism”: Fungus: use of GeneMark-ES for gene finding, ...
Question

What additional informations are generated by Quast, compared to the Fasta Statistics outputs?

Quast allows us to compare Flye assembly to the reference genome:

  1. Genome fraction (90.192 %) is the percentage of aligned bases in the reference genome.
  2. Duplication ratio (1.094) is the total number of aligned bases in the assembly divided by the total number of aligned bases in the reference genome.
  3. Largest alignment (698452) is the length of the largest continuous alignment in the assembly.
  4. Total aligned length (45.2 Mb) is the total number of aligned bases in the assembly.

Quast also generates some plots:

  1. Cumulative length plot shows the growth of contig lengths. On the x-axis, contigs are ordered from the largest to smallest. The y-axis gives the size of the x largest contigs in the assembly.
  2. GC content plot shows the distribution of GC content in the contigs.

Genome assembly assessment with BUSCO

BUSCO (Benchmarking Universal Single-Copy Orthologs) allows a measure for quantitative assessment of genome assembly based on evolutionarily informed expectations of gene content. Details for this tool are here: Busco website

Hands-on: BUSCO on Flye assembly

First on the Flye assembly:

  1. Busco Tool: toolshed.g2.bx.psu.edu/repos/iuc/busco/busco/5.2.2+galaxy0 with the following parameters:
    • param-file “Sequences to analyse”: consensus (output of Flye tool)
    • “Auto-detect or select lineage”: Select lineage
      • “Lineage”: Mucorales

Then, on the reference assembly:

  1. Busco Tool: toolshed.g2.bx.psu.edu/repos/iuc/busco/busco/5.2.2+galaxy0 with the following parameters:
    • param-file “Sequences to analyse”: Mucmuc1_AssemblyScaffolds.fasta
    • “Auto-detect or select lineage”: Select lineage
      • “Lineage”: Mucorales
Question

Compare the number of BUSCO genes identified in the Flye assembly and the reference genome. What do you observe ?

Short summary generated by BUSCO indicates that reference genome contains:

  1. 2327 Complete BUSCOs (of which 2302 are single-copy and 25 are duplicated),
  2. 13 fragmented BUSCOs,
  3. 109 missing BUSCOs.

Short summary generated by BUSCO indicates that Flye assembly contains:

  1. 2348 complete BUSCOs (2310 single-copy and 38 duplicated),
  2. 8 fragmented BUSCOs
  3. 93 missing BUSCOs.

BUSCO analysis confirms that these two assemblies are of similar quality, with similar number of complete, fragmented and missing BUSCOs genes.

Conclusion

This pipeline shows how to generate and evaluate a genome assembly from long reads PacBio data. Once you are satisfied with your genome sequence, you might want to annotate it: have a look at the RepeatMasker and Funannoate tutorials to learn how to do it!

Key points
  • PacBio data allows to perform good quality genome assembly

  • Quast and BUSCO make it easy to compare the quality of assemblies

Frequently Asked Questions

Have questions about this tutorial? Check out the tutorial FAQ page or the FAQ page for the Assembly topic to see if your question is listed there. If not, please ask your question on the GTN Gitter Channel or the Galaxy Help Forum

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.

Click here to load Google feedback frame

Citing this Tutorial

  1. Anthony Bretaudeau, Alexandre Cormier, Erwan Corre, Laura Leroi, Stéphanie Robin, Erasmus+ Programme, 2022 Genome assembly using PacBio data (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/assembly/tutorials/flye-assembly/tutorial.html Online; accessed TODAY
  2. Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012


@misc{assembly-flye-assembly,
author = "Anthony Bretaudeau and Alexandre Cormier and Erwan Corre and Laura Leroi and Stéphanie Robin and Erasmus+ Programme",
title = "Genome assembly using PacBio data (Galaxy Training Materials)",
year = "2022",
month = "10",
day = "18"
url = "\url{https://training.galaxyproject.org/training-material/topics/assembly/tutorials/flye-assembly/tutorial.html}",
note = "[Online; accessed TODAY]"
}
@article{Batut_2018,
    doi = {10.1016/j.cels.2018.05.012},
    url = {https://doi.org/10.1016%2Fj.cels.2018.05.012},
    year = 2018,
    month = {jun},
    publisher = {Elsevier {BV}},
    volume = {6},
    number = {6},
    pages = {752--758.e1},
    author = {B{\'{e}}r{\'{e}}nice Batut and Saskia Hiltemann and Andrea Bagnacani and Dannon Baker and Vivek Bhardwaj and Clemens Blank and Anthony Bretaudeau and Loraine Brillet-Gu{\'{e}}guen and Martin {\v{C}}ech and John Chilton and Dave Clements and Olivia Doppelt-Azeroual and Anika Erxleben and Mallory Ann Freeberg and Simon Gladman and Youri Hoogstrate and Hans-Rudolf Hotz and Torsten Houwaart and Pratik Jagtap and Delphine Larivi{\`{e}}re and Gildas Le Corguill{\'{e}} and Thomas Manke and Fabien Mareuil and Fidel Ram{\'{\i}}rez and Devon Ryan and Florian Christoph Sigloch and Nicola Soranzo and Joachim Wolff and Pavankumar Videm and Markus Wolfien and Aisanjiang Wubuli and Dilmurat Yusuf and James Taylor and Rolf Backofen and Anton Nekrutenko and Björn Grüning},
    title = {Community-Driven Data Analysis Training for Biology},
    journal = {Cell Systems}
}
                   

Congratulations on successfully completing this tutorial!