Genome assembly using PacBio data
OverviewQuestions:Objectives:
How to perform a genome assembly with PacBio data ?
How to check assembly quality ?
Requirements:
Assemble a Genome with PacBio data
Assess assembly quality
- Introduction to Galaxy Analyses
- Sequence analysis
- Quality Control: slides slides - tutorial hands-on
Time estimation: 6 hoursLevel: Intermediate IntermediateSupporting Materials:Last modification: Oct 18, 2022
Introduction
In this tutorial, we will assemble a genome of a species of fungi in the family Mucoraceae, Mucor mucedo, from PacBio sequencing data. These data were obtained from NCBI (SRR8534473, SRR8534474 and SRR8534475). The quality of the assembly obtained will be analyzed, in particular by comparing it to a reference assembly, obtained with Falcon assembler, and available on the JGI website.
AgendaIn this tutorial, we will cover:
Get data
We will use long reads sequencing data: CLR (continuous long reads) from PacBio sequencing of Mucor mucedo genome. This data is a subset of data from NCBI. We will also use later a reference genome assembly downloaded from the JGI website. This reference genome was assembled using the same PacBio data, we will use it as a comparison with our own assembly.
Get data from Zenodo
Hands-on: Data upload from Zenodo
- Create a new history for this tutorial
Import the files from Zenodo
https://zenodo.org/api/files/d010d8f1-a1fd-4366-991f-916c2f0c55db/SRR8534473_subreads.fastq.gz https://zenodo.org/api/files/d010d8f1-a1fd-4366-991f-916c2f0c55db/SRR8534474_subreads.fastq.gz https://zenodo.org/api/files/d010d8f1-a1fd-4366-991f-916c2f0c55db/SRR8534475_subreads.fastq.gz
- Copy the link location
Open the Galaxy Upload Manager (galaxy-upload on the top-right of the tool panel)
- Select Paste/Fetch Data
Paste the link into the text field
Press Start
- Close the window
- Rename the datasets
Check that the datatype is
fastqsanger.gz
- Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
- In the central panel, click on the galaxy-chart-select-data Datatypes tab on the top
- Select
datatypes
- tip: you can start typing the datatype into the field to filter the dropdown menu
- Click the Save button
Get data from JGI website
Hands-on: Data upload from JGI website
- Create a JGI account in registration page of JGI: JGI registration
- Sign in JGI Genome Portal JGI Genome Portal
- Genome assembly is available here: JGI Mucor mucedo
- Import fasta assembly file
Mucmuc1_AssemblyScaffolds.fasta
on your computer locally- Upload this file on Galaxy
Check that the datatype is
fasta
- Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
- In the central panel, click on the galaxy-chart-select-data Datatypes tab on the top
- Select
datatypes
- tip: you can start typing the datatype into the field to filter the dropdown menu
- Click the Save button
Genome Assembly with Flye
We will use Flye, a de novo assembler for single molecule sequencing reads, such as those produced by PacBio and Oxford Nanopore Technologies. It is designed for a wide range of datasets, from small bacterial projects to large mammalian-scale assemblies. The package represents a complete pipeline: it takes raw PacBio / ONT reads as input and outputs polished contigs. Flye also has a special mode for metagenome assembly. All informations about Flye assembler are here: Flye.
Hands-on: Assembly
- Flye Tool: toolshed.g2.bx.psu.edu/repos/bgruening/flye/flye/2.9+galaxy0 with the following parameters:
- param-file “Input reads”: the three sequencing datasets
- “Mode”:
PacBio raw
- “Number of polishing iterations”:
1
- “Reduced contig assembly coverage”:
Disable reduced coverage for initial disjointing assembly
The tool produces four datasets: consensus, assembly graph, graphical fragment assembly and assembly info
QuestionWhat are the different output datasets?
- The first dataset (consensus) is a fasta file containing the final assembly (1461 contigs). You may notice that the result (contigs number) you obtained is sligthy different from the one presented here. This is due to the Flye assembly algorithm which doesn’t always give the eact same results.
- The second and third dataset are assembly graph files. These graphs are used to represent the final assembly of a genome, they are based on reads and their overlap information. Some tools such as Bandage allow to visualize the assembly graph.
- The fourth dataset is a tabular file (assembly_info) containing extra information about contigs/scaffolds.
Quality assessment
Genome assembly metrics with Fasta Statistics
Fasta statistics displays the summary statistics for a fasta file. In the case of a genome assembly, we need to calculate different metrics such as assembly size, scaffolds number or N50 value. These metrics will allow us to evaluate the quality of this assembly.
Hands-on: Fasta statistics on Flye assembly
- Fasta Statistics Tool: toolshed.g2.bx.psu.edu/repos/iuc/fasta_stats/fasta-stats/2.0 with the following parameters:
- param-file “fasta or multifasta file”:
consensus
(output of Flye tool)
Hands-on: Fasta statistics on the reference assembly
- Fasta Statistics Tool: toolshed.g2.bx.psu.edu/repos/iuc/fasta_stats/fasta-stats/2.0 with the following parameters:
- param-file “fasta or multifasta file”:
Mucmuc1_AssemblyScaffolds.fasta
Question
- Compare the different metrics obtained for Flye assembly and reference genome.
- What can you conclude about the quality of this new assembly ?
- We compare the metrics of the two genome assembly:
- The Flye assembly: 1461 contigs/scaffolds, N50 = 222 kb, length max = 897 kb, size = 48.6 Mb, 36.6% GC
- The reference genome: 456 contigs/scaffolds, N50 = 202 kb, length max = 776 kb, size = 46.1 Mb, 36.7% GC
- Metrics are very similar, Flye generated an assembly with a quality similar to that of the reference genome.
Genome assemblies comparison with Quast
Another way to calculate metrics assembly is to use QUAST = QUality ASsessment Tool. Quast is a tool to evaluate genome assemblies by computing various metrics and to compare genome assembly with a reference genome. The manual of Quast is here: Quast
Hands-on: Task description
- Quast Tool: toolshed.g2.bx.psu.edu/repos/iuc/quast/quast/5.0.2+galaxy3 with the following parameters:
- “Use customized names for the input files?”:
No, use dataset names
- param-file “Contigs/scaffolds file”:
consensus
(output of Flye tool)- “Type of assembly”:
Genome
- “Use a reference genome?”:
Yes
- param-file “Reference genome”:
Mucmuc1_AssemblyScaffolds.fasta
- “Type of organism”:
Fungus: use of GeneMark-ES for gene finding, ...
QuestionWhat additional informations are generated by Quast, compared to the Fasta Statistics outputs?
Quast allows us to compare Flye assembly to the reference genome:
- Genome fraction (90.192 %) is the percentage of aligned bases in the reference genome.
- Duplication ratio (1.094) is the total number of aligned bases in the assembly divided by the total number of aligned bases in the reference genome.
- Largest alignment (698452) is the length of the largest continuous alignment in the assembly.
- Total aligned length (45.2 Mb) is the total number of aligned bases in the assembly.
Quast also generates some plots:
- Cumulative length plot shows the growth of contig lengths. On the x-axis, contigs are ordered from the largest to smallest. The y-axis gives the size of the x largest contigs in the assembly.
- GC content plot shows the distribution of GC content in the contigs.
Genome assembly assessment with BUSCO
BUSCO (Benchmarking Universal Single-Copy Orthologs) allows a measure for quantitative assessment of genome assembly based on evolutionarily informed expectations of gene content. Details for this tool are here: Busco website
Hands-on: BUSCO on Flye assemblyFirst on the Flye assembly:
- Busco Tool: toolshed.g2.bx.psu.edu/repos/iuc/busco/busco/5.2.2+galaxy0 with the following parameters:
- param-file “Sequences to analyse”:
consensus
(output of Flye tool)- “Auto-detect or select lineage”:
Select lineage
- “Lineage”:
Mucorales
Then, on the reference assembly:
- Busco Tool: toolshed.g2.bx.psu.edu/repos/iuc/busco/busco/5.2.2+galaxy0 with the following parameters:
- param-file “Sequences to analyse”:
Mucmuc1_AssemblyScaffolds.fasta
- “Auto-detect or select lineage”:
Select lineage
- “Lineage”:
Mucorales
QuestionCompare the number of BUSCO genes identified in the Flye assembly and the reference genome. What do you observe ?
Short summary generated by BUSCO indicates that reference genome contains:
- 2327 Complete BUSCOs (of which 2302 are single-copy and 25 are duplicated),
- 13 fragmented BUSCOs,
- 109 missing BUSCOs.
Short summary generated by BUSCO indicates that Flye assembly contains:
- 2348 complete BUSCOs (2310 single-copy and 38 duplicated),
- 8 fragmented BUSCOs
- 93 missing BUSCOs.
BUSCO analysis confirms that these two assemblies are of similar quality, with similar number of complete, fragmented and missing BUSCOs genes.
Conclusion
This pipeline shows how to generate and evaluate a genome assembly from long reads PacBio data. Once you are satisfied with your genome sequence, you might want to annotate it: have a look at the RepeatMasker and Funannoate tutorials to learn how to do it!
Key points
PacBio data allows to perform good quality genome assembly
Quast and BUSCO make it easy to compare the quality of assemblies
Frequently Asked Questions
Have questions about this tutorial? Check out the tutorial FAQ page or the FAQ page for the Assembly topic to see if your question is listed there. If not, please ask your question on the GTN Gitter Channel or the Galaxy Help ForumFeedback
Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.
Citing this Tutorial
- Anthony Bretaudeau, Alexandre Cormier, Erwan Corre, Laura Leroi, Stéphanie Robin, Erasmus+ Programme, 2022 Genome assembly using PacBio data (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/assembly/tutorials/flye-assembly/tutorial.html Online; accessed TODAY
- Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012
Congratulations on successfully completing this tutorial!@misc{assembly-flye-assembly, author = "Anthony Bretaudeau and Alexandre Cormier and Erwan Corre and Laura Leroi and Stéphanie Robin and Erasmus+ Programme", title = "Genome assembly using PacBio data (Galaxy Training Materials)", year = "2022", month = "10", day = "18" url = "\url{https://training.galaxyproject.org/training-material/topics/assembly/tutorials/flye-assembly/tutorial.html}", note = "[Online; accessed TODAY]" } @article{Batut_2018, doi = {10.1016/j.cels.2018.05.012}, url = {https://doi.org/10.1016%2Fj.cels.2018.05.012}, year = 2018, month = {jun}, publisher = {Elsevier {BV}}, volume = {6}, number = {6}, pages = {752--758.e1}, author = {B{\'{e}}r{\'{e}}nice Batut and Saskia Hiltemann and Andrea Bagnacani and Dannon Baker and Vivek Bhardwaj and Clemens Blank and Anthony Bretaudeau and Loraine Brillet-Gu{\'{e}}guen and Martin {\v{C}}ech and John Chilton and Dave Clements and Olivia Doppelt-Azeroual and Anika Erxleben and Mallory Ann Freeberg and Simon Gladman and Youri Hoogstrate and Hans-Rudolf Hotz and Torsten Houwaart and Pratik Jagtap and Delphine Larivi{\`{e}}re and Gildas Le Corguill{\'{e}} and Thomas Manke and Fabien Mareuil and Fidel Ram{\'{\i}}rez and Devon Ryan and Florian Christoph Sigloch and Nicola Soranzo and Joachim Wolff and Pavankumar Videm and Markus Wolfien and Aisanjiang Wubuli and Dilmurat Yusuf and James Taylor and Rolf Backofen and Anton Nekrutenko and Björn Grüning}, title = {Community-Driven Data Analysis Training for Biology}, journal = {Cell Systems} }
Do you want to extend your knowledge? Follow one of our recommended follow-up trainings:
- Genome Annotation
- Masking repeats with RepeatMasker: tutorial hands-on