View markdown source on GitHub

Genome assembly quality control.

Contributors

Questions

last_modification Last modification: Sep 20, 2022

Genome assembly quality control, or “the 3C”



.image-100[ The 3C for genome assembly quality control ]


.image-30[ Illustration of genome assembly contiguity ]


Contiguity

.pull-left[

Desire

.pull-right[ Metrics:

]

Sequences, i.e. a set of contigs and/or scaffolds


N50 & L50

.left[ N50: given a set of sequences of varying lengths, the N50 is defined as the length L of the shortest contig for which longer and equal length contigs cover at least 50% of the assembly.

L50: given a set of sequences of varying lengths, the L50 is defined as count of smallest number of sequences whose length sum makes up 50% of the assembly. ]

N50 describes a sequence length whereas L50 describes a number of sequences.


N50 & L50

.pull-left[ Example:

.image-100[ Schematic explanation of N50 ]

N50 = 8 and L50 = 4

.footnote[Alhakami, H., Mirebrahim, H., & Lonardi, S. (2017). A comparative evaluation of genome assembly reconciliation tools. Genome biology, 18(1), 1-14.]


N50 & L50

However, the theses statistics may not reflect some assembly improvements. If we connect two sequences longer than N50 or connect two sequences shorter than N50, N50 is not changed. N50 is only improved if we connect a sequence shorter than N50 and a sequence longer than N50.

.image-40[ Schematic explanation of N50 and it limits ]


Nx curve

« 50 » is a single point on the Nx curve. The entire Nx curve in fact gives us a better sense of contiguity.

.pull-left[ Example of Nx graph for Drosophila assemblies ]

.pull-right[ Example of cumulative sequence length graph for Drosophila assemblies ]


QUAST - A tool to evaluate genome assemblies

It also includes:


.image-30[ Illustration of genome assembly completeness ]


Types of completeness


Assembly size vs estimated

Proportion of the original genome represented by the assembly:

.image-100[ Formula to estimate assembly size completeness ]

“*” it’s an estimation, so not perfect. See An introduction to get started in genome assembly and annotation to find methods to determine the genome size.


Known vs. unknown nucleotides

Proportion of A, T, G, C versus N (unknown nucleotide).We expect an assembly without unknown nucleotides (N).


“Core” genes

.pull-left[ Quantitative assessment of genome assembly based on evolutionarily informed expectations of gene content from near-universal single-copy orthologs. .image-70[ Formula to estimate assembly completeness for core genes ] ] .pull-right[ .image-70[ Example of BUSCO plot for Nosema species (Microsporidia) ]]

.footnote[Tips: Reference databases are constructed using known genomes. Species with few/no close genomes available can have very bad scores.]


Core genes evaluation software

BUSCO: Assessing genome assembly and annotation completeness with Benchmarking Universal Single-Copy Orthologs

.pull-left[ .image-70[ Explanation of orthology classification for creation of BUSCO sets - Universality ]]

.pull-right[ .image-60[ Explanation of orthology classification for creation of BUSCO sets - Duplicability ]]

Eukaryota: 255 single copy from 70 species; Arthropoda: 1013 single copy from 90 species; Fungi: 758 single copy from 549 species

.footnote[Waterhouse, R. M., Zdobnov, E. M. & Kriventseva, E. V. Correlating Traits of Gene Retention, Sequence Divergence, Duplicability and Essentiality in Vertebrates, Arthropods, and Fungi. Genome Biol Evol 3, 75–86 (2011).]


BUSCO limitations

The value of the BUSCO is only as good as its reference database.

Example with BUSCO Eukaryotic set: .pull-left[ .image-70[ Limitation of BUSCO orthoDB for eukaOrthoDB - part A ]] .pull-right[ .image-70[ Limitation of BUSCO orthoDB for eukaOrthoDB - part B ]]

.footnote[Saary, P., Mitchell, A. L. & Finn, R. D. Estimating the quality of eukaryotic genomes recovered from metagenomic analysis with EukCC. Genome Biol 21, 244 (2020).]


BUSCO limitations

The use of transcriptome alignment of a closely related species or a de novo RNA-Seq assembly of the same species can be another proxy to assess the completeness of the assembly and adress BUSCO limitations.


Assembly kmer content

The aim is to check assembly coherence against the content within reads that were used to produce the assembly. Basically, how many elements of each frequency on the read’s spectrum ended up being not included in the assembly, included once, included twice etc.


K-mer spectrum plots

.pull-left[ .image-70[ How to read kmer spectrum of reads ]] .pull-right[ .image-70[ How to read kmer spectrum of an assembly using reads ]]

.footnote[Rhie, A., Walenz, B. P., Koren, S. & Phillippy, A. M. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol 21, 245 (2020).]


Assembly kmer content - Homozygous genomes

.image-70[ Merqury spectra-cn plot for S.cerevesiae S288C assembly ]

congratulations Good kmer representation of reads in the assembly


Assembly kmer content - Homozygous genomes

.image-70[ Merqury spectra-cn plot for S.cerevesiae S288C assembly in case of missing contigs ]

Bad kmer representation of reads in the assembly


Assembly kmer content - Heterozygous genomes


.image-70[ Merqury spectra-cn plot for highly heterozygous assembly and for which haplotig were purged or collapsed. ]

congratulations Good kmer representation of reads in the assembly

The lost content (the black peak) represents the half of the heterozygous content that is lost when bubbles are collapsed.


Assembly kmer content - Heterozygous genomes

.image-80[ Merqury spectra-cn plot for second haplotype of C. gigas. ]

Bad kmer representation of reads in the assembly


Reads mapping and assembly coverage


.image-60[ Illustration of genome assembly correctness ]


.image-30[ Illustration of genome assembly correctness ]


Mistakes into the assembly

** Proportion of the assembly that is free from mistakes**


SNP / indels errors

.image-80[ Illustration of SNP errors in a genome assembly. ]


Other mis-assemblies

.image-80[ Illustration of rearrangements inversions assembly errors. ]


Other mis-assemblies

.image-80[ Illustration of collapsed and expanded repeats assembly errors. ]


Switch and hamming errors (phased assemblies)

.pull-left[ .image-70[ Illustration of switch and hamming errors in phased genome assembly. ] In red, heterozygous locus from second haplotype. In blue, heterozygous locus from first haplotype. ]

.pull-right[

.footnote[http://lh3.github.io/2021/04/17/concepts-in-phased-assemblies]


Evaluation against reference genome

.image-30[ Example of a dot plot between 2 genomes. ]


.pull-left[ .left[ Dot plots are widely used to quickly compare 2 sequence sets. They provide a synthetic overview of:

</br>

.image-50[ Example of a dot plot between 2 genomes. ] ]

.pull-right[ .left[ A non-exhaustive list of tools for making dot plots:

</br> </br> </br>

.image-50[ Interpret Dot Plots here based on great explanation by Michael Schatz. ] ]


Tips


Key Points

Thank you!

This material is the result of a collaborative work. Thanks to the Galaxy Training Network and all the contributors! Galaxy Training Network This material is licensed under the Creative Commons Attribution 4.0 International License.