+ - 0:00:00
Notes for current slide

Presenter notes contain extra information which might be useful if you intend to use these slides for teaching.

Press P again to switch presenter notes off

Press C to create a new window where the same presentation will be displayed. This window is linked to the main window. Changing slides on one will cause the slide to change on the other.

Useful when presenting.

Notes for next slide

Genome assembly quality control.

last_modification Updated: Sep 20, 2022

text-document Plain-text slides

Tip: press P to view the presenter notes | arrow-keys Use arrow keys to move between slides
1 / 37

Presenter notes contain extra information which might be useful if you intend to use these slides for teaching.

Press P again to switch presenter notes off

Press C to create a new window where the same presentation will be displayed. This window is linked to the main window. Changing slides on one will cause the slide to change on the other.

Useful when presenting.

Requirements

Before diving into this slide deck, we recommend you to have a look at:

2 / 37

question Questions

  • Assembly: Is my genome assembly ready for scaffolding?

  • Annotation: Is my genome assembly ready for structural annotation?

3 / 37

Genome assembly quality control, or "the 3C"



The 3C for genome assembly quality control

4 / 37

Illustration of genome assembly contiguity

5 / 37

Contiguity

Desire

  • Fewer sequences
  • Longer sequences

Metrics:

  • Number of sequences
  • Average sequences length
  • Median sequences length
  • Minimum and maximum sequences length
  • N50, NG50, L50, LG50
  • GC content
  • Number and proportion of bases that are N

Sequences, i.e. a set of contigs and/or scaffolds

6 / 37

N50 & L50

N50: given a set of sequences of varying lengths, the N50 is defined as the length L of the shortest contig for which longer and equal length contigs cover at least 50% of the assembly.

L50: given a set of sequences of varying lengths, the L50 is defined as count of smallest number of sequences whose length sum makes up 50% of the assembly.

N50 describes a sequence length whereas L50 describes a number of sequences.

7 / 37

N50 & L50

Example:

  • Genome size = 100
  • Sequence sorted by size list L = (25, 10, 10, 8 , 7, 7 , 6 , 5, 5, 5, 5, 3, 2, 2 ) = 100
  • 50% of the total length is contained within sequences of at least 8bp: 25 + 10 + 10 + 8 ≥ 50

Schematic explanation of N50

N50 = 8 and L50 = 4

Alhakami, H., Mirebrahim, H., & Lonardi, S. (2017). A comparative evaluation of genome assembly reconciliation tools. Genome biology, 18(1), 1-14.

8 / 37

N50 & L50

However, the theses statistics may not reflect some assembly improvements. If we connect two sequences longer than N50 or connect two sequences shorter than N50, N50 is not changed. N50 is only improved if we connect a sequence shorter than N50 and a sequence longer than N50.

Schematic explanation of N50 and it limits

9 / 37

Nx curve

« 50 » is a single point on the Nx curve. The entire Nx curve in fact gives us a better sense of contiguity.

Example of Nx graph for Drosophila assemblies

Example of cumulative sequence length graph for Drosophila assemblies

10 / 37

QUAST - A tool to evaluate genome assemblies

  • QUAST: for genome assemblies.
  • MetaQUAST: for metagenomic datasets.
  • QUAST-LG: for large genomes (e.g., mammalians).
  • rnaQUAST: for RNAseq.
  • Icarus: an interactive visualizer for these tools.

It also includes:

  • Reads mapping (mi-assemblies evaluation).
  • Kmer representation (KMC)
  • Structural prediction modules (GeneMark, GlimmerHMM, Barrnap and BUSCO).
  • For metagenomics dataset: MetaGeneMark, Krona tools, BLAST, and SILVA 16S rRNA database.
11 / 37

Illustration of genome assembly completeness

12 / 37

Types of completeness

  • Assembly size
  • Known vs. unknown nucleotides
  • "Core" genes
  • Assembly kmer content
  • Reads mapping and assembly coverage
13 / 37

Assembly size vs estimated

Proportion of the original genome represented by the assembly:

Formula to estimate assembly size completeness

"*" it’s an estimation, so not perfect. See An introduction to get started in genome assembly and annotation to find methods to determine the genome size.

14 / 37

Known vs. unknown nucleotides

Proportion of A, T, G, C versus N (unknown nucleotide).We expect an assembly without unknown nucleotides (N).

15 / 37

"Core" genes

Quantitative assessment of genome assembly based on evolutionarily informed expectations of gene content from near-universal single-copy orthologs.

Formula to estimate assembly completeness for core genes

Example of BUSCO plot for Nosema species (Microsporidia)

Tips: Reference databases are constructed using known genomes. Species with few/no close genomes available can have very bad scores.

16 / 37

Core genes evaluation software

BUSCO: Assessing genome assembly and annotation completeness with Benchmarking Universal Single-Copy Orthologs

Explanation of orthology classification for creation of BUSCO sets - Universality

Explanation of orthology classification for creation of BUSCO sets - Duplicability

Eukaryota: 255 single copy from 70 species; Arthropoda: 1013 single copy from 90 species; Fungi: 758 single copy from 549 species

Waterhouse, R. M., Zdobnov, E. M. & Kriventseva, E. V. Correlating Traits of Gene Retention, Sequence Divergence, Duplicability and Essentiality in Vertebrates, Arthropods, and Fungi. Genome Biol Evol 3, 75–86 (2011).

17 / 37

BUSCO limitations

The value of the BUSCO is only as good as its reference database.

Example with BUSCO Eukaryotic set:

Limitation of BUSCO orthoDB for eukaOrthoDB - part A

Limitation of BUSCO orthoDB for eukaOrthoDB - part B

Saary, P., Mitchell, A. L. & Finn, R. D. Estimating the quality of eukaryotic genomes recovered from metagenomic analysis with EukCC. Genome Biol 21, 244 (2020).

18 / 37

BUSCO limitations

The use of transcriptome alignment of a closely related species or a de novo RNA-Seq assembly of the same species can be another proxy to assess the completeness of the assembly and adress BUSCO limitations.

19 / 37

Assembly kmer content

The aim is to check assembly coherence against the content within reads that were used to produce the assembly. Basically, how many elements of each frequency on the read’s spectrum ended up being not included in the assembly, included once, included twice etc.

  • Merqury or KAT
  • Histogram is build with read kmer content.
20 / 37

K-mer spectrum plots

How to read kmer spectrum of reads

How to read kmer spectrum of an assembly using reads

Rhie, A., Walenz, B. P., Koren, S. & Phillippy, A. M. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol 21, 245 (2020).

21 / 37

Assembly kmer content - Homozygous genomes

Merqury spectra-cn plot for S.cerevesiae S288C assembly

congratulations Good kmer representation of reads in the assembly

22 / 37

Assembly kmer content - Homozygous genomes

Merqury spectra-cn plot for S.cerevesiae S288C assembly in case of missing contigs

Bad kmer representation of reads in the assembly

23 / 37

Assembly kmer content - Heterozygous genomes


Merqury spectra-cn plot for highly heterozygous assembly and for which haplotig were purged or collapsed.

congratulations Good kmer representation of reads in the assembly

The lost content (the black peak) represents the half of the heterozygous content that is lost when bubbles are collapsed.

24 / 37

Assembly kmer content - Heterozygous genomes

Merqury spectra-cn plot for second haplotype of C. gigas.

Bad kmer representation of reads in the assembly

25 / 37

Reads mapping and assembly coverage


  • Proportion of mapped vs. unmapped reads i.e. proportion of missing parts in the assembly
  • Coverage in terms of redundancy (A): number of reads that align to, or "cover," a known reference.
  • Coverage in terms of the percentage coverage of a reference by reads (B): E.g. if 90% of a reference is covered by reads (and 10% not) it is a 90% coverage.

Illustration of genome assembly correctness

26 / 37

Illustration of genome assembly correctness

27 / 37

Mistakes into the assembly

Proportion of the assembly that is free from mistakes

  • Indels / SNPs
  • Mis-joins
  • Repeat compressions
  • Unnecessary duplications
  • Rearrangements

    → Align back reads to the assembly and check for inconsistencies
28 / 37

SNP / indels errors

Illustration of SNP errors in a genome assembly.

29 / 37

Other mis-assemblies

Illustration of rearrangements inversions assembly errors.

30 / 37

Other mis-assemblies

Illustration of collapsed and expanded repeats assembly errors.

31 / 37

Switch and hamming errors (phased assemblies)

Illustration of switch and hamming errors in phased genome assembly.

In red, heterozygous locus from second haplotype. In blue, heterozygous locus from first haplotype.
  • Switch error: a change from one parental allele to another parental allele on a contig. This terminology has been used for measuring reference-based phasing accuracy for two decades. A haplotig is supposed to have no switch errors.
  • Yak hamming error: an allele not on the most supported haplotype of a contig. Its main purpose is to test how close a contig is to a haplotig. The yak definition is not widely accepted. The hamming error rate is arguably less important in practice.

http://lh3.github.io/2021/04/17/concepts-in-phased-assemblies

32 / 37

Evaluation against reference genome

Example of a dot plot between 2 genomes.

33 / 37

Dot plots are widely used to quickly compare 2 sequence sets. They provide a synthetic overview of:

  • Similarity
  • Specificity
  • Highlighting repetitions, breaks and inversions.


Example of a dot plot between 2 genomes.

A non-exhaustive list of tools for making dot plots:

  • MUMmer dotplot
  • Chromeister
  • D-genies (not yet available into Galaxy)




Interpret Dot Plots here based on great explanation by Michael Schatz.

34 / 37

Tips

  • The quality of an assembly is often validated by using other data from the same individual or from other individuals (RNA-Seq alignment, Hi-C alignment, DNA-Seq alignment,...).

  • The positions of the telomeric repeats in the chromosome assemblies are also of interesting to evaluate the correctness.

  • The identification of organelles (mitochondria, chloroplast,...) can also inform us about the quality of the assembly in terms of completness. However, the structure of the organelles may lead the assembler to think that they are repeats and he discards them.

  • In the case of diploid organisms, one of the classical problems of assemblies is the conservation of the two haplotypes. We obtains particular BUSCO / kmer / assembly size metrics that can be corrected by removing, "purging", the haplotigs.

35 / 37

keypoints Key points

  • We learned that it is essential to control the quality of an assembly

  • We learned that there are several quality criteria and tools to enable this assessment

  • Certain quality criteria are expected at the time of publication

36 / 37

Requirements

Before diving into this slide deck, we recommend you to have a look at:

2 / 37
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow