Genome annotation with Funannotate
Author(s) | Anthony Bretaudeau |
Editor(s) | Alexandre Cormier Laura Leroi Erwan Corre Stéphanie Robin |
OverviewQuestions:Objectives:
How to annotate an eukaryotic genome with Funannotate?
How to perform functional annotation?
How to evaluate and visualize annotated genomic features?
How to format the annotation for submission at NCBI?
Requirements:
Load genome into Galaxy
Annotate genome with Funannotate
Perform functional annotation using EggNOG-mapper and InterProScan
Evaluate annotation quality with BUSCO
View annotations in JBrowse
- Introduction to Galaxy Analyses
- Genome Annotation
- Masking repeats with RepeatMasker: tutorial hands-on
Time estimation: 8 hoursLevel: Intermediate IntermediateSupporting Materials:
- Topic Overview slides
- Datasets
- Workflows
- FAQs
- video Recordings
- instances Available on these Galaxies
Last modification: Oct 18, 2022
Introduction
Genome annotation of eukaryotes is a little more complicated than for prokaryotes: eukaryotic genomes are usually larger than prokaryotes, with more genes. The sequences determining the beginning and the end of a gene are generally less conserved than the prokaryotic ones. Many genes also contain introns, and the limits of these introns (acceptor and donor sites) are not highly conserved.
In this tutorial we will use a software tool called Funannotate (Palmer and Stajich 2020) to annotate the genome sequence of a small eukaryote: Mucor mucedo (a fungal plant pathogen).
As explained on Funannotate’s website, “it was originally written to annotate fungal genomes (small eukaryotes ~ 30 Mb genomes), but has evolved over time to accomodate larger genomes”. As other annotation tools like Maker (Campbell et al. 2014) or Braker (Brůna et al. 2021), it works by aligning as many evidences as possible along the genome sequence, and then reconciliating all these signals to determine probable gene structures.
The evidences can be transcript or protein sequences from the same (or closely related) organism. These sequences can come from public databases (like NR or GenBank) or from your own experimental data (transcriptome assembly from an RNASeq experiment for example). Funannotate is also able to take into account repeated elements.
Funannotate uses ab-initio predictors (Augustus, SNAP, glimmerHMM, CodingQuarry and GeneMark-ES/ET (optional due to licensing)) to improve its predictions: these software tools are able to make gene structure predictions by analysing only the genome sequence with a statistical model.
While for Maker you need to perform training steps for the ab-initio predictors, Funannotate is able to take care of that for you, which makes it much easier to use.
In this tutorial, you will learn how to perform a structural genome annotation, and how to evaluate its quality. Then, you will learn how to run functional annotation, using EggNOG-mapper and InterProScan to automatically assign names and functions to the annotated genes. And you will also learn how Funannotate can prepare files ready for submission of your annotation to the NCBI. Finally, you will learn how to use the JBrowse genome browser to visualise your new annotation.
AgendaIn this tutorial, we will cover:
Data upload
To annotate our genome using Funannotate, we will use the following files:
- The genome sequence in fasta format. For best results, the sequence should be soft-masked beforehand. You can learn how to do it by following the RepeatMasker tutorial. For this tutorial, we will try to annotate the genome assembled in the Flye assembly tutorial.
- Some RNASeq data in fastq format. We will align them on the genome, and Funannotate will use it as evidence to annotate genes.
- A set of protein sequences, like UniProt/SwissProt. It is important to have good quality, curated sequences here, that’s why, by default, Funannotate will use the UniProt/SwissProt databank. In this tutorial, we have prepared a subset of this databank to speed up computing, but you should use UniProt/SwissProt for real life analysis.
Funannotate will take into account the position of mapped RNASeq reads, and the alignment of protein sequences on the genome sequence to determine gene positions.
Hands-on: Data upload
Create a new history for this tutorial
Click the new-history icon at the top of the history panel.
If the new-history is missing:
- Click on the galaxy-gear icon (History options) on the top of the history panel
- Select the option Create New from the menu
Import the files from Zenodo or from the shared data library (
GTN - Material
->genome-annotation
->Genome annotation with Funannotate
):https://zenodo.org/api/files/8c2cc766-2b68-45bd-a2d3-391acf9bdb1b/genome_masked.fasta https://zenodo.org/api/files/8c2cc766-2b68-45bd-a2d3-391acf9bdb1b/rnaseq_R1.fq.gz https://zenodo.org/api/files/8c2cc766-2b68-45bd-a2d3-391acf9bdb1b/rnaseq_R2.fq.gz https://zenodo.org/api/files/8c2cc766-2b68-45bd-a2d3-391acf9bdb1b/SwissProt_subset.fasta https://zenodo.org/api/files/8c2cc766-2b68-45bd-a2d3-391acf9bdb1b/alternate_annotation.gbk https://zenodo.org/api/files/8c2cc766-2b68-45bd-a2d3-391acf9bdb1b/alternate_annotation.gff3
- Copy the link location
Open the Galaxy Upload Manager (galaxy-upload on the top-right of the tool panel)
- Select Paste/Fetch Data
Paste the link into the text field
Press Start
- Close the window
As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a shared data library:
- Go into Shared data (top panel) then Data libraries
- Navigate to the correct folder as indicated by your instructor
- Select the desired files
- Click on the To History button near the top and select as Datasets from the dropdown menu
- In the pop-up window, select the history you want to import the files to (or create a new one)
- Click on Import
Preparing the genome sequence
Before annotating the genome, we want to make sure that the fasta file is properly formatted. We do it now to make sure we will not encounter unexpected errors later in the annotation process.
Funannotate provides two little tools to help us. Let’s run the two tools, one after the other.
The first one (Funannotate assembly clean Tool: toolshed.g2.bx.psu.edu/repos/iuc/funannotate_clean/funannotate_clean/1.8.9+galaxy2 ) compares all the sequences between them, and removes the shorter ones that are already included in longer ones. This is to reduce unexpected redundancy in the genome. This step is recommended only for haploid genomes (we know our organism is haploid). This first tool also removes any suspicious sequence (like sequences made only of 1 or 2 letters, instead of the 5 expected (ATGCN).
The second tool will ensure that our fasta file is sorted, based on the length of the contigs (the longest ones first). It will also rename contigs to make sure the name are standard (they will all begin with scaffold_
, then a number).
Hands-on: Polish the assembly
- Funannotate assembly clean Tool: toolshed.g2.bx.psu.edu/repos/iuc/funannotate_clean/funannotate_clean/1.8.9+galaxy2 with the following parameters:
- param-file “Assembly to clean”:
genome_masked.fasta
(Input dataset)- Sort assembly Tool: toolshed.g2.bx.psu.edu/repos/iuc/funannotate_sort/funannotate_sort/1.8.9+galaxy2 with the following parameters:
- param-file “Assembly to sort”:
output
(output of Funannotate assembly clean tool)
After this step, the genome is clean, sorted, and ready for the structural annotation.
QuestionHow many sequences are removed by this cleaning step ?
The repeat masked genome contains 1461 sequences, while the cleand one only contains 1425, so 36 were removed.
Preparing RNASeq data
When you sequence a new genome, you usually sequence a few libraries of RNASeq data, from different tissues and in different conditions, because this data will help you in annotating the genome. Here, we are using data from one RNASeq dataset that is available on Sequence Read Archive (SRA): SRR8534859.
You would normally get the Fastq files directly from SRA and use them in the following step. To speed up the tutorial (without impairing too much the quality of the results), we have reduced the size of the dataset into a single pair of (smaller) fastq files, available from Zenodo (or the GTN Data Libraries).
To make use of this RNASeq data, we need to map it on the genome. We will use RNA star tool and get a result in the form of a BAM file, that we will use in the rest of the tutorial.
Hands-on: Hands-on
- RNA STAR Tool: toolshed.g2.bx.psu.edu/repos/iuc/rgrnastar/rna_star/2.7.8a+galaxy0 with the following parameters:
- “Single-end or paired-end reads”:
Paired-end (as individual datasets)
- param-file “RNA-Seq FASTQ/FASTA file, forward reads”:
rnaseq_R1.fq.gz
(Input dataset)- param-file “RNA-Seq FASTQ/FASTA file, reverse reads”:
rnaseq_R2.fq.gz
(Input dataset)- “Custom or built-in reference genome”:
Use reference genome from history and create temporary index
- param-file “Select a reference genome”:
genome
(output of Sort assembly tool)- “Length of the SA pre-indexing string”:
11
We select 11
for the “Length of the SA pre-indexing string” parameter as it is the recommended value for a small genome of this size. If you select 14 (the default value), STAR will advise you to use 11
instead in its logs.
Comment: What if I want to use multiple RNASeq librariesTo get the best possible annotation, it is adviced to use multiple RNASeq libraries, sequences from different tissues in different conditions. To use them, you can map each one individually using STAR, just as in this tutorial. You will get one BAM file per RNASeq library, and you can then easily merge them into a single BAM file by using the MergeSamFiles Tool: toolshed.g2.bx.psu.edu/repos/devteam/picard/picard_MergeSamFiles/2.18.2.1 tool. This single BAM file can then be used by Funannotate like we do in the next steps.
Before we move on to the next step, we need to make sure that the mapping went well. Have a look at the log
output of RNA STAR
.
QuestionWhat proportion of reads were correctly mapped to the genome? Do you think it is enough to continue with this tutorial?
Look for
Uniquely mapped reads %
: you should get ~96%. This is a very good score, because while reducing the size of the dataset for this tutorial, we have kept mostly reads properly mapping. Anyway, with real data, you expect to have a vast majority of reads mapping uniquely on the genome. If it’s not the case, check that you’re using the correct RNASeq files, with the correct genome sequence.
Structural annotation
We can now run Funannotate predict annotation to perform the structural annotation of the genome.
We need to input the genome sequence, the mapped RNASeq data, and the proteins to align on the genome. We also specify the name of the species and strain (they can be used later for submission to NCBI).
There are other parameters to finely tune how Funannotate will run ab-initio predictors to predict genes, and to filter the final results based on various criteria. As Funannotate uses BUSCO (Benchmarking Universal Single-Copy Orthologs) for initial training of ab-initio predictors, we select datasets close to the species we are annotating: mucorales (orthodb 10)
and rhizopus_oryzae
.
Funannotate is also able to use GeneMark to predict new genes, but to due to licensing restrictions, this software is not available on every Galaxy instance. We will ignore this for this tutorial, it will not impact the results too much.
Hands-on: Hands-on
- Funannotate predict annotation Tool: toolshed.g2.bx.psu.edu/repos/iuc/funannotate_predict/funannotate_predict/1.8.9+galaxy2 with the following parameters:
- param-file “Assembly to annotate”:
genome
(output of Sort assembly tool)- “Funannotate database”: select the latest version available
- In “Organism”:
- “Name of the species to annotate”:
Mucor mucedo
- “Strain name”:
muc1
- “Is it a fungus species?”:
No
- In “Evidences”:
- param-file “RNA-seq mapped to genome to train Augustus/GeneMark-ET”:
mapped.bam
(output of RNA STAR tool)- “Select protein evidences”:
Custom protein sequences
- param-file “Proteins to map to genome”:
SwissProt_subset.fasta
(Input dataset)- In “Busco”:
- BUSCO models to align””:
mucorales (orthodb 10)
- “Initial Augustus species training set for BUSCO alignment”:
rhizopus_oryzae
- “Which outputs should be generated”: Select all
Comment: on parameters
- For “Select protein evidences” we select
Custom protein sequences
to reduce the computing time, but for real data analysis, you should select the default value:Use UniProtKb/SwissProt (from selected Funannotate database)
.- It is possible to enable the “Is it a fungus species?” option in Funannotate: it launches an additional ab initio predictor (CodingQuerry) dedicated to fungi genomes. However it has proved to be unstable on the genome studied in this tutorial, and it can create a lot of fragmented gene models depending on the RNASeq data available. For this tutorial we leave this option to
No
. You can test it with real data, but be sure to compare the result with and without this option.- For real data analysis you can consider enabling the “Augustus settings (advanced)” > “Run ‘optimize_augustus.pl’ to refine training (long runtime)”. If you have enough data, you might get better results as there will be an additional training step for augustus (at the cost of a longer runtime).
Comment: Don't waitThis step will take a bit of time to run. While it runs, we can already schedule the following functional annotation steps. Galaxy will run them automatically as soon as the structural annotation is ready.
This tool produces several output dataset, in particular:
- the full structural annotation in Genbank, GFF3 or NCBI tbl formats: these files contain the position of all the genes that were found on the genome.
- the CDS, transcript and protein sequences of all the genes predicted by Funannotate (fasta files)
- some statistics and reports
Let’s have a closer look at the output of our annotation. First display the stats
dataset: the first par of the file contains some information on how funannotate was launched. If you go to the bottom, you’ll find a few interesting numbers in the annotation
section:
- the total number of genes and mRNA
- the average length of genes, exons, proteins
- the number of single/multiple exon transcripts
These number alone are interesting, but not fully informative on the quality of the annotation. For example, for the number of genes, you want a ‘good’ number based on what you expect for this species (a too big number can mean that genes are fragmented, and a too small number can mean that some genes were not annotated at all). These numbers can help when comparing an annotation with other ones performed with other parameters or tools. You can also compare these numbers with annotation of closely related species.
To get a better picture of the quality of the result, we will run BUSCO in the next step.
Before moving on, have a quick look at the tbl2asn error summary report
output: it lists a few potential problems that were identified by Funannotate in the results it generated. For example, Funannotate can tell you when it predicted genes that contain very short exons, or that use a rare splice site sequence. You can have a detailed list of identified potential problems in the tbl2asn genome validation report
dataset. It does not mean that each listed gene is wrong, but it means that you might want to give a closer look at these ones. If you have time to manually check each gene, Apollo can help you in doing this, see the note in the conclusion for this.
Functional annotation
EggNOG Mapper
EggNOG Mapper compares each protein sequence of the annotation to a huge set of ortholog groups from the EggNOG database. In this database, each ortholog group is associated with functional annotation like Gene Ontology (GO) terms or KEGG pathways. When the protein sequence of a new gene is found to be very similar to one of these ortholog groups, the corresponding functional annotation is transfered to this new gene.
Hands-on
- eggNOG Mapper Tool: toolshed.g2.bx.psu.edu/repos/galaxyp/eggnog_mapper/eggnog_mapper/2.1.8+galaxy2.1.8 with the following parameters:
- param-file “Fasta sequences to annotate”:
protein sequences
(output of Funannotate predict annotation tool)”- “Version of eggNOG Database”: select the latest version available
- In “Output Options”:
- “Exclude header lines and stats from output files”:
No
The output of this tool is a tabular file, where each line represents a gene from our annotation, with the functional annotation that was found by EggNOG-mapper. It includes a predicted protein name, GO terms, EC numbers, KEGG identifiers, …
Display the file and explore which kind of identifiers were found by EggNOG Mapper.
InterProScan
InterPro is a huge integrated database of protein families. Each family is characterized by one or muliple signatures (i.e. sequence motifs) that are specific to the protein family, and corresponding functional annotation like protein names or Gene Ontology (GO). A good proportion of the signatures are manually curated, which means they are of very good quality.
InterProScan is a tool that analyses each protein sequence from our annotation to determine if they contain one or several of the signatures from InterPro. When a protein contains a known signature, the corresponding functional annotation will be assigned to it by InterProScan.
InterProScan itself runs multiple applications to search for the signatures in the protein sequences. It is possible to select exactly which ones we want to use when launching the analysis (by default all will be run).
Hands-on
- InterProScan Tool: toolshed.g2.bx.psu.edu/repos/bgruening/interproscan/interproscan/5.55-88.0+galaxy3 with the following parameters:
- param-file “Protein FASTA File”:
protein sequences
(output of Funannotate predict annotation tool)”- “InterProScan database”: select the latest version available
- “Use applications with restricted license, only for non-commercial use?”:
Yes
(set it toNo
if you run InterProScan for commercial use)- “Output format”:
Tab-separated values format (TSV)
andXML
To speed up the processing by InterProScan during this tutorial, you can disable
Pfam
andPANTHER
applications. When analysing real data, it is adviced to keep them enabled.When some applications are disabled, you will of course miss the corresponding results in the output of InterProScan.
The output of this tool is both a tabular file and an XML file. Both contain the same information, but the tabular one is more readable for a Human: each line represents a gene from our annotation, with the different domains and motifs that were found by InterProScan.
If you display the TSV file you should see something like this:
Each line correspond to a motif found in one of the annotated proteins. The most interesting columns are:
- Column 1: the protein identifier
- Column 5: the identifier of the signature that was found in the protein sequence
- Column 4: the databank where this signature comes from (InterProScan regroups several motifs databanks)
- Column 6: the human readable description of the motif
- Columns 7 and 8: the position where the motif was found
- Column 9: a score for the match (if available)
- Column 12 and 13: identifier of the signature integrated in InterPro (if available). Have a look an example webpage for IPR036859 on InterPro.
- The following columns contains various identifiers that were assigned to the protein based on the match with the signature (Gene ontology term, Reactome, …)
The XML output file contains the same information in a computer-friendly format, we will use it in the next step.
Submission to NCBI
If you plan to submit the final genome sequence and annotation to NCBI, there are a few steps to follow.
Warning: Please do not submit this genome to NCBI!In this tutorial we will not perform a real submission, as we don’t want to duplicate the annotation in Genbank every time someone follows this tutorial!
But here’s a description of the main steps, and how Funannotate can help you in this process. NCBI provides a complete documentation for genome and annotation submission.
- First, you should have created a BioProject and a BioSample on the NCBI portal, corresponding to your scientific project, and the sample(s) you have sequenced
- The raw reads used for the assembly, and the RNASeq ones should be deposited on SRA, and taggued with the BioProject and BioSample ids
- Ideally you should start submitting the assembly before performing the annotation. The reason is simple: NCBI performs some validation on the genome sequence before accepting it into GenBank. It means that you may be forced to make modifications to the genome sequence (e.g. remove contigs, split contigs where you have adapter contamination, etc). You will save time and trouble doing the annotation on a fully validated and frozen genome sequence.
- You should get a
locus_tag
for your genome: it is the unique prefix that is used at the beginning of each gene name. By default, Funannotate usesFUN_
(e.g.FUN_000001
), but to submit to NCBI you need to have a prefix specific to your genome. NCBI should provide it to you when you create your BioProject. - You then need to prepare a file containing a few metadata. This can be done online on https://submit.ncbi.nlm.nih.gov/genbank/template/submission/. You need to fill the form with some basic information, like that for example (for real data, you should of course write real information!):
When filled, you can click on the Create template button: you will be able to download a .sbt
file that you should then upload to your Galaxy history.
The next step will show you how Funannotate can use this .sbt
file, and the locus_tag
, to generate the needed file for the submission of the annotation to the NCBI. These files will need to be submitted on the NCBI genome submission portal.
Integrating structural and functional annotation
Now we have a structural annotation, and the results of both EggNOG Mapper and InterProScan. Each one is in a separate file, so we will now combine all this data into a single file that will contain both the structural and the functional annotation. This will be the final output of our annotation pipeline, ready to be submitted to the NCBI reference database.
Hands-on: Hands-on
- Funannotate functional Tool: toolshed.g2.bx.psu.edu/repos/iuc/funannotate_annotate/funannotate_annotate/1.8.9+galaxy2 with the following parameters:
- “Input format”:
GenBank (from 'Funannotate predict annotation' tool)
- param-file “Genome annotation in genbank format”:
annotation (genbank)
(output of Funannotate predict annotation tool)- “Funannotate database”: select the latest version available
- param-file “NCBI submission template file”:
template.sbt
(the file downloaded from NCBI website, or leave empty if you didn’t generate it)- param-file “Eggnog-mapper annotations file”:
annotations
(output of eggNOG Mapper tool)- param-file “InterProScan5 XML file”:
InterProScan XML
(output of InterProScan tool)- param-file “BUSCO models”:
mucorales (orthodb 10)
- “Strain name”:
muc1
- “locus_tag from NCBI to rename GFF gene models with”:
MMUCEDO_
(we consider NCBI has assigned us thislocus_tag
)- “Which outputs should be generated”: Select all
This tool produces several output dataset, in particular:
- The full structural and functional annotation in Genbank and GFF3 format
- NCBI tbl and sqn (
NCBI Sequin genome
) formats: these files can be used for submission to NCBI - The CDS, transcript and protein sequences of all the genes predicted by Funannotate
- Some statistics and reports
If you display the GFF3 output, you will notice that the functional information, including gene names, is now stored in this file.
Evaluation with Busco
We now have a complete annotation, including functional annotation, but it’s time to evaluate the quality of this annotation. BUSCO (Benchmarking Universal Single-Copy Orthologs) is a tool allowing to evaluate the quality of a genome assembly or of a genome annotation. By comparing genomes from various more or less related species, the authors determined sets of ortholog genes that are present in single copy in (almost) all the species of a clade (Bacteria, Fungi, Plants, Insects, Mammalians, …). Most of these genes are essential for the organism to live, and are expected to be found in any newly sequenced and annotated genome from the corresponding clade. Using this data, BUSCO is able to evaluate the proportion of these essential genes (also named BUSCOs) found in a set of (predicted) transcript or protein sequences. This is a good evaluation of the “completeness” of the annotation.
Hands-on: Hands-on
- Busco Tool: toolshed.g2.bx.psu.edu/repos/iuc/busco/busco/5.2.2+galaxy2 with the following parameters:
- param-file “Sequences to analyse”:
protein sequences
(output of Funannotate functional tool)- “Mode”:
annotated gene sets (protein)
- “Auto-detect or select lineage?”:
Select lineage
- “Lineage”:
Mucorales
- In “Advanced Options”:
- “Which outputs should be generated”:
shortsummary text
andsummary image
QuestionHow many BUSCO genes were found complete in the annotation? Do you think the quality of the annotation is good?
On a total of 2449, you should find ~2312 BUSCO genes identifed as complete in the annotation, with 2281 being in single copy, and 31 being duplicated.
That’s a quite good result as running BUSCO on the genome itself gives a very close number (2327 Complete BUSCOs) (see Flye assembly tutorial). It means the annotation process was able to detect most of the genes it was supposed to find.
To improve the result you can consider using more RNASeq data, and using the “Augustus settings (advanced)” > “Run ‘optimize_augustus.pl’ to refine training (long runtime)” option.
Visualisation with a genome browser
With Galaxy, you can visualize the annotation you have generated using JBrowse genome browser. This allows you to navigate along the chromosomes of the genome and see the structure of each predicted gene. We also add an RNASeq track, using the BAM file created with RNA STAR tool.
Hands-on: Hands-on
- JBrowse Tool: toolshed.g2.bx.psu.edu/repos/iuc/jbrowse/jbrowse/1.16.11+galaxy1 with the following parameters:
- “Reference genome to display”:
Use a genome from history
- param-file “Select the reference genome”:
genome
(output of Sort assembly tool)- In “Track Group”:
- param-repeat “Insert Track Group”
- “Track Category”:
Annotation
- In “Annotation Track”:
- param-repeat “Insert Annotation Track”
- “Track Type”:
GFF/GFF3/BED Features
- param-file “GFF/GFF3/BED Track Data”:
gff3
(output of Funannotate functional tool)- param-repeat “Insert Track Group”
- “Track Category”:
RNASeq
- In “Annotation Track”:
- param-repeat “Insert Annotation Track”
- “Track Type”:
BAM Pileups
- param-file “BAM Track Data”:
mapped.bam
(output of RNA STAR tool)- “Autogenerate SNP Track”:
Yes
Click on the newly created dataset’s eye to display it. You will see a JBrowse genome browser. You can have a look at the JBrowse tutorial for a more in-depth description of JBrowse.
Enable the annotation track, and the SNPs/Coverage tracks on the left side of JBrowse, then navigate along the genome. You will see the different gene models predicted by Funannotate.
If you click on a gene, a popup will appear with detailed information on the selected gene: position and sequence, but also the functional information retrieved from the EggNOG mapper and InterProScan results.
If you zoom to a specific gene, and look at the RNASeq tracks, you will see light grey regions corresponding to portions of the genomes where RNASeq were mapped, and darker grey regions corresponding to introns (= regions were some reads were found to match both the end of an exon, and the start of the next one). You can enable the other RNASeq track to display each individual read that was mapped on the genome.
If you navigate along the genome, you will find genes with very low RNASeq coverage: this demonstrates how Funannotate is able to predict genes not only with RNASeq, but also by comparing to protein sequences (from SwissProt) and using ab initio predictors, trained using RNASeq data.
Comparing annotations
Earlier, we have seen how general statistics and BUSCO results can help to evaluate the quality of an annotation. But when annnotating a new genome, you might want to try different annotation methods and parameters. We will see now how to compare multiple annotations between them. As an example, we will compare the annotation we have generated, with an alternate one, that you have uploaded from Zenodo at the beginning of the tutorial. Comparison is only possible for annotations performed on the same genome sequence.
Comparing with AEGeAn ParsEval
Let’s run AEGeAn ParsEval first: it compares two annotations in GFF3 format.
Hands-on: Hands-on
- AEGeAn ParsEval Tool: toolshed.g2.bx.psu.edu/repos/iuc/aegean_parseval/aegean_parseval/0.16.0 with the following parameters:
- param-file “Reference GFF3 file”:
gff3
(output of Funannotate functional tool)- param-file “Prediction GFF3 file”:
alternate_annotation.gff3
(Input dataset)- “Select the output type”:
HTML
AEGeAn ParsEval tool compares gene loci spread all along the genome sequence, each locus containing gene(s) on the “reference” (=the annotation we have generated) and/or the “prediction” (=the alternate one) annotation.
The output is a web page where you can see how many loci or genes are identical or different between the two annotations. By clicking on the (+)
links, or on the scaffold names, you can see the results more in detail, even at the gene level.
Hands-on: Hands-on
- Display the report
- Click on
scaffold_11
- Click on the first locus (between positions 637 and 7730)
QuestionWhat can you tell from the comparison at this gene locus?
You should see something like this:
As you can see, at this position, the alternate annotation as found only one big gene, while the annotation you have generated has found 4 different genes, not all on the same strand.
This can be explained by the way the alternate annotation was performed: it was done by running Funannotate without any RNASeq data, and choosing a wrong value (
insecta
) in the busco parameter. The annotation you have generated is probably the good one here.
Comparing with Funannotate compare
Funannotate compare tool is another tool to compare several annotations. It uses genbank files, as generated by Funannotate functional tool.
Hands-on: Hands-on
- Funannotate compare Tool: toolshed.g2.bx.psu.edu/repos/iuc/funannotate_compare/funannotate_compare/1.8.9+galaxy2 with the following parameters:
- param-files “Genome annotations in genbank format”:
alternate_annotation.gbk
(Input dataset) andgbk
(output of Funannotate functional tool)- “Funannotate database”: select the latest version available
The output is a web page with different tabs:
Stats
: general statistics on the two annotations.Orthologs
: orthology relations between genes of the two annotations.InterPro
andPFAM
: count the number of times each InterPro/PFAM signature is identified in each annotation. You can download a PDF file at the top, showing an Non-metric multidimensional scaling (NMDS) analysis of this data.Merops
andCAZymes
: count the number of members of each Merops/CAZymes family in each annotation. You can download PDF files at the top, showing colorful representations of this data.GO
: list Gene Ontology terms that were found to be over or under represented in each annotation.
QuestionWhat can you tell from this report?
The alternate annotation contains less genes than the one you generated, and misses a lot of functional annotations. It confirms that this alternate annotation have a lower quality than the one you performed.
Conclusion
Congratulations for reaching the end of this tutorial! Now you know how to perform a structural and functional annotation of a new eukaryotic genome, using Funannotate, EggNOG mapper and InterProScan. You also learned how Funannotate can help you in the submission process to NCBI. And you learned how to visualise your new annotation using JBrowse, and how to compare it with another annotation.
An automatic annotation of an eukaryotic genome is unfortunately rarely perfect. If you inspect some predicted genes (or look at the tbl2asn genome validation report
output of Funannotate), you may find some mistakes made by Funannotate, or potential problems, e.g. wrong exon/intron limits, splitted genes, or merged genes. Setting up a manual curation project using Apollo can help a lot to manually fix these errors. Check out the Apollo tutorial for more details.
Key points
Funannotate allows to perform structural annotation of an eukaryotic genome.
Functional annotation can be performed using EggNOG-mapper and InterProScan.
BUSCO and JBrowse allow to inspect the quality of an annotation.
Funannotate allows to format an annotation for sumission at NCBI.
Frequently Asked Questions
Have questions about this tutorial? Check out the tutorial FAQ page or the FAQ page for the Genome Annotation topic to see if your question is listed there. If not, please ask your question on the GTN Gitter Channel or the Galaxy Help ForumReferences
- Campbell, M. S., C. Holt, B. Moore, and M. Yandell, 2014 Genome annotation and curation using MAKER and MAKER-P. Current Protocols in Bioinformatics 48: 4–11. 10.1002/0471250953.bi0411s48
- Palmer, J. M., and J. Stajich, 2020 Funannotate v1.8.1: Eukaryotic genome annotation. 10.5281/zenodo.4054262
- Brůna, T., K. J. Hoff, A. Lomsadze, M. Stanke, and M. Borodovsky, 2021 BRAKER2: automatic eukaryotic genome annotation with GeneMark-EP+ and AUGUSTUS supported by a protein database. NAR Genomics and Bioinformatics 3: lqaa108. 10.1093/nargab/lqaa108
Glossary
- NMDS
- Non-metric multidimensional scaling
Feedback
Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.
Citing this Tutorial
- , 2022 Genome annotation with Funannotate (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/genome-annotation/tutorials/funannotate/tutorial.html Online; accessed TODAY
- Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012
@misc{genome-annotation-funannotate, author = "Anthony Bretaudeau", title = "Genome annotation with Funannotate (Galaxy Training Materials)", year = "2022", month = "10", day = "18" url = "\url{https://training.galaxyproject.org/training-material/topics/genome-annotation/tutorials/funannotate/tutorial.html}", note = "[Online; accessed TODAY]" } @article{Batut_2018, doi = {10.1016/j.cels.2018.05.012}, url = {https://doi.org/10.1016%2Fj.cels.2018.05.012}, year = 2018, month = {jun}, publisher = {Elsevier {BV}}, volume = {6}, number = {6}, pages = {752--758.e1}, author = {B{\'{e}}r{\'{e}}nice Batut and Saskia Hiltemann and Andrea Bagnacani and Dannon Baker and Vivek Bhardwaj and Clemens Blank and Anthony Bretaudeau and Loraine Brillet-Gu{\'{e}}guen and Martin {\v{C}}ech and John Chilton and Dave Clements and Olivia Doppelt-Azeroual and Anika Erxleben and Mallory Ann Freeberg and Simon Gladman and Youri Hoogstrate and Hans-Rudolf Hotz and Torsten Houwaart and Pratik Jagtap and Delphine Larivi{\`{e}}re and Gildas Le Corguill{\'{e}} and Thomas Manke and Fabien Mareuil and Fidel Ram{\'{\i}}rez and Devon Ryan and Florian Christoph Sigloch and Nicola Soranzo and Joachim Wolff and Pavankumar Videm and Markus Wolfien and Aisanjiang Wubuli and Dilmurat Yusuf and James Taylor and Rolf Backofen and Anton Nekrutenko and Björn Grüning}, title = {Community-Driven Data Analysis Training for Biology}, journal = {Cell Systems} }
Funding
These individuals or organisations provided funding support for the development of this resource
This project (2020-1-NL01-KA203-064717
) is funded with the support of the Erasmus+ programme of the European Union. Their funding has supported a large number of tutorials within the GTN across a wide array of topics.