Long non-coding RNAs (lncRNAs) annotation with FEELnc
Author(s) | Stéphanie Robin |
Editor(s) | Anthony Bretaudeau |
OverviewQuestions:Objectives:
How to annotate lncRNAs with FEELnc?
How to classify lncRNAs according to their localisation and direction of transcription of proximal RNA transcripts?
How to update genome annotation with these annotated lncRNAs?
Requirements:
Load data (genome assembly, annotation and mapped RNASeq) into Galaxy
Perform a transcriptome assembly with StringTie
Annotate lncRNAs with FEELnc
Classify lncRNAs according to their location
Update genome annotation with lncRNAs
- Introduction to Galaxy Analyses
- Genome Annotation
- Genome annotation with Funannotate: tutorial hands-on
Time estimation: 2 hoursLevel: Intermediate IntermediateSupporting Materials:Last modification: Oct 18, 2022
Introduction
Messenger RNAs (mRNAs) are not the only type of RNAs present in organisms (like mammals, insects or plants) and represent only a small fraction of the transcripts. A vast repertoire of small (miRNAs, snRNAs) and long non-coding RNAs (lncRNAs) are also present. Long non-coding RNAs (LncRNAs) are generally defined as transcripts longer than 200 nucleotides that are not translated into functional proteins. They are important because of their major roles in cellular machinery and their presence in large number. Indeed, they are notably involved in gene expression regulation, control of translation or imprinting. Statistics from the GENCODE project reveals that the human genome contains more than 19,095 lncRNA genes, almost as much as the 19,370 protein-coding genes.
Using RNASeq data, we can reconstruct assembled transcripts (with ou without any reference genome) which can then be annotated and identified individually as mRNAs or lncRNAs.
In this tutorial, we will use a software tool called StringTie (“StringTie enables improved reconstruction of a transcriptome from RNA-seq reads” 2015) to assemble the transcripts and then FEELnc (“FEELnc: a tool for long non-coding RNA annotation and its application to the dog transcriptome” 2017) to annotate the assembled transcripts of a small eukaryote: Mucor mucedo (a fungal plant pathogen).
StringTie is a fast and highly efficient assembler of RNA-Seq alignments into potential transcripts.
FEELnc (FlExible Extraction of Long non-coding RNA) is a pipeline to annotate lncRNAs from RNASeq assembled transcripts. It is composed of 3 modules:
- FEELnc_filter: Extract, filter candidate transcripts.
- FEELnc_codpot: Compute the coding potential of candidate transcripts.
- FEELnc_classifier: Classify lncRNAs based on their genomic localization wrt others transcripts.
AgendaIn this tutorial, we will cover:
Data upload
To assemble transcriptome with StringTie and annotate lncRNAs with FEELnc, we will use the following files :
- The genome sequence in fasta format. For this tutorial, we will use the genome assembled in the Flye assembly tutorial.
- The genome annotation in GFF3 format. We will use the genome annotation obtained in the Funannotate tutorial.
- Some aligned RNASeq data in bam format. Here, we will use some mapped RNASeq data where mapping was done using STAR.
Hands-on: Data upload
Create a new history for this tutorial
Click the new-history icon at the top of the history panel.
If the new-history is missing:
- Click on the galaxy-gear icon (History options) on the top of the history panel
- Select the option Create New from the menu
Import the files from Zenodo or from the shared data library (
GTN - Material
->genome-annotation
->Long non-coding RNAs (lncRNAs) annotation with FEELnc
):https://zenodo.org/api/files/0f8d27c5-8c8d-4379-90c4-c3cd950de391/genome_assembly.fasta https://zenodo.org/api/files/0f8d27c5-8c8d-4379-90c4-c3cd950de391/genome_annotation.gff3 https://zenodo.org/api/files/0f8d27c5-8c8d-4379-90c4-c3cd950de391/all_RNA_mapped.bam
- Copy the link location
Open the Galaxy Upload Manager (galaxy-upload on the top-right of the tool panel)
- Select Paste/Fetch Data
Paste the link into the text field
Press Start
- Close the window
As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a shared data library:
- Go into Shared data (top panel) then Data libraries
- Navigate to the correct folder as indicated by your instructor
- Select the desired files
- Click on the To History button near the top and select as Datasets from the dropdown menu
- In the pop-up window, select the history you want to import the files to (or create a new one)
- Click on Import
Transcripts assembly with StringTie
StringTie is a fast and highly efficient assembler of RNA-Seq alignments into potential transcripts. StringTie takes as input a SAM, BAM or CRAM file sorted by coordinate (genomic location). This file should contain spliced RNA-seq read alignments such as the ones produced by TopHat, HISAT2 or STAR. The TopHat output is already sorted, but the SAM ouput from other aligners should be sorted using the samtools program.
A reference annotation file in GTF or GFF3 format can be provided to StringTie which can be used as ‘guides’ for the assembly process and help improve the transcript structure recovery for those transcripts.
Hands-on: Transcripts assemblyStringTie Tool: toolshed.g2.bx.psu.edu/repos/iuc/stringtie/stringtie/2.1.7+galaxy1 with the following parameters:
- “Input options”:
Short reads
- param-file “Input short mapped reads”:
all_RNA_mapped.bam
- “Specify strand information”: Unstranded
- “Use a reference file to guide assembly?”: Use reference GTF/GFF3
- “Reference file”: Use a file from history
- param-file “GTF/GFF3 dataset to guide assembly”:
genome_annotation.gff3
- “Use Reference transcripts only?”:
No
- “Output files for differential expression?”:
No additional output
- “Output coverage file”:
No
We obtain an annotation file (GTF format) which contained all assembled transcripts present in the RNASeq data.
After this step, the transcriptome is assembled and ready for lncRNAs annotation.
QuestionHow many transcripts are assembled ?
Specific features can be extracted from the GTF file using for example Extract features from GFF data Tool: Extract_features1 . By selecting
transcript
Fromcolumn 3 / Feature
, we can select only the transcript elements present in this annotation file. Assembly contains 14,877 transcripts (corresponding to the number of lines in the filtered GTF file).
lncRNAs annotation with FEELnc
FEELnc is a pipeline which is composed of 3 steps. These 3 steps are run automatically when running FEELnc within Galaxy. The first step (FEELnc_filter) consists in filtering out unwanted/spurious transcripts and/or transcripts overlapping (in sense) exons of the reference annotation, and especially protein coding exons as they more probably correspond to new mRNA isoforms.
To use FEELnc, we need to have a reference annotation file in GTF format, which contains protein-coding genes annotation. Presently, we downloaded only the reference annotation file in GFF3 format (annotation.gff3
). To convert from GFF3 to GTF format, we will use gffread.
Hands-on: FEELnc
- gffread Tool: toolshed.g2.bx.psu.edu/repos/devteam/gffread/gffread/2.2.1.3+galaxy0 with the following parameters:
- param-file “Input BED, GFF3 or GTF feature file”:
genome_annotation.gff3
- “Feature File Output”:
GTF
- FEELnc Tool: toolshed.g2.bx.psu.edu/repos/iuc/feelnc/feelnc/0.2 with the following parameters:
- param-file “Transcripts assembly”:
Assembled transcript
(output of StringTie tool)- param-file “Reference annotation”:
genome_annotation.gtf
(Output of gffread tool)- param-file “Genome sequence”:
genome_assembly.fasta
FEELnc provides 3 output files
- lncRNA annotation file: annotation file in GTF format which contains the final set of lncRNAs
- mRNA annotation file: annotation file in GTF format which contains the final set of mRNAs
- Classifier output file: table containing classification of lncRNAs based on their genomic localisation w.r.t other transcripts (direction:
sense
orantisense
, type:genic
, if the lncRNA gene overlaps an RNA gene from the reference annotation file orintergenic
(lincRNA) if not).
FEELnc provides also summary file in stdout.
QuestionHow many RNAs does this annotation contain ? How many interactions between lncRNAs and mRNAs have been identified ? Can you describe the different types of lncRNAs ?
The summary file indicates 104 lncRNAs and 0 new mRNAs were annotated by FEELnc. The initial annotation contains 13,795 mRNAs annotated. Therefore, a total of 13,898 RNAs are currently annotated.
The summary file indicates 652 interactions between lncRNAs and mRNAs. These interactions are described in the Classifier output file.
The different types of lncRNAs (intergenic (sense and antisense), intragenic (sense)) are described in the Classifier output file. We observe that the majority of the lncRNAs are intergenic. These lncRNAs can each have interactions with several mRNAs. Only 7 lncRNAs are genic. These lncRNAs have only one interaction with the mRNA that contains it.
For future analyses, it would be interesting to use an updated annotation containing mRNAs and lncRNAs annotations. Thus, we will merge the reference annotation with those obtained with FEELnc.
Hands-on: Merge the annotationsconcatenate Tool: https://toolshed.g2.bx.psu.edu/view/bgruening/text_processing/f46f0e4f75c4 with the following parameters:
- param-file “Datasets to concatenate”:
genome_annotation.gtf
- Insert Dataset
- param-file “Dataset”:
lncRNA annotation with FEELnc
Conclusion
Congratulations for reaching the end of this tutorial! Now you know how to perform an annotation of lncRNAs by using RNASeq data.
Key points
StringTie allows to perform a transcriptome assembly using mapped RNASeq data and provides an annotation file containing trancripts description.
FEELnc pipeline allows to perform annotation of long non-coding RNAs (lncRNAs).
Annotation is based on reconstructed transcripts from RNA-seq data (either with or without a reference genome)
Annotation can be performed without any training set of non-coding RNAs.
FEELnc provides the localisation and the direction of transcription of proximal RNA transcripts of lncRNAs.
Frequently Asked Questions
Have questions about this tutorial? Check out the tutorial FAQ page or the FAQ page for the Genome Annotation topic to see if your question is listed there. If not, please ask your question on the GTN Gitter Channel or the Galaxy Help ForumReferences
- StringTie enables improved reconstruction of a transcriptome from RNA-seq reads, 2015 Nature Biotechnology 33: 290–295. 10.1038/nbt.3122
- FEELnc: a tool for long non-coding RNA annotation and its application to the dog transcriptome, 2017 Nucleic Acids Research 45: e57. 10.1093/nar/gkw1306
Glossary
- LncRNAs
- Long non-coding RNAs
- lncRNAs
- long non-coding RNAs
- mRNAs
- Messenger RNAs
Feedback
Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.
Citing this Tutorial
- , 2022 Long non-coding RNAs (lncRNAs) annotation with FEELnc (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/genome-annotation/tutorials/lncrna/tutorial.html Online; accessed TODAY
- Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012
Congratulations on successfully completing this tutorial!@misc{genome-annotation-lncrna, author = "Stéphanie Robin", title = "Long non-coding RNAs (lncRNAs) annotation with FEELnc (Galaxy Training Materials)", year = "2022", month = "10", day = "18" url = "\url{https://training.galaxyproject.org/training-material/topics/genome-annotation/tutorials/lncrna/tutorial.html}", note = "[Online; accessed TODAY]" } @article{Batut_2018, doi = {10.1016/j.cels.2018.05.012}, url = {https://doi.org/10.1016%2Fj.cels.2018.05.012}, year = 2018, month = {jun}, publisher = {Elsevier {BV}}, volume = {6}, number = {6}, pages = {752--758.e1}, author = {B{\'{e}}r{\'{e}}nice Batut and Saskia Hiltemann and Andrea Bagnacani and Dannon Baker and Vivek Bhardwaj and Clemens Blank and Anthony Bretaudeau and Loraine Brillet-Gu{\'{e}}guen and Martin {\v{C}}ech and John Chilton and Dave Clements and Olivia Doppelt-Azeroual and Anika Erxleben and Mallory Ann Freeberg and Simon Gladman and Youri Hoogstrate and Hans-Rudolf Hotz and Torsten Houwaart and Pratik Jagtap and Delphine Larivi{\`{e}}re and Gildas Le Corguill{\'{e}} and Thomas Manke and Fabien Mareuil and Fidel Ram{\'{\i}}rez and Devon Ryan and Florian Christoph Sigloch and Nicola Soranzo and Joachim Wolff and Pavankumar Videm and Markus Wolfien and Aisanjiang Wubuli and Dilmurat Yusuf and James Taylor and Rolf Backofen and Anton Nekrutenko and Björn Grüning}, title = {Community-Driven Data Analysis Training for Biology}, journal = {Cell Systems} }