Visualization of RNA-Seq results with CummeRbund

Authors: AvatarAndrea Bagnacani
Overview
Questions:
  • How are RNA-Seq results stored?

  • Why are visualization techniques needed?

  • How to select genes for visualizing meaningful results of differential gene expression analysis?

Objectives:
  • Manage RNA-Seq results

  • Extract genes for producing differential gene expression analysis visualizations

  • Visualize meaningful information

Requirements:
Time estimation: 1 hour
Supporting Materials:
Last modification: Oct 18, 2022
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License The GTN Framework is licensed under MIT

Introduction

RNA-Seq analysis helps researchers annotate new genes and splice variants, and provides cell- and context-specific quantification of gene expression. RNA-Seq data, however, are complex and require both computer science and mathematical knowledge to be managed and interpreted.

Visualization techniques are key to overcome the complexity of RNA-Seq data, and represent valuable tools to gather information and insights.

In this tutorial we will visualize RNA-seq data from the CuffDiff tool.

Agenda

In this tutorial, we will deal with:

  1. Introduction
  2. Reasons for visualizing RNA-Seq results
  3. Importing RNA-Seq result data
  4. Filtering and sorting
  5. CummeRbund
  6. Conclusion

Reasons for visualizing RNA-Seq results

To make sense of the available RNA-Seq data, and overview the condition-specific gene expression levels of the provided samples, we need to visualize our results. Here we will use CummeRbund.

CummeRbund is an open-source tool that simplifies the analysis of a CuffDiff RNA-Seq output. In particular, it helps researchers with:

  • managing, integrating, and visualizing the data produced by CuffDiff
  • simplifying data exploration
  • providing a bird’s-eye view of the expression analysis by describing relationships betweeen genes, transcripts, transcription start sites, and protein-coding regions
  • exploring subfeatures of individual genes or gene-sets
  • creating publication-ready plots

A typical workflow for the visualization of RNA-Seq data involving CummeRbund:

workflow.

CummeRbund reads your RNA-Seq results from a SQLite database. This database has to be created using CuffDiff’s SQLite output option.

Instruct CuffDiff to organize its output in a SQLite database to be read CummeRbund.

SQLite output.

Importing RNA-Seq result data

Hands-on: Data upload
  1. Create a new history

    Click the new-history icon at the top of the history panel.

    If the new-history is missing:

    1. Click on the galaxy-gear icon (History options) on the top of the history panel
    2. Select the option Create New from the menu
  2. Import the CuffDiff SQLite dataset

    • Copy the link location
    • Open the Galaxy Upload Manager (galaxy-upload on the top-right of the tool panel)

    • Select Paste/Fetch Data
    • Paste the link into the text field

    • Press Start

    • Close the window

    Rename the dataset to “RNA-Seq SQLite result data”

    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, change the Name field to RNA-Seq SQLite result data
    • Click the Save button

By default, when data is imported via its link, Galaxy names it with its URL.

CuffDiff’s output data is organized in a SQLite database, so we need to extract it to be able to see what it looks like.

For this tutorial, we are interested in CuffDiff’s tested transcripts for differential expression.

Hands-on: Extract CuffDiff results
  1. Extract CuffDiff tool with the following parameters
    • “Select tables to output”: Transcript differential expression testing
  2. Inspect the table

    • Click on the galaxy-eye (eye) icon (“View data”) on the right of the file name in the history
    • Inspect the content of the file

Each entry is a differentially expressed gene, which is described in terms of the following attributes.

  • test_id: A unique identifier describing the transcript being tested
  • gene_id: The identifier of the gene being tested
  • gene: The name of the gene being tested
  • locus: The genomic coordinates of the gene transcript being tested
  • sample_1: Label of the 1st sample
  • sample_2: Label of the 2nd sample
  • status: The test’s status:
    • “OK” if test successful
    • “NOTEST” if not enough alignments for testing
    • “LOWDATA” if too complex or shallowly sequenced
    • “HIDATA” if too many fragments in locus
    • “FAIL” if a numerical exception prevented testing
  • value_1: The gene’s FPKM in sample_1
  • value_2: The gene’s FPKM in sample_2
  • log2(fold_change): The log2 of the fold change (sample_1/sample_2). This reports the expression difference between condition one (sample_1) and condition two (sample_2)
  • test_stat: The test statistic’s value, used to determine the significance of the observed change in FPKM
  • p_value: The uncorrected p-value of the test statistic
  • q_value: The False-discovery-rate-adjusted p-value of the test statistic
  • significant: “yes” or “no”, depending on whether the p_value is greater then the FDR after Benjamini-Hochberg correction for multiple-testing. This tells whether the difference between the expression levels in condition one (sample_1) and condition two (sample_2) is significant

We want to keep only the significant differentially expressed genes.

Question
  1. How to retain only the significant differentially expressed genes?
  2. Which column stores this information?
  1. We need to filter on the column storing the record’s significance
  2. Column 14

Filtering and sorting

We now want to highlight the transcripts whose expression difference, the log2(fold_change), has been statistically assessed as both high and significant.

Hands-on: Extract CuffDiff's significant differentially expressed genes
  1. Filter tool with the following parameters
    • “Filter”: the extracted table from the previous step
    • “With following condition”: an appropriate filter over the target column (see questions below when in doubt)
    • “Number of header lines to skip”: the number of rows used for the table’s header
    Question
    1. What column stores the information of significance for each record?
    2. Which conditional expression has to be set to filter all records on the selected column?
    3. How many rows are allocated for the table’s header?
    4. How many entries were originally stored in the table? And how many after the filtering operation?
    1. Column 14
    2. c14==’yes’
    3. 1
    4. ~140,000 (before filtering) vs. 219 (after filtering)

Review the meaning of each column, and look at your data.

  • The differential expression values are stored in column 10
  • The statistical score, assessing the differential expression significance, is stored in column 13 We will sort all records on the basis of their Q-score (column 13) and log2(fold_change).
  1. Sort tool: with the following parameters
    • “Sort Dataset”: the filtered table
    • “on column”: 13
    • “with flavor”: Numerical sort
    • “everything in”: Ascending order
    • param-repeat Insert Column selection, and parameterize the Sort tool to sort on column 10. Be careful of the sorting order!
    • Are there any rows allocated for the table’s header? In that case, set “Number of header lines to skip” accordingly!
    Question
    1. Which gene transcript has been statistically assessed as both high and significant?

    1 LIMCH1

CummeRbund

CummeRbund generates two outputs:

  • The plot, which visualizes our RNA-Seq results of interest
  • The ggplot object responsible for generating the plot

In this section we will parametrize CummeRbund to create different kinds of plots from our input data.

Hands-on: Visualization
  1. CummeRbund tool with the following parameters
    • param-repeat Insert plots
      • “Width”: 800
      • “Height”: 600
      • “Plot type”: Expression Plot
        • Expression levels to plot”:Isoforms
        • “Gene ID”: NDUFV1

The input data used to create the visualization comprise 3 conditions: hits7 (Patient 1), hits8 (Patient 2), and hits9 (Control).

Our first CummeRbund plot is the “Expression Plot” of the isoforms of gene NDUFV1, which shows the expression differences of isoforms NM_001166102 and NM_007103 among the three conditions. Error bars capture the variability of the distribution of FPKM values: the broader the distribution of FPKM values, the larger the corresponding error bar.

Expression plot.

Our plot has a modest number of isoforms, and is therefore easy to read. However, with a high number of isoforms and expression variability among different conditions, the plot can look very busy. We can therefore change the visualization type by selecting another type of plot. Let’s change visualization.

Hands-on: Visualization
  1. CummeRbund tool with the following parameters
    • param-repeat Click on “Insert plots”
      • “Width”: 800
      • “Height”: 600
      • “Plot type”: Expression Bar Plot
        • Expression levels to plot”:Isoforms
        • “Gene ID”: NDUFV1

Expression bar plot.

The Expression Bar Plot of gene NDUFV1’s replicates NM_001166102 and NM_007103, shows the expression changes across the three aforementioned conditions.

Comment

These plots are shown also in this Galaxy video tutorial.

Let’s now create a heatmap to plot the expression levels of the significant differentially expressed gene isoforms obtained from our filter and sort operations. As a showcase example, let’s consider only the top 5 differentially expressed genes.

Hands-on: Visualization
  1. CummeRbund tool with the following parameters
    • param-repeat Insert plots
      • “Width”: 800
      • “Height”: 600
      • “Plot type”: Heatmap
        • Expression levels to plot”: Isoforms
        • “Gene ID”: LIMCH1
        • “Gene ID”: IFNL2
        • param-repeat Insert Genes
          • “Gene ID”: CXCL11
        • param-repeat Insert Genes
          • “Gene ID”: NUB1
      • “Cluster by”: Both

Expression bar plot.

Heatmap of significant differentially expressed isoforms of genes LIMCH1, IFNL2, CXCL11, NUB1. Differences in up- and down-regulation of certain gene isoforms across Patient 1 (hits7), Patient 2 (hits8), and Control (hits9), are visible in the lower and upper part of the heatmap.

Comment

For more sophisticated visualizations of your RNA-Seq analysis results, try selecting different CummeRbund plot options and parametrizations. Have a look also at CummeRbund’s manual. Alternatively, you can modify a plot’s style by changing CummeRbund’s R output! CummeRbund’s R outputs are ggplot objects. Look here to learn how to change fonts, colors, error bars, and more.

Conclusion

Visualization tools help researchers making sense of data, providing a bird’s-eye view of the underlying analysis results. In this tutorial we overviewed the advantages of visualizing RNA-Seq results with CummeRbund, and gained insights on CuffDiff’s big-data output by plotting information relative to the most significant differentially expressed genes in our RNA-Seq analysis.

Key points
  • Extract information from a SQLite CuffDiff database

  • Filter and sort results to highlight differential expressed genes of interest

  • Generate publication-ready visualizations of RNA-Seq analysis results

Frequently Asked Questions

Have questions about this tutorial? Check out the tutorial FAQ page or the FAQ page for the Transcriptomics topic to see if your question is listed there. If not, please ask your question on the GTN Gitter Channel or the Galaxy Help Forum

Useful literature

Further information, including links to documentation and original publications, regarding the tools, analysis techniques and the interpretation of results described in this tutorial can be found here.

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.

Click here to load Google feedback frame

Citing this Tutorial

  1. Andrea Bagnacani, 2022 Visualization of RNA-Seq results with CummeRbund (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/transcriptomics/tutorials/rna-seq-viz-with-cummerbund/tutorial.html Online; accessed TODAY
  2. Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012


@misc{transcriptomics-rna-seq-viz-with-cummerbund,
author = "Andrea Bagnacani",
title = "Visualization of RNA-Seq results with CummeRbund (Galaxy Training Materials)",
year = "2022",
month = "10",
day = "18"
url = "\url{https://training.galaxyproject.org/training-material/topics/transcriptomics/tutorials/rna-seq-viz-with-cummerbund/tutorial.html}",
note = "[Online; accessed TODAY]"
}
@article{Batut_2018,
    doi = {10.1016/j.cels.2018.05.012},
    url = {https://doi.org/10.1016%2Fj.cels.2018.05.012},
    year = 2018,
    month = {jun},
    publisher = {Elsevier {BV}},
    volume = {6},
    number = {6},
    pages = {752--758.e1},
    author = {B{\'{e}}r{\'{e}}nice Batut and Saskia Hiltemann and Andrea Bagnacani and Dannon Baker and Vivek Bhardwaj and Clemens Blank and Anthony Bretaudeau and Loraine Brillet-Gu{\'{e}}guen and Martin {\v{C}}ech and John Chilton and Dave Clements and Olivia Doppelt-Azeroual and Anika Erxleben and Mallory Ann Freeberg and Simon Gladman and Youri Hoogstrate and Hans-Rudolf Hotz and Torsten Houwaart and Pratik Jagtap and Delphine Larivi{\`{e}}re and Gildas Le Corguill{\'{e}} and Thomas Manke and Fabien Mareuil and Fidel Ram{\'{\i}}rez and Devon Ryan and Florian Christoph Sigloch and Nicola Soranzo and Joachim Wolff and Pavankumar Videm and Markus Wolfien and Aisanjiang Wubuli and Dilmurat Yusuf and James Taylor and Rolf Backofen and Anton Nekrutenko and Björn Grüning},
    title = {Community-Driven Data Analysis Training for Biology},
    journal = {Cell Systems}
}
                   

Congratulations on successfully completing this tutorial!