2: RNA-seq counts to genes

Overview
Questions:
  • What are the differentially expressed genes in the mammary gland of pregnant versus lactating mice?

  • How to analyze RNA count data using limma-voom?

  • How to perform quality control (QC) of RNA-seq count data?

Objectives:
  • Analysis of RNA-seq count data using limma-voom

  • QC of count data

  • Visualisation and interactive exploration of count data

  • Identification of differentially expressed genes

Requirements:
Time estimation: 2 hours
Supporting Materials:
Last modification: Oct 18, 2022
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License The GTN Framework is licensed under MIT

Introduction

Measuring gene expression on a genome-wide scale has become common practice over the last two decades or so, with microarrays predominantly used pre-2008. With the advent of next generation sequencing technology in 2008, an increasing number of scientists use this technology to measure and understand changes in gene expression in often complex systems. As sequencing costs have decreased, using RNA-Seq to simultaneously measure the expression of tens of thousands of genes for multiple samples has never been easier. The cost of these experiments has now moved from generating the data to storing and analysing it.

There are many steps involved in analysing an RNA-Seq experiment. The analysis begins with sequencing reads (FASTQ files). These are usually aligned to a reference genome, if available. Then the number of reads mapped to each gene can be counted. This results in a table of counts, which is what we perform statistical analyses on to determine differentially expressed genes and pathways. The purpose of this tutorial is to demonstrate how to perform differential expression on count data with limma-voom. How to generate counts from reads (FASTQs) is covered in the accompanying tutorial RNA-seq reads to counts.

Mouse mammary gland dataset

The data for this tutorial comes from a Nature Cell Biology paper by Fu et al. 2015. Both the raw data (sequence reads) and processed data (counts) can be downloaded from Gene Expression Omnibus database (GEO) under accession number GSE60450.

This study examined the expression profiles of basal and luminal cells in the mammary gland of virgin, pregnant and lactating mice. Six groups are present, with one for each combination of cell type and mouse status. Note that two biological replicates are used here, two independent sorts of cells from the mammary glands of virgin, pregnant or lactating mice, however three replicates is usually recommended as a minimum requirement for RNA-seq. In this tutorial we will use the GEO counts file as a starting point for our analysis. Alternatively, you could create a count matrix from the raw sequence reads, as demonstrated in the RNA-seq reads to counts tutorial. The GEO count file was generated from aligning the reads to the mouse mm10 genome with the Rsubread aligner (Liao et al. 2019), followed by counting reads mapped to RefSeq genes with featureCounts (Liao et al. 2013), see the Fu paper for details.

We will use limma-voom (Law et al. 2014) for identifying differentially expressed genes here. Other popular alternatives are edgeR (Robinson et al. 2010) and DESeq2 (Love et al. 2014). Limma-voom has been shown to be perform well in terms of precision, accuracy and sensitivity (Costa-Silva et al. 2017) and, due to its speed, it’s particularly recommended for large-scale datasets with 100s of samples (Chen et al. 2016).

This is a Galaxy tutorial based on material from the COMBINE R RNAseq workshop, first taught here.

Tutorial Dataset.
Figure 1: Tutorial Dataset
Agenda

In this tutorial, we will deal with:

  1. Introduction
  2. Preparing the inputs
    1. Import data
    2. Get gene annotations
  3. Differential expression with limma-voom
    1. Filtering to remove lowly expressed genes
    2. Normalization for composition bias
    3. Specify Contrast(s) of interest
  4. QC of count data
    1. Multidimensional scaling plot
    2. Density plots
    3. Box plots
    4. Voom variance plot
    5. MD and Volcano plots for DE results
  5. Testing relative to a threshold (TREAT)
  6. Visualising results
    1. Heatmap of top genes
    2. Stripcharts of top genes
    3. Interactive DE plots (Glimma)
  7. Conclusion
Comment: Results may vary

Your results may be slightly different from the ones presented in this tutorial due to differing versions of tools, reference data, external databases, or because of stochastic processes in the algorithms.

Preparing the inputs

We will use three files for this analysis:

  • Count matrix (genes in rows, samples in columns)
  • Sample information file (sample id, group)
  • Gene annotation file (gene id, symbol, description)

Import data

Hands-on: Data upload
  1. Create a new history for this RNA-seq exercise e.g. RNA-seq with limma-voom

    Click the new-history icon at the top of the history panel.

    If the new-history is missing:

    1. Click on the galaxy-gear icon (History options) on the top of the history panel
    2. Select the option Create New from the menu
    1. Click on Unnamed history (or the current name of the history) (Click to rename history) at the top of your history panel
    2. Type the new name
    3. Press Enter
  2. Import the mammary gland counts table and the associated sample information file.

    To import the files, there are two options:

    • Option 1: From a shared data library if available (GTN - Material -> transcriptomics -> 2: RNA-seq counts to genes)
    • Option 2: From Zenodo

      • Copy the link location
      • Open the Galaxy Upload Manager (galaxy-upload on the top-right of the tool panel)

      • Select Paste/Fetch Data
      • Paste the link into the text field

      • Press Start

      • Close the window
    • You can paste both links below into the Paste/Fetch box:
     https://zenodo.org/record/4273218/files/countdata.tsv
     https://zenodo.org/record/4273218/files/factordata.tsv
    
  3. Rename the counts dataset as countdata and the sample information dataset as factordata using the galaxy-pencil (pencil) icon.
  4. Check that the datatype is tabular. If the datatype is not tabular, please change the file type to tabular.

    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, click on the galaxy-chart-select-data Datatypes tab on the top
    • Select tabular
      • tip: you can start typing the datatype into the field to filter the dropdown menu
    • Click the Save button

Let’s take a look at the data. The countdata file contains information about genes (one gene per row), the first column has the Entrez gene id and the remaining columns contain information about the number of reads aligning to the gene in each experimental sample. There are two replicates for each cell type and time point (detailed sample info can be found in file “GSE60450_series_matrix.txt” from the GEO website). The first few rows and columns of the seqdata file are shown below.

countdata file.
Figure 2: Count file

The factordata file contains basic information about the samples that we will need for the analysis. See below. Note that the sample ids must match exactly with the ids in the counts file.

factordata file.
Figure 3: Sample information file

The files above have been formatted for you. If you are interested to know how they were formatted the information is below.

The original files are available at

```
https://zenodo.org/record/4273218/files/GSE60450_Lactation-GenewiseCounts.txt
https://zenodo.org/record/4273218/files/SampleInfo.txt
```
seqdata file.
Figure 4: Count file (before formatting)
sampleinfo file.
Figure 5: Sample information file (before formatting)

To create the file, countdata, that contains only the counts for the 12 samples i.e. we’ll remove the gene length column with the Cut columns from a table (cut) tool. The sample names are also pretty long so we’ll use the >Replace Text in entire line tool to shorten these to contain only the relevant information about each sample. We will also replace the hyphen in the sample names with a dot so they match the names in the sample information file.

Hands-on: Format the counts data
  1. Cut columns from a table (cut) tool with the following parameters:
    • param-file “File to cut”: seqdata
    • param-select “Operation”: Discard
    • param-select “List of fields”: Select Column:2
  2. Replace Text in entire line tool with the following parameters:
    • param-file “File to process”: output of Cut tool
    • param-text “Find pattern”: _B[A-Z0-9_]+
  3. Replace Text in entire line tool with the following parameters:
    • param-file “File to process”: output of Replace Text tool
    • param-text “Find pattern”: -
    • param-text “Replace with”: .
  4. Rename file as countdata using the galaxy-pencil (pencil) icon.

To create the file, factordata, that contains the groups information that we need for the limma-voom tool. We’ll combine the cell type and mouse status to make 6 groups e.g. we’ll combine the CellType basal with the Status pregnant for the group basalpregnant. We’ll use the Merge Columns tool to combine the cell type and mouse status columns in the sample information file, making a column with the 6 group names.

Hands-on: Format the sample information file
  1. Merge Columns together tool with the following parameters:
    • param-file “Select data”: sampleinfo
    • param-select “Merge column”: Column: 3
    • param-select “with column”: Column: 4
  2. Cut columns from a table (cut) tool with the following parameters:
    • param-file “File to cut”: output of Merge Columns tool
    • param-select “Operation”: Keep
    • param-select “List of fields”: Select Column:2 and Column:5
  3. Rename file as factordata using the galaxy-pencil (pencil) icon.

Get gene annotations

Optionally, gene annotations can be provided to the limma-voom tool and if provided the annotation will be available in the output files. We’ll get gene symbols and descriptions for these genes using the Galaxy annotateMyIDs tool, which provides annotations for human, mouse, fruitfly and zebrafish.

Hands-on: Get gene annotations
  1. annotateMyIDs tool with the following parameters:
    • param-file “File with IDs”: countdata
    • param-check “File has header”: Yes
    • param-select “Organism”: Mouse
    • param-select “ID Type”: Entrez
    • param-checkOutput columns”: tick
      • ENTREZID
      • SYMBOL
      • GENENAME
  2. Rename file as annodata using the galaxy-pencil (pencil) icon. The file should look like below.
    annodata file.
    Figure 6: Gene annotation file
  3. Check the number of lines shown on the datasets in the history, there should be 27,180 lines in both. There must be the same number of lines (rows) in the counts and annotation.

Differential expression with limma-voom

Filtering to remove lowly expressed genes

It is recommended to filter for lowly expressed genes when running the limma-voom tool. Genes with very low counts across all samples provide little evidence for differential expression and they interfere with some of the statistical approximations that are used later in the pipeline. They also add to the multiple testing burden when estimating false discovery rates, reducing power to detect differentially expressed genes. These genes should be filtered out prior to further analysis.

There are a few ways to filter out lowly expressed genes. When there are biological replicates in each group, in this case we have a sample size of 2 in each group, we favour filtering on a minimum counts-per-million (CPM) threshold present in at least 2 samples. Two represents the smallest sample size for each group in our experiment. In this dataset, we choose to retain genes if they are expressed at a CPM above 0.5 in at least two samples. The CPM threshold selected can be compared to the raw count with the CpmPlots (see below).

The limma tool uses the cpm function from the edgeR package (Robinson et al. 2010) to generate the CPM values which can then be filtered. Note that by converting to CPMs we are normalizing for the different sequencing depths for each sample. A CPM of 0.5 is used as it corresponds to a count of 10-15 for the library sizes in this data set. If the count is any smaller, it is considered to be very low, indicating that the associated gene is not expressed in that sample. A requirement for expression in two or more libraries is used as each group contains two replicates. This ensures that a gene will be retained if it is only expressed in one group. Smaller CPM thresholds are usually appropriate for larger libraries. As a general rule, a good threshold can be chosen by identifying the CPM that corresponds to a count of 10, which in this case is about 0.5. You should filter with CPMs rather than filtering on the counts directly, as the latter does not account for differences in library sizes between samples.

Normalization for composition bias

In an RNA-seq analysis, the counts are normalized for different sequencing depths between samples. Normalizing to eliminate composition biases between samples is also typically performed. Composition biases can occur, for example, if there are a few highly expressed genes dominating in some samples, leading to less reads from other genes. By default, TMM normalization (Robinson and Oshlack 2010) is performed by the limma tool using the edgeR calcNormFactors function (this can be changed under Advanced Options). TMM stands for Trimmed Mean of M values, where a weighted trimmed mean of the log expression ratios is used to scale the counts for the samples. See the figure from the TMM paper below. Note the plot (Figure 1c) that shows how a few highly expressed genes in the liver sample (where the arrow is) results in the majority of other genes in the sample having the appearance of being expressed lower in liver. The mid-line through the points is offset from the expected zero and the TMM normalization factor (red line) scales the counts to adjust for this.

TMM normalization.
Figure 7: TMM normalization (Robinson and Oshlack 2010)

Specify Contrast(s) of interest

Since we are interested in differences between groups, we need to specify which comparisons we want to test. For example, if we are interested in knowing which genes are differentially expressed between the pregnant and lactating group in the basal cells we specify basalpregnant-basallactate for the Contrast of Interest. Note that the group names in the contrast must exactly match the names of the groups in the factordata file. More than one contrast can be specified using the Insert Contrast button, so we could look at more comparisons of the groups here, but first we’ll take a look at basalpregnant-basallactate.

Hands-on: Differential expression with limma-voom
  1. limma tool with the following parameters:
    • param-file “Differential Expression Method”: limma-voom
    • param-select “Count Files or Matrix?”: Single Count Matrix
      • param-file “Count Matrix”: Select countdata
    • param-select “Input factor information from file?”: Yes
      • param-file “Factor File”: Select factordata
    • param-select “Use Gene Annotations?”: Yes
      • param-file “Factor File”: Select annodata
    • param-text “Contrast of Interest”: basalpregnant-basallactate
    • param-select “Filter lowly expressed genes?”: Yes
      • param-select “Filter on CPM or Count values?”: CPM
      • param-text “Minimum CPM”: 0.5
      • param-text “Minimum Samples”: 2
  2. Inspect the Report produced by clicking on the galaxy-eye (eye) icon

If we need to account for additional sources of variation, for example, batch, sex, genotype etc, we can input that information as additional factors. For example, if we were interested in the genes differentially expressed between the luminal and basal cell types, we could include an additional column to account for the variation due to the different stages.

Multiple factors.
Figure 8: Multiple factors

QC of count data

Before we check out the differentially expressed genes, we can look at the Report information to check that the data is good quality and that the samples are as we would expect.

Multidimensional scaling plot

By far, one of the most important plots we make when we analyse RNA-Seq data are MDS plots. An MDS plot is a visualisation of a principal components analysis, which determines the greatest sources of variation in the data. A principal components analysis is an example of an unsupervised analysis, where we don’t need to specify the groups. If your experiment is well controlled and has worked well, what we hope to see is that the greatest sources of variation in the data are the treatments/groups we are interested in. It is also an incredibly useful tool for quality control and checking for outliers. This Galaxy limma tool outputs an MDS plot by default in the Report and a link is also provided to a PDF version (MDSPlot_CellTypeStatus.pdf). A scree plot is also produced that shows how much variation is attributed to each dimension. If there was a batch effect for example, you may see high values for additional dimensions, and you may choose to include batch as an additional factor in the differential expression analysis. The limma tool plots the first two dimensions by default (1 vs 2), however you can also plot additional dimensions 2 vs 3 and 3 vs 4 using under Output Options Additional Plots MDS Extra. These are displayed in the Report along with a link to a PDF version (MDSPlot_extra.pdf). Selecting the Glimma Interactive Plots will generate an interactive version of the MDS plot, see the plots section of the report below. If outlier samples are detected you may decide to remove them. Alternatively, you could downweight them by choosing the option in the limma tool Apply voom with sample quality weights?. The voom sample quality weighting is described in the paper Why weight? Modelling sample and observational level variability improves power in RNA-seq analyses (Liu et al. 2015).

MDS Plot.
Figure 9: MDS Plot

Take a look at the MDS plot coloured by group.

Question

Do you notice anything about the samples in this plot?

Two samples don’t appear to be in the right place.

It turns out that there has been a mix-up with two samples, they have been mislabelled in the sample information file. This shows how the MDS plot can also be useful to help identify if sample mix-ups may have occurred. We need to redo the limma-voom analysis with the correct sample information.

Hands-on: Use the Rerun button to redo steps
  1. Import the correct sample information file from https://zenodo.org/record/4273218/files/factordata_fixed.tsv and name as factordata_fixed
  2. Use the Rerun galaxy-refresh button in the History to run limma tool as before with the factordata_fixed file and adding the following parameters:
    • Output Options
      • param-check “Additional Plots” tick:
        • Glimma Interactive Plots
        • Density Plots (if filtering)
        • CpmsVsCounts Plots (if filtering on cpms)
        • Box Plots (if normalising)
        • MDS Extra (Dims 2vs3 and 3vs4)
        • MD Plots for individual samples
        • Heatmaps (top DE genes)
        • Stripcharts (top DE genes)
  3. Delete the incorrect factordata file and its limma outputs to avoid any confusion.

In the Report you should then see the correct MDS plot as below.

MDS Plot.
Figure 10: MDS Plot (correct samples)
Question

What is the greatest source of variation in the data (i.e. what does dimension 1 represent)? What is the second greatest source of variation in the data?

Dimension 1 represents the variation due to cell type, basal vs luminal. Dimension 2 represents the variation due to the stages, virgin, pregnant or lactating.

Next, scroll down the Report to take a look at the Additional information and Summary of experimental data sections near the bottom. It should look similar to below. Here you can check that the correct samples have been assigned to the correct groups, what settings were used (e.g. filters, normalization method) and also how many genes were filtered out due to low expression.

Report summary.

Question

How many genes have been filtered out for low expression?

11375 genes were filtered out as insignificant as they were without more than 0.5 CPM in at least 2 samples.

Density plots

Density plots can be output in the Report if Filter lowly expressed genes is selected. A link is also provided in the Report to a PDF version (DensityPlots.pdf). These plots allow comparison of the counts distributions before and after filtering. The samples are coloured by the groups. Count data is not normally distributed, so if we want to examine the distributions of the raw counts we need to log the counts. We typically check the distribution of the read counts on the log2 scale. A CPM value of 1 is equivalent to a log-CPM value of 0 and the CPM we used of 0.5 is equivalent to a log-CPM of -1. It can be seen in the Raw counts (before filtering) plot below, that a large proportion of genes within each sample are not expressed or lowly-expressed, and the Filtered counts plot shows our filter of CPM of 0.5 (in at least 2 samples) removes a lot of these uninformative genes.

Density Plots.
Figure 11: Density Plots

The Report provides links to PDFs of all plots shown in the Report and also to the rest of the additional plots selected to be output.

Report Outputs.
Figure 12: Report outputs

We can also have a look more closely to see whether our threshold of 0.5 CPM does indeed correspond to a count of about 10-15 reads in each sample with the plots of CPM versus raw counts.

Click on the CpmPlots.pdf link in the Report. You should see 12 plots, one for each sample. Two of the plots are shown below. From these plots we can see that 0.5 CPM is equivalent to ~10 counts in each of the 12 samples, so 0.5 seems to be an appropriate threshold for this dataset (these samples all have sequencing depth of 20-30 million, see the Library information file below, so a CPM value of 0.5 would be ~10 counts).

CPM threshold Plots.
Figure 13: CPM vs Raw Counts Plots
Comment: Thresholds
  • A threshold of 1 CPM in at least minimum group sample size is a good rule of thumb for samples with about 10 million reads. For larger library sizes increase the CPM theshold and for smaller library sizes decrease it. Check the CpmPlots to see if the selected threshold looks appropriate for the samples (equivalent to ~10 reads).

Box plots

We can also use box plots to check the distributions of counts in the samples. Box plots can be selected to be output by the Galaxy limma-voom tool if normalization is applied (TMM is applied by default). The plots are output in the Report and a link is also provided to a PDF version (BoxPlots.pdf). The samples are coloured by the groups. With the box plots for these samples we can see that overall the distributions are not identical but still not very different. If a sample is really far above or below the blue horizontal line we may need to investigate that sample further.

Box Plots.
Figure 14: Box Plots
Question

Compare the box plots before and after TMM normalisation. Can you see any differences?

After the normalization more of the samples are closer to the median horizontal line.

The TMM normalization generates normalization factors, where the product of these factors and the library sizes defines the effective library size. TMM normalization (and most scaling normalization methods) scale relative to one sample. The normalization factors multiply to unity across all libraries. A normalization factor below one indicates that the library size will be scaled down, as there is more suppression (i.e., composition bias) in that library relative to the other libraries. This is also equivalent to scaling the counts upwards in that sample. Conversely, a factor above one scales up the library size and is equivalent to downscaling the counts. We can see the normalization factors for these samples in the Library information file if we select to output it with “Output Library information file?”: Yes. Click on the galaxy-eye (eye) icon to view.

Library Info file.
Figure 15: Library information file
Question

Which sample has the largest normalization factor? Which sample has the smallest?

MCL1.LA has the largest normalization factor and MCL1.LE the smallest.

It is considered good practice to make mean-difference (MD) plots for all the samples as a quality check, as described in this edgeR workflow article. These plots allow expression profiles of individual samples to be explored more closely. An MD plot shows the log-fold change between a sample against the average expression across all the other samples. This visualisation can help you see if there are genes highly upregulated or downregulated in a sample. If we look at mean difference plots for these samples, we should be able to see the composition bias problem. The mean-difference plots show average expression (mean: x-axis) against log-fold-changes (difference: y-axis).

Click on the MDPlots_Samples.pdf link in the Report. You should see 12 MD plots, one for each sample. Let’s take a look at the plots for the two samples MCL1.LA and MCL1.LE that had the largest and smallest normalization factors. The MD plots on the left below show the counts normalized for library size and the plots on the right show the counts after the TMM normalization has been applied. MCL1.LA had the largest normalization factor and was above the median line in the unnormalized by TMM box plots. MCL1.LE had the smallest normalization factor and was below the median line in the box plots. These MD plots help show the composition bias problem has been addressed.

MD Plot LA.
Figure 16: MD Plots for MCL1.LA before and after TMM normalization
MD Plot LE.
Figure 17: MD Plots for MCL1.LE before and after TMM normalization

Voom variance plot

This plot is generated by the voom method and displayed in the Report along with a link to a PDF version (VoomPlot.pdf). Each dot represents a gene and it shows the mean-variance relationship of the genes in the dataset. This plot can help show if low counts have been filtered adequately and if there is a lot of variation in the data, as shown in the More details on Voom variance plots box below.

Voom Plot.
Figure 18: Voom Plot

The SA plot below plots log2 residual standard deviations against mean log-CPM values. The average log2 residual standard deviation is marked by a horizontal blue line. This plot shows how the dependence between the means and variances has been removed after the voom weights are applied to the data.

SA Plot.
Figure 19: SA Plot

If we didn’t filter this dataset for the lowly expressed genes the variance plot would look like below.

Voom Nofilter Plot.
Figure 20: Voom Plot unfiltered counts

If we look at the plot generated with the two samples mixed up we can see there’s more variation.

Voom Mixedup Plot.
Figure 21: Voom Plot mixed-up samples

More examples of the variation this plot can show can be seen in Figure 1 from the limma-voom paper (Law et al. 2014), shown below.

Voom Plot Examples.
Figure 22: Voom Plot more examples

Figure 1: Mean-variance relationships. Gene-wise means and variances of RNA-seq data are represented by black points with a LOWESS trend. Plots are ordered by increasing levels of biological variation in datasets. (a) voom trend for HBRR and UHRR genes for Samples A, B, C and D of the SEQC project; technical variation only. (b) C57BL/6J and DBA mouse experiment; low-level biological variation. (c) Simulation study in the presence of 100 upregulating genes and 100 downregulating genes; moderate-level biological variation. (d) Nigerian lymphoblastoid cell lines; high-level biological variation. (e) Drosophila melanogaster embryonic developmental stages; very high biological variation due to systematic differences between samples. (f) LOWESS voom trends for datasets (a)–(e). HBRR, Ambion’s Human Brain Reference RNA; LOWESS, locally weighted regression; UHRR, Stratagene’s Universal Human Reference RNA.

MD and Volcano plots for DE results

Genome-wide plots that are useful for checking differentially expressed (DE) results are MD plots (or MA plots) and Volcano plots. There are functions in limma for generating these plots and they are used by this tool. These plots are output by default and shown in the Report along with a link to PDF versions (MDPlot_basalpregnant-basallactate.pdf and VolcanoPlot_basalpregnant-basallactate.pdf). In the volcano plot the top genes (by adjusted p-value) are highlighted. The number of top genes is 10 by default and the user can specify the number of top genes to view (up to 100) under Advanced Options. For further information on volcano plots see the Volcano Plot tutorial

MDVol Plot.
Figure 23: MD Plot and Volcano Plot

The MD Plot highlighted genes are significant at an adjusted p-value (adj.P) threshold of 0.05 and exhibit log2-fold-change (lfc) of at least 0. These thresholds can be changed under Advanced Options.

Question

How many genes are differentially expressed at the default thresholds of adj.P=0.05 and lfc=0?

The number of DE genes at these adj.P and lfc thresholds is shown in the table in the Report as below.

DE counts.

Comment: on adjusted P value

A note about deciding how many genes are significant: In order to decide which genes are differentially expressed, we usually take a cut-off (e.g. 0.05 or 0.01) on the adjusted p-value, NOT the raw p-value. This is because we are testing many genes (more than 15000 genes here), and the chances of finding differentially expressed genes is very high when you do that many tests. Hence we need to control the false discovery rate, which is the adjusted p-value column in the results table. What this means is that, if we choose an adjusted p-value cut-off of 0.05, and if 100 genes are significant at a 5% false discovery rate, we are willing to accept that 5 will be false positives.

Testing relative to a threshold (TREAT)

When there is a lot of differential expression, sometimes we may want to cut-off on a fold change threshold, as well as a p-value threshold, so that we follow up on the most biologically significant genes. However, it is not recommended to simply rank by p-value and then discard genes with small logFC’s, as this has been shown to increase the false discovery rate. In other words, you are not controlling the false discovery rate at 5% any more. There is a function called treat in limma that performs this style of analysis correctly (McCarthy and Smyth 2009). TREAT will simply take a user-specified log fold change cut-off and recalculate the moderated t-statistics and p-values with the new information about logFC. There are thousands of genes differentially expressed in this basalpregnant-basallactate comparison, so let’s rerun the analysis applying TREAT and similar thresholds to what was used in the Fu paper: an adjusted P value of 0.01 (1% false discovery rate) and a log-fold-change cutoff of 0.58 (equivalent to a fold change of 1.5).

Hands-on: Testing relative to a threshold (TREAT)
  1. Use the Rerun galaxy-refresh button in the History to rerun limma tool with the following parameters:
    • Advanced Options
      • param-text “Minimum Log2 Fold Change”: 0.58
      • param-text “P-Value Adjusted Threshold”: 0.01
      • param-check “Test significance relative to a fold-change threshold (TREAT)”: Yes
  2. Add a tag #treat to the Report output and inspect the report
  • Click on the dataset
  • Click on galaxy-tags Edit dataset tags
  • Add a tag starting with #

    Tags starting with # will be automatically propagated to the outputs of tools using this dataset.

  • Check that the tag is appearing below the dataset name

We can see that much fewer genes are now highlighted in the MD plot and identified as differentially expressed.

TREAT MDVol Plot.
Figure 24: TREAT MD and Volcano plots
TREAT DE counts.
Figure 25: TREAT differentially expressed genes

Visualising results

In addition to the plots already discussed, it is recommended to have a look at the expression levels of the individual samples for the genes of interest, before following up on the DE genes with further lab work. The Galaxy limma tool can auto-generate heatmaps of the top genes to show the expression levels across the samples. This enables a quick view of the expression of the top differentially expressed genes and can help show if expression is consistent amongst replicates in the groups.

Heatmap of top genes

Click on the Heatmap_basalpregnant-basallactate.pdf link in the Report. You should see a plot like below.

Heatmap.
Figure 26: Heatmap of top genes

Stripcharts of top genes

The limma-voom tool can also auto-generate stripcharts to view the expression of the top genes across the groups. Click on the Stripcharts_basalpregnant-basallactate.pdf link in the Report. You should see 10 plots, one for each top gene. Four are shown below. Note that here you can see if replicates tend to group together and how the expression compares to the other groups.

Stripchart Plot.
Figure 27: Stripcharts of top genes

Interactive DE plots (Glimma)

Interactive versions of the MD and Volcano plots can be output by the limma-voom tool via the Glimma package (Su et al. 2017), if a gene annotation file is provided and Glimma Interactive Plots is selected. Links to the Glimma html pages are generated in the Report.

Let’s take a look at the interactive MD plot. Click on the Glimma_MDPlot_basalpregnant-basallactate.html link in the Report. You should see a two-panel interactive MD plot like below. The left plot shows the log-fold-change vs average expression. The right plot shows the expression levels of a particular gene of each sample by groups (similar to the stripcharts). Hovering over points on left plot will plot expression level for corresponding gene, clicking on points will fix the expression plot to gene. Clicking on rows on the table has the same effect as clicking on the corresponding gene in the plot.

Hands-on: Search for a gene of interest

Egf was a gene identifed as very highly expressed in the Fu paper and confirmed with qRT-PCR, see Fig. 6c from the paper below.

Fu EGF gene.
Figure 28: Fu et al, Nat Cell Biol 2015

Search for Egf in the Glimma interactive table. You should see something similar to below.

Glimma EGF gene.
Figure 29: Glimma EGF gene

Notice that in the plot on the right above, showing Egf expression for all samples, we can see it is highly expressed in the luminal lactate group but not the other samples.

Multiple contrasts can be run with the limma tool. For example, we can compare the pregnant and lactating conditions for both the basal and luminal cells. So let’s rerun the limma-voom TREAT analysis (adj.P <0.01 and lfc=0.58) and this time use the Insert Contrast button to include the additional contrast luminalpregnant - luminallactate. We can then see how the number of differentially expressed genes in the luminal cells compares to the basal cells.

Hands-on: Run multiple contrasts
  1. Use the Rerun galaxy-refresh button in the History to rerun the limma tool #treat analysis adding the following parameters (i.e. run with 2 contrasts):
    • param-text “Contrast of Interest”: basalpregnant-basallactate
    • param-text “Contrast of Interest”: luminalpregnant-luminallactate
  2. Add a tag #multiple-contrasts to the Report output and inspect the report

You should find that there are more genes differentially expressed for the luminal cells than the basal. There are ~274 genes DE in basal cells versus ~ 1610 in the luminal cells.

Basal Luminal DE counts.
Figure 30: Basal vs Luminal DE counts

This is similar to what Fu et al found, many more genes differentially expressed in the luminal cells on lactation, compared to the basal cells.

Fu DE genes.
Figure 31: Fu et al, Nat Cell Biol 2015

The tables of differentially expressed genes are output as links in the Report (limma-voom_basalpregnant-basallactate.tsv and limma-voom_luminalpregnant-luminallactate.tsv), see below, and also as datasets in the history (DE tables). With multiple contrasts, a plot for each contrast is generated for relevant plots, as shown below. This enables a quick and easy visual comparison of the contrasts.

Multiple contrasts.
Figure 32: Multiple contrasts output
  1. The Report with all the plots and tables can be downloaded by clicking on the floppy disk icon on the dataset in the history.
  2. If you want to see the R code used, or to explore some of the outputs in R, you can select to output the Rscript and/or RData files in the tool form.

To see some methods for identifying differentially expressed pathways in this dataset, see the follow-on tutorial RNA-seq genes to pathways. To see how to create a heatmap of custom genes using this dataset, see the tutorial Visualization of RNA-Seq results with heatmap2

Conclusion

In this tutorial we have seen how counts files can be converted into differentially expressed genes with limma-voom. This follows on from the accompanying tutorial, RNA-seq reads to counts, that showed how to generate counts from the raw reads (FASTQs) for this dataset. In this part we have seen ways to visualise the count data, and QC checks that can be performed to help assess the quality and results. We have also reproduced results similar to what the authors found in the original paper with this dataset. For further reading on analysis of RNA-seq count data and the methods used here, see the articles; RNA-seq analysis is easy as 1-2-3 with limma, Glimma and edgeR (Law et al. 2018) and From reads to genes to pathways: differential expression analysis of RNA-Seq experiments using Rsubread and the edgeR quasi-likelihood pipeline (Chen et al. 2016).

Key points
  • The limma-voom tool can be used to perform differential expression and output useful plots

  • Multiple comparisons can be input and compared

  • Results can be interactively explored with limma-voom via Glimma

Frequently Asked Questions

Have questions about this tutorial? Check out the tutorial FAQ page or the FAQ page for the Transcriptomics topic to see if your question is listed there. If not, please ask your question on the GTN Gitter Channel or the Galaxy Help Forum

Useful literature

Further information, including links to documentation and original publications, regarding the tools, analysis techniques and the interpretation of results described in this tutorial can be found here.

References

  1. McCarthy, D. J., and G. K. Smyth, 2009 Testing significance relative to a fold-change threshold is a TREAT. Bioinformatics 25: 765–771. 10.1093/bioinformatics/btp053
  2. Robinson, M. D., D. J. McCarthy, and G. K. Smyth, 2010 edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26: 139–140.
  3. Robinson, M. D., and A. Oshlack, 2010 A scaling normalization method for differential expression analysis of RNA-seq data. Genome biology 11: 1–9.
  4. Liao, Y., G. K. Smyth, and W. Shi, 2013 featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics 30: 923–930. https://academic.oup.com/bioinformatics/article/30/7/923/232889
  5. Law, C. W., Y. Chen, W. Shi, and G. K. Smyth, 2014 voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biology 15: R29. 10.1186/gb-2014-15-2-r29
  6. Love, M. I., W. Huber, and S. Anders, 2014 Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology 15: 550. https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0550-8
  7. Fu, N. Y., A. C. Rios, B. Pal, R. Soetanto, A. T. L. Lun et al., 2015 EGF-mediated induction of Mcl-1 at the switch to lactation is essential for alveolar cell survival. Nature Cell Biology 17: 365–375. 10.1038/ncb3117
  8. Liu, R., A. Z. Holik, S. Su, N. Jansz, K. Chen et al., 2015 Why weight? Modelling sample and observational level variability improves power in RNA-seq analyses. Nucleic Acids Research 43: e97–e97. 10.1093/nar/gkv412
  9. Chen, Y., A. T. L. Lun, and G. K. Smyth, 2016 From reads to genes to pathways: differential expression analysis of RNA-Seq experiments using Rsubread and the edgeR quasi-likelihood pipeline. F1000Research 5: 1438. 10.12688/f1000research.8987.2
  10. Costa-Silva, J., D. Domingues, and F. M. Lopes, 2017 RNA-Seq differential expression analysis: An extended review and a software tool (Z. Wei, Ed.). PLOS ONE 12: e0190152. 10.1371/journal.pone.0190152
  11. Su, S., C. W. Law, C. Ah-Cann, M.-L. Asselin-Labat, M. E. Blewitt et al., 2017 Glimma: interactive graphics for gene expression analysis (B. Berger, Ed.). Bioinformatics 33: 2050–2052. 10.1093/bioinformatics/btx094
  12. Law, C. W., M. Alhamdoosh, S. Su, X. Dong, L. Tian et al., 2018 RNA-seq analysis is easy as 1-2-3 with limma, Glimma and edgeR. F1000Research 5: 1408. 10.12688/f1000research.9005.3
  13. Liao, Y., G. K. Smyth, and W. Shi, 2019 The R package Rsubread is easier, faster, cheaper and better for alignment and quantification of RNA sequencing reads. Nucleic Acids Research 47: e47–e47. 10.1093/nar/gkz114

Feedback

Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.

Click here to load Google feedback frame

Citing this Tutorial

  1. Maria Doyle, Belinda Phipson, Jovana Maksimovic, Anna Trigos, Matt Ritchie, Harriet Dashnow, Shian Su, Charity Law, 2022 2: RNA-seq counts to genes (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/transcriptomics/tutorials/rna-seq-counts-to-genes/tutorial.html Online; accessed TODAY
  2. Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012


@misc{transcriptomics-rna-seq-counts-to-genes,
author = "Maria Doyle and Belinda Phipson and Jovana Maksimovic and Anna Trigos and Matt Ritchie and Harriet Dashnow and Shian Su and Charity Law",
title = "2: RNA-seq counts to genes (Galaxy Training Materials)",
year = "2022",
month = "10",
day = "18"
url = "\url{https://training.galaxyproject.org/training-material/topics/transcriptomics/tutorials/rna-seq-counts-to-genes/tutorial.html}",
note = "[Online; accessed TODAY]"
}
@article{Batut_2018,
    doi = {10.1016/j.cels.2018.05.012},
    url = {https://doi.org/10.1016%2Fj.cels.2018.05.012},
    year = 2018,
    month = {jun},
    publisher = {Elsevier {BV}},
    volume = {6},
    number = {6},
    pages = {752--758.e1},
    author = {B{\'{e}}r{\'{e}}nice Batut and Saskia Hiltemann and Andrea Bagnacani and Dannon Baker and Vivek Bhardwaj and Clemens Blank and Anthony Bretaudeau and Loraine Brillet-Gu{\'{e}}guen and Martin {\v{C}}ech and John Chilton and Dave Clements and Olivia Doppelt-Azeroual and Anika Erxleben and Mallory Ann Freeberg and Simon Gladman and Youri Hoogstrate and Hans-Rudolf Hotz and Torsten Houwaart and Pratik Jagtap and Delphine Larivi{\`{e}}re and Gildas Le Corguill{\'{e}} and Thomas Manke and Fabien Mareuil and Fidel Ram{\'{\i}}rez and Devon Ryan and Florian Christoph Sigloch and Nicola Soranzo and Joachim Wolff and Pavankumar Videm and Markus Wolfien and Aisanjiang Wubuli and Dilmurat Yusuf and James Taylor and Rolf Backofen and Anton Nekrutenko and Björn Grüning},
    title = {Community-Driven Data Analysis Training for Biology},
    journal = {Cell Systems}
}
                   

Congratulations on successfully completing this tutorial!