View markdown source on GitHub

An introduction to scRNA-seq data analysis

Contributors

Authors: AvatarMehmet Tekman

Questions

Objectives

last_modification Last modification: Nov 24, 2021

Single-cell RNA-seq

An introduction to scRNA-seq data analysis

Speaker Notes


Bulk RNA-Seq

.pull-left[Two blobs labelled tissue A and tissue B are shown, on the right they are summarised into tables of Gene A, B, and X and their different average expression per tissue.]

.pull-right[ .reduce90[

Attribute Summary
Resolution Entire tissues
Signal Average gene expression per tissue
Differential Expression Difference between average gene expression between tissues

] ]

Speaker Notes


Single Cell RNA-Seq

.pull-left[Red and blue clusters of cells are shown resembling the tissue blob from the previous slide. Now the graphs on the right for expression in Genes A, B, X are shown per cell instead of per tissue.]

.pull-right[ .reduce90[

Attribute Summary
Resolution Individual cells within tissues
Signal Individual gene expression per cell
Differential Expression Some cells express the same set of genes in the same way; comparing one set of cells against another

] ]

Speaker Notes


From Bulk RNA to Single Cell RNA

.image-50[Tissue A and B from the first slide are shown as the collections of cells from the second slide.]

.reduce90[

Speaker Notes


Cell Capture and Replicates

.center[How do we prepare samples for sequencing?]

Speaker Notes For example, how are cells captured and sequenced?

.pull-left[ .reduce90[

Bulk RNA-seq

  1. Cut a thin slice of a tissue
  2. Add enzyme to break down cell walls
  3. Rinse out the unwanted DNA / RNA material
  4. Perform sequencing on leftover goop

] ]

Speaker Notes In bulk RNA-seq analysis, the process involves taking a sample, removing unwanted molecules and sequencing everything else.

.pull-left[ .reduce90[

Single-cell RNA-seq

  1. Cut a thin slice of a tissue
  2. Breakdown a tissue into cells
  3. Isolate each cell
    • Add enzyme to break down cell walls
    • Perform barcoding
  4. Perform sequencing in a common pool

] ]

Speaker Notes

Biological Replicates

.center[ .reduce90[

Type Notes
Bulk RNA-seq Each tissue slice is a sample, can take another slice
Single-cell RNA-seq Each cell is sample, cannot directly replicate because unique

] ]

Speaker Notes


Capture / Sorting:

How are cells isolated?

Speaker Notes Cell isolation can be performed in different ways.

.pull-right[.image-90[A black and white image of a woman in the lab using her mouth to pipette cells from one test tube to another.]]

.pull-left[ .reduce90[

Speaker Notes One method is manual pipetting, where wet lab scientists suction up individual cells using a long thin tube.

.pull-left[ .reduce90[

Speaker Notes They can do this hundreds of times to isolate hundreds of cells, but it is error-prone, and often multiple cells are isolated together.

.pull-left[ .reduce90[

Speaker Notes Another method is flow cytometry, which reduces the human-error component of this stage.


Capture / Sorting: Flow Cytometry

.pull-right[Cartoon of a fluidics system with two lasers pointing through the fluidics system and filters and detectors detecting the amount of light reflected out of the system with an optics system. This goes through a detector to an electronics system.]

.pull-left[ .reduce90[

.pull-left[ .reduce90[

.pull-left[ .reduce90[

Speaker Notes


Capture / Sorting: Size and Type

.pull-right[ The same cartoon as previously ]

.pull-left[ Optical Scatter

]

Speaker Notes


Capture / Sorting: Size and Type

.pull-left[ .reduce90[ Forward Scatter (FSC)

.image-75[.pull-right[A coloured scatter plot showing two clumps of points labelled monocytes and lymphocytes.]]

Speaker Notes

.pull-left[ .reduce90[

Side Scatter (SSC)

.image-75[.pull-right[The same scatter plot but now monocytes and graunlocytes are shown as blobs.]]

Speaker Notes Side scatter is perpendicular to the main laser, and measures the granularity of the cell, ideal for distinguishing cells with less defined internal structures, such as the granulocytes on the Y-axis of the example image.


Capture / Sorting: FACS

.pull-left[ A scatter plot cut into four regions of CD4+/- and CD8+/- .footnote[.reduce70[Image from BD Biosciences]] ]

.pull-right[ .reduce90[ Fluorescence-Activated Cell Sorting (FACS)

] ]

Speaker Notes


Barcoding Cells

.center[Groups of GGG and TCT are added to two different cells to label them.]

.footnote[Add unique barcodes to every transcript in a cell]

Speaker Notes


Barcoding Cells

.footnote[Place cells into sequencing plate]

.pull-left[Cells with barcodes are plated into individual wells based on their barcode.]

.pull-right[ .reduce90[

Speaker Notes Once the RNA molecules have been tagged by cell barcodes, they can be amplified, either separately or pooled together, where the amplified products share the same cell barcodes as their original counterparts.


Sequencing Issues: Amplification

.center[.image-75[A cartoon of a cell with a red and blue strand. The red strand amplifies well, the blue does not.]]

.reduce90[

Speaker Notes


Sequencing Issues: Amp. + UMIs

.pull-left[The same cartoon but now red and blue strands are labelled with pink and grey adapters. The red and blue both amplify but at different rates.]

.pull-right[ .reduce90[

Speaker Notes


Sequencing Issues: Amp. + UMIs

.pull-left[The same cartoon, red and blue amplify at different rates.]

.pull-right[

.center[Counting Reads

  Reads
Red 6
Blue 3

] ]

Speaker Notes

.pull-left[

.center[Grouping Reads by Gene and UMI

  UMIs Reads
Red Pink 2
  Cyan 4
Blue Pink 1
  Green 2

] ]

.pull-right[

.center[Counting de-duplicated Reads

  UMIs (Grouped) # UMIs
Red {Pink, Cyan} 2
Blue {Pink, Green} 2

] ]

Speaker Notes However if we group the reads by their UMIs, and then count only the number of unique UMIs per transcript, de-duplicating the reads which share the same transcript and UMI, we arrive at 2 red reads and 2 blue reads which better represents the true number of transcripts.


Sequencing Issues: Unique UMIs?

.pull-left[The same cartoon, red and blue amplify at different rates.] .pull-right[

  UMIs #Reads
Red {Pink, Cyan} 2
Blue {Pink, Green} 2

.reduce90[

]

Speaker Notes


.reduce90[

Speaker Notes This is due to there being often more transcripts than available UMIs, both which are dependent on the number of transcripts in a cell, and the length of the barcode.


Sequencing Issues: Unique UMIs?

.center[Barcodes of length N with Edit Distance of B:]

.pull-left[

.center[N = 5 and B = 1]

AAAAA AAAAC AAAAG AAAAT AAACA ····
CCCCC CCCCA CCCCG CCCCT CCCAC ····
              ·
              ·
              ·

.center[4⁵ = 1024 barcodes]

]

.pull-right[

.center[N = 5 and B = 2]

AAAAA AAACC AAAGG AAATT AACCA ····
CCCCC CCCAA CCCGG CCCTT CCCAA ····
              ·
              ·
              ·

.center[4⁵⁻¹ = 512 barcodes]

]

.footnote[

Edit distances guard against sequencing errors.

]

Speaker Notes


Sequencing Issues: Unique UMIs?

.pull-left[The same cartoon, red and blue amplify at different rates.] .pull-right[

  UMIs # Reads
Red {Pink, Cyan} 2
Blue {Pink, Green} 2

.reduce90[



]

]

.reduce90[ In what context are UMIs unique?

Speaker Notes In the context of amplification, UMIs do not need to be unique, they just need to be random enough to deduplicate transcripts in order to give a more accurate estimate of the number of transcripts within a cell.


Cell Barcodes and UMIs (Recap)

For Each Cell:

  1. Add Cell Barcodes to Cells Groups of GGG and TCT are added to two different cells to label them.

Speaker Notes So let’s just recap what we’ve learned: First each cell has cell barcodes added to each RNA molecule in each cell.


Cell Barcodes and UMIs (Recap)

For Each Cell:

  1. Add Cell Barcodes to Cells
  2. Add UMIs to Cell Barcoded Cells Random mixtures of three letter barcodes are shown, in addition to the two cells from the last cartoon which had GGG in one and TCT labelled reads in the other cell. Now they all have random prefixes before the GGG in one cell and TCT in the other.

Speaker Notes


QC: Overcoming Background Noise

.center[A matrix of Genes 1, 2, 3 and cells per column is changed into two matrices, one with counts of genes detected per cell, and counts of cells detected per gene]

Speaker Notes


Normalisation: Bulk vs Single-Cell

.pull-left[

Bulk RNA-seq: High Coverage

  T1 T2 T3
GeneA 100 80 40
GeneB 45 30 40

.reduce70[* Median Gene Expression is high]


scRNA-seq: Very Low Sequencing Depth

  C1 C2 C3 C4 C5
GeneA 0 0 2 0 1
GeneB 2 0 15 0 0

.reduce70[* Median Gene Expression is zero]

]

.pull-right[

Why is this a problem?

.center[ \(R(s,g) = \frac{X\_{sg}}{(\prod\_{s} X\_{s})^{\frac{1}{n}}}\)

\[DESeq(s,g) = \frac{X\_{sg}}{Med(R\_{s})}\]

] ]

Speaker Notes

.pull-right[ Can’t divide by zero! ]

Speaker Notes


Normalisation: SCRAN method

.footnote[.small[Pooling across cells to normalize single-cell RNA sequencing data with many zero counts, Lun et al., 2016]]

.pull-left[Blue and red bubbles are mixed, then separated into two groups, and then arranged around a circle, red going from small to large around the right half, blue from small to large around the left. The bottom of the circle is labelled 6, the top is labelled 12.]

.pull-right[ .reduce90[

  1. Calculate the library sizes of all cells

  2. Calculate the library size of a pseudo reference cell (average)

  3. Separate odd sizes (red) and even sizes (blue) into two groups

  4. Sort each group by library size and place on opposite sides of a “ring” ] ]

Speaker Notes


Normalisation: SCRAN method

.footnote[.small[Pooling across cells to normalize single-cell RNA sequencing data with many zero counts, Lun et al., 2016]]

.pull-right[The same final graph with blue and red circles of increasing size with an arrow pointing to a large number of formulas that overlap.]

.pull-left[ .reduce90[

  1. Define overlapping pools of adjacent cells of size k

  2. For each pool
    1. Sum the library sizes of all cells within
    2. Derive a size factor by dividing by the reference cell
  3. For each cell
    1. Find which pools it belongs to
    2. Build a linear model using these size factors
    3. Estimate the size factor of the cell on this linear model ] ]

Speaker Notes


Normalisation: SCRAN method

.footnote[.small[Pooling across cells to normalize single-cell RNA sequencing data with many zero counts, Lun et al., 2016]]

.center[The two previous graphs now in one graph.]

Speaker Notes


Wanted vs Unwanted Variation

.pull-right[Three overlapping line graphs mapping contributing variance to density. Top N genes is shown increasing in density as contributing variance increases, which genes per cell, transcripts, and batch source decrease.]

.pull-left[ .reduce90[ Wanted Variation

Unwanted Variation

Speaker Notes


Confounding Variation: Biological

.center[A cartoon on the left shows a question mark with arrows to nothing and to transcripts shown. On the right are the cell cycle phases and different amounts of transcripts in each phase.]

.pull-left[ .reduce90[ .center[Transcription Bursting]

.pull-right[ .reduce90[ .center[Cell Cycle]

Speaker Notes


Confounding Variation: Technical

.center[Library size variation points to two cells with red and blue transcripts in identical numbers. However during amplification in one cell it produces results, while in the other blue is dropped.]

.pull-left[ .reduce90[ Amplification Bias

.pull-left[ .reduce90[ Dropout Events

Speaker Notes


Confounding Variation: Technical

.center[Library size variation points to two cells with red and blue transcripts in identical numbers. However during amplification in one cell it produces results, while in the other blue is dropped.]

Library Size Variation

Speaker Notes


Relationships Between Cells

Consider:

Aim:

Note:

Speaker Notes


Distance Matrix

A count matrix of genes vs cells is plotted in N-dimensional space with each gene representing the different axes. A distance formula for 3 dimensions is shown, and then a final table is shown from the count matrix with the distances between each of the cells, based on their genes.

Speaker Notes


Relatedness of Cells: KNN

A plot of cells across three genes is shown with the label high dimensional dataset of cells. This produces a distance matrix (symmetric), and then via KNN with k=2, a non-symmetric matrix. This is then plotted again in the gene-dimensional space to show connections between cells.

Speaker Notes


Dimensional Reduction

Matrix of genes vs cells is plotted in gene-dimensions, and then reduced into 2 dimensions.

.pull-left[ .reduce90[ Aim:

.pull-right[ .reduce90[ Constraint

Speaker Notes


Clustering

.pull-left[.image-100[A scatter plot with many groups of cells labelled by different colours. The cells are largely clustered well, with few outlying cells.]]

.pull-right[ .reduce90[

  1. 2D Projection
    • Each dot is a cell
    • Clustering colours the dots, where different coloured cells belong to different clusters
    • Different clusters represent different cell types ] ]

Speaker Notes


Clustering

.pull-left[.image-100[Same scatter plot with clustering as before, but now the clusters are labelled things like Neurons, NSC, Glial Prog., Astrocytes, etc.]]

.pull-right[ .reduce90[

  1. 2D Projection
  2. Discrete Cell Types
    • Each cluster should represent a different type
    • Look for the most DE genes in each cluster
    • Find the marker genes → Cell Type ] ]

Speaker Notes


Clustering

.pull-left[.image-100[The same labelled graph, but now arrows connect the next nearest groups of cell types.]]

.pull-right[ .reduce90[

  1. 2D Projection
  2. Discrete Cell Types
  3. Relationships infer Lineage
    • Neural Stem Cells differentiate into mature cell types
    • Lineage trees are constructed by taking into account
    • Entropy of cluster
    • Proximity of cluster ] ]

Speaker Notes We can also further derive the relationships between these clusters by computing lineage trees based on the amount of noise in each cluster, with the expectation that stem cells have noisy expression profiles yielding broader clusters, and mature cells have very clear expression profiles yielding tighter clusters.


Clustering: Hard vs Soft

   
.image-100[Same set of distinct clusters with very clear separation] .image-100[Clusters now bleed into one another, and the separate is not clear.]
.center[Hard] .center[Soft]
Big spaces between clusters Clusters bleed into one another
Cell types are well defined and the clustering reflects that Cell types seem to intermingle with one another.

Speaker Notes


Continuous Phenotypes:

.center[The graph charts development time of reticulocytes as they pass through an intermediate or rare cell phase, into their final form: red blood cells.] .reduce90[

Speaker Notes Soft clustering is to be expected, since although clustering is a statistical method for discretely partitioning data, the underlying cell biology of the data is a continuous process, where cells transition from one well-defined state to another through intermediate stages which are represented in-between two cluster centres.


Performing Clustering

.pull-left[ Discrete expression profiles: Three mountains are shown with clouds, we just see three peaks. Cells in red, green, and blue are shown at the peaks. Continuous expression landscape: the clouds are removed and we see the mountains are actually connected and there are cells in between in various intermediate colours. ]

.pull-right[ .reduce90[ Dynamic datasets with continuously dynamic clusters

Variety of different clustering methods

Speaker Notes


Performing Clustering: K-means

.pull-right[An animated figure showing several iteration of an algorithm that is optimising a 3-way split between a scatter plot of cells. There is no clear boundary making the final result appear only marginally better.]

.pull-left[ .reduce90[ K-means

  1. Initialise k random positions
  2. Iteration Step:
    1. Calculate distance from each cell to each k position
    2. Assign each cell to it’s nearest k
    3. Set new k positions to the mean position of all cells in that group

K-medians

] ]

Speaker Notes


Performing Clustering: Hierarchical

.pull-left[A many-step figure starting with a number of individual dots. The text reads "identify the two clusters that are closest" and "merge the two most similar clusters." The process repeats a number of times until all clusters are absorbed into the one large blob.]

.pull-right[ .reduce90[

.image-90[Several points in a square are labelled A through F, on the right a dendogram is shown with lengths indicating how close each letter is to each other.]

Speaker Notes


Community Clustering: Louvain

.center[A graph is shown with dots connected by lines. Below, those dots have expanded and pink touches orange and nearly touches purple. It asks pink by iteself? And notes 4 external links and 0 internal links. Two hypothetical options are shown, if pink absorbs purple, we see 5 external connections and 1 internal, so, it's added new connections. An X suggests this is wrong. Below is the pink absorbs orange option, where we see 3 external and 1 internal connection, so one connection has become internal, and no new nodes are connected. A check mark indicates this was right.]

.reduce90[ Aim: Maximise internal links and minimise external links ]

Speaker Notes


Community Clustering: Louvain

.center[Same Graph as previously, but now there are more, larger clusters. Blue and purple were absorbed, yellow and red were absorbed, and we see a simplified 4 node graph.]

.reduce90[

Speaker Notes If the new configuration has instead increased the number of external links, then the configuration is rejected and another cell is picked and tested. By performing this multiple times, a community structure of cells is built to whichever degree of specificity the user desires.


Summary

.pull-left[Red and blue clusters of cells are shown resembling the tissue blobs. Graphs on the right for expression in Genes A, B, X are shown per cell]

.pull-right[ .reduce90[

Speaker Notes


Further scRNA-seq Data Analysis

Screenshot of the galaxy training materials that cover single cell

Speaker Notes


Key Points

curriculum Do you want to extend your knowledge?

Follow one of our recommended follow-up trainings: - [Transcriptomics](/training-material/topics/transcriptomics) - Pre-processing of Single-Cell RNA Data: [slides slides](/training-material/topics/transcriptomics/tutorials/scrna-preprocessing/slides.html) - [tutorial hands-on](/training-material/topics/transcriptomics/tutorials/scrna-preprocessing/tutorial.html)

Thank you!

This material is the result of a collaborative work. Thanks to the Galaxy Training Network and all the contributors! Galaxy Training Network This material is licensed under the Creative Commons Attribution 4.0 International License.