This is an example of a workflow to process data in Seurat v5. Here we’re using a simple dataset consisting of a single set of cells which we believe should split into subgroups. In this exercise we will:

  • Load in the data

  • Do some basic QC and Filtering

  • Select genes which we believe are going to be informative

  • Perform dimensionality reduction

  • Detect clusters within the data

  • Find genes which define the clusters

  • Examine how robust the clusters and genes are

Setup

We’re going to start with some basic housekeeping. We’re going to load the packages we’re going to use, these will be:

  • Seurat (for general single cell loading and processing)

  • Tidyverse (for non-standard data manipulation and plotting)

  • Azimuth (for cell type prediction)

  • LoupeR (for exporting cloupe files)

We’re also going to set a nicer theme for ggplot in tidyverse so our graphs look nicer.

library(Seurat)
library(tidyverse)
library(Azimuth)
library(loupeR)
library(clustree)
theme_set(theme_bw(base_size = 14))

Loading Data

We have the cellranger output files already produced. The files we’re working with here are the raw output from the CellRanger pipeline. We didn’t do anything else to them. We can see the files we have to work with. We’re working with the filtered h5 file, which means that we have to also install the hdf5 package in R. This means we can load the data from a single file using the Read10x_h5 function in Seurat.

Read10X_h5("../filtered_feature_bc_matrix.h5") -> data

CreateSeuratObject(
  counts=data, 
  project="course", 
  min.cells = 3, 
  min.features=200
) -> data

data
## An object of class Seurat 
## 17136 features across 5070 samples within 1 assay 
## Active assay: RNA (17136 features, 0 variable features)
##  1 layer present: counts

QC

Before we do any analysis it’s really important to do some quality control and filtering to make sure we’re working with good data. Seurat automatically creates two metrics we can use:

  1. nCount_RNA the total number of reads (or more correctly UMIs) in the dataset

  2. nFeature_RNA the number of observed genes (anything with a nonzero count)

We can supplement this with other metrics which we can calculate ourselves.

Knee Plot

We can recreate the knee plot we saw in the 10X QC report. This is just the number of UMIs vs the index position of the reads.

data[[]] %>%
  arrange(desc(nCount_RNA)) %>%
  mutate(index=1:n()) %>%
  ggplot(aes(x=index,y=nCount_RNA)) + 
  geom_line() +
  scale_x_log10() +
  scale_y_log10() 

We can see that the default filtering has gone beyond the “knee” point and into the descending part of the distribution. When we look at the distribution of counts then we may want to remove some of the lower coverage cells. At the top end we may also have cells with high coverage which could represent duplets.

Amount of MT genes

Single cell datasets can be filled with large numbers of reads coming from mitochondria. These often indicate a sick cell undergoing apoptosis. We want to check for this. Seurat comes with a load of built-in functions for accessing certain aspects of your data, but you can also dig into the raw data fairly easily.

The raw data can be found with LayerData(data, layer="counts"). This is a matrix with genes as rownames and cell barcodes as columns. Later we will create a new “data” layer with a matrix of normalised counts in it.

We can find the gene names as the rownames of the Seurat object and we identify the mitochondrial genes by their names starting with “MT-”. Be aware that in other species the naming of the mitochondrial genes may not be the same so you’d need to adjust the pattern.

grep("^MT-",rownames(data),value = TRUE)
##  [1] "MT-ND1"  "MT-ND2"  "MT-CO1"  "MT-CO2"  "MT-ATP8" "MT-ATP6" "MT-CO3" 
##  [8] "MT-ND3"  "MT-ND4L" "MT-ND4"  "MT-ND5"  "MT-ND6"  "MT-CYB"

We can calculate the percentage of the data coming from a set of genes using the PercentageFeatureSet function. We can store the result in the main metadata store for the object by defining a new column called “percent.MT”.

PercentageFeatureSet(data,pattern="^MT-") -> data$percent.MT

head(data$percent.MT)
## AAACCTGAGAAACCAT-1 AAACCTGAGATAGCAT-1 AAACCTGAGCGTGAAC-1 AAACCTGCAACGCACC-1 
##           3.151927           4.809843           3.133159           4.295333 
## AAACCTGCACACCGAC-1 AAACCTGCATCGATTG-1 
##           2.648676           3.776435

Amount of Ribosomal Genes

Ribosomal genes also tend to be very highly represented, and can vary between cell types, so it can be instructive to see how prevalent they are in the data. These are ribosomal protein genes rather than the actual rRNA, so they’re more a measure of the translational activity of the cell rather than the cleanliness of the polyA selection.

grep("^RP[LS]",rownames(data),value = TRUE)
##  [1] "RPL22"       "RPL11"       "RPS6KA1"     "RPS8"        "RPL5"       
##  [6] "RPS27"       "RPS6KC1"     "RPS7"        "RPS27A"      "RPL31"      
## [11] "RPL37A"      "RPL32"       "RPL15"       "RPSA"        "RPL14"      
## [16] "RPL29"       "RPL24"       "RPL22L1"     "RPL39L"      "RPL35A"     
## [21] "RPL9"        "RPL34-AS1"   "RPL34"       "RPS3A"       "RPL37"      
## [26] "RPS23"       "RPS14"       "RPL26L1"     "RPS18"       "RPS10"      
## [31] "RPL10A"      "RPL7L1"      "RPS12"       "RPS6KA2"     "RPS20"      
## [36] "RPL7"        "RPL30"       "RPL8"        "RPS6"        "RPL35"      
## [41] "RPL12"       "RPL7A"       "RPS24"       "RPLP2"       "RPL27A"     
## [46] "RPS13"       "RPS6KA4"     "RPS6KB2"     "RPS6KB2-AS1" "RPS3"       
## [51] "RPS25"       "RPS26"       "RPL41"       "RPL6"        "RPLP0"      
## [56] "RPL21"       "RPS29"       "RPL36AL"     "RPS6KL1"     "RPS6KA5"    
## [61] "RPS27L"      "RPL4"        "RPLP1"       "RPS17"       "RPS2"       
## [66] "RPS15A"      "RPL13"       "RPL26"       "RPL23A"      "RPL23"      
## [71] "RPL19"       "RPL27"       "RPS6KB1"     "RPL38"       "RPL17"      
## [76] "RPS15"       "RPL36"       "RPS28"       "RPL18A"      "RPS16"      
## [81] "RPS19"       "RPL18"       "RPL13A"      "RPS11"       "RPS9"       
## [86] "RPL28"       "RPS5"        "RPS21"       "RPL3"        "RPS19BP1"   
## [91] "RPS6KA3"     "RPS4X"       "RPS6KA6"     "RPL36A"      "RPL39"      
## [96] "RPL10"
PercentageFeatureSet(data,pattern="^RP[LS]") -> data$percent.Ribosomal

head(data$percent.Ribosomal)
## AAACCTGAGAAACCAT-1 AAACCTGAGATAGCAT-1 AAACCTGAGCGTGAAC-1 AAACCTGCAACGCACC-1 
##           47.34694           27.49441           35.16101           26.25958 
## AAACCTGCACACCGAC-1 AAACCTGCATCGATTG-1 
##           42.52874           48.79154

Percentage of Largest Gene

We can also go into the count matrix and make our own metrics. The data is stored in a “Sparse Matrix” which is more efficient for storing data with a large proportion of unobserved values (such as 10X data). In this example we run apply over the columns (cells) and calculate what percentage of the data comes from the single most observed gene. Again, having a high proportion of your data dominated by a single gene is a metric which could either give biological context or indicate a technical problem, depending on what the gene is.

When we calculate this we normally find that MALAT1 is normally the largest gene by some distance - it’s a non-coding nuclear gene expressed at very high levels. This has such a big effect that we’ll measure it separately, and exclude it from our analysis here.

We will get:

  • The count for the largest gene per cell

  • The index position of the gene with the largest count

  • The name of the most highly expressed gene per cell

data[rownames(data) != "MALAT1",] -> data.nomalat

apply(
  LayerData(data.nomalat, layer="counts"),
  2,
  max
) -> data.nomalat$largest_count

apply(
  LayerData(data.nomalat, layer="counts"),
  2,
  which.max
) -> data.nomalat$largest_index

rownames(data.nomalat)[data.nomalat$largest_index] -> data.nomalat$largest_gene

100 * data.nomalat$largest_count / data.nomalat$nCount_RNA -> data.nomalat$percent.Largest.Gene

data.nomalat$largest_gene -> data$largest_gene
data.nomalat$percent.Largest.Gene -> data$percent.Largest.Gene

rm(data.nomalat)

Seurat QC Plots

Single Metrics

Seurat comes with some convenience methods for plotting out certain types of visualisation, such as the distribution of certain QC metrics. We can view this on both a linear and log scale to see which looks more helpful.

VlnPlot(data, layer = "counts", features=c("nCount_RNA","percent.MT", "percent.Ribosomal","percent.Largest.Gene"))