This is an example of a workflow to process data in Seurat v5. Here we’re using a simple dataset consisting of a single set of cells which we believe should split into subgroups. In this exercise we will:
Load in the data
Do some basic QC and Filtering
Select genes which we believe are going to be informative
Perform dimensionality reduction
Detect clusters within the data
Find genes which define the clusters
Examine how robust the clusters and genes are
We’re going to start with some basic housekeeping. We’re going to load the packages we’re going to use, these will be:
Seurat (for general single cell loading and processing)
Tidyverse (for non-standard data manipulation and plotting)
Azimuth (for cell type prediction)
LoupeR (for exporting cloupe files)
We’re also going to set a nicer theme for ggplot in tidyverse so our graphs look nicer.
library(Seurat)
library(tidyverse)
library(Azimuth)
library(loupeR)
library(clustree)
theme_set(theme_bw(base_size = 14))
We have the cellranger output files already produced. The files we’re
working with here are the raw output from the CellRanger pipeline. We
didn’t do anything else to them. We can see the files we have to work
with. We’re working with the filtered h5 file, which means that we have
to also install the hdf5 package in R. This means we can load the data
from a single file using the Read10x_h5
function in
Seurat.
Read10X_h5("../filtered_feature_bc_matrix.h5") -> data
CreateSeuratObject(
counts=data,
project="course",
min.cells = 3,
min.features=200
) -> data
data
## An object of class Seurat
## 17136 features across 5070 samples within 1 assay
## Active assay: RNA (17136 features, 0 variable features)
## 1 layer present: counts
Before we do any analysis it’s really important to do some quality control and filtering to make sure we’re working with good data. Seurat automatically creates two metrics we can use:
nCount_RNA
the total number of reads (or more
correctly UMIs) in the dataset
nFeature_RNA
the number of observed genes (anything
with a nonzero count)
We can supplement this with other metrics which we can calculate ourselves.
We can recreate the knee plot we saw in the 10X QC report. This is just the number of UMIs vs the index position of the reads.
data[[]] %>%
arrange(desc(nCount_RNA)) %>%
mutate(index=1:n()) %>%
ggplot(aes(x=index,y=nCount_RNA)) +
geom_line() +
scale_x_log10() +
scale_y_log10()
We can see that the default filtering has gone beyond the “knee” point and into the descending part of the distribution. When we look at the distribution of counts then we may want to remove some of the lower coverage cells. At the top end we may also have cells with high coverage which could represent duplets.
Single cell datasets can be filled with large numbers of reads coming from mitochondria. These often indicate a sick cell undergoing apoptosis. We want to check for this. Seurat comes with a load of built-in functions for accessing certain aspects of your data, but you can also dig into the raw data fairly easily.
The raw data can be found with
LayerData(data, layer="counts")
. This is a matrix with
genes as rownames and cell barcodes as columns. Later we will create a
new “data” layer with a matrix of normalised counts in it.
We can find the gene names as the rownames of the Seurat object and we identify the mitochondrial genes by their names starting with “MT-”. Be aware that in other species the naming of the mitochondrial genes may not be the same so you’d need to adjust the pattern.
grep("^MT-",rownames(data),value = TRUE)
## [1] "MT-ND1" "MT-ND2" "MT-CO1" "MT-CO2" "MT-ATP8" "MT-ATP6" "MT-CO3"
## [8] "MT-ND3" "MT-ND4L" "MT-ND4" "MT-ND5" "MT-ND6" "MT-CYB"
We can calculate the percentage of the data coming from a set of
genes using the PercentageFeatureSet
function. We can store
the result in the main metadata store for the object by defining a new
column called “percent.MT”.
PercentageFeatureSet(data,pattern="^MT-") -> data$percent.MT
head(data$percent.MT)
## AAACCTGAGAAACCAT-1 AAACCTGAGATAGCAT-1 AAACCTGAGCGTGAAC-1 AAACCTGCAACGCACC-1
## 3.151927 4.809843 3.133159 4.295333
## AAACCTGCACACCGAC-1 AAACCTGCATCGATTG-1
## 2.648676 3.776435
Ribosomal genes also tend to be very highly represented, and can vary between cell types, so it can be instructive to see how prevalent they are in the data. These are ribosomal protein genes rather than the actual rRNA, so they’re more a measure of the translational activity of the cell rather than the cleanliness of the polyA selection.
grep("^RP[LS]",rownames(data),value = TRUE)
## [1] "RPL22" "RPL11" "RPS6KA1" "RPS8" "RPL5"
## [6] "RPS27" "RPS6KC1" "RPS7" "RPS27A" "RPL31"
## [11] "RPL37A" "RPL32" "RPL15" "RPSA" "RPL14"
## [16] "RPL29" "RPL24" "RPL22L1" "RPL39L" "RPL35A"
## [21] "RPL9" "RPL34-AS1" "RPL34" "RPS3A" "RPL37"
## [26] "RPS23" "RPS14" "RPL26L1" "RPS18" "RPS10"
## [31] "RPL10A" "RPL7L1" "RPS12" "RPS6KA2" "RPS20"
## [36] "RPL7" "RPL30" "RPL8" "RPS6" "RPL35"
## [41] "RPL12" "RPL7A" "RPS24" "RPLP2" "RPL27A"
## [46] "RPS13" "RPS6KA4" "RPS6KB2" "RPS6KB2-AS1" "RPS3"
## [51] "RPS25" "RPS26" "RPL41" "RPL6" "RPLP0"
## [56] "RPL21" "RPS29" "RPL36AL" "RPS6KL1" "RPS6KA5"
## [61] "RPS27L" "RPL4" "RPLP1" "RPS17" "RPS2"
## [66] "RPS15A" "RPL13" "RPL26" "RPL23A" "RPL23"
## [71] "RPL19" "RPL27" "RPS6KB1" "RPL38" "RPL17"
## [76] "RPS15" "RPL36" "RPS28" "RPL18A" "RPS16"
## [81] "RPS19" "RPL18" "RPL13A" "RPS11" "RPS9"
## [86] "RPL28" "RPS5" "RPS21" "RPL3" "RPS19BP1"
## [91] "RPS6KA3" "RPS4X" "RPS6KA6" "RPL36A" "RPL39"
## [96] "RPL10"
PercentageFeatureSet(data,pattern="^RP[LS]") -> data$percent.Ribosomal
head(data$percent.Ribosomal)
## AAACCTGAGAAACCAT-1 AAACCTGAGATAGCAT-1 AAACCTGAGCGTGAAC-1 AAACCTGCAACGCACC-1
## 47.34694 27.49441 35.16101 26.25958
## AAACCTGCACACCGAC-1 AAACCTGCATCGATTG-1
## 42.52874 48.79154
We can also go into the count matrix and make our own metrics. The
data is stored in a “Sparse Matrix” which is more efficient for storing
data with a large proportion of unobserved values (such as 10X data). In
this example we run apply
over the columns (cells) and
calculate what percentage of the data comes from the single most
observed gene. Again, having a high proportion of your data dominated by
a single gene is a metric which could either give biological context or
indicate a technical problem, depending on what the gene is.
When we calculate this we normally find that MALAT1 is normally the largest gene by some distance - it’s a non-coding nuclear gene expressed at very high levels. This has such a big effect that we’ll measure it separately, and exclude it from our analysis here.
We will get:
The count for the largest gene per cell
The index position of the gene with the largest count
The name of the most highly expressed gene per cell
data[rownames(data) != "MALAT1",] -> data.nomalat
apply(
LayerData(data.nomalat, layer="counts"),
2,
max
) -> data.nomalat$largest_count
apply(
LayerData(data.nomalat, layer="counts"),
2,
which.max
) -> data.nomalat$largest_index
rownames(data.nomalat)[data.nomalat$largest_index] -> data.nomalat$largest_gene
100 * data.nomalat$largest_count / data.nomalat$nCount_RNA -> data.nomalat$percent.Largest.Gene
data.nomalat$largest_gene -> data$largest_gene
data.nomalat$percent.Largest.Gene -> data$percent.Largest.Gene
rm(data.nomalat)
Seurat comes with some convenience methods for plotting out certain types of visualisation, such as the distribution of certain QC metrics. We can view this on both a linear and log scale to see which looks more helpful.
VlnPlot(data, layer = "counts", features=c("nCount_RNA","percent.MT", "percent.Ribosomal","percent.Largest.Gene"))