Background

The advent of single-cell transcriptomics has made it possible to capture of cellular heterogeneity. However, it suffers from the drawback of losing spatial information due to the presence of a cell dissociation step.

On the other hand, spatially resolved transcriptomic (SRT) technologies can profile the whole or a subset of the transcriptome at a spot (a few cells), cellular or sub-cellular level. To know more about these approaches, please check out one of my previous blog posts here.

The increasing adoption of SRT technologies is driving the development of computational methods and frameworks to analyse these datasets. The first step of any data analysis workflow is quality control (QC), which makes sure that only high-quality data points are retained for downstream analyses.

Here, I introduce SpotSweeper, a collection of computational methods implemented in a freely available R/Bioconductor package, which incorporates spatial awareness into the QC step of SRT data analysis workflow. I will briefly describe why it is novel and how it works.

Rationale

The QC of SRT data heavily relies on the methods developed for the QC of single-cell data, which is mainly done based on three QC metrics without any spatial awareness -

Library size or total read counts per cell
Number of unique genes per cell
Percentage of total counts mapped to mitochondrial genes per cell

User-defined (for example, not less than 1000 counts or not more than 5% mitochondrial counts) or data-driven (for example, no more than 3 median absolute deviations away from the median value) thresholds for the values of these metrics are used to filter out low-quality cells.

In most cases, the same approach are still being used for the QC of SRT data. QC metrics of individual cells or spots are compared to all other spots or cells within the tissue slice without considering their unique spatial context. Such a global approach thus lacks any spatial awareness.

This is problematic as figure 1 clearly shows that the values for mitochondrial ratio/percentage vary across different cortical layers (in case of brain tissue samples). Hence, spatial biology acts a confounding factor in this global approach.

Figure 1: Mitochondrial ratios confounded by spatial biology. The distribution of the value varies across the cortical layers and white matter. This is part of figure 1b in the original paper. See the reference section below.

SpotSweeper: how it works

SpotSweeper can perform QC at two different levels -

Spot/cell-level
Region-level

An overview of this method is illustrated in figure 2.
Figure 2: A schematic overview of the SpotSweeper method. This is figure 1a from the original paper.

The following two subsections describe how the method works.

Spot/cell-level QC

One of the novelties of SpotSweeper is that it first identifies the k nearest neighbors for each spot/cell i (NN_k(i)) based on the spatial coordinates (figure 2). To identify local outliers, it uses the following equation - \[ z_i = \frac{0.6745 * (x_i - m_i)}{MAD_i} \]where,
z_i = robust z-score for spot/cell i,
0.6745 = scaling factor to make the value conform to standard normal distribution,
x_i = a QC metric of interest,
m_i = median of the neighbors’ QC metric,
MAD_i = median absolute deviation of the QC metric for spot/cell i

These normalized local z scores are not confounded by spatial biology (figure 3).
Figure 3: z-scores calculated with SpotSweeper for mitochondrial ratio across cortical layers. The values are not confounded by spatial biology. This is part of figure 1b in the orignal paper.

By default, the threshold of z_i < -3 is considered for library size and gene counts, whereas that of z_i > 3 is considered for mitochondrial ratio.

Region-level QC

SpotSweeper is also capable of detecting regional artifacts using a multiscale approach (figure 2), which is inspired by object detection methods in computer vision. Region-level QC is performed in three steps

Step 1: Local variance of the mitochondrial ratio is calculated at multiple spatial scales, defined as concentric circles around each spot/cell. By default, first to fifth order neighbors are used. It creates a local variance matrix where each row is a spot/cell and each column is the local variance ratio at a particular scale (figure 4).
Figure 4: Local variance matrix. Each row represents a spot/cell and the local variances across various scales (k = 1-5) are stored in the columns.
Step 2: PCA is performed on the local variance matrix, after applying iteratively reweighted least squares (IRLS) algorithm to account for the mean-variance relationship.
Step 3: k-means clustering (k=2) is performed on the first two principal components to create separate clusters for artifacts and high-quality tissues.

Using this process, SpotSweeper can identify two types of regional artifacts (figure 5)

Dry spots created by incomplete coverage of permeabilisation liquid (figure 5a)
Hangnails created by tissue damage during dissection (figure 5b)

Figure 5: Identification of regional artifacts. Two types of regional artifacts, namely (a) dryspots and (b) hangnails, can be detected with SpotSweeper.

Advantages of SpotSweeper

Compatible with SpatialExperiment
Detailed documentation
Easy to install and use (at least, I did not face any trouble installing and using it on Google Colab).

References and link to codes

Click here to read the original paper on SpotSweeper
Click here to find the SpotSweeper Bioconductor package
Click here to access a github directory with my codes as I apply the SpotSweeper package on different SRT datasets. More notebooks will continue to be uploaded as I carry on my explorations.