Preprocessing has many functions, but the overarching goals include reducing nuisance variation, increasing variation of interest (signal), and enabling comparisons between samples. The choices made during preprocessing can significantly impact the results of downstream analyses. Important steps include loading data, peak recalibration, peak picking, peak alignment, region of interest (ROI) selection, and normalization.
library(dplyr)
library(tibble)
library(Cardinal)
source(file.path("Utilities.R"))
RNGkind("L'Ecuyer-CMRG")
setCardinalVerbose(verbose = FALSE)
gcinfo(verbose = FALSE)
This dataset is available on MassIVE under ID MSV000094448 as the
original dataset from Schurman et al.,
2025. This work is presented as a reanalysis. Running this workflow
starting with the raw files will produce each intermediate.
Alternatively, each dataset and all intermediates can be downloaded and
placed in the ./Data directory.
We will load individual images from ./Data/Raw. Some
images need to be flipped (to position cartilage on top, as it would be
anatomically). Two images have zones of zero pixel intensity and must be
trimmed. Sometimes images have coordinate dimensions that are redundant,
such as the z coordinate where all values are 1 for a 2D
image. Leaving these coordinates in can cause issues with Cardinal, so
we will set them to NULL.
DATA_PATH <- file.path("Data", "Raw")
MSE_FILENAMES <- list.files(DATA_PATH, pattern = "\\.imzML$",
recursive = TRUE)
mse_list_raw = list()
for (f in MSE_FILENAMES) {
just_fname <- isolate_filename(f)
agnostic_step_fname <- name_tibble |>
filter(raw_filename == f) |>
pull(run)
if (length(agnostic_step_fname) != 0){
m <- readMSIData(file.path(DATA_PATH, f),verbose=F)
if (agnostic_step_fname %in% c("6891-OA-lateral",
"6891-OA-medial",
"6949-OA-lateral",
"6949-OA-medial",
"6954-OA-lateral",
"6954-OA-medial")) {
m <- flip_vertically(m)
}
if (agnostic_step_fname == "6919-CAD-medial") {
m <- Cardinal::subset(m, x <= 390)
}
if (agnostic_step_fname == "6924-CAD-medial") {
m <- Cardinal::subset(m, x <= 367)
}
#best not to have extraneous or unused coordinates
coord(m)$z <- NULL
runNames(m) <- agnostic_step_fname
mse_list_raw[[agnostic_step_fname]] <- m
} else {
print(paste("Skipping ",f))
}
}
## [1] "Skipping 6911a_lat_oa.imzML"
## [1] "Skipping 6911b_med_oa.imzML"
## [1] "Skipping 6920a_lat_cad.imzML"
## [1] "Skipping 6920b_med_cad.imzML"
This code will recreate figures from the manuscript.
png(filename = file.path("Recreated-Figures",
"Figure-Peakpicking-Raw.png"),
height = 500, width = 1000, pointsize = 20)
m <- mse_list_raw[[1]] |>
convertMSImagingArrays2Experiment() |>
summarizeFeatures(stat = c("mean", "sd"), verbose = F,
BPPARAM = SnowfastParam(workers = 10L, force.GC = T))
print(plot(m, "mean",
ylab = "Mean Pixel Intensity",
#xlim = c(850, 2500),
#ylim = c(0, 850),
grid = F))
dev.off()
as_tibble(fData(m)) |>
filter(mean != 0 | sd != 0) |>
mutate(cv = sd/mean) |>
summarise(mean_cv = mean(cv, na.rm = T))
MSI data are large and can be time and resource intensive to
preprocess. In these vignettes we employ heavy use of parallel
processing via Cardinal, which itself uses BiocParallel.
Powerful computers, especially computers with lots of memory, will
benefit most from parallel processing. We will use matter’s
SnowfastParam as it is compatible with all major platforms
and interactive usage (unlike MulticoreParam) but is faster
than SnowParam. Performance depends on many factors, so
parameters must ultimately be tuned to the specific computer, dataset,
and processing step. Select parameters carefully. Keyboard interrupts
may fail to terminate commands that have launched multiple parallel
processes and may require manually killing those processes.
Additionally, if overly ambitious parameter choices demand more physical
memory than the system can provide, the operating system will begin
using swap space. This shifts processing effort toward memory management
and can severely degrade performance. In extreme cases, the system may
become unresponsive while attempting to manage memory during
preprocessing, and the command may eventually terminate or the system
may crash.
More information detailing how to set parallel defaults and options in Cardinal can be found in the Cardinal-package section of the Cardinal reference manual.
Computer memory, more than speed, is the limiting factor for working
with MSI datasets. Cardinal v3 addresses these issues using the matter
package as its backend. In general, when loading .imzML files, Cardinal
leaves memory-heavy components (for example, intensity matrices)
out-of-memory, instead using a matter_mat to provide an
“interface” for the data. When data is read, such as during a
preprocessing step, the matter matrix streams the data from disk as it
is needed. This lets us interact with data that would be too large to
hold in memory at once. This process will be slower for certain
operations, and for computers without solid-state drives. Out-of-memory
objects have some limitations. Results of operations, such as
recalibration, are by default NOT stored as out-of-memory objects.
Consider the following code:
mse_recalibrated <- recalibrate(mse,
ref = reference_peaks, tolerance = tol, units = "ppm")
Before recalibration, mse holds metadata in memory and
its intensity matrix out-of-memory. After recalibration,
mse holds metadata AND the recalibrated intensity matrix
(the result of the recalibration step) IN memory. We will address this
by writing the results of each intermediate to disk as .imzML files,
garbage collecting the object, and reloading the results via
readMSIData; leaving us with results as out-of-memory
objects. To check an object’s memory usage, include in-memory and
out-of-memory usage, one can use matter::mem(object). See
?matter::mem for additional details.
Peak picking is the process of selecting informative peaks for later analysis. For multi-sample experiments, peak picking must yield feature sets with a shared mass spectrum. We will begin with smoothing and baseline reduction. Not all datasets require these steps. Smoothing can help denoise outlier pixels. Baseline reduction corrects baseline increases that often happen at small mass ranges.
We will perform smoothing and baseline recalibration individually on each sample, using parallel processing to speed up processing on each sample.
calibrant_mz <- sort(c(1570.6896, 843.39, 1019.50, 1211.58, 1552, 2097))
invisible(gc())
for (n in names(mse_list_raw)) {
cat(format(Sys.time(), "%Y-%m-%d %H:%M:%S"), ": Starting ",n," BLR + SMOOTH\n")
m <- mse_list_raw[[n]]|>
smooth(method = "gaussian", width = 8, sd = 2) |>
reduceBaseline(method = "median") |>
process(BPPARAM=SnowfastParam(workers = 10, force.GC = T))
cat(format(Sys.time(), "%Y-%m-%d %H:%M:%S"), ": Starting ",n," WRITE\n")
writeMSIDataSafe(m,
file.path("Data",
"Preprocessing",
"Smoothed-BLR",
paste0(runNames(m), ".imzML")),
BPPARAM=SnowfastParam(workers = 5L, force.GC = T))
rm(m)
invisible(gc())
}
## 2026-04-07 09:28:41 : Starting 6891-OA-lateral BLR + SMOOTH
## 2026-04-07 09:29:06 : Starting 6891-OA-lateral WRITE
## 2026-04-07 09:29:49 : Starting 6891-OA-medial BLR + SMOOTH
## 2026-04-07 09:30:10 : Starting 6891-OA-medial WRITE
## 2026-04-07 09:30:40 : Starting 6908-OA-lateral BLR + SMOOTH
## 2026-04-07 09:31:01 : Starting 6908-OA-lateral WRITE
## 2026-04-07 09:31:32 : Starting 6908-OA-medial BLR + SMOOTH
## 2026-04-07 09:31:49 : Starting 6908-OA-medial WRITE
## 2026-04-07 09:32:12 : Starting 6919-CAD-medial BLR + SMOOTH
## 2026-04-07 09:32:40 : Starting 6919-CAD-medial WRITE
## 2026-04-07 09:33:23 : Starting 6919-CAD-lateral BLR + SMOOTH
## 2026-04-07 09:33:57 : Starting 6919-CAD-lateral WRITE
## 2026-04-07 09:34:51 : Starting 6924-CAD-lateral BLR + SMOOTH
## 2026-04-07 09:35:12 : Starting 6924-CAD-lateral WRITE
## 2026-04-07 09:35:45 : Starting 6924-CAD-medial BLR + SMOOTH
## 2026-04-07 09:36:03 : Starting 6924-CAD-medial WRITE
## 2026-04-07 09:36:28 : Starting 6938-CAD-medial BLR + SMOOTH
## 2026-04-07 09:36:52 : Starting 6938-CAD-medial WRITE
## 2026-04-07 09:37:31 : Starting 6938-CAD-lateral BLR + SMOOTH
## 2026-04-07 09:38:01 : Starting 6938-CAD-lateral WRITE
## 2026-04-07 09:38:52 : Starting 6944-CAD-medial BLR + SMOOTH
## 2026-04-07 09:39:13 : Starting 6944-CAD-medial WRITE
## 2026-04-07 09:39:45 : Starting 6944-CAD-lateral BLR + SMOOTH
## 2026-04-07 09:40:10 : Starting 6944-CAD-lateral WRITE
## 2026-04-07 09:40:50 : Starting 6949-OA-lateral BLR + SMOOTH
## 2026-04-07 09:41:13 : Starting 6949-OA-lateral WRITE
## 2026-04-07 09:41:50 : Starting 6949-OA-medial BLR + SMOOTH
## 2026-04-07 09:42:12 : Starting 6949-OA-medial WRITE
## 2026-04-07 09:42:48 : Starting 6954-OA-lateral BLR + SMOOTH
## 2026-04-07 09:43:10 : Starting 6954-OA-lateral WRITE
## 2026-04-07 09:43:47 : Starting 6954-OA-medial BLR + SMOOTH
## 2026-04-07 09:44:09 : Starting 6954-OA-medial WRITE
Reload smoothed and baseline-reduced intermediates so they are out of memory.
DATA_PATH <- file.path("Data","Preprocessing","Smoothed-BLR")
MSE_FILENAMES <- list.files(DATA_PATH, pattern = "\\.imzML$",
recursive = TRUE)
mse_list_smooth = list()
for (f in MSE_FILENAMES) {
m <- readMSIData(file.path(DATA_PATH, f),verbose=F)
mse_list_smooth[[runNames(m)]] <- m
}
Peak recalibration aligns spectra to a shared reference. We will perform this step twice - the first time with a set of estimated, sample-specific reference peaks, and the second time with a set of predefined reference peaks shared across all samples. The intention is for the first recalibration to align spectra within a sample, and the second for aligning samples to each other.
Recalibration is the most computational and memory intensive step in preprocessing. It may take a long time, and may consume all of your device’s memory. Adjust parallel processing parameters as necessary.
for (n in names(mse_list_smooth)){
m <- mse_list_smooth[[n]]
cat(format(Sys.time(), "%Y-%m-%d %H:%M:%S"), ": Starting ",n," ESTREFPEAKS\n")
esti_ref_peaks <- estimateReferencePeaks(m,
BPPARAM = SnowfastParam(workers = 10L, force.GC = T))
cat(format(Sys.time(), "%Y-%m-%d %H:%M:%S"), ": Starting ",n," RECALIBx2\n")
m_recalib_picked <- m |>
recalibrate(ref = esti_ref_peaks, tolerance = 80, units = "ppm") |>
recalibrate(ref = calibrant_mz, tolerance = 80, units = "ppm") |>
process(BPPARAM=SnowfastParam(workers = 15L, force.GC = T))
invisible(gc())
cat(format(Sys.time(), "%Y-%m-%d %H:%M:%S"), ": Starting ",n," WRITE\n")
writeMSIDataSafe(m_recalib_picked,
file.path("Data",
"Preprocessing",
"Smoothed-BLR-Recalibrated",
paste0(runNames(m_recalib_picked), ".imzML")),
BPPARAM=SnowfastParam(workers = 5L, force.GC = T))
rm(m_recalib_picked)
invisible(gc())
}
## 2026-04-07 09:44:52 : Starting 6891-OA-lateral ESTREFPEAKS
## 2026-04-07 09:46:59 : Starting 6891-OA-lateral RECALIBx2
## 2026-04-07 09:48:18 : Starting 6891-OA-lateral WRITE
## 2026-04-07 09:49:01 : Starting 6891-OA-medial ESTREFPEAKS
## 2026-04-07 09:50:26 : Starting 6891-OA-medial RECALIBx2
## 2026-04-07 09:51:11 : Starting 6891-OA-medial WRITE
## 2026-04-07 09:51:42 : Starting 6908-OA-lateral ESTREFPEAKS
## 2026-04-07 09:53:09 : Starting 6908-OA-lateral RECALIBx2
## 2026-04-07 09:54:03 : Starting 6908-OA-lateral WRITE
## 2026-04-07 09:54:33 : Starting 6908-OA-medial ESTREFPEAKS
## 2026-04-07 09:55:40 : Starting 6908-OA-medial RECALIBx2
## 2026-04-07 09:56:18 : Starting 6908-OA-medial WRITE
## 2026-04-07 09:56:40 : Starting 6919-CAD-lateral ESTREFPEAKS
## 2026-04-07 09:59:22 : Starting 6919-CAD-lateral RECALIBx2
## 2026-04-07 10:01:21 : Starting 6919-CAD-lateral WRITE
## 2026-04-07 10:02:15 : Starting 6919-CAD-medial ESTREFPEAKS
## 2026-04-07 10:04:23 : Starting 6919-CAD-medial RECALIBx2
## 2026-04-07 10:05:28 : Starting 6919-CAD-medial WRITE
## 2026-04-07 10:06:10 : Starting 6924-CAD-lateral ESTREFPEAKS
## 2026-04-07 10:07:43 : Starting 6924-CAD-lateral RECALIBx2
## 2026-04-07 10:08:43 : Starting 6924-CAD-lateral WRITE
## 2026-04-07 10:09:15 : Starting 6924-CAD-medial ESTREFPEAKS
## 2026-04-07 10:10:32 : Starting 6924-CAD-medial RECALIBx2
## 2026-04-07 10:11:10 : Starting 6924-CAD-medial WRITE
## 2026-04-07 10:11:35 : Starting 6938-CAD-lateral ESTREFPEAKS
## 2026-04-07 10:13:44 : Starting 6938-CAD-lateral RECALIBx2
## 2026-04-07 10:15:21 : Starting 6938-CAD-lateral WRITE
## 2026-04-07 10:16:07 : Starting 6938-CAD-medial ESTREFPEAKS
## 2026-04-07 10:17:50 : Starting 6938-CAD-medial RECALIBx2
## 2026-04-07 10:18:45 : Starting 6938-CAD-medial WRITE
## 2026-04-07 10:19:19 : Starting 6944-CAD-lateral ESTREFPEAKS
## 2026-04-07 10:21:02 : Starting 6944-CAD-lateral RECALIBx2
## 2026-04-07 10:22:06 : Starting 6944-CAD-lateral WRITE
## 2026-04-07 10:22:42 : Starting 6944-CAD-medial ESTREFPEAKS
## 2026-04-07 10:24:06 : Starting 6944-CAD-medial RECALIBx2
## 2026-04-07 10:24:52 : Starting 6944-CAD-medial WRITE
## 2026-04-07 10:25:21 : Starting 6949-OA-lateral ESTREFPEAKS
## 2026-04-07 10:26:58 : Starting 6949-OA-lateral RECALIBx2
## 2026-04-07 10:27:48 : Starting 6949-OA-lateral WRITE
## 2026-04-07 10:28:22 : Starting 6949-OA-medial ESTREFPEAKS
## 2026-04-07 10:29:54 : Starting 6949-OA-medial RECALIBx2
## 2026-04-07 10:30:36 : Starting 6949-OA-medial WRITE
## 2026-04-07 10:31:07 : Starting 6954-OA-lateral ESTREFPEAKS
## 2026-04-07 10:32:52 : Starting 6954-OA-lateral RECALIBx2
## 2026-04-07 10:33:53 : Starting 6954-OA-lateral WRITE
## 2026-04-07 10:34:26 : Starting 6954-OA-medial ESTREFPEAKS
## 2026-04-07 10:35:59 : Starting 6954-OA-medial RECALIBx2
## 2026-04-07 10:36:46 : Starting 6954-OA-medial WRITE
rm(mse_list_smooth)
invisible(gc())
As before, reload intermediates to be out of memory.
DATA_PATH_PICKED <- file.path("Data","Preprocessing",
"Smoothed-BLR-Recalibrated")
MSE_FILENAMES_PICKED <- list.files(DATA_PATH_PICKED, pattern = "\\.imzML$",
recursive = TRUE)
mse_list_rp = list()
for (f in MSE_FILENAMES_PICKED) {
m <- readMSIData(file.path(DATA_PATH_PICKED, f), verbose=F)
mse_list_rp[[runNames(m)]] <- m
}
Next, let us assign metadata to each pixel. This will assist later
with modeling and tissue identification. This notation,
i.e. MSimage$var_name <- value, assigns
value as a column in the MSimage
PositionDataFrame object. value must be of
length 1 (to assign the same value to every pixel) or length n_pixels
(to assign a specific value to each pixel).
for (n in names(mse_list_rp)) {
sample_info <- name_tibble |> filter(run == n)
mse_list_rp[[n]]$subject <- sample_info$subject
mse_list_rp[[n]]$condition <- sample_info$condition
mse_list_rp[[n]]$tissue <- sample_info$tissue
}
This chain conducts multiple steps, yielding a single
MSImagingExperiment object containing a run for every
sample, all with a shared mass axis.
combine groups samples into a single
MSImagingArrays object.peakPick performs peak picking on all spectra with the
same method and signal-to-noise ratio (SNR) across all samplespeakAlign centroids the picked peaks, creating the
unified mass axis for all samples. This also converts the
MSImagingArrays to a MSImagingExperimentprocess actually executes the above steps.summarizeFeatures summarizes pixel intensities by
features. Keep in mind that these summaries are across all values from
every sample - now represented as runs in the
MSImagingExperiment objectThe final step, changing the spectrumRepresentation,
ensures that Cardinal writes the object to disk as centroided. (This is
not always necessary and is done to ensure backwards compatibility with
previous versions of Cardinal v3.)
After these steps, the resulting combined image will have an orders-of-magnitude smaller memory footprint, as we are no longer maintaining separate mass axes for each spectrum and non-centroided peaks. The object should easily fit in memory.
mse_all <- combine(mse_list_rp) |>
peakPick(method="diff", SNR=5) |>
peakAlign(tolerance = 80, units = "ppm") |>
process(BPPARAM=SnowfastParam(workers = 10L, force.GC = T)) |>
summarizeFeatures(stat = c(sd = "sd", mean = "mean", nnzero = "nnzero"),
na.rm = T, BPPARAM=SnowfastParam(workers = 5L))
rm(mse_list_rp)
invisible(gc())
# This is likely unnecessary, but may prevent
# issues with older versions of Cardinal
experimentData(mse_all)$spectrumRepresentation <- "centroid spectrum"
Cardinal automatically calculates freq during peak
picking, but this metric is computed across all runs. In our workflow,
we recommend filtering peaks by non-zero frequency on a per-run basis.
The function below can do this.
filterFeaturesByRunFreq <- function(mse, cutoff = 0.01,
run_col = "run",
spectra_name = "intensity"){
pd <- pixelData(mse)
run <- pd[[run_col]]
if (is.null(run)) stop("pixelData(mse)$", run_col, " is NULL")
run <- as.factor(run)
X <- spectra(mse, spectra_name)
keep <- rep_len(TRUE, nrow(mse))
for (lv in levels(run)) {
cols <- which(run == lv)
if (length(cols) == 0L) next
Xr <- X[, cols, drop = FALSE]
# present iff finite AND non-zero
present_counts <- rowSums(is.finite(Xr) & (Xr != 0))
freq <- present_counts / length(cols)
keep <- keep & (freq >= cutoff)
}
mse[keep, , drop = FALSE]
}
Using the function defined above, filter peaks so that only peaks
where freq >= 0.05 in every run are retained.
mse_all_filt <- mse_all |>
filterFeaturesByRunFreq(0.05) |>
summarizePixels(stat = c(tic = "sum", mean = "mean", nnzero = "nnzero"),
na.rm = T, BPPARAM=SnowfastParam(workers = 5L))
invisible(gc())
Write the uniform, peak-picked image to disk.
writeMSIDataSafe(mse_all_filt,
file.path("Data",
"Preprocessing",
"Smoothed-BLR-Recalibrated-Picked-Aligned-FreqFiltered05perRun.imzML"))
For other steps, it is useful to have each peak-picked image separate again.
for (rn in runNames(mse_all_filt)) {
m <- subsetPixels(mse_all_filt, run == rn)
writeMSIDataSafe(m,
file.path("Data",
"Preprocessing",
"Individual-Picked-Aligned-Filtered",
paste0(rn, ".imzML")))
}
Remove unnecessary objects from memory.
#CLEAN UP
rm(m, mse_all_filt, mse_all)
invisible(gc())
ROIs define the tissue areas relevant to the biological question to be analyzed in later steps. Region of interest (ROI) segmentation can be achieved through many means. Below we present a method that uses predefined marker ions to segment regions of interest. We will use this method to segment cartilage, bone, and background/extracellular matrix (ECM). It is important to involve subject matter experts (i.e., pathologists and biologists who know the tissue) in selection of ROI markers and review of segmentation results. Unsupervised multivariate segmentation methods that use all features, although useful for discovery, should not be used to create ROIs when investigating differential abundance, as doing so constitutes double dipping and overfits ROIs to biological variation.
library(dplyr)
library(ggplot2)
library(data.table)
library(tibble)
library(Cardinal)
source(file.path("Utilities.R"))
RNGkind("L'Ecuyer-CMRG")
setCardinalVerbose(verbose = FALSE)
# Patch for domain-filtered sparse_mat coercion bug
setMethod("as.vector", "sparse_arr",
function(x, mode = "any") {
if ( mode != "any" )
matter::type(x) <- mode
y <- as.vector(x[])
names(y) <- NULL
y
})
DATA_PATH <- file.path("Data",
"Preprocessing",
"Individual-Picked-Aligned-Filtered")
MSE_FILENAMES <- list.files(DATA_PATH, pattern = "\\.imzML$",
recursive = TRUE)
We will keep them in a named list for now.
mse_list_pp = list()
for (f in MSE_FILENAMES) {
sample_name <- sub('\\..*$', '', basename(f))
m <- readMSIData(file.path(DATA_PATH, f),verbose=F)
runNames(m) <- sample_name
mse_list_pp[[sample_name]] <- m
}
First, let’s subset a representative image. We will locate the features with m/z values nearest our approximate values for the representative features.
m <- readMSIData(file.path(DATA_PATH, MSE_FILENAMES[1]), verbose=F)
cartilage_feature_approx <- 1141.5556
cartilage_feature_ids <- features(m, mz = cartilage_feature_approx)
background_feature_approx <- c(1570.6786,
1571.6801,
1572.6808,
1592.6699,
1593.6739,
1594.6766,
1608.6582)
background_feature_ids <- features(m, mz = background_feature_approx)
m <- summarizeFeatures(m,
na.rm = T,
verbose = F,
BPPARAM = SnowfastParam(workers = 5L,
force.GC = T))
We chose m/z 1141.556 as the cartilage marker, as it accurately defines cartilage across the greatest number of tissues. Below is a single tissue.
image(m, i = cartilage_feature_ids, key = F)
There were seven m/z values indicative of background/ECM: 1570.679, 1571.68, 1572.681, 1592.67, 1593.674, 1594.677, 1608.658
M/z 1570.679, 1571.68, 1572.681 are Glufib, +1, and +2 isotopes.
M/z 1592.67, 1593.674, 1594.677 are potentially Glufib +23 sodium adducts and +1 and +2 isotopes.
M/z 1608.658 is an identical spatial analog, similar enough to include in ROI segmentation.
image(m, i = background_feature_ids,
scale = T,
key = F)
To begin we will select only the single cartilage marker as its own MSE.
cartilage_features <- m[cartilage_feature_ids,]
Spatial-Dirichlet Gaussian mixture models (sDGMM) (Guo et al. 2019) is an open-source, unsupervised clustering algorithm that groups pixels in single ion images into \(k\) prespecified clusters using pixel intensity and spatial dependence of neighboring pixels. sDGMM is technically unsupervised but acts only on single features, facilitating a higher degree of control and replicability. We will use adaptive weights to better define segment edges.
Choosing the number of clusters is one way to affect a finer level of control over what sDGMM segments. In our testing we found 5 clusters, ordered by mean intensity with the first three assigned as cartilage, to yield the most consistent results across samples.
set.seed(12345)
sdgmm <- spatialDGMM(cartilage_features,
k = 5,
r = 5,
weights = "adaptive",
verbose = F)
cartilage_vector <- recode_values(as.character(sdgmm$class[[1]]),
"5" ~ "other",
"4" ~ "other",
"3" ~ "other",
"2" ~ "cartilage",
"1" ~ "cartilage")
pData(cartilage_features)$class <- cartilage_vector
Visualizing the 5 clusters, we can see that in this case segmenting more clusters gave finer control over cartilage boundaries.
image(sdgmm)
After collapsing the 5 segments into only two.
image(cartilage_features, "class")
Background/ECM and bone tissues are mutually exclusive from cartilage, so we will perform segmentation only on sample areas classified in the steps above as not cartilage. This offers another advantage: our markers to differentiate bone and background/ECM do not have to also differentiate cartilage.
First we will select only the background/ECM features and pixels not classified as cartilage.
background_and_bone_features <- Cardinal::subset(m,
mz %in% mz(m)[background_feature_ids],
cartilage_vector=="other")
Unlike cartilage, we have selected six features for background/ECM
segmentation. Instead of segmenting each individually, we will first
combine them into a single feature by using the pixel-wise mean.
condense_groups (supplied from Utilities.R)
uses the group_ids assigned by grouping_df to
combine features into groups. The agg_fun="mean" specifies
we want to combine groups via the pixel-wise mean.
grouping_df <- tibble(feature_id = 1:length(mz(background_and_bone_features)),
group_id = rep(1, length(mz(background_and_bone_features))),
mz = mz(background_and_bone_features))
background_and_bone_features_condensed <- condense_groups(background_and_bone_features,
grouping_df,
verbose = FALSE,
agg_fun="mean",
na.rm=T)
Aggregating these features into a single feature by the pixel-wise mean concentrates signal and smooths out noise. Feature aggregation for ROI selection also presents some interesting possibilities - such as segmenting the pixel-wise ratio of two features.
image(background_and_bone_features_condensed, key = F)
Next we will segment the background/ECM feature aggregate using sDGMM. Like before, varying the number of clusters affects the segmentation results, especially at the segment boundaries.
set.seed(12345)
sdgmm <- spatialDGMM(background_and_bone_features_condensed,
k = 4,
r = 1,
weights = "adaptive",
verbose = F)
bone_background_vector <- recode_values(as.character(sdgmm$class[[1]]),
"4" ~ "bone",
"3" ~ "background",
"2" ~ "background",
"1" ~ "background")
pData(background_and_bone_features)$class <- bone_background_vector
Visualizing the four clusters before collapsing.
image(sdgmm)
After collapsing into two classes, background and bone.
image(background_and_bone_features, "class")
There are likely more underlying structures, especially in the bone ROI, that could be sub-segmented further. For the purpose of this demonstration we will conclude here, with three ROIs. The following combines ROI labels hierarchically.
factor_tibble <- tibble(x = pData(cartilage_features)$x,
y = pData(cartilage_features)$y,
cartilage = cartilage_vector) |>
left_join(tibble(x = pData(background_and_bone_features)$x,
y = pData(background_and_bone_features)$y,
bone_background = bone_background_vector),
by = c("x","y")) |>
mutate(zone = as.factor(ifelse(is.na(bone_background),
"cartilage",
bone_background)))
pData(m)$tissue_zone <- factor_tibble$zone
After segmentation and label combination we can visualize the final result for this sample.
image(m, "tissue_zone")
Now we will repeat this procedure on all samples, saving the ROI labels in a named list and a data frame for visualization.
bp <- SnowfastParam(workers = 15L, force.GC = TRUE)
mse_list_segmented <- BiocParallel::bplapply(
mse_list_pp,
function(mse) {
source(file.path("Utilities.R"))
# bug in matter - needs to be here for now
setMethod("as.vector", "sparse_arr",
function(x, mode = "any") {
if ( mode != "any" )
matter::type(x) <- mode
y <- as.vector(x[])
names(y) <- NULL
y
})
cartilage_feature_approx <- 1141.5556
cartilage_feature_ids <- features(mse,
mz = cartilage_feature_approx)
background_feature_approx <- c(1570.6786,
1571.6801,
1572.6808,
1592.6699,
1593.6739,
1594.6766,
1608.6582)
background_feature_ids <- features(mse,
mz = background_feature_approx)
## --- cartilage vs other ---
cartilage_features <- mse[cartilage_feature_ids, ]
set.seed(12345)
sdgmm1 <- spatialDGMM(
cartilage_features,
k = 5,
r = 5,
weights = "adaptive",
verbose = F
)
cartilage_vector <- recode_values(
as.character(sdgmm1$class[[1]]),
"5" ~ "other",
"4" ~ "other",
"3" ~ "other",
"2" ~ "cartilage",
"1" ~ "cartilage"
)
background_and_bone_features <- Cardinal::subset(
mse,
background_feature_ids,
cartilage_vector == "other"
)
grouping_df <- tibble(
feature_id = seq_along(mz(background_and_bone_features)),
group_id = 1L,
mz = mz(background_and_bone_features)
)
background_and_bone_features_condensed <- condense_groups(
background_and_bone_features,
grouping_df,
verbose = FALSE,
agg_fun = "mean",
na.rm = TRUE
)
set.seed(12345)
sdgmm2 <- spatialDGMM(
background_and_bone_features_condensed,
k = 4,
r = 1,
weights = "adaptive",
verbose = FALSE
)
bone_background_vector <- recode_values(
as.character(sdgmm2$class[[1]]),
"4" ~ "bone",
"3" ~ "background",
"2" ~ "background",
"1" ~ "background"
)
## --- assemble zones ---
factor_tibble <- tibble(
x = pData(cartilage_features)$x,
y = pData(cartilage_features)$y,
cartilage = cartilage_vector
) |>
left_join(
tibble(
x = pData(background_and_bone_features)$x,
y = pData(background_and_bone_features)$y,
bone_background = bone_background_vector
),
by = c("x", "y")
) |>
mutate(
zone = factor(ifelse(is.na(bone_background),
"cartilage",
bone_background))
)
pData(mse)$tissue_zone <- factor_tibble$zone
mse
},
BPPARAM = bp
)
pdata_list <- lapply(names(mse_list_segmented), function(nm) {
dt <- as.data.table(pData(mse_list_segmented[[nm]]))
dt[, id := nm]
dt
})
hc_class_plotting_frame <- rbindlist(pdata_list)
roi_list <- hc_class_plotting_frame[, .(x, y, run, tissue_zone)]
save(roi_list, file = file.path("Outputs", "roi_list.Rdata"))
To illustrate the effect of segmentation method on P values, we will
compare the single ion segmentation of the OA dataset to segmentation
results from multivariate segmentation methods, specifically spatial
shrunken centroids. Multiple values of k and s
were tested. Segmentation did not meaningfully improve past
k=5 and there were minimal differences between different
values of s, thus it was kept to s = 0 (the
default). Even still, several tissues demonstrated poor segmentation of
cartilage. Additionally, as multivariate segmentation begins with random
class allocation, it will not produce the same class labels for each
tissue. Therefore, the author has annotated them, include situation when
multiple classes must be combined into one to form cartilage. We will
use these results at the end of the vignette to evaluate at sensitivity
of T tests between conditions when performed on ROIs segmented with
differernt methods. .
bp <- SnowfastParam(workers = 16L, force.GC = TRUE)
sscs <- BiocParallel::bplapply(
mse_list_pp,
function(mse) {
set.seed(54321)
# bug in matter - needs to be here for now
setMethod("as.vector", "sparse_arr",
function(x, mode = "any") {
if ( mode != "any" )
matter::type(x) <- mode
y <- as.vector(x[])
names(y) <- NULL
y
})
ssc <- spatialShrunkenCentroids(mse, k = 5, weights = "adaptive")
ssc
},
BPPARAM = bp
)
manage_directory("sscs")
for (ssc in names(sscs)) {
image_png(sscs[[ssc]], "class", filename = file.path("sscs", paste0("ssc_",ssc,".png")))
}
mvs_cartilage_class <- list("6891-OA-lateral" = c(3),
"6891-OA-medial" = c(4),
"6908-OA-lateral" = c(3),
"6908-OA-medial" = c(5),
"6919-CAD-lateral" = c(5),
"6919-CAD-medial" = c(3),
"6924-CAD-lateral" = c(3,5),
"6924-CAD-medial" = c(4),
"6938-CAD-lateral" = c(1),
"6938-CAD-medial" = c(3),
"6944-CAD-lateral" = c(2),
"6944-CAD-medial" = c(1),
"6949-OA-lateral" = c(2),
"6949-OA-medial" = c(4),
"6954-OA-lateral" = c(3),
"6954-OA-medial" = c(3))
pdata_list_mvs <- lapply(names(mse_list_pp), function(nm) {
class_vec <- sscs[[nm]]$class
mse_list_pp[[nm]]$tissue_zone_mvs <- ifelse( class_vec %in% mvs_cartilage_class[[nm]] ,"cartilage","other")
dt <- as.data.table(pData(mse_list_pp[[nm]]))
dt[, id := nm]
dt
})
hc_class_plotting_frame_mvs <- rbindlist(pdata_list_mvs)
roi_list_mvs <- hc_class_plotting_frame_mvs[, .(x, y, run, tissue_zone_mvs)]
save(roi_list_mvs, file = file.path("Outputs", "roi_list_mvs.Rdata"))
plt <- ggplot(hc_class_plotting_frame_mvs |>
left_join(name_tibble, by = join_by(run)),
aes(x = x, y = -1*y, fill = tissue_zone_mvs)) +
geom_tile() +
theme_void() +
facet_wrap("tid", scales = "free", ncol = 2) +
scale_fill_brewer(palette = "Paired") +
theme(legend.position = "none") +
labs(fill = "ROI")
Visualization of all samples’ ROIs.
ggplot(hc_class_plotting_frame |>
left_join(name_tibble, by = join_by(run)) |>
mutate(roi = recode_values(tissue_zone,
"bone" ~ "Bone",
"background" ~ "Background",
"cartilage" ~ "Cartilage")),
aes(x = x, y = -1*y, fill = roi)) +
geom_tile() +
theme_void() +
facet_wrap("tid", scales = "free", ncol = 2) +
scale_fill_brewer(palette = "Paired") +
theme(legend.position = "none") +
labs(fill = "ROI")
A better approach would be to run the sDGMM segmentation on the
single MSImagingExperiment object, thereby forcing the
algorithm to consider every run’s spatial distribution in unison. This
could potentially make segmentation more comparable across runs but
would also overfit to both technical noise (intensity differences that
would have been corrected during normalization) and biological
differences caused by experimental groups. This would confound the
intended effect we want to study, so instead we will cluster
independently.
Normalization exists to decrease nuisance variation and enhance signal, but relies on a set of assumptions. Briefly, normalization schemes that scale by a global constant (e.g., TIC, median) assume that few analytes are truly changing and that the amount of signal is constant between spectra. Reference normalization assumes the signal of the reference is constant across spectra. Normalization remains an active field of research, with few definite recommendations available aside from the necessity of some normalization scheme.
library(dplyr)
library(tibble)
library(data.table)
library(Cardinal)
source(file.path("Utilities.R"))
RNGkind("L'Ecuyer-CMRG")
setCardinalVerbose(verbose = FALSE)
options(stringsAsFactors = FALSE)
We will perform normalization on the unified dataset.
m <- readMSIData(file.path("Data",
"Preprocessing",
"Smoothed-BLR-Recalibrated-Picked-Aligned-FreqFiltered05perRun.imzML"))
Normalization should only be performed on features which represent true biology. Additionally, features with very high intensity - far greater than endogenous analytes - will bias normalization and should be dropped. We will remove the mass calibrant and its isotopes.
bg_mz_approx <- c(1570.6786,
1571.6801,
1572.6808,
1592.6699,
1593.6739,
1594.6766)
bg_idx <- features(m, mz=bg_mz_approx)
m <- subsetFeatures(m, mz %ni% mz(m)[bg_idx])
Sparsity of pixels and features is ubiquitous in MSI. Sparsity is generally a mix of concepts: zero intensity, intensity below the limit of detection, and missing values. Although each is subtly different, they are highly interlinked due to the nature of mass spectrometry. In practice, they behave similarly, but these conceptual similarities do not extend to preprocessing. Zero and missing values behave very differently in a mathematical sense. In general, zeros are retained while missing values are dropped. Zeros bias summaries towards zero, especially means. Missing values are often dropped, meaning that summaries are biased larger towards non-missing values. There is no good way around this predicament other than to reduce sparsity by defining more stringent ROIs and filtering highly sparse features. Regardless, at the time of normalization one will need to choose a method (leaving zeros as they are, or replacing them with NAs) and stick with it through analysis.
In our testing, we found median normalization was less biased by outliers than TIC normalization. The trade-off is that TIC normalization is slightly more robust to sparsity. Median normalization is simple: compute the median intensity of each m/z value for each pixel, then scale each m/z value of the pixel so that each median is the same. Although simple, this still presents issues. If the median of a spectrum is zero, then the scaled intensity becomes undefined. If we replace zeros with missing values, then recompute the median dropping missing values, the median will be biased larger. There is no easy answer. For this analysis, we chose to replace zeros with NA values, as leaving zeros in would result in zero-inflation of mean abundance at the time of statistical modeling. Zero inflation breaks the assumption that residual values are normally distributed.
Even still, we found that median normalization can be highly biased
by outliers. True outliers are rare, but can be of very high magnitude.
We chose to clip medians at the 95th percentile to control outliers.
These steps are performed in the chunk below using matter’s
addProcessing function to queue steps for parallel
processing.
zero_to_na <- function(x, mz) {
x[x == 0] <- NA_real_
x
}
clip_95 <- function(vector_ints, mz) {
if (all(is.na(vector_ints))) {
vector_ints
} else {
thresh <- quantile(vector_ints, 0.95, na.rm = T)
vector_ints |>
pmin(thresh)
}
}
mse <- m |>
addProcessing(zero_to_na, label="zero_to_na_before_median_norm") |>
addProcessing(clip_95, label="clip_95percentile") |>
process(spectra="intensity",
index="mz",
BPPARAM = SnowfastParam(workers = 12L, force.GC = TRUE)) |>
summarizePixels(stat = c(tic = "sum", mean = "mean", nnzero = "nnzero"),
na.rm = T, BPPARAM=SnowfastParam(workers = 5L)) |>
summarizeFeatures(stat = c(sd = "sd", mean = "mean", nnzero = "nnzero"),
na.rm = T, BPPARAM=SnowfastParam(workers = 5L))
We also compute the global median (the median of medians, across all spectra in all runs) to use as a final scaling constant. The final constant can be anything, so we chose this value to remain representative.
pData(mse)$median <- Cardinal::spectrapply(
mse,
FUN = function(x) stats::median(x, na.rm = TRUE),
spectra = "intensity",
simplify = TRUE,
BPPARAM = SnowfastParam(workers = 12L, force.GC = TRUE)
)
#sanity check
min(pData(mse)$median[is.na(pData(mse)$median) == F])
## [1] 1.934078
max(pData(mse)$median[is.na(pData(mse)$median) == F])
## [1] 118.7556
grand_median <- median(
as.data.table(pData(mse))[,
.(run_median = median(median, na.rm = TRUE)),
by = run][,
run_median]
)
scale_by_median <- function(x, mz, scaling_constant, na.rm = TRUE) {
med <- stats::median(x, na.rm = na.rm)
if (is.na(med)) {
warning("scale_by_median: median is NA; returning all NA for that spectrum")
return(rep(NA_real_, length(x)))
}
if (med == 0) {
stop("scale_by_median: median == 0 encountered")
}
(scaling_constant / med) * x
}
We now apply normalization. Cardinal does not have a native implementation of median normalization, thus we created our own. As the data are centroided and peak picked, they easily fit in memory.
normalized <- mse |>
addProcessing(scale_by_median, label="scale_by_median",
scaling_constant=grand_median, na.rm=TRUE) |>
process(spectra="intensity", index="mz",
BPPARAM = SnowfastParam(workers = 12L, force.GC = TRUE))
normalized <- normalized |>
summarizePixels(stat = c(tic = "sum", mean = "mean", nnzero = "nnzero"),
na.rm = T, BPPARAM=SnowfastParam(workers = 5L)) |>
summarizeFeatures(stat = c(sd = "sd", mean = "mean", nnzero = "nnzero"),
na.rm = T, BPPARAM=SnowfastParam(workers = 5L))
Finally, save the normalized dataset.
writeMSIDataSafe(normalized, file.path("Data",
"Preprocessing",
"normalized.imzML"))
Below creates figures from the paper.
png(filename = file.path("Recreated-Figures",
"Figure-Peakpicking-Preprocessed.png"),
height = 500, width = 600, pointsize = 20)
print(plot(normalized, "mean",
ylab = "Mean Pixel Intensity",
run = "6891-OA-lateral",
grid = F))
dev.off()
knitr::kable(normalized |>
subsetPixels(run == "6891-OA-lateral") |>
summarizeFeatures(stat = c("mean", "sd"), na.rm = T) |>
fData() |>
as_tibble() |>
filter(mean != 0 | sd != 0) |>
mutate(cv = sd/mean) |>
summarise(mean_cv = mean(cv, na.rm = T)))
| mean_cv |
|---|
| 0.238479 |
For the figure comparing p-values from t-tests between ROIs created with univariate segmentation vs. multivariate segmentation, we will test each feature. These steps will be covered in greater detail later.
load(file.path("Outputs", "roi_list.Rdata"))
load(file.path("Outputs", "roi_list_mvs.Rdata"))
mse_norm <- readMSIData(file.path("Data",
"Preprocessing",
"normalized.imzML")) |>
addProcessing(zero_to_na, label="zero_to_na_before_filtering") |>
process(spectra="intensity", index="mz",
BPPARAM = SnowfastParam(workers = 12L, force.GC = TRUE),
verbose = F)
pData(mse_norm)$tissue_zone <- as.data.table(pData(mse_norm)) |>
left_join(roi_list, by = c("x","y","run")) |>
dplyr::pull(tissue_zone)
pData(mse_norm)$tissue_zone_mvs <- as.data.table(pData(mse_norm)) |>
left_join(roi_list_mvs, by = c("x","y","run")) |>
dplyr::pull(tissue_zone_mvs)
get_mse_frame <- function(mse_summ){
mse_summ |>
fData() |>
as_tibble() |>
mutate(feature_id = row_number()) |>
tidyr::pivot_longer(
X6891.OA.lateral.mean:X6954.OA.medial.mean,
names_to = "run",
values_to = "cartilage.mean",
names_pattern = "X(.+)\\.mean$",
names_transform = list(run = ~ stringr::str_replace_all(.x, "\\.", "-"))
) |>
left_join(name_tibble, by = join_by(run)) |>
mutate_at(vars(feature_id, subject, condition, tissue), factor) |>
arrange(feature_id, condition, tissue, subject) |>
select(feature_id, mz, run, cartilage.mean, subject, condition, tissue) |>
filter(tissue == "Medial")
}
mse_cartilage <- mse_norm |>
subsetPixels(tissue_zone == "cartilage")
mse_frame <- summarizeFeatures(mse_cartilage,
stat = "mean",
na.rm = T,
group = mse_cartilage$run) |>
get_mse_frame()
mse_cartilage_mvs <- mse_norm |>
subsetPixels(tissue_zone_mvs == "cartilage")
mse_frame_mvs <- summarizeFeatures(mse_cartilage_mvs,
stat = "mean",
na.rm = T,
group = mse_cartilage_mvs$run) |>
get_mse_frame()
modeling_frame <- mse_frame |>
mutate(cartilage.mean.mvs = mse_frame_mvs$cartilage.mean)
library(purrr)
##
## Attaching package: 'purrr'
## The following object is masked from 'package:data.table':
##
## transpose
library(broom)
results <- modeling_frame %>%
group_by(feature_id) %>%
summarise(
t1 = list(tidy(t.test(cartilage.mean ~ condition, var.equal = T))),
t2 = list(tidy(t.test(cartilage.mean.mvs ~ condition, var.equal = T))),
.groups = "drop"
) %>%
tidyr::unnest(t1) %>%
tidyr::unnest(t2, names_sep = ".mvs")
results |>
ggplot(aes(x = -log10(p.value), y = -log10(t2.mvsp.value))) +
geom_hex(bins = 38) +
scale_fill_gradient(low = "#ffe6e4f7", high = "#8e0315") +
geom_abline(intercept = 0, slope = 1) +
labs(x = "Univariate segmentation of representitive markers: -log10(P value)",
y = "Multivaratie Segmentation with all features: -log10(P value)",
fill = "Count") +
stat_smooth(method = "lm") +
theme_minimal() +
coord_equal() +
ggpubr::stat_regline_equation(aes(label = ..eq.label..), label.x = 1.25, label.y = .15, color = "blue")
## `geom_smooth()` using formula = 'y ~ x'
ggsave(file.path("Recreated-Figures", "Figure-ROI-Segmentation-Method-P-Values.png"), height = 5.0, width = 5.3)
## `geom_smooth()` using formula = 'y ~ x'