Filtering and aggregation of peak-picked and normalized MSI data further concentrate signal and reduce noise. Additionally, these steps reduce the number of features, and therefore the number of statistical comparisons. Fewer tests mean less stringent multiple hypothesis correction and greater experimental sensitivity. Unfortunately, feature clustering is computationally intensive and may require access to high-performance computing resources.

Load required packages and data

Find peak-picked and normalized imzML files

COMBINED_DATA_PATH <- file.path("Data","Preprocessing","normalized.imzML")
load(file.path("Outputs", "roi_list.Rdata"))

Load peak-picked imzML files

We will load the combined, normalized, peak-picked image. We will then replace zeros with NAs. The imzML format does not respect NA values, so Cardinal will read them as zeros when reloaded. We will also join the ROIs created in 1-Preprocessing.

mse <- readMSIData(COMBINED_DATA_PATH, verbose = F)

mse <- mse |>
      addProcessing(zero_to_na, label="zero_to_na_before_filtering") |>
      process(spectra="intensity", index="mz", 
                BPPARAM = SnowfastParam(workers = 12L, force.GC = TRUE),
                verbose = F)

pData(mse)$tissue_zone <- as.data.table(pData(mse)) |>
  left_join(roi_list, by = c("x","y","run")) |>
  dplyr::pull(tissue_zone)

Some features used for segmentation remain. Conducting analysis on these features is double dipping, as they have been used to create the ROIs. You can explore this concept in greater detail in Simulation-2. features will throw warnings if it can’t find a close enough m/z value. Assuming that feature was already dropped, that is OK. (Warnings are muted for this chunk.)

rounded_mz_of_ROI_segmentation_ions <- c(1141.5556, 1142.524, 1570.6786, 
                                        1571.6801, 1572.6808, 1592.6699, 
                                        1593.6739, 1594.6766, 1608.6582)

indices_of_ROI_segmentation_ions <- features(mse, 
                            mz = rounded_mz_of_ROI_segmentation_ions, 
                            tolerance = .001, 
                            units = "mz")

# If following the vignettes exactly the above step should yield two indicies

if (length(indices_of_ROI_segmentation_ions) >= 1) {
  # work around as mse[-indicies_to_drop] not supported
  mse <- mse[setdiff(1:nrow(mse), indices_of_ROI_segmentation_ions), ]
}

Non-specific filtering

We will be filtering by mean and standard deviation. We assume that noisy, non-informative features have low variation and low intensity. For simplicity, we will drop features where sample-wide mean intensity and standard deviation are both in the bottom 25th percentile in non-background pixels.

First, we will drop background pixels.

mse_no_background <- subsetPixels(mse, tissue_zone!= "background")

Now we can compute the mean and standard deviation of each feature across all samples.

mse_no_background <- summarizeFeatures(mse_no_background,
                                       stat = c("mean", "sd", "nnzero"),
                                       na.rm = TRUE,
                                       BPPARAM = SnowfastParam(workers = 12L, 
                                                               force.GC = TRUE),
                                       verbose = F)

Now we can identify features to drop.

mz_to_keep <- mz(subsetFeatures(mse_no_background,
                                !(sd < quantile(sd, p = 0.25) & 
                                  mean < quantile(mean, p = 0.25))))

Let’s visualize the filter.

as_tibble(fData(mse_no_background)) |>
  mutate(keep = !(mean < quantile(mean, p = 0.25) & sd < quantile(sd, p = 0.25))) |>
  ggplot(aes(x = mean, y = sd, color = keep)) +
  geom_point(alpha = 0.4)

Apply the filter

Now apply the filter. The new MSE will have 684 features.

mse <- subsetFeatures(mse, mz %in% mz_to_keep)

Let’s save this for later

writeMSIDataSafe(mse, file.path("Data","Filtering-Aggregation","Preprocessed-NSFiltered.imzML"),
                verbose = F)

Feature clustering

Clustering of related features is an active field of research with many implementations. Choosing a clustering algorithm should be done based on what you want to cluster (related molecules, spatial analogs, isotopes, adducts, precursors, etc.) and the methods to cluster (spatial similarity, kendrick mass defect, intensity ratios, etc.). As such, we treat feature clustering as an interchangeable step that takes in MSIs and returns meaningful feature clusters to the experiment.

As of version 3.6, Cardinal does not offer feature clustering methods. In this work we have used DeepION (ACS Paper, Github) to cluster isotopes. We have also explored mzClustering (Bioinformatics Paper, Original Github, Github of install-ready version, made possible by Nihira Golasangi). At the time of publication, these methods were experimentally tested but lacked easy-to-install implementations. mzClustering has been updated by Nihira Golasangi to be install-ready from Github. DeepION may require more work to reconcile package conflicts, especially if using Apple M series chipsets. Because these methods exist external to this workflow and are interchangeable with other clustering methodologies, we will prepare the input files for both methods and provide the output of DeepION for clustering.

Both methods use spatial similarity to cluster features. mzClustering creates groups of similar features, while DeepION tests similarity of possible isotopes, as determined by mass differences. To adapt these methods to multi-tissue clustering, we must force the methods to cluster all the samples together. As all samples share the same m/z axis, we can achieve this by stitching together samples’ intensity matrices. Both of these methods are computationally intensive, especially when used to cluster multiple tissue in unison. Additionally, these methods were designed to run on Nvidia GPUs (using CUDA) and must be adapted to run on Apple M series Metal GPU frameworks.

Prepare intensity matrices

First, extract the intensity matrix of each sample, and bind their columns. Both algorithms take a matrix of features and pixels as input - mzClustering requires features \(\times\) pixels while DeepION requires the transpose (pixels \(\times\) features).

k <- as.matrix(intensity(mse))

Each row of a MSImagingExperiment intensity matrix represents a feature, and each column a pixel. Matrix values are intensities. Keep in mind that MSImagingExperiment objects usually represent irregularly-shaped samples rastered into a coordinate system, and Cardinal only stores the intensities that are actual sample values. This presents an issue: both DeepION and mzClustering require rectangular images.

We could use sliceImage to get 3D array of \(x \times y \times n_{features}\) - a cube with rectangular slices (features), where the non-sample background of the rectangular slice is NAs - but it would make the images much larger (artificially, mostly NAs) and require more memory.

Instead we will use the fact that, along with a 2D intensity matrix, DeepION and mzClustering take the dimensions of the image as a parameter. Using those dimensions, the algorithms extract the intensity vector of each feature (each row of the intensity matrix, a set of pixel intensities in the order of \(P_{x=1,y=1}, P_{x=2,y=2}...P_{x=dim(x),y=dim(y)}\)) and reshape the vector using the supplied dimensions. Therefore we shall zero pad the bound intensity arrays, such that each row can be reshaped into a square image. This will disrupt the spatial relationship between pixels - when reshaping each feature image, the new dimensions will be different from the original and there are no sample background pixels - but the spatial relationships between features will be preserved.

# start with the length (# of columns) of the bound intensity matrix
len <- dim(k)[2]
# find the nearest square
new_col <- ceiling(sqrt(len))^2
# find the how many more columns you have to add to get to that square
delta <- new_col - len
# tack on a matrix of 0's to make the square root of the total number of columns the next integer
new_k <- cbind(k, matrix(0, nrow = nrow(k), ncol = delta))

In a twist of irony, DeepION and mzClustering cannot handle NA values. As MSIs experience intensity dependent missingness we will zero-fill NA values.

new_k <- replace(new_k, is.na(new_k), 0)
new_k <- as.data.frame(new_k)

Write input files for the feature clustering algorithms

The value below is the new dimension for the images.

new_dimension <- ceiling(sqrt(len))
print(paste("Square Image Dimension:",new_dimension))

## [1] "Square Image Dimension: 708"

Write the intensity matrices for the algorithms. Both algorithms take the file name as arguments. DeepION also requires a vector of m/z values to compute possible isotope candidates. This will take a second as these files are large.

write.table(new_k,
            file=file.path("Outputs",paste0("super_image_",new_dimension,"_for_mzClustering.txt")),
            row.names = F,
            col.names = F)

write.table(t(new_k),
            file=file.path("Outputs",paste0("super_image_",new_dimension,"_for_DeepION.txt")),
            row.names = F,
            col.names = F)

# for DeepION algo (needs the image matrix transposed and a vector of the MZ values)
write.table(mz(mse),
            file=file.path("Outputs","mz_vector_for_DeepION.txt"),
            row.names = F,
            col.names = F)


# garbage collect these so they aren't taking up valuable memory
rm(k, new_k)

Feature aggregation

Read and preprocess DeepION clustering outputs

We have supplied the clustering output of DeepION for a specific stitched-image dimension. Each row in DeepION’s output represents an isotope grouping. This process works for any clustering schema (not unique to DeepION) that produces outputs similar to DeepION, though it may require some adaptation.

mse <- readMSIData(file.path("Data","Filtering-Aggregation","Preprocessed-NSFiltered.imzML")) |>
      addProcessing(zero_to_na, label="zero_to_na_before_filtering") |>
      process(spectra="intensity", index="mz", 
                BPPARAM = SnowfastParam(workers = 12L, force.GC = TRUE),
                verbose = F)

deepion_iso_output_file <- file.path("Outputs", "output_super_image_708.txt")
iso <- readr::read_csv(deepion_iso_output_file, show_col_types = F) |>
  mutate(group_id = row_number())
knitr::kable(head(iso))

monoisotope	isotope_1	isotope_type_1	charge_1	group_id
1034.487	0.000	0	0	1
1034.573	1035.577	1	1	2
1068.506	0.000	0	0	3
1116.515	0.000	0	0	4
1118.485	1119.487	1	1	5
1120.512	0.000	0	0	6

Reshape so that each row is a feature.

iso <- iso |>
  pivot_longer(monoisotope : isotope_2, 
               names_to = "isotope", 
               names_prefix = "isotope_", 
               values_to = "mz") |>
  filter(mz > 0) |>
  dplyr::select(mz, group_id)
knitr::kable(head(iso))

mz	group_id
1034.487	1
1034.573	2
1035.577	2
1068.506	3
1116.515	4
1118.485	5

Unfortunately DeepION does not preserve all decimal points of the original m/z values. We will need to do a fuzzy merge to get the original values for Cardinal.

iso_merged <- iso |>
  difference_right_join(tibble(mz = as.numeric(mz(mse)), 
                              feature_id = 1:nrow(fData(mse))),
                       by = 'mz',
                       max_dist = 0.0001,
                       distance_col = "mz_delta") |>
  dplyr::select(group_id, feature_id, `mz.y`) |>
  dplyr::rename("mz" = `mz.y` )
knitr::kable(head(iso_merged))

group_id	feature_id	mz
1	1	1034.487
2	2	1034.573
2	3	1035.577
3	4	1068.506
4	5	1116.515
5	6	1118.485

Aggregate feature clusters by selecting representative features with the highest mean intensity

We will aggregate features by selecting the feature with the highest mean intensity to serve as each isotope group’s representative feature.

mse_no_background_filtered <- mse |>
                              subsetPixels(tissue_zone != "background") |>
                              subsetFeatures(mz %in% iso_merged$mz) |>
                              summarizeFeatures(stat = "mean", na.rm = T)

iso_merged$mean <- as.vector(fData(mse_no_background_filtered)$mean)

Now let’s drop low abundance isotopes, keeping only the feature with the highest mean in each group.

#sanity check
all(mz(mse)==iso_merged$mz)

## [1] TRUE

isotopes_to_keep <- iso_merged |>
                          group_by(group_id) |>
                          filter(mean == max(mean)) |>
                          ungroup() |>
                          pull(mz)

mse <- subsetFeatures(mse, mz %in% isotopes_to_keep)

Save this intermediate.

writeMSIDataSafe(mse, file.path("Data","Filtering-Aggregation","Preprocessed-NSFiltered-Aggregated.imzML"),
                verbose = F)

Like during initial peak picking, we will apply a filter that will save us a lot of trouble later during statistical modeling. Highly sparse features create unstable estimates of mean abundance. The models detailed in this work assume normal distributions of residuals. Zero inflation will cause serious deviation from this assumption, yielding poor fit and invalid results. To keep things simple, we will drop features that are very sparse in the cartilage ROI. This represents a conservative modeling approach and will drastically reduce the number of features we continue analysis with. Alternate options include using models that are invariant to zero inflation.

filterFeaturesByRunFreq <- function(mse, cutoff = 0.01,
                                   run_col = "run",
                                   spectra_name = "intensity"){
  pd <- pixelData(mse)
  run <- pd[[run_col]]
  if (is.null(run)) stop("pixelData(mse)$", run_col, " is NULL")
  run <- as.factor(run)

  X <- spectra(mse, spectra_name)

  keep <- rep_len(TRUE, nrow(mse))

  for (lv in levels(run)) {
    cols <- which(run == lv)
    if (length(cols) == 0L) next

    Xr <- X[, cols, drop = FALSE]

    # present iff finite AND non-zero
    present_counts <- rowSums(is.finite(Xr) & (Xr != 0))

    freq <- present_counts / length(cols)
    keep <- keep & (freq >= cutoff)
  }

  mse[keep, , drop = FALSE]
}

Drop features that are present (non-zero and finite) in fewer than 25% of pixels in any run.

mse_cartilage <- subsetPixels(mse, tissue_zone== "cartilage")

mse_cartilage <- summarizeFeatures(mse_cartilage,
                                  stat = c("mean", "sd", "nnzero"),
                                  na.rm = TRUE,
                                  groups = mse_cartilage$run,
                                  verbose = F)

mse_cartilage_filt <- filterFeaturesByRunFreq(mse_cartilage, 0.25)

Save the as .imzML files. Onward to modeling!

mse <- subsetFeatures(mse, mz %in% mz(mse_cartilage_filt))

writeMSIDataSafe(mse, file.path("Data","Filtering-Aggregation","Preprocessed-NSFiltered-Aggregated-CartilageFiltered25.imzML"),
                verbose = F)

Code to recreate figures and tables

mse_pp <- readMSIData(file.path("Data", "Preprocessing", "Smoothed-BLR-Recalibrated-Picked-Aligned-FreqFiltered05perRun.imzML"),
  verbose = F)
cat("After Peak Picking ",length(mz(mse_pp)), "\n")


bg_mz_approx <- c(1570.6786, 1571.6801, 1572.6808, 1592.6699, 1593.6739, 1594.6766)
cat("After dropping internal standard ions [-6]) ", length(mz(mse_pp))-length(bg_mz_approx), "\n")


mse_pre_filt <- readMSIData(COMBINED_DATA_PATH)
other_ions_in_roi_selection <- c(1141.5556, 1142.524, 1608.6582)
cat("After dropping other ions used in ROI selection [-3] and before NS filtering ",
  length(mz(mse_pre_filt)) - length(other_ions_in_roi_selection),"\n")

mse_ns_filt <- readMSIData(file.path("Data","Filtering-Aggregation","Preprocessed-NSFiltered.imzML"),
  verbose = F)
cat("After NS filtering ",length(mz(mse_ns_filt)),"\n")

mse_deiso <- readMSIData(file.path("Data","Filtering-Aggregation","Preprocessed-NSFiltered-Aggregated.imzML"),
  verbose = F)
cat("After deiso ",length(mz(mse_deiso)),"\n")

mse_cart_filtered <- readMSIData(file.path("Data","Filtering-Aggregation","Preprocessed-NSFiltered-Aggregated-CartilageFiltered25.imzML"),
  verbose = F)
cat("After cartilage sparsity filtering ", length(mz(mse_cart_filtered)), "\n")

cat("Table S3\n")
cat("Approximate m/z of preselected features used to segment cartilage (and isotopes/spatial analogs): ", 
  paste(round(c(1141.5556, 1142.524),3), collapse = ", "), "\n")
cat("Approximate m/z of preselected features used to segment background (and isotopes/spatial analogs): ", 
  paste(round(c(1570.6786, 1571.6801, 1572.6808, 1592.6699,1593.6739, 1594.6766, 1608.6582),3), collapse = ", "), "\n")
cat("Approximate m/z of features removed by non-specific filtering: ", 
  paste(setdiff(round(setdiff(mz(mse_pre_filt), mz(mse_ns_filt)),3), round(other_ions_in_roi_selection, 3)), collapse = ", "), "\n")

cat("Table S4\n")
cat("Approximate single ion m/z features: ", 
  paste(round(iso_merged |> group_by(group_id) |> filter(n() == 1) |> pull(mz),3), collapse = ", "), "\n")
cat("Approximate m/z features part of 2-member isotope groups: ", 
  paste(round(iso_merged |> group_by(group_id) |> filter(n() == 2) |> pull(mz),3), collapse = ", "), "\n")
cat("Approximate m/z features part of 3-member isotope groups: ", 
  paste(round(iso_merged |> group_by(group_id) |> filter(n() == 3) |> pull(mz),3), collapse = ", "), "\n")

cat("Approxiamte m/z features dropped during Cartilage ROI sparsity filtering: ", 
  paste(round(setdiff(mz(mse_deiso), mz(mse_cart_filtered)),3), collapse = ", "), "\n")

knitr::kable(
  iso_merged |>
    group_by(group_id) |>
    mutate(`Isotopes in a group` = n()) |>
    ungroup() |>
    group_by(`Isotopes in a group`) |>
    summarize(
      `Count of ions`   = n(),
      `Count of groups` = n_distinct(group_id),
      .groups = "drop"
    )
)

m <- subsetPixels(mse_ns_filt, mse_ns_filt$run == "6891-OA-lateral")

jl_3 <- sort(unique(iso_merged |> group_by(group_id) |>
  mutate(n = n()) |> ungroup() |> filter(n == 3) |> pull(group_id)))

id_1 <- iso_merged |> filter(group_id %in% c(258)) |> pull(feature_id)
id_2 <- iso_merged |> filter(group_id %in% c(312)) |> pull(feature_id)
id_3 <- iso_merged |> filter(group_id %in% c(566)) |> pull(feature_id)

png(filename = file.path("Recreated-Figures","Figure-clusters-3-isotope-sii.png"), width = 1500, height = 900, pointsize = 54)
image(m, i = c(id_1[1], id_2[1], id_3[1], id_1[2], id_2[2], id_3[2],id_1[3], id_2[3], id_3[3]), 
      scale = F, 
      superpose = F, 
      layout = c(3,3), 
      key = F, 
      grid = F,
      xaxt = "n",
      yaxt = "n",
      ylab = NA,
      xlab = NA)
dev.off()

######### 38, 41, 45, 50, 65
jl <- sort(unique(iso_merged |> group_by(group_id) |> mutate(n = n()) |> ungroup() |> filter(n == 2) |> pull(group_id)))
id_1 <- iso_merged |> filter(group_id == jl[3]) |> pull(feature_id)
id_2 <- iso_merged |> filter(group_id == jl[6]) |> pull(feature_id)
id_3 <- iso_merged |> filter(group_id == jl[11]) |> pull(feature_id)

png(filename = file.path("Recreated-Figures","Figure-clusters-2-isotope-sii.png"), width = 1500, height = 700, pointsize = 54)
image(m, i = c(id_1[1], id_2[1], id_3[1], id_1[2], id_2[2], id_3[2]), 
      scale = F, 
      superpose = F, 
      layout = c(2,3), 
      key = F, 
      grid = F,
      xaxt = "n",
      yaxt = "n",
      ylab = NA,
      xlab = NA)
dev.off()

paste0(iso_merged |>
         group_by(group_id) |>
         mutate(n = n()) |>
         ungroup() |> filter(n == 1) |> mutate(mzr = round(mz, 4)) |> pull(mzr),
       collapse = ", ")

paste0(iso_merged |>
         group_by(group_id) |>
         mutate(n = n()) |>
         ungroup() |> filter(n == 2) |> mutate(mzr = round(mz, 4)) |> pull(mzr),
       collapse = ", ")

paste0(iso_merged |>
         group_by(group_id) |>
         mutate(n = n()) |>
         ungroup() |> filter(n == 3) |> mutate(mzr = round(mz, 4)) |> pull(mzr),
       collapse = ", ")

2-Filtering-Aggregation

Ethan Rogers

03/20/2026