Statistical modeling formalizes the assumed relationships of our data
mathematically and lets us estimate parameters like variances and means
that we can use later for statistical inference. We will use linear
mixed effects models as they are the simplest model that can specify the
relationships and sources of variation in our experimental design. Once
fit, we will assess the models for violations of assumptions. Once we
have verified our models fit our data well, we can use them for
statistical inference, which we will do in the next vignette,
4-Statistical-Inference.
library(tidyverse)
library(ggrepel)
library(ggpubr)
library(data.table)
library(lmerTest)
library(emmeans)
library(Cardinal)
library(broom.mixed)
source("Utilities.R")
zero_to_na <- function(x, mz) {
x[x == 0] <- NA_real_
x
}
bp_param <- function(workers) {
if (interactive()) {
SnowfastParam(workers = as.integer(workers), force.GC = TRUE)
} else {
NULL
}
}
setCardinalVerbose(verbose = FALSE)
gcinfo(verbose = FALSE)
DATA_PATH <- file.path("Data","Filtering-Aggregation","Preprocessed-NSFiltered-Aggregated-CartilageFiltered25.imzML")
mse <- readMSIData(DATA_PATH)
mse <- mse |>
addProcessing(zero_to_na, label="zero_to_na_before_modeling") |>
process(spectra="intensity", index="mz", BPPARAM = bp_param(3L))
Models are specified based on the experimental design. In our case,
we have two fixed effects: condition (Osteoarthritis vs Control) and
tissue (Medial vs Lateral), and one random effect: subject. We will also
include the interaction between condition and tissue. Modeling can be
done in and out of Cardinal. We will demonstrate two methods: using
Cardinal::meansTest and using lmer
directly.
Cardinal offers statistical modeling via meansTest.
Under the hood, meansTest uses nlme::lme or
lme4::lmer to fit models with random effects and
stats::lm for models without. meansTest
requires a few steps to prepare the dataset for modeling.
To confine statistical testing to the region of interest, we will subset the pixels to only those in the cartilage. This is the longest step and may take a couple of minutes.
pData(mse)$subject <- as.factor(pData(mse)$subject)
pData(mse)$tissue <- as.factor(pData(mse)$tissue)
pData(mse)$condition <- as.factor(pData(mse)$condition)
pData(mse)$tissue_zone <- as.factor(pData(mse)$tissue_zone)
pData(mse)$run <- as.factor(pData(mse)$run)
mse <- subsetPixels(mse, tissue_zone == "cartilage")
We are ready to fit the models. The fixed effects are specified in
the second argument. condition * tissue is shorthand for
condition + tissue + condition:tissue, which includes the
main effects and their interaction. The random effects are specified in
the third argument. ~ 1|subject indicates that we want a
random intercept for each subject. Specification of random effects for
lmer can be found here.
For compatibility with previous versions of Cardinal and
nlme::lme(), the fixed and random effects are separated.
The samples argument specifies the variable that delineates
which pixels are included in each sample or experimental unit.
For mixed effects models, we can run meansTest using the default
implementation, nlme::lme(), or specify
lme4::lmer() using use_lmer=TRUE for newer
Cardinal releases. When modeling basic random effects structures, both
implementations are equivalent. lme4::lmer() extended via
lmerTest enables use of the Satterthwaite approximation for
calculating degrees of freedom. As we have opted to replace zero values
with NA’s, we must pass na.rm = TRUE to ensure NA values
are dropped during summarization. meansTest will summarize
all pixels in the object. To specify the cartilage ROI, we used
subsetPixels to eliminate non-cartilage pixels.
nlme::lme() in Cardinalnlme_means_test <- meansTest(mse, ~ condition * tissue, ~ 1|subject, samples = run(mse), na.rm = T)
nlme_means_test
## MeansTest of length 124
## model: lme
## DataFrame with 10 rows and 6 columns
## i mz fixed random statistic pvalue
## <integer> <numeric> <character> <character> <numeric> <numeric>
## 1 1 1126.52 intensity ~ conditio.. ~1 | subject 0.711532 0.870488
## 2 2 1143.53 intensity ~ conditio.. ~1 | subject 0.894013 0.826872
## 3 3 1302.63 intensity ~ conditio.. ~1 | subject 5.194768 0.158078
## 4 4 1304.61 intensity ~ conditio.. ~1 | subject 3.057369 0.382864
## 5 5 1325.60 intensity ~ conditio.. ~1 | subject 0.599056 0.896648
## 6 6 1343.65 intensity ~ conditio.. ~1 | subject 3.927915 0.269354
## 7 7 1366.62 intensity ~ conditio.. ~1 | subject 2.447975 0.484767
## 8 8 1367.62 intensity ~ conditio.. ~1 | subject 1.370194 0.712536
## 9 9 1381.63 intensity ~ conditio.. ~1 | subject 1.482458 0.686325
## 10 10 1388.60 intensity ~ conditio.. ~1 | subject 2.039564 0.564236
## ... and 114 more results
Note the P-value column. This is a P value for a likelihood-ratio
test (LRT), investigating the goodness of fit by comparing our model
to a reduced model. When no reduced model is given and random effects
are specified, the reduced model defaults to a model with just the
random effects. The P value indicates whether the full model is
significantly better at explaining the variation than the reduced model.
In this context the LRT is generally more of a diagnostic test of model
fit. When specified correctly LRTs can offer a measure of the
variance-explaining capacity of a specific term in the model, but does
not offer information on the relevance to the outcome of interest. To
get that information, we must run contrasts, as we will do in the next
vignette, 4-Statistical-Inference.
lme4::lmer() in Cardinallmer_means_test <- meansTest(
mse, ~ condition * tissue, ~ 1 | subject,
samples = run(mse),
na.rm = TRUE,
use_lmer = TRUE)
lmer_means_test
## MeansTest of length 124
## model: lmerModLmerTest
## DataFrame with 10 rows and 7 columns
## i mz fixed random statistic pvalue
## <integer> <numeric> <character> <character> <numeric> <numeric>
## 1 1 1126.52 intensity ~ conditio.. ~1 | subject NA NA
## 2 2 1143.53 intensity ~ conditio.. ~1 | subject NA NA
## 3 3 1302.63 intensity ~ conditio.. ~1 | subject NA NA
## 4 4 1304.61 intensity ~ conditio.. ~1 | subject NA NA
## 5 5 1325.60 intensity ~ conditio.. ~1 | subject NA NA
## 6 6 1343.65 intensity ~ conditio.. ~1 | subject NA NA
## 7 7 1366.62 intensity ~ conditio.. ~1 | subject NA NA
## 8 8 1367.62 intensity ~ conditio.. ~1 | subject NA NA
## 9 9 1381.63 intensity ~ conditio.. ~1 | subject NA NA
## 10 10 1388.60 intensity ~ conditio.. ~1 | subject NA NA
## singular
## <logical>
## 1 TRUE
## 2 FALSE
## 3 FALSE
## 4 FALSE
## 5 TRUE
## 6 FALSE
## 7 FALSE
## 8 TRUE
## 9 TRUE
## 10 FALSE
## ... and 114 more results
Note the singular column - Cardinal automatically
reports singularity for lme4::lmer() models. Models can be
singular for many reasons; in our case, this is likely due to low
biological variance relative to technical variance. More on this later.
If we print the results, we can see information about the models. In
Cardinal, models fit with lme4::lmer() (via
use_lmer=TRUE) use restricted maximum likelihood
estimation. Means testing with nlme::lme() in Cardinal uses
maximum likelihood estimation to be compatible with LRTs.
Note: meansTest does not support LRTs for
lme4::lmer() models, so the statistic and
pvalue columns in the output will be NA.
Let’s save the combined image and lme4::lmer() results
for later.
save(lmer_means_test, file = file.path("Outputs", "meansTest_cartilage_lmer.RData"))
lmerAlternatively, we can extract the data from the
MSImagingExperiment objects and run the models ourselves.
This allows for greater flexibility than using
Cardinal::meansTest.
First, let us create a helper function to run the models and extract the results tidily. In addition, we will calculate the intraclass correlation coefficient (ICC) for each model - a measure of the breakdown between within-subject and residual variance, with greater values (> 0.5) indicating more within-subject variance than residual variance. We will also extract whether the model is singular and the convergence code. Our function will return three objects: a list of models, a wide tibble of model results, and a long tibble of model results that is better suited for plotting.
# Helper function to model and catch errors
model_and_extract <- function(mse_frame, form) {
models <- list()
for (f in unique(mse_frame$feature_id)) {
data <- mse_frame |> filter(feature_id == f)
print(paste0("Running model for: ", f))
models[[as.character(f)]] <- lmer(form, data = data)
}
model_results <- bind_rows(lapply(models, FUN = function(m){
tidy(m, "fixed") |> mutate(model_id = "means",
isSingular = isSingular(m),
convergenceCode = get_convergence_code(m),
icc = as.data.frame(VarCorr(m))$vcov[1]/sum(as.data.frame(VarCorr(m))$vcov),
total_var = sum(as.data.frame(VarCorr(m))$vcov)) |>
pivot_wider(id_cols = c(isSingular, convergenceCode, icc, total_var),
names_from = term,
values_from = c(estimate, std.error, statistic, df, p.value))
}), .id = "feature_id")
# good for plotting
model_results_long <- model_results |>
pivot_longer(
cols = starts_with("estimate_") | starts_with("std.error_") | starts_with("statistic_") | starts_with("df_") | starts_with("p.value_"),
names_to = c("model_item", "parameter"),
names_pattern = "(.*)_(.*)",
values_to = "value")
return(list(models = models,
model_results = model_results,
model_results_long = model_results_long))
}
Now let’s summarize the pixel means per sample by region of interest (ROI).
# Summarize pixel means per sample by ROI
mse <- summarizeFeatures(mse,
stat = c("mean", "nnzero"),
na.rm = TRUE,
verbose = F)
mse <- summarizeFeatures(mse,
stat = c("mean", "nnzero"),
groups = mse$run,
na.rm = TRUE,
verbose = F)
Next, we will extract the feature data from the MSE objects (roughly,
a data frame of each sample’s mean feature intensity by ROI) and add the
experimental variables to create a modeling data frame. It is important
to ensure that the sample IDs in the data frame match those in the
MSImagingExperiment object, and that the experimental
variables are factors. The models will still run if the variables are
numeric, but they will treat them as continuous variables rather than
categorical variables, which have different interpretations.
# Extract means for modeling and specify experimental vars
mse_frame <- as_tibble(fData(mse)) |>
mutate(feature_id = row_number()) |>
pivot_longer(
X6891.OA.lateral.mean:X6954.OA.medial.mean,
names_to = "run",
values_to = "cartilage.mean",
names_pattern = "X(.+)\\.mean$",
names_transform = list(run = ~ str_replace_all(.x, "\\.", "-"))
) |>
left_join(name_tibble, by = join_by(run)) |>
mutate_at(vars(feature_id, subject, condition, tissue), factor) |>
arrange(feature_id, condition, tissue, subject) |>
select(feature_id, mz, run, cartilage.mean, subject, condition, tissue)
Now we can run the models, using the same formula as before.
Cardinal::meansTest separates fixed and random effects as
separate arguments. lme4::lmer combines them. We will also
save the models and results. We can access a wide table of results with
cartilage_res[[2]] and a list of models with
cartilage_res[[1]].
cartilage_res <- model_and_extract(mse_frame, "cartilage.mean ~ condition * tissue + (1|subject)")
save(cartilage_res, file = file.path("Outputs", "cartilage_results.RData"))
rmarkdown::paged_table(head(cartilage_res[[2]]))
As above, many of the models are singular. Let’s explore the ICC values of these models.
knitr::kable(cartilage_res[[2]] |>
summarize(`% models with ICC at or near zero` = sum(icc <= 0.0001)/n()*100, .by = isSingular))
| isSingular | % models with ICC at or near zero |
|---|---|
| TRUE | 100 |
| FALSE | 0 |
As we can see, most of the singular models have ICC values near zero
(less than 0.0001), indicating that the models are not able to
(meaningfully) partition out within-subject variance from residual
variance. Aside from the theoretical issues implied by singularity
(overly complex random effects structure, low biological variance
relative to technical variance, etc.), singularity also interferes with
calculation of degrees of freedom via the Satterthwaite approximation.
This method uses mean squares in the approximation, and when the model
is singular, the mean squares for the random effect will be zero.
lmer is aware of this problem and simply omits the random
effect if singularity is detected, making it equivalent to a fixed
effects model. Because we have the simplest random effects structure
already, this solution is acceptable. However, for more complex random
effects structures, one would instead iteratively remove random effect
components until the model is no longer singular.
Linear mixed effects models make several assumptions that must be checked to ensure statistical validity. There are many ways to express these assumptions, but in general they are that random effects and residuals are independent, are Normally distributed, and have constant variance. We can assess these assumptions with residual quantile-quantile (QQ) plots.
Simple QQ plots can be generated with the following base R functions.
qqnorm(resid(cartilage_res[[1]][[1]]), main = "Q-Q Plot of Residuals")
qqline(resid(cartilage_res[[1]][[1]]), col = "red")
We will generate more informative QQ plots, as well as some other diagnostic plots. This code creates a tibble of residuals and fitted values from each model.
qq_plot_frame <- NULL
set.seed(2025)
selected_fids <- sort(sample(names(cartilage_res[[1]]), 20))
for (fid in names(cartilage_res[[1]])) {
model <- cartilage_res[[1]][[fid]]
qq_plot_frame <- bind_rows(qq_plot_frame,
cbind(tibble(residuals = residuals(model),
fitted = fitted(model),
feature_id = fid),
model.frame(model)))
}
Then we can compute theoretical normal quantiles for each model.
qq_plot_frame <-
qq_plot_frame |>
mutate(feature_id = as.numeric(feature_id)) |>
left_join(
tibble(mz = mz(mse)) |>
mutate(feature_id = row_number(),
mz = round(mz, 3)),
by = join_by("feature_id")
) |>
group_by(mz) |>
arrange(residuals) |>
mutate(
samp = residuals,
theo = qnorm(ppoints(n()))
) |>
ungroup()
qq_plot_frame$condition = factor(qq_plot_frame$condition,
levels = c("Osteoarthritis","Control"))
qq_plot_frame$tissue = factor(qq_plot_frame$tissue,
levels = c("Medial","Lateral"))
fids <- unique(qq_plot_frame$feature_id)
Let’s generate interaction plots that will help us see the signal and effect being modeled.
manage_directory("QQ-Itx-SII-Plots")
m <- readMSIData(DATA_PATH) |>
addProcessing(zero_to_na, label="zero_to_na_before_modeling") |>
process(spectra="intensity", index="mz", BPPARAM = bp_param(8L))
feature_ids <- fids
selected_mz <- fData(m)[feature_ids, , drop = FALSE]
m <- subsetFeatures(m, mz %in% selected_mz$mz)
k <- as.data.table(t(as.matrix(intensity(m))))
setnames(k, paste0("feature_", feature_ids))
pdata <- as.data.table(pData(m))[, .(x, y, tissue_zone, run, subject, condition, tissue)]
result <- cbind(pdata, k)
melted <- melt(result, id.vars = c("x", "y", "tissue_zone", "run", "subject", "condition", "tissue"),
measure.vars = patterns("^feature_"),
variable.name = "feature",
value.name = "intensity")
melted[, feature := sub("^feature_", "", feature)]
sii <- function(fid, bone_alpha = 0.3, background_alpha = 1){
df <- melted |>
filter(feature == as.character(fid)) |>
left_join(name_tibble |> select(run, tid), by = c("run")) |>
mutate(
intensity = ifelse(tissue_zone == "background", NA, intensity),
alpha_val = dplyr::case_when(
tissue_zone == "cartilage" ~ 1,
tissue_zone == "bone" ~ bone_alpha,
TRUE ~ background_alpha
),
tid = factor(tid, levels = c("6891-OM",
"6891-OL",
"6919-CM",
"6919-CL",
"6908-OM",
"6908-OL",
"6924-CM",
"6924-CL",
"6949-OM",
"6949-OL",
"6938-CM",
"6938-CL",
"6954-OM",
"6954-OL",
"6944-CM",
"6944-CL"))
)
df|>
arrange(subject, desc(tissue)) |>
ggplot(aes(x = x, y = -1 * y, fill = intensity, alpha = alpha_val)) +
geom_tile() +
coord_equal() +
theme_void() +
xlab(NULL) +
ylab(NULL) +
facet_wrap(~ tid, ncol = 4) +
scale_fill_viridis_c(
option = "turbo",
begin = 0,
end = 1,
na.value = "#bdbdbd"
) +
scale_alpha_identity(guide = "none") +
theme(
legend.position = "bottom",
panel.spacing.x = unit(0, "pt"), # gap between columns
panel.spacing.y = unit(1, "pt") # gap between rows
) +
labs(fill = "Intensity")
}
itx_fancy <- function(tmp_df){
tmp_df_alt <- tmp_df |>
left_join(name_tibble |>
select(condition, tissue, subject, tid) |>
mutate(subject = as.factor(subject)),
by = join_by(condition, tissue, subject))|>
mutate(tissue = factor(tissue, levels = c("Medial","Lateral")),
condition = factor(condition, levels = c("Osteoarthritis","Control")))
p <- tmp_df_alt |>
ggplot(aes(x = tissue, y = cartilage.mean, group = subject, color = condition, fill = condition)) +
ylab("Mean Intensity") +
xlab(NULL) +
#ylim(global_floor, global_ceiling) +
theme_classic() +
theme(legend.position = 'bottom') +
stat_summary(aes(group = condition, color = condition),fun = mean, geom = "point",
shape = 21, size = 3.5, fill = "white", stroke = 1.5, alpha = .6) +
stat_summary(aes(group = condition, color = condition), fun = mean, geom = "line",
linewidth = 3.25, alpha = 0.3, lineend = "round") +
geom_point() +
geom_line()
p +
geom_text_repel(data = bind_rows(tmp_df_alt |>
filter(tissue == "Lateral"),
tmp_df_alt |>
filter(tissue == "Lateral") |>
mutate(suffix = ifelse(condition == "Control", "CL", "OL")) |>
group_by(condition) |>
summarise(cartilage.mean = mean(cartilage.mean, na.rm = TRUE),
tissue = "Lateral",
suffix = dplyr::first(suffix),
tid = paste0("Mean ", suffix))|>
ungroup()),
aes(label = tid),
color = "black",
hjust = -0.4,
vjust = 0.5,
size = 2.5,
segment.alpha = 0) +
geom_text_repel(data = bind_rows(tmp_df_alt |>
filter(tissue == "Medial"),
tmp_df_alt |>
filter(tissue == "Medial") |>
mutate(suffix = ifelse(condition == "Control", "CM", "OM")) |>
group_by(condition) |>
summarise(cartilage.mean = mean(cartilage.mean, na.rm = TRUE),
tissue = "Medial",
suffix = dplyr::first(suffix),
tid = paste0("Mean ", suffix)) |>
ungroup()),
aes(label = tid),
color = "black",
hjust = 1.4,
vjust = 0.5,
size = 2.5,
segment.alpha = 0) +
labs(color = "Condition", fill = "Condition")
}
This code will generate a set of plots for every model, saving them
in the ./QQ-Itx-SII-Plots directory.
for (f in fids) {
tmp_df <- qq_plot_frame |> filter(feature_id == f)
qqs <- ggplot(tmp_df, aes(theo, samp, shape = tissue, color = condition)) +
geom_point(size = 3, alpha = 0.7) +
geom_abline(slope = 1, intercept = 0) +
labs(
x = "Theoretical (Normal) Quantiles",
y = "Residual Quantiles",
shape = "Tissue",
color = "Condition"
) +
theme_linedraw() +
theme(
legend.position = "bottom",
legend.box = "horizontal" # color + shape guides side-by-side
) +
guides(
color = guide_legend(
direction = "vertical", # entries stacked
ncol = 1,
byrow = TRUE,
title.position = "left" # title left of entries
),
shape = guide_legend(
direction = "vertical",
ncol = 1,
byrow = TRUE,
title.position = "left"
)
) +
theme(legend.key.width = unit(0.4, "cm"),
legend.spacing.x = unit(0.1, "cm"))
var_condition <- ggplot(tmp_df, aes(y = condition, x = residuals)) +
geom_boxplot(aes(group = condition), outlier.shape = NA, color = "black") +
geom_point(aes(shape = tissue, color = condition), size = 3, alpha = 0.7) +
labs(
x = NULL,
y = "Condition",
shape = "Tissue",
color = "Condition"
) +
theme_linedraw() +
theme(
legend.position = "none",
axis.text.y = element_text(angle = 90, hjust = 0.5, vjust = 0.5))
var_tissue <- ggplot(tmp_df, aes(y = tissue, x = residuals)) +
geom_boxplot(aes(group = tissue), outlier.shape = NA, color = "black") +
geom_point(aes(shape = tissue, color = condition), size = 3, alpha = 0.7) +
labs(
x = NULL,
y = "Tissue",
shape = "Tissue",
color = "Condition"
) +
theme_linedraw() +
theme(
legend.position = "none",
axis.text.y = element_text(angle = 90, hjust = 0.5, vjust = 0.5))
var_boxplots <- ggarrange(var_condition, var_tissue, ncol = 2) |>
annotate_figure(top = "Distribution of residuals by independent variable",
bottom = "Residuals",
fig.lab = "D",
fig.lab.size = 20,
fig.lab.face = "bold") +
theme(panel.border = element_rect(colour = "#999999", fill=NA, linewidth=0.25))
qqs <- annotate_figure(qqs, top = "Quantile-quantile plots of residuals",
fig.lab = "A",
fig.lab.size = 20,
fig.lab.face = "bold") +
theme(panel.border = element_rect(colour = "#999999", fill=NA, linewidth=0.25))
boxps <- itx_fancy(tmp_df)
boxps <- annotate_figure(boxps, top = "Mean intensity by tissue and condition",
fig.lab = "B",
fig.lab.size = 20,
fig.lab.face = "bold")+
theme(panel.border = element_rect(colour = "#999999", fill=NA, linewidth=0.25))
sii_plot <- annotate_figure(sii(f), top = "Single ion images with cartilage highlighted",
fig.lab = "C",
fig.lab.size = 20,
fig.lab.face = "bold")+
theme(panel.border = element_rect(colour = "#999999", fill=NA, linewidth=0.25))
ggarr <- ggarrange(ggarrange(qqs,
boxps,
nrow = 2),
ggarrange(sii_plot, var_boxplots, nrow = 2, heights = c(5,3)),
ncol = 2,
widths = c(1,2)) |>
annotate_figure(bottom = paste0("Feature: ", round(unique(tmp_df$mz)[1], 3), " m/z" ))
ggsave(file.path("QQ-Itx-SII-Plots",paste0("QQ-", f ,".png")), plot = ggarr, height = 9, width = 11, units = "in")
if (round(unique(tmp_df$mz)[1], 2) == 1438.66) {
ggsave(file.path("Recreated-Figures","Figure-Statistical-Modeling-Good-Fit.png"), plot = ggarr, height = 9, width = 11, units = "in")
}
}
Let’s take a look at the last one:
ggarr
Each plot can offer us some understanding of the fit of the model to our data.
Here is the manual part: as we only have 124 model, we can manually
diagnose assumption fit for each one using the set of plots in the
./QQ-Itx-SII-Plots directory. I (the author) have done this
manual classification myself, and added my determinations to
Outputs/feature-model-fit-evaluation.csv. We will refrain
from testing our hypotheses on models with poor fit, as we can no longer
be sure the tests will yield accurate results.
To demonstrate common QQ plot shapes, and especially, what is considered “Normal” with only 16 samples, we can simulate plots below.
set.seed(2025)
n <- 16
normal <- rnorm(n, mean = 0, sd = 1)
exp_dist <- rchisq(n, df = .5)
normal_with_outlier <- c(rnorm(n - 1, mean = 0, sd = 1), 5)
qq_data <- function(x) {
qq <- qqnorm(x, plot.it = FALSE)
data.frame(
sample = qq$y,
theoretical = qq$x
)
}
df <- bind_rows(
qq_data(normal) |> mutate(scenario = "QQ plot from Normally distributed data"),
qq_data(exp_dist) |> mutate(scenario = "QQ plot from Right-skewed data"),
qq_data(log2(exp_dist)) |> mutate(scenario = "QQ plot from Right-skewed data\nafter log transformation")
) |>
mutate(scenario = factor(scenario, levels = c("QQ plot from Normally distributed data",
"QQ plot from Right-skewed data",
"QQ plot from Right-skewed data\nafter log transformation")))
ggplot(df, aes(sample = sample)) +
stat_qq() +
stat_qq_line(distribution = stats::qnorm) +
facet_wrap(~ scenario, scales = "free") +
labs(x = "Theoretical (Normal) Quantiles",
y = "Residual Quantiles")
ggsave(file.path("Recreated-Figures", "SI-Figure-Statistical-Modeling-QQ-Examples.png"), width = 8, height = 4)
Let us also look at Normality when we only have 8 biological replicates. Our models assume that the distribution of random effects is normal overall, and within each experimental variable. (In this example, we only have condition.)
set.seed(2025)
df <- NULL
n = 8
for (s in 1:8) {
df <- bind_rows(df,
tibble(sample = rnorm(n, mean = 0, sd = 1)) |>
mutate(subject = s,
condition = ifelse(s <= 4, "Control", "OA")))
}
ggarrange(
ggplot(df, aes(y = sample, x = condition)) +
geom_boxplot(alpha = 0.5, outliers = F) +
geom_jitter(width = 0.1, alpha = 0.2) +
labs(y = "Residuals",
x = "Condition") +
ylim(-2,3),
ggplot(df, aes(y = sample, x = as.factor(subject))) +
geom_boxplot(alpha = 0.5, outliers = F) +
geom_jitter(width = 0.1, alpha = 0.2) +
labs(x = "Subject",
y = "Residuals")+
ylim(-2,3),
nrow = 1, labels = c("A", "B"), widths = c(.35, 1))
ggsave(file.path("Recreated-Figures", "SI-Figure-Statistical-Modeling-Constant-Variance.png"), width = 6, height = 4)
As we can see, it is generally simple to diagnose serious deviations from model assumptions. However, even when we sample data from an empirically Normal distribution, small sample size and expected variation can inject uncertainty and abnormality.
Additionally, Normality of a sample is subjective and is better described as a continious scale from clearly abnormal to clearly normal. If we operate with the understanding that all sampled data have some variation and that model fit is not dichotomously good or bad, but a continuous spectrum from well fit to poorly fit, then we can say that so long as deviations are not serious, we can rely on the robustness of linear models to handle small departures from the model’s assumptions.