Models specify relationships and offer estimates of parameters. Statistical testing quantifies the strength of evidence for specific hypotheses, summarized as a signal-to-noise ratio (a t-statistic). The signal is the estimate of the comparison to test, such as the difference between the means of each condition. The noise is the variance of that estimate. The p-value quantifies the probability of observing a test statistic under the null distribution; the smaller the p-value, the less likely it would be to observe it if the null hypothesis were true, and the more evidence there is against the null hypothesis. Testing requires calculating the degrees of freedom, or, more abstractly, the amount of information available to estimate different parameters. Somewhat related to the sample size, degrees of freedom are reduced when estimating more parameters and generally increase with more data. As we evaluate more comparisons, we increase the chance of false positives and need to adjust our p-values accordingly using a multiple testing correction.

Test hypotheses of interest

Like statistical modeling, we perform statistical testing in Cardinal or externally. Both approaches require specifying a contrast. Formally, a contrast is a linear combination of coefficients for each term in the model that sum to zero. More simply, a contrast is a set of coefficients that, when multiplied in sequence with the model parameter estimates, sums to the estimate of the comparison being tested in the hypothesis that uses all model terms. We generally don’t have to specify them ourselves and can instead use the terms of the model with standard statistical packages.

Method 1: Comparison testing in Cardinal

Cardinal::contrast() offers arguments to specify custom contrasts using emmeans syntax and performed using emmeans. First, we will load the meansTest results object from 3-Statistical-Modeling. The results object contains a summary data frame accessible with mtops() and the lmer models. We will also load the results from manual diagnosis of model fit, creating a list of features that adequately meet model assumptions.

load(file.path("Outputs", "meansTest_cartilage_lmer.RData"))

model_fit_eval <- read_csv(file.path("Outputs", "feature-model-fit-evaluation.csv"), show_col_types = FALSE) |> 
  mutate(across(qq_normality:outliers, as.factor))

passed <- model_fit_eval |> 
          filter(qq_normality == "OK" & 
            residual_variance == "OK" & 
            outliers == "OK" )

Run tests

In emmeans, which Cardinal uses for hypothesis testing and which we will use externally, we can specify ~ condition | tissue as contrast.specs for Hypothesis A (i.e., “condition given tissue”). To get the actual pairwise contrasts, we supply "pairwise" as the contrast.method. emmeans offers tremendous functionality and versatility, with many possible ways to specify contrasts. We will explore other syntax when we repeat these analyses externally from Cardinal in the next part of this workflow.

measTest_models_with_acceptable_fit <- lmer_means_test[passed$feature_id]

h <- Cardinal::contrastTest(measTest_models_with_acceptable_fit, ~condition * tissue, method="pairwise")

Cardinal adds tests to the results object in a wide fashion, with each test having an identifying prefix. That is quite ugly, so below is an example of the emmeans object returned for each model, in this case the first feature model for 1366.62 m/z.

knitr::kable(h[[1]])

contrast	estimate	SE	df	t.ratio	p.value
Control Lateral - Osteoarthritis Lateral	-14.347788	10.840585	11.66542	-1.3235252	0.2110136
Control Lateral - Control Medial	-2.866572	9.880078	6.00000	-0.2901366	0.7814781
Control Lateral - Osteoarthritis Medial	-11.556785	10.840585	11.66542	-1.0660665	0.3079610
Osteoarthritis Lateral - Control Medial	11.481217	10.840585	11.66542	1.0590956	0.3109938
Osteoarthritis Lateral - Osteoarthritis Medial	2.791004	9.880078	6.00000	0.2824880	0.7870612
Control Medial - Osteoarthritis Medial	-8.690213	10.840585	11.66542	-0.8016369	0.4387877

Notice there are six tests - every pairwise combination of the 4 (2x2) unique levels of condition and tissue. For our purposes, we only care about the Control Medial - Osteoarthritis Medial (hypothesis A) and Osteoarthritis Lateral - Osteoarthritis Medial (hypothesis B). Each comparison offers an estimate, t_ratio (aka t-statistic), and p_value. The estimate is the effect estimate and t_ratio the signal (estimate) to noise (standard error of the estimate) ratio. To get the reverse, say, Osteoarthritis Medial - Osteoarthritis Lateral, simply flip the sign of the estimate and t ratio. All else will be the same.

We can see that the p-value is large for 1366.62 m/z from Hypothesis A above, indicating that we cannot reject the null hypothesis; any difference in mean abundance between conditions in medial tissues is not statistically significant.

Correct for multiple testing

Running more tests increases the probability that one of the tests will be a false positive. We can correct for this in a variety of ways, such as globally lowering the threshold for significance (e.g., the Bonferroni correction). The preferred method, however, is to correct the false discovery rate (FDR), or the false-positive rate of just our positive tests. By setting an acceptable rate of positive tests that are false (in our case 0.05, or 5%, independently per hypothesis), we can apply a less strict correction. Cardinal includes a function for FDR correction, topFeatures.

kableExtra::kbl(topFeatures(h, n=5, sort.by = 2)) |>
  kableExtra::kable_styling(full_width = FALSE) |>
  kableExtra::scroll_box(width = "100%", height = "200px")

i	mz	singular	Control.Lateral…Osteoarthritis.Lateral.estimate	Control.Lateral…Control.Medial.estimate	Control.Lateral…Osteoarthritis.Medial.estimate	Osteoarthritis.Lateral…Control.Medial.estimate	Osteoarthritis.Lateral…Osteoarthritis.Medial.estimate	Control.Medial…Osteoarthritis.Medial.estimate	Control.Lateral…Osteoarthritis.Lateral.pvalue	Control.Lateral…Control.Medial.pvalue	Control.Lateral…Osteoarthritis.Medial.pvalue	Osteoarthritis.Lateral…Control.Medial.pvalue	Osteoarthritis.Lateral…Osteoarthritis.Medial.pvalue	Control.Medial…Osteoarthritis.Medial.pvalue	Control.Lateral…Osteoarthritis.Lateral.fdr	Control.Lateral…Control.Medial.fdr	Control.Lateral…Osteoarthritis.Medial.fdr	Osteoarthritis.Lateral…Control.Medial.fdr	Osteoarthritis.Lateral…Osteoarthritis.Medial.fdr	Control.Medial…Osteoarthritis.Medial.fdr
7	1366.620	FALSE	-14.347788	-2.866572	-11.5567846	11.4812166	2.7910036	-8.690213	0.2110136	0.7814781	0.3079610	0.3109938	0.7870612	0.4387877	0.9677784	0.8994715	0.9556276	0.9942336	0.9110842	0.8896473
8	1367.616	TRUE	-12.151073	-7.833456	-9.0934830	4.3176173	3.0575903	-1.260027	0.3402864	0.5340797	0.4716782	0.7303139	0.8069040	0.9196825	0.9995921	0.8151743	0.9556276	0.9942336	0.9110842	0.9580111
9	1381.628	TRUE	-3.268407	-3.078269	0.4005776	0.1901371	3.6689841	3.478847	0.4757351	0.5012576	0.9295905	0.9665416	0.4246677	0.4484445	0.9995921	0.8151743	0.9556276	0.9942336	0.9110842	0.8896473
14	1428.646	FALSE	-6.604205	-6.230537	-12.6595823	0.3736678	-6.0553774	-6.429045	0.6607162	0.2291452	0.4093875	0.9800342	0.2409500	0.6690491	0.9995921	0.6645211	0.9556276	0.9942336	0.9110842	0.8896473
17	1437.631	FALSE	-1.851057	-3.691546	-2.3300860	-1.8404899	-0.4790294	1.361460	0.5621189	0.0148208	0.4688661	0.5642960	0.6761225	0.6679844	0.9995921	0.1777739	0.9556276	0.9942336	0.9110842	0.8896473

topFeatures applies FDR correction to every set of p-values independently. This is important because FDR correction is sensitive to the distribution of p-values. We will explore this later. Now, let’s extract our p-values and see if any are significant.

results_tibble_h1 <- as_tibble(topFeatures(h)) |> 
  select(i, mz, starts_with("Control.Medial...Osteoarthritis.Medial.")) |> 
  rename_with(~ str_remove(.x, fixed("Control.Medial...Osteoarthritis.Medial.")), 
              starts_with("Control.Medial...Osteoarthritis.Medial.")) |> 
  arrange(pvalue) |> 
  mutate(hypo = "Hypothesis A")


results_tibble_h2 <- as_tibble(topFeatures(h)) |> 
  select(i, mz, starts_with("Osteoarthritis.Lateral...Osteoarthritis.Medial.")) |> 
  rename_with(~ str_remove(.x, fixed("Osteoarthritis.Lateral...Osteoarthritis.Medial.")), 
              starts_with("Osteoarthritis.Lateral...Osteoarthritis.Medial.")) |> 
  arrange(pvalue) |> 
  mutate(hypo = "Hypothesis B")

knitr::kable(rbind(results_tibble_h1, results_tibble_h2) |> 
    mutate(`Significant at 0.1 after FDR adjustment` = fdr < 0.1) |> 
    dplyr::summarize(Count = dplyr::n(), .by = c(hypo, `Significant at 0.1 after FDR adjustment`)))

hypo	Significant at 0.1 after FDR adjustment	Count
Hypothesis A	FALSE	29
Hypothesis B	FALSE	29

Unfortunately, after adjustment, no findings for either hypothesis are significant. Regardless, let’s make a volcano plot using the unadjusted p-values. We will add a line at unadjusted p = 0.01. (Keep in mind this plot may look different than the one in the accompanying publication, as it uses the estimate rather than the percent difference and may use the flipped comparison. These do not affect the significance. We will show that plot below.)

rbind(results_tibble_h1, results_tibble_h2) |> 
    ggplot(aes(x = estimate, y = -log(pvalue))) +
    geom_point(alpha = 0.25) +
    facet_wrap("hypo") +
    geom_hline(yintercept = -log(0.01)) +
    xlab("Estimated Difference") +
    ylab("-Log(Unadjusted P Value)")

Method 2: Comparison testing externally

Like in Cardinal, we will test comparisons externally with emmeans again, showing multiple ways to specify the same comparison. First, let’s load the models we fit in the last vignette.

load(file.path("Outputs","cartilage_results.RData"))

cartilage_models_OK_fit <- cartilage_res[[1]][passed$feature_id]

We will write a function to perform test hypotheses for each model, but first, let’s experiment with different ways to specify contrasts in emmeans, again using the model for 1366.62 m/z.

m <- cartilage_models_OK_fit[[1]]

Run tests

We can specify the same pairwise comparison of conditions in tissues from above with the same syntax, but using emmeans functions, and selecting only the second contrast for each pair.

#Hypothesis A
emmeans_grid <- emmeans(m, ~ condition | tissue, lmer.df = "satterthwaite")
emmeans::contrast(emmeans_grid, "pairwise", adjust = "none")[2]

##  contrast                 tissue estimate   SE   df t.ratio p.value
##  Control - Osteoarthritis Medial    -8.69 10.8 11.7  -0.802  0.4388
## 
## Degrees-of-freedom method: satterthwaite

#Hypothesis B
emmeans_grid <- emmeans(m, ~ tissue | condition, lmer.df = "satterthwaite")
emmeans::contrast(emmeans_grid, "pairwise", adjust = "none")[2]

##  contrast         condition      estimate   SE df t.ratio p.value
##  Lateral - Medial Osteoarthritis     2.79 9.88  6   0.282  0.7871
## 
## Degrees-of-freedom method: satterthwaite

These function calls use the same default arguments as in Cardinal::meansTest - Satterthwaite degrees of freedom calculation and no initial P value adjustment. It also produces the same two contrasts. We can get the same results by passing the emmeans object to the pairwise wrapper function, pairs:

#Hypothesis A using pairs()
emmeans_grid <- emmeans(m, ~ condition | tissue, lmer.df = "satterthwaite")
pairs(emmeans_grid, adjust = "none")[2]

##  contrast                 tissue estimate   SE   df t.ratio p.value
##  Control - Osteoarthritis Medial    -8.69 10.8 11.7  -0.802  0.4388
## 
## Degrees-of-freedom method: satterthwaite

#Hypothesis B using pairs()
emmeans_grid <- emmeans(m, ~ tissue | condition, lmer.df = "satterthwaite")
pairs(emmeans_grid, adjust = "none")[2]

##  contrast         condition      estimate   SE df t.ratio p.value
##  Lateral - Medial Osteoarthritis     2.79 9.88  6   0.282  0.7871
## 
## Degrees-of-freedom method: satterthwaite

Or, we can directly specify the contrast vector for just the contrasts of interest, using the following specs and methods:

emmeans_grid <- emmeans(m, ~ condition * tissue, lmer.df = "satterthwaite")
contrast(emmeans_grid,
         list(control_vs_osteoarthritis_in_medial = c(0, 0, 1, -1),
         lateral_vs_medial_in_osteoarthritis = c(0, 1, 0, -1)),
         adjust = "none")

##  contrast                            estimate    SE   df t.ratio p.value
##  control_vs_osteoarthritis_in_medial    -8.69 10.80 11.7  -0.802  0.4388
##  lateral_vs_medial_in_osteoarthritis     2.79  9.88  6.0   0.282  0.7871
## 
## Degrees-of-freedom method: satterthwaite

Now, let’s write a function to test Hypothesis A and B on all models. This function also extracts the mean estimate for the reference so that we can compute the percent change.

signift_cartilage <- bind_rows(lapply(cartilage_models_OK_fit, FUN = function(m){
    emmeans_grid <- emmeans(m, ~ condition * tissue, lmer.df = "satterthwaite")
    k <- contrast(emmeans_grid,
         list(control_vs_osteoarthritis_in_medial = c(0, 0, 1, -1),
         lateral_vs_medial_in_osteoarthritis = c(0, 1, 0, -1)),
         adjust = "none")
    as_tibble(k) |> mutate(reference_estimate = as_tibble(emmeans_grid)  |>
                filter((condition == "Control" & tissue == "Medial" )| 
                (condition == "Osteoarthritis" & tissue == "Lateral"))  |> 
                arrange(desc(tissue)) |> pull(emmean)) |> 
                    mutate(pct_delta = estimate/reference_estimate)
    }), .id = "feature_id") |> 
  group_by(contrast) |> 
  mutate(p_value_fdr = p.adjust(p.value, method = "fdr")) |> 
  ungroup() |> 
  mutate(signif_after_fdr_adj = p_value_fdr < 0.1) |> 
  arrange(p.value)

save(signift_cartilage, file = file.path("Outputs", "signift_cartilage.RData"))

knitr::kable(head(signift_cartilage))

feature_id	contrast	estimate	SE	df	t.ratio	p.value	reference_estimate	pct_delta	p_value_fdr	signif_after_fdr_adj
18	lateral_vs_medial_in_osteoarthritis	-5.973193	3.2696362	6.000000	-1.826868	0.1174970	45.63479	-0.1308912	0.9110842	FALSE
55	control_vs_osteoarthritis_in_medial	2.213277	1.3382654	12.000000	1.653841	0.1240612	37.48495	0.0590444	0.8896473	FALSE
55	lateral_vs_medial_in_osteoarthritis	2.143486	1.3382654	12.000000	1.601690	0.1352052	37.41516	0.0572892	0.9110842	FALSE
104	control_vs_osteoarthritis_in_medial	-1.658258	1.0278974	8.994291	-1.613252	0.1411693	34.32677	-0.0483080	0.8896473	FALSE
78	lateral_vs_medial_in_osteoarthritis	4.198255	2.6002345	6.000000	1.614568	0.1575314	43.73598	0.0959909	0.9110842	FALSE
81	control_vs_osteoarthritis_in_medial	1.384963	0.9209036	10.744242	1.503917	0.1614164	35.90810	0.0385696	0.8896473	FALSE

Correct for multiple testing

As above, we will correct for multiple testing using the FDR method and construct a volcano plot. We have also flipped the comparison to harmonize these results with the accompanying journal article (positive differences mean greater abundance in osteoarthritis conditions or medial tissues).

signift_cartilage |>
  ggplot(aes(x = -1 * pct_delta, y = -log10(p.value), color = signif_after_fdr_adj)) +
  geom_point(alpha = 0.6, size = 2) +
  geom_hline(yintercept = -log10(0.01), linetype = "dashed", color = "gray40") +
  facet_wrap(~contrast, labeller = labeller(contrast = c(
    "control_vs_osteoarthritis_in_medial" = "OA vs. Control in Medial",
    "lateral_vs_medial_in_osteoarthritis" = "Medial vs. Lateral in OA"
  ))) +
  scale_x_continuous(labels = scales::percent) +
  scale_color_manual(
    values = c("TRUE" = "#E41A1C", "FALSE" = "gray50"),
    labels = c("TRUE" = "Significant", "FALSE" = "Not significant"),
    name = "FDR-adjusted"
  ) +
  labs(
    x = "Estimated Percent Difference",
    y = expression(-log[10](italic(P)~value))
  ) +
  theme_bw(base_size = 12) +
  theme(
    panel.grid.minor = element_blank(),
    strip.background = element_rect(fill = "white", color = "black"),
    strip.text = element_text(face = "bold"),
    legend.position = "bottom"
  )

  ggsave(file.path("Recreated-Figures", "Figure-Statistical-Inference-Volcano.png"), height = 5.75, width = 6, units = "in")

Inspecting the distributions of unadjusted P values for test interdependence

When performing multiple tests, the distribution of p-values can reveal information about the independence of the tests. If the tests are independent, we expect either a uniform distribution (when few tests are significant) or a right-skewed distribution with a peak near zero (when many tests are significant). Deviations from this pattern can indicate that the tests are not independent, which can affect the performance of multiple testing corrections.

signift_cartilage |> 
  ggplot(aes(x = p.value)) +
  geom_histogram(bins = 7, fill = "gray50", color = "white", boundary = 0) +
  facet_wrap(~contrast, labeller = labeller(contrast = c(
    "control_vs_osteoarthritis_in_medial" = "OA vs. Control in Medial",
    "lateral_vs_medial_in_osteoarthritis" = "Medial vs. Lateral in OA"
  ))) +
  scale_x_continuous(breaks = seq(0, 1, by = 0.2)) +
  labs(
    x = expression(italic(P)~value),
    y = "Count"
  ) +
  theme_bw(base_size = 12) +
  theme(
    panel.grid.minor = element_blank(),
    strip.background = element_rect(fill = "white", color = "black"),
    strip.text = element_text(face = "bold")
  )

  ggsave(file.path("Recreated-Figures", "SI-Figure-Statistical-Inference-PValue-Dists.png"), height = 5.75, width = 6, units = "in")

As the distribution of p-values is neither uniform nor right-skewed, there may be some interdependence between tests, but this is hard to verify with so few tests overall. It would not be surprising were there interdependence however, as the tests are performed on the same samples and many metabolites are biochemically related. Better clustering of features could eliminate some of this potential interdependence. Some software packages offer methods to account for interdependence in multiple testing correction, such as the locfdr package.

This code below will recreate the tables and figures from the manuscript.

knitr::kable(model_fit_eval |> select(-c(feature_id)))

signift_cartilage |> 
  filter(contrast == "control_vs_osteoarthritis_in_medial") |>
  mutate(feature_id = as.integer(feature_id)) |>
  left_join(passed) |>
  mutate(`M/Z` = round(mz, 3),
          Estimate = round(estimate, 3),
          Std.Error = round(SE, 3),
          `P Value` = round(p.value, 3),
          `FDR Adj. P value` = round(p_value_fdr, 3)) |>
  select(`M/Z`, Estimate, Std.Error, `P Value`, `FDR Adj. P value`) |> 
  arrange(`P Value`) |>
  knitr::kable()

signift_cartilage |> 
  filter(contrast == "lateral_vs_medial_in_osteoarthritis") |>
  mutate(feature_id = as.integer(feature_id)) |>
  left_join(passed) |>
  mutate(`M/Z` = round(mz, 3),
          Estimate = round(estimate, 3),
          Std.Error = round(SE, 3),
          `P Value` = round(p.value, 3),
          `FDR Adj. P value` = round(p_value_fdr, 3)) |>
  select(`M/Z`, Estimate, Std.Error, `P Value`, `FDR Adj. P value`) |> 
  arrange(`P Value`) |>
  knitr::kable()

4-Statistical-Inference

Ethan Rogers

03/20/2026

Translate research question(s) to statistical hypothesis

Load required packages

Test hypotheses of interest

Method 1: Comparison testing in Cardinal

Run tests

Correct for multiple testing

Method 2: Comparison testing externally

Run tests

Correct for multiple testing

Inspecting the distributions of unadjusted P values for test interdependence