Models specify relationships and offer estimates of parameters. Statistical testing quantifies the strength of evidence for specific hypotheses, summarized as a signal-to-noise ratio (a t-statistic). The signal is the estimate of the comparison to test, such as the difference between the means of each condition. The noise is the variance of that estimate. The p-value quantifies the probability of observing a test statistic under the null distribution; the smaller the p-value, the less likely it would be to observe it if the null hypothesis were true, and the more evidence there is against the null hypothesis. Testing requires calculating the degrees of freedom, or, more abstractly, the amount of information available to estimate different parameters. Somewhat related to the sample size, degrees of freedom are reduced when estimating more parameters and generally increase with more data. As we evaluate more comparisons, we increase the chance of false positives and need to adjust our p-values accordingly using a multiple testing correction.

Translate research question(s) to statistical hypothesis

We will test two hypotheses:

To test these hypotheses, we must express them in terms of the model parameters. We can use the estimated mean intensity of different sets of tissues and conditions. For both hypotheses, we assume that there is no difference between the mean intensity in each set of tissues and conditions. For example, for Hypothesis A, we expect that the estimate is zero, i.e., \(\bar{Y}_{Control, Medial} - \bar{Y}_{Osteoarthritis, Medial} = 0\).

In practice, statistical software will do this translation using the structure of the model, but we must still specify to the software what tests we want it to perform.

Load required packages

Test hypotheses of interest

Like statistical modeling, we perform statistical testing in Cardinal or externally. Both approaches require specifying a contrast. Formally, a contrast is a linear combination of coefficients for each term in the model that sum to zero. More simply, a contrast is a set of coefficients that, when multiplied in sequence with the model parameter estimates, sums to the estimate of the comparison being tested in the hypothesis that uses all model terms. We generally don’t have to specify them ourselves and can instead use the terms of the model with standard statistical packages.

Method 1: Comparison testing in Cardinal

Cardinal::contrast() offers arguments to specify custom contrasts using emmeans syntax and performed using emmeans. First, we will load the meansTest results object from 3-Statistical-Modeling. The results object contains a summary data frame accessible with mtops() and the lmer models. We will also load the results from manual diagnosis of model fit, creating a list of features that adequately meet model assumptions.

load(file.path("Outputs", "meansTest_cartilage_lmer.RData"))

model_fit_eval <- read_csv(file.path("Outputs", "feature-model-fit-evaluation.csv"), show_col_types = FALSE) |> 
  mutate(across(qq_normality:outliers, as.factor))

passed <- model_fit_eval |> 
          filter(qq_normality == "OK" & 
            residual_variance == "OK" & 
            outliers == "OK" )

Run tests

In emmeans, which Cardinal uses for hypothesis testing and which we will use externally, we can specify ~ condition | tissue as contrast.specs for Hypothesis A (i.e., “condition given tissue”). To get the actual pairwise contrasts, we supply "pairwise" as the contrast.method. emmeans offers tremendous functionality and versatility, with many possible ways to specify contrasts. We will explore other syntax when we repeat these analyses externally from Cardinal in the next part of this workflow.

measTest_models_with_acceptable_fit <- lmer_means_test[passed$feature_id]

h <- Cardinal::contrastTest(measTest_models_with_acceptable_fit, ~condition * tissue, method="pairwise")

Cardinal adds tests to the results object in a wide fashion, with each test having an identifying prefix. That is quite ugly, so below is an example of the emmeans object returned for each model, in this case the first feature model for 1366.62 m/z.

knitr::kable(h[[1]])
contrast estimate SE df t.ratio p.value
Control Lateral - Osteoarthritis Lateral -14.347788 10.840585 11.66542 -1.3235252 0.2110136
Control Lateral - Control Medial -2.866572 9.880078 6.00000 -0.2901366 0.7814781
Control Lateral - Osteoarthritis Medial -11.556785 10.840585 11.66542 -1.0660665 0.3079610
Osteoarthritis Lateral - Control Medial 11.481217 10.840585 11.66542 1.0590956 0.3109938
Osteoarthritis Lateral - Osteoarthritis Medial 2.791004 9.880078 6.00000 0.2824880 0.7870612
Control Medial - Osteoarthritis Medial -8.690213 10.840585 11.66542 -0.8016369 0.4387877

Notice there are six tests - every pairwise combination of the 4 (2x2) unique levels of condition and tissue. For our purposes, we only care about the Control Medial - Osteoarthritis Medial (hypothesis A) and Osteoarthritis Lateral - Osteoarthritis Medial (hypothesis B). Each comparison offers an estimate, t_ratio (aka t-statistic), and p_value. The estimate is the effect estimate and t_ratio the signal (estimate) to noise (standard error of the estimate) ratio. To get the reverse, say, Osteoarthritis Medial - Osteoarthritis Lateral, simply flip the sign of the estimate and t ratio. All else will be the same.

We can see that the p-value is large for 1366.62 m/z from Hypothesis A above, indicating that we cannot reject the null hypothesis; any difference in mean abundance between conditions in medial tissues is not statistically significant.

Correct for multiple testing

Running more tests increases the probability that one of the tests will be a false positive. We can correct for this in a variety of ways, such as globally lowering the threshold for significance (e.g., the Bonferroni correction). The preferred method, however, is to correct the false discovery rate (FDR), or the false-positive rate of just our positive tests. By setting an acceptable rate of positive tests that are false (in our case 0.05, or 5%, independently per hypothesis), we can apply a less strict correction. Cardinal includes a function for FDR correction, topFeatures.

kableExtra::kbl(topFeatures(h, n=5, sort.by = 2)) |>
  kableExtra::kable_styling(full_width = FALSE) |>
  kableExtra::scroll_box(width = "100%", height = "200px")
i mz singular Control.Lateral…Osteoarthritis.Lateral.estimate Control.Lateral…Control.Medial.estimate Control.Lateral…Osteoarthritis.Medial.estimate Osteoarthritis.Lateral…Control.Medial.estimate Osteoarthritis.Lateral…Osteoarthritis.Medial.estimate Control.Medial…Osteoarthritis.Medial.estimate Control.Lateral…Osteoarthritis.Lateral.pvalue Control.Lateral…Control.Medial.pvalue Control.Lateral…Osteoarthritis.Medial.pvalue Osteoarthritis.Lateral…Control.Medial.pvalue Osteoarthritis.Lateral…Osteoarthritis.Medial.pvalue Control.Medial…Osteoarthritis.Medial.pvalue Control.Lateral…Osteoarthritis.Lateral.fdr Control.Lateral…Control.Medial.fdr Control.Lateral…Osteoarthritis.Medial.fdr Osteoarthritis.Lateral…Control.Medial.fdr Osteoarthritis.Lateral…Osteoarthritis.Medial.fdr Control.Medial…Osteoarthritis.Medial.fdr
7 1366.620 FALSE -14.347788 -2.866572 -11.5567846 11.4812166 2.7910036 -8.690213 0.2110136 0.7814781 0.3079610 0.3109938 0.7870612 0.4387877 0.9677784 0.8994715 0.9556276 0.9942336 0.9110842 0.8896473
8 1367.616 TRUE -12.151073 -7.833456 -9.0934830 4.3176173 3.0575903 -1.260027 0.3402864 0.5340797 0.4716782 0.7303139 0.8069040 0.9196825 0.9995921 0.8151743 0.9556276 0.9942336 0.9110842 0.9580111
9 1381.628 TRUE -3.268407 -3.078269 0.4005776 0.1901371 3.6689841 3.478847 0.4757351 0.5012576 0.9295905 0.9665416 0.4246677 0.4484445 0.9995921 0.8151743 0.9556276 0.9942336 0.9110842 0.8896473
14 1428.646 FALSE -6.604205 -6.230537 -12.6595823 0.3736678 -6.0553774 -6.429045 0.6607162 0.2291452 0.4093875 0.9800342 0.2409500 0.6690491 0.9995921 0.6645211 0.9556276 0.9942336 0.9110842 0.8896473
17 1437.631 FALSE -1.851057 -3.691546 -2.3300860 -1.8404899 -0.4790294 1.361460 0.5621189 0.0148208 0.4688661 0.5642960 0.6761225 0.6679844 0.9995921 0.1777739 0.9556276 0.9942336 0.9110842 0.8896473

topFeatures applies FDR correction to every set of p-values independently. This is important because FDR correction is sensitive to the distribution of p-values. We will explore this later. Now, let’s extract our p-values and see if any are significant.

results_tibble_h1 <- as_tibble(topFeatures(h)) |> 
  select(i, mz, starts_with("Control.Medial...Osteoarthritis.Medial.")) |> 
  rename_with(~ str_remove(.x, fixed("Control.Medial...Osteoarthritis.Medial.")), 
              starts_with("Control.Medial...Osteoarthritis.Medial.")) |> 
  arrange(pvalue) |> 
  mutate(hypo = "Hypothesis A")


results_tibble_h2 <- as_tibble(topFeatures(h)) |> 
  select(i, mz, starts_with("Osteoarthritis.Lateral...Osteoarthritis.Medial.")) |> 
  rename_with(~ str_remove(.x, fixed("Osteoarthritis.Lateral...Osteoarthritis.Medial.")), 
              starts_with("Osteoarthritis.Lateral...Osteoarthritis.Medial.")) |> 
  arrange(pvalue) |> 
  mutate(hypo = "Hypothesis B")

knitr::kable(rbind(results_tibble_h1, results_tibble_h2) |> 
    mutate(`Significant at 0.1 after FDR adjustment` = fdr < 0.1) |> 
    dplyr::summarize(Count = dplyr::n(), .by = c(hypo, `Significant at 0.1 after FDR adjustment`)))
hypo Significant at 0.1 after FDR adjustment Count
Hypothesis A FALSE 29
Hypothesis B FALSE 29

Unfortunately, after adjustment, no findings for either hypothesis are significant. Regardless, let’s make a volcano plot using the unadjusted p-values. We will add a line at unadjusted p = 0.01. (Keep in mind this plot may look different than the one in the accompanying publication, as it uses the estimate rather than the percent difference and may use the flipped comparison. These do not affect the significance. We will show that plot below.)

rbind(results_tibble_h1, results_tibble_h2) |> 
    ggplot(aes(x = estimate, y = -log(pvalue))) +
    geom_point(alpha = 0.25) +
    facet_wrap("hypo") +
    geom_hline(yintercept = -log(0.01)) +
    xlab("Estimated Difference") +
    ylab("-Log(Unadjusted P Value)")

Method 2: Comparison testing externally

Like in Cardinal, we will test comparisons externally with emmeans again, showing multiple ways to specify the same comparison. First, let’s load the models we fit in the last vignette.

load(file.path("Outputs","cartilage_results.RData"))

cartilage_models_OK_fit <- cartilage_res[[1]][passed$feature_id]

We will write a function to perform test hypotheses for each model, but first, let’s experiment with different ways to specify contrasts in emmeans, again using the model for 1366.62 m/z.

m <- cartilage_models_OK_fit[[1]]

Run tests

We can specify the same pairwise comparison of conditions in tissues from above with the same syntax, but using emmeans functions, and selecting only the second contrast for each pair.

#Hypothesis A
emmeans_grid <- emmeans(m, ~ condition | tissue, lmer.df = "satterthwaite")
emmeans::contrast(emmeans_grid, "pairwise", adjust = "none")[2]
##  contrast                 tissue estimate   SE   df t.ratio p.value
##  Control - Osteoarthritis Medial    -8.69 10.8 11.7  -0.802  0.4388
## 
## Degrees-of-freedom method: satterthwaite
#Hypothesis B
emmeans_grid <- emmeans(m, ~ tissue | condition, lmer.df = "satterthwaite")
emmeans::contrast(emmeans_grid, "pairwise", adjust = "none")[2]
##  contrast         condition      estimate   SE df t.ratio p.value
##  Lateral - Medial Osteoarthritis     2.79 9.88  6   0.282  0.7871
## 
## Degrees-of-freedom method: satterthwaite

These function calls use the same default arguments as in Cardinal::meansTest - Satterthwaite degrees of freedom calculation and no initial P value adjustment. It also produces the same two contrasts. We can get the same results by passing the emmeans object to the pairwise wrapper function, pairs:

#Hypothesis A using pairs()
emmeans_grid <- emmeans(m, ~ condition | tissue, lmer.df = "satterthwaite")
pairs(emmeans_grid, adjust = "none")[2]
##  contrast                 tissue estimate   SE   df t.ratio p.value
##  Control - Osteoarthritis Medial    -8.69 10.8 11.7  -0.802  0.4388
## 
## Degrees-of-freedom method: satterthwaite
#Hypothesis B using pairs()
emmeans_grid <- emmeans(m, ~ tissue | condition, lmer.df = "satterthwaite")
pairs(emmeans_grid, adjust = "none")[2]
##  contrast         condition      estimate   SE df t.ratio p.value
##  Lateral - Medial Osteoarthritis     2.79 9.88  6   0.282  0.7871
## 
## Degrees-of-freedom method: satterthwaite

Or, we can directly specify the contrast vector for just the contrasts of interest, using the following specs and methods:

emmeans_grid <- emmeans(m, ~ condition * tissue, lmer.df = "satterthwaite")
contrast(emmeans_grid,
         list(control_vs_osteoarthritis_in_medial = c(0, 0, 1, -1),
         lateral_vs_medial_in_osteoarthritis = c(0, 1, 0, -1)),
         adjust = "none")
##  contrast                            estimate    SE   df t.ratio p.value
##  control_vs_osteoarthritis_in_medial    -8.69 10.80 11.7  -0.802  0.4388
##  lateral_vs_medial_in_osteoarthritis     2.79  9.88  6.0   0.282  0.7871
## 
## Degrees-of-freedom method: satterthwaite

Now, let’s write a function to test Hypothesis A and B on all models. This function also extracts the mean estimate for the reference so that we can compute the percent change.

signift_cartilage <- bind_rows(lapply(cartilage_models_OK_fit, FUN = function(m){
    emmeans_grid <- emmeans(m, ~ condition * tissue, lmer.df = "satterthwaite")
    k <- contrast(emmeans_grid,
         list(control_vs_osteoarthritis_in_medial = c(0, 0, 1, -1),
         lateral_vs_medial_in_osteoarthritis = c(0, 1, 0, -1)),
         adjust = "none")
    as_tibble(k) |> mutate(reference_estimate = as_tibble(emmeans_grid)  |>
                filter((condition == "Control" & tissue == "Medial" )| 
                (condition == "Osteoarthritis" & tissue == "Lateral"))  |> 
                arrange(desc(tissue)) |> pull(emmean)) |> 
                    mutate(pct_delta = estimate/reference_estimate)
    }), .id = "feature_id") |> 
  group_by(contrast) |> 
  mutate(p_value_fdr = p.adjust(p.value, method = "fdr")) |> 
  ungroup() |> 
  mutate(signif_after_fdr_adj = p_value_fdr < 0.1) |> 
  arrange(p.value)

save(signift_cartilage, file = file.path("Outputs", "signift_cartilage.RData"))

knitr::kable(head(signift_cartilage))
feature_id contrast estimate SE df t.ratio p.value reference_estimate pct_delta p_value_fdr signif_after_fdr_adj
18 lateral_vs_medial_in_osteoarthritis -5.973193 3.2696362 6.000000 -1.826868 0.1174970 45.63479 -0.1308912 0.9110842 FALSE
55 control_vs_osteoarthritis_in_medial 2.213277 1.3382654 12.000000 1.653841 0.1240612 37.48495 0.0590444 0.8896473 FALSE
55 lateral_vs_medial_in_osteoarthritis 2.143486 1.3382654 12.000000 1.601690 0.1352052 37.41516 0.0572892 0.9110842 FALSE
104 control_vs_osteoarthritis_in_medial -1.658258 1.0278974 8.994291 -1.613252 0.1411693 34.32677 -0.0483080 0.8896473 FALSE
78 lateral_vs_medial_in_osteoarthritis 4.198255 2.6002345 6.000000 1.614568 0.1575314 43.73598 0.0959909 0.9110842 FALSE
81 control_vs_osteoarthritis_in_medial 1.384963 0.9209036 10.744242 1.503917 0.1614164 35.90810 0.0385696 0.8896473 FALSE

Correct for multiple testing

As above, we will correct for multiple testing using the FDR method and construct a volcano plot. We have also flipped the comparison to harmonize these results with the accompanying journal article (positive differences mean greater abundance in osteoarthritis conditions or medial tissues).

signift_cartilage |>
  ggplot(aes(x = -1 * pct_delta, y = -log10(p.value), color = signif_after_fdr_adj)) +
  geom_point(alpha = 0.6, size = 2) +
  geom_hline(yintercept = -log10(0.01), linetype = "dashed", color = "gray40") +
  facet_wrap(~contrast, labeller = labeller(contrast = c(
    "control_vs_osteoarthritis_in_medial" = "OA vs. Control in Medial",
    "lateral_vs_medial_in_osteoarthritis" = "Medial vs. Lateral in OA"
  ))) +
  scale_x_continuous(labels = scales::percent) +
  scale_color_manual(
    values = c("TRUE" = "#E41A1C", "FALSE" = "gray50"),
    labels = c("TRUE" = "Significant", "FALSE" = "Not significant"),
    name = "FDR-adjusted"
  ) +
  labs(
    x = "Estimated Percent Difference",
    y = expression(-log[10](italic(P)~value))
  ) +
  theme_bw(base_size = 12) +
  theme(
    panel.grid.minor = element_blank(),
    strip.background = element_rect(fill = "white", color = "black"),
    strip.text = element_text(face = "bold"),
    legend.position = "bottom"
  )

  ggsave(file.path("Recreated-Figures", "Figure-Statistical-Inference-Volcano.png"), height = 5.75, width = 6, units = "in")

Inspecting the distributions of unadjusted P values for test interdependence

When performing multiple tests, the distribution of p-values can reveal information about the independence of the tests. If the tests are independent, we expect either a uniform distribution (when few tests are significant) or a right-skewed distribution with a peak near zero (when many tests are significant). Deviations from this pattern can indicate that the tests are not independent, which can affect the performance of multiple testing corrections.

signift_cartilage |> 
  ggplot(aes(x = p.value)) +
  geom_histogram(bins = 7, fill = "gray50", color = "white", boundary = 0) +
  facet_wrap(~contrast, labeller = labeller(contrast = c(
    "control_vs_osteoarthritis_in_medial" = "OA vs. Control in Medial",
    "lateral_vs_medial_in_osteoarthritis" = "Medial vs. Lateral in OA"
  ))) +
  scale_x_continuous(breaks = seq(0, 1, by = 0.2)) +
  labs(
    x = expression(italic(P)~value),
    y = "Count"
  ) +
  theme_bw(base_size = 12) +
  theme(
    panel.grid.minor = element_blank(),
    strip.background = element_rect(fill = "white", color = "black"),
    strip.text = element_text(face = "bold")
  )

  ggsave(file.path("Recreated-Figures", "SI-Figure-Statistical-Inference-PValue-Dists.png"), height = 5.75, width = 6, units = "in")

As the distribution of p-values is neither uniform nor right-skewed, there may be some interdependence between tests, but this is hard to verify with so few tests overall. It would not be surprising were there interdependence however, as the tests are performed on the same samples and many metabolites are biochemically related. Better clustering of features could eliminate some of this potential interdependence. Some software packages offer methods to account for interdependence in multiple testing correction, such as the locfdr package.

This code below will recreate the tables and figures from the manuscript.

knitr::kable(model_fit_eval |> select(-c(feature_id)))

signift_cartilage |> 
  filter(contrast == "control_vs_osteoarthritis_in_medial") |>
  mutate(feature_id = as.integer(feature_id)) |>
  left_join(passed) |>
  mutate(`M/Z` = round(mz, 3),
          Estimate = round(estimate, 3),
          Std.Error = round(SE, 3),
          `P Value` = round(p.value, 3),
          `FDR Adj. P value` = round(p_value_fdr, 3)) |>
  select(`M/Z`, Estimate, Std.Error, `P Value`, `FDR Adj. P value`) |> 
  arrange(`P Value`) |>
  knitr::kable()

signift_cartilage |> 
  filter(contrast == "lateral_vs_medial_in_osteoarthritis") |>
  mutate(feature_id = as.integer(feature_id)) |>
  left_join(passed) |>
  mutate(`M/Z` = round(mz, 3),
          Estimate = round(estimate, 3),
          Std.Error = round(SE, 3),
          `P Value` = round(p.value, 3),
          `FDR Adj. P value` = round(p_value_fdr, 3)) |>
  select(`M/Z`, Estimate, Std.Error, `P Value`, `FDR Adj. P value`) |> 
  arrange(`P Value`) |>
  knitr::kable()