Models specify relationships and offer estimates of parameters. Statistical testing quantifies the strength of evidence for specific hypotheses, summarized as a signal-to-noise ratio (a t-statistic). The signal is the estimate of the comparison to test, such as the difference between the means of each condition. The noise is the variance of that estimate. The p-value quantifies the probability of observing a test statistic under the null distribution; the smaller the p-value, the less likely it would be to observe it if the null hypothesis were true, and the more evidence there is against the null hypothesis. Testing requires calculating the degrees of freedom, or, more abstractly, the amount of information available to estimate different parameters. Somewhat related to the sample size, degrees of freedom are reduced when estimating more parameters and generally increase with more data. As we evaluate more comparisons, we increase the chance of false positives and need to adjust our p-values accordingly using a multiple testing correction.
We will test two hypotheses:
Hypothesis A, between-subjects: Control vs. Osteoarthritis, in Medial tissues
Hypothesis B, within-subjects: Lateral vs. Medial, in the Osteoarthritis condition
To test these hypotheses, we must express them in terms of the model parameters. We can use the estimated mean intensity of different sets of tissues and conditions. For both hypotheses, we assume that there is no difference between the mean intensity in each set of tissues and conditions. For example, for Hypothesis A, we expect that the estimate is zero, i.e., \(\bar{Y}_{Control, Medial} - \bar{Y}_{Osteoarthritis, Medial} = 0\).
In practice, statistical software will do this translation using the structure of the model, but we must still specify to the software what tests we want it to perform.
Like statistical modeling, we perform statistical testing in Cardinal or externally. Both approaches require specifying a contrast. Formally, a contrast is a linear combination of coefficients for each term in the model that sum to zero. More simply, a contrast is a set of coefficients that, when multiplied in sequence with the model parameter estimates, sums to the estimate of the comparison being tested in the hypothesis that uses all model terms. We generally don’t have to specify them ourselves and can instead use the terms of the model with standard statistical packages.
Cardinal::contrast() offers arguments to specify custom
contrasts using emmeans syntax and performed using
emmeans. First, we will load the meansTest
results object from 3-Statistical-Modeling. The results object contains
a summary data frame accessible with mtops() and the
lmer models. We will also load the results from manual
diagnosis of model fit, creating a list of features that adequately meet
model assumptions.
load(file.path("Outputs", "meansTest_cartilage_lmer.RData"))
model_fit_eval <- read_csv(file.path("Outputs", "feature-model-fit-evaluation.csv"), show_col_types = FALSE) |>
mutate(across(qq_normality:outliers, as.factor))
passed <- model_fit_eval |>
filter(qq_normality == "OK" &
residual_variance == "OK" &
outliers == "OK" )
In emmeans, which Cardinal uses for hypothesis testing
and which we will use externally, we can specify
~ condition | tissue as contrast.specs for
Hypothesis A (i.e., “condition given tissue”). To get the actual
pairwise contrasts, we supply "pairwise" as the
contrast.method. emmeans offers tremendous
functionality and versatility, with many
possible ways to specify contrasts. We will explore other syntax
when we repeat these analyses externally from Cardinal in the next part
of this workflow.
measTest_models_with_acceptable_fit <- lmer_means_test[passed$feature_id]
h <- Cardinal::contrastTest(measTest_models_with_acceptable_fit, ~condition * tissue, method="pairwise")
Cardinal adds tests to the results object in a wide fashion, with
each test having an identifying prefix. That is quite ugly, so below is
an example of the emmeans object returned for each model,
in this case the first feature model for 1366.62 m/z.
knitr::kable(h[[1]])
| contrast | estimate | SE | df | t.ratio | p.value |
|---|---|---|---|---|---|
| Control Lateral - Osteoarthritis Lateral | -14.347788 | 10.840585 | 11.66542 | -1.3235252 | 0.2110136 |
| Control Lateral - Control Medial | -2.866572 | 9.880078 | 6.00000 | -0.2901366 | 0.7814781 |
| Control Lateral - Osteoarthritis Medial | -11.556785 | 10.840585 | 11.66542 | -1.0660665 | 0.3079610 |
| Osteoarthritis Lateral - Control Medial | 11.481217 | 10.840585 | 11.66542 | 1.0590956 | 0.3109938 |
| Osteoarthritis Lateral - Osteoarthritis Medial | 2.791004 | 9.880078 | 6.00000 | 0.2824880 | 0.7870612 |
| Control Medial - Osteoarthritis Medial | -8.690213 | 10.840585 | 11.66542 | -0.8016369 | 0.4387877 |
Notice there are six tests - every pairwise combination of the 4
(2x2) unique levels of condition and tissue. For our purposes, we only
care about the Control Medial - Osteoarthritis Medial
(hypothesis A) and
Osteoarthritis Lateral - Osteoarthritis Medial (hypothesis
B). Each comparison offers an estimate,
t_ratio (aka t-statistic), and p_value. The
estimate is the effect estimate and t_ratio
the signal (estimate) to noise (standard error of the estimate) ratio.
To get the reverse, say,
Osteoarthritis Medial - Osteoarthritis Lateral, simply flip
the sign of the estimate and t ratio. All else will be the same.
We can see that the p-value is large for 1366.62 m/z from Hypothesis A above, indicating that we cannot reject the null hypothesis; any difference in mean abundance between conditions in medial tissues is not statistically significant.
Running more tests increases the probability that one of the tests
will be a false positive. We can correct for this in a variety of ways,
such as globally lowering the threshold for significance (e.g., the
Bonferroni correction). The preferred method, however, is to correct the
false discovery rate (FDR), or the false-positive rate of just our
positive tests. By setting an acceptable rate of positive tests that are
false (in our case 0.05, or 5%, independently per hypothesis), we can
apply a less strict correction. Cardinal includes a function for FDR
correction, topFeatures.
kableExtra::kbl(topFeatures(h, n=5, sort.by = 2)) |>
kableExtra::kable_styling(full_width = FALSE) |>
kableExtra::scroll_box(width = "100%", height = "200px")
| i | mz | singular | Control.Lateral…Osteoarthritis.Lateral.estimate | Control.Lateral…Control.Medial.estimate | Control.Lateral…Osteoarthritis.Medial.estimate | Osteoarthritis.Lateral…Control.Medial.estimate | Osteoarthritis.Lateral…Osteoarthritis.Medial.estimate | Control.Medial…Osteoarthritis.Medial.estimate | Control.Lateral…Osteoarthritis.Lateral.pvalue | Control.Lateral…Control.Medial.pvalue | Control.Lateral…Osteoarthritis.Medial.pvalue | Osteoarthritis.Lateral…Control.Medial.pvalue | Osteoarthritis.Lateral…Osteoarthritis.Medial.pvalue | Control.Medial…Osteoarthritis.Medial.pvalue | Control.Lateral…Osteoarthritis.Lateral.fdr | Control.Lateral…Control.Medial.fdr | Control.Lateral…Osteoarthritis.Medial.fdr | Osteoarthritis.Lateral…Control.Medial.fdr | Osteoarthritis.Lateral…Osteoarthritis.Medial.fdr | Control.Medial…Osteoarthritis.Medial.fdr |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 7 | 1366.620 | FALSE | -14.347788 | -2.866572 | -11.5567846 | 11.4812166 | 2.7910036 | -8.690213 | 0.2110136 | 0.7814781 | 0.3079610 | 0.3109938 | 0.7870612 | 0.4387877 | 0.9677784 | 0.8994715 | 0.9556276 | 0.9942336 | 0.9110842 | 0.8896473 |
| 8 | 1367.616 | TRUE | -12.151073 | -7.833456 | -9.0934830 | 4.3176173 | 3.0575903 | -1.260027 | 0.3402864 | 0.5340797 | 0.4716782 | 0.7303139 | 0.8069040 | 0.9196825 | 0.9995921 | 0.8151743 | 0.9556276 | 0.9942336 | 0.9110842 | 0.9580111 |
| 9 | 1381.628 | TRUE | -3.268407 | -3.078269 | 0.4005776 | 0.1901371 | 3.6689841 | 3.478847 | 0.4757351 | 0.5012576 | 0.9295905 | 0.9665416 | 0.4246677 | 0.4484445 | 0.9995921 | 0.8151743 | 0.9556276 | 0.9942336 | 0.9110842 | 0.8896473 |
| 14 | 1428.646 | FALSE | -6.604205 | -6.230537 | -12.6595823 | 0.3736678 | -6.0553774 | -6.429045 | 0.6607162 | 0.2291452 | 0.4093875 | 0.9800342 | 0.2409500 | 0.6690491 | 0.9995921 | 0.6645211 | 0.9556276 | 0.9942336 | 0.9110842 | 0.8896473 |
| 17 | 1437.631 | FALSE | -1.851057 | -3.691546 | -2.3300860 | -1.8404899 | -0.4790294 | 1.361460 | 0.5621189 | 0.0148208 | 0.4688661 | 0.5642960 | 0.6761225 | 0.6679844 | 0.9995921 | 0.1777739 | 0.9556276 | 0.9942336 | 0.9110842 | 0.8896473 |
topFeatures applies FDR correction to every set of
p-values independently. This is important because FDR correction is
sensitive to the distribution of p-values. We will explore this later.
Now, let’s extract our p-values and see if any are significant.
results_tibble_h1 <- as_tibble(topFeatures(h)) |>
select(i, mz, starts_with("Control.Medial...Osteoarthritis.Medial.")) |>
rename_with(~ str_remove(.x, fixed("Control.Medial...Osteoarthritis.Medial.")),
starts_with("Control.Medial...Osteoarthritis.Medial.")) |>
arrange(pvalue) |>
mutate(hypo = "Hypothesis A")
results_tibble_h2 <- as_tibble(topFeatures(h)) |>
select(i, mz, starts_with("Osteoarthritis.Lateral...Osteoarthritis.Medial.")) |>
rename_with(~ str_remove(.x, fixed("Osteoarthritis.Lateral...Osteoarthritis.Medial.")),
starts_with("Osteoarthritis.Lateral...Osteoarthritis.Medial.")) |>
arrange(pvalue) |>
mutate(hypo = "Hypothesis B")
knitr::kable(rbind(results_tibble_h1, results_tibble_h2) |>
mutate(`Significant at 0.1 after FDR adjustment` = fdr < 0.1) |>
dplyr::summarize(Count = dplyr::n(), .by = c(hypo, `Significant at 0.1 after FDR adjustment`)))
| hypo | Significant at 0.1 after FDR adjustment | Count |
|---|---|---|
| Hypothesis A | FALSE | 29 |
| Hypothesis B | FALSE | 29 |
Unfortunately, after adjustment, no findings for either hypothesis are significant. Regardless, let’s make a volcano plot using the unadjusted p-values. We will add a line at unadjusted p = 0.01. (Keep in mind this plot may look different than the one in the accompanying publication, as it uses the estimate rather than the percent difference and may use the flipped comparison. These do not affect the significance. We will show that plot below.)
rbind(results_tibble_h1, results_tibble_h2) |>
ggplot(aes(x = estimate, y = -log(pvalue))) +
geom_point(alpha = 0.25) +
facet_wrap("hypo") +
geom_hline(yintercept = -log(0.01)) +
xlab("Estimated Difference") +
ylab("-Log(Unadjusted P Value)")
Like in Cardinal, we will test comparisons externally with
emmeans again, showing multiple ways to specify the same
comparison. First, let’s load the models we fit in the last
vignette.
load(file.path("Outputs","cartilage_results.RData"))
cartilage_models_OK_fit <- cartilage_res[[1]][passed$feature_id]
We will write a function to perform test hypotheses for each model,
but first, let’s experiment with different ways to specify contrasts in
emmeans, again using the model for 1366.62 m/z.
m <- cartilage_models_OK_fit[[1]]
We can specify the same pairwise comparison of conditions in tissues
from above with the same syntax, but using emmeans
functions, and selecting only the second contrast for each pair.
#Hypothesis A
emmeans_grid <- emmeans(m, ~ condition | tissue, lmer.df = "satterthwaite")
emmeans::contrast(emmeans_grid, "pairwise", adjust = "none")[2]
## contrast tissue estimate SE df t.ratio p.value
## Control - Osteoarthritis Medial -8.69 10.8 11.7 -0.802 0.4388
##
## Degrees-of-freedom method: satterthwaite
#Hypothesis B
emmeans_grid <- emmeans(m, ~ tissue | condition, lmer.df = "satterthwaite")
emmeans::contrast(emmeans_grid, "pairwise", adjust = "none")[2]
## contrast condition estimate SE df t.ratio p.value
## Lateral - Medial Osteoarthritis 2.79 9.88 6 0.282 0.7871
##
## Degrees-of-freedom method: satterthwaite
These function calls use the same default arguments as in
Cardinal::meansTest - Satterthwaite degrees of freedom
calculation and no initial P value adjustment. It also produces the same
two contrasts. We can get the same results by passing the
emmeans object to the pairwise wrapper function,
pairs:
#Hypothesis A using pairs()
emmeans_grid <- emmeans(m, ~ condition | tissue, lmer.df = "satterthwaite")
pairs(emmeans_grid, adjust = "none")[2]
## contrast tissue estimate SE df t.ratio p.value
## Control - Osteoarthritis Medial -8.69 10.8 11.7 -0.802 0.4388
##
## Degrees-of-freedom method: satterthwaite
#Hypothesis B using pairs()
emmeans_grid <- emmeans(m, ~ tissue | condition, lmer.df = "satterthwaite")
pairs(emmeans_grid, adjust = "none")[2]
## contrast condition estimate SE df t.ratio p.value
## Lateral - Medial Osteoarthritis 2.79 9.88 6 0.282 0.7871
##
## Degrees-of-freedom method: satterthwaite
Or, we can directly specify the contrast vector for just the contrasts of interest, using the following specs and methods:
emmeans_grid <- emmeans(m, ~ condition * tissue, lmer.df = "satterthwaite")
contrast(emmeans_grid,
list(control_vs_osteoarthritis_in_medial = c(0, 0, 1, -1),
lateral_vs_medial_in_osteoarthritis = c(0, 1, 0, -1)),
adjust = "none")
## contrast estimate SE df t.ratio p.value
## control_vs_osteoarthritis_in_medial -8.69 10.80 11.7 -0.802 0.4388
## lateral_vs_medial_in_osteoarthritis 2.79 9.88 6.0 0.282 0.7871
##
## Degrees-of-freedom method: satterthwaite
Now, let’s write a function to test Hypothesis A and B on all models. This function also extracts the mean estimate for the reference so that we can compute the percent change.
signift_cartilage <- bind_rows(lapply(cartilage_models_OK_fit, FUN = function(m){
emmeans_grid <- emmeans(m, ~ condition * tissue, lmer.df = "satterthwaite")
k <- contrast(emmeans_grid,
list(control_vs_osteoarthritis_in_medial = c(0, 0, 1, -1),
lateral_vs_medial_in_osteoarthritis = c(0, 1, 0, -1)),
adjust = "none")
as_tibble(k) |> mutate(reference_estimate = as_tibble(emmeans_grid) |>
filter((condition == "Control" & tissue == "Medial" )|
(condition == "Osteoarthritis" & tissue == "Lateral")) |>
arrange(desc(tissue)) |> pull(emmean)) |>
mutate(pct_delta = estimate/reference_estimate)
}), .id = "feature_id") |>
group_by(contrast) |>
mutate(p_value_fdr = p.adjust(p.value, method = "fdr")) |>
ungroup() |>
mutate(signif_after_fdr_adj = p_value_fdr < 0.1) |>
arrange(p.value)
save(signift_cartilage, file = file.path("Outputs", "signift_cartilage.RData"))
knitr::kable(head(signift_cartilage))
| feature_id | contrast | estimate | SE | df | t.ratio | p.value | reference_estimate | pct_delta | p_value_fdr | signif_after_fdr_adj |
|---|---|---|---|---|---|---|---|---|---|---|
| 18 | lateral_vs_medial_in_osteoarthritis | -5.973193 | 3.2696362 | 6.000000 | -1.826868 | 0.1174970 | 45.63479 | -0.1308912 | 0.9110842 | FALSE |
| 55 | control_vs_osteoarthritis_in_medial | 2.213277 | 1.3382654 | 12.000000 | 1.653841 | 0.1240612 | 37.48495 | 0.0590444 | 0.8896473 | FALSE |
| 55 | lateral_vs_medial_in_osteoarthritis | 2.143486 | 1.3382654 | 12.000000 | 1.601690 | 0.1352052 | 37.41516 | 0.0572892 | 0.9110842 | FALSE |
| 104 | control_vs_osteoarthritis_in_medial | -1.658258 | 1.0278974 | 8.994291 | -1.613252 | 0.1411693 | 34.32677 | -0.0483080 | 0.8896473 | FALSE |
| 78 | lateral_vs_medial_in_osteoarthritis | 4.198255 | 2.6002345 | 6.000000 | 1.614568 | 0.1575314 | 43.73598 | 0.0959909 | 0.9110842 | FALSE |
| 81 | control_vs_osteoarthritis_in_medial | 1.384963 | 0.9209036 | 10.744242 | 1.503917 | 0.1614164 | 35.90810 | 0.0385696 | 0.8896473 | FALSE |
As above, we will correct for multiple testing using the FDR method and construct a volcano plot. We have also flipped the comparison to harmonize these results with the accompanying journal article (positive differences mean greater abundance in osteoarthritis conditions or medial tissues).
signift_cartilage |>
ggplot(aes(x = -1 * pct_delta, y = -log10(p.value), color = signif_after_fdr_adj)) +
geom_point(alpha = 0.6, size = 2) +
geom_hline(yintercept = -log10(0.01), linetype = "dashed", color = "gray40") +
facet_wrap(~contrast, labeller = labeller(contrast = c(
"control_vs_osteoarthritis_in_medial" = "OA vs. Control in Medial",
"lateral_vs_medial_in_osteoarthritis" = "Medial vs. Lateral in OA"
))) +
scale_x_continuous(labels = scales::percent) +
scale_color_manual(
values = c("TRUE" = "#E41A1C", "FALSE" = "gray50"),
labels = c("TRUE" = "Significant", "FALSE" = "Not significant"),
name = "FDR-adjusted"
) +
labs(
x = "Estimated Percent Difference",
y = expression(-log[10](italic(P)~value))
) +
theme_bw(base_size = 12) +
theme(
panel.grid.minor = element_blank(),
strip.background = element_rect(fill = "white", color = "black"),
strip.text = element_text(face = "bold"),
legend.position = "bottom"
)
ggsave(file.path("Recreated-Figures", "Figure-Statistical-Inference-Volcano.png"), height = 5.75, width = 6, units = "in")
When performing multiple tests, the distribution of p-values can reveal information about the independence of the tests. If the tests are independent, we expect either a uniform distribution (when few tests are significant) or a right-skewed distribution with a peak near zero (when many tests are significant). Deviations from this pattern can indicate that the tests are not independent, which can affect the performance of multiple testing corrections.
signift_cartilage |>
ggplot(aes(x = p.value)) +
geom_histogram(bins = 7, fill = "gray50", color = "white", boundary = 0) +
facet_wrap(~contrast, labeller = labeller(contrast = c(
"control_vs_osteoarthritis_in_medial" = "OA vs. Control in Medial",
"lateral_vs_medial_in_osteoarthritis" = "Medial vs. Lateral in OA"
))) +
scale_x_continuous(breaks = seq(0, 1, by = 0.2)) +
labs(
x = expression(italic(P)~value),
y = "Count"
) +
theme_bw(base_size = 12) +
theme(
panel.grid.minor = element_blank(),
strip.background = element_rect(fill = "white", color = "black"),
strip.text = element_text(face = "bold")
)
ggsave(file.path("Recreated-Figures", "SI-Figure-Statistical-Inference-PValue-Dists.png"), height = 5.75, width = 6, units = "in")
As the distribution of p-values is neither uniform nor right-skewed,
there may be some interdependence between tests, but this is hard to
verify with so few tests overall. It would not be surprising were there
interdependence however, as the tests are performed on the same samples
and many metabolites are biochemically related. Better clustering of
features could eliminate some of this potential interdependence. Some
software packages offer methods to account for interdependence in
multiple testing correction, such as the locfdr
package.
This code below will recreate the tables and figures from the manuscript.
knitr::kable(model_fit_eval |> select(-c(feature_id)))
signift_cartilage |>
filter(contrast == "control_vs_osteoarthritis_in_medial") |>
mutate(feature_id = as.integer(feature_id)) |>
left_join(passed) |>
mutate(`M/Z` = round(mz, 3),
Estimate = round(estimate, 3),
Std.Error = round(SE, 3),
`P Value` = round(p.value, 3),
`FDR Adj. P value` = round(p_value_fdr, 3)) |>
select(`M/Z`, Estimate, Std.Error, `P Value`, `FDR Adj. P value`) |>
arrange(`P Value`) |>
knitr::kable()
signift_cartilage |>
filter(contrast == "lateral_vs_medial_in_osteoarthritis") |>
mutate(feature_id = as.integer(feature_id)) |>
left_join(passed) |>
mutate(`M/Z` = round(mz, 3),
Estimate = round(estimate, 3),
Std.Error = round(SE, 3),
`P Value` = round(p.value, 3),
`FDR Adj. P value` = round(p_value_fdr, 3)) |>
select(`M/Z`, Estimate, Std.Error, `P Value`, `FDR Adj. P value`) |>
arrange(`P Value`) |>
knitr::kable()