mediation_sim_may_WIP

Mediation

Author

Vilgot Huhn

Published

May 27, 2025

Building a generative model step by step

My goal here is to build a model that describes the essential aspects of the data-generating process that our statistical test will attempt to estimate.

Each individual will be part of either the IU treatment group or the MC treatment group. For the duration of the treatment, the mediator is assumed to gradually change in an approximately linear way. Before considering the comparisons between groups, let’s first imagine a patient in one group with a changing mediator, \(m^a\). The residuals of the model \(\sigma\) is assumed to be normally distributed with a mean of 0. \(t\) is a variable that represent time, which can be weeks from 0 to 9.

\[ m_i \sim N(\mu_i,\sigma) \\ \mu_i = \alpha + \beta t_i \]

For now let’s just imagine the data-generating process for a single patient. When generating data parameters like \(\alpha\) and \(\beta\) will have “priors” that determine the distribution that data-points are drawn from. For our purposes now the input in those distribution functions are somewhat arbitrary, and they do not represent any thought through Bayesian priors. Because I’m not yet trying to model uncertainty, I’ll skip specifying \(\sigma\) as a distribution, instead it will be set to = 0.3.

\[ \alpha \sim N(0,1) \\ \beta \sim N(-0.1,0.05) \]

\(\alpha\) now contains the normal intercept (week = 0) standard deviation for the mediator, while \(\beta\) represents the average slope of the mediator, as well as the heterogeneity of slopes. Let’s simulate and plot 1 patient:

n <- 1
mediator_intercept_sd <- 1
avg_mediator_slope <- -0.1
mediator_slope_sd <- 0.05
weeks <- 0:9
sigma <- 0.3

m_1 <- rnorm(n, 0, mediator_intercept_sd) + rnorm(n, avg_mediator_slope, mediator_slope_sd)*weeks + rnorm((n*length(weeks)), 0, sigma)

plot(weeks,m_1, ylim = c(-2,2))

Some additional things to be explicit about: \(m_i\) is changing linearly, if not for \(\sigma\). This \(\sigma\) can be seen as representing both “measurement error” and exogenous influences. While our treatment is hypothesized to affect the mediator, it is not likely to be the only thing affecting the mediator. We also imagine that the change is heterogenous; the treatment is not expected to work equally for everyone.

Now let’s model how this might look for many different patients \(j=1,...,n\). The value of a datapoint \(\mu_i\) (if not for it’s error/exogenous influences) can now be given by this formula:

\[ m_i \sim N(\mu_i,\sigma) \\ \mu_{i} = \alpha_{j[i]} \ + \ \beta_{j[i]}t_i \\ \alpha_j \sim N(0,1) \\ \beta_j \sim N(-0.1,0.05) \\ \]

The only thing that has changed is that we’re now imagining that the slopes and intercepts beloing to many different patients. To generate this we could simply loop the code many times.

n <- 7 #note that the code calls nr patients "n" instead of j.
mediator_intercept_sd <- 1
avg_mediator_slope <- -0.1
mediator_slope_sd <- 0.05
weeks <- 0:9
sigma <- 0.3
rho <- -0.15

df <- data.frame(m = c(), week = c(), id = c())

for(i in 1:n){
m_1 <- rnorm(1, 0, mediator_intercept_sd) + rnorm(1, avg_mediator_slope, mediator_slope_sd)*weeks + rnorm((length(weeks)), 0, sigma)
df <- rbind.data.frame(df,data.frame(m = m_1, week = weeks, id = rep(i, times = 10)))
}

In real data it is common to see a negative slope-intercept correlation where a higher value on an individuals \(\alpha\) would be associated with a more negative \(\beta\). Our data-generating process should model this as well. To achieve this we need to let intercepts and slopes them be drawn from a multivariate normal distribution.

\[ \begin{align} y_i &\sim N(\mu_i,\sigma) \\ \mu_i &= \alpha_{j[i]} + \beta_{j[i]}t_i \\ \begin{pmatrix} \alpha_j \\ \beta_j \end{pmatrix} &\sim \text{MVN}\left(\begin{pmatrix} \mu_\alpha \\ \mu_\beta \end{pmatrix}, \Sigma\right) \\ \mu_\alpha &= 0 \\ \mu_\beta &= -0.1 \\ \Sigma &= \begin{pmatrix} \sigma_\alpha^2 & \rho\sigma_\alpha\sigma_\beta \\ \rho\sigma_\alpha\sigma_\beta & \sigma_\beta^2 \end{pmatrix} \\ \rho &= -0.15 \\ \sigma_\alpha &= 1 \\ \sigma_\beta &= 0.05 \end{align} \]

\(\Sigma\) now represent a common covariance matrix for the slopes and intercepts. The diagonal of the matrix represent the intercept and slope variances for earlier, but we now also have a \(\rho\sigma_\alpha \sigma_\beta\) that represents the correlation. When generating data I’ve set it to -0.15.

Updating the code and looping for 7 patients looks like this:

#Now with slope-intercept correlation
n <- 7 #note that the code calls nr patients "n" instead of j.
mediator_intercept_sd <- 1
avg_mediator_slope <- -0.1
mediator_slope_sd <- 0.05
weeks <- 0:9
sigma <- 0.3
rho <- -0.15

# Set up multivariate normal parameters
mu <- c(0, avg_mediator_slope)  # means for intercept and slope
Sigma <- matrix(c(mediator_intercept_sd^2, 
                  rho * mediator_intercept_sd * mediator_slope_sd,
                  rho * mediator_intercept_sd * mediator_slope_sd, 
                  mediator_slope_sd^2), 
                nrow = 2, ncol = 2)

df <- data.frame(m = c(), week = c(), id = c())

for(i in 1:n){
  # Generate correlated intercept and slope for this participant
  params <- MASS::mvrnorm(1, mu = mu, Sigma = Sigma)
  alpha_i <- params[1]  # intercept
  beta_i <- params[2]   # slope
  
  # Calculate values for all weeks for this participant
  m_1 <- alpha_i + beta_i * weeks + rnorm(length(weeks), 0, sigma)
  
  df <- rbind.data.frame(df, data.frame(m = m_1, week = weeks, id = rep(i, times = length(weeks))))
}

Let’s plot it:

p <- ggplot(df, aes(y = m, x = week, group = as.factor(id), color = as.factor(id))) + geom_point(aes(y = m), alpha = 0.8) + 
  geom_line(aes(y = m), alpha = 0.8) +
  theme_minimal()

print(p)

Modeling the outcome

The outcome (worry symptoms) we imagine will be affected by treatment in a similar way, decreasing linearly as patients spend time working with the treatment protocol, while being measured with some degree of error. Importantly the outcome will be partially affected by the mediator and partially will have exogenous influences.

We can expand the model to now instead describe two results \(y_i\) and \(m_i\). Since we model two “outcomes” we also get two residuals \(\sigma_y\) and \(\sigma_m\).

\[ \begin{align} y_i &\sim N(\mu_{y,i}, \sigma_y) \\ m_i &\sim N(\mu_{m,i}, \sigma_m) \\ \mu_{y,i} &= \alpha_{y,j[i]} + \beta_{y,j[i]}t_i + \gamma_{j[i]}(m_i-\bar{m_j}) \\ \mu_{m,i} &= \alpha_{m,j[i]} + \beta_{m,j[i]}t_i \\ \begin{pmatrix} \alpha_{m,j} \\ \beta_{m,j} \\ \alpha_{y,j} \\ \beta_{y,j} \\ \gamma_j \end{pmatrix} &\sim \text{MVN}\left(\begin{pmatrix} \mu_{\alpha,m} \\ \mu_{\beta,m} \\ \mu_{\alpha,y} \\ \mu_{\beta,y} \\ \mu_{\gamma} \end{pmatrix}, \Sigma\right) \\ \end{align} \]

A lot of things have happened in this step. Our alphas and betas now have subscripts that both relate them to either the outcome or the mediator. We’ve also added a \(\gamma_j\) which describe the effect of the mediator. Note that this now assumes a constant mediator-outcome relationship for each participant \(j\). Note Not centering the parameter meant it basically pushed y values further from zero. Now while centered it instead represents some sort of average slope of the mediator (balancing on an axis in the middle of the dataset…) Constant in the sense that it doesn’t vary over time; the relationship is however allowed to vary by participant.

Importantly we now have a joint covariance matrix \(\Sigma\) that structures the multivariate normal distribution which all individual level parameters are drawn from.

Expanded covariance matrix:

\[ \Sigma = \begin{pmatrix} \sigma_{\alpha_m}^2 & \rho_{\alpha_m,\beta_m}\sigma_{\alpha_m}\sigma_{\beta_m} & \rho_{\alpha_m,\alpha_y}\sigma_{\alpha_m}\sigma_{\alpha_y} & \rho_{\alpha_m,\beta_y}\sigma_{\alpha_m}\sigma_{\beta_y} & \rho_{\alpha_m,\gamma}\sigma_{\alpha_m}\sigma_{\gamma} \\ \cdot & \sigma_{\beta_m}^2 & \rho_{\beta_m,\alpha_y}\sigma_{\beta_m}\sigma_{\alpha_y} & \rho_{\beta_m,\beta_y}\sigma_{\beta_m}\sigma_{\beta_y} & \rho_{\beta_m,\gamma}\sigma_{\beta_m}\sigma_{\gamma} \\ \cdot & \cdot & \sigma_{\alpha_y}^2 & \rho_{\alpha_y,\beta_y}\sigma_{\alpha_y}\sigma_{\beta_y} & \rho_{\alpha_y,\gamma}\sigma_{\alpha_y}\sigma_{\gamma} \\ \cdot & \cdot & \cdot & \sigma_{\beta_y}^2 & \rho_{\beta_y,\gamma}\sigma_{\beta_y}\sigma_{\gamma} \\ \cdot & \cdot & \cdot & \cdot & \sigma_{\gamma}^2 \end{pmatrix} \]

Where \(\alpha_{m,j}, \beta_{m,j}\) are mediator intercept and slope for participant \(j\). \(\alpha_{y,j}, \beta_{y,j}\) are their outcome intercept and slope, \(\gamma_j\) is their mediator-outcome relationship.

These slopes and intercepts drawn from a multivariate normal distribution. Like before \(\sigma_{\alpha}\) and \(\sigma_{\beta}\) describe the intercept standard deviation and heterogeneity of the change over time, but they now have subscripts that specify whether they belong to the mediator or the outcome. We describe heterogeneity of the mediator-outcome relationship with \(\sigma_{\gamma}\).

We now also have ten (!) \(\rho\) terms that capture the correlations between individuals slopes and intercepts.

Let’s think through these one by one!

\(\rho_{\alpha_m,\beta_m}\) is the slope-intercept correlation that we previously defined in the model without the outcome. This should probably be slightly negative to reflect that people who are already at a high level in the mediator tend to have less room to get even worse (and vice versa). A “regression towards the mean”-like effect.
\(\rho_{\alpha_m,\alpha_y}\) is the correlation between the intercept of the mediator and the intercept of the outcome. This should be moderately positive to reflect that individuals with a high level of our mediating variable (e.g. intolerance of uncertainty) tend to be more worried (at the start of treatment).
\(\rho_{\alpha_m,\beta_y}\) is more conceptually tricky. This represents whether individuals high on the mediator at start of treatment tend to change more in the outcome. My stab at it would be that this should also be a slight negative correlation, but probably weaker than the slope-intercept correlation for the mediator itself.
\(\rho_{\alpha_m,\gamma}\) represents the relationship between an individuals mediator-outcome relationship, and their initial level of the mediator. For example, do individuals with a higher level of negative metacognitions beforehand also have a stronger overall relationship between their negative metacognitions and their worry? I’m not sure what to make of that. My first hunch is no; while there would be an overall relationship between the mediator and the outcome, \(\mu_\gamma\), it wouldn’t necessarily vary by initial negative metacognition level.
\(\rho_{\beta_m,\alpha_y}\) is similarly tricky but like \(\rho_{\alpha_m,\beta_y}\) I would argue for “slightly negative but less so than within variable intercept-slope correlation”
\(\rho_{\beta_m,\beta_y}\) is potentially a very relevant parameter of interest. This should be positive. Individuals whose mediator change more should have outcomes that change more.
\(\rho_{\beta_m,\gamma}\) is also tricky. It represents how mediator slope relates to the mediator-outcome relationship, for a given participant. One could perhaps frame it as a type of individual differences in treatment response. Perhaps participants where the mediator changes more are the same participants where the mediator is strongly related to the outcome. Now that I type it out, that sounds pretty reasonable.
\(\rho_{\alpha_y,\beta_y}\) is the slope-intercept correlation for the outcome. In line with previous reasoning: probably slightly negative.
\(\rho_{\alpha_y,\gamma}\) is the relationship between initial worry and the mediator-outcome relationship. Since that’s the flipside of \(\rho_{\alpha_m,\gamma}\) similar reasoning should hold.
\(\rho_{\beta_y,\gamma}\) is also tricky. It should probably be similar to \(\rho_{\beta_m,\gamma}\), so that those whose outcome change more tend to have a stronger mediator-outcome relationship (since we’re imagining they change because of the changing mediator.)

Finally we can complicate the model even further by adding a correlation between the residuals (\(\sigma_m\) and \(\sigma_y\)). The outcome is then instead a linear predictor \(y_i = \mu_{y,i} + \epsilon_{y,i}\) and \(\epsilon\) comes from a multivariate normal distribution that contain both \(\sigma^2_y\), \(\sigma^2_m\) and their correlation \(\rho_{\epsilon}\sigma_m\sigma_y\).

Should we posit such a relationship? I think it’s unclear. We’re then saying that despite all these things we’re describing, our \(\boldsymbol\mu\)s and our \(\rho\)s and our \(\gamma\)s, there’s still an additional, not yet captured, relationship between the mediator and the outcome.

The code below contains it, but I will set the relationship to zero when generating data.

(Note. Unlike the previous code I haven’t done any checks on whether this fully works).

n <- 7 #note that the code calls nr patients "n" instead of j.

# Parameters for the 5-dimensional MVN
mu_alpha_m <- 0        # specify mediator intercept mean
mu_beta_m <- -0.1      # specify mediator slope mean
mu_alpha_y <- 0        # specify outcome intercept mean
mu_beta_y <- -0.11     # specify outcome slope mean  
mu_gamma <- 0.8       # specify mediator-outcome relationship mean

# Standard deviations for random effects
sd_alpha_m <- 1        # specify mediator intercept SD
sd_beta_m <- 0.06      # specify mediator slope SD
sd_alpha_y <- 0.95      # specify outcome intercept SD
sd_beta_y <- 0.04      # specify outcome slope SD
sd_gamma <- 0.2        # specify mediator-outcome relationship SD

# Residual parameters
sigma_m <- 0.3
sigma_y <- 0.4
rho_epsilon <- 0     # residual correlation

weeks <- 0:9

# Set up 5-dimensional multivariate normal parameters
mu_vec <- c(mu_alpha_m, mu_beta_m, mu_alpha_y, mu_beta_y, mu_gamma)

# Create 5x5 covariance matrix (you can specify correlations as needed)
rho_alpha_m_beta_m <- -0.15    # mediator slope-intercept correlation
rho_alpha_m_alpha_y <- 0.9     # mediator-outcome intercept correlation
rho_alpha_m_beta_y <- -0.1     # mediator-intercept outcome-slope correlation
rho_alpha_m_gamma <- 0         # mediator intercept-mediator effect correlation
rho_beta_m_alpha_y <- 0        # mediator slope-outcome intercept correlation
rho_beta_m_beta_y <- 0.24      # mediator slope-outcome slope correlation
rho_beta_m_gamma <- 0.25       # mediator slope-mediator effect correlation

rho_alpha_y_beta_y <- - 0.16   # outcome slope-intercept correlation
rho_alpha_y_gamma <- 0         # outcome intercept-mediator effect correlation
rho_beta_y_gamma <- 0.20       # outcome slope-mediator effect correlation

Sigma_RE <- matrix(0, nrow = 5, ncol = 5)
diag(Sigma_RE) <- c(sd_alpha_m^2, sd_beta_m^2, sd_alpha_y^2, sd_beta_y^2, sd_gamma^2)

# Fill in correlations (symmetric matrix)
Sigma_RE[1,2] <- Sigma_RE[2,1] <- rho_alpha_m_beta_m * sd_alpha_m * sd_beta_m
Sigma_RE[1,3] <- Sigma_RE[3,1] <- rho_alpha_m_alpha_y * sd_alpha_m * sd_alpha_y
Sigma_RE[1,4] <- Sigma_RE[4,1] <- rho_alpha_m_beta_y * sd_alpha_m * sd_beta_y
Sigma_RE[1,5] <- Sigma_RE[5,1] <- rho_alpha_m_gamma * sd_alpha_m * sd_gamma

Sigma_RE[2,3] <- Sigma_RE[3,2] <- rho_beta_m_alpha_y * sd_beta_m * sd_alpha_y
Sigma_RE[2,4] <- Sigma_RE[4,2] <- rho_beta_m_beta_y * sd_beta_m * sd_beta_y
Sigma_RE[2,5] <- Sigma_RE[5,2] <- rho_beta_m_gamma * sd_beta_m * sd_gamma

Sigma_RE[3,4] <- Sigma_RE[4,3] <- rho_alpha_y_beta_y * sd_alpha_y * sd_beta_y
Sigma_RE[3,5] <- Sigma_RE[5,3] <- rho_alpha_y_gamma * sd_alpha_y * sd_gamma

Sigma_RE[4,5] <- Sigma_RE[5,4] <- rho_beta_y_gamma * sd_beta_y * sd_gamma

# Residual covariance matrix
Sigma_res <- matrix(c(sigma_m^2, rho_epsilon * sigma_m * sigma_y,
                      rho_epsilon * sigma_m * sigma_y, sigma_y^2), 
                    nrow = 2, ncol = 2)

df <- data.frame(m = c(), y = c(), week = c(), id = c())

for(i in 1:n){
  # Generate 5 correlated random effects for this participant
  params <- mvrnorm(1, mu = mu_vec, Sigma = Sigma_RE)
  alpha_m_i <- params[1]  # mediator intercept
  beta_m_i <- params[2]   # mediator slope
  alpha_y_i <- params[3]  # outcome intercept
  beta_y_i <- params[4]   # outcome slope
  gamma_i <- params[5]    # mediator-outcome relationship
  
  # Generate correlated residuals for all timepoints
  residuals <- mvrnorm(length(weeks), mu = c(0, 0), Sigma = Sigma_res)
  epsilon_m <- residuals[, 1]
  epsilon_y <- residuals[, 2]
  
  # Calculate mediator values
  mu_m <- alpha_m_i + beta_m_i * weeks
  m_vals <- mu_m + epsilon_m
  
    # CENTER THE MEDIATOR BY PARTICIPANT
  m_centered <- m_vals - mean(m_vals)
  
  # Calculate outcome values using CENTERED mediator
  mu_y <- alpha_y_i + beta_y_i * weeks + gamma_i * m_centered
  y_vals <- mu_y + epsilon_y
  
  df <- rbind.data.frame(df, data.frame(m = m_vals, y = y_vals, week = weeks,
                                        id = rep(i, times = length(weeks))))
}

My aim for this code is that it can be used as a tool to test and make sense of different available methods for mediation analysis.

For now, let’s just plot both mediator and

p <- ggplot(df, aes(x = week, group = as.factor(id), color = as.factor(id))) + geom_point(aes(y = m), alpha = 0.8) + 
  geom_line(aes(y = m), alpha = 0.8) +
  geom_point(aes(y = y), alpha = 0.4) +
  geom_line(aes(y = y), alpha = 0.4, linetype = "dashed") +
  labs(color = "ID", y = "Value") +
  theme_minimal()

print(p)

Appendix: Maximal model including correlated residuals.

Note. Not yet updated \(\gamma\) to apply to centered \(m\)

\[ \begin{align} y_i &= \mu_{y,i} + \epsilon_{y,i} \\ m_i &= \mu_{m,i} + \epsilon_{m,i} \\ \mu_{y,i} &= \alpha_{y,j[i]} + \beta_{y,j[i]}t_i + \gamma_{j[i]}m_i \\ \mu_{m,i} &= \alpha_{m,j[i]} + \beta_{m,j[i]}t_i \\ \begin{pmatrix} \alpha_{m,j} \\ \beta_{m,j} \\ \alpha_{y,j} \\ \beta_{y,j} \\ \gamma_j \end{pmatrix} &\sim \text{MVN}\left(\begin{pmatrix} \mu_{\alpha,m} \\ \mu_{\beta,m} \\ \mu_{\alpha,y} \\ \mu_{\beta,y} \\ \mu_{\gamma} \end{pmatrix}, \Sigma_{RE}\right) \\ \begin{pmatrix} \epsilon_{m,i} \\ \epsilon_{y,i} \end{pmatrix} &\sim \text{MVN}\left(\begin{pmatrix} 0 \\ 0 \end{pmatrix}, \begin{pmatrix} \sigma_m^2 & \rho_{\epsilon}\sigma_m\sigma_y \\ \rho_{\epsilon}\sigma_m\sigma_y & \sigma_y^2 \end{pmatrix}\right) \end{align} \]