The Causal Model and Notation

Modified

June 14, 2024

Consider a typical setting where we have measured some treatment \(A\), some set of pre-treatment variables \(L\), and an outcome. We will assume that the data are generated by the following causal model (but alternative definitions can be achieved using other causal models):

\[ \begin{align} L &= f_L(U_L)\\ A &= f_A(L, U_A)\\ Y &= f_Y(A, L, U_Y) \end{align} \]

We will often refer to \(L\) as confounders, \(A\) as the treatment or exposure, and \(Y\) as the outcome.


Take for example the following sample of data:


Let, \(L\) = sex, bmi, age, and smoke; \(A\) = trt; and \(Y\) = event. When we say that we assume the previous NPSEM, we are positing that:

Note that in some cases, we may know one or more of the functions \(f\). For example, if our data came from a randomized clinical trial for \(A\), then we know the function \(f_A\).


Central to how we will define causal effects is the concept of counterfactual random variables.

Counterfactual random variables

Hypothetical random variables that would have been observed, possibly contrary to fact, in an alternative world.

For example, consider a scenario where we are interested in the value of \(Y\) in a hypothetical situation where, instead of the variable \(A\) being equal to its observed value, \(A\) is set to some other value.

Returning to the data example, imagine we are interested in the value of event if trt was replaced with the output of a function \(\dd\) that always returns 1:

\[ \begin{align} L &= f_L(U_L)\\ A^{\dd} &= \dd(A, L) = 1 \\ Y^{\dd} &= f_Y(1, L, U_Y) \end{align} \]

Here we introduce some new notation \(A^{\dd}\) to refer to the post-intervention exposure. If we had the ability to collect data from this alternative NPSEM, the data may instead look like this:

Unfortunately, we are never able to collect data from this alternative world. This is called the fundamental problem of causal inference.


The previous NPSEM is the simplest causal model we will assume in this workshop. However, real data is often much more complex and may be characterized by:

As such, we need to modify and introudce some additional notation:

Symbol Definition
\(i\) The index (i.e. a row in a dataset) of an observation from a data set with \(n\) total units (i.e., the total number of rows)
\(t\) The index of time for a total number of time points \(\tau\)
\(L_t\) Confounders at time \(t\)
\(A_t\) A vector of intervention variables (i..e, treatment or exposure) at time \(t\)
\(Y\) An outcome variable observed at the end of the study, that is at time \(\tau + 1\). Earlier measures of the outcome can be included in \(L_t\).
\(C_t\) A indicator variable that a unit is observed (not censored) at time \(t+1\)
\(O_1, ..., O_n\) A sample of \(n\) i.i.d observations with \(O = (L_1, A_1, C_1, L_2, A_2, C_2, ..., L_\tau, A_\tau, C_\tau, Y)\)
\(\bar{X}_t = (X_1, ..., X_t)\) The history of a variable up until time \(t\)
\(\underline{X}_t = (X_t, ..., X_\tau)\) The future of a variable, including time \(t\)
\(H_t = (\bar{A}_{t-1}, \bar{L}_t)\) The history of all variables up until just before \(A_t\)
\(\epsilon_t\) A randomizer
\(\dd(a_t, h_t, \epsilon_t)\) A function that maps \(A_t\), \(H_t\), and \(\epsilon_t\) to a new value of treatment \(A^{\dd}_t\)

References

Dı́az, Iván, Nicholas Williams, Katherine L Hoffman, and Edward J Schenck. 2023. “Nonparametric Causal Effects Based on Longitudinal Modified Treatment Policies.” Journal of the American Statistical Association 118 (542): 846–57.