Estimators
\[ \renewcommand{\P}{\mathsf{P}} \newcommand{\m}{\mathsf{m}} \newcommand{\p}{\mathsf{p}} \newcommand{\q}{\mathsf{q}} \newcommand{\bb}{\mathsf{b}} \newcommand{\g}{\mathsf{g}} \newcommand{\rr}{\mathsf{r}} \newcommand{\IF}{\mathbb{IF}} \newcommand{\dd}{\mathsf{d}} \newcommand{\Pn}{$\mathsf{P}_n$} \newcommand{\E}{\mathsf{E}} \]
To recap, we have
defined a causal parameter \(\theta\) that is represents the outcome \(Y\) under a general hypothetical intervention, \(\dd(a_t, h_t, \epsilon)\), and
defined the necessary assumptions to identify the expected value of \(Y\) under an intervention that replaces \(A\) with the output of \(\dd_t(a, h_t, \epsilon)\).
The parameter \(\theta\) is a function that could be computed if we knew the distribution of the variables (e.g., outcome regressions). Because we do not know this distribution, we need to estimate it from a sample of observed data. We now discuss how to estimate the parameter \(\theta\) from a sample.
Sequential Regression Estimator
One possible estimator is simply a plug-in estimator of the identification result. This estimator is often referred to as G-computation, and it proceeds by simply estimating the regressions described in the identification result. Another way to describe this algorithm is as follows:
Algorithm: Sequential regressions
Set \(\m_{i,\tau +1} = Y_i\).
For \(t = \tau, ..., 1\):
Using a pre-specified parametric model, regress \(\m_{i,t+1}\) on \(\{A_{i, t}, H_{i,t}\}\).
Generate predictions from this model with \(A_{i,t}\) changed to \(A^{\dd}_{i,t}\). Set \(\m_{i, t}\) to be these predicted values.
Repeat (iterate) the above two steps until setting \(\m_{i, 1}\) to the predicted values.
Take the final estimate as \(\hat{\theta} = \frac{1}{n}\sum_{i=1}^n \hat\m_{i, 1}\).
Compute standard errors using a bootstrap of steps 1 and 2.
Pros ✅ | Cons ❌ |
---|---|
|
|
A substitution estimator is nice, because its estimates are guaranteed to stay within the valid range of the outcome and it is simple to estimate. That said, its cons are major:
- In studies with multiple time points, the adjustment set can become large very quickly. For instance, consider a study with 3 covariates measured at every time point, and 10 time points. Even though we have only 3 covariates at each time point, the number of coviariates in the regression of \(Y\) is 33.
- Imagine trying to correctly specify (e.g., include the approproate interactions) a parametric model (e.g., logistic, Cox) with 33 variables.
Density-ratio estimation
The next three estimators all rely on estimating the density ratio
\[ r_t(a_t, h_t) = \frac{\g_t^\dd(a_t \mid h_t)}{\g_t(a_t \mid h_t)}. \]
Recall that \(\g_t^\dd(a_t \mid h_t)\) denotes the density of treatment post-intervention and that \(\g_t(a_t \mid h_t)\) denotes the density of the observed treatment under no intervention.
We will often refer to this ratio as the intervention mechanism throughout the workshop. Estimation of \(r_t(a_t, h_t)\) is fully automated in lmtp
and hidden from the user. A comprehensive understanding of this process isn’t necessary to use lmtp
!
We can directly estimate this density ratio with a classification trick.
Can compute the density ratio (which we can see is also the odds) in a classification problem. We use an augmented dataset with \(2n\) observations. In this new dataset, the outcome is a new variable that we make, \(\Lambda\), (defined below) and the predictors are the variables \(A_t\) and \(H_t\). In this new dataset, the data structure at time \(t\) is redefined as
\[ (H_{\lambda, i, t}, A_{\lambda, i, t}, \Lambda_{\lambda, i} : \lambda = 0, 1; i = 1, ..., n) \]
\(\Lambda_{\lambda, i} = \lambda_i\) indexes duplicate values. So if \(\Lambda_i =1\) if observation \(i\) is a duplicated value and \(\Lambda_i =0\) otherwise.
For all duplicated observations \(\lambda\in\{0,1\}\) with the same \(i\), \(H_{\lambda, i, t}\) is the same
For all the non-duplicated observations, \(\lambda = 0\), \(A_{\lambda=0, i, t}\) equals the observed exposure values \(A_{i, t}\)
For all the duplicated observations, \(\lambda=1\), \(A_{\lambda=1, i, t}\) equals the exposure values under the intervention \(\dd\), \(A^{\dd}_{i,t}\)
We then estimate the conditional probability that \(\Lambda=1\) in this dataset, and divide it by the corresponding estimate of the conditional probability that \(\Lambda=0\). Specifically, denoting \(p^\lambda\) to be the distribution of the data in the augmented dataset, we have:
\[ r_t(a_t, h_t) = \frac{p^\lambda(a_t, h_t \mid \Lambda = 1)}{p^\lambda(a_t, h_t \mid \Lambda = 0)}=\frac{p^\lambda(\Lambda = 1\mid A_t=a_t, H_t=h_t)}{p^\lambda(\Lambda = 0\mid A_t=a_t, H_t=h_t)} \]
Inverse Probability Weighting
We call this estimator IPW due to its similarity with the inverse-probability weighted estimator for binary treatments, but it would be more accurately referred to as simply reweighted estimator. It is based on the following alternative identification formula:
\[ \theta = \E \bigg[ \bigg\{\prod_{t=1}^\tau r_t(a_t, h_t) \bigg\} Y \bigg] \]
Algorithm: IPW Estimator
Construct estimates of \(r_{i,t}(a_t, h_t)\) using the density ratio classification trick and a pre-specified parametric model.
Define the weights \(w_{i} = \prod_{t=1}^\tau r_{i,t}(a_t, h_t)\).
Take the final estimate as \(\hat{\theta} = \frac{1}{n}\sum_{i=1}^n \hat{w}_{i}\times y_i\).
Compute standard errors using a bootstrap of steps 1-3.
Pros ✅ | Cons ❌ |
---|---|
|
|
Doubly Robust Estimators
G-computation and IPW estimators require the estimation of nuisance parameters with correctly specified parametric models. We will now turn our attention to two non-parametric estimators:
targeted minimum-loss based estimator (TMLE), and
a sequentially doubly-robust estimator (SDR).
Wait, what does it mean for an estimator to be doubly robust?
- For the simple case of a single time point, an estimator is considered doubly robust if it is able to produce a consistent estimate of the target parameter as long as either the outcome model is consistently estimated or the treatment (and censoring) model(s) are consistently estimated. For example:
Time | 1 |
Treatment | Correct |
Outcome | Wrong |
- For time-varying setting, an estimator is doubly robust if, for some time \(s\), all outcome regressions for \(t >s\) are consistently estimated and all intervention mechanisms (treatment + censoring) for \(t \leq s\) are consistently estimated. Consider for example \(5\) time points:
Time | 1 | 2 | 3 | 4 | 5 |
Treatment | Correct | Correct | Wrong | Wrong | Wrong |
Outcome | Wrong | Wrong | Correct | Correct | Correct |
- Sequential double robustness (often also referred to as \(2^\tau\)-multiply robust) implies that an estimator is consistent if for all times either the outcome or intervention mechanism (treatment + censoring) is consistently estimated. For example:
Time | 1 | 2 | 3 | 4 | 5 |
Treatment | Correct | Wrong | Correct | Wrong | Correct |
Outcome | Wrong | Correct | Wrong | Correct | Wrong |
Efficient Influence Function
Key to constructing the TMLE and SDR is the efficient influence function (EIF).
The EIF characterizes the asymptotic behavior of all regular and efficient estimators.
The EIF characterizes the first-order bias of pathwise differentiable estimands.
Before we introduce the EIF, it’s necessary to make some additional assumptions on \(A\) and \(\dd\).
The treatment \(A\) is discrete, or
If \(A\) is continuous, the function \(\dd\) is piecewise smooth invertible
The function \(\dd\) does not depend on the observed distribution \(\P\) (this means that we cannot apply these methods to odds ratio IPSI, although we can apply them to risk ratio IPSI)
These assumptions ensure that the efficient influence function of \(\theta\) for interventions \(\dd\) have a structure similar to the influence function for the effect of dynamic regimes.
This allows for multiply robust estimation, which is not generally possible for interventions \(\dd\) that depend on \(\P\).
Define the function
\[ \phi_t: O \mapsto \sum_{s=t}^\tau \bigg( \prod_{k=t}^s r_k(a_k, h_k)\bigg) \big\{\m_{s+1}(a_{s+1}^\dd, h_{s+1}) - \m_s(a_s, h_s) \big\} + \m_t(a_t^\dd, h_t). \]
The efficient influence function for estimating \(\theta = \E[\m_1(A^\dd, L_1)]\) in the non-parametric model is given by \(\phi_1(O) - \theta\).
In the case of single time-point, the influence function simplifies to
\[ r(a, w)\{Y - \m(a,w)\} + \m(a^{\dd},w) - \theta. \]
Targeted Minimum-Loss Based Estimation
Algorithm: TMLE
Construct estimates of \(r_{i,t}(a_t, h_t)\) using the density ratio classification trick and your favorite regression method.
For \(t = 1, ..., \tau\), compute the weights: \(w_{i,t} = \prod_{k=1}^t r_{i,k}(a_{i,k}, h_{i,k})\)
Set \(\tilde{\m}_{i,\tau +1}(A^\dd_{i,t+1}, H_{i,t+1}) = Y_i\).
For \(t = \tau, ..., 1\):
Regress \(\tilde{\m}_{i,t+1}(A^\dd_{i,t+1}, H_{i,t+1})\) on \(\{A_{i, t}, H_{i,t}\}\).
Using this regression, generate predictions from \(\{A_{i, t}, H_{i,t}\}\) and \(\{A^\dd_{i, t}, H_{i,t}\}\).
Denote the predictions as \(\tilde{\m}_t(A_{i,t}, H_{i,t})\) and \(\tilde{\m}_t(A^\dd_{i,t}, H_{i,t})\) respectively.
Fit the generalized linear tilting model:
\(\text{link }\tilde{\m}^\epsilon_t(A_{i,t}, H_{i,t}) = \epsilon + \text{link }\tilde{\m}_{i,t}(A_{i,t}, H_{i,t})\)
with weights \(w_{i,t}\).
\(\text{link }\tilde{\m}_{i,t}(A_{i,t}, H_{i,t})\) is an offset variable (i.e., a variable with known parameter value equal to one).
The parameter \(\epsilon\) may be estimated by running a generalized linear model of \(\tilde{\m}_{i,t+1}(A^\dd_{i,t+1}, H_{i,t+1})\) with only an intercept term, an offset term equal to \(\text{link }\tilde{\m}_{i,t}(A_{i,t}, H_{i,t})\), and weights \(w_{i,t}\).
Let \(\hat\epsilon\) be the maximum likelihood estimate, and update the estimates as:
\(\text{link }\tilde{\m}^\hat\epsilon_t(A^\dd_{i,t}, H_{i,t}) = \hat\epsilon + \text{link }\tilde{\m}_t(A^\dd_{i,t}, H_{i,t})\)
\(\text{link }\tilde{\m}^\hat\epsilon_t(A_{i,t}, H_{i,t}) = \hat\epsilon + \text{link }\tilde{\m}_t(A_{i,t}, H_{i,t})\)
Update \(\tilde{\m}_{i,t} = \tilde{\m}^\hat\epsilon_{i,t}\), \(t = t-1\), and iterate.
The final estimate is defined as \(\hat\theta = \frac{1}{n}\sum_{i=1}^n\tilde{m}_{i, 1}(A^\dd_{i, 1}, L_{i, 1})\).
Pros ✅ | Cons ❌ |
---|---|
|
|
Sequentially Doubly Robust Estimator
Algorithm: SDR Estimator
Construct estimates of \(r_{i,t}(a_t, h_t)\) using the density ratio classification trick and your favorite regression method.
Initialize \(\phi_{\tau +1}(O_i) = Y_i\).
For \(t = \tau, ..., 1\):
Compute the pseudo-outcome \(\check{Y}_{i,t+1} = \phi_{t+1}(O_i)\).
Regress \(\check{Y}_{i,t+1}\) on \(\{A_{i, t}, H_{i,t}\}\).
Let \(\check\m_{i,t}\) denote the predicted values. Update \(\m_{i,t} = \check\m_{i,t}\) and iterate.
The final estimate is defined as \(\hat\theta = \frac{1}{n}\sum_{i=1}^n\phi_1(O_i).\)
Pros ✅ | Cons ❌ |
---|---|
|
|
Choosing an Estimator
In general we never recommend using the IPW or sequential regression estimator. Both require the use of correctly pre-specified parametric models for valid statistical inference (ya right 😂).
The TMLE and SDR estimators, however, are both doubly or sequentially doubly robust and can be used with machine-learning algorithms while remaining \(\sqrt{n}\)-consistent under reasonable assumptions.
IPW | G-comp. | TMLE | SDR | |
---|---|---|---|---|
Uses outcome regression | ⭐ | ⭐ | ⭐ | |
Uses the propensity score | ⭐ | ⭐ | ⭐ | |
Valid inference with machine-learning | ⭐ | ⭐ | ||
Substitution estimator | ⭐ | ⭐ | ||
Doubly robust | ⭐ | ⭐ | ||
Sequentially doubly robust | ⭐ |
While the SDR estimator may be more robust to model misspecification, the TMLE does have the advantage of being a substitution estimator. Because of this, estimates from the TMLE are guaranteed to stay within the valid range of the outcome. Taken together, this leads to the following recommendations for choosing between the TMLE and SDR:
If treatment is not time-varying, use the TMLE.
If treatment is time-varying and the parameter \(\theta\) has clear bounds, such as probabilities, beware of the SDR estimator. Use TMLE preferably.
Cross-fitting
When estimating nuisance parameters with data adaptive algorithms, you should perform a process similar to cross-validation called cross-fitting. Cross-fitting helps ensure:
that standard errors will be correct, and
can help reduce estimator bias and improve coverage of the confidence intervals.
Cross-fitting is fully automated in lmtp
, but for more information we recommend reviewing Chernozhukov et al. (2018), Dı́az (2020), and Zivich and Breskin (2021).