Probably Approximately Correct

MSc Machine Learning @ UCL | Alumnus @ IIT Madras| Google DeepMind Scholar | Interests: Machine learning

Friday, May 26, 2023

Causal Research (#POST 4)

Paper1:Double machine learning and automated confounder selection: A cautionary tale

Summary :

Double/debiased machine learning (DML) is a method used for variable selection in high-dimensional causal inference settings. It uses regularization techniques such as LASSO or l2-boosting to identify variables that are highly correlated with treatment and outcome. By doing so, this procedure addresses the problem of omitted variable bias and provides consistent estimates of the causal impact of the treatment. DML consists of two main approaches: partialling out and double selection, both accounting for the link between treatment and covariates. These designs depend on doubly robust moment conditions and can handle approximation errors from regularization robustly. Nonetheless, ignorability assumption is violated by DML where covariates are not fully exogenous. Poor controls which result in violations of ignorability can undermine the effectiveness of DML. Simulation studies show that even minor deviations from ignorability make DML sensitive leading to results that resemble those based on naïve LASSO. For example, application of DML to estimate gender wage differences reveals huge disparities when endogeneity due to marital status is taken into account


Further reading

-https://towardsdatascience.com/double-machine-learning-for-causal-inference-78e0c6111f9d

-For an example that will help the layperson grasp what it means to break the ignorability assumption due to inclusion of “bad controls” in DML algorithm, let me give you one. Say we are conducting research on the effects of a cloud workshop program (treatment variable D) on employee performance (outcome variable Y). In order to achieve this, we collect data for several control variables X such as demographic characteristics, educational background, work experience and job satisfaction. However, by mistake we added ‘motivation’ as a control when it was not entirely independently assigned from treatment. To put it differently, motivation could be affected by both the treatment itself- workshop- or other unobservable characteristics that jointly impact on treatment assignment and job performance. Our use of DML algorithm including “motivation” as a control violates ignorability assumption. The implication is that this variable is not independent of how the treatment is assigned and may introduce bias into estimated treatment effects. This “bad control” can result in confounding whereby the estimation effect of treatment becomes mixed with “motivation’s” influence on job performance.

Paper 2: ParKCa: Causal Inference with Partially Known Causes

Summary :

Having a slight deviation from traditional stacking models, ParKCa is a causal discovery approach that involves combining various causal estimates. In typical stacking models, individual causal discovery approaches are learned separately and their results compared. However, in ParKCa, these outputs become features for a classification model that acts like the meta-learner. The objective is to identify new causes based on some few examples of known causes. On one hand, each learner (causal discovery method) receives input data in a given format while its output corresponds to every potential cause as implemented here. Rather than cross-validation, level 0 transposed data is bootstrapped to create level 1 data required for training the meta-learner. This way, such assumptions as causal sufficiency are not violated. It should be noted that diversity among learners is the most important feature of ParKCa and it can be measured by Q statistic which describes differences between pairs of classifiers’ performance in detail. The average Q-statistic will give an indication if there is overall diversity across all learners.



Paper 3: Causal Inference with Non-IID Data using Linear Graphical Models

Summary :

An interaction model is a causal model with an adaptation network, in which nodes represent explicit variables within a directed acyclic graph and the direct edges depict the causal relationships of these variables.  The generation process of observed explicit variables’ data is determined by structural equations. In this context, it introduces the term ‘isolated interaction model’ as an “ideal” model that can be obtained by removing all interactions between units to study bias caused by interactions. The paper also examines symmetry assumptions, focusing particularly on the ancestral same-distribution condition (ASDC). ASDC relaxes independence and identical distribution (IID) assumption commonly used in traditional causal inference so that sets of variables must have some similar distribution properties for them to satisfy such conditions. This research concentrates on True average causal effect (TACE), considered as an extension of average causal effect (ACE) under non-IID. TACE represents only the component effecting through cause variable on outcome variable excluding any influence over non-causal path from one unit to another.

Besides, article goes deep into quantification, detection and elimination of interaction bias around TACE estimation. It suggests deflecting bias structures and reflecting bias structures as two types of graphical structures that introduce bias when estimating TACE . For assessment and detection of interaction bias based on presence of these bias structures in interaction network , we prove several theorems and corollaries using quantitative methods. Redressing partiality ,Theorem 2 provides a method for computing a linear regression-based unbiased estimate for TACE from a set of samples satisfying Equation That Is Bias Free: Algorithm 1 proposes an algorithmic approach to selecting largest subset without biases from an interaction network.

This article discusses how these theorems are applicable in practice especially regarding sample size, strength of connections and sparsity among other considerations affecting both reduction of bias and quality estimation.


Paper 4CounteRGAN: Generating Counterfactuals for Real-Time Recourse and Interpretability using Residual GANs

Summary :



CounteRGAN is meant to produce reasonable examples of counterfactuals that will give recourse to users and offer a better interpretability. It does this using an RGAN (Relational Generative Adversarial Network) and a fixed target classifier C. With CounteRGAN, there are specific objectives for which it can generate counterfactuals: they must be actionable, realistic, belong to specified class, and have low computational latency. There are two versions of the CounteRGAN value function which depend on whether gradients of the classifier are known or not. The goal of the value function is maximizing D while minimizing G. In case when we know that the classifier is differentiable, then the CounteRGAN value function is defined as follows:


VCounterRGAN(G, D) = VRGAN(G, D) + VCF(G, C, t) + Reg(G(x)),

where t represents the target class. The first term, VRGAN, utilizes a specialized RGAN to encourage realistic outputs. The second term, VCF, drives the counterfactual towards the desired target class. The third term, Reg(G(x)), controls the sparsity and magnitude of the residuals, serving as a proxy for counterfactual actionability.

The VRGAN term, which contributes to realistic outputs, is computed based on the discriminator and generator's performance on real and synthetic data samples drawn from the same probability distribution. The VCF term is responsible for aligning the counterfactual examples with the desired target class. It uses the classifier's prediction function Ct to guide the generation process. In cases where the target classifier is non-differentiable or unknown (black-box), a variant called CounteRGANbb is introduced. This variant does not rely on the classifier's gradients. Instead, it weights the first term of the RGAN value function by the classifier's prediction score, Ct(xi). The rest of the value function remains the same. The regularization term, Reg(G, {xi}), controls the sparsity and magnitude of the residuals by combining L1 and L2 regularization terms. It helps to ensure that the generated counterfactual examples are actionable and feasible.
The convergence properties of CounteRGAN are formalized by Theorem 1, which states that under certain conditions, the minimax optimization of the value function leads to the convergence of the generator's output distribution to a distribution defined by pCt(x), where pCt represents the desired class distribution.


Labels:

0 Comments:

Post a Comment

Subscribe to Post Comments [Atom]

<< Home