Describe the bug
The nuisance estimation (DoubleMLPLPR._nuisance_est) fails when the data preprocessing (DoubleMLPLPR._transform_data()) drops rows with only one datapoint.
The problem seems to be the following: d_mean is computed using the original data (before being transformed)
But m_hat_star is computed using the predictions and means from the transformed data (which potentially has less rows).
# general cre adjustment
if self._approach == "cre_general":
d_mean = self._d_mean[:, self._i_treat]
df_m_hat = pd.DataFrame({"id": self._dml_data.id_var, "m_hat": m_hat["preds"]})
m_hat_mean = df_m_hat.groupby(["id"]).transform("mean")
m_hat_star = m_hat["preds"] + d_mean - m_hat_mean["m_hat"] # <-- this raised an error
m_hat["preds"] = m_hat_star
Here is the stack trace:
C:\Users\siebert\PycharmProjects\tree-growth\.venv\Lib\site-packages\doubleml\plm\plpr.py:275: UserWarning: The data contains 2 id(s) with only one row. These row(s) have been dropped.
warnings.warn(
Traceback (most recent call last):
File "C:\Users\siebert\AppData\Local\JetBrains\PyCharm Community Edition 2024.3.3\plugins\python-ce\helpers\pydev\pydevd.py", line 1647, in _exec
pydev_imports.execfile(file, globals, locals) # execute the script
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\siebert\AppData\Local\JetBrains\PyCharm Community Edition 2024.3.3\plugins\python-ce\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "C:\Users\siebert\PycharmProjects\tree-growth\tree_growth_analysis\estimation\estimation_doubleml.py", line 79, in <module>
estimate_df, sensitivity_params = estimation(
^^^^^^^^^^^
File "C:\Users\siebert\PycharmProjects\tree-growth\tree_growth_analysis\estimation\estimation_doubleml.py", line 48, in estimation
dml_plr_obj.fit(n_jobs_cv=4)
File "C:\Users\siebert\PycharmProjects\tree-growth\.venv\Lib\site-packages\doubleml\double_ml.py", line 593, in fit
nuisance_predictions = self._fit_nuisance_and_score_elements(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\siebert\PycharmProjects\tree-growth\.venv\Lib\site-packages\doubleml\double_ml.py", line 1306, in _fit_nuisance_and_score_elements
score_elements, preds = self._nuisance_est(
^^^^^^^^^^^^^^^^^^^
File "C:\Users\siebert\PycharmProjects\tree-growth\.venv\Lib\site-packages\doubleml\plm\plpr.py", line 401, in _nuisance_est
m_hat_star = m_hat["preds"] + d_mean - m_hat_mean["m_hat"]
~~~~~~~~~~~~~~~^~~~~~~~
ValueError: operands could not be broadcast together with shapes (2836,) (2838,)
My guess is that the computation of d_mean uses the original data and not the transformed data in the following function:
def _set_d_mean(self):
if self._approach in ["cre_general", "cre_normal"]:
data = self._original_dml_data.data
d_cols = self._original_dml_data.d_cols
id_col = self._original_dml_data.id_col
help_d_mean = data.loc[:, [id_col] + d_cols]
d_mean = help_d_mean.groupby(id_col).transform("mean").values
self._d_mean = d_mean
else:
self._d_mean = None
From what I understand, the transformed data is stored in self.data_transform (also self._dml_data)
and is shall exist when the d_mean is computed so the code could look like this:
def _set_d_mean(self):
if self._approach in ["cre_general", "cre_normal"]:
data = self._dml_data.data # or self.data_transform.data
d_cols = self._dml_data.d_cols # or self.data_transform.d_cols
id_col = self._dml_data.id_col # or self.data_transform.id_col
help_d_mean = data.loc[:, [id_col] + d_cols]
d_mean = help_d_mean.groupby(id_col).transform("mean").values
self._d_mean = d_mean
else:
self._d_mean = None
Minimum reproducible code snippet
I cannot share the data I am working with, but you shall be able to reproduce by havong a panel data set where some individuals only have one data (like one measurement over time).
These are the objects I am using:
obj_dml_data = dml.DoubleMLPanelData(
df[columns_to_keep].copy().reset_index(drop=True),
y_col=y_col, # Outcome
d_cols=d_cols, # Treatment
x_cols=x_cols, # Confounders
t_col=t_col, # Time
id_col=id_col, # How to cluster the data for fixed effects
static_panel=True,
)
ml_l = RandomForestRegressor(n_estimators=100, max_depth=5) # Model for Y
ml_m = RandomForestRegressor(n_estimators=100, max_depth=5) # Model for D
dml_plr_obj = dml.DoubleMLPLPR(
obj_dml_data,
ml_l=ml_l,
ml_m=ml_m,
n_folds=5,
n_rep=3,
score='partialling out',
approach='cre_general',
)
logger.debug("fitting")
dml_plr_obj.fit(n_jobs_cv=4)
Expected Result
Well, no error ;)
Actual Result
m_hat_star = m_hat["preds"] + d_mean - m_hat_mean["m_hat"]
~~~~~~~~~~~~~~~^~~~~~~~
ValueError: operands could not be broadcast together with shapes (2836,) (2838,)
Versions
Windows-11-10.0.26200-SP0
Python 3.12.6 (tags/v3.12.6:a4a2d2b, Sep 6 2024, 20:11:23) [MSC v.1940 64 bit (AMD64)]
DoubleML 0.11.2
Scikit-Learn 1.6.1
Describe the bug
The nuisance estimation (DoubleMLPLPR._nuisance_est) fails when the data preprocessing (DoubleMLPLPR._transform_data()) drops rows with only one datapoint.
The problem seems to be the following:
d_meanis computed using the original data (before being transformed)But
m_hat_staris computed using the predictions and means from the transformed data (which potentially has less rows).Here is the stack trace:
My guess is that the computation of d_mean uses the original data and not the transformed data in the following function:
From what I understand, the transformed data is stored in
self.data_transform(alsoself._dml_data)and is shall exist when the d_mean is computed so the code could look like this:
Minimum reproducible code snippet
I cannot share the data I am working with, but you shall be able to reproduce by havong a panel data set where some individuals only have one data (like one measurement over time).
These are the objects I am using:
Expected Result
Well, no error ;)
Actual Result
ValueError: operands could not be broadcast together with shapes (2836,) (2838,)
Versions
Windows-11-10.0.26200-SP0
Python 3.12.6 (tags/v3.12.6:a4a2d2b, Sep 6 2024, 20:11:23) [MSC v.1940 64 bit (AMD64)]
DoubleML 0.11.2
Scikit-Learn 1.6.1