Skip to content

[Bug]: Issue in computing d_mean when rows are dropped (DoubleMLPanelData, DoubleMLPLPR, approach "cre_general") #391

@siebert-julien

Description

@siebert-julien

Describe the bug

The nuisance estimation (DoubleMLPLPR._nuisance_est) fails when the data preprocessing (DoubleMLPLPR._transform_data()) drops rows with only one datapoint.

The problem seems to be the following: d_mean is computed using the original data (before being transformed)

But m_hat_star is computed using the predictions and means from the transformed data (which potentially has less rows).

            # general cre adjustment
            if self._approach == "cre_general":
                d_mean = self._d_mean[:, self._i_treat]
                df_m_hat = pd.DataFrame({"id": self._dml_data.id_var, "m_hat": m_hat["preds"]})
                m_hat_mean = df_m_hat.groupby(["id"]).transform("mean")
                m_hat_star = m_hat["preds"] + d_mean - m_hat_mean["m_hat"] # <-- this raised an error
                m_hat["preds"] = m_hat_star

Here is the stack trace:

C:\Users\siebert\PycharmProjects\tree-growth\.venv\Lib\site-packages\doubleml\plm\plpr.py:275: UserWarning: The data contains 2 id(s) with only one row. These row(s) have been dropped.
  warnings.warn(
Traceback (most recent call last):
  File "C:\Users\siebert\AppData\Local\JetBrains\PyCharm Community Edition 2024.3.3\plugins\python-ce\helpers\pydev\pydevd.py", line 1647, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\siebert\AppData\Local\JetBrains\PyCharm Community Edition 2024.3.3\plugins\python-ce\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "C:\Users\siebert\PycharmProjects\tree-growth\tree_growth_analysis\estimation\estimation_doubleml.py", line 79, in <module>
    estimate_df, sensitivity_params = estimation(
                                      ^^^^^^^^^^^
  File "C:\Users\siebert\PycharmProjects\tree-growth\tree_growth_analysis\estimation\estimation_doubleml.py", line 48, in estimation
    dml_plr_obj.fit(n_jobs_cv=4)
  File "C:\Users\siebert\PycharmProjects\tree-growth\.venv\Lib\site-packages\doubleml\double_ml.py", line 593, in fit
    nuisance_predictions = self._fit_nuisance_and_score_elements(
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\siebert\PycharmProjects\tree-growth\.venv\Lib\site-packages\doubleml\double_ml.py", line 1306, in _fit_nuisance_and_score_elements
    score_elements, preds = self._nuisance_est(
                            ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\siebert\PycharmProjects\tree-growth\.venv\Lib\site-packages\doubleml\plm\plpr.py", line 401, in _nuisance_est
    m_hat_star = m_hat["preds"] + d_mean - m_hat_mean["m_hat"]
                 ~~~~~~~~~~~~~~~^~~~~~~~
ValueError: operands could not be broadcast together with shapes (2836,) (2838,) 

My guess is that the computation of d_mean uses the original data and not the transformed data in the following function:

    def _set_d_mean(self):
        if self._approach in ["cre_general", "cre_normal"]:
            data = self._original_dml_data.data
            d_cols = self._original_dml_data.d_cols
            id_col = self._original_dml_data.id_col
            help_d_mean = data.loc[:, [id_col] + d_cols]
            d_mean = help_d_mean.groupby(id_col).transform("mean").values
            self._d_mean = d_mean
        else:
            self._d_mean = None

From what I understand, the transformed data is stored in self.data_transform (also self._dml_data)
and is shall exist when the d_mean is computed so the code could look like this:

    def _set_d_mean(self):
        if self._approach in ["cre_general", "cre_normal"]:
            data = self._dml_data.data        # or self.data_transform.data
            d_cols = self._dml_data.d_cols  # or self.data_transform.d_cols
            id_col = self._dml_data.id_col   # or self.data_transform.id_col
            help_d_mean = data.loc[:, [id_col] + d_cols]
            d_mean = help_d_mean.groupby(id_col).transform("mean").values
            self._d_mean = d_mean
        else:
            self._d_mean = None

Minimum reproducible code snippet

I cannot share the data I am working with, but you shall be able to reproduce by havong a panel data set where some individuals only have one data (like one measurement over time).

These are the objects I am using:

    obj_dml_data = dml.DoubleMLPanelData(
        df[columns_to_keep].copy().reset_index(drop=True),
        y_col=y_col,  # Outcome
        d_cols=d_cols,  # Treatment
        x_cols=x_cols,  # Confounders
        t_col=t_col,  # Time
        id_col=id_col,  # How to cluster the data for fixed effects
        static_panel=True,
    )


    ml_l = RandomForestRegressor(n_estimators=100, max_depth=5)  # Model for Y
    ml_m = RandomForestRegressor(n_estimators=100, max_depth=5)  # Model for D

    dml_plr_obj = dml.DoubleMLPLPR(
        obj_dml_data,
        ml_l=ml_l,
        ml_m=ml_m,
        n_folds=5,
        n_rep=3,
        score='partialling out',
        approach='cre_general',
    )

    logger.debug("fitting")
    dml_plr_obj.fit(n_jobs_cv=4)

Expected Result

Well, no error ;)

Actual Result

m_hat_star = m_hat["preds"] + d_mean - m_hat_mean["m_hat"]
             ~~~~~~~~~~~~~~~^~~~~~~~

ValueError: operands could not be broadcast together with shapes (2836,) (2838,)

Versions

Windows-11-10.0.26200-SP0
Python 3.12.6 (tags/v3.12.6:a4a2d2b, Sep 6 2024, 20:11:23) [MSC v.1940 64 bit (AMD64)]
DoubleML 0.11.2
Scikit-Learn 1.6.1

Metadata

Metadata

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions