[Bug]: Issue in computing d_mean when rows are dropped (DoubleMLPanelData, DoubleMLPLPR, approach "cre_general")

### Describe the bug

The nuisance estimation (DoubleMLPLPR._nuisance_est) fails when the data preprocessing (DoubleMLPLPR._transform_data()) drops rows with only one datapoint.

The problem seems to be the following: `d_mean` is computed using the original data (before being transformed)

But `m_hat_star` is computed using the predictions and means from the transformed data (which potentially has less rows).

```python
            # general cre adjustment
            if self._approach == "cre_general":
                d_mean = self._d_mean[:, self._i_treat]
                df_m_hat = pd.DataFrame({"id": self._dml_data.id_var, "m_hat": m_hat["preds"]})
                m_hat_mean = df_m_hat.groupby(["id"]).transform("mean")
                m_hat_star = m_hat["preds"] + d_mean - m_hat_mean["m_hat"] # <-- this raised an error
                m_hat["preds"] = m_hat_star
```


Here is the stack trace:

```bash
C:\Users\siebert\PycharmProjects\tree-growth\.venv\Lib\site-packages\doubleml\plm\plpr.py:275: UserWarning: The data contains 2 id(s) with only one row. These row(s) have been dropped.
  warnings.warn(
Traceback (most recent call last):
  File "C:\Users\siebert\AppData\Local\JetBrains\PyCharm Community Edition 2024.3.3\plugins\python-ce\helpers\pydev\pydevd.py", line 1647, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\siebert\AppData\Local\JetBrains\PyCharm Community Edition 2024.3.3\plugins\python-ce\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "C:\Users\siebert\PycharmProjects\tree-growth\tree_growth_analysis\estimation\estimation_doubleml.py", line 79, in <module>
    estimate_df, sensitivity_params = estimation(
                                      ^^^^^^^^^^^
  File "C:\Users\siebert\PycharmProjects\tree-growth\tree_growth_analysis\estimation\estimation_doubleml.py", line 48, in estimation
    dml_plr_obj.fit(n_jobs_cv=4)
  File "C:\Users\siebert\PycharmProjects\tree-growth\.venv\Lib\site-packages\doubleml\double_ml.py", line 593, in fit
    nuisance_predictions = self._fit_nuisance_and_score_elements(
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\siebert\PycharmProjects\tree-growth\.venv\Lib\site-packages\doubleml\double_ml.py", line 1306, in _fit_nuisance_and_score_elements
    score_elements, preds = self._nuisance_est(
                            ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\siebert\PycharmProjects\tree-growth\.venv\Lib\site-packages\doubleml\plm\plpr.py", line 401, in _nuisance_est
    m_hat_star = m_hat["preds"] + d_mean - m_hat_mean["m_hat"]
                 ~~~~~~~~~~~~~~~^~~~~~~~
ValueError: operands could not be broadcast together with shapes (2836,) (2838,) 
```

My guess is that the computation of d_mean uses the original data and not the transformed data in the following function:

```python
    def _set_d_mean(self):
        if self._approach in ["cre_general", "cre_normal"]:
            data = self._original_dml_data.data
            d_cols = self._original_dml_data.d_cols
            id_col = self._original_dml_data.id_col
            help_d_mean = data.loc[:, [id_col] + d_cols]
            d_mean = help_d_mean.groupby(id_col).transform("mean").values
            self._d_mean = d_mean
        else:
            self._d_mean = None
```

From what I understand, the transformed data is stored in `self.data_transform` (also `self._dml_data`)
and is shall exist when the d_mean is computed so the code could look like this:

```python
    def _set_d_mean(self):
        if self._approach in ["cre_general", "cre_normal"]:
            data = self._dml_data.data        # or self.data_transform.data
            d_cols = self._dml_data.d_cols  # or self.data_transform.d_cols
            id_col = self._dml_data.id_col   # or self.data_transform.id_col
            help_d_mean = data.loc[:, [id_col] + d_cols]
            d_mean = help_d_mean.groupby(id_col).transform("mean").values
            self._d_mean = d_mean
        else:
            self._d_mean = None
```




### Minimum reproducible code snippet

I cannot share the data I am working with, but you shall be able to reproduce by havong a panel data set where some individuals only have one data (like one measurement over time).

These are the objects I am using:

```python

    obj_dml_data = dml.DoubleMLPanelData(
        df[columns_to_keep].copy().reset_index(drop=True),
        y_col=y_col,  # Outcome
        d_cols=d_cols,  # Treatment
        x_cols=x_cols,  # Confounders
        t_col=t_col,  # Time
        id_col=id_col,  # How to cluster the data for fixed effects
        static_panel=True,
    )


    ml_l = RandomForestRegressor(n_estimators=100, max_depth=5)  # Model for Y
    ml_m = RandomForestRegressor(n_estimators=100, max_depth=5)  # Model for D

    dml_plr_obj = dml.DoubleMLPLPR(
        obj_dml_data,
        ml_l=ml_l,
        ml_m=ml_m,
        n_folds=5,
        n_rep=3,
        score='partialling out',
        approach='cre_general',
    )

    logger.debug("fitting")
    dml_plr_obj.fit(n_jobs_cv=4)
```

### Expected Result

Well, no error ;)

### Actual Result

    m_hat_star = m_hat["preds"] + d_mean - m_hat_mean["m_hat"]
                 ~~~~~~~~~~~~~~~^~~~~~~~
ValueError: operands could not be broadcast together with shapes (2836,) (2838,) 

### Versions

Windows-11-10.0.26200-SP0
Python 3.12.6 (tags/v3.12.6:a4a2d2b, Sep  6 2024, 20:11:23) [MSC v.1940 64 bit (AMD64)]
DoubleML 0.11.2
Scikit-Learn 1.6.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Issue in computing d_mean when rows are dropped (DoubleMLPanelData, DoubleMLPLPR, approach "cre_general") #391

Describe the bug

Minimum reproducible code snippet

Expected Result

Actual Result

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: Issue in computing d_mean when rows are dropped (DoubleMLPanelData, DoubleMLPLPR, approach "cre_general") #391

Description

Describe the bug

Minimum reproducible code snippet

Expected Result

Actual Result

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions