Using Int64 type for features in BetaGeoModel causes expected_purchases() to fail with TypeError #1471

stochastic1 · 2025-02-05T17:15:41Z

Environment: jupyter notebook on GCP VertexAI instance, 16 vCPUs, 104GB RAM
Python version: 3.10.15
pymc-marketing version: <module 'pymc._version' from '/opt/conda/lib/python3.10/site-packages/pymc/_version.py'>
pandas version: 2.2.3
numpy version: 1.26.4

Expectation Passing integer types for frequency, T and recency, as in the example of daily activity and daily expectation steps into the BetaGeo model will enable successful execution of the expected_purchases() method.

Observed result Passing integer types for frequency, T and recency to the BetoGeo model causes execution of the expected_purchases() method to fail with a TypeError: 'PandasExtensionArray' object is not callable. However, passing float64 types for these three features enables the expected_purchases() method to succeed.

Hypothesis The _extract_predictive_variables() method on the expected_purchases() function in the BetaGeoModel() Class will only return a result if float values are passed for T, frequency and recency. Otherwise, passing any of these as Int64 types forces a conversion of that feature to an xarray.core.extension_array that causes the function execution to fail.

Context: I read in a dataframe with frequency, T and recency as Int64 values, intending to execute the BetaGeo model for daily observations.
The BetaGeo model converged on 100k unique customers in about 20 minutes. After model fit, I created a sample dataframe of 5 customers following the tutorial. Note the frequency, recency and T columns are Int64 type.

I called the expected_purchases() method, again following tutorial, for these five customers and received the error:
TypeError: 'PandasExtensionArray' object is not callable

StackTrace below, but I traced it back to the _extract_predictive_variables method treating Int64 features differently from float64 features:
When data_small has Int64

_extract_predictive_variables converts them to xarray type

When data_small is float64, _extract_predictive_variables persists them as float64:

StackTrace below:

The text was updated successfully, but these errors were encountered:

stochastic1 · 2025-02-05T17:16:52Z

@ColtAllen , I wonder if this is something you've seen.

ColtAllen · 2025-02-05T17:46:49Z

Hey @stochastic1,

float64 types are required for the xarray operations happening under the hood. The frequency feature is always a whole number, but this need not be the case for recency & T. An example of this would be raw data at a daily granularity, but summarized via clv.rfm_summary() for modeling at the weekly level.

Also, training data is saved as a model attribute, so unless you need to run predictions on out-of-sample customers, you can run BetaGeoModel.expected_purchases() without passing data as an argument.

stochastic1 · 2025-02-05T20:12:23Z

Thanks for the explanation. I'll use clv.rfm_summary() in my next pass.
Is the intended use case a weekly-level summary rather than daily? That would inform our strategy.

I did try executing with frequency as a whole number (int64) but recency and frequency as float64 type and the same TypeError emerged regarding PandasExtensionArray objects.

Thanks for the tip on training data as an attribute for BetaGeoModel.expected_purchases(). My use case is actually to collapse the probabilistic estimates into point estimates at various steps to compare forecasts to observed results in a holdout set. Executing wide-open the matrix is enormous and I run out of memory. Are there point-estimate functions built in for the model outputs? Otherwise I'd expect to use methods from xarray.

ColtAllen · 2025-02-05T20:51:38Z

Is the intended use case a weekly-level summary rather than daily? That would inform our strategy.

For daily raw data spanning many years, summarizing to weekly or monthly can help with model convergence, but ultimately it depends on your specific use case. If this model is to be used in a monthly business report for example, monthly predictions might make more sense. Summarizing to weekly would also make sense if your data has strong seasonality trends for days of the week.

I did try executing with frequency as a whole number (int64) but recency and frequency as float64 type and the same TypeError emerged regarding PandasExtensionArray objects.

I used the term "whole number" as an ambiguous case (be it 1.0 or 1). All variables require Float64 datatypes regardless. clv.rfm_summary() will handle this type casting automatically.

My use case is actually to collapse the probabilistic estimates into point estimates at various steps
. . .
Executing wide-open the matrix is enormous and I run out of memory. Are there point-estimate functions built in for the model outputs? Otherwise I'd expect to use methods from xarray.

model.fit(fit_method='map') will fit a model in seconds, and return point estimates instead of full posterior distributions. However, you'll lose the credibility intervals for predictions. To get point estimates from a full posterior, thexarray syntax is something like this: point_preds = model.expected_purchases(future_t=10).mean(("chain","draw"))

to observed results in a holdout set.

clv.rfm_train_test_split can be used for train/test splits, and results plotted with plot_expected_purchases_over_time. That plotting function is a recent add and hasn't seen much testing yet for train/test splits, so if you have any problems with it let me know and I'll get a work issue created.

Be mindful of any seasonal/holiday events that may bias results in the train/test periods. I'm planning to add high/low seasonality support, but not until Q4 this year:

stochastic1 · 2025-02-09T17:21:53Z

@ColtAllen , thank you for the guidance. I am working through these and will follow up by 2/12.

github-actions bot added the Needs Triage label Feb 5, 2025

ColtAllen added invalid This doesn't seem right question Further information is requested and removed Needs Triage labels Feb 5, 2025

ColtAllen removed the invalid This doesn't seem right label Feb 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using Int64 type for features in BetaGeoModel causes expected_purchases() to fail with TypeError #1471

Using Int64 type for features in BetaGeoModel causes expected_purchases() to fail with TypeError #1471

stochastic1 commented Feb 5, 2025 •

edited

Loading

stochastic1 commented Feb 5, 2025

ColtAllen commented Feb 5, 2025 •

edited

Loading

stochastic1 commented Feb 5, 2025

ColtAllen commented Feb 5, 2025 •

edited

Loading

stochastic1 commented Feb 9, 2025

Using Int64 type for features in BetaGeoModel causes expected_purchases() to fail with TypeError #1471

Using Int64 type for features in BetaGeoModel causes expected_purchases() to fail with TypeError #1471

Comments

stochastic1 commented Feb 5, 2025 • edited Loading

stochastic1 commented Feb 5, 2025

ColtAllen commented Feb 5, 2025 • edited Loading

stochastic1 commented Feb 5, 2025

ColtAllen commented Feb 5, 2025 • edited Loading

stochastic1 commented Feb 9, 2025

stochastic1 commented Feb 5, 2025 •

edited

Loading

ColtAllen commented Feb 5, 2025 •

edited

Loading

ColtAllen commented Feb 5, 2025 •

edited

Loading