Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using Int64 type for features in BetaGeoModel causes expected_purchases() to fail with TypeError #1471

Open
stochastic1 opened this issue Feb 5, 2025 · 5 comments
Labels
question Further information is requested

Comments

@stochastic1
Copy link

stochastic1 commented Feb 5, 2025

Environment: jupyter notebook on GCP VertexAI instance, 16 vCPUs, 104GB RAM
Python version: 3.10.15
pymc-marketing version: <module 'pymc._version' from '/opt/conda/lib/python3.10/site-packages/pymc/_version.py'>
pandas version: 2.2.3
numpy version: 1.26.4

Expectation Passing integer types for frequency, T and recency, as in the example of daily activity and daily expectation steps into the BetaGeo model will enable successful execution of the expected_purchases() method.

Observed result Passing integer types for frequency, T and recency to the BetoGeo model causes execution of the expected_purchases() method to fail with a TypeError: 'PandasExtensionArray' object is not callable. However, passing float64 types for these three features enables the expected_purchases() method to succeed.

Hypothesis The _extract_predictive_variables() method on the expected_purchases() function in the BetaGeoModel() Class will only return a result if float values are passed for T, frequency and recency. Otherwise, passing any of these as Int64 types forces a conversion of that feature to an xarray.core.extension_array that causes the function execution to fail.

Context: I read in a dataframe with frequency, T and recency as Int64 values, intending to execute the BetaGeo model for daily observations.
The BetaGeo model converged on 100k unique customers in about 20 minutes. After model fit, I created a sample dataframe of 5 customers following the tutorial. Note the frequency, recency and T columns are Int64 type.

I called the expected_purchases() method, again following tutorial, for these five customers and received the error:
TypeError: 'PandasExtensionArray' object is not callable

StackTrace below, but I traced it back to the _extract_predictive_variables method treating Int64 features differently from float64 features:
When data_small has Int64

Image
_extract_predictive_variables converts them to xarray type

Image

When data_small is float64, _extract_predictive_variables persists them as float64:

Image

StackTrace below:
Image

Image

Image

@stochastic1
Copy link
Author

@ColtAllen , I wonder if this is something you've seen.

@ColtAllen
Copy link
Collaborator

ColtAllen commented Feb 5, 2025

Hey @stochastic1,

float64 types are required for the xarray operations happening under the hood. The frequency feature is always a whole number, but this need not be the case for recency & T. An example of this would be raw data at a daily granularity, but summarized via clv.rfm_summary() for modeling at the weekly level.

Also, training data is saved as a model attribute, so unless you need to run predictions on out-of-sample customers, you can run BetaGeoModel.expected_purchases() without passing data as an argument.

@ColtAllen ColtAllen added invalid This doesn't seem right question Further information is requested and removed Needs Triage labels Feb 5, 2025
@stochastic1
Copy link
Author

Thanks for the explanation. I'll use clv.rfm_summary() in my next pass.
Is the intended use case a weekly-level summary rather than daily? That would inform our strategy.

I did try executing with frequency as a whole number (int64) but recency and frequency as float64 type and the same TypeError emerged regarding PandasExtensionArray objects.

Thanks for the tip on training data as an attribute for BetaGeoModel.expected_purchases(). My use case is actually to collapse the probabilistic estimates into point estimates at various steps to compare forecasts to observed results in a holdout set. Executing wide-open the matrix is enormous and I run out of memory. Are there point-estimate functions built in for the model outputs? Otherwise I'd expect to use methods from xarray.

@ColtAllen
Copy link
Collaborator

ColtAllen commented Feb 5, 2025

Is the intended use case a weekly-level summary rather than daily? That would inform our strategy.

For daily raw data spanning many years, summarizing to weekly or monthly can help with model convergence, but ultimately it depends on your specific use case. If this model is to be used in a monthly business report for example, monthly predictions might make more sense. Summarizing to weekly would also make sense if your data has strong seasonality trends for days of the week.

I did try executing with frequency as a whole number (int64) but recency and frequency as float64 type and the same TypeError emerged regarding PandasExtensionArray objects.

I used the term "whole number" as an ambiguous case (be it 1.0 or 1). All variables require Float64 datatypes regardless. clv.rfm_summary() will handle this type casting automatically.

My use case is actually to collapse the probabilistic estimates into point estimates at various steps
. . .
Executing wide-open the matrix is enormous and I run out of memory. Are there point-estimate functions built in for the model outputs? Otherwise I'd expect to use methods from xarray.

model.fit(fit_method='map') will fit a model in seconds, and return point estimates instead of full posterior distributions. However, you'll lose the credibility intervals for predictions. To get point estimates from a full posterior, thexarray syntax is something like this: point_preds = model.expected_purchases(future_t=10).mean(("chain","draw"))

to observed results in a holdout set.

clv.rfm_train_test_split can be used for train/test splits, and results plotted with plot_expected_purchases_over_time. That plotting function is a recent add and hasn't seen much testing yet for train/test splits, so if you have any problems with it let me know and I'll get a work issue created.

Be mindful of any seasonal/holiday events that may bias results in the train/test periods. I'm planning to add high/low seasonality support, but not until Q4 this year:

Image

@ColtAllen ColtAllen removed the invalid This doesn't seem right label Feb 6, 2025
@stochastic1
Copy link
Author

@ColtAllen , thank you for the guidance. I am working through these and will follow up by 2/12.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants