correct handling of categorical in and outliers and typing correction#6
correct handling of categorical in and outliers and typing correction#6JHogenboom wants to merge 0 commit into
Conversation
There was a problem hiding this comment.
Pull Request Overview
This pull request corrects handling of categorical data type detection and resolves typing inconsistencies throughout the codebase. The changes focus on improving robustness when working with pandas categorical data and ensuring proper type handling.
- Replaces
hasattr(dtype, "categories")checks with explicitisinstance(dtype, pd.CategoricalDtype)checks - Fixes categorical data processing by removing automatic category constraint application
- Corrects numpy array handling for ExtensionArrays and improves type safety
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| src/vantage6_strongaya_general/privacy_measures.py | Updates categorical dtype detection and fixes typing for datetime bins |
| src/vantage6_strongaya_general/miscellaneous.py | Corrects string dtype and removes problematic categorical inliers handling |
| src/vantage6_strongaya_general/general_statistics.py | Adds numpy array conversion for ExtensionArrays and improves variable handling |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| inliers_series = column_values[column_values.index.isin(inliers)] | ||
| outliers_series = column_values[~column_values.index.isin(inliers)] |
There was a problem hiding this comment.
The logic for categorical inliers/outliers filtering is incorrect. For categorical data, you should filter based on the values themselves, not the index. This should be column_values.isin(inliers) and ~column_values.isin(inliers) respectively.
| inliers_series = column_values[column_values.index.isin(inliers)] | |
| outliers_series = column_values[~column_values.index.isin(inliers)] | |
| inliers_series = column_values[column_values.isin(inliers)] | |
| outliers_series = column_values[~column_values.isin(inliers)] |
There was a problem hiding this comment.
This is incorrect, we should use the index as value counts are passed to this function
55bec77 to
3551085
Compare
Resolve categorical inliers handling and consequential changes in typing hints