Skip to content

correct handling of categorical in and outliers and typing correction#6

Closed
JHogenboom wants to merge 0 commit into
mainfrom
update-inlier-handling
Closed

correct handling of categorical in and outliers and typing correction#6
JHogenboom wants to merge 0 commit into
mainfrom
update-inlier-handling

Conversation

@JHogenboom

Copy link
Copy Markdown
Contributor

Resolve categorical inliers handling and consequential changes in typing hints

@JHogenboom JHogenboom requested a review from Copilot September 9, 2025 08:01
@JHogenboom JHogenboom added the enhancement New feature or request label Sep 9, 2025

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request corrects handling of categorical data type detection and resolves typing inconsistencies throughout the codebase. The changes focus on improving robustness when working with pandas categorical data and ensuring proper type handling.

  • Replaces hasattr(dtype, "categories") checks with explicit isinstance(dtype, pd.CategoricalDtype) checks
  • Fixes categorical data processing by removing automatic category constraint application
  • Corrects numpy array handling for ExtensionArrays and improves type safety

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
src/vantage6_strongaya_general/privacy_measures.py Updates categorical dtype detection and fixes typing for datetime bins
src/vantage6_strongaya_general/miscellaneous.py Corrects string dtype and removes problematic categorical inliers handling
src/vantage6_strongaya_general/general_statistics.py Adds numpy array conversion for ExtensionArrays and improves variable handling

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +836 to +837
inliers_series = column_values[column_values.index.isin(inliers)]
outliers_series = column_values[~column_values.index.isin(inliers)]

Copilot AI Sep 9, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic for categorical inliers/outliers filtering is incorrect. For categorical data, you should filter based on the values themselves, not the index. This should be column_values.isin(inliers) and ~column_values.isin(inliers) respectively.

Suggested change
inliers_series = column_values[column_values.index.isin(inliers)]
outliers_series = column_values[~column_values.index.isin(inliers)]
inliers_series = column_values[column_values.isin(inliers)]
outliers_series = column_values[~column_values.isin(inliers)]

Copilot uses AI. Check for mistakes.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect, we should use the index as value counts are passed to this function

@JHogenboom JHogenboom closed this Sep 9, 2025
@JHogenboom JHogenboom force-pushed the update-inlier-handling branch from 55bec77 to 3551085 Compare September 9, 2025 09:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants