correct handling of categorical in and outliers and typing correction by JHogenboom · Pull Request #6 · STRONGAYA/v6-tools-general

JHogenboom · 2025-09-09T08:01:11Z

Resolve categorical inliers handling and consequential changes in typing hints

Copilot

Pull Request Overview

This pull request corrects handling of categorical data type detection and resolves typing inconsistencies throughout the codebase. The changes focus on improving robustness when working with pandas categorical data and ensuring proper type handling.

Replaces hasattr(dtype, "categories") checks with explicit isinstance(dtype, pd.CategoricalDtype) checks
Fixes categorical data processing by removing automatic category constraint application
Corrects numpy array handling for ExtensionArrays and improves type safety

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
src/vantage6_strongaya_general/privacy_measures.py	Updates categorical dtype detection and fixes typing for datetime bins
src/vantage6_strongaya_general/miscellaneous.py	Corrects string dtype and removes problematic categorical inliers handling
src/vantage6_strongaya_general/general_statistics.py	Adds numpy array conversion for ExtensionArrays and improves variable handling

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-09-09T08:01:41Z

+        inliers_series = column_values[column_values.index.isin(inliers)]
+        outliers_series = column_values[~column_values.index.isin(inliers)]


The logic for categorical inliers/outliers filtering is incorrect. For categorical data, you should filter based on the values themselves, not the index. This should be column_values.isin(inliers) and ~column_values.isin(inliers) respectively.

Suggested change

inliers_series = column_values[column_values.index.isin(inliers)]

outliers_series = column_values[~column_values.index.isin(inliers)]

inliers_series = column_values[column_values.isin(inliers)]

outliers_series = column_values[~column_values.isin(inliers)]

This is incorrect, we should use the index as value counts are passed to this function

JHogenboom requested a review from Copilot September 9, 2025 08:01

JHogenboom added the enhancement New feature or request label Sep 9, 2025

Copilot AI reviewed Sep 9, 2025

View reviewed changes

JHogenboom closed this Sep 9, 2025

JHogenboom force-pushed the update-inlier-handling branch from 55bec77 to 3551085 Compare September 9, 2025 09:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

correct handling of categorical in and outliers and typing correction#6

correct handling of categorical in and outliers and typing correction#6
JHogenboom wants to merge 0 commit into
mainfrom
update-inlier-handling

JHogenboom commented Sep 9, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Sep 9, 2025

Uh oh!

JHogenboom Sep 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		inliers_series = column_values[column_values.index.isin(inliers)]
		outliers_series = column_values[~column_values.index.isin(inliers)]

Conversation

JHogenboom commented Sep 9, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

JHogenboom Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants