Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(SemanticLayerSchema): Refactoring using SemanticLayerSchema all over the code instead of the dictionary #1519

Closed
wants to merge 8 commits into from

Conversation

scaliseraoul
Copy link

@scaliseraoul scaliseraoul commented Jan 14, 2025

  1. Some of the unit tests were removed as duplicates of the newly created for SemanticLayerSchema
  2. Some checks on the scheme were removed because they are now redundant with pydantic checks

Important

Introduces SemanticLayerSchema for schema validation and updates components to utilize it, with corresponding test updates.

  • SemanticLayerSchema:
    • Introduces SemanticLayerSchema class in semantic_layer_schema.py for schema validation using Pydantic.
    • Validates source type, connection, and table for remote sources.
    • Validates column types and transformation types.
  • DatasetLoader:
    • Updates DatasetLoader in loader.py to use SemanticLayerSchema for schema handling.
    • Removes redundant source type validation.
    • Refactors _apply_transformations() to use SemanticLayerSchema.
  • QueryBuilder:
    • Updates QueryBuilder in query_builder.py to use SemanticLayerSchema.
    • Refactors query building methods to use schema attributes directly.
  • DataFrame:
    • Updates DataFrame in base.py to use SemanticLayerSchema for schema handling.
    • Refactors _create_yml_template() to use SemanticLayerSchema.
  • VirtualDataFrame:
    • Updates VirtualDataFrame in virtual_dataframe.py to use SemanticLayerSchema.
  • Tests:
    • Updates unit tests in test_loader.py, test_query_builder.py, test_semantic_layer_schema.py, and test_dataframe.py to reflect changes in schema handling.
    • Adds tests for SemanticLayerSchema validation and functionality.

This description was created by Ellipsis for a43c1db. It will automatically update as commits are pushed.

scaliseraoul and others added 8 commits January 13, 2025 15:32
…e, removes unreachable code and adds tests for _load_from_local_source
…into release/v3

* 'release/v3' of https://github.com/scaliseraoul/pandas-ai:
  ci: fix lint
  refactor: remove workspace
Co-authored-by: ellipsis-dev[bot] <65095814+ellipsis-dev[bot]@users.noreply.github.com>
…into release/v3

* 'release/v3' of https://github.com/scaliseraoul/pandas-ai:
  refactor(loader): renames _load_from_source to _load_from_local_source removes unreachable code and adds tests for _load_from_local_source (Sinaptik-AI#1514)
@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Jan 14, 2025
Copy link

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ Changes requested. Reviewed everything up to a43c1db in 1 minute and 44 seconds

More details
  • Looked at 944 lines of code in 9 files
  • Skipped 0 files when reviewing.
  • Skipped posting 5 drafted comments based on config settings.
1. pandasai/data_loader/loader.py:65
  • Draft comment:
    Add a validation for unsupported source_type before proceeding with the else block. This will prevent potential runtime errors if the source_type is neither local nor remote.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable:
    The code has been refactored to use a SemanticLayerSchema class which would handle validation during instantiation. The old _validate_source_type method was removed because this validation is now handled at schema creation. The else block is safe because invalid types would fail earlier during schema creation.
    I could be wrong about what validation the SemanticLayerSchema class performs. I don't have access to that class's code.
    Even without seeing the SemanticLayerSchema class, the removal of the explicit validation and the use of a strongly-typed schema class strongly suggests this validation is handled elsewhere. The change appears intentional and well-structured.
    The comment should be deleted as the validation concern is already handled by the SemanticLayerSchema class during instantiation.
2. pandasai/data_loader/loader.py:231
  • Draft comment:
    Consider adding a check for unsupported cache formats in _cache_data and raise an exception if encountered. This will ensure robustness against unexpected formats.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable:
    The format validation is already handled by the schema validation and other methods. Adding another check here would be redundant. The code is already robust against unexpected formats through the schema validation layer. The comment is suggesting defensive programming that isn't necessary given the architecture.
    I could be wrong about the schema validation - I don't see the SemanticLayerSchema class definition. Maybe it doesn't actually validate the format.
    Even if the schema doesn't validate it, the _get_cache_file_path() method on line 91-93 enforces only parquet or csv formats, so by the time we reach _cache_data(), the format is guaranteed to be valid.
    The comment should be deleted because format validation is already handled earlier in the code flow, making additional validation redundant and unnecessary.
3. pandasai/data_loader/query_builder.py:28
  • Draft comment:
    Converting the table name to lowercase might not be necessary or correct for all databases, as table names can be case-sensitive. Consider handling this conversion based on specific database requirements.
  • Reason this comment was not posted:
    Comment was on unchanged code.
4. pandasai/dataframe/base.py:185
  • Draft comment:
    The _create_yml_template method uses hardcoded values for source and destination types and paths. Consider parameterizing these values or providing a way to customize them for better flexibility.
  • Reason this comment was not posted:
    Comment was on unchanged code.
5. pandasai/data_loader/loader.py:156
  • Draft comment:
    The error message can be improved for clarity and grammar:
            raise ValueError(f"Unsupported file format: {format}.")
  • Reason this comment was not posted:
    Comment was on unchanged code.

Workflow ID: wflow_1NzYHgQxAg6alDnD


Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

@field_validator("format")
@classmethod
def is_format_supported(cls, format: str) -> str:
if format not in LOCAL_SOURCE_TYPES:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The is_format_supported validator checks if the format is in LOCAL_SOURCE_TYPES, which might be misleading. Consider using a more appropriate constant or list for format validation.

@scaliseraoul scaliseraoul changed the title feat(SemanticLayerSchema): Adding a SemanticLayerSchema class with logical and business validation feat(SemanticLayerSchema): Refactoring using SemanticLayerSchema all over the code instead of the dictionary Jan 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant