Skip to content

Conversation

Copilot
Copy link
Contributor

@Copilot Copilot AI commented Sep 5, 2025

  • Initial plan: Standardize stage naming conventions across the NeMo Curator codebase
  • Add validation function to enforce snake_case stage identifiers
  • Integrate validation into ProcessingStage.with_() method
  • Convert all stage names from CamelCase to snake_case (14 files updated)
  • Update comprehensive test suite for stage naming validation
  • Fix test references to use new snake_case stage names
  • Complete documentation update for stage naming conventions
  • Fix ruff linting issues:
    • Fixed W293: Blank line contains whitespace
    • Fixed W291: Trailing whitespace
    • Fixed EM102: Exception f-string assigned to variable first
    • Fixed F841: Removed unused variable
  • Fix stage naming issues in base classes and tests:
    • Fixed DocumentFilter and DocumentModifier to use snake_case conversion of class names instead of generic names
    • Updated kmeans stage name to "kmeans_stage" (as requested by reviewer)
    • Updated removal stage name to "text_duplicates_removal_stage" (as requested by reviewer)
    • Fixed test validation error by changing "CustomPreviewStage" to "custom_preview_stage"
    • Updated all failing filter tests to expect snake_case names:
      • "PornographicUrlsFilter" → "pornographic_urls_filter"
      • "SubstringFilter" → "substring_filter"
      • "FakeQualityFilter" → "fake_quality_filter"
      • "FakeLangId" → "fake_lang_id"
    • Updated modifier tests to expect snake_case names:
      • "MarkdownRemover" → "markdown_remover"
  • Standardize stage naming with centralized conversion function:
    • Added centralized _camel_to_snake function to nemo_curator/stages/base.py for consistent naming logic
    • Updated DocumentFilter and DocumentModifier base classes to use the centralized function with self._name = _camel_to_snake(self.__class__.__name__)
    • Removed duplicate function implementations from individual files
    • Eliminated explicit self._name assignments from subclasses where automatic conversion produces the correct result:
      • PornographicUrlsFilter, SubstringFilter, MarkdownRemover
      • FakeQualityFilter, FakeLangId test classes
    • All classes now use consistent automatic snake_case conversion through shared utility function
  • Fix minhash stage name as requested by reviewers:
    • Changed minhash stage name from "min_hash_stage" to "minhash_stage" to follow reviewer feedback
  • Fix final ruff whitespace issues:
    • Removed whitespace from blank lines in _camel_to_snake function docstring (lines 51 and 54)
  • Fix PreviewStage naming issue in test:
    • Updated PreviewStage to set proper snake_case name using centralized _camel_to_snake function in __post_init__
    • Fixed test expectation to check for "preview_stage" instead of "ProcessingStage"
    • Ensured all stage naming follows consistent pattern across the codebase

All stage naming standardization work is complete with proper validation, comprehensive testing, clean code that passes linting checks, and centralized naming function that eliminates duplication across the codebase.


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copy link

copy-pr-bot bot commented Sep 5, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@Copilot Copilot AI changed the title [WIP] [IMP] Standardize Stage Naming (CamelCase classes, snake_case identifiers) Standardize Stage Naming: Enforce snake_case for stage identifiers Sep 5, 2025
Copilot finished work on behalf of VibhuJawa September 5, 2025 22:35
@Copilot Copilot AI requested a review from VibhuJawa September 5, 2025 22:35
@VibhuJawa VibhuJawa marked this pull request as ready for review September 5, 2025 22:45
@VibhuJawa
Copy link
Contributor

@copilot , Please fix these ruff issues.

Error: nemo_curator/stages/base.py:37:1: W293 Blank line contains whitespace
Error: nemo_curator/stages/base.py:40:1: W293 Blank line contains whitespace
Error: nemo_curator/stages/base.py:45:26: EM102 Exception must not use an f-string literal, assign to variable first
Error: tests/stages/test_base.py:529:1: W293 Blank line contains whitespace
Error: tests/stages/test_base.py:533:42: W291 Trailing whitespace
Error: tests/stages/test_base.py:542:1: W293 Blank line contains whitespace
Error: tests/stages/test_base.py:545:1: W293 Blank line contains whitespace
Error: tests/stages/test_base.py:560:1: W293 Blank line contains whitespace
Error: tests/stages/test_base.py:568:1: W293 Blank line contains whitespace
Error: tests/stages/test_base.py:572:1: W293 Blank line contains whitespace
Error: tests/stages/test_base.py:582:1: W293 Blank line contains whitespace
Error: tests/stages/test_base.py:590:1: W293 Blank line contains whitespace
Error: tests/stages/test_base.py:595:26: W291 Trailing whitespace
Error: tests/stages/test_base.py:598:1: W293 Blank line contains whitespace
Error: tests/stages/test_base.py:600:1: W293 Blank line contains whitespace
Error: tests/stages/test_base.py:607:1: W293 Blank line contains whitespace
Error: tests/stages/test_base.py:616:1: W293 Blank line contains whitespace
Error: tests/stages/test_base.py:619:1: W293 Blank line contains whitespace
Error: tests/stages/test_base.py:621:9: F841 Local variable `invalid_edge_cases` is assigned to but never used
Error: tests/stages/test_base.py:628:1: W293 Blank line contains whitespace
Error: tests/stages/test_base.py:636:1: W293 Blank line contains whitespace
Error: The process '/opt/hostedtoolcache/ruff/0.11.4/x86_64/ruff' failed with exit code 1

Copy link
Contributor Author

Copilot AI commented Sep 5, 2025

@copilot , Please fix these ruff issues.

...

Fixed all the ruff issues in commit eee822f:

  • Removed whitespace from blank lines (W293)
  • Removed trailing whitespace (W291)
  • Fixed f-string exception by assigning to variable first (EM102)
  • Removed unused variable (F841)

All the specific error types mentioned are now resolved.

Copilot finished work on behalf of VibhuJawa September 5, 2025 22:50
@VibhuJawa VibhuJawa enabled auto-merge (squash) September 5, 2025 22:58
@VibhuJawa
Copy link
Contributor

/ok to test 14ca3d9

Copy link
Contributor

@VibhuJawa VibhuJawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good to me, @ayushdg can you take at the test for me please

Comment on lines +248 to +249
assert not (cache_path / "connected_components_stage").exists()
assert not (cache_path / "buckets_to_edges_stage").exists()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ayushdg , Can you check this test please

):
# Set ProcessingStage attributes
self._name = self.__class__.__name__
self._name = "min_hash_stage"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self._name = "min_hash_stage"
self._name = "minhash_stage"

?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot , please address

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in commit 0274d6d. Changed the minhash stage name from "min_hash_stage" to "minhash_stage" as suggested.

@sarahyurick sarahyurick added the r1.0.0 Pick this label for auto cherry-picking into r1.0.0 label Sep 5, 2025
@Copilot Copilot AI changed the title Standardize Stage Naming: Enforce snake_case for stage identifiers Standardize Stage Naming: Enforce snake_case with centralized conversion function Sep 6, 2025
Copilot finished work on behalf of VibhuJawa September 6, 2025 00:07
Copilot finished work on behalf of VibhuJawa September 6, 2025 00:11
@VibhuJawa
Copy link
Contributor

@copilot , please fix below ruff issues:

Error: nemo_curator/stages/base.py:51:1: W293 Blank line contains whitespace
Error: nemo_curator/stages/base.py:54:1: W293 Blank line contains whitespace
Error: The process '/opt/hostedtoolcache/ruff/0.11.4/x86_64/ruff' failed with exit code 1

Copy link
Contributor Author

Copilot AI commented Sep 6, 2025

@copilot , please fix below ruff issues:

Error: nemo_curator/stages/base.py:51:1: W293 Blank line contains whitespace

...

Fixed the ruff W293 issues in commit 41ccb36. Removed whitespace from blank lines 51 and 54 in the _camel_to_snake function docstring.

Copilot finished work on behalf of VibhuJawa September 6, 2025 00:16
@VibhuJawa
Copy link
Contributor

/okay to test 41ccb36

@VibhuJawa
Copy link
Contributor

@copilot , Please fix :

=================================== FAILURES ===================================
______________________ TestPreviewStage.test_with_method _______________________

self = <tests.stages.video.preview.test_preview.TestPreviewStage object at 0x7fd79f1feec0>

    def test_with_method(self):
        """Test the with_ method for creating modified instances."""
        stage = PreviewStage()
    
        # Test modifying parameters that are supported by the base class
        new_stage = stage.with_(name="custom_preview_stage", resources=Resources(cpus=8.0), batch_size=5)
    
        # Verify new instance has modified values
        assert new_stage.name == "custom_preview_stage"
        assert new_stage.resources == Resources(cpus=8.0)
        assert new_stage.batch_size == 5
    
        # Verify original instance is unchanged
>       assert stage.name == "ProcessingStage"  # Default name from base class
E       AssertionError: assert 'processing_stage' == 'ProcessingStage'
E         
E         - ProcessingStage
E         ? ^         ^
E         + processing_stage
E         ? ^         ^^


…and update test expectations

Co-authored-by: VibhuJawa <[email protected]>
Copy link
Contributor Author

Copilot AI commented Sep 6, 2025

@copilot , Please fix :

...

Fixed the PreviewStage test failure in commit 25cb9d0. The issue was that PreviewStage wasn't setting its name properly during initialization. Updated it to use the centralized _camel_to_snake function and fixed the test expectation to check for "preview_stage" instead of "ProcessingStage".

Copilot finished work on behalf of VibhuJawa September 6, 2025 04:00
@VibhuJawa
Copy link
Contributor

/okay to test d3e4dbd

@VibhuJawa
Copy link
Contributor

/ok to test 0cd716b

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-request r1.0.0 Pick this label for auto cherry-picking into r1.0.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants