Commit efe76b9
ci: Migrate ray-curator docker to uv (#928)
* New API Spec with Ray Backend (#726)
* Create package + reorganize (#2)
* fc
Signed-off-by: Praateek <[email protected]>
* remove per file ignore
Signed-off-by: Praateek <[email protected]>
* sc
Signed-off-by: Praateek <[email protected]>
* ruff
Signed-off-by: Praateek <[email protected]>
* use curator_id_str
Signed-off-by: Praateek <[email protected]>
---------
Signed-off-by: Praateek <[email protected]>
* fc
Signed-off-by: Praateek <[email protected]>
* kmeans works
Signed-off-by: Praateek <[email protected]>
* Fuzzy dedup fixes (#11)
* high level method for each step
Signed-off-by: Ayush Dattagupta <[email protected]>
* Fixes/changes after testing
Signed-off-by: Ayush Dattagupta <[email protected]>
* Updates to existing fuzzy_dedup modules
Signed-off-by: Ayush Dattagupta <[email protected]>
* Add high level fuzzy dedup api and e2e example
Signed-off-by: Ayush Dattagupta <[email protected]>
* Add e2e example
Signed-off-by: Ayush Dattagupta <[email protected]>
* Add config
Signed-off-by: Ayush Dattagupta <[email protected]>
---------
Signed-off-by: Ayush Dattagupta <[email protected]>
* fc
Signed-off-by: Praateek <[email protected]>
* fc
Signed-off-by: Praateek <[email protected]>
* removal works
Signed-off-by: Praateek <[email protected]>
* bug fix
Signed-off-by: Praateek <[email protected]>
* working streaming embedding with id generator
Signed-off-by: Praateek <[email protected]>
* Dump high level skeleton
Signed-off-by: Ayush Dattagupta <[email protected]>
* update xenna executor
Signed-off-by: Ayush Dattagupta <[email protected]>
* More changes
Signed-off-by: Ayush Dattagupta <[email protected]>
* working example
Signed-off-by: Praateek <[email protected]>
* Revert "working example"
This reverts commit 7b3e65173dd1df92b0de9431fcfebdbc0b93d6c9.
* [WIP] Add reader + utf modifier (#31)
* Dump high level skeleton
Signed-off-by: Ayush Dattagupta <[email protected]>
* update xenna executor
Signed-off-by: Ayush Dattagupta <[email protected]>
* More changes
Signed-off-by: Ayush Dattagupta <[email protected]>
* Updates for utfModifier+ high level updates
Signed-off-by: Ayush Dattagupta <[email protected]>
* Remove old examples and add new modifier and stages
Signed-off-by: Ayush Dattagupta <[email protected]>
* Add modify stage
Signed-off-by: Ayush Dattagupta <[email protected]>
* More updates
Signed-off-by: Ayush Dattagupta <[email protected]>
---------
Signed-off-by: Ayush Dattagupta <[email protected]>
* Revert "[WIP] Add reader + utf modifier (#31)" (#32)
This reverts commit ef25e3eff6502cb9bfc4a57ba48f0939284fd49b.
* rebase
Signed-off-by: Praateek <[email protected]>
* rebase continue
Signed-off-by: Praateek <[email protected]>
* Remove older file versions
Signed-off-by: Sarah Yurick <[email protected]>
* Final changes as per the meeting
* refactor
Signed-off-by: Praateek <[email protected]>
* example works
Signed-off-by: Praateek <[email protected]>
* add base classes
Signed-off-by: Praateek <[email protected]>
* example works
Signed-off-by: Praateek <[email protected]>
* ..
Signed-off-by: Praateek <[email protected]>
* more google style
Signed-off-by: Praateek <[email protected]>
* add init for backends
Signed-off-by: Praateek <[email protected]>
* Update example script
* add impl
Signed-off-by: Sarah Yurick <[email protected]>
* ruff
Signed-off-by: Sarah Yurick <[email protected]>
* add suggestions
Signed-off-by: Sarah Yurick <[email protected]>
* add another check
Signed-off-by: Sarah Yurick <[email protected]>
* Move changes one level deeper in ray-curator, add pyproject toml
Signed-off-by: Ayush Dattagupta <[email protected]>
* Update dependencies to include cosmos-xenna and pyarrow explicitly
Signed-off-by: Ayush Dattagupta <[email protected]>
* Update python upper bound
Signed-off-by: Ayush Dattagupta <[email protected]>
* Add a simple contributing file with instructions
Signed-off-by: Ayush Dattagupta <[email protected]>
* Remove pyarrow check since it's an explicit dependency
Signed-off-by: Ayush Dattagupta <[email protected]>
* Remove unusued file utils
Signed-off-by: Ayush Dattagupta <[email protected]>
---------
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Ayush Dattagupta <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Co-authored-by: Praateek Mahajan <[email protected]>
Co-authored-by: Praateek <[email protected]>
Co-authored-by: Sarah Yurick <[email protected]>
Co-authored-by: Abhinav Garg <[email protected]>
Co-authored-by: Sarah Yurick <[email protected]>
* [Ray] Allow loguru to be serialized #729
* [Ray] Add Jsonl / Parquet Writer Stage (#730)
* Update CI testing workflow for ray branch (#739)
* Update ci workflow to build ray-curator package instead
Signed-off-by: Ayush Dattagupta <[email protected]>
* Split out CPU and GPU modules
Signed-off-by: Ayush Dattagupta <[email protected]>
* Update pytest command
Signed-off-by: Ayush Dattagupta <[email protected]>
* update crossfit dep to use pinned version (avoiding absl dep issues)
Signed-off-by: Ayush Dattagupta <[email protected]>
* Explicitly add absl-py dependency to avoid python 3.10 errors
Signed-off-by: Ayush Dattagupta <[email protected]>
* Update paths for codecov
Signed-off-by: Ayush Dattagupta <[email protected]>
---------
Signed-off-by: Ayush Dattagupta <[email protected]>
* Initial API desing doc (#737)
* Intial APi desing doc
Signed-off-by: Abhinav Garg <[email protected]>
* Update ray-curator/api-design.md
Co-authored-by: Praateek Mahajan <[email protected]>
Signed-off-by: Abhinav Garg <[email protected]>
* Update ray-curator/api-design.md
Co-authored-by: Praateek Mahajan <[email protected]>
Signed-off-by: Abhinav Garg <[email protected]>
* Update ray-curator/api-design.md
Co-authored-by: Praateek Mahajan <[email protected]>
Signed-off-by: Abhinav Garg <[email protected]>
* Update ray-curator/api-design.md
Co-authored-by: Praateek Mahajan <[email protected]>
Signed-off-by: Abhinav Garg <[email protected]>
* Update ray-curator/api-design.md
Co-authored-by: Praateek Mahajan <[email protected]>
Signed-off-by: Abhinav Garg <[email protected]>
* Update ray-curator/api-design.md
Co-authored-by: Praateek Mahajan <[email protected]>
Signed-off-by: Abhinav Garg <[email protected]>
* Update ray-curator/api-design.md
Co-authored-by: Praateek Mahajan <[email protected]>
Signed-off-by: Abhinav Garg <[email protected]>
* Update ray-curator/api-design.md
Co-authored-by: Praateek Mahajan <[email protected]>
Signed-off-by: Abhinav Garg <[email protected]>
* Update ray-curator/api-design.md
Co-authored-by: Ayush Dattagupta <[email protected]>
Signed-off-by: Abhinav Garg <[email protected]>
* Refine map-style execution description in API design document to clarify task transformation and mapping flexibility.
* Remove redundant sections on Tasks, Stages, and Pipelines from the API design document to streamline content and improve clarity.
* Add quickstart example and update API design documentation
- Introduced a new quickstart example in `ray_curator/examples/quickstart.py` demonstrating a sentiment analysis pipeline with three stages: TaskCreationStage, WordCountStage, and SentimentStage.
- Updated `api-design.md` to include a new section for examples, linking to the quickstart for user reference.
- Clarified resource requirements in `resources.py` documentation for GPU usage and constraints.
* Ruff related changes
Signed-off-by: Abhinav Garg <[email protected]>
* PR changes
Signed-off-by: Abhinav Garg <[email protected]>
* Update DocumentTask to DocumentBatch in API design for improved type flexibility
Signed-off-by: Abhinav Garg <[email protected]>
* Add fault tolerance requirements to API design documentation
- Introduced a new section outlining the necessity for fault tolerance and retry safety in all stages.
- Highlighted critical aspects such as task preemption and handling of partial operations to ensure robustness during execution.
Signed-off-by: Abhinav Garg <[email protected]>
---------
Signed-off-by: Abhinav Garg <[email protected]>
Co-authored-by: Praateek Mahajan <[email protected]>
Co-authored-by: Ayush Dattagupta <[email protected]>
* Refactor XennaExecutor by removing the cluster initialization function and deleting the associated ray_cluster_init.py file. This streamlines the execution process by eliminating unnecessary setup code. (#768)
Signed-off-by: Abhinav Garg <[email protected]>
* [Ray] Add Ray Data as an experimental backend (#740)
* [Ray] Add integration test to test backends for a specified pipeline (#770)
* Adding with_ for options in ProcessingStage and CompositeStage (#764)
* [Ray] `DocumentFilter` and `Filter`/`Score`/`ScoreFilter` (#746)
* add documentfilter implementation
Signed-off-by: Sarah Yurick <[email protected]>
* fix nits and ruff
Signed-off-by: Sarah Yurick <[email protected]>
* add additional logic for setup, setup_on_node, and process_batch
Signed-off-by: Sarah Yurick <[email protected]>
* add pytests
Signed-off-by: Sarah Yurick <[email protected]>
* add dep
Signed-off-by: Sarah Yurick <[email protected]>
* more dep edits
Signed-off-by: Sarah Yurick <[email protected]>
* another dep
Signed-off-by: Sarah Yurick <[email protected]>
* add fasttext dep
Signed-off-by: Sarah Yurick <[email protected]>
* add jieba and mecab
Signed-off-by: Sarah Yurick <[email protected]>
* add default None params for setup_on_node and setup functions
Signed-off-by: Sarah Yurick <[email protected]>
* add praateek's suggestions
Signed-off-by: Sarah Yurick <[email protected]>
* organize imports
Signed-off-by: Sarah Yurick <[email protected]>
* remove process_batch
Signed-off-by: Sarah Yurick <[email protected]>
* add _metadata to result
Signed-off-by: Sarah Yurick <[email protected]>
* add praateek's suggestions
Signed-off-by: Sarah Yurick <[email protected]>
* ruff and post init for _name
Signed-off-by: Sarah Yurick <[email protected]>
* modify test
Signed-off-by: Sarah Yurick <[email protected]>
---------
Signed-off-by: Sarah Yurick <[email protected]>
* [Ray] Add Download Extract Base Class + Common Crawl Stage (#738)
* [Ray] Use Ray Actors where viable (#792)
* Extract And download for WIkipedia (#795)
* copy over
Signed-off-by: Praateek <[email protected]>
* copy over
Signed-off-by: Praateek <[email protected]>
* add init to download
Signed-off-by: Praateek <[email protected]>
* move justext
Signed-off-by: Praateek <[email protected]>
* move resiliparse
Signed-off-by: Praateek <[email protected]>
* move trafilatura
Signed-off-by: Praateek <[email protected]>
* move get_stop_list_dict
Signed-off-by: Praateek <[email protected]>
* move download_utils.py to utils/download_utils.py
Signed-off-by: Praateek <[email protected]>
* move out to download.py
Signed-off-by: Praateek <[email protected]>
* move WarcIterator towarc_reader.py
Signed-off-by: Praateek <[email protected]>
* move CommonCrawlWARCExtractor to html_extractor
Signed-off-by: Praateek <[email protected]>
* remove commoncrawl.py
Signed-off-by: Praateek <[email protected]>
* create url_generation.py from download_utils
Signed-off-by: Praateek <[email protected]>
* tests dir
Signed-off-by: Praateek <[email protected]>
* copy over test_download.py as test_common_crawl.py
Signed-off-by: Praateek <[email protected]>
* add html_extractors/__init__
Signed-off-by: Praateek <[email protected]>
* move html_extractor to ProcessingStage
Signed-off-by: Praateek <[email protected]>
* update WarcReader to use ProecssingStage
Signed-off-by: Praateek <[email protected]>
* move to classes for url generation
Signed-off-by: Praateek <[email protected]>
* typo in name
Signed-off-by: Praateek <[email protected]>
* bug fixes in justext; rename resiliparse func; utils modular
Signed-off-by: Praateek <[email protected]>
* init file in for download/text
Signed-off-by: Praateek <[email protected]>
* justtext minor change
Signed-off-by: Praateek <[email protected]>
* support str in htmlextractor
Signed-off-by: Praateek <[email protected]>
* add a working example
Signed-off-by: Praateek <[email protected]>
* set source_files so that write can be hashed
Signed-off-by: Praateek <[email protected]>
* use pprint in example
Signed-off-by: Praateek <[email protected]>
* update comment
Signed-off-by: Praateek <[email protected]>
* all tests migrated + work
Signed-off-by: Praateek <[email protected]>
* update defaults in example; comments in stage
Signed-off-by: Praateek <[email protected]>
* add tests for url generation + PR review
Signed-off-by: Praateek <[email protected]>
* update download for aws
Signed-off-by: Praateek <[email protected]>
* rename aws to use_aws_to_donwload
Signed-off-by: Praateek <[email protected]>
* update resources
Signed-off-by: Praateek <[email protected]>
* change url generation to have ray-stage-spec
Signed-off-by: Praateek <[email protected]>
* make download fault tolerant
Signed-off-by: Praateek <[email protected]>
* refactor as per pr reviews; with tests
Signed-off-by: Praateek <[email protected]>
* add readme
Signed-off-by: Praateek <[email protected]>
* bug fix; update tests
Signed-off-by: Praateek <[email protected]>
* update record limit to None
Signed-off-by: Praateek <[email protected]>
* bug fixes
Signed-off-by: Praateek <[email protected]>
* pr comments
Signed-off-by: Praateek <[email protected]>
* add back test html extractor implementations
Signed-off-by: Praateek <[email protected]>
* remove cc example
Signed-off-by: Praateek <[email protected]>
* add column utils
Signed-off-by: Praateek <[email protected]>
* add todos
Signed-off-by: Praateek <[email protected]>
* Add Wikipedia download and extract stage
This commit introduces a comprehensive pipeline for downloading and processing Wikipedia dump files within the ray-curator framework. Key components include:
- **WikipediaUrlGenerator**: Generates URLs for Wikipedia dump files.
- **WikipediaDownloader**: Downloads .bz2 dump files using wget.
- **WikipediaIterator**: Parses Wikipedia XML dumps and extracts article content.
- **WikipediaExtractor**: Cleans Wikipedia markup and extracts meaningful text.
Additionally, an example script demonstrating the usage of the new stage is included, along with tests for each component to ensure functionality and reliability.
Documentation for the new stage is also provided to guide users in implementation and usage.
Signed-off-by: Abhinav Garg <[email protected]>
* merge from main
Signed-off-by: Praateek <[email protected]>
* move deps to text
Signed-off-by: Praateek <[email protected]>
* update dev
Signed-off-by: Praateek <[email protected]>
* update pyproject and test.yml
Signed-off-by: Praateek <[email protected]>
* remove cugraph extra pyproject
Signed-off-by: Praateek <[email protected]>
* move text to optional deps
Signed-off-by: Praateek <[email protected]>
* Refactor pyproject.toml: Remove unused dependencies and clean up dev section
Signed-off-by: Abhinav Garg <[email protected]>
* Remove unused Wikipedia example and related README documentation from the download text stages.
Signed-off-by: Abhinav Garg <[email protected]>
* Add method to fetch JSON dump data for Wikipedia and refactor dump date retrieval logic
- Introduced `_get_data_for_dump` method to handle fetching and parsing JSON dump data.
- Refactored logic in `_get_wikipedia_urls` to iterate through available dumps and check their status.
- Improved error handling for cases where dump data cannot be loaded or is not finished.
Signed-off-by: Abhinav Garg <[email protected]>
* Add README for custom download pipelines and remove Wikipedia stage documentation
- Introduced a new README.md file detailing the structure and implementation of custom download pipelines.
- Removed the outdated README.md for the Wikipedia download and extract stage to streamline documentation.
Signed-off-by: Abhinav Garg <[email protected]>
* Add num_workers_per_node method to DocumentDownloader and WikipediaDownloader
- Implemented num_workers_per_node method in DocumentDownloader to define the number of workers per node for downloading tasks.
- Overridden num_workers_per_node in WikipediaDownloader to return a fixed value of 1.
- Updated xenna_stage_spec method in DocumentDownloadStage to include the number of workers per node.
Signed-off-by: Abhinav Garg <[email protected]>
* Update WikipediaDownloader to use 2 workers and change logging level in WikipediaIterator
- Modified num_workers_per_node in WikipediaDownloader to return 2, allowing for increased parallelism during downloads.
- Changed logging from info to debug level in WikipediaIterator for extracted articles to reduce log verbosity.
Signed-off-by: Abhinav Garg <[email protected]>
---------
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Abhinav Garg <[email protected]>
Co-authored-by: Praateek <[email protected]>
* Fixing tests (#827)
* Refactor Wikipedia extraction and URL generation logic
- Removed redundant return statement in `WikipediaExtractor` class.
- Simplified status check for dump data in `WikipediaUrlGenerator` by directly accessing the dictionary keys.
- Updated logging level in tests to ensure accurate assertions on log calls.
- Enhanced test cases for URL generation to cover various dump statuses.
These changes improve code clarity and maintainability while ensuring robust error handling in the Wikipedia download and extraction process.
Signed-off-by: Abhinav Garg <[email protected]>
* Add mwparserfromhell dependency to pyproject.toml
- Included `mwparserfromhell==0.6.5` in the text dependencies section of `pyproject.toml` to support parsing Wikipedia markup.
This addition enhances the functionality of the project by ensuring the necessary tools for processing Wikipedia data are available.
Signed-off-by: [Your Name] <[email protected]>
Signed-off-by: Abhinav Garg <[email protected]>
---------
Signed-off-by: Abhinav Garg <[email protected]>
Signed-off-by: [Your Name] <[email protected]>
* Update ray version to 2.48 #839
* Re-enable CI/CD for Ray API branch (#840)
* CI/CD for Ray API branch
Signed-off-by: Sarah Yurick <[email protected]>
* add text dependencies
Signed-off-by: Sarah Yurick <[email protected]>
* only run cpu tests
Signed-off-by: Sarah Yurick <[email protected]>
* comment instead of delete
Signed-off-by: Sarah Yurick <[email protected]>
---------
Signed-off-by: Sarah Yurick <[email protected]>
* Ray Video Pipeline : Video Reader (#775)
* Add video io reader
* Add test
* Add VideoReaderStage to video reading pipeline and update VideoDownloadStage to accept VideoTask. Enhance video reading capabilities with new tests for VideoReaderStage.
Signed-off-by: Ao Tang <[email protected]>
* Update VideoDownloadStage to support verbose logging and modify video_read_example to include verbose argument.
Signed-off-by: Ao Tang <[email protected]>
* Update outputs for VideoDownloadStage and VideoReaderStage to include additional metadata fields.
Signed-off-by: Ao Tang <[email protected]>
* Update CI workflow to include video dependencies for testing
Signed-off-by: Ao Tang <[email protected]>
* Add tests for video tasks module
- Introduced a new test package for tasks with an initial test suite for the video tasks module, including tests for the Clip, ClipStats, Video, VideoMetadata, and VideoTask classes.
- Implemented various test cases to validate initialization, property calculations, metadata extraction, and size calculations.
This enhances the testing coverage for video-related functionalities in the ray-curator project.
Signed-off-by: Ao Tang <[email protected]>
* Enhance video tasks module with additional test cases
- Expanded the test suite for the video tasks module by adding new test cases for the Clip, ClipStats, Video, VideoMetadata, and VideoTask classes.
- Improved coverage for various functionalities including initialization, property calculations, and metadata extraction.
This update strengthens the reliability of video-related features in the ray-curator project.
Signed-off-by: Ao Tang <[email protected]>
* Update pyproject.toml to include a trailing comma for pynvml dependency
Signed-off-by: Ao Tang <[email protected]>
* Refactor video processing stages to introduce a composite VideoReaderDownloadStage
- Replaced separate VideoReaderStage and VideoDownloadStage with a new VideoReaderDownloadStage that combines both functionalities.
- Updated the video_read_example to utilize the new composite stage.
- Adjusted inputs and outputs for VideoDownloadStage to reflect changes in the pipeline.
- Added tests for the new VideoReaderDownloadStage to ensure proper functionality and integration.
This refactor simplifies the video reading and downloading process within the ray-curator framework.
Signed-off-by: Ao Tang <[email protected]>
---------
Signed-off-by: Ao Tang <[email protected]>
* chore: Add new trustees and vetters to the copy-pr-bot configuration (#841) (#842)
* chore: Add new trustees and vetters to the copy-pr-bot configuration
* chore: Remove empty line in copy-pr-bot configuration
* chore: Remove ryantwolf from additional trustees and vetters in copy-pr-bot configuration
---------
Signed-off-by: Ao Tang <[email protected]>
Signed-off-by: NeMo Bot <[email protected]>
Co-authored-by: Ao Tang <[email protected]>
* ci: Add community-bot (#846) (#849)
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: NeMo Bot <[email protected]>
Co-authored-by: oliver könig <[email protected]>
* Ray Video Reader Enhancement (#848)
* Refactor video reading stages: Rename VideoReaderStage to VideoListStage and update VideoReaderDownloadStage to use the new class. Adjust tests accordingly to reflect the changes in stage names and functionality.
Signed-off-by: Ao Tang <[email protected]>
* Rename test_video_reader to test_video_list
Signed-off-by: Ao Tang <[email protected]>
* Update VideoListStage name and corresponding tests to reflect new naming convention
- Changed the internal name of VideoListStage from "video_reader" to "video_list".
- Updated assertions in the test for VideoListStage to match the new name.
- Adjusted configuration in the VideoReaderDownloadStage to use "video_list" instead of "video_reader".
This ensures consistency across the codebase following the recent refactor.
Signed-off-by: Ao Tang <[email protected]>
* Update test assertions in VideoReaderDownloadStage to use "video_list" instead of "video_reader"
Signed-off-by: Ao Tang <[email protected]>
* Refactor video processing stages: Replace VideoDownloadStage with VideoReaderStage in VideoReaderDownloadStage. Update related tests to reflect the new structure and ensure consistency across the codebase.
Signed-off-by: Ao Tang <[email protected]>
* Enhance VideoListStage and VideoReaderStage documentation
Signed-off-by: Ao Tang <[email protected]>
* Refactor video reading pipeline: Introduce VideoLoadingStage as a composite stage that combines VideoListStage and VideoReaderStage.
Signed-off-by: Ao Tang <[email protected]>
* Remove SplitPipeTask from video module and update imports accordingly.
Signed-off-by: Ao Tang <[email protected]>
* Refactor video task imports: Update import statements in video_list, video_loading, video_reader, and related test files to use the new video module structure.
Signed-off-by: Ao Tang <[email protected]>
* ruff fix
Signed-off-by: Ao Tang <[email protected]>
* Implement FilePartitioningStage: Introduce a new stage for partitioning files into groups based on specified criteria, including a limit on the number of groups. Update VideoLoadingStage to utilize FilePartitioningStage instead of the deprecated VideoListStage. Refactor VideoReaderStage to accept FileGroupTask as input and adjust related tests to ensure functionality and correctness.
Signed-off-by: Ao Tang <[email protected]>
* Refactor video reading stages: Replace VideoLoadingStage with VideoReader as a composite stage that combines FilePartitioningStage and VideoReaderStage. Update related tests to ensure functionality and correctness. Remove deprecated VideoLoadingStage and its associated tests.
Signed-off-by: Ao Tang <[email protected]>
* Update video_limit type in VideoReader to support None: Changed the type of video_limit from int to int | None to allow for more flexible configuration. This enhances the usability of the VideoReader class.
Signed-off-by: Ao Tang <[email protected]>
* Refactor file partitioning limit check
Signed-off-by: Ao Tang <[email protected]>
* Remove redundant tests from TestVideoReader: Deleted tests for video limit values, verbose flag, file extensions, and files per partition configuration to streamline the test suite and focus on essential functionality.
Signed-off-by: Ao Tang <[email protected]>
---------
Signed-off-by: Ao Tang <[email protected]>
* Enhance FilePartitioningStage to enforce task limit check earlier in the process. (#867)
Signed-off-by: Ao Tang <[email protected]>
* Initialize and shutdown ray session in each executor (#844)
* Remove pynvml dependency from pyproject.toml (#872)
* docs: refactor all the things (#826) (#859)
* docs: refactor all the things
* remove auto api docs
* api docs to gitignore
* updated readme
* python linting fixes batch 1
* batch 2
* batch 3
* update
---------
Signed-off-by: Lawrence Lane <[email protected]>
Signed-off-by: NeMo Bot <[email protected]>
Co-authored-by: L.B. <[email protected]>
Co-authored-by: Sarah Yurick <[email protected]>
* ci(fix): Use GITHUB_TOKEN for community bot (#853) (#854)
* ci(fix): Use GITHUB_TOKEN for community bot
* f
---------
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: NeMo Bot <[email protected]>
Co-authored-by: oliver könig <[email protected]>
Co-authored-by: Ayush Dattagupta <[email protected]>
* update LLM PII redaction file - fix issue 828 (#868) (#871)
* update LLM PII redaction file - fix 828
* Fix ruff check LLM PII redaction file - fix 828
* update LLM PII redaction Enron-file - fix 828
* update LLM-PII redaction README - fix 828
* updated LLM PII redaction Enron-file - fix 828
* updated LLM PII redaction file - fix 828
* Update tutorials/curator-llm-pii/README.md
* removed typo from README file - fix 828
* updated LLM redaction tutorial - fix 828
* updated LLM redaction-Enron file - fix 828
* updated LLM redaction-Enron file - fix 828
* Update tutorials/curator-llm-pii/PII-LLM-modification-Enron.ipynb
* Update tutorials/curator-llm-pii/PII-LLM-modification-Enron.ipynb
---------
Signed-off-by: Adeola Adesoba <[email protected]>
Signed-off-by: aadesoba-nv <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: NeMo Bot <[email protected]>
Co-authored-by: aadesoba-nv <[email protected]>
Co-authored-by: Sarah Yurick <[email protected]>
* [Tutorials] Lazy import GPU modules in the Llama Nemotron tutorial (#831) (#875)
Signed-off-by: Mehran Maghoumi <[email protected]>
Signed-off-by: NeMo Bot <[email protected]>
Co-authored-by: Mehran Maghoumi <[email protected]>
Co-authored-by: Sarah Yurick <[email protected]>
* docs: changelog update (#860) (#887)
* docs: changelog update
* formatting
* remove item
---------
Signed-off-by: Lawrence Lane <[email protected]>
Signed-off-by: NeMo Bot <[email protected]>
Co-authored-by: L.B. <[email protected]>
* linkfixes (#865) (#882)
Signed-off-by: Lawrence Lane <[email protected]>
Signed-off-by: NeMo Bot <[email protected]>
Co-authored-by: L.B. <[email protected]>
Co-authored-by: Ayush Dattagupta <[email protected]>
* docs: Fixing version switcher issues (#885) (#886)
Signed-off-by: Andrew Schilling <[email protected]>
Signed-off-by: NeMo Bot <[email protected]>
Co-authored-by: Andrew Schilling <[email protected]>
* [Ray] Download and extract ArXiv (#805)
* remove dask arxiv
Signed-off-by: Sarah Yurick <[email protected]>
* first pass for entire arxiv implementation
Signed-off-by: Sarah Yurick <[email protected]>
* ruff
Signed-off-by: Sarah Yurick <[email protected]>
* fix circular import
Signed-off-by: Sarah Yurick <[email protected]>
* working module
Signed-off-by: Sarah Yurick <[email protected]>
* add downloader tests
Signed-off-by: Sarah Yurick <[email protected]>
* remove unused noqa
Signed-off-by: Sarah Yurick <[email protected]>
* add test_iterator
Signed-off-by: Sarah Yurick <[email protected]>
* add extractor tests
Signed-off-by: Sarah Yurick <[email protected]>
* fix failing download tests
Signed-off-by: Sarah Yurick <[email protected]>
* add test_stage
Signed-off-by: Sarah Yurick <[email protected]>
* sort
Signed-off-by: Sarah Yurick <[email protected]>
* add url generator tests
Signed-off-by: Sarah Yurick <[email protected]>
* remove noqa
Signed-off-by: Sarah Yurick <[email protected]>
* remove nemo_curator/download, outdated scripts, outdated examples
Signed-off-by: Sarah Yurick <[email protected]>
---------
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
* [Ray] Classifiers (#753)
* [Ray] Classifiers
Signed-off-by: Sarah Yurick <[email protected]>
* fix ruff
Signed-off-by: Sarah Yurick <[email protected]>
* add utils file
Signed-off-by: Sarah Yurick <[email protected]>
* commit quality classifier benchmark helpers
Signed-off-by: Sarah Yurick <[email protected]>
* use basictokenizer as cpu tokenizer, add crossfit config
Signed-off-by: Sarah Yurick <[email protected]>
* some ruff
Signed-off-by: Sarah Yurick <[email protected]>
* merge upstream
Signed-off-by: Praateek <[email protected]>
* use _name, remove gpu resources from labeler
Signed-off-by: Sarah Yurick <[email protected]>
* consolidate praateek's work with distributeddataclassifier for quality classifier
Signed-off-by: Sarah Yurick <[email protected]>
* ruff
Signed-off-by: Sarah Yurick <[email protected]>
* add content type, domain, multilingual domain, and filter_by support
Signed-off-by: Sarah Yurick <[email protected]>
* support for fineweb, fineweb mixtral, and fineweb nemotron classifiers
Signed-off-by: Sarah Yurick <[email protected]>
* ruff
Signed-off-by: Sarah Yurick <[email protected]>
* add prompt task complexity support
Signed-off-by: Sarah Yurick <[email protected]>
* remove noqa
Signed-off-by: Sarah Yurick <[email protected]>
* padding_size does not need to be exposed to user
Signed-off-by: Sarah Yurick <[email protected]>
* max_seq_length does not need to be exposed to the user, set default micro_batch_sizes
Signed-off-by: Sarah Yurick <[email protected]>
* add max_chars, edit docstring
Signed-off-by: Sarah Yurick <[email protected]>
* ruff
Signed-off-by: Sarah Yurick <[email protected]>
* aegis functionality, start working on instruction data guard
Signed-off-by: Sarah Yurick <[email protected]>
* nit fixes
Signed-off-by: Sarah Yurick <[email protected]>
* add working pytests for all classifiers
Signed-off-by: Sarah Yurick <[email protected]>
* remove existing pytest file
Signed-off-by: Sarah Yurick <[email protected]>
* add more comments to tests
Signed-off-by: Sarah Yurick <[email protected]>
* address review, add mem conversation, add README
Signed-off-by: Sarah Yurick <[email protected]>
* move redundant test code
Signed-off-by: Sarah Yurick <[email protected]>
* ruff
Signed-off-by: Sarah Yurick <[email protected]>
* model_inference_batch_size and format_name_with_suffix
Signed-off-by: Sarah Yurick <[email protected]>
* add missing hf_token usage, remove test file, restructure dirs and files
Signed-off-by: Sarah Yurick <[email protected]>
* delete old examples and scripts
Signed-off-by: Sarah Yurick <[email protected]>
---------
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Praateek <[email protected]>
Co-authored-by: Praateek <[email protected]>
* [RAY] Add ID Module (#876)
* Add id inital working IMP
Signed-off-by: Vibhu Jawa <[email protected]>
* working add_id
Signed-off-by: Vibhu Jawa <[email protected]>
* Add ID
Signed-off-by: Vibhu Jawa <[email protected]>
* Update ray-curator/ray_curator/tasks/tasks.py
Co-authored-by: Copilot <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
* Add prefix feature, overwrite, warnings
Signed-off-by: Vibhu Jawa <[email protected]>
* rename id_prefix to user_prefix
Signed-off-by: Vibhu Jawa <[email protected]>
* Add in test for tasks and fix task id
Signed-off-by: VibhuJawa <[email protected]>
---------
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: VibhuJawa <[email protected]>
Co-authored-by: Copilot <[email protected]>
* Add video splitting pipeline with fixed stride extraction and transcoding Stage (#783)
* Add video splitting pipeline with fixed stride extraction and transcoding stages
- Introduced `video_split_clip_example.py` to demonstrate video splitting functionality.
- Added `ClipTranscodingStage` and `FixedStrideExtractorStage` for processing video clips.
- Implemented command-line arguments for configuring video processing parameters.
- Created utility functions for grouping iterables in `grouping.py`.
- Added unit tests for the new stages in `test_clip_transcoding_stage.py` and `test_fixed_stride_extractor_stage.py`.
Signed-off-by: Ao Tang <[email protected]>
* Refactor video splitting pipeline to remove debug mode and enhance stage integration
Signed-off-by: Ao Tang <[email protected]>
* Add video limit argument to video split clip example
Signed-off-by: Ao Tang <[email protected]>
* Refactor video processing stages to enhance resource management and integrate new functionalities
- Replaced separate VideoReaderStage and VideoDownloadStage with a composite VideoReaderDownloadStage, streamlining the video reading and downloading process.
- Updated ClipTranscodingStage to improve GPU resource allocation and added detailed arguments for better configurability.
- Adjusted tests to reflect changes in resource management, ensuring accurate assertions on GPU usage.
These changes improve the clarity and efficiency of video processing within the ray-curator framework.
Signed-off-by: Ao Tang <[email protected]>
* Add mock GPU classes and enhance ClipTranscodingStage tests
- Introduced MockGpuInfo and MockGpuResources classes to simulate GPU information and resources for testing.
- Updated test_resources_gpu_encoder and test_resources_hwaccel_enabled methods to utilize mocks, ensuring accurate resource assertions without dependency on actual GPU hardware.
- Enhanced test_different_encoder_configurations to validate resource requirements for various encoder configurations, including GPU settings.
These changes improve the robustness of the ClipTranscodingStage tests by isolating them from hardware dependencies, facilitating easier testing and validation.
Signed-off-by: [Your Name] <[email protected]>
Signed-off-by: Ao Tang <[email protected]>
* Remove deprecated GPU resource tests from ClipTranscodingStage
Signed-off-by: Ao Tang <[email protected]>
* Remove unused test for processing in debug mode from ClipTranscodingStage tests
Signed-off-by: Ao Tang <[email protected]>
* Add unit tests for grouping utilities in the ray_curator.utils module
Signed-off-by: Ao Tang <[email protected]>
* Enhance video processing stages with ray stage specifications
- Added `ray_stage_spec` method to `ClipTranscodingStage`, `VideoDownloadStage`, and `VideoReaderStage` to define stage characteristics for Ray integration.
- Updated input and output methods in `ClipTranscodingStage` to include additional input parameters.
- Modified `SplitPipeTask` to return properties from `data` instead of `video`, ensuring consistency in task data handling.
- Added unit tests to verify the correctness of the new `ray_stage_spec` implementations.
These changes improve the integration of video processing stages with Ray's architecture and enhance test coverage for the new functionalities.
Signed-off-by: Ao Tang <[email protected]>
* Refactor video processing imports and update pipeline stages
Signed-off-by: Ao Tang <[email protected]>
* Remove unused `IS_ACTOR_STAGE` key from `ray_stage_spec` in `ClipTranscodingStage` and clean up commented-out code. This simplifies the stage specification and prepares for future enhancements.
Signed-off-by: Ao Tang <[email protected]>
* Remove redundant check for video source bytes in ClipTranscodingStage. This simplifies the process method by eliminating unnecessary error handling when source bytes are not available.
Signed-off-by: Ao Tang <[email protected]>
* Refactor ClipTranscodingStage to use a class variable for the stage name and implement post-initialization resource setup. Added error handling for None source bytes in the process method. Updated tests to remove redundant checks and ensure proper functionality.
Signed-off-by: Ao Tang <[email protected]>
* Remove unnecessary error handling for None source bytes in ClipTranscodingStage's process method,
Signed-off-by: Ao Tang <[email protected]>
* remove redudant test
Signed-off-by: Ao Tang <[email protected]>
* precommit fix
Signed-off-by: Ao Tang <[email protected]>
---------
Signed-off-by: Ao Tang <[email protected]>
Signed-off-by: [Your Name] <[email protected]>
* docs: ray curator api autodoc updates (#896)
Signed-off-by: Lawrence Lane <[email protected]>
* Move all text stages to `stages/text/` (#891)
* first pass
Signed-off-by: Sarah Yurick <[email protected]>
* ruff
Signed-off-by: Sarah Yurick <[email protected]>
* fix tests
Signed-off-by: Sarah Yurick <[email protected]>
* fix after merge
Signed-off-by: Sarah Yurick <[email protected]>
---------
Signed-off-by: Sarah Yurick <[email protected]>
* Add Ray Actor Pool Exceuctor (#893)
* Initial Minhash implementation on Ray (#837)
* Initial minhash logic without Stage API
Signed-off-by: Ayush Dattagupta <[email protected]>
* update args and support passing in pre-batched files
Signed-off-by: Ayush Dattagupta <[email protected]>
* Remove old minhash impl
Signed-off-by: Ayush Dattagupta <[email protected]>
* Add Class to do GPU IO for dedup
Co-authored-by: Praateek Mahajan <[email protected]>
Signed-off-by: Ayush Dattagupta <[email protected]>
* Add ID Generator class
Co-authored-by: Praateek Mahajan <[email protected]>
Signed-off-by: Ayush Dattagupta <[email protected]>
* Move MinHashActor to a GPUMinHash class and create a GPUMinHash Processing stage
Signed-off-by: Ayush Dattagupta <[email protected]>
* Remove minhash method in favor of minhashProcessingStage
Signed-off-by: Ayush Dattagupta <[email protected]>
* Add mkdir logic to the writer
Signed-off-by: Ayush Dattagupta <[email protected]>
* Add file partitioning stage to __init__.py
Signed-off-by: Ayush Dattagupta <[email protected]>
* Update cuda12x extra to deduplication. Bump pynvml to avoid conflicts
Signed-off-by: Ayush Dattagupta <[email protected]>
* Update stage name
Signed-off-by: Ayush Dattagupta <[email protected]>
* Add initial minhash tests
Signed-off-by: Ayush Dattagupta <[email protected]>
* Add rmm pool arg to MinhashStage, default to false in the parent actor
Signed-off-by: Ayush Dattagupta <[email protected]>
* Move IO and ID generator logic to the Stage rather than the parent GPUMinHash class
Signed-off-by: Ayush Dattagupta <[email protected]>
* Update GPUMinHash Tests
Signed-off-by: Ayush Dattagupta <[email protected]>
* Standardize Id generator actor name
Signed-off-by: Ayush Dattagupta <[email protected]>
* Add GPUMinHashStage tests
Signed-off-by: Ayush Dattagupta <[email protected]>
* Rename GPUMinHashStage to MinHashStage
Signed-off-by: Ayush Dattagupta <[email protected]>
* Add marker for GPU tests
Signed-off-by: Ayush Dattagupta <[email protected]>
* update cpu ci workflow to skip GPU tests
Signed-off-by: Ayush Dattagupta <[email protected]>
* Skip tests if imports fail
Signed-off-by: Ayush Dattagupta <[email protected]>
* move cudf import checks before stage imports
Signed-off-by: Ayush Dattagupta <[email protected]>
* Use storage options from read_kwargs directly
Signed-off-by: Ayush Dattagupta <[email protected]>
---------
Signed-off-by: Ayush Dattagupta <[email protected]>
Co-authored-by: Praateek Mahajan <[email protected]>
* docs: curate text load data content updates for ray (#895)
* docs: load text data article updates
Signed-off-by: Lawrence Lane <[email protected]>
* remove "ray-curator" for curator
Signed-off-by: Lawrence Lane <[email protected]>
* simplify naming
Signed-off-by: Lawrence Lane <[email protected]>
* imports
Signed-off-by: Lawrence Lane <[email protected]>
* imports
Signed-off-by: Lawrence Lane <[email protected]>
* imports
Signed-off-by: Lawrence Lane <[email protected]>
* linkfix
Signed-off-by: Lawrence Lane <[email protected]>
* read through
Signed-off-by: Lawrence Lane <[email protected]>
* simplification
Signed-off-by: Lawrence Lane <[email protected]>
* remove placeholder concept details
Signed-off-by: Lawrence Lane <[email protected]>
* pipeline verbiage
Signed-off-by: Lawrence Lane <[email protected]>
* initial feedback round
Signed-off-by: Lawrence Lane <[email protected]>
* reduce admonition noise
Signed-off-by: Lawrence Lane <[email protected]>
* minor updates
Signed-off-by: Lawrence Lane <[email protected]>
* minor updates
Signed-off-by: Lawrence Lane <[email protected]>
* feedback
Signed-off-by: Lawrence Lane <[email protected]>
---------
Signed-off-by: Lawrence Lane <[email protected]>
* Adding function decorator for very simple functions to be converted into stages (#835)
* Revert 'Add utility decorators for ProcessingStage creation' (empty cherry-pick)
Signed-off-by: Abhinav Garg <[email protected]>
* Add utility decorators for ProcessingStage creation
This commit introduces a new module containing the `processing_stage` decorator, which allows users to easily convert plain Python functions into `ProcessingStage` instances. The decorator supports configuration options such as stage name, resource allocation, and batch size. Additionally, unit tests have been added to validate the functionality of the decorator and ensure proper handling of task processing.
Signed-off-by: [Your Name] <[email protected]>
Signed-off-by: Abhinav Garg <[email protected]>
* test commit
Signed-off-by: Sarah Yurick <[email protected]>
* add test_stage_registry, other nits
Signed-off-by: Sarah Yurick <[email protected]>
* overwrite stage registry
Signed-off-by: Sarah Yurick <[email protected]>
* ruff
Signed-off-by: Sarah Yurick <[email protected]>
* propagate _metadata and _stage_perf
Signed-off-by: Sarah Yurick <[email protected]>
* accept resources dict
Signed-off-by: Sarah Yurick <[email protected]>
* reformat
Signed-off-by: Sarah Yurick <[email protected]>
* add process_batch tests
Signed-off-by: Sarah Yurick <[email protected]>
* ruff
Signed-off-by: Sarah Yurick <[email protected]>
* remove todo
Signed-off-by: Sarah Yurick <[email protected]>
* add pipeline example
Signed-off-by: Sarah Yurick <[email protected]>
---------
Signed-off-by: Abhinav Garg <[email protected]>
Signed-off-by: [Your Name] <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Co-authored-by: Sarah Yurick <[email protected]>
* Add Text Embedding Model (#899)
* Add Ray curator dockerfile and enable testing (#879)
* Add Ray curator dockerfile and enable testing
Signed-off-by: Dong Hyuk Chang <[email protected]>
* Fix indentation issues
Signed-off-by: Dong Hyuk Chang <[email protected]>
* Update dockerfile and add cuda12x
Signed-off-by: Dong Hyuk Chang <[email protected]>
* Update coverage pathes
Signed-off-by: Dong Hyuk Chang <[email protected]>
* Update gpu tests runner
Signed-off-by: Dong Hyuk Chang <[email protected]>
* Add gpu testing scripts and update
Signed-off-by: Dong Hyuk Chang <[email protected]>
* Cd into ray-curator for coverage
Signed-off-by: Dong Hyuk Chang <[email protected]>
* Create dev layer and install dev packages
Signed-off-by: Dong Hyuk Chang <[email protected]>
* Update coverage paths
Signed-off-by: Dong Hyuk Chang <[email protected]>
* Install opencv
Signed-off-by: Dong Hyuk Chang <[email protected]>
* Address syntax error
Signed-off-by: Dong Hyuk Chang <[email protected]>
* Update cv2 ubuntu dependencies
Signed-off-by: Dong Hyuk Chang <[email protected]>
* Fix typo
Signed-off-by: Dong Hyuk Chang <[email protected]>
* Add cudf placeholder test
Signed-off-by: Dong Hyuk Chang <[email protected]>
* Space after import
Signed-off-by: Dong Hyuk Chang <[email protected]>
* Add gpu_only_import
Signed-off-by: Dong Hyuk Chang <[email protected]>
* Remove import utils for now
Signed-off-by: Dong Hyuk Chang <[email protected]>
* Fix spacing
Signed-off-by: Dong Hyuk Chang <[email protected]>
* Skip gpu tests for cpu
Signed-off-by: Dong Hyuk Chang <[email protected]>
* Update unit test coverage path
Signed-off-by: Dong Hyuk Chang <[email protected]>
* Skip gpu coverage report for now
Signed-off-by: Dong Hyuk Chang <[email protected]>
* Use pixi
Signed-off-by: Dong Hyuk Chang <[email protected]>
* Fix dockerfile syntax
Signed-off-by: Dong Hyuk Chang <[email protected]>
* Try ffmpeg only
Signed-off-by: Dong Hyuk Chang <[email protected]>
* Add extra index url for pixi
Signed-off-by: Dong Hyuk Chang <[email protected]>
* Address typos
Signed-off-by: Dong Hyuk Chang <[email protected]>
* Install git
Signed-off-by: Dong Hyuk Chang <[email protected]>
* Update entrypoint
Signed-off-by: Dong Hyuk Chang <[email protected]>
* Fix typo
Signed-off-by: Dong Hyuk Chang <[email protected]>
* Use env var for dev install
Signed-off-by: Dong Hyuk Chang <[email protected]>
* Resolve syntax error
Signed-off-by: Dong Hyuk Chang <[email protected]>
* Fix env var and verbose install
Signed-off-by: Dong Hyuk Chang <[email protected]>
* Update pixi entrypoint and pyproject install
Signed-off-by: Dong Hyuk Chang <[email protected]>
* Trigger entrypoint before tests
Signed-off-by: Dong Hyuk Chang <[email protected]>
* Update test entrypoint
Signed-off-by: Dong Hyuk Chang <[email protected]>
* Source entrypoint
Signed-off-by: Dong Hyuk Chang <[email protected]>
* Update list of dev install pixi
Signed-off-by: Dong Hyuk Chang <[email protected]>
* Add back cuda12x and index-strategy
Signed-off-by: Dong Hyuk Chang <[email protected]>
* Turn off verbose install
Signed-off-by: Dong Hyuk Chang <[email protected]>
* Skip gpu coverage for now
Signed-off-by: Dong Hyuk Chang <[email protected]>
* Support arm
Signed-off-by: Dong Hyuk Chang <[email protected]>
* Set timeout for dockerbuild and update pyproject
Signed-off-by: Dong Hyuk Chang <[email protected]>
* Remove retry github config
Signed-off-by: Dong Hyuk Chang <[email protected]>
---------
Signed-off-by: Dong Hyuk Chang <[email protected]>
* ci: Install ray-curator module (#905)
* Add ray curator as pypi dependency
Signed-off-by: Dong Hyuk Chang <[email protected]>
* Add package info and test import
Signed-off-by: Dong Hyuk Chang <[email protected]>
* Update pyproject.toml
Signed-off-by: Dong Hyuk Chang <[email protected]>
* Copy src for pixi install
Signed-off-by: Dong Hyuk Chang <[email protected]>
* Update test import
Signed-off-by: Dong Hyuk Chang <[email protected]>
* Revert temp test
Signed-off-by: Dong Hyuk Chang <[email protected]>
---------
Signed-off-by: Dong Hyuk Chang <[email protected]>
* [REVIEW] Add modifers to ray curator (#898)
* Inital WIP modifier workflows
Signed-off-by: VibhuJawa <[email protected]>
* Moved tests and also moved modifiers to text sub module
Signed-off-by: VibhuJawa <[email protected]>
* Add tests for the meta class and modifier and improve docstring
Signed-off-by: VibhuJawa <[email protected]>
* Update ray-curator/ray_curator/stages/text/modifiers/slicer.py
Co-authored-by: Copilot <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
* Update ray-curator/ray_curator/stages/text/modifiers/line_remover.py
Co-authored-by: Copilot <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
* Delete files from dask dir and remove optional download fields
Signed-off-by: Vibhu Jawa <[email protected]>
* Add pytest as requested
Signed-off-by: Vibhu Jawa <[email protected]>
---------
Signed-off-by: VibhuJawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
Co-authored-by: Copilot <[email protected]>
* Allow users to fuse multiple `DocumentFilter` objects into a single `ScoreFilter` stage (#850)
* Allow users to fuse multiple `DocumentFilter` objects into a single `ScoreFilter` stage
Signed-off-by: Sarah Yurick <[email protected]>
* remove old example and scripts file
Signed-off-by: Sarah Yurick <[email protected]>
* add suggestions
Signed-off-by: Sarah Yurick <[email protected]>
* add init
Signed-off-by: Sarah Yurick <[email protected]>
* fix csv path
Signed-off-by: Sarah Yurick <[email protected]>
* clearer error messages
Signed-off-by: Sarah Yurick <[email protected]>
---------
Signed-off-by: Sarah Yurick <[email protected]>
* Fix exception when blocksize is set (#892) (#904)
If blocksize is set instead of files_per_partition, this line raised an exception.
Signed-off-by: Yurii Paniv <[email protected]>
Signed-off-by: NeMo Bot <[email protected]>
Co-authored-by: Yurii Paniv <[email protected]>
* docs: curate text - process data - language dir (#900)
* docs: curate text - process data - language dir
Signed-off-by: Lawrence Lane <[email protected]>
* remove extra content
Signed-off-by: Lawrence Lane <[email protected]>
* another pass
Signed-off-by: Lawrence Lane <[email protected]>
* remove pool
Signed-off-by: Lawrence Lane <[email protected]>
* formatting
Signed-off-by: Lawrence Lane <[email protected]>
* feedback
Signed-off-by: Lawrence Lane <[email protected]>
* clarificaiton and alternative as pipeline stage. removed extra section
Signed-off-by: Lawrence Lane <[email protected]>
* Update docs/curate-text/process-data/language-management/language.md
Co-authored-by: Sarah Yurick <[email protected]>
Signed-off-by: L.B. <[email protected]>
---------
Signed-off-by: Lawrence Lane <[email protected]>
Signed-off-by: L.B. <[email protected]>
Co-authored-by: Sarah Yurick <[email protected]>
* docs: add README for experimental scripts directory (#910)
Signed-off-by: Abhinav Garg <[email protected]>
* Add IdGenerator to JsonlReader + IdGenerator tests / write_to_disk / from_disk (#907)
* Initial buckets to edges stage (#909)
* Initial buckets to edges stage
Signed-off-by: Ayush Dattagupta <[email protected]>
* re-add file utils from lsh pr
Signed-off-by: Ayush Dattagupta <[email protected]>
* Handle directory cleanup/creation logic in the stage
Signed-off-by: Ayush Dattagupta <[email protected]>
* Add tests for buckets to edglist
Signed-off-by: Ayush Dattagupta <[email protected]>
* Rename doc_id_column to doc_id_field, update storage_options to read/write_kwargs instead
Signed-off-by: Ayush Dattagupta <[email protected]>
* Fix indentation
Signed-off-by: Ayush Dattagupta <[email protected]>
* Fix kwargs args
Signed-off-by: Ayush Dattagupta <[email protected]>
* Add copyright headers
Signed-off-by: Ayush Dattagupta <[email protected]>
* remove previous curator impl
Signed-off-by: Ayush Dattagupta <[email protected]>
---------
Signed-off-by: Ayush Dattagupta <[email protected]>
* [SemDedup] Add KMeans (#912)
* S3 Client (#903)
* WIP
Signed-off-by: Ao Tang <[email protected]>
* WIP
Signed-off-by: Ao Tang <[email protected]>
* Refactor S3 client configuration and enhance video reading logging
- Updated S3_PROFILE_PATH to use an environment variable for better flexibility in specifying the S3 credentials file location.
- Improved logging in VideoReaderStage to provide more informative messages about video byte downloads, including the size of the downloaded video.
Signed-off-by: Ao Tang <[email protected]>
* Enhance VideoReader functionality with S3 support and improve validation checks
- Updated VideoReader to conditionally use ClientPartitioningStage for S3 paths and FilePartitioningStage for local paths, improving flexibility in handling video sources.
- Enhanced validation in VideoTask to check for the existence of input videos when provided as pathlib.Path, ensuring better error handling.
- Removed unused methods from S3Client to streamline the codebase.
Signed-off-by: Ao Tang <[email protected]>
* Remove redundant exception raising in VideoReaderStage to improve error handling during video reading. This change prevents unnecessary propagation of exceptions while still logging errors effectively.
Signed-off-by: Ao Tang <[email protected]>
* Refactor ClientPartitioningStage and enhance S3 client configuration
- Rearranged import statements for better organization and readability in `client_partitioning.py` and `video_reader.py`.
- Updated `S3ClientConfig` and `BaseClientConfig` to use `@dataclass` for improved data handling.
- Added comprehensive unit tests for `ClientPartitioningStage`, covering initialization, setup, and processing methods with various scenarios.
- Improved error handling and validation in the `_read_list_json` function.
This refactor enhances the maintainability and test coverage of the codebase, ensuring better functionality and reliability in handling client partitioning tasks.
Signed-off-by: Ao Tang <[email protected]>
* Remove SPDX license comments from S3 client, storage client, and storage utilities files to streamline code readability. This change simplifies the file headers while retaining essential module documentation.
Signed-off-by: Ao Tang <[email protected]>
* Use Fsspec instead of boto3
Signed-off-by: Ao Tang <[email protected]>
* Refactor file handling and enhance video reading capabilities
- Introduced a new `FSPath` class in `client_utils.py` for improved file operations with fsspec.
- Updated `ClientPartitioningStage` and `VideoReaderStage` to utilize the new `FSPath` class for better handling of file paths.
- Removed unused imports and streamlined code in `client_partitioning.py` and `video_reader.py`.
- Enhanced error handling in `VideoReaderStage` to support various input types for video sources.
This refactor improves the maintainability and flexibility of file handling in the video processing pipeline.
Signed-off-by: Ao Tang <[email protected]>
* move client_partitioning.py
Signed-off-by: Ao Tang <[email protected]>
* ruff check
Signed-off-by: Ao Tang <[email protected]>
* Fix broken tests
Signed-off-by: Ao Tang <[email protected]>
* Add `as_posix` method to `FSPath` class and implement comprehensive test suite
- Introduced `as_posix` method in the `FSPath` class to convert filesystem paths to POSIX format, accommodating various protocols.
- Created a new test suite for `FSPath` in `test_client_utils.py`, covering initialization, string representation, file operations, and edge cases.
- Enhanced tests for `get_bytes_cat_ranges` to handle different file sizes and error scenarios.
This update improves the functionality and test coverage of the `FSPath` class, ensuring robust file handling across different filesystems.
Signed-off-by: Ao Tang <[email protected]>
* Remove logging of downloaded video size in VideoReaderStage to streamline error handling and reduce unnecessary output.
Signed-off-by: Ao Tang <[email protected]>
* Refactor video reading and splitting pipeline examples for improved readability
- Reformatted the `create_video_reading_pipeline` and `create_video_splitting_pipeline` functions to enhance code clarity by aligning parameters and removing unnecessary line breaks.
- Updated the `VideoReader` and `ClipTranscodingStage` instantiation to follow a consistent style.
- Made minor adjustments in the `ClientPartitioningStage` to ensure consistent formatting and improved readability.
These changes contribute to a cleaner and more maintainable codebase for video processing pipelines.
Signed-off-by: Ao Tang <[email protected]>
---------
Signed-off-by: Ao Tang <[email protected]>
* Add ClipWriterStage to video splitting pipeline Clean (#897)
* WIP
Signed-off-by: Ao Tang <[email protected]>
* WIP
Signed-off-by: Ao Tang <[email protected]>
* Update ClipWriterStage to clarify local storage usage
Signed-off-by: [Your Name] <[email protected]>
Signed-off-by: Ao Tang <[email protected]>
* Enhance video clip processing with new GenericClipWriterStage and required output path argument
- Introduced a new GenericClipWriterStage for writing video clips and their metadata, consolidating the writing process and improving resource management.
- Updated the video_split_clip_example to require an output clip path, ensuring that users specify where to save the generated clips.
- The new stage supports parallel writing of clips and metadata, enhancing performance and flexibility in video processing workflows.
Signed-off-by: Ao Tang <[email protected]>
* Enhance ClipWriterStage with additional metadata handling
- Improved `ClipWriterStage` to support writing additional metadata during video processing.
- Updated related utility functions to accommodate new metadata fields.
- Refined unit tests to cover the new functionality and ensure reliability.
Signed-off-by: Ao Tang <[email protected]>
* Add ClipWriterStage to video splitting pipeline
- Introduced `ClipWriterStage` for writing clips and metadata during video processing.
- Updated `video_split_clip_example.py` to include the new stage, allowing for clip writing functionality.
- Enhanced command-line argument parsing for output clip path.
- Added utility functions for managing storage paths and writing data in various formats.
- Implemented unit tests for `ClipWriterStage` to ensure functionality and reliability.
Signed-off-by: Ao Tang <[email protected]>
* ruff fix
Signed-off-by: Ao Tang <[email protected]>
* ruff format
Signed-off-by: Ao Tang <[email protected]>
* Refactor S3 client configuration and enhance video reading logging
- Updated S3_PROFILE_PATH to use an environment variable for better flexibility in specifying the S3 credentials file location.
- Improved logging in VideoReaderStage to provide more informative messages about video byte downloads, including the size of the downloaded video.
Signed-off-by: Ao Tang <[email protected]>
* Enhance VideoReader functionality with S3 support and improve validation checks
- Updated VideoReader to conditionally use ClientPartitioningStage for S3 paths and FilePartitioningStage for local paths, improving flexibility in handling video sources.
- Enhanced validation in VideoTask to check for the existence of input videos when provided as pathlib.Path, ensuring better error handling.
- Removed unused methods from S3Client to streamline the codebase.
Signed-off-by: Ao Tang <[email protected]>
* Remove redundant exception raising in VideoReaderStage to improve error handling during video reading. This change prevents unnecessary propagation of exceptions while still logging errors effectively.
Signed-off-by: Ao Tang <[email protected]>
* Refactor ClientPartitioningStage and enhance S3 client configuration
- Rearranged import statements for better organization and readability in `client_partitioning.py` and `video_reader.py`.
- Updated `S3ClientConfig` and `BaseClientConfig` to use `@dataclass` for improved data handling.
- Added comprehensive unit tests for `ClientPartitioningStage`, covering initialization, setup, and processing methods with various scenarios.
- Improved error handling and validation in the `_read_list_json` function.
This refactor enhances the maintainability and test coverage of the codebase, ensuring better functionality and reliability in handling client partitioning tasks.
Signed-off-by: Ao Tang <[email protected]>
* Remove SPDX license comments from S3 client, storage client, and storage utilities files to streamline code readability. This change simplifies the file headers while retaining essential module documentation.
Signed-off-by: Ao Tang <[email protected]>
* Use Fsspec instead of boto3
Signed-off-by: Ao Tang <[email protected]>
* Refactor file handling and enhance video reading capabilities
- Introduced a new `FSPath` class in `client_utils.py` for improved file operations with fsspec.
- Updated `ClientPartitioningStage` and `VideoReaderStage` to utilize the new `FSPath` class for better handling of file paths.
- Removed unused imports and streamlined code in `client_partitioning.py` and `video_reader.py`.
- Enhanced error handling in `VideoReaderStage` to support various input types for video sources.
This refactor improves the maintainability and flexibility of file handling in the video processing pipeline.
Signed-off-by: Ao Tang <[email protected]>
* move client_partitioning.py
Signed-off-by: Ao Tang <[email protected]>
* ruff check
Signed-off-by: Ao Tang <[email protected]>
* Fix broken tests
Signed-off-by: Ao Tang <[email protected]>
* Remove unused `generic_clip_writer.py`, `storage_client.py`, and related utility files; refactor `writer_utils.py` to eliminate storage client dependencies and streamline file writing functions. Update tests to reflect these changes and ensure compatibility with the new structure.
Signed-off-by: Ao Tang <[email protected]>
* Remove test file `test_client_utils.py` for the `FSPath` class, cleaning up unused test cases and ensuring the test suite reflects the current codebase structure.
Signed-off-by: Ao Tang <[email protected]>
* Refactor ClipWriterStage to remove storage client dependencies and streamline file writing methods. Updated method signatures to eliminate storage client parameters, enhancing code clarity and maintainability.
Signed-off-by: Ao Tang <[email protected]>
* Remove unused import of ClipWriterStage in video_split_clip_example.py to streamline the code and improve clarity.
Signed-off-by: Abhinav Garg <[email protected]>
* Remove unused `input_s3_profile_name` attribute from `VideoReader` class to streamline the code and improve clarity.
Signed-off-by: Ao Tang <[email protected]>
* Remove unused `input_s3_profile_name` attribute from `VideoReader` class to streamline the code and improve clarity.
Signed-off-by: Abhinav Garg <[email protected]>
* Refactor video metadata writing in ClipWriterStage by removing an unnecessary blank line for improved code clarity. Update get_full_path function signature for consistency in type hinting. Enhance test case formatting in TestVideoReaderStage to improve readability and maintainability.
Signed-off-by: Ao Tang <[email protected]>
---------
Signed-off-by: Ao Tang <[email protected]>
Signed-off-by: [Your Name] <[email protected]>
Signed-off-by: Abhinav Garg <[email protected]>
Co-authored-by: Abhinav Garg <[email protected]>
* Add motion filtering stages to video splitting pipeline (#797)
* Add video io reader
* Add test
* Add video splitting pipeline with fixed stride extraction and transcoding stages
- Introduced `video_split_clip_example.py` to demonstrate video splitting functionality.
- Added `ClipTranscodingStage` and `FixedStrideExtractorStage` for processing video clips.
- Implemented command-line arguments for…1 parent 2f273bc commit efe76b9
File tree
7 files changed
+7201
-53
lines changed- .github
- actions/test-template
- workflows
- ray-curator
- docker
- common
7 files changed
+7201
-53
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
131 | 131 | | |
132 | 132 | | |
133 | 133 | | |
134 | | - | |
| 134 | + | |
| 135 | + | |
135 | 136 | | |
136 | 137 | | |
137 | 138 | | |
| |||
176 | 177 | | |
177 | 178 | | |
178 | 179 | | |
179 | | - | |
| 180 | + | |
180 | 181 | | |
181 | 182 | | |
182 | 183 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
150 | 150 | | |
151 | 151 | | |
152 | 152 | | |
153 | | - | |
154 | 153 | | |
155 | 154 | | |
156 | | - | |
| 155 | + | |
| 156 | + | |
157 | 157 | | |
158 | 158 | | |
159 | 159 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
10 | 10 | | |
11 | 11 | | |
12 | 12 | | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
13 | 22 | | |
14 | 23 | | |
15 | 24 | | |
16 | | - | |
17 | | - | |
18 | | - | |
19 | | - | |
20 | | - | |
21 | | - | |
22 | | - | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
23 | 40 | | |
24 | | - | |
25 | | - | |
26 | | - | |
27 | | - | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
28 | 44 | | |
29 | 45 | | |
30 | 46 | | |
31 | 47 | | |
32 | 48 | | |
33 | 49 | | |
| 50 | + | |
34 | 51 | | |
35 | 52 | | |
36 | 53 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
67 | 67 | | |
68 | 68 | | |
69 | 69 | | |
70 | | - | |
71 | | - | |
72 | | - | |
73 | | - | |
74 | | - | |
75 | | - | |
76 | | - | |
77 | | - | |
78 | | - | |
79 | 70 | | |
80 | 71 | | |
81 | 72 | | |
82 | 73 | | |
83 | 74 | | |
84 | | - | |
85 | 75 | | |
86 | 76 | | |
| 77 | + | |
87 | 78 | | |
88 | | - | |
89 | 79 | | |
90 | 80 | | |
91 | | - | |
92 | | - | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
93 | 85 | | |
94 | 86 | | |
| 87 | + | |
| 88 | + | |
95 | 89 | | |
96 | 90 | | |
97 | 91 | | |
98 | 92 | | |
99 | 93 | | |
100 | | - | |
101 | | - | |
102 | | - | |
103 | | - | |
104 | | - | |
105 | | - | |
106 | 94 | | |
| 95 | + | |
107 | 96 | | |
108 | 97 | | |
109 | 98 | | |
| 99 | + | |
110 | 100 | | |
111 | 101 | | |
112 | | - | |
113 | | - | |
114 | | - | |
115 | | - | |
116 | | - | |
117 | | - | |
118 | | - | |
119 | | - | |
120 | | - | |
121 | | - | |
122 | | - | |
123 | | - | |
124 | | - | |
125 | | - | |
126 | | - | |
127 | | - | |
128 | | - | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
129 | 148 | | |
130 | | - | |
131 | | - | |
132 | 149 | | |
133 | 150 | | |
134 | 151 | | |
| |||
0 commit comments