-
Notifications
You must be signed in to change notification settings - Fork 676
Open
Labels
P1Important tasks that we should complete soonImportant tasks that we should complete soonPerformance πPerformance related issues and pull requests.Performance related issues and pull requests.
Description
No specific reproducer attached, but this is observable on a variety of workloads using large DFs.
Modin on Ray uses excessive amounts of memory for a wide range of operations, and we should investigate how this can be reduced. Consider this diagram from the authors of Dias [paper + GH] and PandasBench, where Modin uses obscene amounts of memory for some notebooks:
While testing an implementation for parallel writes from Ray datasets to Snowflake, we also observed the upload of a ~3GB in-memory dataframe to be using well over 100GB RAM.
Further areas of investigation:
- Is this entirely a Ray issue, or are there things Modin can do to address it?
- Is excessive memory usage also an issue for Modin on Dask? The PandasBench paper indicates that dask DFs' peak memory consumption is relatively low.
See also: #5524
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
P1Important tasks that we should complete soonImportant tasks that we should complete soonPerformance πPerformance related issues and pull requests.Performance related issues and pull requests.