Skip to content

[FEA] category dtype support in parquet reader #12497

@mattf

Description

@mattf

Is your feature request related to a problem? Please describe.
writing code with import cudf as pd

Describe the solution you'd like
same behavior as import pandas as pd

In [1]: import cudf as pd

In [2]: pd.__version__
Out[2]: '22.12.01'

In [3]: df = pd.DataFrame({'a': ['one','two','three'] * 10})

In [4]: df.info()
<class 'cudf.core.dataframe.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   a       30 non-null     object
dtypes: object(1)
memory usage: 234.0+ bytes

In [5]: df.a = df.astype('category')

In [6]: df.info()
<class 'cudf.core.dataframe.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   a       30 non-null     category
dtypes: category(1)
memory usage: 57.0 bytes

In [7]: %ls df.parquet
ls: cannot access 'df.parquet': No such file or directory

In [8]: df.to_pandas().to_parquet('df.parquet')

In [9]: %ls df.parquet
df.parquet

In [10]: pd.read_parquet('df.parquet').info()
<class 'cudf.core.dataframe.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   a       30 non-null     object
dtypes: object(1)
memory usage: 234.0+ bytes

In [11]: import pandas

In [12]: pandas.read_parquet('df.parquet').info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   a       30 non-null     category
dtypes: category(1)
memory usage: 290.0 bytes

In [13]: pd.DataFrame(pandas.read_parquet('df.parquet')).info()
<class 'cudf.core.dataframe.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   a       30 non-null     category
dtypes: category(1)
memory usage: 57.0 bytes

the parquet reader turns the column into dtype=object

In [10]: pd.read_parquet('df.parquet').info()
<class 'cudf.core.dataframe.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   a       30 non-null     object
dtypes: object(1)
memory usage: 234.0+ bytes

Metadata

Metadata

Assignees

No one assigned

    Labels

    0 - BacklogIn queue waiting for assignmentPythonAffects Python cuDF API.cuIOcuIO issuefeature requestNew feature or requestlibcudfAffects libcudf (C++/CUDA) code.

    Type

    No type

    Projects

    Status

    Todo

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions