Skip to content

"Vectorize" ColumnGeneratorCachedByIndex #80

@MischaPanch

Description

@MischaPanch

The ColumnGeneratorCachedByIndex is recommended for new cached column generators, but it can be significantly slower than the not-recommended way of first creating a ColumnGenerator and then adding cache by wrapping with IndexCachedColumnGenerator.

The reason is that IndexCachedColumnGenerator will find all non-cached values and then process them at once (i.e., batch-wise), whereas the ColumnGeneratorCachedByIndex will always loop through all values. Thus, for an initial filling of the cache this can be much slower.

Not sure what to do here - one would need to redesign the ColumnGeneratorCachedByIndex to not use _generate_value, but that's a breaking change. Another way would be to write a new class a la VectorizedColumnGeneratorCachedByIndex, but I honestly feel like batch-wise processing of missing values should be the default behavior

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions