Skip to content

Add support for faster shuffles #280

@velvia

Description

@velvia

Currently u32x8 shuffle1_dyn are not optimized and fallback is used which results in a whole mess of extract intrinsics. It is not very fast.

Can we please add support for _mm256_permutevar8x32_epi32 and similar variants at the u32x8 (and f32x8, etc.) levels? It is a fairly large speedup.

Thanks

Metadata

Metadata

Assignees

No one assigned

    Labels

    EnhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions