Description
I'm the author of rawloader and am trying to figure out the best way to speed up image operations without a lot of code duplication. Ideally one could write the function once and have the best SIMD implementation be used in several architectures. For this a few things would need to happen:
- When SIMD isn't available at all fallback to a implementation of the same instructions
- Have the same function call the ideal function depending on if the target has SSE/AVX/etc
- Auto-generate all the target variations (e.g., with and without AVX on x86-64) and dispatch between then at runtime
Having 1) would make it much easier to add SIMD support to applications without having to add special cases everywhere but at least for some applications it's not strictly needed as you would never want to run it on a CPU so basic. Having 2) would take a lot of the effort away from writing SIMD implementations but it only makes sense if the performance downside of mixing SIMD with non-SIMD code doesn't make it slower than the fully non-SIMD version. I've proposed the equivalent of 3) for normal LLVM generation here:
I'm curious what the general opinion on this is. At least for basic operations like doing a FMA on a bunch of values it would be great to be able to write once and target most architectures efficiently with good fallbacks. Having a way to also use OpenCL with the same code would be even nicer and probably possible for a few simple operations.