Open
Description
I ran some benchmarks dispatching jobs to the IAA with a single work queue, 8 cores, and sweeping the engines assigned to the work queue from 1 to 8. With one engine I was able to achieve an inbound throughput on the IAA (checked using pcm_accel) of ~1.9GB/s, with 2 engines 3.8GB/s, and then it stayed flat at 3.8GB/s as I added more engines. I tested with input sizes up to 512MB and using up to 32 threads. The bottleneck was encountered with a thread/input size/core count much lower than what I described, but these are the maximum I used for the benchmark. Is this expected? My impression from some recent papers on the on-chip Intel accelerators is that the maximum throughput should be closer to 30GB/s.
Metadata
Metadata
Assignees
Labels
No labels