cpu-gpu-arch/gpu/NVidia-Maxwell.md at main · azhirnov/cpu-gpu-arch · GitHub

Examples

GTX 9xx
Titan X
Quadro Mxxxx
Tegra X1
Nintendo Switch
Shield TV
MX110, MX130

References

Notes

FP16 data storage.
Unified L1/Texture cache.
Native shared memory atomic operations for 32-bit integer arithmetic, along with native 32 or 64-bit compare-and-swap (CAS).
Async transfer queue.
instructions IMAD and IMUL have a long latency because they are emulated. ref

Specs

ops/clock per SM: [7]
- 128 fp32 FMA
- 4 fp64 FMA
- 32 fp32 rcp, rsqrt, log2, exp2, sin, cos
- 32 i32 add/sub
- 64 i32 shift
- 64 cmp, min, max
- 64 i32 bit reverse
- 64 i32 bit extract, insert
- 128 i32 and, or, xor
- 32 MSB
- 32 pop count
- 32 i8/i16 to i32
- 4 to/from i64/fp64
- 32 type conversions