@@ -106,6 +106,7 @@ def AMDGPU_ExtPackedFp8Op :
106
106
If the passed-in vector has fewer than four elements, or the input is scalar,
107
107
the remaining values in the <4 x i8> will be filled with
108
108
undefined values as needed.
109
+
109
110
#### Example
110
111
```mlir
111
112
// Extract single FP8 element to scalar f32
@@ -171,6 +172,7 @@ def AMDGPU_PackedTrunc2xFp8Op :
171
172
sub-registers, and so the conversion intrinsics (which are currently the
172
173
only way to work with 8-bit float types) take packed vectors of 4 8-bit
173
174
values.
175
+
174
176
#### Example
175
177
```mlir
176
178
%result = amdgpu.packed_trunc_2xfp8 %src1, %src2 into %dest[word 1]
@@ -234,6 +236,7 @@ def AMDGPU_PackedStochRoundFp8Op :
234
236
sub-registers, and so the conversion intrinsics (which are currently the
235
237
only way to work with 8-bit float types) take packed vectors of 4 8-bit
236
238
values.
239
+
237
240
#### Example
238
241
```mlir
239
242
%result = amdgpu.packed_stoch_round_fp8 %src + %stoch_seed into %dest[2]
@@ -364,6 +367,7 @@ def AMDGPU_RawBufferLoadOp :
364
367
- If `boundsCheck` is false and the target chipset is RDNA, OOB_SELECT is set
365
368
to 2 to disable bounds checks, otherwise it is 3
366
369
- The cache coherency bits are off
370
+
367
371
#### Example
368
372
```mlir
369
373
// Load scalar f32 from 1D buffer
@@ -413,6 +417,7 @@ def AMDGPU_RawBufferStoreOp :
413
417
414
418
See `amdgpu.raw_buffer_load` for a description of how the underlying
415
419
instruction is constructed.
420
+
416
421
#### Example
417
422
```mlir
418
423
// Store scalar f32 to 1D buffer
@@ -465,6 +470,7 @@ def AMDGPU_RawBufferAtomicCmpswapOp :
465
470
466
471
See `amdgpu.raw_buffer_load` for a description of how the underlying
467
472
instruction is constructed.
473
+
468
474
#### Example
469
475
```mlir
470
476
// Atomic compare-swap
@@ -510,6 +516,7 @@ def AMDGPU_RawBufferAtomicFaddOp :
510
516
511
517
See `amdgpu.raw_buffer_load` for a description of how the underlying
512
518
instruction is constructed.
519
+
513
520
#### Example
514
521
```mlir
515
522
// Atomic floating-point add
@@ -710,6 +717,7 @@ def AMDGPU_SwizzleBitModeOp : AMDGPU_Op<"swizzle_bitmode",
710
717
711
718
Supports arbitrary int/float/vector types, which will be repacked to i32 and
712
719
one or more `rocdl.ds_swizzle` ops during lowering.
720
+
713
721
#### Example
714
722
```mlir
715
723
%result = amdgpu.swizzle_bitmode %src 1 2 4 : f32
@@ -740,6 +748,7 @@ def AMDGPU_LDSBarrierOp : AMDGPU_Op<"lds_barrier"> {
740
748
(those which will implement this barrier by emitting inline assembly),
741
749
use of this operation will impede the usabiliity of memory watches (including
742
750
breakpoints set on variables) when debugging.
751
+
743
752
#### Example
744
753
```mlir
745
754
amdgpu.lds_barrier
@@ -782,6 +791,7 @@ def AMDGPU_SchedBarrierOp :
782
791
`amdgpu.sched_barrier` serves as a barrier that could be
783
792
configured to restrict movements of instructions through it as
784
793
defined by sched_barrier_opts.
794
+
785
795
#### Example
786
796
```mlir
787
797
// Barrier allowing no dependent instructions
@@ -888,6 +898,7 @@ def AMDGPU_MFMAOp :
888
898
889
899
The negateA, negateB, and negateC flags are only supported for double-precision
890
900
operations on gfx94x.
901
+
891
902
#### Example
892
903
```mlir
893
904
%result = amdgpu.mfma %a * %b + %c
@@ -935,6 +946,7 @@ def AMDGPU_WMMAOp :
935
946
936
947
The `clamp` flag is used to saturate the output of type T to numeric_limits<T>::max()
937
948
in case of overflow.
949
+
938
950
#### Example
939
951
```mlir
940
952
%result = amdgpu.wmma %a * %b + %c
@@ -1062,6 +1074,7 @@ def AMDGPU_ScaledMFMAOp :
1062
1074
are omitted from this wrapper.
1063
1075
- The `negateA`, `negateB`, and `negateC` flags in `amdgpu.mfma` are only supported for
1064
1076
double-precision operations on gfx94x and so are not included here.
1077
+
1065
1078
#### Example
1066
1079
```mlir
1067
1080
%result = amdgpu.scaled_mfma
0 commit comments