Update stable diffusion benchmark for TensorRT EP (microsoft#16560)

### Description Add Stable Diffusion Text2Image pipelines of TensorRT EP and CUDA EP. They can automatically export and optimize ONNX model, and create ONNXRuntime session to use TensorRT EP or CUDA execution provider. Add support for benchmarking TensorRT. Add support of cuda graph. The feature is only supported in nightly package right now. Engine/Provider to test | command line ---- | --- CUDA EP | `python benchmark.py -v 1.5` CUDA EP with cuda graph | `python benchmark.py -v 1.5 --enable_cuda_graph` TensorRT EP | `python benchmark.py -v 1.5 -r tensorrt` TensorRT EP with cuda graph | `python benchmark.py -v 1.5 -r tensorrt --enable_cuda_graph` TensorRT | `python benchmark.py -v 1.5 -e tensorrt` Add benchmark numbers of T4 GPU using CUDA 11.7, cuDNN 8.5, PyTorch 1.13.1+cu11.7, TensorRT 8.6.1, onnxruntime-gpu 1.15.1 (or ort-nightly-gpu 1.16 for cuda graph). TODO: add benchmark numbers of A100-80GB ### Motivation and Context
shaoboyan091 · Jul 10, 2023 · b8f6235 · b8f6235
1 parent 2fd5e1c
commit b8f6235
Show file tree

Hide file tree

Showing 9 changed files with 2,257 additions and 65 deletions.
diff --git a/onnxruntime/python/tools/transformers/fusion_transpose.py b/onnxruntime/python/tools/transformers/fusion_transpose.py
@@ -55,6 +55,10 @@ def fuse(
             cast_children = self.model.get_children(cast_node, input_name_to_nodes)
             if cast_children and len(cast_children) > 1:
                 return
+
+            if cast_node.input[0] not in output_name_to_node:
+                return
+
             transpose_a = output_name_to_node[cast_node.input[0]]
 
         if transpose_a.op_type != "Transpose":

diff --git a/onnxruntime/python/tools/transformers/models/stable_diffusion/README.md b/onnxruntime/python/tools/transformers/models/stable_diffusion/README.md