diff --git a/DistributedTraining/README.md b/DistributedTraining/README.md
index 2b7ef17..a4c4023 100644
--- a/DistributedTraining/README.md
+++ b/DistributedTraining/README.md
@@ -54,7 +54,9 @@ It implements everything that are described in [ZeRO paper [1]](https://arxiv.or
 
 ### Integrate Huggingface with Deepspeed
 
-[This page](https://huggingface.co/docs/transformers/main_classes/deepspeed?highlight=deepspeed#deepspeed-integration) contains the descriptions and codes for integrating the Deepspeed with Huggingface for distributed training with Huggingface models.
+[This page](https://huggingface.co/docs/transformers/main_classes/deepspeed) contains the descriptions and codes for integrating the Deepspeed with Huggingface for distributed training with Huggingface models.
+
+[This](./src/Huggingface_DeepSpeed_CLI.ipynb) is the example notebook of using Deepspeed with Huggingface.
 
 ### DeepSpeed ZeRO
 
diff --git a/DistributedTraining/src/Huggingface_DeepSpeed_CLI.ipynb b/DistributedTraining/src/Huggingface_DeepSpeed_CLI.ipynb
new file mode 100644
index 0000000..f7dfc74
--- /dev/null
+++ b/DistributedTraining/src/Huggingface_DeepSpeed_CLI.ipynb
@@ -0,0 +1,1478 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "accelerator": "GPU",
+    "colab": {
+      "provenance": [],
+      "toc_visible": true,
+      "machine_shape": "hm"
+    },
+    "kernelspec": {
+      "display_name": "Python 3",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.7.8"
+    },
+    "gpuClass": "premium"
+  },
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "o6S7Z35-TkSR"
+      },
+      "source": [
+        "\n",
+        "\n",
+        "## Setting up the correct environment\n",
+        "\n",
+        "In order to run `transformers` with `deepspeed`, you need:\n",
+        "1. enough general RAM. Different users seem to get a instance with different size of allocated general RAM. Try `!free -h` and if your process gets killed, you probably run out of memory. If you can't get enough memory you can turn `cpu_offload` off in `ds_config.json` below.\n",
+        "2. matching cuda versions. Your pytorch needs to be built with the same major cuda version as you system-wide installed cuda. This is normally not needed to run `pytorch` alone, but is needed for building CUDA extensions, like DeepSpeed. You will find full documentation [here](https://huggingface.co/transformers/main_classes/trainer.html#installation-notes).\n",
+        "\n",
+        "Since we can't control which cuda version colab has it can be tricky to find the right matching pytorch version. So this notebook will save you time by already showing you all the required versions you need to install.\n",
+        "\n",
+        "Surely, this notebook will get outdated in time. So make sure you check for the latest version of it at https://github.com/stas00/porting/blob/master/transformers/deepspeed/ and please let me know if it needs to be updated if deepspeed stops building.\n",
+        "\n",
+        "As I mentioned earlier if Deepspeed builds but the training gets killed you got a colab instance with too little RAM. There is no need to contact me then as there is nothing I can do about it."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "Kr8bCfITLOQe",
+        "outputId": "5393e389-f3b1-4618-83b4-798ef64ad595"
+      },
+      "source": [
+        "# Free colab seems to give different amount of general RAM to different users or even the same users at different times.\n",
+        "\n",
+        "!free -h"
+      ],
+      "execution_count": 1,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "              total        used        free      shared  buff/cache   available\n",
+            "Mem:            83G        683M         79G        1.2M        2.9G         82G\n",
+            "Swap:            0B          0B          0B\n"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "xJ1iecs6SWgk",
+        "outputId": "12de3be8-3e08-4662-b9ad-f5d521ec4c16"
+      },
+      "source": [
+        "# check which nvidia drivers and cuda version is running\n",
+        "\n",
+        "!nvidia-smi"
+      ],
+      "execution_count": 2,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "Wed Dec 14 13:46:25 2022       \n",
+            "+-----------------------------------------------------------------------------+\n",
+            "| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |\n",
+            "|-------------------------------+----------------------+----------------------+\n",
+            "| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |\n",
+            "| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |\n",
+            "|                               |                      |               MIG M. |\n",
+            "|===============================+======================+======================|\n",
+            "|   0  A100-SXM4-40GB      Off  | 00000000:00:04.0 Off |                    0 |\n",
+            "| N/A   35C    P0    56W / 400W |      0MiB / 40536MiB |      0%      Default |\n",
+            "|                               |                      |             Disabled |\n",
+            "+-------------------------------+----------------------+----------------------+\n",
+            "                                                                               \n",
+            "+-----------------------------------------------------------------------------+\n",
+            "| Processes:                                                                  |\n",
+            "|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |\n",
+            "|        ID   ID                                                   Usage      |\n",
+            "|=============================================================================|\n",
+            "|  No running processes found                                                 |\n",
+            "+-----------------------------------------------------------------------------+\n"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "_9mmhJzcgHy1",
+        "outputId": "2c0bf435-4dce-42f9-9d13-771579989b69"
+      },
+      "source": [
+        "# need to match the system-wide installed cuda-11 for deepspeed to compile\n",
+        "# so install the matching pytorch\n",
+        "\n",
+        "# pt-1.8.1 works too\n",
+        "# !pip install torch==1.8.1+cu111 torchvision==0.9.1+cu111 torchaudio==0.8.1 -f https://download.pytorch.org/whl/torch_stable.html\n",
+        "\n",
+        "# pt-1.11\n",
+        "!pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html\n",
+        "\n"
+      ],
+      "execution_count": 3,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n",
+            "Looking in links: https://download.pytorch.org/whl/cu113/torch_stable.html\n",
+            "Collecting torch==1.11.0+cu113\n",
+            "  Downloading https://download.pytorch.org/whl/cu113/torch-1.11.0%2Bcu113-cp38-cp38-linux_x86_64.whl (1637.0 MB)\n",
+            "\u001b[K     |████████████████▎               | 834.1 MB 1.2 MB/s eta 0:11:36tcmalloc: large alloc 1147494400 bytes == 0x3a606000 @  0x7fdda8916615 0x5d6f4c 0x51edd1 0x51ef5b 0x4f750a 0x4997a2 0x4fd8b5 0x4997c7 0x4fd8b5 0x49abe4 0x4f5fe9 0x55e146 0x4f5fe9 0x55e146 0x4f5fe9 0x55e146 0x5d8868 0x5da092 0x587116 0x5d8d8c 0x55dc1e 0x55cd91 0x5d8941 0x49abe4 0x55cd91 0x5d8941 0x4990ca 0x5d8868 0x4997a2 0x4fd8b5 0x49abe4\n",
+            "\u001b[K     |████████████████████▋           | 1055.7 MB 1.2 MB/s eta 0:08:23tcmalloc: large alloc 1434370048 bytes == 0x7ec5c000 @  0x7fdda8916615 0x5d6f4c 0x51edd1 0x51ef5b 0x4f750a 0x4997a2 0x4fd8b5 0x4997c7 0x4fd8b5 0x49abe4 0x4f5fe9 0x55e146 0x4f5fe9 0x55e146 0x4f5fe9 0x55e146 0x5d8868 0x5da092 0x587116 0x5d8d8c 0x55dc1e 0x55cd91 0x5d8941 0x49abe4 0x55cd91 0x5d8941 0x4990ca 0x5d8868 0x4997a2 0x4fd8b5 0x49abe4\n",
+            "\u001b[K     |██████████████████████████▏     | 1336.2 MB 1.2 MB/s eta 0:04:14tcmalloc: large alloc 1792966656 bytes == 0x3a8e000 @  0x7fdda8916615 0x5d6f4c 0x51edd1 0x51ef5b 0x4f750a 0x4997a2 0x4fd8b5 0x4997c7 0x4fd8b5 0x49abe4 0x4f5fe9 0x55e146 0x4f5fe9 0x55e146 0x4f5fe9 0x55e146 0x5d8868 0x5da092 0x587116 0x5d8d8c 0x55dc1e 0x55cd91 0x5d8941 0x49abe4 0x55cd91 0x5d8941 0x4990ca 0x5d8868 0x4997a2 0x4fd8b5 0x49abe4\n",
+            "\u001b[K     |████████████████████████████████| 1637.0 MB 133.9 MB/s eta 0:00:01tcmalloc: large alloc 1636999168 bytes == 0x6e876000 @  0x7fdda89151e7 0x4d30a0 0x4d312c 0x5d6f4c 0x51edd1 0x51ef5b 0x4f750a 0x4997a2 0x55cd91 0x5d8941 0x4997a2 0x55cd91 0x5d8941 0x4997a2 0x55cd91 0x5d8941 0x4997a2 0x55cd91 0x5d8941 0x4997a2 0x55cd91 0x5d8941 0x4997a2 0x5d8868 0x4997a2 0x55cd91 0x5d8941 0x49abe4 0x4fd8b5 0x49abe4 0x55cd91\n",
+            "tcmalloc: large alloc 2046255104 bytes == 0xd01a0000 @  0x7fdda8916615 0x5d6f4c 0x51edd1 0x51ef5b 0x4f750a 0x4997a2 0x55cd91 0x5d8941 0x4997a2 0x55cd91 0x5d8941 0x4997a2 0x55cd91 0x5d8941 0x4997a2 0x55cd91 0x5d8941 0x4997a2 0x55cd91 0x5d8941 0x4997a2 0x5d8868 0x4997a2 0x55cd91 0x5d8941 0x49abe4 0x4fd8b5 0x49abe4 0x55cd91 0x5d8941 0x4fe318\n",
+            "\u001b[K     |████████████████████████████████| 1637.0 MB 12 kB/s \n",
+            "\u001b[?25hCollecting torchvision==0.12.0+cu113\n",
+            "  Downloading https://download.pytorch.org/whl/cu113/torchvision-0.12.0%2Bcu113-cp38-cp38-linux_x86_64.whl (22.3 MB)\n",
+            "\u001b[K     |████████████████████████████████| 22.3 MB 64.2 MB/s \n",
+            "\u001b[?25hCollecting torchaudio==0.11.0+cu113\n",
+            "  Downloading https://download.pytorch.org/whl/cu113/torchaudio-0.11.0%2Bcu113-cp38-cp38-linux_x86_64.whl (2.9 MB)\n",
+            "\u001b[K     |████████████████████████████████| 2.9 MB 93.3 MB/s \n",
+            "\u001b[?25hRequirement already satisfied: typing-extensions in /usr/local/lib/python3.8/dist-packages (from torch==1.11.0+cu113) (4.4.0)\n",
+            "Requirement already satisfied: numpy in /usr/local/lib/python3.8/dist-packages (from torchvision==0.12.0+cu113) (1.21.6)\n",
+            "Requirement already satisfied: requests in /usr/local/lib/python3.8/dist-packages (from torchvision==0.12.0+cu113) (2.23.0)\n",
+            "Requirement already satisfied: pillow!=8.3.*,>=5.3.0 in /usr/local/lib/python3.8/dist-packages (from torchvision==0.12.0+cu113) (7.1.2)\n",
+            "Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.8/dist-packages (from requests->torchvision==0.12.0+cu113) (1.24.3)\n",
+            "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.8/dist-packages (from requests->torchvision==0.12.0+cu113) (2022.9.24)\n",
+            "Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.8/dist-packages (from requests->torchvision==0.12.0+cu113) (2.10)\n",
+            "Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.8/dist-packages (from requests->torchvision==0.12.0+cu113) (3.0.4)\n",
+            "Installing collected packages: torch, torchvision, torchaudio\n",
+            "  Attempting uninstall: torch\n",
+            "    Found existing installation: torch 1.13.0+cu116\n",
+            "    Uninstalling torch-1.13.0+cu116:\n",
+            "      Successfully uninstalled torch-1.13.0+cu116\n",
+            "  Attempting uninstall: torchvision\n",
+            "    Found existing installation: torchvision 0.14.0+cu116\n",
+            "    Uninstalling torchvision-0.14.0+cu116:\n",
+            "      Successfully uninstalled torchvision-0.14.0+cu116\n",
+            "  Attempting uninstall: torchaudio\n",
+            "    Found existing installation: torchaudio 0.13.0+cu116\n",
+            "    Uninstalling torchaudio-0.13.0+cu116:\n",
+            "      Successfully uninstalled torchaudio-0.13.0+cu116\n",
+            "\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n",
+            "torchtext 0.14.0 requires torch==1.13.0, but you have torch 1.11.0+cu113 which is incompatible.\u001b[0m\n",
+            "Successfully installed torch-1.11.0+cu113 torchaudio-0.11.0+cu113 torchvision-0.12.0+cu113\n"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "8aNVOVxab2Ds",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "outputId": "156e0c9d-e8d2-473a-c066-f31a693acbdc"
+      },
+      "source": [
+        "# either install the release\n",
+        "#!pip install deepspeed\n",
+        "# or the master \n",
+        "!pip install git+https://github.com/microsoft/deepspeed\n",
+        "\n",
+        "# remove any previously cached deepspeed objects as they can be incompatible with this new build\n",
+        "#!rm -r /root/.cache/torch_extensions/"
+      ],
+      "execution_count": 4,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n",
+            "Collecting git+https://github.com/microsoft/deepspeed\n",
+            "  Cloning https://github.com/microsoft/deepspeed to /tmp/pip-req-build-i78nfd45\n",
+            "  Running command git clone -q https://github.com/microsoft/deepspeed /tmp/pip-req-build-i78nfd45\n",
+            "Collecting hjson\n",
+            "  Downloading hjson-3.1.0-py3-none-any.whl (54 kB)\n",
+            "\u001b[K     |████████████████████████████████| 54 kB 3.1 MB/s \n",
+            "\u001b[?25hCollecting ninja\n",
+            "  Downloading ninja-1.11.1-py2.py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (145 kB)\n",
+            "\u001b[K     |████████████████████████████████| 145 kB 67.7 MB/s \n",
+            "\u001b[?25hRequirement already satisfied: numpy in /usr/local/lib/python3.8/dist-packages (from deepspeed==0.8.0+384f17b0) (1.21.6)\n",
+            "Requirement already satisfied: packaging in /usr/local/lib/python3.8/dist-packages (from deepspeed==0.8.0+384f17b0) (21.3)\n",
+            "Requirement already satisfied: psutil in /usr/local/lib/python3.8/dist-packages (from deepspeed==0.8.0+384f17b0) (5.4.8)\n",
+            "Collecting py-cpuinfo\n",
+            "  Downloading py_cpuinfo-9.0.0-py3-none-any.whl (22 kB)\n",
+            "Requirement already satisfied: pydantic in /usr/local/lib/python3.8/dist-packages (from deepspeed==0.8.0+384f17b0) (1.10.2)\n",
+            "Requirement already satisfied: torch in /usr/local/lib/python3.8/dist-packages (from deepspeed==0.8.0+384f17b0) (1.11.0+cu113)\n",
+            "Requirement already satisfied: tqdm in /usr/local/lib/python3.8/dist-packages (from deepspeed==0.8.0+384f17b0) (4.64.1)\n",
+            "Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.8/dist-packages (from packaging->deepspeed==0.8.0+384f17b0) (3.0.9)\n",
+            "Requirement already satisfied: typing-extensions>=4.1.0 in /usr/local/lib/python3.8/dist-packages (from pydantic->deepspeed==0.8.0+384f17b0) (4.4.0)\n",
+            "Building wheels for collected packages: deepspeed\n",
+            "  Building wheel for deepspeed (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
+            "  Created wheel for deepspeed: filename=deepspeed-0.8.0+384f17b0-py3-none-any.whl size=784202 sha256=48e3a1ae11775c34a33488c640cbd6e6ccf82422a64e77434b8848dc4dfdefd3\n",
+            "  Stored in directory: /tmp/pip-ephem-wheel-cache-qm4hults/wheels/1a/40/69/bd5d5b2f963f453c2734682e231eba4592f6ebf6e93720b8da\n",
+            "Successfully built deepspeed\n",
+            "Installing collected packages: py-cpuinfo, ninja, hjson, deepspeed\n",
+            "Successfully installed deepspeed-0.8.0+384f17b0 hjson-3.1.0 ninja-1.11.1 py-cpuinfo-9.0.0\n"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "BZQAIH70Yykn",
+        "outputId": "a88f53e3-2b19-4e27-a892-6853a49fcf32"
+      },
+      "source": [
+        "%%bash\n",
+        "git clone https://github.com/huggingface/transformers\n",
+        "cd transformers\n",
+        "# examples change a lot so let's pick a sha that we know this notebook will work with\n",
+        "# comment out/remove the next line if you want the master\n",
+        "git checkout 0aac9ba2dabcf9\n",
+        "pip install -e .\n",
+        "pip install -r examples/pytorch/translation/requirements.txt\n",
+        "\n",
+        "# if needed free up some space used by cached pip packages\n",
+        "# rm -rf /root/.cache/pip\n"
+      ],
+      "execution_count": 5,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n",
+            "Obtaining file:///content/transformers\n",
+            "  Installing build dependencies: started\n",
+            "  Installing build dependencies: finished with status 'done'\n",
+            "  Getting requirements to build wheel: started\n",
+            "  Getting requirements to build wheel: finished with status 'done'\n",
+            "    Preparing wheel metadata: started\n",
+            "    Preparing wheel metadata: finished with status 'done'\n",
+            "Requirement already satisfied: requests in /usr/local/lib/python3.8/dist-packages (from transformers==4.18.0.dev0) (2.23.0)\n",
+            "Collecting huggingface-hub<1.0,>=0.1.0\n",
+            "  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)\n",
+            "Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.8/dist-packages (from transformers==4.18.0.dev0) (2022.6.2)\n",
+            "Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.8/dist-packages (from transformers==4.18.0.dev0) (6.0)\n",
+            "Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.8/dist-packages (from transformers==4.18.0.dev0) (1.21.6)\n",
+            "Collecting tokenizers!=0.11.3,>=0.11.1\n",
+            "  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)\n",
+            "Requirement already satisfied: filelock in /usr/local/lib/python3.8/dist-packages (from transformers==4.18.0.dev0) (3.8.0)\n",
+            "Collecting sacremoses\n",
+            "  Downloading sacremoses-0.0.53.tar.gz (880 kB)\n",
+            "Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.8/dist-packages (from transformers==4.18.0.dev0) (21.3)\n",
+            "Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.8/dist-packages (from transformers==4.18.0.dev0) (4.64.1)\n",
+            "Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.8/dist-packages (from huggingface-hub<1.0,>=0.1.0->transformers==4.18.0.dev0) (4.4.0)\n",
+            "Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.8/dist-packages (from packaging>=20.0->transformers==4.18.0.dev0) (3.0.9)\n",
+            "Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.8/dist-packages (from requests->transformers==4.18.0.dev0) (2.10)\n",
+            "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.8/dist-packages (from requests->transformers==4.18.0.dev0) (2022.9.24)\n",
+            "Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.8/dist-packages (from requests->transformers==4.18.0.dev0) (3.0.4)\n",
+            "Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.8/dist-packages (from requests->transformers==4.18.0.dev0) (1.24.3)\n",
+            "Requirement already satisfied: six in /usr/local/lib/python3.8/dist-packages (from sacremoses->transformers==4.18.0.dev0) (1.15.0)\n",
+            "Requirement already satisfied: click in /usr/local/lib/python3.8/dist-packages (from sacremoses->transformers==4.18.0.dev0) (7.1.2)\n",
+            "Requirement already satisfied: joblib in /usr/local/lib/python3.8/dist-packages (from sacremoses->transformers==4.18.0.dev0) (1.2.0)\n",
+            "Building wheels for collected packages: sacremoses\n",
+            "  Building wheel for sacremoses (setup.py): started\n",
+            "  Building wheel for sacremoses (setup.py): finished with status 'done'\n",
+            "  Created wheel for sacremoses: filename=sacremoses-0.0.53-py3-none-any.whl size=895260 sha256=a89855b8fe6e7331aa80ab714e1e3433f5504f6264d4bc9b90b9d8d8400fca26\n",
+            "  Stored in directory: /root/.cache/pip/wheels/82/ab/9b/c15899bf659ba74f623ac776e861cf2eb8608c1825ddec66a4\n",
+            "Successfully built sacremoses\n",
+            "Installing collected packages: tokenizers, sacremoses, huggingface-hub, transformers\n",
+            "  Running setup.py develop for transformers\n",
+            "Successfully installed huggingface-hub-0.11.1 sacremoses-0.0.53 tokenizers-0.13.2 transformers\n",
+            "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n",
+            "Collecting accelerate\n",
+            "  Downloading accelerate-0.15.0-py3-none-any.whl (191 kB)\n",
+            "Collecting datasets>=1.8.0\n",
+            "  Downloading datasets-2.7.1-py3-none-any.whl (451 kB)\n",
+            "Collecting sentencepiece!=0.1.92\n",
+            "  Downloading sentencepiece-0.1.97-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)\n",
+            "Requirement already satisfied: protobuf in /usr/local/lib/python3.8/dist-packages (from -r examples/pytorch/translation/requirements.txt (line 4)) (3.19.6)\n",
+            "Collecting sacrebleu>=1.4.12\n",
+            "  Downloading sacrebleu-2.3.1-py3-none-any.whl (118 kB)\n",
+            "Collecting py7zr\n",
+            "  Downloading py7zr-0.20.2-py3-none-any.whl (65 kB)\n",
+            "Requirement already satisfied: torch>=1.3 in /usr/local/lib/python3.8/dist-packages (from -r examples/pytorch/translation/requirements.txt (line 7)) (1.11.0+cu113)\n",
+            "Collecting multiprocess\n",
+            "  Downloading multiprocess-0.70.14-py38-none-any.whl (132 kB)\n",
+            "Requirement already satisfied: pandas in /usr/local/lib/python3.8/dist-packages (from datasets>=1.8.0->-r examples/pytorch/translation/requirements.txt (line 2)) (1.3.5)\n",
+            "Requirement already satisfied: dill<0.3.7 in /usr/local/lib/python3.8/dist-packages (from datasets>=1.8.0->-r examples/pytorch/translation/requirements.txt (line 2)) (0.3.6)\n",
+            "Collecting responses<0.19\n",
+            "  Downloading responses-0.18.0-py3-none-any.whl (38 kB)\n",
+            "Requirement already satisfied: packaging in /usr/local/lib/python3.8/dist-packages (from datasets>=1.8.0->-r examples/pytorch/translation/requirements.txt (line 2)) (21.3)\n",
+            "Requirement already satisfied: pyarrow>=6.0.0 in /usr/local/lib/python3.8/dist-packages (from datasets>=1.8.0->-r examples/pytorch/translation/requirements.txt (line 2)) (9.0.0)\n",
+            "Requirement already satisfied: aiohttp in /usr/local/lib/python3.8/dist-packages (from datasets>=1.8.0->-r examples/pytorch/translation/requirements.txt (line 2)) (3.8.3)\n",
+            "Requirement already satisfied: huggingface-hub<1.0.0,>=0.2.0 in /usr/local/lib/python3.8/dist-packages (from datasets>=1.8.0->-r examples/pytorch/translation/requirements.txt (line 2)) (0.11.1)\n",
+            "Requirement already satisfied: tqdm>=4.62.1 in /usr/local/lib/python3.8/dist-packages (from datasets>=1.8.0->-r examples/pytorch/translation/requirements.txt (line 2)) (4.64.1)\n",
+            "Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.8/dist-packages (from datasets>=1.8.0->-r examples/pytorch/translation/requirements.txt (line 2)) (6.0)\n",
+            "Requirement already satisfied: requests>=2.19.0 in /usr/local/lib/python3.8/dist-packages (from datasets>=1.8.0->-r examples/pytorch/translation/requirements.txt (line 2)) (2.23.0)\n",
+            "Requirement already satisfied: fsspec[http]>=2021.11.1 in /usr/local/lib/python3.8/dist-packages (from datasets>=1.8.0->-r examples/pytorch/translation/requirements.txt (line 2)) (2022.11.0)\n",
+            "Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.8/dist-packages (from datasets>=1.8.0->-r examples/pytorch/translation/requirements.txt (line 2)) (1.21.6)\n",
+            "Collecting xxhash\n",
+            "  Downloading xxhash-3.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)\n",
+            "Requirement already satisfied: regex in /usr/local/lib/python3.8/dist-packages (from sacrebleu>=1.4.12->-r examples/pytorch/translation/requirements.txt (line 5)) (2022.6.2)\n",
+            "Requirement already satisfied: tabulate>=0.8.9 in /usr/local/lib/python3.8/dist-packages (from sacrebleu>=1.4.12->-r examples/pytorch/translation/requirements.txt (line 5)) (0.8.10)\n",
+            "Requirement already satisfied: lxml in /usr/local/lib/python3.8/dist-packages (from sacrebleu>=1.4.12->-r examples/pytorch/translation/requirements.txt (line 5)) (4.9.1)\n",
+            "Collecting colorama\n",
+            "  Downloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)\n",
+            "Collecting portalocker\n",
+            "  Downloading portalocker-2.6.0-py2.py3-none-any.whl (15 kB)\n",
+            "Requirement already satisfied: typing-extensions in /usr/local/lib/python3.8/dist-packages (from torch>=1.3->-r examples/pytorch/translation/requirements.txt (line 7)) (4.4.0)\n",
+            "Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.8/dist-packages (from aiohttp->datasets>=1.8.0->-r examples/pytorch/translation/requirements.txt (line 2)) (22.1.0)\n",
+            "Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.8/dist-packages (from aiohttp->datasets>=1.8.0->-r examples/pytorch/translation/requirements.txt (line 2)) (1.8.2)\n",
+            "Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /usr/local/lib/python3.8/dist-packages (from aiohttp->datasets>=1.8.0->-r examples/pytorch/translation/requirements.txt (line 2)) (4.0.2)\n",
+            "Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.8/dist-packages (from aiohttp->datasets>=1.8.0->-r examples/pytorch/translation/requirements.txt (line 2)) (6.0.3)\n",
+            "Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.8/dist-packages (from aiohttp->datasets>=1.8.0->-r examples/pytorch/translation/requirements.txt (line 2)) (1.3.1)\n",
+            "Requirement already satisfied: charset-normalizer<3.0,>=2.0 in /usr/local/lib/python3.8/dist-packages (from aiohttp->datasets>=1.8.0->-r examples/pytorch/translation/requirements.txt (line 2)) (2.1.1)\n",
+            "Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.8/dist-packages (from aiohttp->datasets>=1.8.0->-r examples/pytorch/translation/requirements.txt (line 2)) (1.3.3)\n",
+            "Requirement already satisfied: filelock in /usr/local/lib/python3.8/dist-packages (from huggingface-hub<1.0.0,>=0.2.0->datasets>=1.8.0->-r examples/pytorch/translation/requirements.txt (line 2)) (3.8.0)\n",
+            "Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.8/dist-packages (from packaging->datasets>=1.8.0->-r examples/pytorch/translation/requirements.txt (line 2)) (3.0.9)\n",
+            "Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.8/dist-packages (from requests>=2.19.0->datasets>=1.8.0->-r examples/pytorch/translation/requirements.txt (line 2)) (1.24.3)\n",
+            "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.8/dist-packages (from requests>=2.19.0->datasets>=1.8.0->-r examples/pytorch/translation/requirements.txt (line 2)) (2022.9.24)\n",
+            "Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.8/dist-packages (from requests>=2.19.0->datasets>=1.8.0->-r examples/pytorch/translation/requirements.txt (line 2)) (2.10)\n",
+            "Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.8/dist-packages (from requests>=2.19.0->datasets>=1.8.0->-r examples/pytorch/translation/requirements.txt (line 2)) (3.0.4)\n",
+            "Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1\n",
+            "  Downloading urllib3-1.25.11-py2.py3-none-any.whl (127 kB)\n",
+            "Requirement already satisfied: psutil in /usr/local/lib/python3.8/dist-packages (from accelerate->-r examples/pytorch/translation/requirements.txt (line 1)) (5.4.8)\n",
+            "Collecting brotli>=1.0.9\n",
+            "  Downloading Brotli-1.0.9-cp38-cp38-manylinux1_x86_64.whl (357 kB)\n",
+            "Collecting pybcj>=0.6.0\n",
+            "  Downloading pybcj-1.0.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (50 kB)\n",
+            "Collecting pyppmd<1.1.0,>=0.18.1\n",
+            "  Downloading pyppmd-1.0.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (139 kB)\n",
+            "Collecting texttable\n",
+            "  Downloading texttable-1.6.7-py2.py3-none-any.whl (10 kB)\n",
+            "Collecting multivolumefile>=0.2.3\n",
+            "  Downloading multivolumefile-0.2.3-py3-none-any.whl (17 kB)\n",
+            "Collecting pyzstd>=0.14.4\n",
+            "  Downloading pyzstd-0.15.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (378 kB)\n",
+            "Collecting inflate64>=0.3.1\n",
+            "  Downloading inflate64-0.3.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (94 kB)\n",
+            "Collecting pycryptodomex>=3.6.6\n",
+            "  Downloading pycryptodomex-3.16.0-cp35-abi3-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (2.3 MB)\n",
+            "Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.8/dist-packages (from pandas->datasets>=1.8.0->-r examples/pytorch/translation/requirements.txt (line 2)) (2.8.2)\n",
+            "Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.8/dist-packages (from pandas->datasets>=1.8.0->-r examples/pytorch/translation/requirements.txt (line 2)) (2022.6)\n",
+            "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.8/dist-packages (from python-dateutil>=2.7.3->pandas->datasets>=1.8.0->-r examples/pytorch/translation/requirements.txt (line 2)) (1.15.0)\n",
+            "Installing collected packages: urllib3, xxhash, texttable, responses, pyzstd, pyppmd, pycryptodomex, pybcj, portalocker, multivolumefile, multiprocess, inflate64, colorama, brotli, sentencepiece, sacrebleu, py7zr, datasets, accelerate\n",
+            "  Attempting uninstall: urllib3\n",
+            "    Found existing installation: urllib3 1.24.3\n",
+            "    Uninstalling urllib3-1.24.3:\n",
+            "      Successfully uninstalled urllib3-1.24.3\n",
+            "Successfully installed accelerate-0.15.0 brotli-1.0.9 colorama-0.4.6 datasets-2.7.1 inflate64-0.3.1 multiprocess-0.70.14 multivolumefile-0.2.3 portalocker-2.6.0 py7zr-0.20.2 pybcj-1.0.1 pycryptodomex-3.16.0 pyppmd-1.0.0 pyzstd-0.15.3 responses-0.18.0 sacrebleu-2.3.1 sentencepiece-0.1.97 texttable-1.6.7 urllib3-1.25.11 xxhash-3.1.0\n"
+          ]
+        },
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "Cloning into 'transformers'...\n",
+            "Note: checking out '0aac9ba2dabcf9'.\n",
+            "\n",
+            "You are in 'detached HEAD' state. You can look around, make experimental\n",
+            "changes and commit them, and you can discard any commits you make in this\n",
+            "state without impacting any branches by performing another checkout.\n",
+            "\n",
+            "If you want to create a new branch to retain commits you create, you may\n",
+            "do so (now or later) by using -b with the checkout command again. Example:\n",
+            "\n",
+            "  git checkout -b <new-branch-name>\n",
+            "\n",
+            "HEAD is now at 0aac9ba2d Add Flaubert OnnxConfig to Transformers (#16279)\n"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "FYG2EDAQdFxt"
+      },
+      "source": [
+        "%%bash\n",
+        "\n",
+        "cd transformers\n",
+        "\n",
+        "cat <<'EOT' > ds_config.json\n",
+        "{\n",
+        "    \"fp16\": {\n",
+        "        \"enabled\": \"auto\",\n",
+        "        \"loss_scale\": 0,\n",
+        "        \"loss_scale_window\": 1000,\n",
+        "        \"initial_scale_power\": 16,\n",
+        "        \"hysteresis\": 2,\n",
+        "        \"min_loss_scale\": 1\n",
+        "    },\n",
+        "\n",
+        "    \"optimizer\": {\n",
+        "        \"type\": \"AdamW\",\n",
+        "        \"params\": {\n",
+        "            \"lr\": \"auto\",\n",
+        "            \"betas\": \"auto\",\n",
+        "            \"eps\": \"auto\",\n",
+        "            \"weight_decay\": \"auto\"\n",
+        "        }\n",
+        "    },\n",
+        "\n",
+        "    \"scheduler\": {\n",
+        "        \"type\": \"WarmupLR\",\n",
+        "        \"params\": {\n",
+        "            \"warmup_min_lr\": \"auto\",\n",
+        "            \"warmup_max_lr\": \"auto\",\n",
+        "            \"warmup_num_steps\": \"auto\"\n",
+        "        }\n",
+        "    },\n",
+        "\n",
+        "    \"zero_optimization\": {\n",
+        "        \"stage\": 2,\n",
+        "        \"offload_optimizer\": {\n",
+        "            \"device\": \"cpu\",\n",
+        "            \"pin_memory\": true\n",
+        "        },\n",
+        "        \"allgather_partitions\": true,\n",
+        "        \"allgather_bucket_size\": 2e8,\n",
+        "        \"overlap_comm\": true,\n",
+        "        \"reduce_scatter\": true,\n",
+        "        \"reduce_bucket_size\": 2e8,\n",
+        "        \"contiguous_gradients\": true\n",
+        "    },\n",
+        "\n",
+        "    \"gradient_accumulation_steps\": \"auto\",\n",
+        "    \"gradient_clipping\": \"auto\",\n",
+        "    \"steps_per_print\": 2000,\n",
+        "    \"train_batch_size\": \"auto\",\n",
+        "    \"train_micro_batch_size_per_gpu\": \"auto\",\n",
+        "    \"wall_clock_breakdown\": false\n",
+        "}\n",
+        "EOT\n"
+      ],
+      "execution_count": 6,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "uJKdum6zdxLE"
+      },
+      "source": [
+        "# !ls -l transformers\n",
+        "# !cat transformers/ds_config.json"
+      ],
+      "execution_count": 9,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "_XJEYx1sVuAJ"
+      },
+      "source": [
+        "## Running Traning + Evaluation CLI style"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "ghNfx0nNZSfq",
+        "outputId": "2e84e5c5-86f0-4f68-af5c-b3757aabf912"
+      },
+      "source": [
+        "!cd transformers; export BS=16; rm -rf output_dir; \\\n",
+        "PYTHONPATH=src USE_TF=0 CUDA_VISIBLE_DEVICES=0 deepspeed --num_gpus=1 examples/pytorch/translation/run_translation.py \\\n",
+        "--model_name_or_path t5-small --output_dir output_dir --adam_eps 1e-06 --evaluation_strategy=steps \\\n",
+        "--do_train --do_eval --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 \\\n",
+        "--max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir  \\\n",
+        "--per_device_train_batch_size $BS --per_device_eval_batch_size $BS --predict_with_generate --sortish_sampler \\\n",
+        "--val_max_target_length 128 --warmup_steps 500 --max_train_samples 2000 --max_eval_samples 500 \\\n",
+        "--dataset_name wmt16 --dataset_config ro-en --source_lang en --target_lang ro \\\n",
+        "--source_prefix \"translate English to Romanian: \" --deepspeed ds_config.json --fp16"
+      ],
+      "execution_count": 10,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "[2022-12-14 13:50:37,125] [WARNING] [runner.py:179:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.\n",
+            "Detected CUDA_VISIBLE_DEVICES=0 but ignoring it because one or several of --include/--exclude/--num_gpus/--num_nodes cl args were used. If you want to use CUDA_VISIBLE_DEVICES don't pass any of these arguments to deepspeed.\n",
+            "[2022-12-14 13:50:37,965] [INFO] [runner.py:508:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 examples/pytorch/translation/run_translation.py --model_name_or_path t5-small --output_dir output_dir --adam_eps 1e-06 --evaluation_strategy=steps --do_train --do_eval --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_train_batch_size 16 --per_device_eval_batch_size 16 --predict_with_generate --sortish_sampler --val_max_target_length 128 --warmup_steps 500 --max_train_samples 2000 --max_eval_samples 500 --dataset_name wmt16 --dataset_config ro-en --source_lang en --target_lang ro --source_prefix translate English to Romanian:  --deepspeed ds_config.json --fp16\n",
+            "[2022-12-14 13:50:39,901] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.8.4-1+cuda11.2\n",
+            "[2022-12-14 13:50:39,901] [INFO] [launch.py:135:main] 0 NCCL_VERSION=2.8.4-1\n",
+            "[2022-12-14 13:50:39,901] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.8.4-1\n",
+            "[2022-12-14 13:50:39,901] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.8.4-1+cuda11.2\n",
+            "[2022-12-14 13:50:39,901] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev\n",
+            "[2022-12-14 13:50:39,901] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2\n",
+            "[2022-12-14 13:50:39,901] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.8.4-1\n",
+            "[2022-12-14 13:50:39,902] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0]}\n",
+            "[2022-12-14 13:50:39,902] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=1, node_rank=0\n",
+            "[2022-12-14 13:50:39,902] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})\n",
+            "[2022-12-14 13:50:39,902] [INFO] [launch.py:162:main] dist_world_size=1\n",
+            "[2022-12-14 13:50:39,902] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0\n",
+            "[2022-12-14 13:50:43,516] [INFO] [comm.py:654:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl\n",
+            "12/14/2022 13:50:43 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: True\n",
+            "12/14/2022 13:50:43 - INFO - __main__ - Training/evaluation parameters Seq2SeqTrainingArguments(\n",
+            "_n_gpu=1,\n",
+            "adafactor=False,\n",
+            "adam_beta1=0.9,\n",
+            "adam_beta2=0.999,\n",
+            "adam_epsilon=1e-06,\n",
+            "bf16=False,\n",
+            "bf16_full_eval=False,\n",
+            "data_seed=None,\n",
+            "dataloader_drop_last=False,\n",
+            "dataloader_num_workers=0,\n",
+            "dataloader_pin_memory=True,\n",
+            "ddp_bucket_cap_mb=None,\n",
+            "ddp_find_unused_parameters=None,\n",
+            "debug=[],\n",
+            "deepspeed=ds_config.json,\n",
+            "disable_tqdm=False,\n",
+            "do_eval=True,\n",
+            "do_predict=False,\n",
+            "do_train=True,\n",
+            "eval_accumulation_steps=None,\n",
+            "eval_steps=1000,\n",
+            "evaluation_strategy=IntervalStrategy.STEPS,\n",
+            "fp16=True,\n",
+            "fp16_backend=auto,\n",
+            "fp16_full_eval=False,\n",
+            "fp16_opt_level=O1,\n",
+            "generation_max_length=None,\n",
+            "generation_num_beams=None,\n",
+            "gradient_accumulation_steps=1,\n",
+            "gradient_checkpointing=False,\n",
+            "greater_is_better=None,\n",
+            "group_by_length=False,\n",
+            "half_precision_backend=auto,\n",
+            "hub_model_id=None,\n",
+            "hub_strategy=HubStrategy.EVERY_SAVE,\n",
+            "hub_token=<HUB_TOKEN>,\n",
+            "ignore_data_skip=False,\n",
+            "label_names=None,\n",
+            "label_smoothing_factor=0.1,\n",
+            "learning_rate=3e-05,\n",
+            "length_column_name=length,\n",
+            "load_best_model_at_end=False,\n",
+            "local_rank=0,\n",
+            "log_level=-1,\n",
+            "log_level_replica=-1,\n",
+            "log_on_each_node=True,\n",
+            "logging_dir=output_dir/runs/Dec14_13-50-42_30e9222e4cae,\n",
+            "logging_first_step=True,\n",
+            "logging_nan_inf_filter=True,\n",
+            "logging_steps=1000,\n",
+            "logging_strategy=IntervalStrategy.STEPS,\n",
+            "lr_scheduler_type=SchedulerType.LINEAR,\n",
+            "max_grad_norm=1.0,\n",
+            "max_steps=-1,\n",
+            "metric_for_best_model=None,\n",
+            "mp_parameters=,\n",
+            "no_cuda=False,\n",
+            "num_train_epochs=1.0,\n",
+            "optim=OptimizerNames.ADAMW_HF,\n",
+            "output_dir=output_dir,\n",
+            "overwrite_output_dir=True,\n",
+            "past_index=-1,\n",
+            "per_device_eval_batch_size=16,\n",
+            "per_device_train_batch_size=16,\n",
+            "predict_with_generate=True,\n",
+            "prediction_loss_only=False,\n",
+            "push_to_hub=False,\n",
+            "push_to_hub_model_id=None,\n",
+            "push_to_hub_organization=None,\n",
+            "push_to_hub_token=<PUSH_TO_HUB_TOKEN>,\n",
+            "remove_unused_columns=True,\n",
+            "report_to=['tensorboard'],\n",
+            "resume_from_checkpoint=None,\n",
+            "run_name=output_dir,\n",
+            "save_on_each_node=False,\n",
+            "save_steps=500,\n",
+            "save_strategy=IntervalStrategy.STEPS,\n",
+            "save_total_limit=None,\n",
+            "seed=42,\n",
+            "sharded_ddp=[],\n",
+            "skip_memory_metrics=True,\n",
+            "sortish_sampler=True,\n",
+            "tf32=None,\n",
+            "tpu_metrics_debug=False,\n",
+            "tpu_num_cores=None,\n",
+            "use_legacy_prediction_loop=False,\n",
+            "warmup_ratio=0.0,\n",
+            "warmup_steps=500,\n",
+            "weight_decay=0.0,\n",
+            "xpu_backend=None,\n",
+            ")\n",
+            "12/14/2022 13:50:46 - INFO - datasets.utils.file_utils - https://huggingface.co/datasets/wmt16/resolve/main/wmt16.py not found in cache or force_download set to True, downloading to /root/.cache/huggingface/datasets/downloads/tmpiyr2y22g\n",
+            "Downloading builder script: 100% 2.81k/2.81k [00:00<00:00, 2.56MB/s]\n",
+            "12/14/2022 13:50:47 - INFO - datasets.utils.file_utils - storing https://huggingface.co/datasets/wmt16/resolve/main/wmt16.py in cache at /root/.cache/huggingface/datasets/downloads/19f1a5dd850571443650011364e61353891328b635bd714cd786431d7d7f6774.b46b8fa175644cc034666818bf00117ae0208bef8a6c6a9c56d86cd22e9dfdb1.py\n",
+            "12/14/2022 13:50:47 - INFO - datasets.utils.file_utils - creating metadata file for /root/.cache/huggingface/datasets/downloads/19f1a5dd850571443650011364e61353891328b635bd714cd786431d7d7f6774.b46b8fa175644cc034666818bf00117ae0208bef8a6c6a9c56d86cd22e9dfdb1.py\n",
+            "12/14/2022 13:50:48 - INFO - datasets.utils.file_utils - https://huggingface.co/datasets/wmt16/resolve/main/dataset_infos.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/datasets/downloads/tmpxf13ryz7\n",
+            "Downloading metadata: 100% 18.6k/18.6k [00:00<00:00, 83.7kB/s]\n",
+            "12/14/2022 13:50:49 - INFO - datasets.utils.file_utils - storing https://huggingface.co/datasets/wmt16/resolve/main/dataset_infos.json in cache at /root/.cache/huggingface/datasets/downloads/52bd443219d9fe96771868215aa4b606128c42183ca39e1d3eb5f577ff586f6b.fd5f57db9d7e08ef8fb6acca6cb641609f1cc9c0cac2d5d703644362789cb676\n",
+            "12/14/2022 13:50:49 - INFO - datasets.utils.file_utils - creating metadata file for /root/.cache/huggingface/datasets/downloads/52bd443219d9fe96771868215aa4b606128c42183ca39e1d3eb5f577ff586f6b.fd5f57db9d7e08ef8fb6acca6cb641609f1cc9c0cac2d5d703644362789cb676\n",
+            "12/14/2022 13:50:50 - INFO - datasets.utils.file_utils - https://huggingface.co/datasets/wmt16/resolve/main/README.md not found in cache or force_download set to True, downloading to /root/.cache/huggingface/datasets/downloads/tmp0dj9ftw0\n",
+            "Downloading readme: 100% 9.90k/9.90k [00:00<00:00, 6.87MB/s]\n",
+            "12/14/2022 13:50:51 - INFO - datasets.utils.file_utils - storing https://huggingface.co/datasets/wmt16/resolve/main/README.md in cache at /root/.cache/huggingface/datasets/downloads/a90ff794674a61205bbd8736c42a16add4b834a3020767e5dd42cc930caf0ee7.30e46efd2178e0948c8c5dd2746108df7ffb79c57ca0815a7eef80561e095f27\n",
+            "12/14/2022 13:50:51 - INFO - datasets.utils.file_utils - creating metadata file for /root/.cache/huggingface/datasets/downloads/a90ff794674a61205bbd8736c42a16add4b834a3020767e5dd42cc930caf0ee7.30e46efd2178e0948c8c5dd2746108df7ffb79c57ca0815a7eef80561e095f27\n",
+            "12/14/2022 13:50:52 - INFO - datasets.utils.file_utils - https://huggingface.co/datasets/wmt16/resolve/main/wmt_utils.py not found in cache or force_download set to True, downloading to /root/.cache/huggingface/datasets/downloads/tmpjkj45es6\n",
+            "Downloading extra modules: 100% 41.4k/41.4k [00:00<00:00, 184kB/s] \n",
+            "12/14/2022 13:50:53 - INFO - datasets.utils.file_utils - storing https://huggingface.co/datasets/wmt16/resolve/main/wmt_utils.py in cache at /root/.cache/huggingface/datasets/downloads/aa40e1459f58a91114fb9f909a397d963413c8dbbe43e787c681a4df539a12b1.07b8014c6b947f2ccdd1e0f90d18964e3a276ef3ca0b9b0d028e79d7e6cff3cd.py\n",
+            "12/14/2022 13:50:53 - INFO - datasets.utils.file_utils - creating metadata file for /root/.cache/huggingface/datasets/downloads/aa40e1459f58a91114fb9f909a397d963413c8dbbe43e787c681a4df539a12b1.07b8014c6b947f2ccdd1e0f90d18964e3a276ef3ca0b9b0d028e79d7e6cff3cd.py\n",
+            "12/14/2022 13:50:53 - INFO - datasets.info - Loading Dataset Infos from /root/.cache/huggingface/modules/datasets_modules/datasets/wmt16/746749a11d25c02058042da7502d973ff410e73457f3d305fc1177dc0e8c4227\n",
+            "12/14/2022 13:50:53 - INFO - datasets.builder - Generating dataset wmt16 (/root/.cache/huggingface/datasets/wmt16/ro-en/1.0.0/746749a11d25c02058042da7502d973ff410e73457f3d305fc1177dc0e8c4227)\n",
+            "Downloading and preparing dataset wmt16/ro-en to /root/.cache/huggingface/datasets/wmt16/ro-en/1.0.0/746749a11d25c02058042da7502d973ff410e73457f3d305fc1177dc0e8c4227...\n",
+            "12/14/2022 13:50:54 - INFO - datasets.builder - Dataset not on Hf google storage. Downloading and preparing it from source\n",
+            "Downloading data files:   0% 0/4 [00:00<?, ?it/s]12/14/2022 13:50:55 - INFO - datasets.utils.file_utils - https://huggingface.co/datasets/wmt/wmt16/resolve/main-zip/translation-task/training-parallel-ep-v8.zip not found in cache or force_download set to True, downloading to /root/.cache/huggingface/datasets/downloads/tmp88opyxko\n",
+            "\n",
+            "Downloading data:   0% 0.00/225M [00:00<?, ?B/s]\u001b[A\n",
+            "Downloading data:   2% 4.13M/225M [00:00<00:05, 41.0MB/s]\u001b[A\n",
+            "Downloading data:   4% 9.04M/225M [00:00<00:04, 44.8MB/s]\u001b[A\n",
+            "Downloading data:   6% 14.6M/225M [00:00<00:04, 49.2MB/s]\u001b[A\n",
+            "Downloading data:   9% 20.1M/225M [00:00<00:03, 51.5MB/s]\u001b[A\n",
+            "Downloading data:  11% 25.3M/225M [00:00<00:04, 46.2MB/s]\u001b[A\n",
+            "Downloading data:  14% 30.9M/225M [00:00<00:03, 49.2MB/s]\u001b[A\n",
+            "Downloading data:  16% 35.8M/225M [00:00<00:04, 47.3MB/s]\u001b[A\n",
+            "Downloading data:  18% 40.6M/225M [00:00<00:04, 45.9MB/s]\u001b[A\n",
+            "Downloading data:  20% 45.3M/225M [00:00<00:03, 46.0MB/s]\u001b[A\n",
+            "Downloading data:  22% 49.9M/225M [00:01<00:04, 42.2MB/s]\u001b[A\n",
+            "Downloading data:  24% 54.5M/225M [00:01<00:03, 43.1MB/s]\u001b[A\n",
+            "Downloading data:  26% 59.3M/225M [00:01<00:03, 44.6MB/s]\u001b[A\n",
+            "Downloading data:  28% 63.8M/225M [00:01<00:04, 39.1MB/s]\u001b[A\n",
+            "Downloading data:  30% 67.9M/225M [00:01<00:04, 38.4MB/s]\u001b[A\n",
+            "Downloading data:  32% 71.8M/225M [00:01<00:04, 31.9MB/s]\u001b[A\n",
+            "Downloading data:  34% 76.6M/225M [00:01<00:04, 35.8MB/s]\u001b[A\n",
+            "Downloading data:  36% 81.9M/225M [00:01<00:03, 40.1MB/s]\u001b[A\n",
+            "Downloading data:  38% 86.2M/225M [00:02<00:03, 40.4MB/s]\u001b[A\n",
+            "Downloading data:  40% 90.4M/225M [00:02<00:03, 40.5MB/s]\u001b[A\n",
+            "Downloading data:  42% 94.8M/225M [00:02<00:03, 41.6MB/s]\u001b[A\n",
+            "Downloading data:  44% 99.1M/225M [00:02<00:03, 40.7MB/s]\u001b[A\n",
+            "Downloading data:  46% 103M/225M [00:02<00:03, 36.2MB/s] \u001b[A\n",
+            "Downloading data:  48% 107M/225M [00:02<00:03, 37.5MB/s]\u001b[A\n",
+            "Downloading data:  50% 112M/225M [00:02<00:02, 38.7MB/s]\u001b[A\n",
+            "Downloading data:  52% 116M/225M [00:02<00:02, 41.2MB/s]\u001b[A\n",
+            "Downloading data:  54% 121M/225M [00:02<00:02, 43.7MB/s]\u001b[A\n",
+            "Downloading data:  56% 126M/225M [00:03<00:02, 43.7MB/s]\u001b[A\n",
+            "Downloading data:  58% 130M/225M [00:03<00:02, 41.8MB/s]\u001b[A\n",
+            "Downloading data:  60% 134M/225M [00:03<00:02, 36.4MB/s]\u001b[A\n",
+            "Downloading data:  62% 140M/225M [00:03<00:02, 42.1MB/s]\u001b[A\n",
+            "Downloading data:  64% 145M/225M [00:03<00:01, 42.3MB/s]\u001b[A\n",
+            "Downloading data:  66% 150M/225M [00:03<00:01, 42.9MB/s]\u001b[A\n",
+            "Downloading data:  68% 154M/225M [00:03<00:01, 39.4MB/s]\u001b[A\n",
+            "Downloading data:  70% 158M/225M [00:03<00:01, 39.0MB/s]\u001b[A\n",
+            "Downloading data:  73% 164M/225M [00:03<00:01, 44.8MB/s]\u001b[A\n",
+            "Downloading data:  75% 169M/225M [00:04<00:01, 46.9MB/s]\u001b[A\n",
+            "Downloading data:  77% 174M/225M [00:04<00:01, 45.2MB/s]\u001b[A\n",
+            "Downloading data:  80% 180M/225M [00:04<00:00, 47.5MB/s]\u001b[A\n",
+            "Downloading data:  82% 184M/225M [00:04<00:00, 44.7MB/s]\u001b[A\n",
+            "Downloading data:  84% 189M/225M [00:04<00:00, 42.3MB/s]\u001b[A\n",
+            "Downloading data:  86% 194M/225M [00:04<00:00, 45.8MB/s]\u001b[A\n",
+            "Downloading data:  88% 199M/225M [00:04<00:00, 40.4MB/s]\u001b[A\n",
+            "Downloading data:  91% 204M/225M [00:04<00:00, 42.6MB/s]\u001b[A\n",
+            "Downloading data:  93% 210M/225M [00:04<00:00, 46.3MB/s]\u001b[A\n",
+            "Downloading data:  95% 215M/225M [00:05<00:00, 43.2MB/s]\u001b[A\n",
+            "Downloading data:  97% 219M/225M [00:05<00:00, 40.6MB/s]\u001b[A\n",
+            "Downloading data: 100% 225M/225M [00:05<00:00, 41.4MB/s]\n",
+            "12/14/2022 13:51:02 - INFO - datasets.utils.file_utils - storing https://huggingface.co/datasets/wmt/wmt16/resolve/main-zip/translation-task/training-parallel-ep-v8.zip in cache at /root/.cache/huggingface/datasets/downloads/643db6243683334ce4b1c245cfc4b0b4c0ea2d8e1ece9cbfb1db13ecf6e46a7b\n",
+            "12/14/2022 13:51:02 - INFO - datasets.utils.file_utils - creating metadata file for /root/.cache/huggingface/datasets/downloads/643db6243683334ce4b1c245cfc4b0b4c0ea2d8e1ece9cbfb1db13ecf6e46a7b\n",
+            "Downloading data files:  25% 1/4 [00:07<00:22,  7.46s/it]12/14/2022 13:51:05 - INFO - datasets.utils.file_utils - https://opus.nlpl.eu/download.php?f=SETIMES/v2/tmx/en-ro.tmx.gz not found in cache or force_download set to True, downloading to /root/.cache/huggingface/datasets/downloads/tmpie08w1rm\n",
+            "\n",
+            "Downloading data:   0% 0.00/23.5M [00:00<?, ?B/s]\u001b[A\n",
+            "Downloading data:   0% 16.4k/23.5M [00:00<05:38, 69.3kB/s]\u001b[A\n",
+            "Downloading data:   0% 65.5k/23.5M [00:00<02:35, 150kB/s] \u001b[A\n",
+            "Downloading data:   0% 115k/23.5M [00:00<02:12, 176kB/s] \u001b[A\n",
+            "Downloading data:   1% 262k/23.5M [00:00<01:06, 351kB/s]\u001b[A\n",
+            "Downloading data:   2% 557k/23.5M [00:01<00:34, 671kB/s]\u001b[A\n",
+            "Downloading data:   5% 1.15M/23.5M [00:01<00:17, 1.28MB/s]\u001b[A\n",
+            "Downloading data:  10% 2.34M/23.5M [00:01<00:08, 2.51MB/s]\u001b[A\n",
+            "Downloading data:  16% 3.85M/23.5M [00:01<00:04, 4.55MB/s]\u001b[A\n",
+            "Downloading data:  19% 4.49M/23.5M [00:01<00:04, 4.43MB/s]\u001b[A\n",
+            "Downloading data:  22% 5.06M/23.5M [00:02<00:04, 4.25MB/s]\u001b[A\n",
+            "Downloading data:  24% 5.57M/23.5M [00:02<00:04, 4.29MB/s]\u001b[A\n",
+            "Downloading data:  26% 6.07M/23.5M [00:02<00:04, 4.32MB/s]\u001b[A\n",
+            "Downloading data:  28% 6.54M/23.5M [00:02<00:03, 4.36MB/s]\u001b[A\n",
+            "Downloading data:  30% 7.01M/23.5M [00:02<00:03, 4.32MB/s]\u001b[A\n",
+            "Downloading data:  32% 7.47M/23.5M [00:02<00:04, 3.95MB/s]\u001b[A\n",
+            "Downloading data:  34% 7.88M/23.5M [00:02<00:03, 3.95MB/s]\u001b[A\n",
+            "Downloading data:  35% 8.30M/23.5M [00:02<00:03, 4.02MB/s]\u001b[A\n",
+            "Downloading data:  37% 8.71M/23.5M [00:02<00:03, 3.99MB/s]\u001b[A\n",
+            "Downloading data:  39% 9.14M/23.5M [00:03<00:03, 4.07MB/s]\u001b[A\n",
+            "Downloading data:  41% 9.58M/23.5M [00:03<00:03, 4.16MB/s]\u001b[A\n",
+            "Downloading data:  43% 10.0M/23.5M [00:03<00:03, 4.25MB/s]\u001b[A\n",
+            "Downloading data:  45% 10.5M/23.5M [00:03<00:02, 4.34MB/s]\u001b[A\n",
+            "Downloading data:  47% 11.0M/23.5M [00:03<00:02, 4.40MB/s]\u001b[A\n",
+            "Downloading data:  49% 11.4M/23.5M [00:03<00:02, 4.42MB/s]\u001b[A\n",
+            "Downloading data:  51% 11.9M/23.5M [00:03<00:02, 3.92MB/s]\u001b[A\n",
+            "Downloading data:  52% 12.3M/23.5M [00:03<00:02, 3.80MB/s]\u001b[A\n",
+            "Downloading data:  54% 12.7M/23.5M [00:03<00:02, 3.80MB/s]\u001b[A\n",
+            "Downloading data:  56% 13.1M/23.5M [00:04<00:02, 3.91MB/s]\u001b[A\n",
+            "Downloading data:  58% 13.5M/23.5M [00:04<00:02, 4.00MB/s]\u001b[A\n",
+            "Downloading data:  59% 13.9M/23.5M [00:04<00:02, 4.08MB/s]\u001b[A\n",
+            "Downloading data:  61% 14.4M/23.5M [00:04<00:02, 4.15MB/s]\u001b[A\n",
+            "Downloading data:  63% 14.8M/23.5M [00:04<00:02, 4.21MB/s]\u001b[A\n",
+            "Downloading data:  65% 15.3M/23.5M [00:04<00:01, 4.19MB/s]\u001b[A\n",
+            "Downloading data:  67% 15.7M/23.5M [00:04<00:01, 4.16MB/s]\u001b[A\n",
+            "Downloading data:  69% 16.1M/23.5M [00:04<00:01, 4.15MB/s]\u001b[A\n",
+            "Downloading data:  70% 16.5M/23.5M [00:04<00:01, 4.07MB/s]\u001b[A\n",
+            "Downloading data:  72% 16.9M/23.5M [00:04<00:01, 4.08MB/s]\u001b[A\n",
+            "Downloading data:  74% 17.4M/23.5M [00:05<00:01, 4.07MB/s]\u001b[A\n",
+            "Downloading data:  76% 17.8M/23.5M [00:05<00:01, 4.03MB/s]\u001b[A\n",
+            "Downloading data:  77% 18.2M/23.5M [00:05<00:01, 4.04MB/s]\u001b[A\n",
+            "Downloading data:  79% 18.6M/23.5M [00:05<00:01, 4.04MB/s]\u001b[A\n",
+            "Downloading data:  81% 19.0M/23.5M [00:05<00:01, 4.01MB/s]\u001b[A\n",
+            "Downloading data:  83% 19.4M/23.5M [00:05<00:00, 4.10MB/s]\u001b[A\n",
+            "Downloading data:  85% 19.9M/23.5M [00:05<00:00, 4.18MB/s]\u001b[A\n",
+            "Downloading data:  86% 20.3M/23.5M [00:05<00:00, 4.06MB/s]\u001b[A\n",
+            "Downloading data:  88% 20.7M/23.5M [00:05<00:00, 3.64MB/s]\u001b[A\n",
+            "Downloading data:  90% 21.1M/23.5M [00:06<00:00, 3.63MB/s]\u001b[A\n",
+            "Downloading data:  91% 21.4M/23.5M [00:06<00:00, 3.64MB/s]\u001b[A\n",
+            "Downloading data:  93% 21.8M/23.5M [00:06<00:00, 3.47MB/s]\u001b[A\n",
+            "Downloading data:  95% 22.2M/23.5M [00:06<00:00, 3.57MB/s]\u001b[A\n",
+            "Downloading data:  96% 22.6M/23.5M [00:06<00:00, 3.68MB/s]\u001b[A\n",
+            "Downloading data:  98% 23.0M/23.5M [00:06<00:00, 3.68MB/s]\u001b[A\n",
+            "Downloading data: 100% 23.5M/23.5M [00:06<00:00, 3.51MB/s]\n",
+            "12/14/2022 13:51:14 - INFO - datasets.utils.file_utils - storing https://opus.nlpl.eu/download.php?f=SETIMES/v2/tmx/en-ro.tmx.gz in cache at /root/.cache/huggingface/datasets/downloads/836c8a8ca41f6cde5f353d1e9fe29eba1bfa7ce4d5840b7f3176131b65a64b7b\n",
+            "12/14/2022 13:51:14 - INFO - datasets.utils.file_utils - creating metadata file for /root/.cache/huggingface/datasets/downloads/836c8a8ca41f6cde5f353d1e9fe29eba1bfa7ce4d5840b7f3176131b65a64b7b\n",
+            "Downloading data files:  50% 2/4 [00:19<00:20, 10.15s/it]12/14/2022 13:51:15 - INFO - datasets.utils.file_utils - https://huggingface.co/datasets/wmt/wmt19/resolve/main-zip/translation-task/dev.zip not found in cache or force_download set to True, downloading to /root/.cache/huggingface/datasets/downloads/tmp532r828c\n",
+            "\n",
+            "Downloading data:   0% 0.00/38.7M [00:00<?, ?B/s]\u001b[A\n",
+            "Downloading data:  17% 6.58M/38.7M [00:00<00:00, 65.7MB/s]\u001b[A\n",
+            "Downloading data:  34% 13.1M/38.7M [00:00<00:00, 64.8MB/s]\u001b[A\n",
+            "Downloading data:  54% 20.8M/38.7M [00:00<00:00, 70.3MB/s]\u001b[A\n",
+            "Downloading data: 100% 38.7M/38.7M [00:00<00:00, 79.4MB/s]\n",
+            "12/14/2022 13:51:16 - INFO - datasets.utils.file_utils - storing https://huggingface.co/datasets/wmt/wmt19/resolve/main-zip/translation-task/dev.zip in cache at /root/.cache/huggingface/datasets/downloads/11ba2d79cea2e93248b0441726361c9bb586ed8c2b366d3eca0e2a09b09560cf\n",
+            "12/14/2022 13:51:16 - INFO - datasets.utils.file_utils - creating metadata file for /root/.cache/huggingface/datasets/downloads/11ba2d79cea2e93248b0441726361c9bb586ed8c2b366d3eca0e2a09b09560cf\n",
+            "Downloading data files: 100% 4/4 [00:21<00:00,  5.49s/it]\n",
+            "12/14/2022 13:51:16 - INFO - datasets.download.download_manager - Downloading took 0.0 min\n",
+            "12/14/2022 13:51:17 - INFO - datasets.download.download_manager - Checksum Computation took 0.0 min\n",
+            "12/14/2022 13:51:17 - INFO - datasets.utils.py_utils - Spawning 4 processes for 4 objects in slices of [1, 1, 1, 1]\n",
+            "Extracting data files #0:   0% 0/1 [00:00<?, ?obj/s]\n",
+            "\n",
+            "Extracting data files #2:   0% 0/1 [00:00<?, ?obj/s]\u001b[A\u001b[A\n",
+            "\n",
+            "\n",
+            "Extracting data files #3:   0% 0/1 [00:00<?, ?obj/s]\u001b[A\u001b[A\u001b[A\n",
+            "Extracting data files #1:   0% 0/1 [00:00<?, ?obj/s]\u001b[A\n",
+            "Extracting data files #1: 100% 1/1 [00:00<00:00,  1.82obj/s]\n",
+            "\n",
+            "\n",
+            "\n",
+            "Extracting data files #3: 100% 1/1 [00:00<00:00,  1.17obj/s]\n",
+            "\n",
+            "\n",
+            "Extracting data files #2: 100% 1/1 [00:01<00:00,  1.75s/obj]\n",
+            "Extracting data files #0: 100% 1/1 [00:04<00:00,  4.90s/obj]\n",
+            "12/14/2022 13:51:22 - INFO - datasets.utils.py_utils - Finished 4 processes\n",
+            "12/14/2022 13:51:22 - INFO - datasets.utils.py_utils - Unpacked 4 objects\n",
+            "Extracting data files: 0it [00:00, ?it/s]\n",
+            "12/14/2022 13:51:22 - INFO - datasets.utils.info_utils - Unable to verify checksums.\n",
+            "12/14/2022 13:51:22 - INFO - datasets.builder - Generating train split\n",
+            "12/14/2022 13:51:40 - INFO - datasets.builder - Generating validation split\n",
+            "12/14/2022 13:51:40 - INFO - datasets.builder - Generating test split\n",
+            "12/14/2022 13:51:40 - INFO - datasets.utils.info_utils - All the splits matched successfully.\n",
+            "Dataset wmt16 downloaded and prepared to /root/.cache/huggingface/datasets/wmt16/ro-en/1.0.0/746749a11d25c02058042da7502d973ff410e73457f3d305fc1177dc0e8c4227. Subsequent calls will reuse this data.\n",
+            "100% 3/3 [00:00<00:00, 256.60it/s]\n",
+            "[INFO|file_utils.py:2241] 2022-12-14 13:51:41,536 >> https://huggingface.co/t5-small/resolve/main/config.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpjoyvsqzp\n",
+            "Downloading: 100% 1.18k/1.18k [00:00<00:00, 959kB/s]\n",
+            "[INFO|file_utils.py:2245] 2022-12-14 13:51:42,474 >> storing https://huggingface.co/t5-small/resolve/main/config.json in cache at /root/.cache/huggingface/transformers/fe501e8fd6425b8ec93df37767fcce78ce626e34cc5edc859c662350cf712e41.d67b370cd9d75f81ad4eb421ee7b8db09e0b6a6c693b8c2b423af5d7bcac6205\n",
+            "[INFO|file_utils.py:2253] 2022-12-14 13:51:42,474 >> creating metadata file for /root/.cache/huggingface/transformers/fe501e8fd6425b8ec93df37767fcce78ce626e34cc5edc859c662350cf712e41.d67b370cd9d75f81ad4eb421ee7b8db09e0b6a6c693b8c2b423af5d7bcac6205\n",
+            "[INFO|configuration_utils.py:649] 2022-12-14 13:51:42,474 >> loading configuration file https://huggingface.co/t5-small/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/fe501e8fd6425b8ec93df37767fcce78ce626e34cc5edc859c662350cf712e41.d67b370cd9d75f81ad4eb421ee7b8db09e0b6a6c693b8c2b423af5d7bcac6205\n",
+            "[INFO|configuration_utils.py:685] 2022-12-14 13:51:42,476 >> Model config T5Config {\n",
+            "  \"_name_or_path\": \"t5-small\",\n",
+            "  \"architectures\": [\n",
+            "    \"T5ForConditionalGeneration\"\n",
+            "  ],\n",
+            "  \"d_ff\": 2048,\n",
+            "  \"d_kv\": 64,\n",
+            "  \"d_model\": 512,\n",
+            "  \"decoder_start_token_id\": 0,\n",
+            "  \"dropout_rate\": 0.1,\n",
+            "  \"eos_token_id\": 1,\n",
+            "  \"feed_forward_proj\": \"relu\",\n",
+            "  \"initializer_factor\": 1.0,\n",
+            "  \"is_encoder_decoder\": true,\n",
+            "  \"layer_norm_epsilon\": 1e-06,\n",
+            "  \"model_type\": \"t5\",\n",
+            "  \"n_positions\": 512,\n",
+            "  \"num_decoder_layers\": 6,\n",
+            "  \"num_heads\": 8,\n",
+            "  \"num_layers\": 6,\n",
+            "  \"output_past\": true,\n",
+            "  \"pad_token_id\": 0,\n",
+            "  \"relative_attention_max_distance\": 128,\n",
+            "  \"relative_attention_num_buckets\": 32,\n",
+            "  \"task_specific_params\": {\n",
+            "    \"summarization\": {\n",
+            "      \"early_stopping\": true,\n",
+            "      \"length_penalty\": 2.0,\n",
+            "      \"max_length\": 200,\n",
+            "      \"min_length\": 30,\n",
+            "      \"no_repeat_ngram_size\": 3,\n",
+            "      \"num_beams\": 4,\n",
+            "      \"prefix\": \"summarize: \"\n",
+            "    },\n",
+            "    \"translation_en_to_de\": {\n",
+            "      \"early_stopping\": true,\n",
+            "      \"max_length\": 300,\n",
+            "      \"num_beams\": 4,\n",
+            "      \"prefix\": \"translate English to German: \"\n",
+            "    },\n",
+            "    \"translation_en_to_fr\": {\n",
+            "      \"early_stopping\": true,\n",
+            "      \"max_length\": 300,\n",
+            "      \"num_beams\": 4,\n",
+            "      \"prefix\": \"translate English to French: \"\n",
+            "    },\n",
+            "    \"translation_en_to_ro\": {\n",
+            "      \"early_stopping\": true,\n",
+            "      \"max_length\": 300,\n",
+            "      \"num_beams\": 4,\n",
+            "      \"prefix\": \"translate English to Romanian: \"\n",
+            "    }\n",
+            "  },\n",
+            "  \"transformers_version\": \"4.18.0.dev0\",\n",
+            "  \"use_cache\": true,\n",
+            "  \"vocab_size\": 32128\n",
+            "}\n",
+            "\n",
+            "[INFO|tokenization_auto.py:345] 2022-12-14 13:51:43,378 >> Could not locate the tokenizer configuration file, will try to use the model config instead.\n",
+            "[INFO|configuration_utils.py:649] 2022-12-14 13:51:44,307 >> loading configuration file https://huggingface.co/t5-small/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/fe501e8fd6425b8ec93df37767fcce78ce626e34cc5edc859c662350cf712e41.d67b370cd9d75f81ad4eb421ee7b8db09e0b6a6c693b8c2b423af5d7bcac6205\n",
+            "[INFO|configuration_utils.py:685] 2022-12-14 13:51:44,307 >> Model config T5Config {\n",
+            "  \"_name_or_path\": \"t5-small\",\n",
+            "  \"architectures\": [\n",
+            "    \"T5ForConditionalGeneration\"\n",
+            "  ],\n",
+            "  \"d_ff\": 2048,\n",
+            "  \"d_kv\": 64,\n",
+            "  \"d_model\": 512,\n",
+            "  \"decoder_start_token_id\": 0,\n",
+            "  \"dropout_rate\": 0.1,\n",
+            "  \"eos_token_id\": 1,\n",
+            "  \"feed_forward_proj\": \"relu\",\n",
+            "  \"initializer_factor\": 1.0,\n",
+            "  \"is_encoder_decoder\": true,\n",
+            "  \"layer_norm_epsilon\": 1e-06,\n",
+            "  \"model_type\": \"t5\",\n",
+            "  \"n_positions\": 512,\n",
+            "  \"num_decoder_layers\": 6,\n",
+            "  \"num_heads\": 8,\n",
+            "  \"num_layers\": 6,\n",
+            "  \"output_past\": true,\n",
+            "  \"pad_token_id\": 0,\n",
+            "  \"relative_attention_max_distance\": 128,\n",
+            "  \"relative_attention_num_buckets\": 32,\n",
+            "  \"task_specific_params\": {\n",
+            "    \"summarization\": {\n",
+            "      \"early_stopping\": true,\n",
+            "      \"length_penalty\": 2.0,\n",
+            "      \"max_length\": 200,\n",
+            "      \"min_length\": 30,\n",
+            "      \"no_repeat_ngram_size\": 3,\n",
+            "      \"num_beams\": 4,\n",
+            "      \"prefix\": \"summarize: \"\n",
+            "    },\n",
+            "    \"translation_en_to_de\": {\n",
+            "      \"early_stopping\": true,\n",
+            "      \"max_length\": 300,\n",
+            "      \"num_beams\": 4,\n",
+            "      \"prefix\": \"translate English to German: \"\n",
+            "    },\n",
+            "    \"translation_en_to_fr\": {\n",
+            "      \"early_stopping\": true,\n",
+            "      \"max_length\": 300,\n",
+            "      \"num_beams\": 4,\n",
+            "      \"prefix\": \"translate English to French: \"\n",
+            "    },\n",
+            "    \"translation_en_to_ro\": {\n",
+            "      \"early_stopping\": true,\n",
+            "      \"max_length\": 300,\n",
+            "      \"num_beams\": 4,\n",
+            "      \"prefix\": \"translate English to Romanian: \"\n",
+            "    }\n",
+            "  },\n",
+            "  \"transformers_version\": \"4.18.0.dev0\",\n",
+            "  \"use_cache\": true,\n",
+            "  \"vocab_size\": 32128\n",
+            "}\n",
+            "\n",
+            "[INFO|file_utils.py:2241] 2022-12-14 13:51:46,160 >> https://huggingface.co/t5-small/resolve/main/spiece.model not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmphxremg_9\n",
+            "Downloading: 100% 773k/773k [00:01<00:00, 702kB/s]\n",
+            "[INFO|file_utils.py:2245] 2022-12-14 13:51:48,218 >> storing https://huggingface.co/t5-small/resolve/main/spiece.model in cache at /root/.cache/huggingface/transformers/65fc04e21f45f61430aea0c4fedffac16a4d20d78b8e6601d8d996ebefefecd2.3b69006860e7b5d0a63ffdddc01ddcd6b7c318a6f4fd793596552c741734c62d\n",
+            "[INFO|file_utils.py:2253] 2022-12-14 13:51:48,218 >> creating metadata file for /root/.cache/huggingface/transformers/65fc04e21f45f61430aea0c4fedffac16a4d20d78b8e6601d8d996ebefefecd2.3b69006860e7b5d0a63ffdddc01ddcd6b7c318a6f4fd793596552c741734c62d\n",
+            "[INFO|file_utils.py:2241] 2022-12-14 13:51:49,133 >> https://huggingface.co/t5-small/resolve/main/tokenizer.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmp770jl9mw\n",
+            "Downloading: 100% 1.32M/1.32M [00:01<00:00, 1.03MB/s]\n",
+            "[INFO|file_utils.py:2245] 2022-12-14 13:51:51,427 >> storing https://huggingface.co/t5-small/resolve/main/tokenizer.json in cache at /root/.cache/huggingface/transformers/06779097c78e12f47ef67ecb728810c2ae757ee0a9efe9390c6419783d99382d.8627f1bd5d270a9fd2e5a51c8bec3223896587cc3cfe13edeabb0992ab43c529\n",
+            "[INFO|file_utils.py:2253] 2022-12-14 13:51:51,427 >> creating metadata file for /root/.cache/huggingface/transformers/06779097c78e12f47ef67ecb728810c2ae757ee0a9efe9390c6419783d99382d.8627f1bd5d270a9fd2e5a51c8bec3223896587cc3cfe13edeabb0992ab43c529\n",
+            "[INFO|tokenization_utils_base.py:1786] 2022-12-14 13:51:54,209 >> loading file https://huggingface.co/t5-small/resolve/main/spiece.model from cache at /root/.cache/huggingface/transformers/65fc04e21f45f61430aea0c4fedffac16a4d20d78b8e6601d8d996ebefefecd2.3b69006860e7b5d0a63ffdddc01ddcd6b7c318a6f4fd793596552c741734c62d\n",
+            "[INFO|tokenization_utils_base.py:1786] 2022-12-14 13:51:54,210 >> loading file https://huggingface.co/t5-small/resolve/main/tokenizer.json from cache at /root/.cache/huggingface/transformers/06779097c78e12f47ef67ecb728810c2ae757ee0a9efe9390c6419783d99382d.8627f1bd5d270a9fd2e5a51c8bec3223896587cc3cfe13edeabb0992ab43c529\n",
+            "[INFO|tokenization_utils_base.py:1786] 2022-12-14 13:51:54,210 >> loading file https://huggingface.co/t5-small/resolve/main/added_tokens.json from cache at None\n",
+            "[INFO|tokenization_utils_base.py:1786] 2022-12-14 13:51:54,210 >> loading file https://huggingface.co/t5-small/resolve/main/special_tokens_map.json from cache at None\n",
+            "[INFO|tokenization_utils_base.py:1786] 2022-12-14 13:51:54,210 >> loading file https://huggingface.co/t5-small/resolve/main/tokenizer_config.json from cache at None\n",
+            "[INFO|configuration_utils.py:649] 2022-12-14 13:51:55,135 >> loading configuration file https://huggingface.co/t5-small/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/fe501e8fd6425b8ec93df37767fcce78ce626e34cc5edc859c662350cf712e41.d67b370cd9d75f81ad4eb421ee7b8db09e0b6a6c693b8c2b423af5d7bcac6205\n",
+            "[INFO|configuration_utils.py:685] 2022-12-14 13:51:55,136 >> Model config T5Config {\n",
+            "  \"_name_or_path\": \"t5-small\",\n",
+            "  \"architectures\": [\n",
+            "    \"T5ForConditionalGeneration\"\n",
+            "  ],\n",
+            "  \"d_ff\": 2048,\n",
+            "  \"d_kv\": 64,\n",
+            "  \"d_model\": 512,\n",
+            "  \"decoder_start_token_id\": 0,\n",
+            "  \"dropout_rate\": 0.1,\n",
+            "  \"eos_token_id\": 1,\n",
+            "  \"feed_forward_proj\": \"relu\",\n",
+            "  \"initializer_factor\": 1.0,\n",
+            "  \"is_encoder_decoder\": true,\n",
+            "  \"layer_norm_epsilon\": 1e-06,\n",
+            "  \"model_type\": \"t5\",\n",
+            "  \"n_positions\": 512,\n",
+            "  \"num_decoder_layers\": 6,\n",
+            "  \"num_heads\": 8,\n",
+            "  \"num_layers\": 6,\n",
+            "  \"output_past\": true,\n",
+            "  \"pad_token_id\": 0,\n",
+            "  \"relative_attention_max_distance\": 128,\n",
+            "  \"relative_attention_num_buckets\": 32,\n",
+            "  \"task_specific_params\": {\n",
+            "    \"summarization\": {\n",
+            "      \"early_stopping\": true,\n",
+            "      \"length_penalty\": 2.0,\n",
+            "      \"max_length\": 200,\n",
+            "      \"min_length\": 30,\n",
+            "      \"no_repeat_ngram_size\": 3,\n",
+            "      \"num_beams\": 4,\n",
+            "      \"prefix\": \"summarize: \"\n",
+            "    },\n",
+            "    \"translation_en_to_de\": {\n",
+            "      \"early_stopping\": true,\n",
+            "      \"max_length\": 300,\n",
+            "      \"num_beams\": 4,\n",
+            "      \"prefix\": \"translate English to German: \"\n",
+            "    },\n",
+            "    \"translation_en_to_fr\": {\n",
+            "      \"early_stopping\": true,\n",
+            "      \"max_length\": 300,\n",
+            "      \"num_beams\": 4,\n",
+            "      \"prefix\": \"translate English to French: \"\n",
+            "    },\n",
+            "    \"translation_en_to_ro\": {\n",
+            "      \"early_stopping\": true,\n",
+            "      \"max_length\": 300,\n",
+            "      \"num_beams\": 4,\n",
+            "      \"prefix\": \"translate English to Romanian: \"\n",
+            "    }\n",
+            "  },\n",
+            "  \"transformers_version\": \"4.18.0.dev0\",\n",
+            "  \"use_cache\": true,\n",
+            "  \"vocab_size\": 32128\n",
+            "}\n",
+            "\n",
+            "[INFO|file_utils.py:2241] 2022-12-14 13:51:56,159 >> https://huggingface.co/t5-small/resolve/main/pytorch_model.bin not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpejaka531\n",
+            "Downloading: 100% 231M/231M [00:05<00:00, 44.4MB/s]\n",
+            "[INFO|file_utils.py:2245] 2022-12-14 13:52:01,663 >> storing https://huggingface.co/t5-small/resolve/main/pytorch_model.bin in cache at /root/.cache/huggingface/transformers/fee5a3a0ae379232608b6eed45d2d7a0d2966b9683728838412caccc41b4b0ed.ddacdc89ec88482db20c676f0861a336f3d0409f94748c209847b49529d73885\n",
+            "[INFO|file_utils.py:2253] 2022-12-14 13:52:01,664 >> creating metadata file for /root/.cache/huggingface/transformers/fee5a3a0ae379232608b6eed45d2d7a0d2966b9683728838412caccc41b4b0ed.ddacdc89ec88482db20c676f0861a336f3d0409f94748c209847b49529d73885\n",
+            "[INFO|modeling_utils.py:1432] 2022-12-14 13:52:01,664 >> loading weights file https://huggingface.co/t5-small/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/fee5a3a0ae379232608b6eed45d2d7a0d2966b9683728838412caccc41b4b0ed.ddacdc89ec88482db20c676f0861a336f3d0409f94748c209847b49529d73885\n",
+            "[INFO|modeling_utils.py:1703] 2022-12-14 13:52:02,616 >> All model checkpoint weights were used when initializing T5ForConditionalGeneration.\n",
+            "\n",
+            "[INFO|modeling_utils.py:1711] 2022-12-14 13:52:02,617 >> All the weights of T5ForConditionalGeneration were initialized from the model checkpoint at t5-small.\n",
+            "If your task is similar to the task the model of the checkpoint was trained on, you can already use T5ForConditionalGeneration for predictions without further training.\n",
+            "Running tokenizer on train dataset:   0% 0/2 [00:00<?, ?ba/s]12/14/2022 13:52:02 - INFO - datasets.arrow_dataset - Caching processed dataset at /root/.cache/huggingface/datasets/wmt16/ro-en/1.0.0/746749a11d25c02058042da7502d973ff410e73457f3d305fc1177dc0e8c4227/cache-f97c1e1f534477fa.arrow\n",
+            "Running tokenizer on train dataset: 100% 2/2 [00:00<00:00, 12.04ba/s]\n",
+            "Running tokenizer on validation dataset:   0% 0/1 [00:00<?, ?ba/s]12/14/2022 13:52:02 - INFO - datasets.arrow_dataset - Caching processed dataset at /root/.cache/huggingface/datasets/wmt16/ro-en/1.0.0/746749a11d25c02058042da7502d973ff410e73457f3d305fc1177dc0e8c4227/cache-725bbb12606c2914.arrow\n",
+            "Running tokenizer on validation dataset: 100% 1/1 [00:00<00:00, 28.45ba/s]\n",
+            "examples/pytorch/translation/run_translation.py:494: FutureWarning: load_metric is deprecated and will be removed in the next major version of datasets. Use 'evaluate.load' instead, from the new library 🤗 Evaluate: https://huggingface.co/docs/evaluate\n",
+            "  metric = load_metric(\"sacrebleu\")\n",
+            "12/14/2022 13:52:04 - INFO - datasets.utils.file_utils - https://raw.githubusercontent.com/huggingface/datasets/2.7.1/metrics/sacrebleu/sacrebleu.py not found in cache or force_download set to True, downloading to /root/.cache/huggingface/datasets/downloads/tmpt9ps2cqo\n",
+            "Downloading builder script: 7.65kB [00:00, 7.90MB/s]       \n",
+            "12/14/2022 13:52:04 - INFO - datasets.utils.file_utils - storing https://raw.githubusercontent.com/huggingface/datasets/2.7.1/metrics/sacrebleu/sacrebleu.py in cache at /root/.cache/huggingface/datasets/downloads/e1bb01015c85785f9a06b106f5850e5cd2e127973589326444733a028bba305f.1ffdbb824af2d651e73e1b4f5816c5725c013d8f9d9a77a293308b84d8b3579a.py\n",
+            "12/14/2022 13:52:04 - INFO - datasets.utils.file_utils - creating metadata file for /root/.cache/huggingface/datasets/downloads/e1bb01015c85785f9a06b106f5850e5cd2e127973589326444733a028bba305f.1ffdbb824af2d651e73e1b4f5816c5725c013d8f9d9a77a293308b84d8b3579a.py\n",
+            "[INFO|trainer.py:457] 2022-12-14 13:52:04,433 >> Using amp half precision backend\n",
+            "[2022-12-14 13:52:04,436] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.8.0+384f17b0, git-hash=384f17b0, git-branch=master\n",
+            "[2022-12-14 13:52:09,320] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False\n",
+            "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
+            "To disable this warning, you can either:\n",
+            "\t- Avoid using `tokenizers` before the fork if possible\n",
+            "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n",
+            "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
+            "To disable this warning, you can either:\n",
+            "\t- Avoid using `tokenizers` before the fork if possible\n",
+            "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n",
+            "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
+            "To disable this warning, you can either:\n",
+            "\t- Avoid using `tokenizers` before the fork if possible\n",
+            "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n",
+            "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
+            "To disable this warning, you can either:\n",
+            "\t- Avoid using `tokenizers` before the fork if possible\n",
+            "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n",
+            "Installed CUDA version 11.2 does not match the version torch was compiled with 11.3 but since the APIs are compatible, accepting this combination\n",
+            "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
+            "To disable this warning, you can either:\n",
+            "\t- Avoid using `tokenizers` before the fork if possible\n",
+            "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n",
+            "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
+            "To disable this warning, you can either:\n",
+            "\t- Avoid using `tokenizers` before the fork if possible\n",
+            "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n",
+            "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
+            "To disable this warning, you can either:\n",
+            "\t- Avoid using `tokenizers` before the fork if possible\n",
+            "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n",
+            "Using /root/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...\n",
+            "Creating extension directory /root/.cache/torch_extensions/py38_cu113/cpu_adam...\n",
+            "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
+            "To disable this warning, you can either:\n",
+            "\t- Avoid using `tokenizers` before the fork if possible\n",
+            "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n",
+            "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
+            "To disable this warning, you can either:\n",
+            "\t- Avoid using `tokenizers` before the fork if possible\n",
+            "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n",
+            "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
+            "To disable this warning, you can either:\n",
+            "\t- Avoid using `tokenizers` before the fork if possible\n",
+            "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n",
+            "Detected CUDA files, patching ldflags\n",
+            "Emitting ninja build file /root/.cache/torch_extensions/py38_cu113/cpu_adam/build.ninja...\n",
+            "Building extension module cpu_adam...\n",
+            "Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)\n",
+            "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
+            "To disable this warning, you can either:\n",
+            "\t- Avoid using `tokenizers` before the fork if possible\n",
+            "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n",
+            "[1/3] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\\\"_gcc\\\" -DPYBIND11_STDLIB=\\\"_libstdcpp\\\" -DPYBIND11_BUILD_ABI=\\\"_cxxabi1011\\\" -I/usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.8/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -c /usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o \n",
+            "[2/3] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\\\"_gcc\\\" -DPYBIND11_STDLIB=\\\"_libstdcpp\\\" -DPYBIND11_BUILD_ABI=\\\"_cxxabi1011\\\" -I/usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.8/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256__ -c /usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o \n",
+            "[3/3] c++ cpu_adam.o custom_cuda_kernel.cuda.o -shared -lcurand -L/usr/local/lib/python3.8/dist-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda_cu -ltorch_cuda_cpp -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o cpu_adam.so\n",
+            "Loading extension module cpu_adam...\n",
+            "Time to load cpu_adam op: 28.35541605949402 seconds\n",
+            "Adam Optimizer #0 is created with AVX2 arithmetic capability.\n",
+            "Config: alpha=0.000030, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1\n",
+            "[2022-12-14 13:52:40,508] [INFO] [logging.py:68:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer\n",
+            "[2022-12-14 13:52:40,513] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam\n",
+            "[2022-12-14 13:52:40,513] [INFO] [utils.py:52:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>\n",
+            "[2022-12-14 13:52:40,513] [INFO] [logging.py:68:log_dist] [Rank 0] Creating fp16 ZeRO stage 2 optimizer\n",
+            "[2022-12-14 13:52:40,513] [INFO] [stage_1_and_2.py:140:__init__] Reduce bucket size 200000000\n",
+            "[2022-12-14 13:52:40,513] [INFO] [stage_1_and_2.py:141:__init__] Allgather bucket size 200000000\n",
+            "[2022-12-14 13:52:40,513] [INFO] [stage_1_and_2.py:142:__init__] CPU Offload: True\n",
+            "[2022-12-14 13:52:40,513] [INFO] [stage_1_and_2.py:143:__init__] Round robin gradient partitioning: False\n",
+            "Using /root/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...\n",
+            "Creating extension directory /root/.cache/torch_extensions/py38_cu113/utils...\n",
+            "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
+            "To disable this warning, you can either:\n",
+            "\t- Avoid using `tokenizers` before the fork if possible\n",
+            "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n",
+            "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
+            "To disable this warning, you can either:\n",
+            "\t- Avoid using `tokenizers` before the fork if possible\n",
+            "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n",
+            "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
+            "To disable this warning, you can either:\n",
+            "\t- Avoid using `tokenizers` before the fork if possible\n",
+            "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n",
+            "Emitting ninja build file /root/.cache/torch_extensions/py38_cu113/utils/build.ninja...\n",
+            "Building extension module utils...\n",
+            "Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)\n",
+            "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
+            "To disable this warning, you can either:\n",
+            "\t- Avoid using `tokenizers` before the fork if possible\n",
+            "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n",
+            "[1/2] c++ -MMD -MF flatten_unflatten.o.d -DTORCH_EXTENSION_NAME=utils -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\\\"_gcc\\\" -DPYBIND11_STDLIB=\\\"_libstdcpp\\\" -DPYBIND11_BUILD_ABI=\\\"_cxxabi1011\\\" -isystem /usr/local/lib/python3.8/dist-packages/torch/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.8/dist-packages/torch/include/THC -isystem /usr/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -c /usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/utils/flatten_unflatten.cpp -o flatten_unflatten.o \n",
+            "[2/2] c++ flatten_unflatten.o -shared -L/usr/local/lib/python3.8/dist-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o utils.so\n",
+            "Loading extension module utils...\n",
+            "Time to load utils op: 14.856906175613403 seconds\n",
+            "Rank: 0 partition count [1] and sizes[(60492288, False)] \n",
+            "[2022-12-14 13:52:55,716] [INFO] [utils.py:827:see_memory_usage] Before initializing optimizer states\n",
+            "[2022-12-14 13:52:55,717] [INFO] [utils.py:828:see_memory_usage] MA 0.14 GB         Max_MA 0.14 GB         CA 0.24 GB         Max_CA 0 GB \n",
+            "[2022-12-14 13:52:55,717] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 5.39 GB, percent = 6.5%\n",
+            "[2022-12-14 13:52:55,987] [INFO] [utils.py:827:see_memory_usage] After initializing optimizer states\n",
+            "[2022-12-14 13:52:55,987] [INFO] [utils.py:828:see_memory_usage] MA 0.14 GB         Max_MA 0.14 GB         CA 0.24 GB         Max_CA 0 GB \n",
+            "[2022-12-14 13:52:55,988] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 6.1 GB, percent = 7.3%\n",
+            "[2022-12-14 13:52:55,988] [INFO] [stage_1_and_2.py:525:__init__] optimizer state initialized\n",
+            "[2022-12-14 13:52:56,061] [INFO] [utils.py:827:see_memory_usage] After initializing ZeRO optimizer\n",
+            "[2022-12-14 13:52:56,062] [INFO] [utils.py:828:see_memory_usage] MA 0.14 GB         Max_MA 0.14 GB         CA 0.24 GB         Max_CA 0 GB \n",
+            "[2022-12-14 13:52:56,062] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 6.1 GB, percent = 7.3%\n",
+            "[2022-12-14 13:52:56,065] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Final Optimizer = adamw\n",
+            "[2022-12-14 13:52:56,066] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed using configured LR scheduler = WarmupLR\n",
+            "[2022-12-14 13:52:56,066] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed LR Scheduler = <deepspeed.runtime.lr_schedules.WarmupLR object at 0x7f7f49631760>\n",
+            "[2022-12-14 13:52:56,066] [INFO] [logging.py:68:log_dist] [Rank 0] step=0, skipped=0, lr=[3e-05], mom=[[0.9, 0.999]]\n",
+            "[2022-12-14 13:52:56,066] [INFO] [config.py:1008:print] DeepSpeedEngine configuration:\n",
+            "[2022-12-14 13:52:56,067] [INFO] [config.py:1012:print]   activation_checkpointing_config  {\n",
+            "    \"partition_activations\": false, \n",
+            "    \"contiguous_memory_optimization\": false, \n",
+            "    \"cpu_checkpointing\": false, \n",
+            "    \"number_checkpoints\": null, \n",
+            "    \"synchronize_checkpoint_boundary\": false, \n",
+            "    \"profile\": false\n",
+            "}\n",
+            "[2022-12-14 13:52:56,067] [INFO] [config.py:1012:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}\n",
+            "[2022-12-14 13:52:56,067] [INFO] [config.py:1012:print]   amp_enabled .................. False\n",
+            "[2022-12-14 13:52:56,067] [INFO] [config.py:1012:print]   amp_params ................... False\n",
+            "[2022-12-14 13:52:56,067] [INFO] [config.py:1012:print]   autotuning_config ............ {\n",
+            "    \"enabled\": false, \n",
+            "    \"start_step\": null, \n",
+            "    \"end_step\": null, \n",
+            "    \"metric_path\": null, \n",
+            "    \"arg_mappings\": null, \n",
+            "    \"metric\": \"throughput\", \n",
+            "    \"model_info\": null, \n",
+            "    \"results_dir\": \"autotuning_results\", \n",
+            "    \"exps_dir\": \"autotuning_exps\", \n",
+            "    \"overwrite\": true, \n",
+            "    \"fast\": true, \n",
+            "    \"start_profile_step\": 3, \n",
+            "    \"end_profile_step\": 5, \n",
+            "    \"tuner_type\": \"gridsearch\", \n",
+            "    \"tuner_early_stopping\": 5, \n",
+            "    \"tuner_num_trials\": 50, \n",
+            "    \"model_info_path\": null, \n",
+            "    \"mp_size\": 1, \n",
+            "    \"max_train_batch_size\": null, \n",
+            "    \"min_train_batch_size\": 1, \n",
+            "    \"max_train_micro_batch_size_per_gpu\": 1.024000e+03, \n",
+            "    \"min_train_micro_batch_size_per_gpu\": 1, \n",
+            "    \"num_tuning_micro_batch_sizes\": 3\n",
+            "}\n",
+            "[2022-12-14 13:52:56,067] [INFO] [config.py:1012:print]   bfloat16_enabled ............. False\n",
+            "[2022-12-14 13:52:56,067] [INFO] [config.py:1012:print]   checkpoint_parallel_write_pipeline  False\n",
+            "[2022-12-14 13:52:56,067] [INFO] [config.py:1012:print]   checkpoint_tag_validation_enabled  True\n",
+            "[2022-12-14 13:52:56,067] [INFO] [config.py:1012:print]   checkpoint_tag_validation_fail  False\n",
+            "[2022-12-14 13:52:56,067] [INFO] [config.py:1012:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f7f49631220>\n",
+            "[2022-12-14 13:52:56,067] [INFO] [config.py:1012:print]   communication_data_type ...... None\n",
+            "[2022-12-14 13:52:56,067] [INFO] [config.py:1012:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}\n",
+            "[2022-12-14 13:52:56,067] [INFO] [config.py:1012:print]   curriculum_enabled_legacy .... False\n",
+            "[2022-12-14 13:52:56,067] [INFO] [config.py:1012:print]   curriculum_params_legacy ..... False\n",
+            "[2022-12-14 13:52:56,067] [INFO] [config.py:1012:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}\n",
+            "[2022-12-14 13:52:56,067] [INFO] [config.py:1012:print]   data_efficiency_enabled ...... False\n",
+            "[2022-12-14 13:52:56,068] [INFO] [config.py:1012:print]   dataloader_drop_last ......... False\n",
+            "[2022-12-14 13:52:56,068] [INFO] [config.py:1012:print]   disable_allgather ............ False\n",
+            "[2022-12-14 13:52:56,068] [INFO] [config.py:1012:print]   dump_state ................... False\n",
+            "[2022-12-14 13:52:56,068] [INFO] [config.py:1012:print]   dynamic_loss_scale_args ...... {'init_scale': 65536, 'scale_window': 1000, 'delayed_shift': 2, 'min_scale': 1}\n",
+            "[2022-12-14 13:52:56,068] [INFO] [config.py:1012:print]   eigenvalue_enabled ........... False\n",
+            "[2022-12-14 13:52:56,068] [INFO] [config.py:1012:print]   eigenvalue_gas_boundary_resolution  1\n",
+            "[2022-12-14 13:52:56,068] [INFO] [config.py:1012:print]   eigenvalue_layer_name ........ bert.encoder.layer\n",
+            "[2022-12-14 13:52:56,068] [INFO] [config.py:1012:print]   eigenvalue_layer_num ......... 0\n",
+            "[2022-12-14 13:52:56,068] [INFO] [config.py:1012:print]   eigenvalue_max_iter .......... 100\n",
+            "[2022-12-14 13:52:56,068] [INFO] [config.py:1012:print]   eigenvalue_stability ......... 1e-06\n",
+            "[2022-12-14 13:52:56,068] [INFO] [config.py:1012:print]   eigenvalue_tol ............... 0.01\n",
+            "[2022-12-14 13:52:56,068] [INFO] [config.py:1012:print]   eigenvalue_verbose ........... False\n",
+            "[2022-12-14 13:52:56,068] [INFO] [config.py:1012:print]   elasticity_enabled ........... False\n",
+            "[2022-12-14 13:52:56,068] [INFO] [config.py:1012:print]   flops_profiler_config ........ {\n",
+            "    \"enabled\": false, \n",
+            "    \"profile_step\": 1, \n",
+            "    \"module_depth\": -1, \n",
+            "    \"top_modules\": 1, \n",
+            "    \"detailed\": true, \n",
+            "    \"output_file\": null\n",
+            "}\n",
+            "[2022-12-14 13:52:56,068] [INFO] [config.py:1012:print]   fp16_auto_cast ............... False\n",
+            "[2022-12-14 13:52:56,068] [INFO] [config.py:1012:print]   fp16_enabled ................. True\n",
+            "[2022-12-14 13:52:56,068] [INFO] [config.py:1012:print]   fp16_master_weights_and_gradients  False\n",
+            "[2022-12-14 13:52:56,068] [INFO] [config.py:1012:print]   global_rank .................. 0\n",
+            "[2022-12-14 13:52:56,068] [INFO] [config.py:1012:print]   grad_accum_dtype ............. None\n",
+            "[2022-12-14 13:52:56,068] [INFO] [config.py:1012:print]   gradient_accumulation_steps .. 1\n",
+            "[2022-12-14 13:52:56,068] [INFO] [config.py:1012:print]   gradient_clipping ............ 1.0\n",
+            "[2022-12-14 13:52:56,068] [INFO] [config.py:1012:print]   gradient_predivide_factor .... 1.0\n",
+            "[2022-12-14 13:52:56,068] [INFO] [config.py:1012:print]   initial_dynamic_scale ........ 65536\n",
+            "[2022-12-14 13:52:56,068] [INFO] [config.py:1012:print]   load_universal_checkpoint .... False\n",
+            "[2022-12-14 13:52:56,068] [INFO] [config.py:1012:print]   loss_scale ................... 0\n",
+            "[2022-12-14 13:52:56,068] [INFO] [config.py:1012:print]   memory_breakdown ............. False\n",
+            "[2022-12-14 13:52:56,068] [INFO] [config.py:1012:print]   monitor_config ............... <deepspeed.monitor.config.DeepSpeedMonitorConfig object at 0x7f7f496313a0>\n",
+            "[2022-12-14 13:52:56,068] [INFO] [config.py:1012:print]   nebula_config ................ {\n",
+            "    \"enabled\": false, \n",
+            "    \"persistent_storage_path\": null, \n",
+            "    \"persistent_time_interval\": 100, \n",
+            "    \"num_of_version_in_retention\": 2, \n",
+            "    \"enable_nebula_load\": true, \n",
+            "    \"load_path\": null\n",
+            "}\n",
+            "[2022-12-14 13:52:56,068] [INFO] [config.py:1012:print]   optimizer_legacy_fusion ...... False\n",
+            "[2022-12-14 13:52:56,068] [INFO] [config.py:1012:print]   optimizer_name ............... adamw\n",
+            "[2022-12-14 13:52:56,068] [INFO] [config.py:1012:print]   optimizer_params ............. {'lr': 3e-05, 'betas': [0.9, 0.999], 'eps': 1e-06, 'weight_decay': 0.0}\n",
+            "[2022-12-14 13:52:56,069] [INFO] [config.py:1012:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}\n",
+            "[2022-12-14 13:52:56,069] [INFO] [config.py:1012:print]   pld_enabled .................. False\n",
+            "[2022-12-14 13:52:56,069] [INFO] [config.py:1012:print]   pld_params ................... False\n",
+            "[2022-12-14 13:52:56,069] [INFO] [config.py:1012:print]   prescale_gradients ........... False\n",
+            "[2022-12-14 13:52:56,069] [INFO] [config.py:1012:print]   scheduler_name ............... WarmupLR\n",
+            "[2022-12-14 13:52:56,069] [INFO] [config.py:1012:print]   scheduler_params ............. {'warmup_min_lr': 0, 'warmup_max_lr': 3e-05, 'warmup_num_steps': 500}\n",
+            "[2022-12-14 13:52:56,069] [INFO] [config.py:1012:print]   sparse_attention ............. None\n",
+            "[2022-12-14 13:52:56,069] [INFO] [config.py:1012:print]   sparse_gradients_enabled ..... False\n",
+            "[2022-12-14 13:52:56,069] [INFO] [config.py:1012:print]   steps_per_print .............. 2000\n",
+            "[2022-12-14 13:52:56,069] [INFO] [config.py:1012:print]   train_batch_size ............. 16\n",
+            "[2022-12-14 13:52:56,069] [INFO] [config.py:1012:print]   train_micro_batch_size_per_gpu  16\n",
+            "[2022-12-14 13:52:56,069] [INFO] [config.py:1012:print]   use_node_local_storage ....... False\n",
+            "[2022-12-14 13:52:56,069] [INFO] [config.py:1012:print]   wall_clock_breakdown ......... False\n",
+            "[2022-12-14 13:52:56,069] [INFO] [config.py:1012:print]   world_size ................... 1\n",
+            "[2022-12-14 13:52:56,069] [INFO] [config.py:1012:print]   zero_allow_untested_optimizer  False\n",
+            "[2022-12-14 13:52:56,069] [INFO] [config.py:1012:print]   zero_config .................. stage=2 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=200000000 allgather_partitions=True allgather_bucket_size=200000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='cpu', nvme_path=None, buffer_count=4, pin_memory=True, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False\n",
+            "[2022-12-14 13:52:56,069] [INFO] [config.py:1012:print]   zero_enabled ................. True\n",
+            "[2022-12-14 13:52:56,069] [INFO] [config.py:1012:print]   zero_optimization_stage ...... 2\n",
+            "[2022-12-14 13:52:56,069] [INFO] [config.py:997:print_user_config]   json = {\n",
+            "    \"fp16\": {\n",
+            "        \"enabled\": true, \n",
+            "        \"loss_scale\": 0, \n",
+            "        \"loss_scale_window\": 1000, \n",
+            "        \"initial_scale_power\": 16, \n",
+            "        \"hysteresis\": 2, \n",
+            "        \"min_loss_scale\": 1\n",
+            "    }, \n",
+            "    \"optimizer\": {\n",
+            "        \"type\": \"AdamW\", \n",
+            "        \"params\": {\n",
+            "            \"lr\": 3e-05, \n",
+            "            \"betas\": [0.9, 0.999], \n",
+            "            \"eps\": 1e-06, \n",
+            "            \"weight_decay\": 0.0\n",
+            "        }\n",
+            "    }, \n",
+            "    \"scheduler\": {\n",
+            "        \"type\": \"WarmupLR\", \n",
+            "        \"params\": {\n",
+            "            \"warmup_min_lr\": 0, \n",
+            "            \"warmup_max_lr\": 3e-05, \n",
+            "            \"warmup_num_steps\": 500\n",
+            "        }\n",
+            "    }, \n",
+            "    \"zero_optimization\": {\n",
+            "        \"stage\": 2, \n",
+            "        \"offload_optimizer\": {\n",
+            "            \"device\": \"cpu\", \n",
+            "            \"pin_memory\": true\n",
+            "        }, \n",
+            "        \"allgather_partitions\": true, \n",
+            "        \"allgather_bucket_size\": 2.000000e+08, \n",
+            "        \"overlap_comm\": true, \n",
+            "        \"reduce_scatter\": true, \n",
+            "        \"reduce_bucket_size\": 2.000000e+08, \n",
+            "        \"contiguous_gradients\": true\n",
+            "    }, \n",
+            "    \"gradient_accumulation_steps\": 1, \n",
+            "    \"gradient_clipping\": 1.0, \n",
+            "    \"steps_per_print\": 2.000000e+03, \n",
+            "    \"train_batch_size\": 16, \n",
+            "    \"train_micro_batch_size_per_gpu\": 16, \n",
+            "    \"wall_clock_breakdown\": false\n",
+            "}\n",
+            "Using /root/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...\n",
+            "No modifications detected for re-loaded extension module utils, skipping build step...\n",
+            "Loading extension module utils...\n",
+            "Time to load utils op: 0.0004124641418457031 seconds\n",
+            "[INFO|trainer.py:1288] 2022-12-14 13:52:56,070 >> ***** Running training *****\n",
+            "[INFO|trainer.py:1289] 2022-12-14 13:52:56,070 >>   Num examples = 2000\n",
+            "[INFO|trainer.py:1290] 2022-12-14 13:52:56,070 >>   Num Epochs = 1\n",
+            "[INFO|trainer.py:1291] 2022-12-14 13:52:56,070 >>   Instantaneous batch size per device = 16\n",
+            "[INFO|trainer.py:1292] 2022-12-14 13:52:56,070 >>   Total train batch size (w. parallel, distributed & accumulation) = 16\n",
+            "[INFO|trainer.py:1293] 2022-12-14 13:52:56,070 >>   Gradient Accumulation steps = 1\n",
+            "[INFO|trainer.py:1294] 2022-12-14 13:52:56,070 >>   Total optimization steps = 125\n",
+            "2022-12-14 13:52:56.233329: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.\n",
+            "  0% 0/125 [00:00<?, ?it/s][2022-12-14 13:52:59,704] [INFO] [stage_1_and_2.py:1765:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 65536\n",
+            "  1% 1/125 [00:01<03:44,  1.81s/it][WARNING|trainer_pt_utils.py:806] 2022-12-14 13:52:59,706 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0\n",
+            "{'loss': 3.3668, 'learning_rate': 0, 'epoch': 0.01}\n",
+            "  1% 1/125 [00:01<03:44,  1.81s/it][2022-12-14 13:52:59,814] [INFO] [stage_1_and_2.py:1765:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768.0\n",
+            "  2% 2/125 [00:01<01:39,  1.24it/s][2022-12-14 13:52:59,978] [INFO] [timer.py:197:stop] 0/3, RunningAvgSamplesPerSec=100.93228052760607, CurrSamplesPerSec=100.93228052760607, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            "  2% 3/125 [00:02<01:02,  1.94it/s][2022-12-14 13:53:00,142] [INFO] [timer.py:197:stop] 0/4, RunningAvgSamplesPerSec=100.51390720541595, CurrSamplesPerSec=100.09898795540143, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            "  3% 4/125 [00:02<00:45,  2.66it/s][2022-12-14 13:53:00,307] [INFO] [timer.py:197:stop] 0/5, RunningAvgSamplesPerSec=100.37581965251519, CurrSamplesPerSec=100.10077966865201, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            "  4% 5/125 [00:02<00:35,  3.34it/s][2022-12-14 13:53:00,470] [INFO] [timer.py:197:stop] 0/6, RunningAvgSamplesPerSec=100.48854641777947, CurrSamplesPerSec=100.82825101866653, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            "  5% 6/125 [00:02<00:30,  3.95it/s][2022-12-14 13:53:00,637] [INFO] [timer.py:197:stop] 0/7, RunningAvgSamplesPerSec=100.11956665739702, CurrSamplesPerSec=98.67035222574441, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            "  6% 7/125 [00:02<00:26,  4.44it/s][2022-12-14 13:53:00,804] [INFO] [timer.py:197:stop] 0/8, RunningAvgSamplesPerSec=99.83657428746905, CurrSamplesPerSec=98.44527473752616, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            "  6% 8/125 [00:02<00:24,  4.84it/s][2022-12-14 13:53:00,983] [INFO] [timer.py:197:stop] 0/9, RunningAvgSamplesPerSec=98.63212933484701, CurrSamplesPerSec=91.97455204304553, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            "  7% 9/125 [00:03<00:22,  5.05it/s][2022-12-14 13:53:01,160] [INFO] [timer.py:197:stop] 0/10, RunningAvgSamplesPerSec=97.89077831305485, CurrSamplesPerSec=92.99776890266041, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            "  8% 10/125 [00:03<00:22,  5.22it/s][2022-12-14 13:53:01,325] [INFO] [timer.py:197:stop] 0/11, RunningAvgSamplesPerSec=98.12283050305555, CurrSamplesPerSec=100.01961973718137, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            "  9% 11/125 [00:03<00:20,  5.46it/s][2022-12-14 13:53:01,491] [INFO] [timer.py:197:stop] 0/12, RunningAvgSamplesPerSec=98.20155966207906, CurrSamplesPerSec=98.91584861830896, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            " 10% 12/125 [00:03<00:20,  5.61it/s][2022-12-14 13:53:01,658] [INFO] [timer.py:197:stop] 0/13, RunningAvgSamplesPerSec=98.25159013855225, CurrSamplesPerSec=98.75471303846216, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            " 10% 13/125 [00:03<00:19,  5.72it/s][2022-12-14 13:53:01,825] [INFO] [timer.py:197:stop] 0/14, RunningAvgSamplesPerSec=98.28307645442669, CurrSamplesPerSec=98.6307625605338, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            " 11% 14/125 [00:03<00:19,  5.80it/s][2022-12-14 13:53:01,993] [INFO] [timer.py:197:stop] 0/15, RunningAvgSamplesPerSec=98.28591102148417, CurrSamplesPerSec=98.31993858379386, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            " 12% 15/125 [00:04<00:18,  5.85it/s][2022-12-14 13:53:02,157] [INFO] [timer.py:197:stop] 0/16, RunningAvgSamplesPerSec=98.39972283305173, CurrSamplesPerSec=99.90362880655496, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            " 13% 16/125 [00:04<00:18,  5.91it/s][2022-12-14 13:53:02,323] [INFO] [timer.py:197:stop] 0/17, RunningAvgSamplesPerSec=98.45905377941865, CurrSamplesPerSec=99.29726354540585, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            " 14% 17/125 [00:04<00:18,  5.95it/s][2022-12-14 13:53:02,502] [INFO] [timer.py:197:stop] 0/18, RunningAvgSamplesPerSec=98.0322214622036, CurrSamplesPerSec=92.04671130776849, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            " 14% 18/125 [00:04<00:18,  5.84it/s][2022-12-14 13:53:02,670] [INFO] [timer.py:197:stop] 0/19, RunningAvgSamplesPerSec=98.0566250130752, CurrSamplesPerSec=98.44874080007277, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            " 15% 19/125 [00:04<00:18,  5.88it/s][2022-12-14 13:53:02,834] [INFO] [timer.py:197:stop] 0/20, RunningAvgSamplesPerSec=98.17354136093077, CurrSamplesPerSec=100.20465927450061, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            " 16% 20/125 [00:04<00:17,  5.93it/s][2022-12-14 13:53:03,001] [INFO] [timer.py:197:stop] 0/21, RunningAvgSamplesPerSec=98.19985193027061, CurrSamplesPerSec=98.67586539507774, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            " 17% 21/125 [00:05<00:17,  5.95it/s][2022-12-14 13:53:03,168] [INFO] [timer.py:197:stop] 0/22, RunningAvgSamplesPerSec=98.20457745561694, CurrSamplesPerSec=98.29444892805826, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            " 18% 22/125 [00:05<00:17,  5.96it/s][2022-12-14 13:53:03,338] [INFO] [timer.py:197:stop] 0/23, RunningAvgSamplesPerSec=98.16114732027464, CurrSamplesPerSec=97.30054066436907, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            " 18% 23/125 [00:05<00:17,  5.94it/s][2022-12-14 13:53:03,507] [INFO] [timer.py:197:stop] 0/24, RunningAvgSamplesPerSec=98.11183311522139, CurrSamplesPerSec=97.08756112381008, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            " 19% 24/125 [00:05<00:17,  5.93it/s][2022-12-14 13:53:03,681] [INFO] [timer.py:197:stop] 0/25, RunningAvgSamplesPerSec=97.97594158706156, CurrSamplesPerSec=95.07875051004217, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            " 20% 25/125 [00:05<00:17,  5.88it/s][2022-12-14 13:53:03,849] [INFO] [timer.py:197:stop] 0/26, RunningAvgSamplesPerSec=97.98023584056564, CurrSamplesPerSec=98.07910767108233, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            " 21% 26/125 [00:05<00:16,  5.90it/s][2022-12-14 13:53:04,019] [INFO] [timer.py:197:stop] 0/27, RunningAvgSamplesPerSec=97.94352764611229, CurrSamplesPerSec=97.07070905777186, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            " 22% 27/125 [00:06<00:16,  5.90it/s][2022-12-14 13:53:04,188] [INFO] [timer.py:197:stop] 0/28, RunningAvgSamplesPerSec=97.90716780405262, CurrSamplesPerSec=97.0068647503957, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            " 22% 28/125 [00:06<00:16,  5.90it/s][2022-12-14 13:53:04,356] [INFO] [timer.py:197:stop] 0/29, RunningAvgSamplesPerSec=97.911064920614, CurrSamplesPerSec=98.01249895939371, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            " 23% 29/125 [00:06<00:16,  5.91it/s][2022-12-14 13:53:04,533] [INFO] [timer.py:197:stop] 0/30, RunningAvgSamplesPerSec=97.73130281431368, CurrSamplesPerSec=93.1154506830802, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            " 24% 30/125 [00:06<00:16,  5.84it/s][2022-12-14 13:53:04,702] [INFO] [timer.py:197:stop] 0/31, RunningAvgSamplesPerSec=97.72453065380168, CurrSamplesPerSec=97.53529046707565, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            " 25% 31/125 [00:06<00:16,  5.86it/s][2022-12-14 13:53:04,871] [INFO] [timer.py:197:stop] 0/32, RunningAvgSamplesPerSec=97.7022106755048, CurrSamplesPerSec=97.05933722001501, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            " 26% 32/125 [00:06<00:15,  5.87it/s][2022-12-14 13:53:05,039] [INFO] [timer.py:197:stop] 0/33, RunningAvgSamplesPerSec=97.70851333549035, CurrSamplesPerSec=97.89797198533037, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            " 26% 33/125 [00:07<00:15,  5.89it/s][2022-12-14 13:53:05,206] [INFO] [timer.py:197:stop] 0/34, RunningAvgSamplesPerSec=97.74121842513598, CurrSamplesPerSec=98.76604957077029, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            " 27% 34/125 [00:07<00:15,  5.92it/s][2022-12-14 13:53:05,387] [INFO] [timer.py:197:stop] 0/35, RunningAvgSamplesPerSec=97.51537556624366, CurrSamplesPerSec=90.80152407343263, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            " 28% 35/125 [00:07<00:15,  5.80it/s][2022-12-14 13:53:05,557] [INFO] [timer.py:197:stop] 0/36, RunningAvgSamplesPerSec=97.50330979885666, CurrSamplesPerSec=97.10680772063559, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            " 29% 36/125 [00:07<00:15,  5.83it/s][2022-12-14 13:53:05,721] [INFO] [timer.py:197:stop] 0/37, RunningAvgSamplesPerSec=97.57756443672217, CurrSamplesPerSec=100.17130463727403, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            " 30% 37/125 [00:07<00:14,  5.90it/s][2022-12-14 13:53:05,885] [INFO] [timer.py:197:stop] 0/38, RunningAvgSamplesPerSec=97.65117629751978, CurrSamplesPerSec=100.29945970990232, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            " 30% 38/125 [00:07<00:14,  5.96it/s][2022-12-14 13:53:06,050] [INFO] [timer.py:197:stop] 0/39, RunningAvgSamplesPerSec=97.71280503425812, CurrSamplesPerSec=99.98445156609202, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            " 31% 39/125 [00:08<00:14,  5.99it/s][2022-12-14 13:53:06,214] [INFO] [timer.py:197:stop] 0/40, RunningAvgSamplesPerSec=97.78451098998667, CurrSamplesPerSec=100.51368138525804, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            " 32% 40/125 [00:08<00:14,  6.02it/s][2022-12-14 13:53:06,392] [INFO] [timer.py:197:stop] 0/41, RunningAvgSamplesPerSec=97.64121769291246, CurrSamplesPerSec=92.49085067939409, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            " 33% 41/125 [00:08<00:14,  5.90it/s][2022-12-14 13:53:06,569] [INFO] [timer.py:197:stop] 0/42, RunningAvgSamplesPerSec=97.50732988271751, CurrSamplesPerSec=92.55756706434039, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            " 34% 42/125 [00:08<00:14,  5.82it/s][2022-12-14 13:53:06,743] [INFO] [timer.py:197:stop] 0/43, RunningAvgSamplesPerSec=97.43901867139148, CurrSamplesPerSec=94.78291647069956, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            " 34% 43/125 [00:08<00:14,  5.80it/s][2022-12-14 13:53:06,909] [INFO] [timer.py:197:stop] 0/44, RunningAvgSamplesPerSec=97.48381607826424, CurrSamplesPerSec=99.35665660893588, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            " 35% 44/125 [00:09<00:13,  5.86it/s][2022-12-14 13:53:07,074] [INFO] [timer.py:197:stop] 0/45, RunningAvgSamplesPerSec=97.53712674575597, CurrSamplesPerSec=99.83006436775453, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            " 36% 45/125 [00:09<00:13,  5.92it/s][2022-12-14 13:53:07,241] [INFO] [timer.py:197:stop] 0/46, RunningAvgSamplesPerSec=97.56654172657346, CurrSamplesPerSec=98.84839013296371, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            " 37% 46/125 [00:09<00:13,  5.95it/s][2022-12-14 13:53:07,410] [INFO] [timer.py:197:stop] 0/47, RunningAvgSamplesPerSec=97.55963144484514, CurrSamplesPerSec=97.25654510962741, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            " 38% 47/125 [00:09<00:13,  5.93it/s][2022-12-14 13:53:07,579] [INFO] [timer.py:197:stop] 0/48, RunningAvgSamplesPerSec=97.55609308009853, CurrSamplesPerSec=97.39713188093866, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            " 38% 48/125 [00:09<00:12,  5.93it/s][2022-12-14 13:53:07,756] [INFO] [timer.py:197:stop] 0/49, RunningAvgSamplesPerSec=97.45183404893723, CurrSamplesPerSec=92.88552644327257, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            " 39% 49/125 [00:09<00:13,  5.84it/s][2022-12-14 13:53:07,937] [INFO] [timer.py:197:stop] 0/50, RunningAvgSamplesPerSec=97.30092274448778, CurrSamplesPerSec=90.69954304821037, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            " 40% 50/125 [00:10<00:13,  5.74it/s][2022-12-14 13:53:08,105] [INFO] [timer.py:197:stop] 0/51, RunningAvgSamplesPerSec=97.318820531769, CurrSamplesPerSec=98.18572647109171, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            " 41% 51/125 [00:10<00:12,  5.81it/s][2022-12-14 13:53:08,271] [INFO] [timer.py:197:stop] 0/52, RunningAvgSamplesPerSec=97.34672802114179, CurrSamplesPerSec=98.73408141616876, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            " 42% 52/125 [00:10<00:12,  5.86it/s][2022-12-14 13:53:08,437] [INFO] [timer.py:197:stop] 0/53, RunningAvgSamplesPerSec=97.38777283455998, CurrSamplesPerSec=99.4850940313863, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            " 42% 53/125 [00:10<00:12,  5.92it/s][2022-12-14 13:53:08,606] [INFO] [timer.py:197:stop] 0/54, RunningAvgSamplesPerSec=97.38849365660226, CurrSamplesPerSec=97.42526973508473, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            " 43% 54/125 [00:10<00:12,  5.91it/s][2022-12-14 13:53:08,773] [INFO] [timer.py:197:stop] 0/55, RunningAvgSamplesPerSec=97.40983685576079, CurrSamplesPerSec=98.53272293204049, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            " 44% 55/125 [00:10<00:11,  5.93it/s][2022-12-14 13:53:08,940] [INFO] [timer.py:197:stop] 0/56, RunningAvgSamplesPerSec=97.4366410335728, CurrSamplesPerSec=98.87868407055264, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            " 45% 56/125 [00:11<00:11,  5.95it/s][2022-12-14 13:53:09,106] [INFO] [timer.py:197:stop] 0/57, RunningAvgSamplesPerSec=97.46918902957462, CurrSamplesPerSec=99.25966507615804, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            " 46% 57/125 [00:11<00:11,  5.97it/s][2022-12-14 13:53:09,275] [INFO] [timer.py:197:stop] 0/58, RunningAvgSamplesPerSec=97.47059867536238, CurrSamplesPerSec=97.54819203553716, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            " 46% 58/125 [00:11<00:11,  5.96it/s][2022-12-14 13:53:09,443] [INFO] [timer.py:197:stop] 0/59, RunningAvgSamplesPerSec=97.47972117759583, CurrSamplesPerSec=97.9933209799206, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            " 47% 59/125 [00:11<00:11,  5.96it/s][2022-12-14 13:53:09,611] [INFO] [timer.py:197:stop] 0/60, RunningAvgSamplesPerSec=97.49170509476033, CurrSamplesPerSec=98.17969338625461, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            " 48% 60/125 [00:11<00:10,  5.96it/s][2022-12-14 13:53:09,779] [INFO] [timer.py:197:stop] 0/61, RunningAvgSamplesPerSec=97.50248122108036, CurrSamplesPerSec=98.13159888048243, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            " 49% 61/125 [00:11<00:10,  5.96it/s][2022-12-14 13:53:09,945] [INFO] [timer.py:197:stop] 0/62, RunningAvgSamplesPerSec=97.52978589153727, CurrSamplesPerSec=99.16828452661896, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            " 50% 62/125 [00:12<00:10,  5.98it/s][2022-12-14 13:53:10,111] [INFO] [timer.py:197:stop] 0/63, RunningAvgSamplesPerSec=97.55317833344864, CurrSamplesPerSec=98.97755969210394, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            " 50% 63/125 [00:12<00:10,  5.99it/s][2022-12-14 13:53:10,280] [INFO] [timer.py:197:stop] 0/64, RunningAvgSamplesPerSec=97.55646711694308, CurrSamplesPerSec=97.7575030991299, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            " 51% 64/125 [00:12<00:10,  5.97it/s][2022-12-14 13:53:10,445] [INFO] [timer.py:197:stop] 0/65, RunningAvgSamplesPerSec=97.59135483380417, CurrSamplesPerSec=99.80423108097055, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            " 52% 65/125 [00:12<00:10,  6.00it/s][2022-12-14 13:53:10,609] [INFO] [timer.py:197:stop] 0/66, RunningAvgSamplesPerSec=97.63201343317623, CurrSamplesPerSec=100.2636451105892, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            " 53% 66/125 [00:12<00:09,  6.02it/s][2022-12-14 13:53:10,781] [INFO] [timer.py:197:stop] 0/67, RunningAvgSamplesPerSec=97.60152490210652, CurrSamplesPerSec=95.68909003282373, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            " 54% 67/125 [00:12<00:09,  5.96it/s][2022-12-14 13:53:10,946] [INFO] [timer.py:197:stop] 0/68, RunningAvgSamplesPerSec=97.63461262324671, CurrSamplesPerSec=99.83451973442467, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            " 54% 68/125 [00:13<00:09,  5.99it/s][2022-12-14 13:53:11,110] [INFO] [timer.py:197:stop] 0/69, RunningAvgSamplesPerSec=97.67476243242315, CurrSamplesPerSec=100.39969659658567, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            " 55% 69/125 [00:13<00:09,  6.02it/s][2022-12-14 13:53:11,276] [INFO] [timer.py:197:stop] 0/70, RunningAvgSamplesPerSec=97.69195984352083, CurrSamplesPerSec=98.85814624948442, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            " 56% 70/125 [00:13<00:09,  6.02it/s][2022-12-14 13:53:11,444] [INFO] [timer.py:197:stop] 0/71, RunningAvgSamplesPerSec=97.69489192763089, CurrSamplesPerSec=97.89468739834375, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            " 57% 71/125 [00:13<00:09,  6.00it/s][2022-12-14 13:53:11,609] [INFO] [timer.py:197:stop] 0/72, RunningAvgSamplesPerSec=97.72895345762099, CurrSamplesPerSec=100.13797219785008, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            " 58% 72/125 [00:13<00:08,  6.02it/s][2022-12-14 13:53:11,775] [INFO] [timer.py:197:stop] 0/73, RunningAvgSamplesPerSec=97.74841419576335, CurrSamplesPerSec=99.13019792400628, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            " 58% 73/125 [00:13<00:08,  6.02it/s][2022-12-14 13:53:11,943] [INFO] [timer.py:197:stop] 0/74, RunningAvgSamplesPerSec=97.75189826502778, CurrSamplesPerSec=97.99990361938418, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            " 59% 74/125 [00:14<00:08,  6.00it/s][2022-12-14 13:53:12,106] [INFO] [timer.py:197:stop] 0/75, RunningAvgSamplesPerSec=97.79049133994862, CurrSamplesPerSec=100.6516204120322, MemAllocated=0.14GB, MaxMemAllocated=1.45GB\n",
+            " 60% 75/125 [00:14<00:08,  6.03it/s][2022-12-14 13:53:12,278] [INFO] [timer.py:197:stop] 0/76, RunningAvgSamplesPerSec=97.75985858485085, CurrSamplesPerSec=95.57434445514153, MemAllocated=0.14GB, MaxMemAllocated=1.53GB\n",
+            " 61% 76/125 [00:14<00:08,  5.97it/s][2022-12-14 13:53:12,445] [INFO] [timer.py:197:stop] 0/77, RunningAvgSamplesPerSec=97.7773182737493, CurrSamplesPerSec=99.08687338783622, MemAllocated=0.14GB, MaxMemAllocated=1.53GB\n",
+            " 62% 77/125 [00:14<00:08,  5.98it/s][2022-12-14 13:53:12,610] [INFO] [timer.py:197:stop] 0/78, RunningAvgSamplesPerSec=97.80285149815478, CurrSamplesPerSec=99.75660818817765, MemAllocated=0.14GB, MaxMemAllocated=1.53GB\n",
+            " 62% 78/125 [00:14<00:07,  6.00it/s][2022-12-14 13:53:12,778] [INFO] [timer.py:197:stop] 0/79, RunningAvgSamplesPerSec=97.8049131188717, CurrSamplesPerSec=97.96185101547184, MemAllocated=0.14GB, MaxMemAllocated=1.53GB\n",
+            " 63% 79/125 [00:14<00:07,  5.99it/s][2022-12-14 13:53:12,954] [INFO] [timer.py:197:stop] 0/80, RunningAvgSamplesPerSec=97.74538023239519, CurrSamplesPerSec=93.36924397597485, MemAllocated=0.14GB, MaxMemAllocated=1.53GB\n",
+            " 64% 80/125 [00:15<00:07,  5.89it/s][2022-12-14 13:53:13,119] [INFO] [timer.py:197:stop] 0/81, RunningAvgSamplesPerSec=97.77433302899652, CurrSamplesPerSec=100.08674629795065, MemAllocated=0.14GB, MaxMemAllocated=1.53GB\n",
+            " 65% 81/125 [00:15<00:07,  5.94it/s][2022-12-14 13:53:13,295] [INFO] [timer.py:197:stop] 0/82, RunningAvgSamplesPerSec=97.71715374059926, CurrSamplesPerSec=93.40199166312918, MemAllocated=0.14GB, MaxMemAllocated=1.53GB\n",
+            " 66% 82/125 [00:15<00:07,  5.86it/s][2022-12-14 13:53:13,459] [INFO] [timer.py:197:stop] 0/83, RunningAvgSamplesPerSec=97.74336677295744, CurrSamplesPerSec=99.88697443469356, MemAllocated=0.14GB, MaxMemAllocated=1.53GB\n",
+            " 66% 83/125 [00:15<00:07,  5.92it/s][2022-12-14 13:53:13,624] [INFO] [timer.py:197:stop] 0/84, RunningAvgSamplesPerSec=97.77431479913615, CurrSamplesPerSec=100.34790254230953, MemAllocated=0.14GB, MaxMemAllocated=1.53GB\n",
+            " 67% 84/125 [00:15<00:06,  5.97it/s][2022-12-14 13:53:13,789] [INFO] [timer.py:197:stop] 0/85, RunningAvgSamplesPerSec=97.79439091138117, CurrSamplesPerSec=99.46916863058976, MemAllocated=0.14GB, MaxMemAllocated=1.53GB\n",
+            " 68% 85/125 [00:15<00:06,  5.99it/s][2022-12-14 13:53:13,966] [INFO] [timer.py:197:stop] 0/86, RunningAvgSamplesPerSec=97.73516104666628, CurrSamplesPerSec=93.05721777643426, MemAllocated=0.14GB, MaxMemAllocated=1.53GB\n",
+            " 69% 86/125 [00:16<00:06,  5.89it/s][2022-12-14 13:53:14,142] [INFO] [timer.py:197:stop] 0/87, RunningAvgSamplesPerSec=97.68315810140393, CurrSamplesPerSec=93.50402042047565, MemAllocated=0.14GB, MaxMemAllocated=1.53GB\n",
+            " 70% 87/125 [00:16<00:06,  5.83it/s][2022-12-14 13:53:14,306] [INFO] [timer.py:197:stop] 0/88, RunningAvgSamplesPerSec=97.71034893847359, CurrSamplesPerSec=100.07823861704476, MemAllocated=0.14GB, MaxMemAllocated=1.53GB\n",
+            " 70% 88/125 [00:16<00:06,  5.90it/s][2022-12-14 13:53:14,471] [INFO] [timer.py:197:stop] 0/89, RunningAvgSamplesPerSec=97.73624706023953, CurrSamplesPerSec=100.01604218301263, MemAllocated=0.14GB, MaxMemAllocated=1.53GB\n",
+            " 71% 89/125 [00:16<00:06,  5.95it/s][2022-12-14 13:53:14,647] [INFO] [timer.py:197:stop] 0/90, RunningAvgSamplesPerSec=97.68620618498892, CurrSamplesPerSec=93.52043869367392, MemAllocated=0.14GB, MaxMemAllocated=1.53GB\n",
+            " 72% 90/125 [00:16<00:05,  5.87it/s][2022-12-14 13:53:14,813] [INFO] [timer.py:197:stop] 0/91, RunningAvgSamplesPerSec=97.69734830711103, CurrSamplesPerSec=98.68790946028814, MemAllocated=0.14GB, MaxMemAllocated=1.53GB\n",
+            " 73% 91/125 [00:16<00:05,  5.91it/s][2022-12-14 13:53:14,993] [INFO] [timer.py:197:stop] 0/92, RunningAvgSamplesPerSec=97.62674307152616, CurrSamplesPerSec=91.72689498316056, MemAllocated=0.14GB, MaxMemAllocated=1.53GB\n",
+            " 74% 92/125 [00:17<00:05,  5.81it/s][2022-12-14 13:53:15,173] [INFO] [timer.py:197:stop] 0/93, RunningAvgSamplesPerSec=97.55338883722011, CurrSamplesPerSec=91.37431971886025, MemAllocated=0.14GB, MaxMemAllocated=1.53GB\n",
+            " 74% 93/125 [00:17<00:05,  5.73it/s][2022-12-14 13:53:15,282] [INFO] [stage_1_and_2.py:1765:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0\n",
+            "[2022-12-14 13:53:15,282] [INFO] [timer.py:197:stop] 0/94, RunningAvgSamplesPerSec=97.9349760660304, CurrSamplesPerSec=152.06188622469557, MemAllocated=0.14GB, MaxMemAllocated=1.53GB\n",
+            " 75% 94/125 [00:17<00:04,  6.45it/s][2022-12-14 13:53:15,462] [INFO] [timer.py:197:stop] 0/95, RunningAvgSamplesPerSec=97.86093909914129, CurrSamplesPerSec=91.49728134901807, MemAllocated=0.14GB, MaxMemAllocated=1.53GB\n",
+            " 76% 95/125 [00:17<00:04,  6.16it/s][2022-12-14 13:53:15,637] [INFO] [timer.py:197:stop] 0/96, RunningAvgSamplesPerSec=97.81472820124776, CurrSamplesPerSec=93.6998510215565, MemAllocated=0.14GB, MaxMemAllocated=1.53GB\n",
+            " 77% 96/125 [00:17<00:04,  6.01it/s][2022-12-14 13:53:15,802] [INFO] [timer.py:197:stop] 0/97, RunningAvgSamplesPerSec=97.83631716043324, CurrSamplesPerSec=99.9091319177191, MemAllocated=0.14GB, MaxMemAllocated=1.53GB\n",
+            " 78% 97/125 [00:17<00:04,  6.03it/s][2022-12-14 13:53:15,970] [INFO] [timer.py:197:stop] 0/98, RunningAvgSamplesPerSec=97.83888757955137, CurrSamplesPerSec=98.08369482607425, MemAllocated=0.14GB, MaxMemAllocated=1.53GB\n",
+            " 78% 98/125 [00:18<00:04,  6.01it/s][2022-12-14 13:53:16,134] [INFO] [timer.py:197:stop] 0/99, RunningAvgSamplesPerSec=97.86483882675847, CurrSamplesPerSec=100.4219318867103, MemAllocated=0.14GB, MaxMemAllocated=1.53GB\n",
+            " 79% 99/125 [00:18<00:04,  6.03it/s][2022-12-14 13:53:16,300] [INFO] [timer.py:197:stop] 0/100, RunningAvgSamplesPerSec=97.88224614593238, CurrSamplesPerSec=99.60070586943196, MemAllocated=0.14GB, MaxMemAllocated=1.6GB\n",
+            " 80% 100/125 [00:18<00:04,  6.03it/s][2022-12-14 13:53:16,466] [INFO] [timer.py:197:stop] 0/101, RunningAvgSamplesPerSec=97.89022030987742, CurrSamplesPerSec=98.67804181573023, MemAllocated=0.14GB, MaxMemAllocated=1.6GB\n",
+            " 81% 101/125 [00:18<00:03,  6.02it/s][2022-12-14 13:53:16,635] [INFO] [timer.py:197:stop] 0/102, RunningAvgSamplesPerSec=97.88673104302609, CurrSamplesPerSec=97.54252059604998, MemAllocated=0.14GB, MaxMemAllocated=1.6GB\n",
+            " 82% 102/125 [00:18<00:03,  5.99it/s][2022-12-14 13:53:16,800] [INFO] [timer.py:197:stop] 0/103, RunningAvgSamplesPerSec=97.90462667139984, CurrSamplesPerSec=99.72784877845807, MemAllocated=0.14GB, MaxMemAllocated=1.6GB\n",
+            " 82% 103/125 [00:18<00:03,  6.01it/s][2022-12-14 13:53:16,966] [INFO] [timer.py:197:stop] 0/104, RunningAvgSamplesPerSec=97.91697119049019, CurrSamplesPerSec=99.18000936986893, MemAllocated=0.14GB, MaxMemAllocated=1.6GB\n",
+            " 83% 104/125 [00:19<00:03,  6.02it/s][2022-12-14 13:53:17,131] [INFO] [timer.py:197:stop] 0/105, RunningAvgSamplesPerSec=97.9339588133802, CurrSamplesPerSec=99.6982172569527, MemAllocated=0.14GB, MaxMemAllocated=1.6GB\n",
+            " 84% 105/125 [00:19<00:03,  6.03it/s][2022-12-14 13:53:17,296] [INFO] [timer.py:197:stop] 0/106, RunningAvgSamplesPerSec=97.95129907480779, CurrSamplesPerSec=99.77084578445802, MemAllocated=0.14GB, MaxMemAllocated=1.6GB\n",
+            " 85% 106/125 [00:19<00:03,  6.04it/s][2022-12-14 13:53:17,462] [INFO] [timer.py:197:stop] 0/107, RunningAvgSamplesPerSec=97.96141384796663, CurrSamplesPerSec=99.02487985043463, MemAllocated=0.14GB, MaxMemAllocated=1.6GB\n",
+            " 86% 107/125 [00:19<00:02,  6.03it/s][2022-12-14 13:53:17,628] [INFO] [timer.py:197:stop] 0/108, RunningAvgSamplesPerSec=97.9771691808996, CurrSamplesPerSec=99.66016659315659, MemAllocated=0.14GB, MaxMemAllocated=1.6GB\n",
+            " 86% 108/125 [00:19<00:02,  6.04it/s][2022-12-14 13:53:17,794] [INFO] [timer.py:197:stop] 0/109, RunningAvgSamplesPerSec=97.98746528718597, CurrSamplesPerSec=99.09126266347529, MemAllocated=0.14GB, MaxMemAllocated=1.6GB\n",
+            " 87% 109/125 [00:19<00:02,  6.03it/s][2022-12-14 13:53:17,959] [INFO] [timer.py:197:stop] 0/110, RunningAvgSamplesPerSec=98.0007185607853, CurrSamplesPerSec=99.4398379537362, MemAllocated=0.14GB, MaxMemAllocated=1.6GB\n",
+            " 88% 110/125 [00:20<00:02,  6.03it/s][2022-12-14 13:53:18,125] [INFO] [timer.py:197:stop] 0/111, RunningAvgSamplesPerSec=98.01640217587799, CurrSamplesPerSec=99.74029927055032, MemAllocated=0.14GB, MaxMemAllocated=1.6GB\n",
+            " 89% 111/125 [00:20<00:02,  6.04it/s][2022-12-14 13:53:18,289] [INFO] [timer.py:197:stop] 0/112, RunningAvgSamplesPerSec=98.0369025323169, CurrSamplesPerSec=100.32405022416698, MemAllocated=0.14GB, MaxMemAllocated=1.6GB\n",
+            " 90% 112/125 [00:20<00:02,  6.05it/s][2022-12-14 13:53:18,453] [INFO] [timer.py:197:stop] 0/113, RunningAvgSamplesPerSec=98.06054518179232, CurrSamplesPerSec=100.73275123197445, MemAllocated=0.14GB, MaxMemAllocated=1.6GB\n",
+            " 90% 113/125 [00:20<00:01,  6.07it/s][2022-12-14 13:53:18,627] [INFO] [timer.py:197:stop] 0/114, RunningAvgSamplesPerSec=98.02725557329867, CurrSamplesPerSec=94.46750310744365, MemAllocated=0.14GB, MaxMemAllocated=1.6GB\n",
+            " 91% 114/125 [00:20<00:01,  5.97it/s][2022-12-14 13:53:18,794] [INFO] [timer.py:197:stop] 0/115, RunningAvgSamplesPerSec=98.02828697114596, CurrSamplesPerSec=98.1439410336658, MemAllocated=0.14GB, MaxMemAllocated=1.6GB\n",
+            " 92% 115/125 [00:20<00:01,  5.97it/s][2022-12-14 13:53:18,959] [INFO] [timer.py:197:stop] 0/116, RunningAvgSamplesPerSec=98.04591514254567, CurrSamplesPerSec=100.07958183892447, MemAllocated=0.14GB, MaxMemAllocated=1.6GB\n",
+            " 93% 116/125 [00:21<00:01,  6.00it/s][2022-12-14 13:53:19,124] [INFO] [timer.py:197:stop] 0/117, RunningAvgSamplesPerSec=98.05878101888149, CurrSamplesPerSec=99.54796064888984, MemAllocated=0.14GB, MaxMemAllocated=1.6GB\n",
+            " 94% 117/125 [00:21<00:01,  6.01it/s][2022-12-14 13:53:19,292] [INFO] [timer.py:197:stop] 0/118, RunningAvgSamplesPerSec=98.06258288872708, CurrSamplesPerSec=98.50177309128485, MemAllocated=0.14GB, MaxMemAllocated=1.6GB\n",
+            " 94% 118/125 [00:21<00:01,  6.00it/s][2022-12-14 13:53:19,456] [INFO] [timer.py:197:stop] 0/119, RunningAvgSamplesPerSec=98.08319859870743, CurrSamplesPerSec=100.53491291620601, MemAllocated=0.14GB, MaxMemAllocated=1.6GB\n",
+            " 95% 119/125 [00:21<00:00,  6.03it/s][2022-12-14 13:53:19,621] [INFO] [timer.py:197:stop] 0/120, RunningAvgSamplesPerSec=98.093446385494, CurrSamplesPerSec=99.30740238158306, MemAllocated=0.14GB, MaxMemAllocated=1.65GB\n",
+            " 96% 120/125 [00:21<00:00,  6.03it/s][2022-12-14 13:53:19,786] [INFO] [timer.py:197:stop] 0/121, RunningAvgSamplesPerSec=98.11045897194816, CurrSamplesPerSec=100.16024118863011, MemAllocated=0.14GB, MaxMemAllocated=1.65GB\n",
+            " 97% 121/125 [00:21<00:00,  6.05it/s][2022-12-14 13:53:19,952] [INFO] [timer.py:197:stop] 0/122, RunningAvgSamplesPerSec=98.11818628925121, CurrSamplesPerSec=99.04651028929273, MemAllocated=0.14GB, MaxMemAllocated=1.65GB\n",
+            " 98% 122/125 [00:22<00:00,  6.04it/s][2022-12-14 13:53:20,118] [INFO] [timer.py:197:stop] 0/123, RunningAvgSamplesPerSec=98.12993625991835, CurrSamplesPerSec=99.56066167198279, MemAllocated=0.14GB, MaxMemAllocated=1.65GB\n",
+            " 98% 123/125 [00:22<00:00,  6.04it/s][2022-12-14 13:53:20,281] [INFO] [timer.py:197:stop] 0/124, RunningAvgSamplesPerSec=98.15251366936872, CurrSamplesPerSec=100.96325783903978, MemAllocated=0.14GB, MaxMemAllocated=1.65GB\n",
+            " 99% 124/125 [00:22<00:00,  6.06it/s][2022-12-14 13:53:20,448] [INFO] [timer.py:197:stop] 0/125, RunningAvgSamplesPerSec=98.15548442096753, CurrSamplesPerSec=98.51927037839029, MemAllocated=0.14GB, MaxMemAllocated=1.65GB\n",
+            "100% 125/125 [00:22<00:00,  6.04it/s][INFO|trainer.py:1526] 2022-12-14 13:53:20,449 >> \n",
+            "\n",
+            "Training completed. Do not forget to share your model on huggingface.co/models =)\n",
+            "\n",
+            "\n",
+            "{'train_runtime': 24.3788, 'train_samples_per_second': 82.038, 'train_steps_per_second': 5.127, 'train_loss': 2.748264869689941, 'epoch': 1.0}\n",
+            "100% 125/125 [00:22<00:00,  5.54it/s]\n",
+            "[INFO|trainer.py:2162] 2022-12-14 13:53:20,451 >> Saving model checkpoint to output_dir\n",
+            "[INFO|configuration_utils.py:440] 2022-12-14 13:53:20,452 >> Configuration saved in output_dir/config.json\n",
+            "[INFO|modeling_utils.py:1085] 2022-12-14 13:53:20,755 >> Model weights saved in output_dir/pytorch_model.bin\n",
+            "[INFO|tokenization_utils_base.py:2094] 2022-12-14 13:53:20,756 >> tokenizer config file saved in output_dir/tokenizer_config.json\n",
+            "[INFO|tokenization_utils_base.py:2100] 2022-12-14 13:53:20,756 >> Special tokens file saved in output_dir/special_tokens_map.json\n",
+            "[INFO|tokenization_t5_fast.py:162] 2022-12-14 13:53:20,790 >> Copy vocab file to output_dir/spiece.model\n",
+            "***** train metrics *****\n",
+            "  epoch                    =        1.0\n",
+            "  train_loss               =     2.7483\n",
+            "  train_runtime            = 0:00:24.37\n",
+            "  train_samples            =       2000\n",
+            "  train_samples_per_second =     82.038\n",
+            "  train_steps_per_second   =      5.127\n",
+            "12/14/2022 13:53:20 - INFO - __main__ - *** Evaluate ***\n",
+            "[INFO|trainer.py:2412] 2022-12-14 13:53:20,803 >> ***** Running Evaluation *****\n",
+            "[INFO|trainer.py:2414] 2022-12-14 13:53:20,803 >>   Num examples = 500\n",
+            "[INFO|trainer.py:2417] 2022-12-14 13:53:20,803 >>   Batch size = 16\n",
+            "100% 32/32 [00:34<00:00,  1.24s/it]12/14/2022 13:53:56 - INFO - datasets.metric - Removing /root/.cache/huggingface/metrics/sacrebleu/default/default_experiment-1-0.arrow\n",
+            "100% 32/32 [00:34<00:00,  1.09s/it]\n",
+            "***** eval metrics *****\n",
+            "  epoch                   =        1.0\n",
+            "  eval_bleu               =    23.8121\n",
+            "  eval_gen_len            =     39.342\n",
+            "  eval_loss               =     3.3844\n",
+            "  eval_runtime            = 0:00:35.78\n",
+            "  eval_samples            =        500\n",
+            "  eval_samples_per_second =     13.971\n",
+            "  eval_steps_per_second   =      0.894\n",
+            "[2022-12-14 13:53:59,120] [INFO] [launch.py:350:main] Process 492 exits successfully.\n"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "vSlYvQWLwblN"
+      },
+      "source": [],
+      "execution_count": null,
+      "outputs": []
+    }
+  ]
+}
\ No newline at end of file