memri · koenvanderveen · Sep 13, 2022 · Oct 7, 2022 · Oct 28, 2022
diff --git a/nbs/finetune_model.ipynb b/nbs/finetune_model.ipynb
@@ -1 +1 @@
-{"cells":[{"cell_type":"markdown","source":["# Load a dataset and train your model\n","Once you have labeled your data on the Memri platform, you can use it to train your model in [this Google Colab notebook](https://colab.research.google.com/drive/189JJ2gLHAtxlmzc5XI3HhB9_VE3fT6DT)*.\n","\n","In this guide you will:\n","\n","1.   Load a labeled dataset from the POD\n","2.   Train a distilRoBERTa text classifier model on a labelled dataset\n","3.   Upload a trained model to use in a plugin for a data app\n","\n","\n","\n","\n","> * If you are unfamiliar with Google Colab notebooks, have a look at [this quick intro.](https://colab.research.google.com/)\n","* Make sure to run the below cells, one by one, in the correct order to avoid errors!\n","* In this guide we are helping you connect your own personal data from your Memri POD, alternitively you can use the [Tweet eval emoji](https://huggingface.co/datasets/tweet_eval#source-data) datasets, which is available from 🤗 [Hugging Face](https://huggingface.co/docs/datasets/index.html).\n","* If you don't wish to use your personal data, or you don't want to spend time training a model, you can simply use our [sentiment-plugin](https://gitlab.memri.io/koenvanderveen/sentiment-plugin/-/packages/6), which uses a pre-trained model from 🤗 Hugging Face. Just paste the  plugin repo address at project set-up step on the Memri platform, and skip this process."],"metadata":{"id":"suH5kbdZ-Zfi"}},{"cell_type":"markdown","metadata":{"id":"A22x_JOU5XVk"},"source":["## Setup"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"1ov_QUj2qnZy"},"outputs":[],"source":["from IPython.display import clear_output\n","!pip install pandas transformers torch git+https://gitlab.memri.io/memri/[email protected]\n","clear_output()\n","print(\"Installed\")"]},{"cell_type":"markdown","metadata":{"id":"j8t1xaae8uQJ"},"source":["1. Import the libraries needed to train your model\n","\n","> * Make sure to run the installation step above first to avoid errors!\n","\n"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"zNkau9-8q9rC"},"outputs":[],"source":["import os\n","import random\n","import textwrap\n","\n","import pandas as pd\n","import torch\n","import transformers\n","from transformers import AutoModelForSequenceClassification, AutoTokenizer\n","from transformers.utils import logging\n","\n","from pymemri.data.itembase import Edge, Item\n","from pymemri.data.schema import Dataset, Message, CategoricalLabel\n","from pymemri.data.oauth import OauthFlow\n","from pymemri.data.loader import write_model_to_package_registry\n","from pymemri.pod.client import PodClient\n","from getpass import getpass\n","transformers.utils.logging.set_verbosity_error()\n","os.environ[\"WANDB_DISABLED\"] = \"true\""]},{"cell_type":"markdown","metadata":{"id":"gby4o-am6Lsa"},"source":["## 1. Load your dataset from the POD"]},{"cell_type":"markdown","source":["1. Run the cell\n","2. Copy your Dataset Name, Login Key and Password Key from your app.memri.io screen, and paste them below as prompted to load your connected dataset from you POD."],"metadata":{"id":"CqWDS-4kWG3s"}},{"cell_type":"code","execution_count":null,"metadata":{"id":"M-bZ2bW-6ATQ"},"outputs":[],"source":["### *Define your pod url here*, this is the one for dev.app.memri.io ####\n","pod_url = \"https://dev.pod.memri.io\"\n","### *Define your dataset here* ####\n","dataset_name = input(\"dataset_name:\") if \"dataset_name\" not in locals() else dataset_name\n","### *Define your login key here* ####\n","owner_key = getpass(\"owner key:\") if \"owner_key\" not in locals() else owner_key\n","### *Define your password key here* ####\n","database_key = getpass(\"database_key:\") if \"database_key\" not in locals() else database_key"]},{"cell_type":"markdown","source":["2.   Connect your POD to load your data"],"metadata":{"id":"v6jxJtPwSMg3"}},{"cell_type":"code","execution_count":null,"metadata":{"id":"EB2zI24IrEPH"},"outputs":[],"source":["# Connect to pod\n","client = PodClient(\n","    url=pod_url,\n","    owner_key=owner_key,\n","    database_key=database_key,\n",")\n","client.add_to_schema(CategoricalLabel, Message, Dataset, OauthFlow);"]},{"cell_type":"markdown","metadata":{"id":"7liaUCkM59U5"},"source":["3.   Download and inspect the dataset\n","\n","> * All entries in the dataset can be found via the Dataset.entry edge\n","\n"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"dWwpzbQ27EQX"},"outputs":[],"source":["dataset = client.get_dataset(dataset_name)\n","\n","num_entries = len(dataset.entry)\n","print(f\"number of items in the dataset: {num_entries}\")"]},{"cell_type":"markdown","metadata":{"id":"Cid4dUXO7GCy"},"source":["4.   Export the dataset to a format compatible with Python and inspect in a table\n","\n","\n","> * In pymemri, the `Dataset` class can format your dataset to different datatypes using the `Dataset.to` method; here we will use Pandas.\n","> * The columns of the dataset are inferred automatically. If you want to use custom columns, you can use the `columns` argument. See the [dataset documentation](https://docs.memri.io/component-architectures/plugins/datasets/) for more info."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"Bz8XW0fB7Prl"},"outputs":[],"source":["data = dataset.to(\"pandas\")\n","data.head()"]},{"cell_type":"markdown","metadata":{"id":"c4sisozj7eTm"},"source":["## 2. Fine-tune a model\n","\n"]},{"cell_type":"markdown","source":["1.   Configure the distilRoBERTa model on your dataset\n","> The transformers library contains all code to do the training, you only need to define a torch Dataset that contains our data and handles tokenization.\n"],"metadata":{"id":"KL2VtYuDX9YZ"}},{"cell_type":"code","execution_count":null,"metadata":{"id":"KjKuK2wi7iAN"},"outputs":[],"source":["# Hyperparameters\n","model_name = \"distilroberta-base\"\n","batch_size = 32\n","learning_rate = 1e-3\n","\n","class TransformerDataset(torch.utils.data.Dataset):\n","    def __init__(self, data: pd.DataFrame, tokenizer: transformers.PreTrainedTokenizerBase):\n","        self.data = data\n","        self.label2idx, self.idx2label = self.get_label_map()\n","        self.num_labels = len(self.label2idx)\n","        self.tokenizer = tokenizer\n","        \n","    def tokenize(self, message, label=None):\n","        tokenized = self.tokenizer(message, padding=\"max_length\", truncation=True)\n","        if label:\n","            tokenized[\"label\"] = self.label2idx[label]\n","        return tokenized\n","\n","    def get_label_map(self):\n","        unique_labels = data[\"annotation.labelValue\"].unique()\n","        return {l: i for i, l in enumerate(unique_labels)}, {i: l for i, l in enumerate(unique_labels)}\n","        \n","    def __len__(self):\n","        return len(self.data)\n","        \n","    def __getitem__(self, idx):\n","        # Get the row from self.data, and skip the first column (id).\n","        return self.tokenize(*self.data.iloc[idx][1:])\n","\n","tokenizer = AutoTokenizer.from_pretrained(model_name)\n","dataset = TransformerDataset(data, tokenizer)"]},{"cell_type":"markdown","metadata":{"id":"OFFbLaAK7nli"},"source":["2.   Train and finetune the model\n","\n","> * The 🤗 Transformers library provides all code needed to train a RoBERTa model. Read their [tutorial on fine-tuning models](https://huggingface.co/docs/transformers/training)\n","* We use Trainer class, as it handles all training, monitoring and integration with [Weights & Biases](https://wandb.ai/site)"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"F2ludJvo7qD8"},"outputs":[],"source":["# Load model\n","model = AutoModelForSequenceClassification.from_pretrained(\n","    model_name,\n","    num_labels=dataset.num_labels,\n","    id2label=dataset.idx2label\n",")\n","\n","# To increase training speed, we will freeze all layers except the classifier head.\n","for param in model.base_model.parameters():\n","    param.requires_grad = False\n","training_args = transformers.TrainingArguments(\n","    \"twitter-emoji-trainer\",\n","    learning_rate=learning_rate,\n","    per_device_train_batch_size=batch_size,\n","    logging_steps=1,\n","    optim=\"adamw_torch\"\n",")\n","\n","trainer = transformers.Trainer(\n","    model=model,\n","    args=training_args,\n","    train_dataset=dataset\n",")\n","logging.set_verbosity(40)\n","trainer.train()"]},{"cell_type":"markdown","metadata":{"id":"XMFRgMo38Eah"},"source":["## 3. Upload your model to a data app plugin"]},{"cell_type":"markdown","metadata":{"id":"nbJOIR8g75_c"},"source":["Now that your model is trained, it will be uploaded to your new GitLab project.\n","\n","\n","1.   Run the cell\n","2.   Copy and paste the GitLab project name from the your screen on app.memri.io\n","\n","\n","> * To avoid errors, make sure your GitLab project does not have any full stops in the name/URL"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"xm2PgsAP77BX"},"outputs":[],"source":["project_name = input(\"project name:\") if \"project_name\" not in locals() else project_name\n","write_model_to_package_registry(model, project_name=project_name, client=client)"]},{"cell_type":"markdown","metadata":{"id":"8qVHRe9Y8AlU"},"source":["That's it! 🎉\n","\n","You have trained a ML model and made it accesible via the package registry, ready to be used in your data app.\n","\n","Check out the next step to see how to [build a plugin and deploy a data app](https://docs.memri.io/tutorials/build_a_sentiment_analysis_app/#deploy-your-data-app/)."]}],"metadata":{"accelerator":"GPU","colab":{"collapsed_sections":[],"provenance":[{"file_id":"1WX1VYwoAQ_2yOMzqvrIz77x52vaNmrtW","timestamp":1655375191426}]},"kernelspec":{"display_name":"Python 3","name":"python3"},"language_info":{"name":"python"}},"nbformat":4,"nbformat_minor":0}
+{"cells":[{"cell_type":"markdown","source":["# Load a dataset and train your model\n","Once you have labeled your data on the Memri platform, you can use it to train your model in [this Google Colab notebook](https://colab.research.google.com/drive/189JJ2gLHAtxlmzc5XI3HhB9_VE3fT6DT)*.\n","\n","In this guide you will:\n","\n","1.   Load a labeled dataset from the POD\n","2.   Train a distilRoBERTa text classifier model on a labelled dataset\n","3.   Upload a trained model to use in a plugin for a data app\n","\n","\n","\n","\n","> * If you are unfamiliar with Google Colab notebooks, have a look at [this quick intro.](https://colab.research.google.com/)\n","* Make sure to run the below cells, one by one, in the correct order to avoid errors!\n","* In this guide we are helping you connect your own personal data from your Memri POD, alternitively you can use the [Tweet eval emoji](https://huggingface.co/datasets/tweet_eval#source-data) datasets, which is available from 🤗 [Hugging Face](https://huggingface.co/docs/datasets/index.html).\n","* If you don't wish to use your personal data, or you don't want to spend time training a model, you can simply use our [sentiment-plugin](https://gitlab.memri.io/koenvanderveen/sentiment-plugin/-/packages/6), which uses a pre-trained model from 🤗 Hugging Face. Just paste the  plugin repo address at project set-up step on the Memri platform, and skip this process."],"metadata":{"id":"suH5kbdZ-Zfi"}},{"cell_type":"markdown","metadata":{"id":"A22x_JOU5XVk"},"source":["## Setup"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"1ov_QUj2qnZy"},"outputs":[],"source":["from IPython.display import clear_output\n","!pip install pandas transformers torch git+https://gitlab.memri.io/memri/[email protected]\n","clear_output()\n","print(\"Installed\")"]},{"cell_type":"markdown","metadata":{"id":"j8t1xaae8uQJ"},"source":["1. Import the libraries needed to train your model\n","\n","> * Make sure to run the installation step above first to avoid errors!\n","\n"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"zNkau9-8q9rC"},"outputs":[],"source":["import os\n","import random\n","import textwrap\n","\n","import pandas as pd\n","import torch\n","import transformers\n","from transformers import AutoModelForSequenceClassification, AutoTokenizer\n","from transformers.utils import logging\n","\n","from pymemri.data.itembase import Edge, Item\n","from pymemri.data.schema import Dataset, Message, CategoricalLabel\n","from pymemri.data.oauth import OauthFlow\n","from pymemri.data.loader import write_model_to_package_registry\n","from pymemri.pod.client import PodClient\n","from getpass import getpass\n","transformers.utils.logging.set_verbosity_error()\n","os.environ[\"WANDB_DISABLED\"] = \"true\""]},{"cell_type":"markdown","metadata":{"id":"gby4o-am6Lsa"},"source":["## 1. Load your dataset from the POD"]},{"cell_type":"markdown","source":["1. Run the cell\n","2. Copy your Dataset Name, Login Key and Password Key from your app.memri.io screen, and paste them below as prompted to load your connected dataset from you POD."],"metadata":{"id":"CqWDS-4kWG3s"}},{"cell_type":"code","execution_count":null,"metadata":{"id":"M-bZ2bW-6ATQ"},"outputs":[],"source":["### *Define your pod url here*, this is the one for qa.app.memri.io ####\n","pod_url = \"https://qa.pod.memri.io\"\n","### *Define your dataset here* ####\n","dataset_name = input(\"dataset_name:\") if \"dataset_name\" not in locals() else dataset_name\n","### *Define your login key here* ####\n","owner_key = getpass(\"owner key:\") if \"owner_key\" not in locals() else owner_key\n","### *Define your password key here* ####\n","database_key = getpass(\"database_key:\") if \"database_key\" not in locals() else database_key"]},{"cell_type":"markdown","source":["2.   Connect your POD to load your data"],"metadata":{"id":"v6jxJtPwSMg3"}},{"cell_type":"code","execution_count":null,"metadata":{"id":"EB2zI24IrEPH"},"outputs":[],"source":["# Connect to pod\n","client = PodClient(\n","    url=pod_url,\n","    owner_key=owner_key,\n","    database_key=database_key,\n",")\n","client.add_to_schema(CategoricalLabel, Message, Dataset, OauthFlow);"]},{"cell_type":"markdown","metadata":{"id":"7liaUCkM59U5"},"source":["3.   Download and inspect the dataset\n","\n","> * All entries in the dataset can be found via the Dataset.entry edge\n","\n"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"dWwpzbQ27EQX"},"outputs":[],"source":["dataset = client.get_dataset(dataset_name)\n","\n","num_entries = len(dataset.entry)\n","print(f\"number of items in the dataset: {num_entries}\")"]},{"cell_type":"markdown","metadata":{"id":"Cid4dUXO7GCy"},"source":["4.   Export the dataset to a format compatible with Python and inspect in a table\n","\n","\n","> * In pymemri, the `Dataset` class can format your dataset to different datatypes using the `Dataset.to` method; here we will use Pandas.\n","> * The columns of the dataset are inferred automatically. If you want to use custom columns, you can use the `columns` argument. See the [dataset documentation](https://docs.memri.io/component-architectures/plugins/datasets/) for more info."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"Bz8XW0fB7Prl"},"outputs":[],"source":["data = dataset.to(\"pandas\")\n","data.head()"]},{"cell_type":"markdown","metadata":{"id":"c4sisozj7eTm"},"source":["## 2. Fine-tune a model\n","\n"]},{"cell_type":"markdown","source":["1.   Configure the distilRoBERTa model on your dataset\n","> The transformers library contains all code to do the training, you only need to define a torch Dataset that contains our data and handles tokenization.\n"],"metadata":{"id":"KL2VtYuDX9YZ"}},{"cell_type":"code","execution_count":null,"metadata":{"id":"KjKuK2wi7iAN"},"outputs":[],"source":["# Hyperparameters\n","model_name = \"distilroberta-base\"\n","batch_size = 32\n","learning_rate = 1e-3\n","\n","class TransformerDataset(torch.utils.data.Dataset):\n","    def __init__(self, data: pd.DataFrame, tokenizer: transformers.PreTrainedTokenizerBase):\n","        self.data = data\n","        self.label2idx, self.idx2label = self.get_label_map()\n","        self.num_labels = len(self.label2idx)\n","        self.tokenizer = tokenizer\n","        \n","    def tokenize(self, message, label=None):\n","        tokenized = self.tokenizer(message, padding=\"max_length\", truncation=True)\n","        if label:\n","            tokenized[\"label\"] = self.label2idx[label]\n","        return tokenized\n","\n","    def get_label_map(self):\n","        unique_labels = data[\"annotation.labelValue\"].unique()\n","        return {l: i for i, l in enumerate(unique_labels)}, {i: l for i, l in enumerate(unique_labels)}\n","        \n","    def __len__(self):\n","        return len(self.data)\n","        \n","    def __getitem__(self, idx):\n","        # Get the row from self.data, and skip the first column (id).\n","        return self.tokenize(*self.data.iloc[idx][1:])\n","\n","tokenizer = AutoTokenizer.from_pretrained(model_name)\n","dataset = TransformerDataset(data, tokenizer)"]},{"cell_type":"markdown","metadata":{"id":"OFFbLaAK7nli"},"source":["2.   Train and finetune the model\n","\n","> * The 🤗 Transformers library provides all code needed to train a RoBERTa model. Read their [tutorial on fine-tuning models](https://huggingface.co/docs/transformers/training)\n","* We use Trainer class, as it handles all training, monitoring and integration with [Weights & Biases](https://wandb.ai/site)"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"F2ludJvo7qD8"},"outputs":[],"source":["# Load model\n","model = AutoModelForSequenceClassification.from_pretrained(\n","    model_name,\n","    num_labels=dataset.num_labels,\n","    id2label=dataset.idx2label\n",")\n","\n","# To increase training speed, we will freeze all layers except the classifier head.\n","for param in model.base_model.parameters():\n","    param.requires_grad = False\n","training_args = transformers.TrainingArguments(\n","    \"twitter-emoji-trainer\",\n","    learning_rate=learning_rate,\n","    per_device_train_batch_size=batch_size,\n","    logging_steps=1,\n","    optim=\"adamw_torch\"\n",")\n","\n","trainer = transformers.Trainer(\n","    model=model,\n","    args=training_args,\n","    train_dataset=dataset\n",")\n","logging.set_verbosity(40)\n","trainer.train()"]},{"cell_type":"markdown","metadata":{"id":"XMFRgMo38Eah"},"source":["## 3. Upload your model to a data app plugin"]},{"cell_type":"markdown","metadata":{"id":"nbJOIR8g75_c"},"source":["Now that your model is trained, it will be uploaded to your new GitLab project.\n","\n","\n","1.   Run the cell\n","2.   Copy and paste the GitLab project name from the your screen on app.memri.io\n","\n","\n","> * To avoid errors, make sure your GitLab project does not have any full stops in the name/URL"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"xm2PgsAP77BX"},"outputs":[],"source":["project_name = input(\"project name:\") if \"project_name\" not in locals() else project_name\n","write_model_to_package_registry(model, project_name=project_name, client=client)"]},{"cell_type":"markdown","metadata":{"id":"8qVHRe9Y8AlU"},"source":["That's it! 🎉\n","\n","You have trained a ML model and made it accesible via the package registry, ready to be used in your data app.\n","\n","Check out the next step to see how to [build a plugin and deploy a data app](https://docs.memri.io/tutorials/build_a_sentiment_analysis_app/#deploy-your-data-app/)."]}],"metadata":{"accelerator":"GPU","colab":{"collapsed_sections":[],"provenance":[{"file_id":"1WX1VYwoAQ_2yOMzqvrIz77x52vaNmrtW","timestamp":1655375191426}]},"kernelspec":{"display_name":"Python 3","name":"python3"},"language_info":{"name":"python"}},"nbformat":4,"nbformat_minor":0}
Original file line number	Diff line number	Diff line change
		@@ -1 +1 @@
		{"cells":[{"cell_type":"markdown","source":["# Load a dataset and train your model\n","Once you have labeled your data on the Memri platform, you can use it to train your model in [this Google Colab notebook](https://colab.research.google.com/drive/189JJ2gLHAtxlmzc5XI3HhB9_VE3fT6DT).\n","\n","In this guide you will:\n","\n","1. Load a labeled dataset from the POD\n","2. Train a distilRoBERTa text classifier model on a labelled dataset\n","3. Upload a trained model to use in a plugin for a data app\n","\n","\n","\n","\n","> If you are unfamiliar with Google Colab notebooks, have a look at [this quick intro.](https://colab.research.google.com/)\n","* Make sure to run the below cells, one by one, in the correct order to avoid errors!\n","* In this guide we are helping you connect your own personal data from your Memri POD, alternitively you can use the [Tweet eval emoji](https://huggingface.co/datasets/tweet_eval#source-data) datasets, which is available from 🤗 [Hugging Face](https://huggingface.co/docs/datasets/index.html).\n","* If you don't wish to use your personal data, or you don't want to spend time training a model, you can simply use our [sentiment-plugin](https://gitlab.memri.io/koenvanderveen/sentiment-plugin/-/packages/6), which uses a pre-trained model from 🤗 Hugging Face. Just paste the plugin repo address at project set-up step on the Memri platform, and skip this process."],"metadata":{"id":"suH5kbdZ-Zfi"}},{"cell_type":"markdown","metadata":{"id":"A22x_JOU5XVk"},"source":["## Setup"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"1ov_QUj2qnZy"},"outputs":[],"source":["from IPython.display import clear_output\n","!pip install pandas transformers torch git+https://gitlab.memri.io/memri/[email protected]\n","clear_output()\n","print(\"Installed\")"]},{"cell_type":"markdown","metadata":{"id":"j8t1xaae8uQJ"},"source":["1. Import the libraries needed to train your model\n","\n","> * Make sure to run the installation step above first to avoid errors!\n","\n"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"zNkau9-8q9rC"},"outputs":[],"source":["import os\n","import random\n","import textwrap\n","\n","import pandas as pd\n","import torch\n","import transformers\n","from transformers import AutoModelForSequenceClassification, AutoTokenizer\n","from transformers.utils import logging\n","\n","from pymemri.data.itembase import Edge, Item\n","from pymemri.data.schema import Dataset, Message, CategoricalLabel\n","from pymemri.data.oauth import OauthFlow\n","from pymemri.data.loader import write_model_to_package_registry\n","from pymemri.pod.client import PodClient\n","from getpass import getpass\n","transformers.utils.logging.set_verbosity_error()\n","os.environ[\"WANDB_DISABLED\"] = \"true\""]},{"cell_type":"markdown","metadata":{"id":"gby4o-am6Lsa"},"source":["## 1. Load your dataset from the POD"]},{"cell_type":"markdown","source":["1. Run the cell\n","2. Copy your Dataset Name, Login Key and Password Key from your app.memri.io screen, and paste them below as prompted to load your connected dataset from you POD."],"metadata":{"id":"CqWDS-4kWG3s"}},{"cell_type":"code","execution_count":null,"metadata":{"id":"M-bZ2bW-6ATQ"},"outputs":[],"source":["### Define your pod url here, this is the one for dev.app.memri.io ####\n","pod_url = \"https://dev.pod.memri.io\"\n","### Define your dataset here ####\n","dataset_name = input(\"dataset_name:\") if \"dataset_name\" not in locals() else dataset_name\n","### Define your login key here ####\n","owner_key = getpass(\"owner key:\") if \"owner_key\" not in locals() else owner_key\n","### Define your password key here ####\n","database_key = getpass(\"database_key:\") if \"database_key\" not in locals() else database_key"]},{"cell_type":"markdown","source":["2. Connect your POD to load your data"],"metadata":{"id":"v6jxJtPwSMg3"}},{"cell_type":"code","execution_count":null,"metadata":{"id":"EB2zI24IrEPH"},"outputs":[],"source":["# Connect to pod\n","client = PodClient(\n"," url=pod_url,\n"," owner_key=owner_key,\n"," database_key=database_key,\n",")\n","client.add_to_schema(CategoricalLabel, Message, Dataset, OauthFlow);"]},{"cell_type":"markdown","metadata":{"id":"7liaUCkM59U5"},"source":["3. Download and inspect the dataset\n","\n","> * All entries in the dataset can be found via the Dataset.entry edge\n","\n"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"dWwpzbQ27EQX"},"outputs":[],"source":["dataset = client.get_dataset(dataset_name)\n","\n","num_entries = len(dataset.entry)\n","print(f\"number of items in the dataset: {num_entries}\")"]},{"cell_type":"markdown","metadata":{"id":"Cid4dUXO7GCy"},"source":["4. Export the dataset to a format compatible with Python and inspect in a table\n","\n","\n","> * In pymemri, the `Dataset` class can format your dataset to different datatypes using the `Dataset.to` method; here we will use Pandas.\n","> * The columns of the dataset are inferred automatically. If you want to use custom columns, you can use the `columns` argument. See the [dataset documentation](https://docs.memri.io/component-architectures/plugins/datasets/) for more info."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"Bz8XW0fB7Prl"},"outputs":[],"source":["data = dataset.to(\"pandas\")\n","data.head()"]},{"cell_type":"markdown","metadata":{"id":"c4sisozj7eTm"},"source":["## 2. Fine-tune a model\n","\n"]},{"cell_type":"markdown","source":["1. Configure the distilRoBERTa model on your dataset\n","> The transformers library contains all code to do the training, you only need to define a torch Dataset that contains our data and handles tokenization.\n"],"metadata":{"id":"KL2VtYuDX9YZ"}},{"cell_type":"code","execution_count":null,"metadata":{"id":"KjKuK2wi7iAN"},"outputs":[],"source":["# Hyperparameters\n","model_name = \"distilroberta-base\"\n","batch_size = 32\n","learning_rate = 1e-3\n","\n","class TransformerDataset(torch.utils.data.Dataset):\n"," def __init__(self, data: pd.DataFrame, tokenizer: transformers.PreTrainedTokenizerBase):\n"," self.data = data\n"," self.label2idx, self.idx2label = self.get_label_map()\n"," self.num_labels = len(self.label2idx)\n"," self.tokenizer = tokenizer\n"," \n"," def tokenize(self, message, label=None):\n"," tokenized = self.tokenizer(message, padding=\"max_length\", truncation=True)\n"," if label:\n"," tokenized[\"label\"] = self.label2idx[label]\n"," return tokenized\n","\n"," def get_label_map(self):\n"," unique_labels = data[\"annotation.labelValue\"].unique()\n"," return {l: i for i, l in enumerate(unique_labels)}, {i: l for i, l in enumerate(unique_labels)}\n"," \n"," def __len__(self):\n"," return len(self.data)\n"," \n"," def __getitem__(self, idx):\n"," # Get the row from self.data, and skip the first column (id).\n"," return self.tokenize(self.data.iloc[idx][1:])\n","\n","tokenizer = AutoTokenizer.from_pretrained(model_name)\n","dataset = TransformerDataset(data, tokenizer)"]},{"cell_type":"markdown","metadata":{"id":"OFFbLaAK7nli"},"source":["2. Train and finetune the model\n","\n","> The 🤗 Transformers library provides all code needed to train a RoBERTa model. Read their [tutorial on fine-tuning models](https://huggingface.co/docs/transformers/training)\n","* We use Trainer class, as it handles all training, monitoring and integration with [Weights & Biases](https://wandb.ai/site)"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"F2ludJvo7qD8"},"outputs":[],"source":["# Load model\n","model = AutoModelForSequenceClassification.from_pretrained(\n"," model_name,\n"," num_labels=dataset.num_labels,\n"," id2label=dataset.idx2label\n",")\n","\n","# To increase training speed, we will freeze all layers except the classifier head.\n","for param in model.base_model.parameters():\n"," param.requires_grad = False\n","training_args = transformers.TrainingArguments(\n"," \"twitter-emoji-trainer\",\n"," learning_rate=learning_rate,\n"," per_device_train_batch_size=batch_size,\n"," logging_steps=1,\n"," optim=\"adamw_torch\"\n",")\n","\n","trainer = transformers.Trainer(\n"," model=model,\n"," args=training_args,\n"," train_dataset=dataset\n",")\n","logging.set_verbosity(40)\n","trainer.train()"]},{"cell_type":"markdown","metadata":{"id":"XMFRgMo38Eah"},"source":["## 3. Upload your model to a data app plugin"]},{"cell_type":"markdown","metadata":{"id":"nbJOIR8g75_c"},"source":["Now that your model is trained, it will be uploaded to your new GitLab project.\n","\n","\n","1. Run the cell\n","2. Copy and paste the GitLab project name from the your screen on app.memri.io\n","\n","\n","> * To avoid errors, make sure your GitLab project does not have any full stops in the name/URL"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"xm2PgsAP77BX"},"outputs":[],"source":["project_name = input(\"project name:\") if \"project_name\" not in locals() else project_name\n","write_model_to_package_registry(model, project_name=project_name, client=client)"]},{"cell_type":"markdown","metadata":{"id":"8qVHRe9Y8AlU"},"source":["That's it! 🎉\n","\n","You have trained a ML model and made it accesible via the package registry, ready to be used in your data app.\n","\n","Check out the next step to see how to [build a plugin and deploy a data app](https://docs.memri.io/tutorials/build_a_sentiment_analysis_app/#deploy-your-data-app/)."]}],"metadata":{"accelerator":"GPU","colab":{"collapsed_sections":[],"provenance":[{"file_id":"1WX1VYwoAQ_2yOMzqvrIz77x52vaNmrtW","timestamp":1655375191426}]},"kernelspec":{"display_name":"Python 3","name":"python3"},"language_info":{"name":"python"}},"nbformat":4,"nbformat_minor":0}
		{"cells":[{"cell_type":"markdown","source":["# Load a dataset and train your model\n","Once you have labeled your data on the Memri platform, you can use it to train your model in [this Google Colab notebook](https://colab.research.google.com/drive/189JJ2gLHAtxlmzc5XI3HhB9_VE3fT6DT).\n","\n","In this guide you will:\n","\n","1. Load a labeled dataset from the POD\n","2. Train a distilRoBERTa text classifier model on a labelled dataset\n","3. Upload a trained model to use in a plugin for a data app\n","\n","\n","\n","\n","> If you are unfamiliar with Google Colab notebooks, have a look at [this quick intro.](https://colab.research.google.com/)\n","* Make sure to run the below cells, one by one, in the correct order to avoid errors!\n","* In this guide we are helping you connect your own personal data from your Memri POD, alternitively you can use the [Tweet eval emoji](https://huggingface.co/datasets/tweet_eval#source-data) datasets, which is available from 🤗 [Hugging Face](https://huggingface.co/docs/datasets/index.html).\n","* If you don't wish to use your personal data, or you don't want to spend time training a model, you can simply use our [sentiment-plugin](https://gitlab.memri.io/koenvanderveen/sentiment-plugin/-/packages/6), which uses a pre-trained model from 🤗 Hugging Face. Just paste the plugin repo address at project set-up step on the Memri platform, and skip this process."],"metadata":{"id":"suH5kbdZ-Zfi"}},{"cell_type":"markdown","metadata":{"id":"A22x_JOU5XVk"},"source":["## Setup"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"1ov_QUj2qnZy"},"outputs":[],"source":["from IPython.display import clear_output\n","!pip install pandas transformers torch git+https://gitlab.memri.io/memri/[email protected]\n","clear_output()\n","print(\"Installed\")"]},{"cell_type":"markdown","metadata":{"id":"j8t1xaae8uQJ"},"source":["1. Import the libraries needed to train your model\n","\n","> * Make sure to run the installation step above first to avoid errors!\n","\n"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"zNkau9-8q9rC"},"outputs":[],"source":["import os\n","import random\n","import textwrap\n","\n","import pandas as pd\n","import torch\n","import transformers\n","from transformers import AutoModelForSequenceClassification, AutoTokenizer\n","from transformers.utils import logging\n","\n","from pymemri.data.itembase import Edge, Item\n","from pymemri.data.schema import Dataset, Message, CategoricalLabel\n","from pymemri.data.oauth import OauthFlow\n","from pymemri.data.loader import write_model_to_package_registry\n","from pymemri.pod.client import PodClient\n","from getpass import getpass\n","transformers.utils.logging.set_verbosity_error()\n","os.environ[\"WANDB_DISABLED\"] = \"true\""]},{"cell_type":"markdown","metadata":{"id":"gby4o-am6Lsa"},"source":["## 1. Load your dataset from the POD"]},{"cell_type":"markdown","source":["1. Run the cell\n","2. Copy your Dataset Name, Login Key and Password Key from your app.memri.io screen, and paste them below as prompted to load your connected dataset from you POD."],"metadata":{"id":"CqWDS-4kWG3s"}},{"cell_type":"code","execution_count":null,"metadata":{"id":"M-bZ2bW-6ATQ"},"outputs":[],"source":["### Define your pod url here, this is the one for qa.app.memri.io ####\n","pod_url = \"https://qa.pod.memri.io\"\n","### Define your dataset here ####\n","dataset_name = input(\"dataset_name:\") if \"dataset_name\" not in locals() else dataset_name\n","### Define your login key here ####\n","owner_key = getpass(\"owner key:\") if \"owner_key\" not in locals() else owner_key\n","### Define your password key here ####\n","database_key = getpass(\"database_key:\") if \"database_key\" not in locals() else database_key"]},{"cell_type":"markdown","source":["2. Connect your POD to load your data"],"metadata":{"id":"v6jxJtPwSMg3"}},{"cell_type":"code","execution_count":null,"metadata":{"id":"EB2zI24IrEPH"},"outputs":[],"source":["# Connect to pod\n","client = PodClient(\n"," url=pod_url,\n"," owner_key=owner_key,\n"," database_key=database_key,\n",")\n","client.add_to_schema(CategoricalLabel, Message, Dataset, OauthFlow);"]},{"cell_type":"markdown","metadata":{"id":"7liaUCkM59U5"},"source":["3. Download and inspect the dataset\n","\n","> * All entries in the dataset can be found via the Dataset.entry edge\n","\n"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"dWwpzbQ27EQX"},"outputs":[],"source":["dataset = client.get_dataset(dataset_name)\n","\n","num_entries = len(dataset.entry)\n","print(f\"number of items in the dataset: {num_entries}\")"]},{"cell_type":"markdown","metadata":{"id":"Cid4dUXO7GCy"},"source":["4. Export the dataset to a format compatible with Python and inspect in a table\n","\n","\n","> * In pymemri, the `Dataset` class can format your dataset to different datatypes using the `Dataset.to` method; here we will use Pandas.\n","> * The columns of the dataset are inferred automatically. If you want to use custom columns, you can use the `columns` argument. See the [dataset documentation](https://docs.memri.io/component-architectures/plugins/datasets/) for more info."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"Bz8XW0fB7Prl"},"outputs":[],"source":["data = dataset.to(\"pandas\")\n","data.head()"]},{"cell_type":"markdown","metadata":{"id":"c4sisozj7eTm"},"source":["## 2. Fine-tune a model\n","\n"]},{"cell_type":"markdown","source":["1. Configure the distilRoBERTa model on your dataset\n","> The transformers library contains all code to do the training, you only need to define a torch Dataset that contains our data and handles tokenization.\n"],"metadata":{"id":"KL2VtYuDX9YZ"}},{"cell_type":"code","execution_count":null,"metadata":{"id":"KjKuK2wi7iAN"},"outputs":[],"source":["# Hyperparameters\n","model_name = \"distilroberta-base\"\n","batch_size = 32\n","learning_rate = 1e-3\n","\n","class TransformerDataset(torch.utils.data.Dataset):\n"," def __init__(self, data: pd.DataFrame, tokenizer: transformers.PreTrainedTokenizerBase):\n"," self.data = data\n"," self.label2idx, self.idx2label = self.get_label_map()\n"," self.num_labels = len(self.label2idx)\n"," self.tokenizer = tokenizer\n"," \n"," def tokenize(self, message, label=None):\n"," tokenized = self.tokenizer(message, padding=\"max_length\", truncation=True)\n"," if label:\n"," tokenized[\"label\"] = self.label2idx[label]\n"," return tokenized\n","\n"," def get_label_map(self):\n"," unique_labels = data[\"annotation.labelValue\"].unique()\n"," return {l: i for i, l in enumerate(unique_labels)}, {i: l for i, l in enumerate(unique_labels)}\n"," \n"," def __len__(self):\n"," return len(self.data)\n"," \n"," def __getitem__(self, idx):\n"," # Get the row from self.data, and skip the first column (id).\n"," return self.tokenize(self.data.iloc[idx][1:])\n","\n","tokenizer = AutoTokenizer.from_pretrained(model_name)\n","dataset = TransformerDataset(data, tokenizer)"]},{"cell_type":"markdown","metadata":{"id":"OFFbLaAK7nli"},"source":["2. Train and finetune the model\n","\n","> The 🤗 Transformers library provides all code needed to train a RoBERTa model. Read their [tutorial on fine-tuning models](https://huggingface.co/docs/transformers/training)\n","* We use Trainer class, as it handles all training, monitoring and integration with [Weights & Biases](https://wandb.ai/site)"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"F2ludJvo7qD8"},"outputs":[],"source":["# Load model\n","model = AutoModelForSequenceClassification.from_pretrained(\n"," model_name,\n"," num_labels=dataset.num_labels,\n"," id2label=dataset.idx2label\n",")\n","\n","# To increase training speed, we will freeze all layers except the classifier head.\n","for param in model.base_model.parameters():\n"," param.requires_grad = False\n","training_args = transformers.TrainingArguments(\n"," \"twitter-emoji-trainer\",\n"," learning_rate=learning_rate,\n"," per_device_train_batch_size=batch_size,\n"," logging_steps=1,\n"," optim=\"adamw_torch\"\n",")\n","\n","trainer = transformers.Trainer(\n"," model=model,\n"," args=training_args,\n"," train_dataset=dataset\n",")\n","logging.set_verbosity(40)\n","trainer.train()"]},{"cell_type":"markdown","metadata":{"id":"XMFRgMo38Eah"},"source":["## 3. Upload your model to a data app plugin"]},{"cell_type":"markdown","metadata":{"id":"nbJOIR8g75_c"},"source":["Now that your model is trained, it will be uploaded to your new GitLab project.\n","\n","\n","1. Run the cell\n","2. Copy and paste the GitLab project name from the your screen on app.memri.io\n","\n","\n","> * To avoid errors, make sure your GitLab project does not have any full stops in the name/URL"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"xm2PgsAP77BX"},"outputs":[],"source":["project_name = input(\"project name:\") if \"project_name\" not in locals() else project_name\n","write_model_to_package_registry(model, project_name=project_name, client=client)"]},{"cell_type":"markdown","metadata":{"id":"8qVHRe9Y8AlU"},"source":["That's it! 🎉\n","\n","You have trained a ML model and made it accesible via the package registry, ready to be used in your data app.\n","\n","Check out the next step to see how to [build a plugin and deploy a data app](https://docs.memri.io/tutorials/build_a_sentiment_analysis_app/#deploy-your-data-app/)."]}],"metadata":{"accelerator":"GPU","colab":{"collapsed_sections":[],"provenance":[{"file_id":"1WX1VYwoAQ_2yOMzqvrIz77x52vaNmrtW","timestamp":1655375191426}]},"kernelspec":{"display_name":"Python 3","name":"python3"},"language_info":{"name":"python"}},"nbformat":4,"nbformat_minor":0}