{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "sKB26q-4-X8r" }, "source": [ "# CS224N: Hugging Face Transformers Tutorial (Spring '24)\n", "Original Author: Ben Newman\n", "\n", "Thanks to Anna Goldie for feedback!\n", "\n", "This notebook will give an introduction to the Hugging Face Transformers Python library and some common patterns that you can use to take advantage of it. It is most useful for using or fine-tuning pretrained transformer models for your projects.\n", "\n", "\n", "Hugging Face provides access to models (both the code that implements them and their pre-trained weights, including latest LLMs like Llama3, DBRX, etc), model-specific tokenizers, as well as pipelines for common NLP tasks, and datasets and metrics in a separate `datasets` package. It has implementations in PyTorch, Tensorflow, and Flax (though we'll be using the PyTorch versions here!)\n", "\n", "\n", "We're going to go through a few use cases:\n", "* Overview of Tokenizers and Models\n", "* Finetuning - for your own task. We'll use a sentiment-classification example.\n", "\n", "\n", "Professor spoke about a few main project types in last Thursday's lecture:\n", "1. Applying an existing pre-trained model to a new application or task and explore how to approach/solve it\n", "2. Implementing a new or complex neural architecture and demonstrate its performance on some data\n", "3. Analyzing the behavior of a model: how it represents linguistic knowledge or what kinds of phenomena it can handle or errors that it makes\n", "\n", "Of these, `transformers` will be the most help for (1) and for (3). (2) involves a bit of learning curve but if you master it, you will find it very convenient to design a model based on existing ones provided by Huggingface. We won't be covering it here and please refer to [this example](https://huggingface.co/docs/transformers/en/custom_models).\n", "\n", "\n", "\n", "Here are additional resources introducing the library that were used to make this tutorial:\n", "\n", "* [Hugging Face Docs](https://huggingface.co/docs/transformers/index)\n", " * Clear documentation\n", " * Tutorials, walk-throughs, and example notebooks\n", " * List of available models\n", "* [Hugging Face Course](https://huggingface.co/course/)\n", "* [Hugging Face Examples](https://github.com/huggingface/transformers/tree/main/examples/pytorch) You can find very similar code structures accross very different downstream tasks/models using Huggingface.\n", "* [Hugging Face O'Reilly Book](https://www.oreilly.com/library/view/natural-language-processing/9781098103231/)\n", " * Students have FREE access through the Stanford Library!\n", "\n" ], "id": "sKB26q-4-X8r" }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "collapsed": true, "id": "9EhWoZef-X8u", "outputId": "2cc17953-33a8-4628-c951-40b07dc0b0e0" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Requirement already satisfied: transformers in /usr/local/lib/python3.10/dist-packages (4.40.1)\n", "Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from transformers) (3.14.0)\n", "Requirement already satisfied: huggingface-hub<1.0,>=0.19.3 in /usr/local/lib/python3.10/dist-packages (from transformers) (0.23.0)\n", "Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from transformers) (1.25.2)\n", "Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from transformers) (24.0)\n", "Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from transformers) (6.0.1)\n", "Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from transformers) (2023.12.25)\n", "Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from transformers) (2.31.0)\n", "Requirement already satisfied: tokenizers<0.20,>=0.19 in /usr/local/lib/python3.10/dist-packages (from transformers) (0.19.1)\n", "Requirement already satisfied: safetensors>=0.4.1 in /usr/local/lib/python3.10/dist-packages (from transformers) (0.4.3)\n", "Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.10/dist-packages (from transformers) (4.66.2)\n", "Requirement already satisfied: fsspec>=2023.5.0 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.19.3->transformers) (2023.6.0)\n", "Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.19.3->transformers) (4.11.0)\n", "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (3.3.2)\n", "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (3.7)\n", "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (2.0.7)\n", "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (2024.2.2)\n", "Requirement already satisfied: datasets in /usr/local/lib/python3.10/dist-packages (2.19.0)\n", "Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from datasets) (3.14.0)\n", "Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from datasets) (1.25.2)\n", "Requirement already satisfied: pyarrow>=12.0.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (14.0.2)\n", "Requirement already satisfied: pyarrow-hotfix in /usr/local/lib/python3.10/dist-packages (from datasets) (0.6)\n", "Requirement already satisfied: dill<0.3.9,>=0.3.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (0.3.8)\n", "Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from datasets) (2.0.3)\n", "Requirement already satisfied: requests>=2.19.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (2.31.0)\n", "Requirement already satisfied: tqdm>=4.62.1 in /usr/local/lib/python3.10/dist-packages (from datasets) (4.66.2)\n", "Requirement already satisfied: xxhash in /usr/local/lib/python3.10/dist-packages (from datasets) (3.4.1)\n", "Requirement already satisfied: multiprocess in /usr/local/lib/python3.10/dist-packages (from datasets) (0.70.16)\n", "Requirement already satisfied: fsspec[http]<=2024.3.1,>=2023.1.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (2023.6.0)\n", "Requirement already satisfied: aiohttp in /usr/local/lib/python3.10/dist-packages (from datasets) (3.9.5)\n", "Requirement already satisfied: huggingface-hub>=0.21.2 in /usr/local/lib/python3.10/dist-packages (from datasets) (0.23.0)\n", "Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from datasets) (24.0)\n", "Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from datasets) (6.0.1)\n", "Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.3.1)\n", "Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (23.2.0)\n", "Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.4.1)\n", "Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (6.0.5)\n", "Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.9.4)\n", "Requirement already satisfied: async-timeout<5.0,>=4.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (4.0.3)\n", "Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.21.2->datasets) (4.11.0)\n", "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets) (3.3.2)\n", "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets) (3.7)\n", "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets) (2.0.7)\n", "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets) (2024.2.2)\n", "Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2.8.2)\n", "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2023.4)\n", "Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2024.1)\n", "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2->pandas->datasets) (1.16.0)\n", "Requirement already satisfied: accelerate in /usr/local/lib/python3.10/dist-packages (0.30.0)\n", "Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from accelerate) (1.25.2)\n", "Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from accelerate) (24.0)\n", "Requirement already satisfied: psutil in /usr/local/lib/python3.10/dist-packages (from accelerate) (5.9.5)\n", "Requirement already satisfied: pyyaml in /usr/local/lib/python3.10/dist-packages (from accelerate) (6.0.1)\n", "Requirement already satisfied: torch>=1.10.0 in /usr/local/lib/python3.10/dist-packages (from accelerate) (2.2.1+cu121)\n", "Requirement already satisfied: huggingface-hub in /usr/local/lib/python3.10/dist-packages (from accelerate) (0.23.0)\n", "Requirement already satisfied: safetensors>=0.3.1 in /usr/local/lib/python3.10/dist-packages (from accelerate) (0.4.3)\n", "Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->accelerate) (3.14.0)\n", "Requirement already satisfied: typing-extensions>=4.8.0 in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->accelerate) (4.11.0)\n", "Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->accelerate) (1.12)\n", "Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->accelerate) (3.3)\n", "Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->accelerate) (3.1.3)\n", "Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->accelerate) (2023.6.0)\n", "Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.1.105 in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->accelerate) (12.1.105)\n", "Requirement already satisfied: nvidia-cuda-runtime-cu12==12.1.105 in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->accelerate) (12.1.105)\n", "Requirement already satisfied: nvidia-cuda-cupti-cu12==12.1.105 in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->accelerate) (12.1.105)\n", "Requirement already satisfied: nvidia-cudnn-cu12==8.9.2.26 in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->accelerate) (8.9.2.26)\n", "Requirement already satisfied: nvidia-cublas-cu12==12.1.3.1 in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->accelerate) (12.1.3.1)\n", "Requirement already satisfied: nvidia-cufft-cu12==11.0.2.54 in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->accelerate) (11.0.2.54)\n", "Requirement already satisfied: nvidia-curand-cu12==10.3.2.106 in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->accelerate) (10.3.2.106)\n", "Requirement already satisfied: nvidia-cusolver-cu12==11.4.5.107 in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->accelerate) (11.4.5.107)\n", "Requirement already satisfied: nvidia-cusparse-cu12==12.1.0.106 in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->accelerate) (12.1.0.106)\n", "Requirement already satisfied: nvidia-nccl-cu12==2.19.3 in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->accelerate) (2.19.3)\n", "Requirement already satisfied: nvidia-nvtx-cu12==12.1.105 in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->accelerate) (12.1.105)\n", "Requirement already satisfied: triton==2.2.0 in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->accelerate) (2.2.0)\n", "Requirement already satisfied: nvidia-nvjitlink-cu12 in /usr/local/lib/python3.10/dist-packages (from nvidia-cusolver-cu12==11.4.5.107->torch>=1.10.0->accelerate) (12.4.127)\n", "Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from huggingface-hub->accelerate) (2.31.0)\n", "Requirement already satisfied: tqdm>=4.42.1 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub->accelerate) (4.66.2)\n", "Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch>=1.10.0->accelerate) (2.1.5)\n", "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface-hub->accelerate) (3.3.2)\n", "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface-hub->accelerate) (3.7)\n", "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface-hub->accelerate) (2.0.7)\n", "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface-hub->accelerate) (2024.2.2)\n", "Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.10/dist-packages (from sympy->torch>=1.10.0->accelerate) (1.3.0)\n" ] } ], "source": [ "!pip install transformers\n", "!pip install datasets\n", "!pip install accelerate" ], "id": "9EhWoZef-X8u" }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Q9th7mpc-X8v" }, "outputs": [], "source": [ "from collections import defaultdict, Counter\n", "import json\n", "\n", "from matplotlib import pyplot as plt\n", "import numpy as np\n", "import torch\n", "\n", "def print_encoding(model_inputs, indent=4):\n", " indent_str = \" \" * indent\n", " print(\"{\")\n", " for k, v in model_inputs.items():\n", " print(indent_str + k + \":\")\n", " print(indent_str + indent_str + str(v))\n", " print(\"}\")" ], "id": "Q9th7mpc-X8v" }, { "cell_type": "markdown", "metadata": { "id": "qmXezUMg2idv" }, "source": [ "## Part 0: Common Pattern for using Hugging Face Transformers\n", "\n", "We're going to start off with a common usage pattern for Hugging Face Transformers, using the example of Sentiment Analysis.\n", "\n", "First, find a model on [the hub](https://huggingface.co/models). Anyone can upload their model for other people to use. (I'm using a sentiment analysis model from [this paper](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3489963)).\n", "\n", "Then, there are two objects that need to be initialized - a **tokenizer**, and a **model**\n", "\n", "* Tokenizer converts strings to lists of vocabulary ids that the model requires\n", "* Model takes the vocabulary ids and produces a prediction" ], "id": "qmXezUMg2idv" }, { "cell_type": "markdown", "metadata": { "id": "ySLmJ0Z-oD35" }, "source": [ "![full_nlp_pipeline.png](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter2/full_nlp_pipeline.svg)\n", "From [https://huggingface.co/course/chapter2/2?fw=pt](https://huggingface.co/course/chapter2/2?fw=pt)" ], "id": "ySLmJ0Z-oD35" }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Mcsii_O42Z8Q", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "5bba7776-ee44-4d8b-9467-e32def806389" }, "outputs": [ { "output_type": "stream", "name": "stderr", "text": [ "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n", " warnings.warn(\n" ] } ], "source": [ "from transformers import AutoTokenizer, AutoModelForSequenceClassification\n", "\n", "# Initialize the tokenizer\n", "tokenizer = AutoTokenizer.from_pretrained(\"siebert/sentiment-roberta-large-english\")\n", "# Initialize the model\n", "model = AutoModelForSequenceClassification.from_pretrained(\"siebert/sentiment-roberta-large-english\")" ], "id": "Mcsii_O42Z8Q" }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "kT_zeWRBoD36", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "d54a618e-14dc-4c0e-ef03-5472889f88c6" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Input:\n", "I'm excited to learn about Hugging Face Transformers!\n", "\n", "Tokenized Inputs:\n", "{\n", " input_ids:\n", " tensor([[ 0, 100, 437, 2283, 7, 1532, 59, 30581, 3923, 12346,\n", " 34379, 328, 2]])\n", " attention_mask:\n", " tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])\n", "}\n", "\n", "Model Outputs:\n", "SequenceClassifierOutput(loss=None, logits=tensor([[-3.7605, 2.9262]], grad_fn=), hidden_states=None, attentions=None)\n", "\n", "The prediction is POSITIVE\n" ] } ], "source": [ "inputs = \"I'm excited to learn about Hugging Face Transformers!\"\n", "tokenized_inputs = tokenizer(inputs, return_tensors=\"pt\")\n", "outputs = model(**tokenized_inputs)\n", "\n", "labels = ['NEGATIVE', 'POSITIVE']\n", "prediction = torch.argmax(outputs.logits)\n", "\n", "\n", "print(\"Input:\")\n", "print(inputs)\n", "print()\n", "print(\"Tokenized Inputs:\")\n", "print_encoding(tokenized_inputs)\n", "print()\n", "print(\"Model Outputs:\")\n", "print(outputs)\n", "print()\n", "print(f\"The prediction is {labels[prediction]}\")" ], "id": "kT_zeWRBoD36" }, { "cell_type": "markdown", "metadata": { "id": "a7jvH9haoD37" }, "source": [ "### 0.1 Tokenizers" ], "id": "a7jvH9haoD37" }, { "cell_type": "markdown", "metadata": { "id": "43FLbwgz-X83" }, "source": [ "Pretrained models are implemented along with **tokenizers** that are used to preprocess their inputs. The tokenizers take raw strings or list of strings and output what are effectively dictionaries that contain the the model inputs.\n", "\n", "\n", "You can access tokenizers either with the Tokenizer class specific to the model you want to use (here DistilBERT), or with the AutoTokenizer class.\n", "Fast Tokenizers are written in Rust, while their slow versions are written in Python." ], "id": "43FLbwgz-X83" }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Pu6L0lWG-X83", "scrolled": true, "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "546092b8-a3a2-4fc6-ed1d-f06e3f5f86ff" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "DistilBertTokenizer(name_or_path='distilbert/distilbert-base-cased', vocab_size=28996, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True), added_tokens_decoder={\n", "\t0: AddedToken(\"[PAD]\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n", "\t100: AddedToken(\"[UNK]\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n", "\t101: AddedToken(\"[CLS]\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n", "\t102: AddedToken(\"[SEP]\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n", "\t103: AddedToken(\"[MASK]\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n", "}\n", "DistilBertTokenizerFast(name_or_path='distilbert/distilbert-base-cased', vocab_size=28996, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True), added_tokens_decoder={\n", "\t0: AddedToken(\"[PAD]\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n", "\t100: AddedToken(\"[UNK]\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n", "\t101: AddedToken(\"[CLS]\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n", "\t102: AddedToken(\"[SEP]\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n", "\t103: AddedToken(\"[MASK]\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n", "}\n", "DistilBertTokenizerFast(name_or_path='distilbert/distilbert-base-cased', vocab_size=28996, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True), added_tokens_decoder={\n", "\t0: AddedToken(\"[PAD]\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n", "\t100: AddedToken(\"[UNK]\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n", "\t101: AddedToken(\"[CLS]\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n", "\t102: AddedToken(\"[SEP]\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n", "\t103: AddedToken(\"[MASK]\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n", "}\n" ] } ], "source": [ "from transformers import DistilBertTokenizer, DistilBertTokenizerFast, AutoTokenizer\n", "name = \"distilbert/distilbert-base-cased\"\n", "# name = \"user/name\" when loading from\n", "# name = local_path when using save_pretrained() method\n", "\n", "tokenizer = DistilBertTokenizer.from_pretrained(name) # written in Python\n", "print(tokenizer)\n", "tokenizer = DistilBertTokenizerFast.from_pretrained(name) # written in Rust\n", "print(tokenizer)\n", "tokenizer = AutoTokenizer.from_pretrained(name) # convenient! Defaults to Fast\n", "print(tokenizer)" ], "id": "Pu6L0lWG-X83" }, { "cell_type": "code", "source": [], "metadata": { "id": "mUWigfZD1yIQ" }, "id": "mUWigfZD1yIQ", "execution_count": null, "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "zrPzbBhR-X84", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "cc8c7486-6957-4304-fcce-0fbdcbcaf4f6" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Vanilla Tokenization\n", "{\n", " input_ids:\n", " [101, 20164, 10932, 10289, 25267, 1110, 1632, 106, 102]\n", " attention_mask:\n", " [1, 1, 1, 1, 1, 1, 1, 1, 1]\n", "}\n", "\n", "[101, 20164, 10932, 10289, 25267, 1110, 1632, 106, 102]\n", "[101, 20164, 10932, 10289, 25267, 1110, 1632, 106, 102]\n" ] } ], "source": [ "# This is how you call the tokenizer\n", "input_str = \"Hugging Face Transformers is great!\"\n", "tokenized_inputs = tokenizer(input_str) # https://huggingface.co/learn/nlp-course/en/chapter6/6\n", "\n", "\n", "print(\"Vanilla Tokenization\")\n", "print_encoding(tokenized_inputs)\n", "print()\n", "\n", "# Two ways to access:\n", "print(tokenized_inputs.input_ids)\n", "print(tokenized_inputs[\"input_ids\"])" ], "id": "zrPzbBhR-X84" }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "E_8C6L2G-X85", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "8b0bd9ba-cc08-4ecc-86a4-e686c3dbbfa1" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "start: Hugging Face Transformers is great!\n", "tokenize: ['Hu', '##gging', 'Face', 'Transformers', 'is', 'great', '!']\n", "convert_tokens_to_ids: [20164, 10932, 10289, 25267, 1110, 1632, 106]\n", "add special tokens: [101, 20164, 10932, 10289, 25267, 1110, 1632, 106, 102]\n", "--------\n", "decode: [CLS] Hugging Face Transformers is great! [SEP]\n" ] } ], "source": [ "cls = [tokenizer.cls_token_id]\n", "sep = [tokenizer.sep_token_id]\n", "\n", "# Tokenization happens in a few steps:\n", "input_tokens = tokenizer.tokenize(input_str)\n", "input_ids = tokenizer.convert_tokens_to_ids(input_tokens)\n", "input_ids_special_tokens = cls + input_ids + sep\n", "\n", "decoded_str = tokenizer.decode(input_ids_special_tokens)\n", "\n", "print(\"start: \", input_str)\n", "print(\"tokenize: \", input_tokens)\n", "print(\"convert_tokens_to_ids:\", input_ids)\n", "print(\"add special tokens: \", input_ids_special_tokens)\n", "print(\"--------\")\n", "print(\"decode: \", decoded_str)\n", "\n", "# NOTE that these steps don't create the attention mask or add the special characters" ], "id": "E_8C6L2G-X85" }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "tdoZ3EEU-X86", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "8d7aa737-db24-45aa-de6e-8a8699af3404" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Hugging Face Transformers is great!\n", "-----\n", "Number of tokens: 9\n", "Ids: [101, 20164, 10932, 10289, 25267, 1110, 1632, 106, 102]\n", "Tokens: ['[CLS]', 'Hu', '##gging', 'Face', 'Transformers', 'is', 'great', '!', '[SEP]']\n", "Special tokens mask: [1, 0, 0, 0, 0, 0, 0, 0, 1]\n", "\n", "char_to_word gives the wordpiece of a character in the input\n", "For example, the 9th character of the string is 'F', and it's part of wordpiece 3, 'Face'\n" ] } ], "source": [ "# For Fast Tokenizers, there's another option too:\n", "inputs = tokenizer._tokenizer.encode(input_str)\n", "\n", "print(input_str)\n", "print(\"-\"*5)\n", "print(f\"Number of tokens: {len(inputs)}\")\n", "print(f\"Ids: {inputs.ids}\")\n", "print(f\"Tokens: {inputs.tokens}\")\n", "print(f\"Special tokens mask: {inputs.special_tokens_mask}\")\n", "print()\n", "print(\"char_to_word gives the wordpiece of a character in the input\")\n", "char_idx = 8\n", "print(f\"For example, the {char_idx + 1}th character of the string is '{input_str[char_idx]}',\"+\\\n", " f\" and it's part of wordpiece {inputs.char_to_token(char_idx)}, '{inputs.tokens[inputs.char_to_token(char_idx)]}'\")" ], "id": "tdoZ3EEU-X86" }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "vt5WV-6S-X87", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "f2c8eba1-2b7e-49e9-ca90-f24bee18c226" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "PyTorch Tensors:\n", "{\n", " input_ids:\n", " tensor([[ 101, 20164, 10932, 10289, 25267, 1110, 1632, 106, 102]])\n", " attention_mask:\n", " tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]])\n", "}\n" ] } ], "source": [ "# Other cool tricks:\n", "# The tokenizer can return pytorch tensors\n", "model_inputs = tokenizer(\"Hugging Face Transformers is great!\", return_tensors=\"pt\")\n", "print(\"PyTorch Tensors:\")\n", "print_encoding(model_inputs)" ], "id": "vt5WV-6S-X87" }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "HI3bAzpeoD3_", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "9902a9a7-602f-4f3f-c8c3-7b04f026c193" }, "outputs": [ { "output_type": "stream", "name": "stderr", "text": [ "Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.\n" ] }, { "output_type": "stream", "name": "stdout", "text": [ "Pad token: [PAD] | Pad token id: 0\n", "Padding:\n", "{\n", " input_ids:\n", " tensor([[ 101, 20164, 10932, 10289, 25267, 1110, 1632, 106, 102, 0,\n", " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", " 0, 0, 0, 0, 0, 0, 0, 0, 0],\n", " [ 101, 1109, 3613, 3058, 17594, 15457, 1166, 1103, 16688, 3676,\n", " 119, 1599, 1103, 3676, 1400, 1146, 1105, 1868, 1283, 1272,\n", " 1131, 1238, 112, 189, 1176, 17594, 1279, 119, 102]])\n", " attention_mask:\n", " tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", " 0, 0, 0, 0, 0],\n", " [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n", " 1, 1, 1, 1, 1]])\n", "}\n" ] } ], "source": [ "# You can pass multiple strings into the tokenizer and pad them as you need\n", "model_inputs = tokenizer([\"Hugging Face Transformers is great!\",\n", " \"The quick brown fox jumps over the lazy dog.\" +\\\n", " \"Then the dog got up and ran away because she didn't like foxes.\",\n", " ],\n", " return_tensors=\"pt\",\n", " padding=True,\n", " truncation=True)\n", "print(f\"Pad token: {tokenizer.pad_token} | Pad token id: {tokenizer.pad_token_id}\")\n", "print(\"Padding:\")\n", "print_encoding(model_inputs)" ], "id": "HI3bAzpeoD3_" }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false, "id": "iSZat-nkoD3_", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "724a6d4b-777e-4438-ee03-d6f8862bbefb" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Batch Decode:\n", "['[CLS] Hugging Face Transformers is great! [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]', \"[CLS] The quick brown fox jumps over the lazy dog. Then the dog got up and ran away because she didn't like foxes. [SEP]\"]\n", "\n", "Batch Decode: (no special characters)\n", "['Hugging Face Transformers is great!', \"The quick brown fox jumps over the lazy dog. Then the dog got up and ran away because she didn't like foxes.\"]\n" ] } ], "source": [ "# You can also decode a whole batch at once:\n", "print(\"Batch Decode:\")\n", "print(tokenizer.batch_decode(model_inputs.input_ids))\n", "print()\n", "print(\"Batch Decode: (no special characters)\")\n", "print(tokenizer.batch_decode(model_inputs.input_ids, skip_special_tokens=True))" ], "id": "iSZat-nkoD3_" }, { "cell_type": "markdown", "metadata": { "id": "JmZNLz_noD4A" }, "source": [ "For more information about tokenizers, you can look at:\n", "[Hugging Face Transformers Docs](https://huggingface.co/docs/transformers/main_classes/tokenizer) and the [Hugging Face Tokenizers Library](https://huggingface.co/docs/tokenizers/python/latest/quicktour.html) (For the Fast Tokenizers). The Tokenizers Library even lets you train your own tokenizers!" ], "id": "JmZNLz_noD4A" }, { "cell_type": "markdown", "metadata": { "id": "6juLjnNt-X87" }, "source": [ "### 0.2 Models\n", "\n", "\n", "\n", "\n", "Initializing models is very similar to initializing tokenizers. You can either use the model class specific to your model or you can use an AutoModel class. I tend to prefer AutoModel, especially when I want to compare models, because it's easy to specify the models as strings.\n", "\n", "While most of the pretrained transformers have similar architecture, if you there are additional weights, called \"heads\" that you have to train if you're doing sequence classification, question answering, or some other task. Hugging Face automatically sets up the architecture you need when you specify the model class. For example, we are doing sentiment analysis, so we are going to use `DistilBertForSequenceClassification`. If we were going to continue training DistilBERT on its masked-language modeling training objective, we would use `DistilBertForMaskedLM`, and if we just wanted the model's representations, maybe for our own downstream task, we could just use `DistilBertModel`.\n", "\n", "\n", "Here's a stylized picture of a model recreated from one found here: [https://huggingface.co/course/chapter2/2?fw=pt](https://huggingface.co/course/chapter2/2?fw=pt).\n", "![model_illustration.png](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter2/transformer_and_head.svg)\n", "\n", "\n", "Here are some examples.\n", "```\n", "*Model\n", "*ForMaskedLM\n", "*ForSequenceClassification\n", "*ForTokenClassification\n", "*ForQuestionAnswering\n", "*ForMultipleChoice\n", "...\n", "```\n", "where `*` can be `AutoModel` or a specific pretrained model (e.g. `DistilBert`)\n", "\n", "\n", "There are three types of models:\n", "* Encoders (e.g. BERT)\n", "* Decoders (e.g. GPT2)\n", "* Encoder-Decoder models (e.g. BART or T5)\n", "\n", "The task-specific classes you have available depend on what type of model you're dealing with.\n", "\n", "\n", "A full list of choices are available in the [docs](https://huggingface.co/docs/transformers/model_doc/auto). Note that not all models are compatible with all model architectures, for example DistilBERT is not compatible with the Seq2Seq models because it only consists of an encoder.\n" ], "id": "6juLjnNt-X87" }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "RXm1K2sF-X88", "scrolled": true, "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "87db5173-398c-48c7-9df1-476144504613" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Loading base model\n", "Loading classification model from base model's checkpoint\n" ] }, { "output_type": "stream", "name": "stderr", "text": [ "Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']\n", "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n", "Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']\n", "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n" ] } ], "source": [ "from transformers import AutoModelForSequenceClassification, DistilBertForSequenceClassification, DistilBertModel\n", "print('Loading base model')\n", "base_model = DistilBertModel.from_pretrained('distilbert-base-cased')\n", "print(\"Loading classification model from base model's checkpoint\")\n", "model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-cased', num_labels=2)\n", "model = AutoModelForSequenceClassification.from_pretrained('distilbert-base-cased', num_labels=2)\n" ], "id": "RXm1K2sF-X88" }, { "cell_type": "markdown", "source": [ "You can also initialize with random weights" ], "metadata": { "id": "v_5IMvUvgEWU" }, "id": "v_5IMvUvgEWU" }, { "cell_type": "code", "source": [ "from transformers import DistilBertConfig, DistilBertModel\n", "\n", "# Initializing a DistilBERT configuration\n", "configuration = DistilBertConfig()\n", "configuration.num_labels=2\n", "# Initializing a model (with random weights) from the configuration\n", "model = DistilBertForSequenceClassification(configuration)\n", "\n", "# Accessing the model configuration\n", "configuration = model.config" ], "metadata": { "id": "IkqSkIhKgIrZ" }, "id": "IkqSkIhKgIrZ", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "1opFV7Vi-X88" }, "source": [ "We get a warning here because the sequence classification parameters haven't been trained yet.\n", "\n", "Passing inputs to the model is super easy. They take inputs as keyword arguments" ], "id": "1opFV7Vi-X88" }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "TDZ72k-U-X89", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "52285d1b-6d62-4d2d-827e-df23c61bbf8f" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "{'input_ids': tensor([[ 101, 20164, 10932, 10289, 25267, 1110, 1632, 106, 102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]])}\n", "\n", "SequenceClassifierOutput(loss=None, logits=tensor([[0.0368, 0.0659]], grad_fn=), hidden_states=None, attentions=None)\n", "\n", "Distribution over labels: tensor([[0.4927, 0.5073]], grad_fn=)\n" ] } ], "source": [ "model_inputs = tokenizer(input_str, return_tensors=\"pt\")\n", "\n", "# Option 1\n", "model_outputs = model(input_ids=model_inputs.input_ids, attention_mask=model_inputs.attention_mask)\n", "\n", "# Option 2 - the keys of the dictionary the tokenizer returns are the same as the keyword arguments\n", "# the model expects\n", "\n", "# f({k1: v1, k2: v2}) = f(k1=v1, k2=v2)\n", "\n", "model_outputs = model(**model_inputs)\n", "\n", "print(model_inputs)\n", "print()\n", "print(model_outputs)\n", "print()\n", "print(f\"Distribution over labels: {torch.softmax(model_outputs.logits, dim=1)}\")" ], "id": "TDZ72k-U-X89" }, { "cell_type": "markdown", "metadata": { "id": "oRRqhVUloD4C" }, "source": [ "If you notice, it's a bit weird that we have two classes for a binary classification task - you could easily have a single class and just choose a threshold. It's like this because of how huggingface models calculate the loss. This will increase the number of parameters we have, but shouldn't otherwise affect performance." ], "id": "oRRqhVUloD4C" }, { "cell_type": "markdown", "metadata": { "id": "rvie4gYD-X8-" }, "source": [ "These models are just Pytorch Modules! You can can calculate the loss with your `loss_func` and call `loss.backward`. You can use any of the optimizers or learning rate schedulers that you used" ], "id": "rvie4gYD-X8-" }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Irxo7sDboD4C", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "36e02d3d-e69b-46e0-c747-461bde1a60d5" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "tensor(0.6787, grad_fn=)\n" ] }, { "output_type": "execute_result", "data": { "text/plain": [ "('distilbert.embeddings.word_embeddings.weight',\n", " Parameter containing:\n", " tensor([[-2.5130e-02, -3.3044e-02, -2.4396e-03, ..., -1.0848e-02,\n", " -4.6824e-02, -9.4855e-03],\n", " [-4.8244e-03, -2.1486e-02, -8.7145e-03, ..., -2.6029e-02,\n", " -3.7862e-02, -2.4103e-02],\n", " [-1.6531e-02, -1.7862e-02, 1.0596e-03, ..., -1.6371e-02,\n", " -3.5670e-02, -3.1419e-02],\n", " ...,\n", " [-9.6466e-03, 1.4814e-02, -2.9182e-02, ..., -3.7873e-02,\n", " -4.6263e-02, -1.6803e-02],\n", " [-1.3170e-02, 6.5378e-05, -3.7222e-02, ..., -4.3558e-02,\n", " -1.1252e-02, -2.2152e-02],\n", " [ 1.1905e-02, -2.3293e-02, -2.2506e-02, ..., -2.7136e-02,\n", " -4.3556e-02, 1.0529e-04]], requires_grad=True))" ] }, "metadata": {}, "execution_count": 56 } ], "source": [ "# You can calculate the loss like normal\n", "label = torch.tensor([1])\n", "loss = torch.nn.functional.cross_entropy(model_outputs.logits, label)\n", "print(loss)\n", "loss.backward()\n", "\n", "# You can get the parameters\n", "list(model.named_parameters())[0]" ], "id": "Irxo7sDboD4C" }, { "cell_type": "markdown", "metadata": { "id": "mpHeG1zDoD4D" }, "source": [ "Hugging Face provides an additional easy way to calculate the loss as well:" ], "id": "mpHeG1zDoD4D" }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "S148gCyG-X8-", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "b2d3d761-ecce-4d96-abf3-acedc2c460e3" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "SequenceClassifierOutput(loss=tensor(0.6787, grad_fn=), logits=tensor([[0.0368, 0.0659]], grad_fn=), hidden_states=None, attentions=None)\n", "\n", "Model predictions: POSITIVE\n" ] } ], "source": [ "# To calculate the loss, we need to pass in a label:\n", "model_inputs = tokenizer(input_str, return_tensors=\"pt\")\n", "\n", "labels = ['NEGATIVE', 'POSITIVE']\n", "model_inputs['labels'] = torch.tensor([1])\n", "\n", "model_outputs = model(**model_inputs)\n", "\n", "\n", "print(model_outputs)\n", "print()\n", "print(f\"Model predictions: {labels[model_outputs.logits.argmax()]}\")" ], "id": "S148gCyG-X8-" }, { "cell_type": "markdown", "metadata": { "id": "7Y6E3IxzoD4E" }, "source": [ "One final note - you can get the hidden states and attention weights from the models really easily. This is particularly helpful if you're working on an analysis project. (For example, see [What does BERT look at?](https://arxiv.org/abs/1906.04341))." ], "id": "7Y6E3IxzoD4E" }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "5WzqhpquoD4E", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "7b8e84ba-4ffd-4a8f-920a-843c14fdcd70" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Hidden state size (per layer): torch.Size([1, 9, 768])\n", "Attention head size (per layer): torch.Size([1, 12, 9, 9])\n" ] } ], "source": [ "from transformers import AutoModel\n", "\n", "model = AutoModel.from_pretrained(\"distilbert-base-cased\", output_attentions=True, output_hidden_states=True)\n", "model.eval()\n", "\n", "model_inputs = tokenizer(input_str, return_tensors=\"pt\")\n", "with torch.no_grad():\n", " model_output = model(**model_inputs)\n", "\n", "\n", "print(\"Hidden state size (per layer): \", model_output.hidden_states[0].shape)\n", "print(\"Attention head size (per layer):\", model_output.attentions[0].shape) # (layer, batch, query_word_idx, key_word_idxs)\n", " # y-axis is query, x-axis is key\n", "# print(model_output)" ], "id": "5WzqhpquoD4E" }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "SH_MAK-soD4F", "colab": { "base_uri": "https://localhost:8080/", "height": 693 }, "outputId": "b5a45a91-b019-4b2f-af37-366ca60dc058" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "['[CLS]', 'Hu', '##gging', 'Face', 'Transformers', 'is', 'great', '!', '[SEP]']\n" ] }, { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "\n" }, "metadata": {} } ], "source": [ "tokens = tokenizer.convert_ids_to_tokens(model_inputs.input_ids[0])\n", "print(tokens)\n", "\n", "\n", "n_layers = len(model_output.attentions)\n", "n_heads = len(model_output.attentions[0][0])\n", "fig, axes = plt.subplots(6, 12)\n", "fig.set_size_inches(18.5*2, 10.5*2)\n", "for layer in range(n_layers):\n", " for i in range(n_heads):\n", " axes[layer, i].imshow(model_output.attentions[layer][0, i])\n", " axes[layer][i].set_xticks(list(range(9)))\n", " axes[layer][i].set_xticklabels(labels=tokens, rotation=\"vertical\")\n", " axes[layer][i].set_yticks(list(range(9)))\n", " axes[layer][i].set_yticklabels(labels=tokens)\n", "\n", " if layer == 5:\n", " axes[layer, i].set(xlabel=f\"head={i}\")\n", " if i == 0:\n", " axes[layer, i].set(ylabel=f\"layer={layer}\")\n", "\n", "plt.subplots_adjust(wspace=0.3)\n", "plt.show()" ], "id": "SH_MAK-soD4F" }, { "cell_type": "markdown", "metadata": { "id": "uumcErs2-X80" }, "source": [ "## Part 1: Finetuning\n", "\n", "For your projects, you are much more likely to want to finetune a pretrained model. This is a little bit more involved, but is still quite easy." ], "id": "uumcErs2-X80" }, { "cell_type": "markdown", "metadata": { "id": "WDdGp4Ua-X81" }, "source": [ "### 2.1 Loading in a dataset\n", "\n", "In addition to having models, the [the hub](https://huggingface.co/datasets) also has datasets." ], "id": "WDdGp4Ua-X81" }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "OTsW-Wwi-X81", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "20d75c4d-702f-4231-ef4f-44534fe051d3" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "DatasetDict({\n", " train: Dataset({\n", " features: ['text', 'label'],\n", " num_rows: 25000\n", " })\n", " test: Dataset({\n", " features: ['text', 'label'],\n", " num_rows: 25000\n", " })\n", " unsupervised: Dataset({\n", " features: ['text', 'label'],\n", " num_rows: 50000\n", " })\n", "})" ] }, "metadata": {}, "execution_count": 60 } ], "source": [ "from datasets import load_dataset, DatasetDict\n", "\n", "\n", "\n", "# DataLoader(zip(list1, list2))\n", "dataset_name = \"stanfordnlp/imdb\"\n", "\n", "imdb_dataset = load_dataset(dataset_name)\n", "\n", "\n", "# Just take the first 50 tokens for speed/running on cpu\n", "def truncate(example):\n", " return {\n", " 'text': \" \".join(example['text'].split()[:50]),\n", " 'label': example['label']\n", " }\n", "\n", "imdb_dataset" ], "id": "OTsW-Wwi-X81" }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "vCaX-gNo0OEV", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "0d043c74-6e73-405f-91a5-823d98a0a973" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "DatasetDict({\n", " train: Dataset({\n", " features: ['text', 'label'],\n", " num_rows: 128\n", " })\n", " val: Dataset({\n", " features: ['text', 'label'],\n", " num_rows: 32\n", " })\n", "})" ] }, "metadata": {}, "execution_count": 61 } ], "source": [ "\n", "# Take 128 random examples for train and 32 validation\n", "small_imdb_dataset = DatasetDict(\n", " train=imdb_dataset['train'].shuffle(seed=1111).select(range(128)).map(truncate),\n", " val=imdb_dataset['train'].shuffle(seed=1111).select(range(128, 160)).map(truncate),\n", ")\n", "small_imdb_dataset" ], "id": "vCaX-gNo0OEV" }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "bBS4c44A-X82", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "3d6cf0d8-3a62-43a6-87f2-bc2d039b44a1" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "{'text': [\"Probably Jackie Chan's best film in the 1980s, and the one that put him on the map. The scale of this self-directed police drama is evident from the opening and closing scenes, during which a squatters' village and shopping mall are demolished. There are, clearly, differences between the original Chinese\",\n", " 'A wonderful movie! Anyone growing up in an Italian family will definitely see themselves in these characters. A good family movie with sadness, humor, and very good acting from all. You will enjoy this movie!! We need more like it.',\n", " 'HORRENDOUS! Avoid like the plague. I would rate this in the top 10 worst movies ever. Special effects, acting, mood, sound, etc. appear to be done by day care students...wait, I have seen programs better than this. Opens like a soft porn show with a blurred nude female doing a',\n", " 'And I absolutely adore Isabelle Blais!!! She was so cute in this movie, and far different from her role in \"Quebec-Montreal\" where she was more like a man-eater. I think she should have been nominated for a Jutra. I mean, Syvlie Moreau was good, but Isabelle was far superior, IMO.',\n", " 'Must confess to having seen a few howlers in my time, but this one is up there with the worst of them. Plot troubling to follow. Sex and violence thrown in to disorient and distract from the really poorly put together film.

I can only imagine that the cast',\n", " \"I pity people calling kamal hassan 'ulaganaayakan' maybe for them ulagam is tollywood ! comeon guys..this movie is a thriller without thrill..

come out of your ulagam and just watch some high class thrillers like The Usual Suspects or even The Silence of the Lambs.

technically good but\",\n", " 'I remember my parents not understanding Saturday Night Live when I was 15. They also did not understand Rock n Roll and many other things. Now that I am approaching their age, I still remember, and find I understand many of the things my kids love. But this is pathetic.',\n", " 'This animated movie is a masterpiece! The narration, music, animation, and storyline where all remarkable. My girlfriend and I saw it again for a second time and we got more insight from it. We invited a couple friends to see Spirit with us and they really enjoyed it a lot.',\n", " 'I vaugely recall seeing this when I was 3 years old, then my parents accidentally taped over all but a few seconds of it with some other cartoon. Then I was about 8 or 9 years old when I rediscovered it and since I was then able to comprehend things',\n", " 'I, like many people, saw this film in the theatre when it first came out in \\'97. It was a below average film at best, defiantly not the \"masterpiece\" that all these \"Titanic\" fanboys like to make it out as. First off, DiCaprio is a terrible actor no matter which'],\n", " 'label': [1, 1, 0, 1, 0, 0, 0, 1, 1, 0]}" ] }, "metadata": {}, "execution_count": 62 } ], "source": [ "small_imdb_dataset['train'][:10]" ], "id": "bBS4c44A-X82" }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "3bjqop3N-X8_", "colab": { "base_uri": "https://localhost:8080/", "height": 49, "referenced_widgets": [ "65a533fb60804956b220793116238e88", "a0cd7a1e5bee4d9a9fe55cafa03e972b", "808a343a24cd4fa19ce8942e366b0d81", "f591cd9700af440ebe81fd24d103d7f0", "e9eecd62333840d2a937a10b616985fb", "83f278f11f6d46c993f9605518451a31", "3164bb6d64394c569b434058b83ec0cb", "ab8c5db7bb4246d6bda6b1e900e50600", "5d8cd6e7192140cdbf0b20af03d17c4a", "62ab925b526942f0bc34c52ba2457da5", "cdadcb68bdd04d34b120c347822787aa" ] }, "outputId": "69eaa47f-23c6-4f86-d2f9-b6f08762bb64" }, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "Map: 0%| | 0/32 [00:00" ], "text/html": [ "\n", "
\n", " \n", " \n", " [16/16 00:12, Epoch 2/2]\n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
EpochTraining LossValidation LossAccuracy
1No log0.6907630.687500
2No log0.6893980.656250

" ] }, "metadata": {} }, { "output_type": "execute_result", "data": { "text/plain": [ "TrainOutput(global_step=16, training_loss=0.6855354309082031, metrics={'train_runtime': 12.879, 'train_samples_per_second': 19.877, 'train_steps_per_second': 1.242, 'train_loss': 0.6855354309082031, 'epoch': 2.0})" ] }, "metadata": {}, "execution_count": 70 } ], "source": [ "# train the model\n", "trainer.train()" ], "id": "tunwonc2-X9C" }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "_Q185j9V-X9C", "colab": { "base_uri": "https://localhost:8080/", "height": 17 }, "outputId": "c0c5a89f-1137-4139-e4a6-19f150e32343" }, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "" ], "text/html": [] }, "metadata": {} } ], "source": [ "# evaluating the model is very easy\n", "\n", "# results = trainer.evaluate() # just gets evaluation metrics\n", "results = trainer.predict(small_tokenized_dataset['val']) # also gives you predictions" ], "id": "_Q185j9V-X9C" }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "UJ0aGxeh-X9D", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "48c19364-5b9c-4601-fc36-5a629bf29b35" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "PredictionOutput(predictions=array([[ 0.00131953, 0.00580434],\n", " [ 0.03595451, 0.03271278],\n", " [ 0.02868645, -0.02640614],\n", " [ 0.01930159, 0.04312824],\n", " [ 0.006476 , 0.03497026],\n", " [ 0.02455846, -0.03357419],\n", " [ 0.03111352, -0.02113312],\n", " [ 0.01838659, -0.02196582],\n", " [-0.0029597 , 0.03812057],\n", " [ 0.0017085 , 0.01325338],\n", " [ 0.01675256, 0.02372436],\n", " [ 0.05900048, -0.06402317],\n", " [ 0.02501691, -0.00633026],\n", " [ 0.03639492, -0.01839869],\n", " [ 0.02455333, 0.01321811],\n", " [ 0.06327322, -0.07626322],\n", " [ 0.04167526, -0.06864192],\n", " [ 0.03178786, 0.01866044],\n", " [ 0.04430217, -0.02746258],\n", " [ 0.00703178, 0.0555813 ],\n", " [ 0.04485226, -0.03188433],\n", " [ 0.00631561, 0.00044989],\n", " [ 0.00466784, 0.03991104],\n", " [ 0.04301776, -0.03423208],\n", " [ 0.02306983, 0.03840486],\n", " [ 0.01598634, 0.03111538],\n", " [ 0.00623961, 0.00687651],\n", " [ 0.03995002, -0.017272 ],\n", " [-0.00964497, 0.02151189],\n", " [ 0.03343549, -0.00809752],\n", " [ 0.01264727, 0.04387456],\n", " [ 0.0167619 , 0.00577193]], dtype=float32), label_ids=array([1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0,\n", " 0, 1, 1, 0, 1, 1, 1, 0, 0, 1]), metrics={'test_loss': 0.6893976926803589, 'test_accuracy': 0.65625, 'test_runtime': 0.13, 'test_samples_per_second': 246.122, 'test_steps_per_second': 15.383})" ] }, "metadata": {}, "execution_count": 72 } ], "source": [ "results" ], "id": "UJ0aGxeh-X9D" }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "kSGsZo3xoD4O", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "65d9c731-435c-4924-da02-cba2765cd58c" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "NEGATIVE\n" ] } ], "source": [ "# To load our saved model, we can pass the path to the checkpoint into the `from_pretrained` method:\n", "test_str = \"I enjoyed the movie!\"\n", "\n", "finetuned_model = AutoModelForSequenceClassification.from_pretrained(\"sample_hf_trainer/checkpoint-8\")\n", "model_inputs = tokenizer(test_str, return_tensors=\"pt\")\n", "prediction = torch.argmax(finetuned_model(**model_inputs).logits)\n", "print([\"NEGATIVE\", \"POSITIVE\"][prediction])" ], "id": "kSGsZo3xoD4O" }, { "cell_type": "markdown", "metadata": { "id": "MllSTgehoD4O" }, "source": [ "Included here are also some practical tips for fine-tuning:\n", "\n", "**Good default hyperparameters.** The hyperparameters you will depend on your task and dataset. You should do a hyperparameter search to find the best ones. That said, here are some good initial values for fine-tuning.\n", "* Epochs: {2, 3, 4} (larger amounts of data need fewer epochs)\n", "* Batch size (bigger is better: as large as you can make it)\n", "* Optimizer: AdamW\n", "* AdamW learning rate: {2e-5, 5e-5}\n", "* Learning rate scheduler: linear warm up for first {0, 100, 500} steps of training\n", "* weight_decay (l2 regularization): {0, 0.01, 0.1}\n", "\n", "You should monitor your validation loss to decide when you've found good hyperparameters." ], "id": "MllSTgehoD4O" }, { "cell_type": "markdown", "metadata": { "id": "gsWGQfrm-X9D" }, "source": [ "There's a lot more that we can integrate into the Trainer to make it more useful including logging, saving model checkpoints, and more! You can even sub-class it to add your own personalized components. You can check out [this link](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer) for more information about the Trainer." ], "id": "gsWGQfrm-X9D" }, { "cell_type": "markdown", "metadata": { "id": "9nCrUosgoD4P" }, "source": [ "## Appendix 0: Generation\n", "\n", "In the example above we finetuned the model on a classification task, but you can also finetune models on generation tasks. The `generate` function makes it easy to generate from these models. For example." ], "id": "9nCrUosgoD4P" }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "QfQEV8EKoD4P", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "fbd14511-e845-46a4-f23d-e0b533e85efa" }, "outputs": [ { "output_type": "stream", "name": "stderr", "text": [ "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n", " warnings.warn(\n" ] } ], "source": [ "from transformers import AutoModelForCausalLM\n", "\n", "gpt2_tokenizer = AutoTokenizer.from_pretrained('gpt2')\n", "\n", "gpt2 = AutoModelForCausalLM.from_pretrained('distilgpt2')\n", "gpt2.config.pad_token_id = gpt2.config.eos_token_id # Prevents warning during decoding" ], "id": "QfQEV8EKoD4P" }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "G5wo61xmoD4Q", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "05d6ce0c-624b-4bba-c371-e35377951568" }, "outputs": [ { "output_type": "stream", "name": "stderr", "text": [ "Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.\n", "Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.\n" ] }, { "output_type": "stream", "name": "stdout", "text": [ "1) Once upon a time when I was trying to get a job, I started feeling like the company would end up making a living. I was starting to feel that if it wasn't for the staff, they would lose, and the other staff would find\n" ] }, { "output_type": "stream", "name": "stderr", "text": [ "Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.\n" ] }, { "output_type": "stream", "name": "stdout", "text": [ "2) Once upon a time, we thought that the first and last names of each player were not related to each other. But I had a feeling that we'd never get to the moment of discovery before. And when I heard it, I felt I'd\n" ] }, { "output_type": "stream", "name": "stderr", "text": [ "Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.\n" ] }, { "output_type": "stream", "name": "stdout", "text": [ "3) Once upon a time of chaos, a group of monsters fell on the field and a few soldiers entered the air. One of them was the man who had been waiting for reinforcements, and he didn't seem to want to be seen. He was carrying\n" ] }, { "output_type": "stream", "name": "stderr", "text": [ "Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.\n" ] }, { "output_type": "stream", "name": "stdout", "text": [ "4) Once upon a time of need, one would be forced to commit the most serious crime.\n", "\n", "\n", "\n", "The problem isn't just about the number of dead and dying, it's also about the importance of ensuring that people who are not dead will\n" ] }, { "output_type": "stream", "name": "stderr", "text": [ "Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.\n" ] }, { "output_type": "stream", "name": "stdout", "text": [ "5) Once upon a time of war, you are a soldier, and if there is anything that can be done, I think it's enough.\"\n", "\n", "\n", "The Great War\n", "It wasn't a long time before the Great War began. There was a\n" ] }, { "output_type": "stream", "name": "stderr", "text": [ "Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.\n" ] }, { "output_type": "stream", "name": "stdout", "text": [ "6) Once upon a time in which the sun had risen, a large man of all stripes with red and blue stripes sat on a bench facing towards a throne, his head turned red and a little black, which was a familiar sight to me.\n", "\"\n" ] }, { "output_type": "stream", "name": "stderr", "text": [ "Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.\n" ] }, { "output_type": "stream", "name": "stdout", "text": [ "7) Once upon a time of my life, the people were really interested in my life. I went in a certain way and didn't want to go into politics or politics, I always wanted to be a good writer. I always wanted to make movies,\n" ] }, { "output_type": "stream", "name": "stderr", "text": [ "Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.\n" ] }, { "output_type": "stream", "name": "stdout", "text": [ "8) Once upon a time of great sadness the day before I would have said goodbye to her.”\n", "I had seen her in her parents’ home before. It seemed like I was so old. The family was only seven. I had seen\n" ] }, { "output_type": "stream", "name": "stderr", "text": [ "Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.\n" ] }, { "output_type": "stream", "name": "stdout", "text": [ "9) Once upon a time the question becomes \"Can the system allow me to run things without running the system?\" and \"Does that mean I can run things without running the system?\"\n", "\n", "\n", "\n", "That seems to me, the idea of the \"aut\n", "10) Once upon a time, there was an entire community who knew their future would be up and running. The city of Seattle's first full-blown municipal government became a full-blown government.\n", "\n", "\n", "In fact, it's not the only city\n" ] } ], "source": [ "prompt = \"Once upon a time\"\n", "\n", "tokenized_prompt = gpt2_tokenizer(prompt, return_tensors=\"pt\")\n", "\n", "for i in range(10):\n", " output = gpt2.generate(**tokenized_prompt,\n", " max_length=50,\n", " do_sample=True,\n", " top_p=0.9)\n", "\n", " print(f\"{i + 1}) {gpt2_tokenizer.batch_decode(output)[0]}\")" ], "id": "G5wo61xmoD4Q" }, { "cell_type": "markdown", "metadata": { "id": "QLAHLU4q9HYQ" }, "source": [ "## Appendix 1: Defining Custom Datasets\n", "\n", "There are a few ways to go about defining datasets, but I'm going to show an example using Pytorch Dataloaders. This example uses an encoder-decoder dataaset,the [E2E Dataset](https://arxiv.org/abs/1706.09254), which is maps structured information about restaurants to natural language descriptions." ], "id": "QLAHLU4q9HYQ" }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "MLqz11UioD4Q", "colab": { "base_uri": "https://localhost:8080/", "height": 356 }, "outputId": "5754856e-288a-4d93-aecc-a0c14f581f1b" }, "outputs": [ { "output_type": "error", "ename": "FileNotFoundError", "evalue": "[Errno 2] No such file or directory: 'e2e-dataset/trainset.csv'", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mFileNotFoundError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0mdatasets\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mDataset\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 6\u001b[0;31m \u001b[0mdf\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mread_csv\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"e2e-dataset/trainset.csv\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 7\u001b[0m \u001b[0mcustom_dataset\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mDataset\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfrom_pandas\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdf\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/usr/local/lib/python3.10/dist-packages/pandas/io/parsers/readers.py\u001b[0m in \u001b[0;36mread_csv\u001b[0;34m(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)\u001b[0m\n\u001b[1;32m 910\u001b[0m \u001b[0mkwds\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mupdate\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkwds_defaults\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 911\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 912\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0m_read\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfilepath_or_buffer\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkwds\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 913\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 914\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/usr/local/lib/python3.10/dist-packages/pandas/io/parsers/readers.py\u001b[0m in \u001b[0;36m_read\u001b[0;34m(filepath_or_buffer, kwds)\u001b[0m\n\u001b[1;32m 575\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 576\u001b[0m \u001b[0;31m# Create the parser.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 577\u001b[0;31m \u001b[0mparser\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mTextFileReader\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfilepath_or_buffer\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwds\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 578\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 579\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mchunksize\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0miterator\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/usr/local/lib/python3.10/dist-packages/pandas/io/parsers/readers.py\u001b[0m in \u001b[0;36m__init__\u001b[0;34m(self, f, engine, **kwds)\u001b[0m\n\u001b[1;32m 1405\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1406\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mhandles\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mIOHandles\u001b[0m \u001b[0;34m|\u001b[0m \u001b[0;32mNone\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1407\u001b[0;31m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_engine\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_make_engine\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mf\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mengine\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1408\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1409\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mclose\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m->\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/usr/local/lib/python3.10/dist-packages/pandas/io/parsers/readers.py\u001b[0m in \u001b[0;36m_make_engine\u001b[0;34m(self, f, engine)\u001b[0m\n\u001b[1;32m 1659\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0;34m\"b\"\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mmode\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1660\u001b[0m \u001b[0mmode\u001b[0m \u001b[0;34m+=\u001b[0m \u001b[0;34m\"b\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1661\u001b[0;31m self.handles = get_handle(\n\u001b[0m\u001b[1;32m 1662\u001b[0m \u001b[0mf\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1663\u001b[0m \u001b[0mmode\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/usr/local/lib/python3.10/dist-packages/pandas/io/common.py\u001b[0m in \u001b[0;36mget_handle\u001b[0;34m(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)\u001b[0m\n\u001b[1;32m 857\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mioargs\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mencoding\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0;34m\"b\"\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mioargs\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmode\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 858\u001b[0m \u001b[0;31m# Encoding\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 859\u001b[0;31m handle = open(\n\u001b[0m\u001b[1;32m 860\u001b[0m \u001b[0mhandle\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 861\u001b[0m \u001b[0mioargs\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmode\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mFileNotFoundError\u001b[0m: [Errno 2] No such file or directory: 'e2e-dataset/trainset.csv'" ] } ], "source": [ "# Option 1: Load into Hugging Face Datasets\n", "\n", "import pandas as pd\n", "from datasets import Dataset\n", "\n", "df = pd.read_csv(\"e2e-dataset/trainset.csv\")\n", "custom_dataset = Dataset.from_pandas(df)" ], "id": "MLqz11UioD4Q" }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "lRauc5JBoD4R" }, "outputs": [], "source": [ "import csv\n", "from torch.utils.data import Dataset, DataLoader\n", "\n", "class E2EDataset(Dataset):\n", " \"\"\"Tokenize data when we call __getitem__\"\"\"\n", " def __init__(self, path, tokenizer):\n", " with open(path, newline=\"\") as f:\n", " reader = csv.reader(f)\n", " next(reader) # skip the heading\n", " self.data = [{\"source\": row[0], \"target\": row[1]} for row in reader]\n", " self.tokenizer = tokenizer\n", "\n", " def __getitem__(self, i):\n", " inputs = self.tokenizer(self.data[i]['source'])\n", " labels = self.tokenizer(self.data[i]['target'])\n", " inputs['labels'] = labels.input_ids\n", " return inputs\n" ], "id": "lRauc5JBoD4R" }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "id": "eRu5mFIpoD4R" }, "outputs": [], "source": [ "bart_tokenizer = AutoTokenizer.from_pretrained('facebook/bart-base')" ], "id": "eRu5mFIpoD4R" }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "I32zL1nEoD4S" }, "outputs": [], "source": [ "dataset = E2EDataset(\"e2e-dataset/trainset.csv\", bart_tokenizer)" ], "id": "I32zL1nEoD4S" }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "I50Sh862oD4T" }, "outputs": [], "source": [ "bart_tokenizer.prepare_seq2seq_batch(src_texts=[\"This is the first test.\", \"This is the second test.\"], tgt_texts=[\"Target 1\", \"Target 2\"], return_tensors=\"pt\")" ], "id": "I50Sh862oD4T" }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "w8frTRD3oD4T" }, "outputs": [], "source": [ "dataset[0]" ], "id": "w8frTRD3oD4T" }, { "cell_type": "markdown", "metadata": { "id": "tHI3KuNZ-X8w" }, "source": [ "## Appendix 2: Pipelines\n", "\n", "There are some standard NLP tasks like sentiment classification or question answering where there are already pre-trained (and fine-tuned!) models available through Hugging Face Transformer's [_Pipeline_](https://huggingface.co/docs/transformers/v4.16.2/en/main_classes/pipelines#transformers.pipeline) interface.\n", "\n", "For your projects, you likely won't be using it too much, but it's still worth knowing about!\n", "\n", "Here's an example with Sentiment Analysis:" ], "id": "tHI3KuNZ-X8w" }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "id": "gOj5ODS0-X8x" }, "outputs": [], "source": [ "from transformers import pipeline\n", "\n", "sentiment_analysis = pipeline(\"sentiment-analysis\", model=\"siebert/sentiment-roberta-large-english\")" ], "id": "gOj5ODS0-X8x" }, { "cell_type": "markdown", "metadata": { "id": "D5wZuMG2-X8y" }, "source": [ "You can run the pipeline by just calling it on a string" ], "id": "D5wZuMG2-X8y" }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "GrygLkiQ-X8y" }, "outputs": [], "source": [ "sentiment_analysis(\"Hugging Face Transformers is really cool!\")" ], "id": "GrygLkiQ-X8y" }, { "cell_type": "markdown", "metadata": { "id": "0e2E8qKH-X8z" }, "source": [ "Or on a list of strings:" ], "id": "0e2E8qKH-X8z" }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "EpBvCpVM-X8z" }, "outputs": [], "source": [ "sentiment_analysis([\"I didn't know if I would like Hákarl, but it turned out pretty good.\",\n", " \"I didn't know if I would like Hákarl, and it was just as bad as I'd heard.\"])" ], "id": "EpBvCpVM-X8z" }, { "cell_type": "markdown", "metadata": { "id": "Ptc0BViy-X80" }, "source": [ "You can find more information on pipelines (including which ones are available) [here](https://huggingface.co/docs/transformers/main_classes/pipelines)" ], "id": "Ptc0BViy-X80" }, { "cell_type": "markdown", "metadata": { "id": "3oRAmG_w-X9H" }, "source": [ "## Appendix 4: Masked Language Modeling" ], "id": "3oRAmG_w-X9H" }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "id": "ZXD2-Gsu-X9H" }, "outputs": [], "source": [ "from transformers import AutoModelForMaskedLM\n", "\n", "tokenizer = AutoTokenizer.from_pretrained(\"bert-base-cased\", fast=True)\n", "bert = AutoModelForMaskedLM.from_pretrained(\"bert-base-cased\")" ], "id": "ZXD2-Gsu-X9H" }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "fvfHGKrq-X9I" }, "outputs": [], "source": [ "prompt = \"I am [MASK] to learn about HuggingFace!\"\n", "model = pipeline(\"fill-mask\", \"bert-base-cased\")\n", "model(prompt)" ], "id": "fvfHGKrq-X9I" }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "0B2qek-v-X9I", "scrolled": true }, "outputs": [], "source": [ "inputs = tokenizer(prompt, return_tensors=\"pt\")\n", "mask_index = np.where(inputs['input_ids'] == tokenizer.mask_token_id)\n", "outputs = bert(**inputs)\n", "top_5_predictions = torch.softmax(outputs.logits[mask_index], dim=1).topk(5)\n", "\n", "print(prompt)\n", "for i in range(5):\n", " prediction = tokenizer.decode(top_5_predictions.indices[0, i])\n", " prob = top_5_predictions.values[0, i]\n", " print(f\" {i+1}) {prediction}\\t{prob:.3f}\")" ], "id": "0B2qek-v-X9I" } ], "metadata": { "colab": { "provenance": [], "gpuType": "T4" }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" }, "accelerator": "GPU", "widgets": { "application/vnd.jupyter.widget-state+json": { "65a533fb60804956b220793116238e88": { "model_module": "@jupyter-widgets/controls", "model_name": "HBoxModel", "model_module_version": "1.5.0", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "HBoxModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "HBoxView", "box_style": "", "children": [ "IPY_MODEL_a0cd7a1e5bee4d9a9fe55cafa03e972b", "IPY_MODEL_808a343a24cd4fa19ce8942e366b0d81", "IPY_MODEL_f591cd9700af440ebe81fd24d103d7f0" ], "layout": "IPY_MODEL_e9eecd62333840d2a937a10b616985fb" } }, "a0cd7a1e5bee4d9a9fe55cafa03e972b": { "model_module": "@jupyter-widgets/controls", "model_name": "HTMLModel", "model_module_version": "1.5.0", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "HTMLModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "HTMLView", "description": "", "description_tooltip": null, "layout": "IPY_MODEL_83f278f11f6d46c993f9605518451a31", "placeholder": "​", "style": "IPY_MODEL_3164bb6d64394c569b434058b83ec0cb", "value": "Map: 100%" } }, "808a343a24cd4fa19ce8942e366b0d81": { "model_module": "@jupyter-widgets/controls", "model_name": "FloatProgressModel", "model_module_version": "1.5.0", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "FloatProgressModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "ProgressView", "bar_style": "success", "description": "", "description_tooltip": null, "layout": "IPY_MODEL_ab8c5db7bb4246d6bda6b1e900e50600", "max": 32, "min": 0, "orientation": "horizontal", "style": "IPY_MODEL_5d8cd6e7192140cdbf0b20af03d17c4a", "value": 32 } }, "f591cd9700af440ebe81fd24d103d7f0": { "model_module": "@jupyter-widgets/controls", "model_name": "HTMLModel", "model_module_version": "1.5.0", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "HTMLModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "HTMLView", "description": "", "description_tooltip": null, "layout": "IPY_MODEL_62ab925b526942f0bc34c52ba2457da5", "placeholder": "​", "style": "IPY_MODEL_cdadcb68bdd04d34b120c347822787aa", "value": " 32/32 [00:00<00:00, 866.41 examples/s]" } }, "e9eecd62333840d2a937a10b616985fb": { "model_module": "@jupyter-widgets/base", "model_name": "LayoutModel", "model_module_version": "1.2.0", "state": { "_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null } }, "83f278f11f6d46c993f9605518451a31": { "model_module": "@jupyter-widgets/base", "model_name": "LayoutModel", "model_module_version": "1.2.0", "state": { "_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null } }, "3164bb6d64394c569b434058b83ec0cb": { "model_module": "@jupyter-widgets/controls", "model_name": "DescriptionStyleModel", "model_module_version": "1.5.0", "state": { "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "DescriptionStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "StyleView", "description_width": "" } }, "ab8c5db7bb4246d6bda6b1e900e50600": { "model_module": "@jupyter-widgets/base", "model_name": "LayoutModel", "model_module_version": "1.2.0", "state": { "_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null } }, "5d8cd6e7192140cdbf0b20af03d17c4a": { "model_module": "@jupyter-widgets/controls", "model_name": "ProgressStyleModel", "model_module_version": "1.5.0", "state": { "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "ProgressStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "StyleView", "bar_color": null, "description_width": "" } }, "62ab925b526942f0bc34c52ba2457da5": { "model_module": "@jupyter-widgets/base", "model_name": "LayoutModel", "model_module_version": "1.2.0", "state": { "_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null } }, "cdadcb68bdd04d34b120c347822787aa": { "model_module": "@jupyter-widgets/controls", "model_name": "DescriptionStyleModel", "model_module_version": "1.5.0", "state": { "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "DescriptionStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "StyleView", "description_width": "" } }, "7439178f023b4baf99dbc5af2c0685f5": { "model_module": "@jupyter-widgets/controls", "model_name": "HBoxModel", "model_module_version": "1.5.0", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "HBoxModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "HBoxView", "box_style": "", "children": [ "IPY_MODEL_6d708b3b1e8c4e56b82d0704463f80c0", "IPY_MODEL_d4734564984c4cc090879870e909a8e0", "IPY_MODEL_9059b94266944ccc9fe3fd3d1fe19ab5" ], "layout": "IPY_MODEL_673080d7448948f8ba0358d1e1f6efb4" } }, "6d708b3b1e8c4e56b82d0704463f80c0": { "model_module": "@jupyter-widgets/controls", "model_name": "HTMLModel", "model_module_version": "1.5.0", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "HTMLModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "HTMLView", "description": "", "description_tooltip": null, "layout": "IPY_MODEL_6872dbb410804e42aae532de6857e4c8", "placeholder": "​", "style": "IPY_MODEL_8c89cfc617594a1fbb673ca305cc123f", "value": "100%" } }, "d4734564984c4cc090879870e909a8e0": { "model_module": "@jupyter-widgets/controls", "model_name": "FloatProgressModel", "model_module_version": "1.5.0", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "FloatProgressModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "ProgressView", "bar_style": "", "description": "", "description_tooltip": null, "layout": "IPY_MODEL_fd643ce0c2ab462aa8aaa236b8ef4688", "max": 8, "min": 0, "orientation": "horizontal", "style": "IPY_MODEL_8377ee4797e7430388e841daea16f252", "value": 8 } }, "9059b94266944ccc9fe3fd3d1fe19ab5": { "model_module": "@jupyter-widgets/controls", "model_name": "HTMLModel", "model_module_version": "1.5.0", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "HTMLModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "HTMLView", "description": "", "description_tooltip": null, "layout": "IPY_MODEL_9c65b39e6e614695885b546cf79c5c1b", "placeholder": "​", "style": "IPY_MODEL_fd6d0fab8cce44e79d4e177273d26840", "value": " 8/8 [00:45<00:00,  5.33s/it]" } }, "673080d7448948f8ba0358d1e1f6efb4": { "model_module": "@jupyter-widgets/base", "model_name": "LayoutModel", "model_module_version": "1.2.0", "state": { "_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null } }, "6872dbb410804e42aae532de6857e4c8": { "model_module": "@jupyter-widgets/base", "model_name": "LayoutModel", "model_module_version": "1.2.0", "state": { "_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null } }, "8c89cfc617594a1fbb673ca305cc123f": { "model_module": "@jupyter-widgets/controls", "model_name": "DescriptionStyleModel", "model_module_version": "1.5.0", "state": { "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "DescriptionStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "StyleView", "description_width": "" } }, "fd643ce0c2ab462aa8aaa236b8ef4688": { "model_module": "@jupyter-widgets/base", "model_name": "LayoutModel", "model_module_version": "1.2.0", "state": { "_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null } }, "8377ee4797e7430388e841daea16f252": { "model_module": "@jupyter-widgets/controls", "model_name": "ProgressStyleModel", "model_module_version": "1.5.0", "state": { "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "ProgressStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "StyleView", "bar_color": null, "description_width": "" } }, "9c65b39e6e614695885b546cf79c5c1b": { "model_module": "@jupyter-widgets/base", "model_name": "LayoutModel", "model_module_version": "1.2.0", "state": { "_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null } }, "fd6d0fab8cce44e79d4e177273d26840": { "model_module": "@jupyter-widgets/controls", "model_name": "DescriptionStyleModel", "model_module_version": "1.5.0", "state": { "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "DescriptionStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "StyleView", "description_width": "" } }, "ab935be7be504e3784de4d7f2cd7bff6": { "model_module": "@jupyter-widgets/controls", "model_name": "HBoxModel", "model_module_version": "1.5.0", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "HBoxModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "HBoxView", "box_style": "", "children": [ "IPY_MODEL_c0a01a008c5e4caa915220ab3db7862e", "IPY_MODEL_5494f014a89943fbaa52a3b760f7c9be", "IPY_MODEL_0e00aedefc414e7a936d1e835ce4326a" ], "layout": "IPY_MODEL_24764907a2b94fea9389650226a5bec2" } }, "c0a01a008c5e4caa915220ab3db7862e": { "model_module": "@jupyter-widgets/controls", "model_name": "HTMLModel", "model_module_version": "1.5.0", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "HTMLModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "HTMLView", "description": "", "description_tooltip": null, "layout": "IPY_MODEL_d2e472a543f94284a77fc019b9e7b9b0", "placeholder": "​", "style": "IPY_MODEL_81d5e9b9b59648aa96bcdcf176e9e04f", "value": "Map: 100%" } }, "5494f014a89943fbaa52a3b760f7c9be": { "model_module": "@jupyter-widgets/controls", "model_name": "FloatProgressModel", "model_module_version": "1.5.0", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "FloatProgressModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "ProgressView", "bar_style": "success", "description": "", "description_tooltip": null, "layout": "IPY_MODEL_1ef0ff13297d43e2a8fbe07e8894a13f", "max": 32, "min": 0, "orientation": "horizontal", "style": "IPY_MODEL_ed8c860c02f545469eb9e47d1f097443", "value": 32 } }, "0e00aedefc414e7a936d1e835ce4326a": { "model_module": "@jupyter-widgets/controls", "model_name": "HTMLModel", "model_module_version": "1.5.0", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "HTMLModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "HTMLView", "description": "", "description_tooltip": null, "layout": "IPY_MODEL_236ad067eca04d60abf593d0cddc8f2f", "placeholder": "​", "style": "IPY_MODEL_e1681735974b44a2a7a6c92097239ceb", "value": " 32/32 [00:00<00:00, 786.32 examples/s]" } }, "24764907a2b94fea9389650226a5bec2": { "model_module": "@jupyter-widgets/base", "model_name": "LayoutModel", "model_module_version": "1.2.0", "state": { "_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null } }, "d2e472a543f94284a77fc019b9e7b9b0": { "model_module": "@jupyter-widgets/base", "model_name": "LayoutModel", "model_module_version": "1.2.0", "state": { "_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null } }, "81d5e9b9b59648aa96bcdcf176e9e04f": { "model_module": "@jupyter-widgets/controls", "model_name": "DescriptionStyleModel", "model_module_version": "1.5.0", "state": { "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "DescriptionStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "StyleView", "description_width": "" } }, "1ef0ff13297d43e2a8fbe07e8894a13f": { "model_module": "@jupyter-widgets/base", "model_name": "LayoutModel", "model_module_version": "1.2.0", "state": { "_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null } }, "ed8c860c02f545469eb9e47d1f097443": { "model_module": "@jupyter-widgets/controls", "model_name": "ProgressStyleModel", "model_module_version": "1.5.0", "state": { "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "ProgressStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "StyleView", "bar_color": null, "description_width": "" } }, "236ad067eca04d60abf593d0cddc8f2f": { "model_module": "@jupyter-widgets/base", "model_name": "LayoutModel", "model_module_version": "1.2.0", "state": { "_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null } }, "e1681735974b44a2a7a6c92097239ceb": { "model_module": "@jupyter-widgets/controls", "model_name": "DescriptionStyleModel", "model_module_version": "1.5.0", "state": { "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "DescriptionStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "StyleView", "description_width": "" } } } } }, "nbformat": 4, "nbformat_minor": 5 }