{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Run the following cell to ensure numba is installed (skip if you are sure)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Defaulting to user installation because normal site-packages is not writeable\n", "Requirement already satisfied: numba in c:\\users\\laveneau\\appdata\\roaming\\python\\python39\\site-packages (0.55.1)\n", "Requirement already satisfied: numpy<1.22,>=1.18 in c:\\program files\\python\\lib\\site-packages (from numba) (1.21.2)\n", "Requirement already satisfied: setuptools in c:\\program files\\python\\lib\\site-packages (from numba) (57.4.0)\n", "Requirement already satisfied: llvmlite<0.39,>=0.38.0rc1 in c:\\users\\laveneau\\appdata\\roaming\\python\\python39\\site-packages (from numba) (0.38.0)\n" ] } ], "source": [ "!pip install numba" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Run the following cell to verify that numba can use Cuda..." ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "is cuda available? yes\n" ] } ], "source": [ "%%python\n", "from numba import cuda\n", "\n", "def bool_to_str(predicate: bool) -> str:\n", " if predicate:\n", " return 'yes'\n", " return 'no'\n", "\n", "def is_cuda_available_str() -> str:\n", " return bool_to_str(cuda.is_available())\n", "\n", "print(f'is cuda available? {is_cuda_available_str()}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another simple solution to detect and print the devices from [numba documentation](https://numba.readthedocs.io/en/stable/cuda-reference/host.html#device-management):" ] }, { "cell_type": "code", "execution_count": 80, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Found 1 CUDA devices\n", "id 0 b'Quadro T2000 with Max-Q Design' [SUPPORTED]\n", " Compute Capability: 7.5\n", " PCI Device ID: 0\n", " PCI Bus ID: 1\n", " UUID: GPU-d0ef4cd4-a0bf-4d14-21df-26d411426265\n", " Watchdog: Enabled\n", " Compute Mode: WDDM\n", " FP32/FP64 Performance Ratio: 32\n", "Summary:\n", "\t1/1 devices are supported\n" ] } ], "source": [ "%%python\n", "from numba import cuda\n", "\n", "if not cuda.detect():\n", " raise Exception(\"we do not have cuda :-(\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Run the following cell for a short test of numba/cuda..." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "it woks!\n" ] } ], "source": [ "%%python\n", "# This is a MAP example\n", "from numba import cuda\n", "import numpy as np \n", "\n", "\n", "@cuda.jit\n", "def increment_by_one(an_array):\n", " # Thread id in a 1D block\n", " tx = cuda.threadIdx.x\n", " # Block id in a 1D grid\n", " ty = cuda.blockIdx.x\n", " # Block width, i.e. number of threads per block\n", " bw = cuda.blockDim.x\n", " # Compute flattened index inside the array\n", " pos = tx + ty * bw\n", " if pos < an_array.size: # Check array boundaries\n", " an_array[pos] += 1\n", " \n", "\n", "if __name__ == \"__main__\":\n", " # build a big vector\n", " h_array = np.arange(1<<16)\n", " \n", " d_array = cuda.to_device(h_array)\n", " \n", " threads_per_block = 32*8 # 8 warps, 256 threads per block\n", " blocks_per_grid = (h_array.size + (threads_per_block - 1)) // threads_per_block\n", " increment_by_one[blocks_per_grid, threads_per_block](d_array)\n", " \n", " d_array.copy_to_host(h_array)\n", " \n", " for i in range(len(h_array)):\n", " assert h_array[i] == i+1, f'bad value at index {i}: {h_array[i]+1}'\n", " print('it woks!')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From here, it is clear that some code are easy to do:\n", "- binary transform\n", "- gather\n", "- scatter\n", "\n", "That's what you did during the first week... Now let us try some simple PRAM algorithms. \n", "\n", "The main problem with PRAM model is that it is too simple to be applied directly to GPU. Indeed a GPU is not a simple vector processor but a set of vector processor. \n", "\n", "You already saw that a Cuda Grid contains some Cuda Blocks, and each Cuda Block contains some threads. \n", "The threads are organized into **warps**.\n", "\n", "A warp is the logical Cuda view of a vector processor, with 32 elementary processors. Hence, threads are working in vector mode by group of 32... \n", "\n", "So, how to program PRAM algorithms in Cuda? Good question, thanks. \n", "\n", "Today we are working into a single block (we will see how to use different blocks concurrently later) with some warps. Remember that the maximum number of threads you can use into a block is limited to 1024. The warp is a group of 32 threads, so the maximum number of warps is 32!\n", "\n", "Now, how to simulate PRAM into a block? For that purpose we need to synchronize the warps, to avoid race condition onto data. The simple solution to synchronizer all the threads into a block is to use the instruction `cuda.syncthreads()`. \n", "\n", "**Warning**: this instruction needs to be used by **all** the threads of the block. It is like a barrier, and it opens when all the threads reach it only...\n", "\n", "To avoid high latency and synchronisation problem, notice that all the data should be first loaded into block shared memory... This kind of memory should be allocated using the `numba.cuda.array(shape, dtype)` method." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## First exercise\n", "The objective here is to write a Cuda algorithm that calculates the maximum of 32 values into a single block. Your implementation should be added at line 21!\n", "\n", "32 threads means a single warp, so no synchronisation is needed here..." ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "maximum seems to work\n" ] } ], "source": [ "from __future__ import annotations\n", "\n", "import numpy as np\n", "from numba import cuda, core\n", "from numba.np.numpy_support import from_dtype\n", "\n", "\n", "class CudaMaximum:\n", " _kernels_cache = {}\n", "\n", " def __init__(self: CudaMaximum) -> None:\n", " pass\n", "\n", " @staticmethod\n", " def _gpu_kernel_factory(np_type):\n", " \"\"\"Factory of kernels for the maximum problem...\n", "\n", " This function returns a Cuda Kernel that does the maximum of some data using a single block.\"\"\"\n", "\n", " def kernel(d_input, d_maximum) -> None:\n", " tid = cuda.threadIdx.x\n", " shared = cuda.shared.array(shape=32, dtype=d_input.dtype)\n", " # do it here\n", "\n", " return cuda.jit(kernel)\n", "\n", " @staticmethod\n", " def _compile(dtype):\n", "\n", " key = dtype\n", " if key not in CudaMaximum._kernels_cache:\n", " CudaMaximum._kernels_cache[key] = CudaMaximum._gpu_kernel_factory(from_dtype(dtype))\n", "\n", " return CudaMaximum._kernels_cache[key]\n", "\n", " def __call__(self, buffer, maximum, stream=cuda.default_stream()):\n", " \"\"\"Computes the maximum.\n", "\n", " :param buffer: A device array, containing the data.\n", " :param stream: Optional CUDA stream in which to perform the calculations.\n", " If no stream is specified, the default stream of 0 is used.\n", " :return: ``None``\n", " \"\"\"\n", "\n", " # ensure 1d array\n", " if buffer.ndim != 1:\n", " raise TypeError(\"only support 1D array\")\n", "\n", " # ensure size > 0\n", " if buffer.size < 1:\n", " raise ValueError(\"array's length is 0\")\n", "\n", " # ensure size == 1\n", " if maximum.size != 1:\n", " raise ValueError(\"array's length is not zero\")\n", "\n", " # ensure size <= 32\n", " if buffer.size > 32:\n", " raise ValueError(\"array's length is too big\")\n", "\n", " kernel = self._compile(buffer.dtype)\n", "\n", " # Perform the maximum on the GPU\n", " nb_threads = buffer.size\n", " nb_blocks = 1\n", "\n", " start_event = cuda.event(True)\n", " stop_event = cuda.event(True)\n", "\n", " start_event.record(stream=stream)\n", " kernel[nb_blocks, nb_threads, stream](buffer, maximum)\n", " stop_event.record(stream=stream)\n", " stop_event.synchronize()\n", " ct = cuda.event_elapsed_time(start_event, stop_event)\n", " print(f\"kernel computation time is {ct} ms\")\n", "\n", "\n", "if __name__ == \"__main__\":\n", " def check(maximum, array, msg):\n", " error = 0\n", " for x in array:\n", " if x > maximum:\n", " error = error + 1\n", " if error > 0:\n", " print(f\"{msg} does not work: {error} errors generated\")\n", " else:\n", " print(f\"{msg} seems to work\")\n", "\n", "\n", " def test(h_buffer):\n", " d_buffer = cuda.to_device(h_buffer)\n", " d_maximum = cuda.device_array(shape=1, dtype=d_buffer.dtype)\n", "\n", " maxer = CudaMaximum()\n", " maxer(d_buffer, d_maximum)\n", "\n", " h_maximum = d_maximum.copy_to_host()\n", "\n", " check(h_maximum[0], h_buffer, \"maximum\")\n", "\n", "\n", " core.config.CUDA_LOW_OCCUPANCY_WARNINGS = False\n", " buffer = np.random.randint(low=0, high=1 << 20, size=32, dtype=np.int32)\n", " test(buffer)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Second exercise\n", "The objective here is to write a Cuda algorithm that calculates the maximum of 1024 values into a single block. Your implementation should be added at line 21!\n", "\n", "Take care: use threads' synchronization to simulate the PRAM algorithm with multiple warps!" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "maximum seems to work\n" ] } ], "source": [ "from __future__ import annotations\n", "\n", "import numpy as np\n", "from numba import cuda, core\n", "from numba.np.numpy_support import from_dtype\n", "\n", "\n", "class CudaMaximum:\n", " _kernels_cache = {}\n", "\n", " def __init__(self: CudaMaximum) -> None:\n", " pass\n", "\n", " @staticmethod\n", " def _gpu_kernel_factory(np_type, nb_threads):\n", " \"\"\"Factory of kernels for the maximum problem...\n", "\n", " This function returns a Cuda Kernel that does the maximum of some data using a single block.\"\"\"\n", "\n", " def kernel(d_input, d_maximum) -> None:\n", " tid = cuda.threadIdx.x\n", " shared = cuda.shared.array(shape=nb_threads, dtype=d_input.dtype)\n", " # do it here\n", "\n", " return cuda.jit(kernel)\n", "\n", " @staticmethod\n", " def _compile(dtype, nb_threads):\n", "\n", " key = dtype, nb_threads\n", " if key not in CudaMaximum._kernels_cache:\n", " CudaMaximum._kernels_cache[key] = CudaMaximum._gpu_kernel_factory(from_dtype(dtype), nb_threads)\n", "\n", " return CudaMaximum._kernels_cache[key]\n", "\n", " def __call__(self, buffer, maximum, stream=cuda.default_stream()):\n", " \"\"\"Computes the maximum.\n", "\n", " :param buffer: A device array, containing the data.\n", " :param stream: Optional CUDA stream in which to perform the calculations.\n", " If no stream is specified, the default stream of 0 is used.\n", " :return: ``None``\n", " \"\"\"\n", "\n", " # ensure 1d array\n", " if buffer.ndim != 1:\n", " raise TypeError(\"only support 1D array\")\n", "\n", " # ensure size > 0\n", " if buffer.size < 1:\n", " raise ValueError(\"array's length is 0\")\n", "\n", " # ensure size == 1\n", " if maximum.size != 1:\n", " raise ValueError(\"array's length is not zero\")\n", "\n", " # ensure size < 1024+1\n", " if buffer.size > 1024:\n", " raise ValueError(\"array's length is too big\")\n", "\n", " # Perform the maximum on the GPU\n", " nb_threads = 1024\n", " nb_blocks = 1\n", "\n", " kernel = self._compile(buffer.dtype, nb_threads)\n", "\n", " start_event = cuda.event(True)\n", " stop_event = cuda.event(True)\n", " start_event.record(stream=stream)\n", " kernel[nb_blocks, nb_threads, stream](buffer, maximum)\n", " stop_event.record(stream=stream)\n", " stop_event.synchronize()\n", " ct = cuda.event_elapsed_time(start_event, stop_event)\n", " print(f\"kernel computation time is {ct} ms\")\n", "\n", "\n", "if __name__ == \"__main__\":\n", " def check(maximum, array, msg):\n", " error = 0\n", " for x in array:\n", " if x > maximum:\n", " error = error + 1\n", " if error > 0:\n", " print(f\"{msg} does not work: {error} errors generated\")\n", " else:\n", " print(f\"{msg} seems to work\")\n", "\n", "\n", " def test(h_buffer):\n", " d_buffer = cuda.to_device(h_buffer)\n", " d_maximum = cuda.device_array(shape=1, dtype=d_buffer.dtype)\n", "\n", " maxer = CudaMaximum()\n", " maxer(d_buffer, d_maximum)\n", "\n", " h_maximum = d_maximum.copy_to_host()\n", "\n", " check(h_maximum[0], h_buffer, \"maximum\")\n", "\n", "\n", " core.config.CUDA_LOW_OCCUPANCY_WARNINGS = False\n", " test(np.random.randint(low=0, high=1 << 20, size=1024, dtype=np.int32))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Third exercise\n", "You may have notice that the first two exercises works with 1024 values only. \n", "If you remove the line:\n", "```python\n", " core.config.CUDA_LOW_OCCUPANCY_WARNINGS = False\n", "```\n", "Then you will see a warning from Cuda saying that the GPU is under-utilized.\n", "Indeed, we ran a single block, onto a single SMP (*Streaming Multi-Processor*), while there is plenty of SMP (from 4 to tens)...\n", "To overcome this problem, the solution is quite simple but not obvious.\n", "\n", "First, you have to launch multiple block, but no more than 256 to avoid registers' pressure (shared memory is simulated using registers, and so too big shared memory implied low number of registers per thread, and so reduced efficiency...). \n", "\n", "Then, you obtain one maximum value per block, and this value should be saved into a specific isolated memory location: so you need an array of maximums of size equals to the number of blocks...\n", "\n", "At last, well you have something like 256 maximum values... Hum, it is where the first exercise is useful ;-)\n", "\n", "Modify you implementation to work with many values (something like 1 millions, at least)..." ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "launch 65536 for 16777216 threads\n", "launch 256 for 65536 threads\n", "Maximum is 1048575\n", "maximum seems to work\n" ] } ], "source": [ "from __future__ import annotations\n", "\n", "import numpy as np\n", "from numba import cuda, core\n", "from numba.np.numpy_support import from_dtype\n", "\n", "\n", "class CudaMaximum:\n", " _kernels_1_cache = {}\n", " _kernels_2_cache = {}\n", "\n", " def __init__(self: CudaMaximum) -> None:\n", " pass\n", "\n", " @staticmethod\n", " def _gpu_kernel_factory_1(np_type, nb_threads):\n", " \"\"\"Factory of kernels for the maximum problem...\n", "\n", " This function returns a Cuda Kernel that does the maximum of some data using a multiple block.\"\"\"\n", "\n", " def kernel(d_input, d_maximum) -> None:\n", " gtid = cuda.grid(1)\n", " ltid = cuda.threadIdx.x\n", " bdim = cuda.blockDim.x\n", " \n", " shared = cuda.shared.array(shape=nb_threads, dtype=d_input.dtype)\n", " # do it here\n", "\n", " return cuda.jit(kernel)\n", "\n", " @staticmethod\n", " def _compile_step1(dtype, nb_threads):\n", "\n", " key = dtype, nb_threads\n", " if key not in CudaMaximum._kernels_1_cache:\n", " CudaMaximum._kernels_1_cache[key] = CudaMaximum._gpu_kernel_factory_1(from_dtype(dtype), nb_threads)\n", "\n", " return CudaMaximum._kernels_1_cache[key]\n", "\n", " @staticmethod\n", " def _gpu_kernel_factory_2(np_type, nb_threads):\n", " \"\"\"Factory of kernels for the maximum problem...\n", "\n", " This function returns a Cuda Kernel that does the maximum of some data using a single block.\"\"\"\n", "\n", " def kernel(d_input, d_maximum) -> None:\n", " tid = cuda.threadIdx.x\n", " shared = cuda.shared.array(shape=nb_threads, dtype=d_input.dtype)\n", " # do it here\n", "\n", " return cuda.jit(kernel)\n", "\n", " @staticmethod\n", " def _compile_step2(dtype, nb_threads):\n", "\n", " key = dtype\n", " if key not in CudaMaximum._kernels_2_cache:\n", " CudaMaximum._kernels_2_cache[key] = CudaMaximum._gpu_kernel_factory_2(from_dtype(dtype), nb_threads)\n", "\n", " return CudaMaximum._kernels_2_cache[key]\n", "\n", " def __call__(self, buffer, maximum, stream=0):\n", " \"\"\"Computes the maximum.\n", "\n", " :param buffer: A device array, containing the data.\n", " :param stream: Optional CUDA stream in which to perform the calculations.\n", " If no stream is specified, the default stream of 0 is used.\n", " :return: ``None``\n", " \"\"\"\n", "\n", " # ensure 1d array\n", " if buffer.ndim != 1:\n", " raise TypeError(\"only support 1D array\")\n", "\n", " # ensure size > 0\n", " if buffer.size < 1:\n", " raise ValueError(\"array's length is 0\")\n", "\n", " # ensure size == 1\n", " if maximum.size != 1:\n", " raise ValueError(\"array's length is not zero\")\n", "\n", " # Perform the maximum per block on the GPU\n", " nb_threads = 256\n", " kernel_1 = self._compile_step1(buffer.dtype, nb_threads)\n", "\n", " start_event = cuda.event(True)\n", " stop_event = cuda.event(True)\n", " start_event.record(stream=stream)\n", "\n", " while buffer.size > 256:\n", " nb_blocks = (buffer.size + nb_threads - 1) // nb_threads\n", " print(f\"launch {nb_blocks} for {nb_threads * nb_blocks} threads\")\n", "\n", " temp = cuda.device_array(shape=nb_blocks, dtype=buffer.dtype)\n", "\n", " kernel_1[nb_blocks, nb_threads, stream](buffer, temp)\n", "\n", " cuda.synchronize()\n", "\n", " buffer = temp\n", "\n", " # second step...\n", " kernel_2 = self._compile_step2(buffer.dtype, buffer.size)\n", " kernel_2[1, buffer.size, stream](buffer, maximum)\n", "\n", " stop_event.record(stream=stream)\n", " stop_event.synchronize()\n", " ct = cuda.event_elapsed_time(start_event, stop_event)\n", " print(f\"kernel computation time is {ct} ms\")\n", "\n", "\n", "\n", "if __name__ == \"__main__\":\n", " def check(maximum, array, msg):\n", " error = 0\n", " for x in array:\n", " if x > maximum:\n", " error = error + 1\n", " if error > 0:\n", " print(f\"{msg} does not work: {error} errors generated\")\n", " else:\n", " print(f\"{msg} seems to work\")\n", "\n", "\n", " def test(h_buffer):\n", " d_buffer = cuda.to_device(h_buffer)\n", " d_maximum = cuda.device_array(shape=1, dtype=d_buffer.dtype)\n", "\n", " maxer = CudaMaximum()\n", " maxer(d_buffer, d_maximum)\n", "\n", " h_maximum = d_maximum.copy_to_host()\n", " print(f\"Maximum is {h_maximum[0]}\")\n", "\n", " check(h_maximum[0], h_buffer, \"maximum\")\n", "\n", "\n", " core.config.CUDA_LOW_OCCUPANCY_WARNINGS = False\n", " buffer = np.random.randint(low=0, high=1 << 20, size=1 << 25, dtype=np.int32)\n", " test(buffer)\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.9.7 64-bit", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" }, "orig_nbformat": 4, "vscode": { "interpreter": { "hash": "9b4d75ac280b6c7c3aa43866cb82dc88915409b55fec83a093dd0284cb58708e" } } }, "nbformat": 4, "nbformat_minor": 2 }