Which GPU Is Best for Deep Learning? An Expert Guide

Are you looking for the best GPU for deep learning tasks? This comprehensive guide from LEARNS.EDU.VN explores the crucial features of GPUs, like Tensor Cores and memory bandwidth, and provides expert recommendations to optimize your deep learning experience. Discover how to make a cost-efficient choice and unlock the full potential of your deep learning projects. Enhance your understanding of GPUs and improve your deep learning skills with learns.edu.vn, your trusted source for educational content.

1. Overview

This article explains what makes a GPU fast for deep learning. It covers CPUs vs. GPUs, Tensor Cores, memory bandwidth, and the memory hierarchy of GPUs, and how they relate to deep learning performance. Understanding these aspects will help you evaluate future GPUs. We’ll discuss the unique features of the new NVIDIA RTX 40 Ampere GPU series and provide GPU recommendations for different scenarios. Finally, we’ll answer common questions, address misconceptions, and cover topics like cloud vs. desktop, cooling, and AMD vs. NVIDIA.

2. How Do GPUs Work?

If you frequently use GPUs, it’s beneficial to understand how they work. This knowledge helps you identify when GPUs are fast or slow, and understand why you need a GPU and how future hardware options might compete. If you prefer performance numbers and recommendations, you can skip this section. The best high-level explanation for how GPUs work is found in Tim Dettmers’ Quora answer:

Read Tim Dettmers‘ answer to Why are GPUs well-suited to deep learning? on Quora

This answer explains why GPUs are better than CPUs for deep learning. By examining the details, we can understand what makes one GPU better than another.

3. The Most Important GPU Specs for Deep Learning Processing Speed

This section helps you intuitively understand deep learning performance, enabling you to evaluate future GPUs. The components are ranked by importance: Tensor Cores, memory bandwidth, cache hierarchy, and FLOPS.

3.1. Tensor Cores

Tensor Cores are efficient matrix multiplication units. Since matrix multiplication is the most expensive part of deep neural networks, Tensor Cores are highly valuable. GPUs without Tensor Cores are not recommended due to their significant impact on performance.

Understanding how Tensor Cores work helps appreciate their importance. Here’s a simplified example of A*B=C matrix multiplication, where matrices are 32×32, demonstrating computational patterns with and without Tensor Cores. This example isn’t an exact representation of a high-performing matrix multiplication kernel, but it covers the basics. A CUDA programmer would optimize it with double buffering, register optimization, occupancy optimization, and instruction-level parallelism.

To fully understand this example, you need to understand cycles. A 1GHz processor can perform 10^9 cycles per second. Each cycle represents a computational opportunity. However, operations often take longer than one cycle, creating a queue where operations wait to finish. This is the latency of the operation.

Here are latency cycle timings for operations on Ampere GPUs, which have relatively slow caches, according to NVIDIA:

Global memory access (up to 80GB): ~380 cycles
L2 cache: ~200 cycles
L1 cache or Shared memory access (up to 128 kb per Streaming Multiprocessor): ~34 cycles
Fused multiplication and addition, a*b+c (FFMA): 4 cycles
Tensor Core matrix multiply: 1 cycle

Each operation is performed by a warp of 32 threads, which operate synchronously. Memory operations are optimized for warps. For example, loading from global memory happens at a granularity of 32*4 bytes. Up to 32 warps (1024 threads) can be in a streaming multiprocessor (SM), the GPU-equivalent of a CPU core. SM resources are divided among active warps, so sometimes running fewer warps increases registers/shared memory/Tensor Core resources per warp.

For the following examples, we assume identical computational resources. We use 8 SMs (about 10% of an RTX 3090) and 8 warps per SM for a 32×32 matrix multiply.

To understand how cycle latencies interact with resources like threads per SM and shared memory per SM, let’s look at matrix multiplication examples. These examples roughly follow the computational steps with and without Tensor Cores, but they are simplified. Real cases involve larger shared memory tiles and slightly different computational patterns.

3.1.1. Matrix Multiplication Without Tensor Cores

To perform an A*B=C matrix multiply with 32×32 matrices, repeatedly accessed memory is loaded into shared memory due to its lower latency (200 cycles vs 34 cycles). A memory block in shared memory is often called a memory tile or tile. Loading two 32×32 floats into a shared memory tile can occur in parallel using 2*32 warps. With 8 SMs and 8 warps each, we only need a single sequential load from global to shared memory, taking 200 cycles.

To perform the matrix multiplication, we load a vector of 32 numbers from shared memory A and B and perform a fused multiply-and-accumulate (FFMA), storing the outputs in registers C. Each SM does 8x dot products (32×32) to compute 8 outputs of C. For the technical reasons behind this, refer to Scott Gray’s blog post on matrix multiplication. This means 8x shared memory accesses at 34 cycles each and 8 FFMA operations (32 in parallel) at 4 cycles each. In total, the cost is:

200 cycles (global memory) + 8*34 cycles (shared memory) + 8*4 cycles (FFMA) = 504 cycles

Now, let’s look at the cycle cost with Tensor Cores.

3.1.2. Matrix Multiplication With Tensor Cores

With Tensor Cores, a 4×4 matrix multiplication can be performed in one cycle. First, memory needs to be in the Tensor Core. Similar to the above, we read from global memory (200 cycles) and store in shared memory. To do a 32×32 matrix multiply, we need 8×8=64 Tensor Core operations. A single SM has 8 Tensor Cores. So with 8 SMs, we have 64 Tensor Cores, which is exactly what we need. Data transfer from shared memory to Tensor Cores takes 1 memory transfer (34 cycles) and then 64 parallel Tensor Core operations (1 cycle). The total cost for Tensor Cores matrix multiplication is:

200 cycles (global memory) + 34 cycles (shared memory) + 1 cycle (Tensor Core) = 235 cycles.

Thus, Tensor Cores significantly reduce the matrix multiplication cost from 504 cycles to 235 cycles. In this simplified case, Tensor Cores reduced the cost of both shared memory access and FFMA operations.

This example is simplified. Real cases involve each thread calculating memory read and write locations during global memory to shared memory transfer. The new Hooper (H100) architectures include the Tensor Memory Accelerator (TMA) to compute these indices in hardware, allowing each thread to focus on more computation.

3.1.3. Matrix Multiplication With Tensor Cores, Asynchronous Copies (RTX 30/RTX 40), and TMA (H100)

The RTX 30 Ampere and RTX 40 Ada series GPUs support asynchronous transfers between global and shared memory. The H100 Hopper GPU extends this with the Tensor Memory Accelerator (TMA) unit, combining asynchronous copies and index calculation for reads and writes. Threads no longer need to calculate the next element to read, focusing instead on matrix multiplication.

The TMA unit fetches memory from global to shared memory (200 cycles). Once data arrives, the TMA unit fetches the next block asynchronously. While this occurs, threads load data from shared memory and perform matrix multiplication via the Tensor Core. Once finished, threads wait for the TMA unit to finish the next data transfer, repeating the sequence.

Due to the asynchronous nature, the second global memory read by the TMA unit progresses as the threads process the current shared memory tile. Thus, the second read takes only 200 – 34 – 1 = 165 cycles.

Since we do many reads, only the first memory access is slow. On average, we reduce the time by 35 cycles.

165 cycles (wait for async copy to finish) + 34 cycles (shared memory) + 1 cycle (Tensor Core) = 200 cycles.

This accelerates matrix multiplication by another 15%.

These examples illustrate why memory bandwidth is crucial for Tensor-Core-equipped GPUs. Since global memory has the largest cycle cost for matrix multiplication with Tensor Cores, reducing global memory latency would result in even faster GPUs. This can be achieved by increasing the clock frequency of the memory (more cycles per second) or by increasing the number of elements transferred at once (bus width).

3.2. Memory Bandwidth

From the previous section, we see that Tensor Cores are very fast, often idle while waiting for memory from global memory. During GPT-3-sized training, which uses huge matrices (larger is better for Tensor Cores), the Tensor Core TFLOPS utilization is about 45-65%, meaning Tensor Cores are idle about 50% of the time.

When comparing Tensor Core GPUs, memory bandwidth is a key indicator of performance. For example, the A100 GPU has 1,555 GB/s memory bandwidth versus the V100’s 900 GB/s. Therefore, a basic estimate of the speedup of an A100 vs. V100 is 1555/900 = 1.73x.

3.3. L2 Cache / Shared Memory / L1 Cache / Registers

Since memory transfers to Tensor Cores limit performance, we seek GPU attributes that enable faster memory transfer. L2 cache, shared memory, L1 cache, and the number of registers are all related. Understanding how a memory hierarchy enables faster memory transfers requires understanding how matrix multiplication is performed on a GPU.

To perform matrix multiplication, we use the GPU memory hierarchy: slow global memory, faster L2 memory, fast local shared memory, and lightning-fast registers. Faster memory is smaller.

Logically, L2 and L1 memory are the same, but L2 cache is larger, increasing the average physical distance needed to retrieve a cache line. Think of L1 and L2 caches as organized warehouses where you retrieve an item. Knowing the item’s location takes longer for larger warehouses. This is the essential difference: Large = slow, small = fast.

For matrix multiplication, this hierarchical separation into smaller, faster memory chunks allows very fast matrix multiplications. We chunk the big matrix multiplication into smaller sub-matrix multiplications called memory tiles.

We perform matrix multiplication across these smaller tiles in local shared memory, which is fast and close to the streaming multiprocessor (SM) – the equivalent of a CPU core. With Tensor Cores, we load parts of these tiles into Tensor Cores directly addressed by registers. A matrix memory tile in L2 cache is 3-5x faster than global GPU memory (GPU RAM), shared memory is ~7-10x faster, and Tensor Cores’ registers are ~200x faster.

Larger tiles mean more memory reuse. This is detailed in my TPU vs GPU blog post. TPUs have very large tiles for each Tensor Core, allowing them to reuse more memory with each global memory transfer, making them slightly more efficient at matrix multiplications than GPUs.

Each tile size is determined by the memory per streaming multiprocessor (SM) and the L2 cache across all SMs. The shared memory sizes on various architectures are:

Volta (Titan V): 128kb shared memory / 6 MB L2
Turing (RTX 20s series): 96 kb shared memory / 5.5 MB L2
Ampere (RTX 30s series): 128 kb shared memory / 6 MB L2
Ada (RTX 40s series): 128 kb shared memory / 72 MB L2

Ada has a much larger L2 cache, enabling larger tile sizes and reducing global memory access. For example, the input and weight matrix of any matrix multiplication for BERT large during training fit neatly into the L2 cache of Ada, but not other GPUs. As such, data only needs to be loaded from global memory once, making matrix multiplication about 1.5 – 2.0x faster for Ada. Larger models have lower speedups during training, but certain sweet spots may make certain models much faster. Inference with batch sizes larger than 8 can also benefit from the larger L2 caches.

4. Estimating Ada / Hopper Deep Learning Performance

This section is for those who want the technical details behind performance estimates for Ampere GPUs. If you do not care about these technical aspects, it is safe to skip this section.

4.1. Practical Ada / Hopper Speed Estimates

If we have an estimate for one GPU of a GPU-architecture like Hopper, Ada, Ampere, Turing, or Volta, extrapolating these results to other GPUs from the same architecture/series is straightforward. NVIDIA has already benchmarked the A100 vs V100 vs H100 across various computer vision and natural language understanding tasks. However, NVIDIA made sure the numbers aren’t directly comparable by using different batch sizes and the number of GPUs whenever possible to favor results for the H100 GPU. So the benchmark numbers are partially honest and partially marketing. Using larger batch sizes is fair, as the H100/A100 GPU has more memory. Still, comparing GPU architectures requires unbiased memory performance evaluation with the same batch size.

To get an unbiased estimate, we can scale the data center GPU results by: (1) accounting for batch size differences, and (2) accounting for differences in using 1 vs. 8 GPUs. We can find such an estimate for both biases in NVIDIA’s data.

Doubling the batch size increases throughput in terms of images/s (CNNs) by 13.6%. Benchmarking transformers on my RTX Titan yielded the same result: 13.5% — it appears that this is a robust estimate.

Parallelizing networks across more GPUs results in performance loss due to networking overhead. The A100 8x GPU system has better networking (NVLink 3.0) than the V100 8x GPU system (NVLink 2.0) — another confounding factor. Looking at the data from NVIDIA, we find that for CNNs, an 8x A100 system has 5% lower overhead than an 8x V100 system. This means if going from 1x A100 to 8x A100 gives you a speedup of, say, 7.00x, then going from 1x V100 to 8x V100 only gives you a speedup of 6.67x. For transformers, the figure is 7%.

Using these figures, we can estimate the speedup for specific deep learning architectures from NVIDIA’s data. The Tesla A100 offers the following speedup over the Tesla V100:

SE-ResNeXt101: 1.43x
Masked-R-CNN: 1.47x
Transformer (12 layer, Machine Translation, WMT14 en-de): 1.70x

The figures are lower than the theoretical estimate for computer vision. This might be due to smaller tensor dimensions, overhead from operations needed to prepare the matrix multiplication (like img2col or Fast Fourier Transform (FFT)), or operations that cannot saturate the GPU (final layers are often relatively small). It could also be artifacts of specific architectures (grouped convolution).

The practical transformer estimate is very close to the theoretical estimate. This is probably because algorithms for huge matrices are very straightforward. I will use these practical estimates to calculate the cost efficiency of GPUs.

4.2. Possible Biases in Estimates

The estimates above are for H100, A100, and V100 GPUs. In the past, NVIDIA snuck unannounced performance degradations into the “gaming” RTX GPUs: (1) Decreased Tensor Core utilization, (2) gaming fans for cooling, (3) disabled peer-to-peer GPU transfers. There might be unannounced performance degradations in the RTX 40 series compared to the full Hopper H100.

One such degradation was found for Ampere GPUs: Tensor Core performance was decreased so that RTX 30 series GPUs are not as good as Quadro cards for deep learning. This was also done for the RTX 20 series, so it is nothing new, but this time it was also done for the Titan equivalent card, the RTX 3090. The RTX Titan did not have performance degradation enabled.

Currently, no degradation for Ada GPUs are known, but I update this post with news on this and let my followers on twitter know.

5. Advantages and Problems for RTX40 and RTX 30 Series

The new NVIDIA Ampere RTX 30 series has benefits over the NVIDIA Turing RTX 20 series, such as sparse network training and inference. Other features, such as the new data types, should be seen more as an ease-of-use-feature as they provide the same performance boost as Turing does but without any extra programming required.

The Ada RTX 40 series has even further advances like 8-bit Float (FP8) tensor cores. The RTX 40 series also has similar power and temperature issues compared to the RTX 30. The issue of melting power connector cables in the RTX 40 can be easily prevented by connecting the power cable correctly.

5.1. Sparse Network Training

Ampere allows for fine-grained structure automatic sparse matrix multiplication at dense speeds. How does this work? Take a weight matrix and slice it into pieces of 4 elements. Now imagine 2 elements of these 4 to be zero. Figure 1 shows how this could look like.

Figure 1: Structure supported by the sparse matrix multiplication feature in Ampere GPUs. The figure is taken from Jeff Pool’s GTC 2020 presentation on Accelerating Sparsity in the NVIDIA Ampere Architecture by the courtesy of NVIDIA.

When you multiply this sparse weight matrix with some dense inputs, the sparse matrix tensor core feature in Ampere automatically compresses the sparse matrix to a dense representation that is half the size as can be seen in Figure 2. After this compression, the densely compressed matrix tile is fed into the tensor core which computes a matrix multiplication of twice the usual size. This effectively yields a 2x speedup since the bandwidth requirements during matrix multiplication from shared memory are halved.

Figure 2: The sparse matrix is compressed to a dense representation before the matrix multiplication is performed. The figure is taken from Jeff Pool’s GTC 2020 presentation on Accelerating Sparsity in the NVIDIA Ampere Architecture by the courtesy of NVIDIA.

I was working on sparse network training in my research and I also wrote a blog post about sparse training. One criticism of my work was that “You reduce the FLOPS required for the network, but it does not yield speedups because GPUs cannot do fast sparse matrix multiplication.” Well, with the addition of the sparse matrix multiplication feature for Tensor Cores, my algorithm, or other sparse training algorithms, now actually provide speedups of up to 2x during training.

Figure 3: The sparse training algorithm that I developed has three stages: (1) Determine the importance of each layer. (2) Remove the smallest, unimportant weights. (3) Grow new weights proportional to the importance of each layer. Read more about my work in my sparse training blog post.

While this feature is still experimental and training sparse networks are not commonplace yet, having this feature on your GPU means you are ready for the future of sparse training.

5.2. Low-precision Computation

In my work, I’ve previously shown that new data types can improve stability during low-precision backpropagation.

Figure 4: Low-precision deep learning 8-bit datatypes that I developed. Deep learning training benefits from highly specialized data types. My dynamic tree datatype uses a dynamic bit that indicates the beginning of a binary bisection tree that quantized the range [0, 0.9] while all previous bits are used for the exponent. This allows to dynamically represent numbers that are both large and small with high precision.

Currently, if you want to have stable backpropagation with 16-bit floating-point numbers (FP16), the big problem is that ordinary FP16 data types only support numbers in the range [-65,504, 65,504]. If your gradient slips past this range, your gradients explode into NaN values. To prevent this during FP16 training, we usually perform loss scaling where you multiply the loss by a small number before backpropagating to prevent this gradient explosion.

The BrainFloat 16 format (BF16) uses more bits for the exponent such that the range of possible numbers is the same as for FP32: [-3*10^38, 3*10^38]. BF16 has less precision, that is significant digits, but gradient precision is not that important for learning. So what BF16 does is that you no longer need to do any loss scaling or worry about the gradient blowing up quickly. As such, we should see an increase in training stability by using the BF16 format as a slight loss of precision.

What this means for you: With BF16 precision, training might be more stable than with FP16 precision while providing the same speedups. With 32-bit TensorFloat (TF32) precision, you get near FP32 stability while giving the speedups close to FP16. The good thing is, to use these data types, you can just replace FP32 with TF32 and FP16 with BF16 — no code changes required.

Overall, though, these new data types can be seen as lazy data types in the sense that you could have gotten all the benefits with the old data types with some additional programming efforts (proper loss scaling, initialization, normalization, using Apex). As such, these data types do not provide speedups but rather improve ease of use of low precision for training.

5.3. Fan Designs and GPUs Temperature Issues

While the new fan design of the RTX 30 series performs very well to cool the GPU, different fan designs of non-founders edition GPUs might be more problematic. If your GPU heats up beyond 80C, it will throttle itself and slow down its computational speed / power. This overheating can happen in particular if you stack multiple GPUs next to each other. A solution to this is to use PCIe extenders to create space between GPUs.

Spreading GPUs with PCIe extenders is very effective for cooling, and other fellow PhD students at the University of Washington and I use this setup with great success. It does not look pretty, but it keeps your GPUs cool! This has been running with no problems at all for 4 years now. It can also help if you do not have enough space to fit all GPUs in the PCIe slots. For example, if you can find the space within a desktop computer case, it might be possible to buy standard 3-slot-width RTX 4090 and spread them with PCIe extenders within the case. With this, you might solve both the space issue and cooling issue for a 4x RTX 4090 setup with a single simple solution.

Figure 5: 4x GPUs with PCIe extenders. It looks like a mess, but it is very effective for cooling. I used this rig for 4 years and cooling is excellent despite problematic RTX 2080 Ti Founders Edition GPUs.

5.4. 3-slot Design and Power Issues

The RTX 3090 and RTX 4090 are 3-slot GPUs, so one will not be able to use it in a 4x setup with the default fan design from NVIDIA. This is kind of justified because it runs at over 350W TDP, and it will be difficult to cool in a multi-GPU 2-slot setting. The RTX 3080 is only slightly better at 320W TDP, and cooling a 4x RTX 3080 setup will also be very difficult.

It is also difficult to power a 4x 350W = 1400W or 4x 450W = 1800W system in the 4x RTX 3090 or 4x RTX 4090 case. Power supply units (PSUs) of 1600W are readily available, but having only 200W to power the CPU and motherboard can be too tight. The components’ maximum power is only used if the components are fully utilized, and in deep learning, the CPU is usually only under weak load. With that, a 1600W PSU might work quite well with a 4x RTX 3080 build, but for a 4x RTX 3090 build, it is better to look for high wattage PSUs (+1700W). Some of my followers have had great success with cryptomining PSUs — have a look in the comment section for more info about that. Otherwise, it is important to note that not all outlets support PSUs above 1600W, especially in the US. This is the reason why in the US, there are currently few standard desktop PSUs above 1600W on the market. If you get a server or cryptomining PSUs, beware of the form factor — make sure it fits into your computer case.

5.5. Power Limiting: An Elegant Solution to Solve the Power Problem?

It is possible to set a power limit on your GPUs. So you would be able to programmatically set the power limit of an RTX 3090 to 300W instead of their standard 350W. In a 4x GPU system, that is a saving of 200W, which might just be enough to build a 4x RTX 3090 system with a 1600W PSU feasible. It also helps to keep the GPUs cool. So setting a power limit can solve the two major problems of a 4x RTX 3080 or 4x RTX 3090 setups, cooling, and power, at the same time. For a 4x setup, you still need effective blower GPUs (and the standard design may prove adequate for this), but this resolves the PSU problem.

Figure 6: Reducing the power limit has a slight cooling effect. Reducing the RTX 2080 Ti power limit by 50-60 W decreases temperatures slightly and fans run more silent.

You might ask, “Doesn’t this slow down the GPU?” Yes, it does, but the question is by how much. I benchmarked the 4x RTX 2080 Ti system shown in Figure 5 under different power limits to test this. I benchmarked the time for 500 mini-batches for BERT Large during inference (excluding the softmax layer). I choose BERT Large inference since, from my experience, this is the deep learning model that stresses the GPU the most. As such, I would expect power limiting to have the most massive slowdown for this model. As such, the slowdowns reported here are probably close to the maximum slowdowns that you can expect. The results are shown in Figure 7.

Figure 7: Measured slowdown for a given power limit on an RTX 2080 Ti. Measurements taken are mean processing times for 500 mini-batches of BERT Large during inference (excluding softmax layer).

As we can see, setting the power limit does not seriously affect performance. Limiting the power by 50W — more than enough to handle 4x RTX 3090 — decreases performance by only 7%.

5.6. RTX 4090s and Melting Power Connectors: How to Prevent Problems

There was a misconception that RTX 4090 power cables melt because they were bent. However, it was found that only 0.1% of users had this problem and the problem occured due to user error. Here a video that shows that the main problem is that cables were not inserted correctly.

So using RTX 4090 cards is perfectly safe if you follow the following install instructions:

If you use an old cable or old GPU make sure the contacts are free of debri / dust.
Use the power connector and stick it into the socket until you hear a *click* — this is the most important part.
Test for good fit by wiggling the power cable left to right. The cable should not move.
Check the contact with the socket visually, there should be no gap between cable and socket.

5.7. 8-bit Float Support in H100 and RTX 40 series GPUs

The support of the 8-bit Float (FP8) is a huge advantage for the RTX 40 series and H100 GPUs. With 8-bit inputs it allows you to load the data for matrix multiplication twice as fast, you can store twice as much matrix elements in your caches which in the Ada and Hopper architecture are very large, and now with FP8 tensor cores you get 0.66 PFLOPS of compute for a RTX 4090 — this is more FLOPS then the entirety of the worlds fastest supercomputer in year 2007. 4x RTX 4090 with FP8 compute rival the faster supercomputer in the world in year 2010 (deep learning started to work just in 2009).

The main problem with using 8-bit precision is that transformers can get very unstable with so few bits and crash during training or generate non-sense during inference. I have written a paper about the emergence of instabilities in large language models and I also written a more accessible blog post.

The main take-way is this: Using 8-bit instead of 16-bit makes things very unstable, but if you keep a couple of dimensions in high precision everything works just fine.

Main results from my work on 8-bit matrix multiplication for Large Language Models (LLMs). We can see that the best 8-bit baseline fails to deliver good zero-shot performance. The method that I developed, LLM.int8(), can perform Int8 matrix multiplication with the same results as the 16-bit baseline.

But Int8 was already supported by the RTX 30 / A100 / Ampere generation GPUs, why is FP8 in the RTX 40 another big upgrade? The FP8 data type is much more stable than the Int8 data type and its easy to use it in functions like layer norm or non-linear functions, which are difficult to do with Integer data types. This will make it very straightforward to use it in training and inference. I think this will make FP8 training and inference relatively common in a couple of months.

If you want to read more about the advantages of Float vs Integer data types you can read my recent paper about k-bit inference scaling laws. Below you can see one relevant main result for Float vs Integer data types from this paper. We can see that bit-by-bit, the FP4 data type preserve more information than Int4 data type and thus improves the mean LLM zeroshot accuracy across 4 tasks.

4-bit Inference scaling laws for Pythia Large Language Models for different data types. We see that bit-by-bit, 4-bit float data types have better zeroshot accuracy compared to the Int4 data types.

6. Raw Performance Ranking of GPUs

Below we see a chart of raw relevative performance across all GPUs. We see that there is a gigantic gap in 8-bit performance of H100 GPUs and old cards that are optimized for 16-bit performance.

Shown is raw relative transformer performance of GPUs. For example, an RTX 4090 has about 0.33x performance of a H100 SMX for 8-bit inference. In other words, a H100 SMX is three times faster for 8-bit inference compared to a RTX 4090.

For this data, I did not model 8-bit compute for older GPUs. I did