FP8 vs. FP16: Choosing the Right Precision for Deep Learning
Nathanael Chiang
As deep learning models push the limits of memory and computational power, choosing the proper precision format, whether FP16 or FP8, is key to balancing performance, memory, and processing speed. So, how do you decide which is best?
Understanding Floating Point Numbers
What Are Floating Point Numbers?
Floating-point numbers are numbers that allow us to represent a wide range of values efficiently. A floating-point number consists of three parts: the sign, the range (or exponent), and the mantissa (or precision). The introduction of floating-point number formats was a revolutionary breakthrough in mathematics, with their origins tracing back over a century.
Today, the IEEE 754-2019 specification sets an internationally recognized standard for representing floating-point numbers. Unlike floating points, integers primarily focus on significant digits (precision) without an exponent, limiting their ability to efficiently represent extremely large or small numbers.
Precision Options for Deep Learning Models
Traditionally, the 32-bit floating point (FP32) has been the standard for training deep learning models. Compared to FP16 and FP8, the FP32 contains a more accurate representation of real numbers since it minimizes the effects of rounding in complex calculations.
There are many scenarios where high precision is necessary. For example, doctors can use medical imaging to diagnose patients, and subtle differences in pixel intensity can affect the doctor’s ability to identify anomalies such as tumors accurately.
However, when training deep learning models, using less precise floating point numbers is more memory and computation efficient because they use fewer bits. FP16 reduces the number of bits from 32-bit floating point (FP32) to 16-bit, reducing the exponent from 8 bits to 5 and the mantissa from 23 bits to 10. Since FP32 uses four times more bits than FP8 and 2 times more than FP16, FP8 precision can provide up to a four times improvement in speed and memory usage over FP32 and up to a 2 times improvement over FP16.
There are two standard versions of FP8: E4M3 and E5M2. E4M3 consists of four exponent bits and three mantissa bits, and E5M2 consists of five exponent bits and two mantissa bits. Since E5M2 consists of five exponent bits, it is capable of storing a wider range of values but with reduced precision; therefore, E4M3 is used more frequently as it balances precision and range.
As we can see, lower precision results in a reduced range, which may make it more prone to numerical instabilities during training. When a number becomes too small to be represented in the given floating-point format, it rounds down to zero. This limitation is called underflow, which can be problematic in deep learning computations because they often include tiny values.
This can exacerbate the vanishing gradient problem. During backpropagation, tiny gradients may be rounded down to zero, so the update to that neuron's weight will be zero. This creates dead neurons since they cannot learn any further. Similarly, overflow occurs when a number is too large, and it may become infinite since lower-precision formats have a smaller exponent range.
We have examined the key trade-offs of low and high-precision floating points. What if we combine the two?
Mixed Precision Training for Efficient Deep Learning
While high precision may be required in certain key components, it may not be necessary in every stage. Mixed precision training is a technique that uses FP32 when necessary and FP16 in all other steps. Combining these data types preserves the benefits of using numerical formats with lower precision, such as requiring less memory and shortening training time while maintaining the model's accuracy.
To prevent underflow, loss scaling is used to preserve small gradient magnitudes. Before starting backpropagation, the loss values are scaled up to prevent small gradients from rounding down to zero. The loss values scale down after propagation, prior to the weight update. With the use of this method, deep learning models retain the memory usage and speed of FP16 and FP8 without sacrificing learning stability.
Training Deep Learning Models
Fixed-Point Representation
An alternative 8-bit format to FP8 is INT8, a fixed-point representation where the values are integers rather than floating-point numbers. INT8 scales and rounds parameters to 256 possible values between -128 and 127, which reduces memory and power usage. However, this fixed range results in a key challenge. By now, it should be obvious that the trade-off for memory efficiency is precision. This is especially problematic when we need to convert a model trained in floating point to INT8, which is why FP8’s higher dynamic range is often more appealing in many circumstances.
E4M3 or E5M2?
There are two prevalent variants of FP8: E4M3 and E5M2. E4M3 includes four exponent bits and three mantissa bits, and E5M2 contains five exponent bits and two mantissa bits. Since E5M2 has five exponent bits, it is able to represent a larger set of values. However, this comes at the cost of reduced precision. As a result, E4M3 is more commonly used since it achieves a good compromise between range and precision.
Quantization
Quantization is the method of reducing the precision of numbers in a model to save memory and improve speed. It is much simpler to perform this with FP8 than INT8, as it retains a floating-point representation. It is much easier to do this with FP8 than INT8 since it is still a floating point number. This means that not only can an LLM's weights but also the activations and KV cache be quantized, leading to more efficient inference. For example, reducing a model from FP16 to FP8 can avoid expensive operations during model inference, enhance performance, and reduce memory usage.
Optimizing Model Inference with FP8 and FP16
FP8 and FP16 have emerged as the industry standards in many AI pipelines due to their ability to optimize large-scale model inference. The Modular Accelerated Xecution (MAX) platform is an integrated suite of AI libraries, tools, and technologies that natively supports both formats, allowing developers to achieve seamless, scalable inference efficiency on cutting-edge hardware. The MAX platform supports PyTorch and HuggingFace models, where FP8 and FP16 optimizations are integrated directly in their APIs. FP8 quantization allows for the deployment of large-scale models, making it well-suited for large-scale deployments on platforms such as MAX. At the same time, FP16 is essential for critical tasks, including mixed precision training within deep learning frameworks, preserving numerical stability in gradient computations while accelerating training speed.
Choosing the right precision can be tricky. Formats such as FP32 may be necessary for high-accuracy applications but at the cost of reduced computational efficiency. Less accurate types of formats such as INT8, may be sufficient for applications with limited computational resources but may require careful tuning to maintain accuracy.
In the next generation of AI systems, FP8 and FP16 will play a critical role in optimizing deep learning models. By understanding different precision formats, you can optimize model deployment while preserving the necessary accuracy.



