A New IEEE 754 Standard for Floating-Point Arithmetic in an Ever-Changing World
Since 1985, most computational scientists—or anyone who uses floating-point (FP) arithmetic—have assumed that their computing platforms implement arithmetic operations according to the Institute of Electrical and Electronics Engineers (IEEE) 754 Standard for Floating-Point Arithmetic. This standard has made it much easier for researchers to write correct and portable code, since computers no longer round results or handle exceptions with the level of variety that existed among companies such as Digital Equipment Corporation, IBM, Cray, and Intel in 1985. Almost all differences became user-controlled under the standard, which also defined interchange formats to ease data porting between platforms and debugging efforts. The 2019 version of the IEEE standard provides new capabilities for reliable scientific computing, fixes bugs, and clarifies exceptional cases in operations and predicates.
The ever-changing world of technology motivates IEEE to periodically update all of its standards. One can only imagine an arithmetic standard that was (literally) set in stone and still uses base 60 instead of binary [6], as we do for keeping time. Beyond the few inevitable bug fixes, what changes in the world motivated updates in the most recent version of the IEEE 754 standard [4]? And what changes are still underway, unpredictable, and left to future versions of IEEE 754 or other arithmetic standards?
At a high level, one significant change is the burgeoning demand for reliability. Increasingly more groups now depend on computing to make important decisions. Sometimes it takes a disastrous rocket launch [2], naval propulsion failure [8], or robotic car crash [7] (see Figure 1)—all caused by faulty exception handling—to rouse public attention. The growing prominence of autonomous devices like cars and health monitors makes reliability even more critical.
One corresponding change in IEEE 754 is the addition of several new recommended operations: augmented addition, subtraction, and multiplication. Augmented addition takes two arguments—\(x\) and \(y\)—and returns two results: \(h=x+y\) rounded in a new way, and \(t=(x+y)-h\) exactly. Here, \(h\) stands for head and \(t\) stands for tail, since (barring exceptions) \(h+t=x+y\) exactly; \(h\) represents the leading bits of the sum (the head) and \(t\) represents the trailing bits (the tail). The new rounding mode for this specific instruction rounds \(h\) to the nearest FP number, breaking ties toward zero (as opposed to the nearest even number, which is the standard approach). This new instruction accelerates two high-level operations that both support reliability.
Augmented addition is also known as two sum, which programmers have long used to simulate double precision via single or quadruple via double [5]. When done appropriately, performing some operations in higher precision can significantly improve the error bounds and increase a calculation’s reliability. For example, Donald Knuth’s original algorithm for computing \(h\) and \(t\) costs six FP operations. A “fast” version requires three operations, assuming that \(|x| \geq |y|\). Neither algorithm handles exceptional cases uniformly. But if one implements augmented addition in hardware, it requires only one or two instructions and provides both significant speedups and uniform exception definitions.
The new definition of augmented addition employs a novel (for binary) rounding mode—rounding halfway cases to the nearest result that is smaller in magnitude (i.e., towards zero)—to support a new use case: bitwise reproducible FP summation [1]. Parallel and vector processing is now ubiquitous, and codes can no longer assume a fixed summation order. Because FP addition is not associative, the final results can differ substantially between runs. A prior SIAM News article summarizes real-life applications that range from debugging efforts to the detection of underground nuclear tests [3]. A portable algorithm uses the fast two-sum algorithm and bitwise-reproducibly sums \(n\) numbers—independent of the summation order—in approximately \(9n\) FP operations and \(3n\) bitwise operations in the common case [1]. Hardware-accelerated augmented addition reduces this calculation to \(4n\) or \(6n\) FP operations and no bitwise operations. Unlike rounding to nearest even, fixing the rounding mode to be independent of the result eliminates the bitwise operations. With parallel synchronization overhead, the utilization of four to six operations per entry enables programmers to make all summations reproducible by default with negligible cost.
Another requirement for reliability is “consistent exception handling.” This concept’s definition may depend on context, but everyone agrees that computing the maximum or minimum of an array of numbers should yield the same result regardless of the argument order. Due to an oversight on the interaction of two sections in IEEE 754-2008, the definition of max and min did not have this property when one argument is a “signaling NaN.” These old definitions are deprecated in the 2019 standard, and new suggested operations guarantee that min and max are associative.
Ensuring that higher-level software behaves consistently and portably with exceptional values requires work that falls outside of standard IEEE 754 arithmetic. For example, the reference implementation of the Basic Linear Algebra Subprograms (BLAS) routine NRM2—which computes the 2-norm of vector \(x\)—may return NaN if two or more entries of \(x\) equal infinity and no NaNs; some releases, like Intel’s Math Kernel Library, have repaired this issue. The reference implementation of the BLAS routine \(\rm{ISAMAX}\), which returns the index of the largest entry in terms of absolute value of input array \(x\), returns \(\rm{ISAMAX}([0,NaN,2])=3\) and \(\rm{ISAMAX}([NaN,0,2])=1\). Even more examples of this phenomenon exist in BLAS and other widely used software. Carefully defining “consistency”—and automating the identification and repair of such cases—is a work in progress.
An interesting challenge when defining consistency is that not all high-level languages agree on the definitions of basic operations. For example, multiplying two complex numbers \(x = (\textrm{Inf} + i*0)\) and \(y = (\textrm{Inf} + i*\textrm{Inf})\) yields \((\textrm{NaN} + i*\textrm{NaN}) \) in Fortran and \((-\textrm{NaN} + i*\textrm{Inf})\) in C. Backward compatibility may prevent languages from agreeing on the correct answer. Fortunately, the C and Fortran standardization committees are currently updating their definitions of max and min to match the new IEEE standard.
Other novel recommended operations in the IEEE 754 standard include “payload” operations to read or write information that is stored in the fraction bits of a NaN—which allow more customized exception handling—as well as the previously-undefined trigonometric functions \(\tan\textrm{Pi}(x) = \tan(\pi*x)\), \(\textrm{asin}\textrm{Pi}(x) = \textrm{asin}(x)/\pi\), and \(\textrm{acos}\textrm{Pi}(x) = \textrm{acos}(x)/\pi\). The standard also explains additional exceptional cases, such as the menagerie of \(x^y\) functions. Clarifications and additions to decimal arithmetic focus on the “quantum” that formalizes useful fixed-point aspects like dollars and cents. These items and other details are discussed in corresponding IEEE documents.
Now we turn to the future. During the standard’s finalization, there was an explosion of 16-bit and smaller precisions for machine learning (ML) applications. IEEE 754-2008 formalized binary16 (which has 10 bits of precision plus one implicit bit, five bits of exponent, and one sign bit) with input from graphics hardware manufacturers. ML applications benefit from a wider exponent range to represent smaller probabilities, thus leading to formats like Google’s bfloat16 (with \(7(+1)\) bits of precision, eight bits of exponent, and one sign bit). Other ML architectures implement different partitionings of the 16 bits, and researchers are investigating the use of even fewer bits to accelerate both ML training and inference.
ML optimizations are one example in which understanding arithmetic requirements is important for novel architectures. Other architectures work by distributing the FP load between control processors and memory-side processors. In the past, programmers have failed to ensure the reliability and reproducibility of these results for smart network interfaces that only accelerate the Message Passing Interface (MPI) and similar standards. Although distributed hardware supports newer memory-centric programming interfaces—which are intended to be transparent to programmers—they must accommodate the same semantic assumptions as sequential codes. Furthermore, developing arithmetic that is more amenable to low-power and high-error situations like interstellar probes requires additional end-to-end analysis.
Some incredibly novel architectures are pushing the limits of current numerical analysis. Bridging the gap between analog computing (like quantum) and the binary domain is an open field with many historic precedents. Advances in stochastic and semi-stochastic arithmetic also accentuate all of the issues that accompany the composition of different rounding and truncation methods. Though this matter lies beyond IEEE 754 and possibly beyond the rectilinear interval standard IEEE 1788.1, it still merits consideration.
Many opportunities exist for students and other researchers in these areas. It is also important to remember that not everything must live within one standard. IEEE 754 does not limit other ideas; instead, this evolving standard supports and inspires comparison. We encourage your undoubtably vigorous comments — perhaps some aspects will appear in IEEE 754-2029 or other future editions. We all have work to do!
References
[1] Ahrens, P., Demmel, J., & Nguyen, H.D. (2020). Algorithms for efficient reproducible floating point summation. ACM Trans. Math. Soft., 46(3), 1-49.
[2] Arnold, D.N. (2000, August 23). The explosion of the Ariane 5. Some disasters attributable to bad numerical computing. Retrieved from http://www-users.math.umn.edu/~arnold/disasters/ariane.html.
[3] Demmel, J., Riedy, J. & Ahrens, P. (2018, October 1). Reproducible BLAS: Make addition associative again! SIAM News, 51(8), p. 8.
[4] IEEE. (2019). IEEE standard for floating-point arithmetic. In IEEE Std 754-2019 (Revision of IEEE 754-2008), pp. 1-84.
[5] Muller, J.-M., Brunie, N., de Dinechin, F., Jeannerod, C.-P., Joldes, M., Lefèvre, V., …, Torres, S. (2018). Handbook of floating-point arithmetic (2nd ed.). Cham, Switzerland: Springer Birkhäuser.
[6] Nield, D. (2017, August 24). This 3,700-year-old Babylonian clay tablet just changed the history of maths. ScienceAlert. Retrieved from https://www.sciencealert.com/scientists-just-solved-a-maths-problem-on-this-3-700-year-old-clay-tablet.
[7] SIT Autonomous [Grouchy-Big9198]. (2020, October 29). During this initialization lap something happened which apparently caused the steering control signal to go to NaN. [Comment on the online forum post [OT Roborace] Driverless racecar drives straight into a wall]. Reddit. Retrieved from https://www.reddit.com/r/formula1/comments/jk9jrg/ot_roborace_driverless_racecar_drives_straight/gai295l.
[8] Slabodkin, G. (1998, July 13). Software glitches leave Navy Smart Ship dead in the water. GCN. Retrieved from https://gcn.com/Articles/1998/07/13/Software-glitches-leave-Navy-Smart-Ship-dead-in-the-water.aspx.
About the Authors
James Demmel
Professor, University of California, Berkeley
James Demmel is a professor of mathematics and electrical engineering and computer sciences (EECS) at the University of California, Berkeley. He is the former chair of the Department of EECS.
Jason Riedy
Technical Staff Member, Lucata Corporation
Jason Riedy is a member of the technical staff at Lucata Corporation, where he applies novel memory-centric architectures to data analysis problems.
Stay Up-to-Date with Email Alerts
Sign up for our monthly newsletter and emails about other topics of your choosing.