BLIS: BLAS and So Much More
The Basic Linear Algebra Subprograms (BLAS) have profoundly impacted scientific software development. This application programming interface (API) supports basic linear algebra functionality, including vector-vector operations like dot product and axpy, matrix-vector operations like matrix-vector multiplication and rank-1 updates, and matrix-matrix operations like matrix-matrix multiplication and solutions for triangular systems with multiple right-hand sides. It enables portable high performance and a level of abstraction that upholds the development, maintenance, and readability of high-quality applications. Here we discuss the BLAS-like Library Instantiation Software (BLIS), which facilitates rapid instantiation of BLAS and BLAS-like operations and provides both the original interface as well as alternatives [9].
Scientists have long expected hardware and middleware vendors to deliver high-performance implementations of BLAS; until the mid-1990s, only these proprietary solutions were available. The open-source Portable, High-Performance ANSI C (PHiPAC) project [2]—and later the first widely-used Automatically Tuned Linear Algebra Software (ATLAS) implementation [10]—included important developments, such as the achievement of reasonably high performance via coding in C, autogenerating code, and autotuning parameters. In the late 1990s, Kazushige Goto introduced a new way of blocking matrix-matrix multiplication that took advantage of two levels of cache memory in central processing units [3]. At the time, this method improved performance by up to 10 percent over ATLAS and existing vendor libraries. Goto’s approach—now called the Goto algorithm—became the foundation for GotoBLAS and still forms the basis for most of the highest-performing BLAS implementations, both proprietary and open source. His implementation lives on as a fork of GotoBLAS called OpenBLAS.
One drawback of GotoBLAS was that much of the source code targeted specific architectures in addition to specific BLAS operations. BLIS restructures Goto’s algorithm so that this low-level, machine-specific code is limited to the implementation of a few “microkernels” within five levels of generic cache- and register-blocking loops (see Figure 1). Once the microkernels are implemented, all BLAS and BLAS-like functionality becomes instantiated and is ready for use. As with Goto’s algorithm, the rearrangement or packing of data at strategic points within the algorithm improves data locality and reduces the number of translation lookaside buffer (TLB) misses. It is important to note that analytical models, rather than empirical tuning, determine the various parameters in BLIS [5]. This approach yields a more flexible, portable, and easily maintained framework that achieves nearly optimal performance, as reported on the BLIS GitHub repository performance pages.
The native interface that BLIS supports is object based, meaning that attributes like matrix domain (real versus complex), precision (half, single, double, or extended), dimensions, and storage format are hidden. BLIS also provides a native “typed” interface that resembles BLAS while offering extensions for various features, including the specification of separate row and column strides for each matrix operand. Finally, BLIS contains a BLAS compatibility interface that implements the conventional BLAS API. Users can easily create their own interfaces that map to either of the underlying BLIS interfaces (object or typed).
BLIS does not only export top-level interfaces; it also exposes lower-level components that researchers can compose in innovative ways. For example, BLIS can mix the domains and precisions of operands [8]. To support this feature, it incorporates the necessary conversions into the packing component that already forms a part of the basic algorithm. This technique has also been employed for a separate library for tensor contraction called TBLIS [6], which avoids the necessary rearrangement of data that occurs when such operations are cast in terms of traditional BLAS. In addition, BLIS has enabled the practical implementation of Strassen’s algorithm [4]. These examples demonstrate that BLIS is as much a conceptual toolbox for innovation as a tangible framework for code instantiation.
Since its debut on GitHub in 2012, BLIS has amassed a sizeable online community that includes volunteers who maintain packages for various distributions of Linux and other operating systems. Supported operating systems include Debian Linux, Ubuntu Linux, Gentoo Linux, Extra Packages for Enterprise Linux/Fedora Linux, openSUSE Linux, and GNU Guix. BLIS also offers macOS and Conda support, along with limited support for Windows dynamic-link libraries (DLLs). Users may download and compile the source code themselves via either git-cloned repositories or source snapshots in .zip or tarball formats. In addition to the “vanilla” distribution of BLIS that The University of Texas at Austin manages (in collaboration with academic, industry, and community partners around the globe), Advanced Micro Devices, Inc. (AMD) maintains1 a separate fork of BLIS that contains optimizations that are specific to Zen-based microarchitectures, as well as enhancements that their corporate customers request. An annual workshop called the BLIS Retreat provides a forum for many stakeholders from academia, industry, and government laboratories to exchange ideas and discuss the latest research.
While creating an entire framework—with support from the National Science Foundation and industry gifts—has taken the better part of a decade, the techniques that underlie BLIS are remarkably simple. During a four-week massive open online course (MOOC) titled “LAFF-On Programming for High Performance,” we examine BLIS’s matrix-matrix multiplication to illustrate important issues in high-performance computing. This course, which is part of a series of MOOCs, attracts both novices and experts and lowers barriers for entry into the field — much like BLIS lowers barriers to porting BLAS-like operations to new environments and architectures.
BLIS continues to grow by rapidly incorporating support for emerging architectures and instruction sets—such as Intel® Advanced Vector Extensions 512, AMD EPYC, ARM Scalable Vector Extension, and IBM POWER10—and addressing dense linear algebra operations beyond the traditional BLAS interface. The simplicity of BLIS’s underlying techniques and concepts has enabled a myriad of improvements. However, the BLIS framework—as a concrete instantiation of these ideas—has begun to reach certain limits. Instead, alternative one-off instantiations of the BLIS concept have enabled efficient tensor contraction [6], machine learning primitives [11], and even operations that are relevant to biostatistics [1]. But is it necessary to look beyond BLIS to address these exciting problems? We believe that the answer is “no” — or rather, the next great adventure for BLIS is to make it so.
1AMD selected BLIS as the foundation for its hardware-optimized BLAS library circa 2015.
This article is based on Robert van de Geijn’s invited talk at the 2020 SIAM Conference on Parallel Processing for Scientific Computing, which took place in Seattle, Wash., last year. Van de Geijn, Van Zee, and others received the 2020 SIAM Activity Group on Supercomputing Best Paper Prize for their paper on the BLIS framework [9]. Van de Geijn’s presentation is available on SIAM’s YouTube channel.
References
[1] Alachiotis, N., Popovici, D.T., & Low, T.M. (2016). Efficient computation of linkage disequilibria as dense linear algebra operations. In 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (pp. 418-427). Chicago, IL: IEEE.
[2] Bilmes, J., Asanović, K., Chin, C.-W., & Demmel, J. (1997). Optimizing matrix multiply using PHiPAC: A portable, high-performance, ANSI C coding methodology. In U. Banerjee (Ed.), ACM international conference on supercomputing 25th anniversary volume (pp. 253-260). Association for Computing Machinery.
[3] Goto, K., & van de Geijn, R. (2008). Anatomy of high-performance matrix multiplication. ACM Trans. Math. Soft., 34(3), 12.
[4] Huang, J., Smith, T.M., Henry, G.M., & van de Geijn, R.A. (2016). Strassen’s algorithm reloaded. In SC’16: Proceedings of the international conference for high performance computing, networking, storage, and analysis (pp. 690-701). Salt Lake City, UT: IEEE.
[5] Low, T.M., Igual, F.D., Smith, T.M., & Quintana-Orti, E.S. (2016). Analytical modeling is enough for high-performance BLIS. ACM Trans. Math. Softw., 43(2), 12.
[6] Matthews, D.A. (2018). High-performance tensor contraction without transposition. SIAM J. Sci. Comput., 40(1), C1-C24.
[7] Van Zee, F.G., & Smith, T.M. (2017). Implementing high-performance complex matrix multiplication via the 3m and 4m methods. ACM Trans. Math. Soft., 44(1), 7.
[8] Van Zee, F.G., Parikh, D.N., & van de Geijn, R.A. (2021). Supporting mixed-domain mixed-precision matrix multiplication within the BLIS framework. ACM Trans. Math. Softw. To be published.
[9] Van Zee, F.G., Smith, T.M., Marker, B., Low, T.M., van de Geijn, R.A., Igual, F.D., Smelyanskiy, M., Zhang, X., Kistler, M., Austel, V., Gunnels, J.A., & Killough, L. (2016). The BLIS framework: Experiments in portability. ACM Trans. Math. Softw., 42(2), 12.
[10] Whaley, R.C., Petitet, A., & Dongarra, J.J. (2001). Automated empirical optimizations of software and the ATLAS project. Para. Comput., 27(1-2), 3-35.
[11] Yu, C.D., Huang, J., Austin, W., Xiao, B., & Biros, G. (2015). Performance optimization for the k-nearest neighbors kernel on x86 architectures. In SC’15: Proceedings of the international conference for high performance computing, networking, storage, and analysis (pp. 1-12). New York, NY: Association for Computing Machinery.
About the Authors
Field Van Zee
Chief Architect, BLAS-like Library Instantiation Software
Field Van Zee, the chief architect of the BLAS-like Library Instantiation Software, is a research scientist at the Oden Institute for Computational Engineering and Sciences at The University of Texas at Austin.
Robert van de Geijn
Faculty member, The University of Texas at Austin
Robert van de Geijn is a faculty member at The University of Texas at Austin who is affiliated with the Department of Computer Science and the Oden Institute for Computational Engineering and Sciences. He received his Ph.D. in applied mathematics from the University of Maryland and is a leading expert in the areas of high-performance computing, linear algebra libraries, and parallel processing.
Maggie Myers
Senior Research Fellow, Oden Institute for Computational Engineering and Sciences
Maggie Myers is a Senior Research Fellow at the Oden Institute for Computational Engineering and Sciences at The University of Texas at Austin. She holds a Ph.D. in mathematical statistics from the University of Maryland and has extensive experience developing educational materials for K-12, college, and graduate-level courses.
Devangi Parikh
Assistant professor, The University of Texas at Austin
Devangi Parikh is an assistant professor of instruction in the Department of Computer Science at The University of Texas at Austin.
Devin Matthews
Associate professor, Southern Methodist University
Devin Matthews is an associate professor of chemistry at Southern Methodist University who specializes in high-performance computing in the context of molecular structures and spectroscopy.