Volume 54 Issue 03 April 2021
Research

BLIS: BLAS and So Much More

The Basic Linear Algebra Subprograms (BLAS) have profoundly impacted scientific software development. This application programming interface (API) supports basic linear algebra functionality, including vector-vector operations like dot product and axpy, matrix-vector operations like matrix-vector multiplication and rank-1 updates, and matrix-matrix operations like matrix-matrix multiplication and solutions for triangular systems with multiple right-hand sides. It enables portable high performance and a level of abstraction that upholds the development, maintenance, and readability of high-quality applications. Here we discuss the BLAS-like Library Instantiation Software (BLIS), which facilitates rapid instantiation of BLAS and BLAS-like operations and provides both the original interface as well as alternatives [9].

BLAS-like Library Instantiation Software (BLIS): Working at the intersection of algorithms and architecture. Photo by Robert van de Geijn and Maggie Myers.
BLAS-like Library Instantiation Software (BLIS): Working at the intersection of algorithms and architecture. Photo by Robert van de Geijn and Maggie Myers.

Scientists have long expected hardware and middleware vendors to deliver high-performance implementations of BLAS; until the mid-1990s, only these proprietary solutions were available. The open-source Portable, High-Performance ANSI C (PHiPAC) project [2]—and later the first widely-used Automatically Tuned Linear Algebra Software (ATLAS) implementation [10]—included important developments, such as the achievement of reasonably high performance via coding in C, autogenerating code, and autotuning parameters. In the late 1990s, Kazushige Goto introduced a new way of blocking matrix-matrix multiplication that took advantage of two levels of cache memory in central processing units [3]. At the time, this method improved performance by up to 10 percent over ATLAS and existing vendor libraries. Goto’s approach—now called the Goto algorithm—became the foundation for GotoBLAS and still forms the basis for most of the highest-performing BLAS implementations, both proprietary and open source. His implementation lives on as a fork of GotoBLAS called OpenBLAS.

One drawback of GotoBLAS was that much of the source code targeted specific architectures in addition to specific BLAS operations. BLIS restructures Goto’s algorithm so that this low-level, machine-specific code is limited to the implementation of a few “microkernels” within five levels of generic cache- and register-blocking loops (see Figure 1). Once the microkernels are implemented, all BLAS and BLAS-like functionality becomes instantiated and is ready for use. As with Goto’s algorithm, the rearrangement or packing of data at strategic points within the algorithm improves data locality and reduces the number of translation lookaside buffer (TLB) misses. It is important to note that analytical models, rather than empirical tuning, determine the various parameters in BLIS [5]. This approach yields a more flexible, portable, and easily maintained framework that achieves nearly optimal performance, as reported on the BLIS GitHub repository performance pages.

The native interface that BLIS supports is object based, meaning that attributes like matrix domain (real versus complex), precision (half, single, double, or extended), dimensions, and storage format are hidden. BLIS also provides a native “typed” interface that resembles BLAS while offering extensions for various features, including the specification of separate row and column strides for each matrix operand. Finally, BLIS contains a BLAS compatibility interface that implements the conventional BLAS API. Users can easily create their own interfaces that map to either of the underlying BLIS interfaces (object or typed).

<strong>Figure 1.</strong> The BLAS-like Library Instantiation Software (BLIS) refactoring of the GotoBLAS algorithm as five loops around the microkernel. Image courtesy of [7].
Figure 1. The BLAS-like Library Instantiation Software (BLIS) refactoring of the GotoBLAS algorithm as five loops around the microkernel. Image courtesy of [7].

BLIS does not only export top-level interfaces; it also exposes lower-level components that researchers can compose in innovative ways. For example, BLIS can mix the domains and precisions of operands [8]. To support this feature, it incorporates the necessary conversions into the packing component that already forms a part of the basic algorithm. This technique has also been employed for a separate library for tensor contraction called TBLIS [6], which avoids the necessary rearrangement of data that occurs when such operations are cast in terms of traditional BLAS. In addition, BLIS has enabled the practical implementation of Strassen’s algorithm [4]. These examples demonstrate that BLIS is as much a conceptual toolbox for innovation as a tangible framework for code instantiation.

Since its debut on GitHub in 2012, BLIS has amassed a sizeable online community that includes volunteers who maintain packages for various distributions of Linux and other operating systems. Supported operating systems include Debian Linux, Ubuntu Linux, Gentoo Linux, Extra Packages for Enterprise Linux/Fedora Linux, openSUSE Linux, and GNU Guix. BLIS also offers macOS and Conda support, along with limited support for Windows dynamic-link libraries (DLLs). Users may download and compile the source code themselves via either git-cloned repositories or source snapshots in .zip or tarball formats. In addition to the “vanilla” distribution of BLIS that The University of Texas at Austin manages (in collaboration with academic, industry, and community partners around the globe), Advanced Micro Devices, Inc. (AMD) maintains1 a separate fork of BLIS that contains optimizations that are specific to Zen-based microarchitectures, as well as enhancements that their corporate customers request. An annual workshop called the BLIS Retreat provides a forum for many stakeholders from academia, industry, and government laboratories to exchange ideas and discuss the latest research.

While creating an entire framework—with support from the National Science Foundation and industry gifts—has taken the better part of a decade, the techniques that underlie BLIS are remarkably simple. During a four-week massive open online course (MOOC) titled “LAFF-On Programming for High Performance,” we examine BLIS’s matrix-matrix multiplication to illustrate important issues in high-performance computing. This course, which is part of a series of MOOCs, attracts both novices and experts and lowers barriers for entry into the field — much like BLIS lowers barriers to porting BLAS-like operations to new environments and architectures.

BLIS continues to grow by rapidly incorporating support for emerging architectures and instruction sets—such as Intel® Advanced Vector Extensions 512, AMD EPYC, ARM Scalable Vector Extension, and IBM POWER10—and addressing dense linear algebra operations beyond the traditional BLAS interface. The simplicity of BLIS’s underlying techniques and concepts has enabled a myriad of improvements. However, the BLIS framework—as a concrete instantiation of these ideas—has begun to reach certain limits. Instead, alternative one-off instantiations of the BLIS concept have enabled efficient tensor contraction [6], machine learning primitives [11], and even operations that are relevant to biostatistics [1]. But is it necessary to look beyond BLIS to address these exciting problems? We believe that the answer is “no” — or rather, the next great adventure for BLIS is to make it so.

1AMD selected BLIS as the foundation for its hardware-optimized BLAS library circa 2015.

This article is based on Robert van de Geijn’s invited talk at the 2020 SIAM Conference on Parallel Processing for Scientific Computing, which took place in Seattle, Wash., last year. Van de Geijn, Van Zee, and others received the 2020 SIAM Activity Group on Supercomputing Best Paper Prize for their paper on the BLIS framework [9]. Van de Geijn’s presentation is available on SIAM’s YouTube channel.

References

[1] Alachiotis, N., Popovici, D.T., & Low, T.M. (2016). Efficient computation of linkage disequilibria as dense linear algebra operations. In 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (pp. 418-427). Chicago, IL: IEEE.
[2] Bilmes, J., Asanović, K., Chin, C.-W., & Demmel, J. (1997). Optimizing matrix multiply using PHiPAC: A portable, high-performance, ANSI C coding methodology. In U. Banerjee (Ed.), ACM international conference on supercomputing 25th anniversary volume (pp. 253-260). Association for Computing Machinery.
[3] Goto, K., & van de Geijn, R. (2008). Anatomy of high-performance matrix multiplication. ACM Trans. Math. Soft., 34(3), 12.
[4] Huang, J., Smith, T.M., Henry, G.M., & van de Geijn, R.A. (2016). Strassen’s algorithm reloaded. In SC’16: Proceedings of the international conference for high performance computing, networking, storage, and analysis (pp. 690-701). Salt Lake City, UT: IEEE.
[5] Low, T.M., Igual, F.D., Smith, T.M., & Quintana-Orti, E.S. (2016). Analytical modeling is enough for high-performance BLIS. ACM Trans. Math. Softw., 43(2), 12.
[6] Matthews, D.A. (2018). High-performance tensor contraction without transposition. SIAM J. Sci. Comput., 40(1), C1-C24.
[7] Van Zee, F.G., & Smith, T.M. (2017). Implementing high-performance complex matrix multiplication via the 3m and 4m methods. ACM Trans. Math. Soft., 44(1), 7.
[8] Van Zee, F.G., Parikh, D.N., & van de Geijn, R.A. (2021). Supporting mixed-domain mixed-precision matrix multiplication within the BLIS framework. ACM Trans. Math. Softw. To be published.
[9] Van Zee, F.G., Smith, T.M., Marker, B., Low, T.M., van de Geijn, R.A., Igual, F.D., Smelyanskiy, M., Zhang, X., Kistler, M., Austel, V., Gunnels, J.A., & Killough, L. (2016). The BLIS framework: Experiments in portability. ACM Trans. Math. Softw., 42(2), 12. 
[10] Whaley, R.C., Petitet, A., & Dongarra, J.J. (2001). Automated empirical optimizations of software and the ATLAS project. Para. Comput., 27(1-2), 3-35.
[11] Yu, C.D., Huang, J., Austin, W., Xiao, B., & Biros, G. (2015). Performance optimization for the k-nearest neighbors kernel on x86 architectures. In SC’15: Proceedings of the international conference for high performance computing, networking, storage, and analysis (pp. 1-12). New York, NY: Association for Computing Machinery.

About the Authors