- Contents
- Introduction
- Sub-configurations
- Configuration families
- Configuration registry
- Adding a new kernel set
- Adding a new configuration family
- Adding a new sub-configuration
- Further development topics
This document describes how to manage, edit, and create BLIS framework configurations. The target audience is primarily BLIS developers who wish to add support for new types of hardware, and developers who write (or tinker with) BLIS kernels.
The BLIS Build System guide introduces the concept of a BLIS configuration. There are actually two types of configurations: sub-configuration and configuration families.
A sub-configuration encapsulates all of the information needed to build BLIS for a particular microarchitecture. For example, the haswell
configuration allows a user or developer to build a BLIS library that targets hardware based on Intel Haswell (or Broadwell or Skylake/Kabylake desktop) microprocessors. Such a sub-configuration typically includes optimized kernels as well as the corresponding cache and register blocksizes that allow those kernels to work well on the target hardware.
A configuration family simply specifies a collection of other registered sub-configurations. For example, the intel64
configuration allows a user or developer to build a BLIS library that includes several Intel x86_64 configurations, and hence supports multiple microarchitectures simultaneously. The appropriate configuration information (e.g. kernels and blocksizes) will be selected via some hardware detection heuristic (e.g. the CPUID
instruction) at runtime. (Note: Prior to 290dd4a, configuration families could only be defined in terms of sub-configurations. Starting with 290dd4a, configuration families may be defined in terms of other families.)
Both of these configuration types are organized as directories of files and then "registered" into a configuration registry file named config_registry
, which resides in the top-level directory.
A sub-configuration is represented by a sub-directory of the config
directory in the top-level of the BLIS distribution:
$ ls config
amd64 cortexa15 excavator intel64 old power7 template
bgq cortexa57 generic knc penryn sandybridge zen
bulldozer cortexa9 haswell knl piledriver steamroller
Let's inspect the haswell
configuration as an example:
$ ls config/haswell
bli_cntx_init_haswell.c bli_family_haswell.h make_defs.mk
A sub-configuration (haswell
, in this case) usually contains just three files:
bli_cntx_init_haswell.c
. This file contains the initialization function for a context targeting the hardware in question, in this case, Intel Haswell. A context, orcntx_t
object, in BLIS encapsulates all of the hardware-specific information--including kernel function pointers and cache and register blocksizes--necessary to support all of the main computational operations in BLIS. The initialization function inside this file should be named the same as the filename (excluding.c
suffix), which should begin with prefixbli_cntx_init_
and end with the (lowercase) name of the sub-configuration. The context initialization function (in this case,bli_cntx_init_haswell()
) is used internally by BLIS when setting up the global kernel structure--a mechanism for managing and supporting multiple microarchitectures simultaneously, so that the choice of which context to use can be deferred until the computation is ready to execute.bli_family_haswell.h
. This header file is#included
when the configuration in question, in this casehaswell
, was the target to./configure
. This is where you would specify certain global parameters and settings. For example, if you wanted to specify custom implementations ofmalloc()
andfree()
, this is where you would specify them. The file is oftentimes empty. (In the case of configuration families, the definitions in this file apply to the entire build, and not any specific sub-configuration, but for consistency we support them for all configuration targets, whether they be singleton sub-configurations or configuration families.)make_defs.mk
. This makefile fragment defines the compiler and compiler flags to use during compilation. Specifically, the values defined in this file are used whenever compiling source code specific to the sub-configuration (i.e., reference kernels and optimized kernels). If the sub-configuration is the target ofconfigure
, then these flags are also used to compile general framework code.
Providing these three components constitutes a complete sub-configuration. A more detailed description of each file will follow.
As mentioned above, the kernels used by a sub-configuration are specified in the bli_cntx_init_
function. This function is flexible in that the context is typically initialized with a set of "reference" kernels. Then, the kernel developer overwrites the fields in the context that correspond to kernel operations that have optimized counterparts that should be used instead.
Let's use the following hypothetical function definition to guide our walkthrough.
#include "blis.h"
void bli_cntx_init_fooarch( cntx_t* cntx )
{
blksz_t blkszs[ BLIS_NUM_BLKSZS ];
// Set default kernel blocksizes and functions.
bli_cntx_init_fooarch_ref( cntx );
// -------------------------------------------------------------------------
// Update the context with optimized native gemm microkernels and
// their storage preferences.
bli_cntx_set_l3_nat_ukrs
(
5,
BLIS_GEMM_UKR, BLIS_DOUBLE, bli_dgemm_bararch_asm, FALSE,
BLIS_GEMMTRSM_L_UKR, BLIS_DOUBLE, bli_dgemmtrsm_l_bararch_asm, FALSE,
BLIS_GEMMTRSM_U_UKR, BLIS_DOUBLE, bli_dgemmtrsm_u_bararch_asm, FALSE,
BLIS_TRSM_L_UKR, BLIS_DOUBLE, bli_dtrsm_l_bararch_asm, FALSE,
BLIS_TRSM_U_UKR, BLIS_DOUBLE, bli_dtrsm_u_bararch_asm, FALSE,
cntx
);
// Update the context with optimized packm kernels.
bli_cntx_set_packm_kers
(
2,
BLIS_PACKM_4XK_KER, BLIS_DOUBLE, bli_dpackm_bararch_asm_4xk,
BLIS_PACKM_8XK_KER, BLIS_DOUBLE, bli_dpackm_bararch_asm_8xk,
cntx
);
// Update the context with optimized level-1f kernels.
bli_cntx_set_l1f_kers
(
5,
BLIS_AXPY2V_KER, BLIS_DOUBLE, bli_daxpy2v_fooarch_asm,
BLIS_DOTAXPYV_KER, BLIS_DOUBLE, bli_ddotaxpyv_fooarch_asm,
BLIS_AXPYF_KER, BLIS_DOUBLE, bli_daxpyf_fooarch_asm,
BLIS_DOTXF_KER, BLIS_DOUBLE, bli_ddotxf_fooarch_asm,
BLIS_DOTXAXPYF_KER, BLIS_DOUBLE, bli_ddotxaxpyf_fooarch_asm,
cntx
);
// Update the context with optimized level-1v kernels.
bli_cntx_set_l1v_kers
(
2,
BLIS_AXPYV_KER, BLIS_DOUBLE, bli_daxpyv_fooarch_asm,
BLIS_DOTV_KER, BLIS_DOUBLE, bli_ddotv_fooarch_asm,
cntx
);
// Initialize level-3 blocksize objects with architecture-specific values.
// s d c z
bli_blksz_init_easy( &blkszs[ BLIS_MR ], 8, 8, 8, 4 );
bli_blksz_init_easy( &blkszs[ BLIS_NR ], 8, 4, 4, 4 );
bli_blksz_init_easy( &blkszs[ BLIS_MC ], 128, 128, 128, 128 );
bli_blksz_init_easy( &blkszs[ BLIS_KC ], 256, 256, 256, 256 );
bli_blksz_init_easy( &blkszs[ BLIS_NC ], 4096, 4096, 4096, 4096 );
// Update the context with the current architecture's register and cache
// blocksizes (and multiples) for native execution.
bli_cntx_set_blkszs
(
5,
BLIS_NC, &blkszs[ BLIS_NC ], BLIS_NR,
BLIS_KC, &blkszs[ BLIS_KC ], BLIS_KR,
BLIS_MC, &blkszs[ BLIS_MC ], BLIS_MR,
BLIS_NR, &blkszs[ BLIS_NR ], BLIS_NR,
BLIS_MR, &blkszs[ BLIS_MR ], BLIS_MR,
cntx
);
}
Function name/signature. This function always takes one argument, a pointer to a cntx_t
object. As with the name of the file, it should be named with the prefix bli_cntx_init_
followed by the lowercase name of the configuration--in this case, fooarch
.
Blocksize object array. The blkszs
array declaration is needed later in the function and should generally be consistent (and unchanged) across all configurations.
Reference initialization. The first function call, bli_cntx_init_fooarch_ref()
, initializes the context cntx
with function pointers to reference implementations of all of the kernels supported by BLIS (as well as cache and register blocksizes, and other fields). This function is automatically generated by BLIS for every sub-configuration enabled at configure-time. The function prototype is generated by a preprocessor macro in frame/include/bli_arch_config.h
.
Level-3 microkernels. The second function call is to a variable argument function, bli_cntx_set_l3_nat_ukrs()
, which updates cntx
with five optimized double-precision complex level-3 microkernels. The first argument encodes the number of individual kernels being registered into the context. Every subsequent line, except for the last line, is associated with the registration of a single kernel, and each of these lines is independent of one another and can occur in any order, provided that the kernel parameters of each line occur in the same order--kernel ID, followed by datatype, followed by function name, followed by storage preference boolean (i.e., whether the microkernel prefers row storage). The last argument of the function call is the address of the context being updated, cntx
. Notice that we are registering microkernels written for another type of hardware, bararch
, because in our hypothetical universe bararch
is very similar to fooarch
and so we recycle the code between the two configurations. After the function returns, the context contains pointers to optimized double-precision level-3 real microkernels. Note that the context will still contain reference microkernels for single-precision real and complex, and double-precision complex computation, as those kernels were not updated.
Note: Currently, BLIS only allows the kernel developer to signal a preference (row or column) for gemm
microkernels. The preference of the gemmtrsm
and trsm
microkernels can (and must) be set, but are ignored by the framework during execution.
Level-1m (packm) kernels. The third function call is to another variable argument function, bli_cntx_set_packm_kers()
. This function works very similar to bli_cntx_set_l3_nat_ukrs()
, except that it expects a different set of kernel IDs (because now we are registering level-1m kernels) and it does not take a storage preference boolean. After this function returns, cntx
contains function pointers to optimized double-precision real packm
kernels. These kernels, like the level-3 kernels previously, are also borrowed from the bararch
kernel set. Unregistered packm
kernels will continue to point to reference code.
Level-1f kernels. The third function call is to yet another variable argument function, bli_cntx_set_l1f_kers()
. This function has the same signature as bli_cntx_set_packm_kers()
, except that it expects a different set of kernel IDs (because now we are registering level-1f kernels). After this function returns, cntx
contains function pointers to optimized double-precision real level-1f kernels. These kernels are written for fooarch
specifically. The unregistered level-1f kernels will continue to point to reference code.
Level-1v kernels. The fourth function call is to bli_cntx_set_l1v_kers()
, which operates similarly to the bli_cntx_set_l1f_kers()
, except here we are registering level-1v kernels. After the function returns, most kernels will continue to point to reference code, except double-precision real instances of axpyv
and dotv
.
For a complete list of kernel IDs, please see the definitions of l3ukr_t
, l1mkr_t
, l1fkr_t
, l1vkr_t
in frame/include/bli_type_defs.h.
Setting blocksizes. The next block of code initializes the blkszs
array with register and cache blocksize values for each datatype. The values here are used by the level-3 operations that employ the level-3 microkernels we registered previously. We use bli_blksz_init_easy()
when initializing only the primary value. If the auxiliary value needs to be set to a different value that the primary, bli_blksz_init()
should be used instead, as in:
// s d c z
bli_blksz_init_easy( &blkszs[ BLIS_MR ], 0, 8, 0, 0 );
bli_blksz_init_easy( &blkszs[ BLIS_NR ], 0, 4, 0, 0 );
bli_blksz_init ( &blkszs[ BLIS_MC ], 0, 128, 0, 0,
0, 160, 0, 0 );
bli_blksz_init ( &blkszs[ BLIS_KC ], 0, 256, 0, 0,
0, 288, 0, 0 );
bli_blksz_init_easy( &blkszs[ BLIS_NC ], 0, 4096, 0, 0 );
Here, we use bli_blksz_init()
to set different auxiliary (maximum) cache blocksizes for MC and KC. The same function could be used to set auxiliary (packing) register blocksizes for MR and NR, which correspond to the PACKMR and PACKNR parameters. Other blocksizes, particularly those corresponding to level-1f operations, may be set. For a complete list of blocksize IDs, please see the definitions of bszid_t
in frame/include/bli_type_defs.h. For more information on interpretations of the auxiliary blocksize value, see the digressions below.
Note that we set level-3 blocksizes even for datatypes that retain reference code kernels; however, by passing in 0
for those blocksizes, we indicate to bli_blksz_init()
and bli_blksz_init_easy()
that the current value should be left untouched. In the example above, this leaves the blocksizes associated with the reference kernels (set by bli_cntx_init_fooarch_ref()
) intact for the single real, single complex, and double complex datatypes.
Digression: Auxiliary blocksize values for register blocksizes are interpreted as the "packing" register blocksizes. PACKMR and PACKNR serve as "leading dimensions" of the packed micropanels that are passed into the microkernel. Oftentimes, PACKMR = MR and PACKNR = NR, and thus the developer does not typically need to set these values manually. (See the implementation notes for gemm in the BLIS Kernel guide for more details on these topics.)
Digression: Auxiliary blocksize values for cache blocksizes are interpreted as the maximum cache blocksizes. The maximum cache blocksizes are a convenient and portable way of smoothing performance of the level-3 operations when computing with a matrix operand that is just slightly larger than a multiple of the preferred cache blocksize in that dimension. In these "edge cases," iterations run with highly sub-optimal blocking. We can address this problem by merging the "edge case" iteration with the second-to-last iteration, such that the cache blocksizes are slightly larger--rather than significantly smaller--than optimal. The maximum cache blocksizes allow the developer to specify the maximum size of this merged iteration; if the edge case causes the merged iteration to exceed this maximum, then the edge case is not merged and instead it is computed upon in separate (final) iteration.
Committing blocksizes. Finally, we commit the values in blkszs
to the context by calling the variable argument function bli_cntx_set_blkszs()
. This function call generally should be considered boilerplate and thus should not changed unless you are altering the matrix multiplication algorithm as specified in the control tree. If this is your goal, please get in contact with BLIS developers via the blis-devel mailing list for guidance, if you have not done so already.
Availability of kernels. Note that any kernel made available to the fooarch
configuration within config_registry
may be referenced inside bli_cntx_init_fooarch()
. In this example, we referenced fooarch
kernels as well as kernels native to another configuration, bararch
. Thus, the config_registry
would contain a line such as:
fooarch: fooarch/fooarch/bararch
Interpreting the line left-to-right: the fooarch
configuration family contains only itself, fooarch
, but must be able to refer to kernels from its own kernel set (fooarch
) as well as kernels belonging to the bararch
kernel set. The configuration registry is described more completely in a later section.
This file is conditionally #included
only for the configuration family targeted at configure-time. For example, if you run ./configure haswell
, bli_family_haswell.h
will be #included
, and if you run ./configure intel64
, bli_family_intel64.h
will be #included
. The header file is #included
by frame/include/bli_arch_config.h.
This header file is oftentimes empty. This is because the parameters specified here usually work fine with their default values, which are defined in frame/include/bli_kernel_macro_defs.h. However, there may be some configurations for which a kernel developer will wish to adjust some of these parameters. Furthermore, when creating a configuration family, the parameters set in the corresponding bli_family_*.h
file must work for all sub-configurations in the family.
A description of the parameters that may be set in bli_family_*.h
follows.
Memory allocation functions. BLIS allows the developer to customize the functions called for memory allocation for three different categories of memory: user, pool, and internal. The functions for user allocation are called any time the creation of a BLIS matrix or vector obj_t
requires that a matrix buffer be allocated, such as via bli_obj_create()
. The functions for pool allocation are called only when allocating blocks to the memory pools used to manage packed matrix buffers. The function for internal allocation are called by BLIS when allocating internal data structures, such as control trees. By default, the three pairs of parameters are defined via preprocessor macros to call the implementation of malloc()
and free()
provided by stdlib.h
:
#define BLIS_MALLOC_USER malloc
#define BLIS_FREE_USER free
#define BLIS_MALLOC_POOL malloc
#define BLIS_FREE_POOL free
#define BLIS_MALLOC_INTL malloc
#define BLIS_FREE_INTL free
Any substitute for malloc()
and free()
defined by customizing these parameters must use the same function prototypes as the original functions. Namely:
void* malloc( size_t size );
void free( void* p );
Furthermore, if a header file needs to be included, such as my_malloc.h
, it should be #included
within the bli_family_*.h
file (before #defining
any of the BLIS_MALLOC_
and BLIS_FREE_
macros).
SIMD register file. BLIS allows you to specify the maximum number of SIMD registers available for use by your kernels, as well as the maximum size (in bytes) of those registers. These values default to:
#define BLIS_SIMD_MAX_NUM_REGISTERS 32
#define BLIS_SIMD_MAX_SIZE 64
These macros are used in computing the maximum amount of temporary storage (typically allocated statically, on the function stack) that will be needed to hold a single micro-tile of any datatype (and for any induced method):
#define BLIS_STACK_BUF_MAX_SIZE ( BLIS_SIMD_MAX_NUM_REGISTERS * BLIS_SIMD_MAX_SIZE * 2 )
These temporary buffers are used when handling edge cases (m % MR != 0 || n % NR != 0) within the level-3 macrokernels, and also in the virtual microkernels of various implementations of induced methods for complex matrix multiplication. It is very important that these values be set correctly; otherwise, you may experience undefined behavior as stack data is overwritten at run-time. A kernel developer may set BLIS_SIMD_MAX_NUM_REGISTERS
and BLIS_SIMD_MAX_SIZE
, which will indirectly affect BLIS_STACK_BUF_MAX_SIZE
, or he may set BLIS_STACK_BUF_MAX_SIZE
directly. Notice that the default values are already set to work with modern x86_64 systems.
Memory alignment. BLIS implements memory alignment internally, rather than relying on a function such as posix_memalign()
, and thus it can provide aligned memory even with functions that adhere to the malloc()
and free()
API in the standard C library.
#define BLIS_SIMD_ALIGN_SIZE BLIS_SIMD_MAX_SIZE
#define BLIS_PAGE_SIZE 4096
#define BLIS_STACK_BUF_ALIGN_SIZE BLIS_SIMD_ALIGN_SIZE
#define BLIS_HEAP_ADDR_ALIGN_SIZE BLIS_SIMD_ALIGN_SIZE
#define BLIS_HEAP_STRIDE_ALIGN_SIZE BLIS_SIMD_ALIGN_SIZE
#define BLIS_POOL_ADDR_ALIGN_SIZE_A BLIS_PAGE_SIZE
#define BLIS_POOL_ADDR_ALIGN_SIZE_B BLIS_PAGE_SIZE
#define BLIS_POOL_ADDR_ALIGN_SIZE_C BLIS_PAGE_SIZE
#define BLIS_POOL_ADDR_ALIGN_SIZE_GEN BLIS_PAGE_SIZE
The value BLIS_STACK_BUF_ALIGN_SIZE
defines the alignment of stack memory used as temporary internal buffers, such as for output matrices to the microkernel when computing edge cases. (See implementation notes for the gemm
microkernel for details.) This value defaults to BLIS_SIMD_ALIGN_SIZE
, which defaults to BLIS_SIMD_MAX_SIZE
.
The value BLIS_HEAP_ADDR_ALIGN_SIZE
defines the alignment used when allocating memory via the malloc()
function defined by BLIS_MALLOC_USER
. Setting this value to BLIS_SIMD_ALIGN_SIZE
may speed up certain level-1v and -1f kernels.
The value BLIS_HEAP_STRIDE_ALIGN_SIZE
defines the alignment used for so-called "leading dimensions" (i.e. column strides for column-stored matrices, and row strides for row-stored matrices) when creating BLIS matrices via the object-based API (e.g. bli_obj_create()
). While setting BLIS_HEAP_ADDR_ALIGN_SIZE
guarantees alignment for the first column (or row), creating a matrix with certain dimension values (m and n) may cause subsequent columns (or rows) to be misaligned. Setting this value to BLIS_SIMD_ALIGN_SIZE
is usually desirable. Additional alignment may or may not be beneficial.
The value BLIS_POOL_ADDR_ALIGN_SIZE_*
define the alignments used when allocating blocks to the memory pools used to manage internal packing buffers for matrices A, B, C, and for general use. Any block of memory returned by the memory allocator is guaranteed to be aligned to this value. Aligning these blocks to the virtual memory page size (usually 4096 bytes) is standard practice.
The make_defs.mk
file primarily contains compiler and compiler flag definitions used by make
when building a BLIS library.
The format of the file is mostly self-explanatory. However, we will expound on the contents here, using the make_defs.mk
file for the haswell
configuration as an example:
# Declare the name of the current configuration and add it to the
# running list of configurations included by common.mk.
THIS_CONFIG := haswell
ifeq ($(CC),)
CC := gcc
CC_VENDOR := gcc
endif
CPPROCFLAGS := -D_POSIX_C_SOURCE=200112L
CMISCFLAGS := -std=c99 -m64
CPICFLAGS := -fPIC
CWARNFLAGS := -Wall -Wno-unused-function -Wfatal-errors
ifneq ($(DEBUG_TYPE),off)
CDBGFLAGS := -g
endif
ifeq ($(DEBUG_TYPE),noopt)
COPTFLAGS := -O0
else
COPTFLAGS := -O3
endif
CKOPTFLAGS := $(COPTFLAGS)
ifeq ($(CC_VENDOR),gcc)
CVECFLAGS := -mavx2 -mfma -mfpmath=sse -march=core-avx2
else
ifeq ($(CC_VENDOR),icc)
CVECFLAGS := -xCORE-AVX2
else
ifeq ($(CC_VENDOR),clang)
CVECFLAGS := -mavx2 -mfma -mfpmath=sse -march=core-avx2
else
$(error gcc, icc, or clang is required for this configuration.)
endif
endif
endif
# Store all of the variables here to new variables containing the
# configuration name.
$(eval $(call store-make-defs,$(THIS_CONFIG)))
Configuration name. The first statement reaffirms the name of the configuration. The THIS_CONFIG
variable is used later to attach the configuration name as a suffix to the remaining variables so that they can co-exist with variables read from other make_defs.mk
files during multi-configuration builds. Note that if the configuration name defined here does not match the name of the directory in which make_defs.mk
is stored, make
will output an error when executing the top-level Makefile
.
Compiler definitions. Next, we set the values of CC
and CC_VENDOR
. The former is the name (or path) to the actual compiler executable to use during compilation. The latter is the compiler family. Currently, BLIS generally supports three compiler families: gcc
, clang
, and icc
. CC_VENDOR
is used when conditionally setting various variables based on the type of flags available--flags that might not vary across different versions or installations of the same compiler (e.g. gcc-4.9
vs gcc-5.0
, or gcc
vs /usr/local/bin/gcc
), but may vary across compiler families (e.g. gcc
vs. icc
). If the compiler you wish to use is in your PATH
environment variable, CC
and CC_VENDOR
will usually contain the same value.
Basic compiler flags. The variables CPPROCFLAGS
and CWARNFLAGS
should be assigned to C preprocessor flags and compiler warning flags, respectively, while CPICFLAGS
should be assigned flags to enable position independent code (shared library) flags. Finally, CMISCFLAGS
may be assigned any miscellaneous flags that do not neatly fit into any other category, such as language flags and 32-/64-bit flags. These four categories of flags are usually recognized across compiler families.
Debugging flags. The CDBGFLAGS
variable should be assigned to contain flags that insert debugging symbols into the object code emitted by the compiler. Typically, this amounts to no more than the -g
flag, but some compilers or situations may call for different (or additional) flags. This variable is conditionally set only if $(DEBUG_TYPE)
, which is set the by configure
script, is not equal to noopt
.
Optimization flags. The COPTFLAGS
variable should be assigned any flags relating to general compiler optimization. Usually this takes the form of -O2
or -O3
, but more specific optimization flags may be included as well, such as -fomit-frame-pointer
. Note that, as with CDBGFLAGS
, COPTFLAGS
is conditionally assigned based on the value of $(DEBUG_TYPE)
. A separate CKOPTFLAGS
variable tracks optimizations flags used when compiling kernels. For most configurations, CKOPTFLAGS
is assigned as a copy of COPTFLAGS
, but if the kernel developer needs different optimization flags to be applied when compiling kernel source code, CKOPTFLAGS
should be set accordingly.
Vectorization flags. The second-to-last block sets the CVECFLAGS
, which should be assigned any flags that must be given to the compiler in order to enable use of a vector instruction set needed or assumed by the kernel source code. Also, if you wish to enable automatic use of certain instruction sets (e.g. -mfpmath=sse
for many Intel architectures), this is where you should set those flags. These flags often differ among compiler families, especially between icc
and gcc
/clang
.
Variable storage/renaming. Finally, the last statement commits the variables defined in the file to "storage". That is, they are copied to variable names that contain THIS_CONFIG
as a suffix. This allows the variables for one configuration to co-exist with variables of another configuration.
A configuration family is represented similarly to that of a sub-configuration: a sub-directory of the config
directory. Additionally, there are two types of families: singleton families and umbrella families.
A singleton family simply refers to a sub-configuration. The configure
script only targets configuration families. But since every sub-configuration is also a valid configuration family, every sub-configuration is a valid configuration target.
An umbrella family is the more interesting type of configuration family. These families are defined as collections of architecturally related sub-configurations. (Important: an umbrella family should always be named something different than any of its constituent sub-configurations.) BLIS provides a mechanism to define umbrella families so that users and developers can build a single instance of BLIS that supports multiple configurations, where some heuristic is used at runtime to choose among the configurations. For example, you may wish to deploy a BLIS library on a storage device that is shared among several computers, each of which is based on a different x86_64 microarchitecture.
Throughout the remainder of this document, we will sometimes refer to "umbrella families" as simply "families". Similarly, we will refer to "singleton families" and "sub-configurations" interchangeably. To the extent that any ambiguity may remain, context should clarify which type of family is germane to the discussion.
Let's inspect the amd64
configuration family as an example:
$ ls config/amd64
bli_family_amd64.h make_defs.mk
A configuration family contains a subset of the files contained within a sub-configuration: A bli_family_*.h
header file and a make_defs.mk
makefile fragment:
bli_family_amd64.h
. This header file is#included
only when the configuration family in question, in this caseamd64
, was the target to./configure
. The file serves a similar purpose as with sub-configurations--a place to define various parameters, such as those relating to memory allocation and alignment. However, in the context of configuration families, the uniqueness of this file makes a bit more sense. Importantly, the definitions in this file will be affect all sub-configurations within the family. Thus, it is useful to think of these as "global" parameters. For example, if custom implementations ofmalloc()
andfree()
are specified in thebli_family_amd64.h
file, these implementations will be used for every sub-configuration member of theamd64
family. (The configuration registry, described in the next section, specifies each configuration family's membership.) As with sub-configurations, this file may be empty, in which case reasonable defaults are selected by the framework.make_defs.mk
. This makefile fragment defines the compiler and compiler flags in a manner identical to that of sub-configurations. However, these configuration flags are used when compiling source code that is not specific to any one particular sub-configuration. (The build system compiles a set of reference kernels and optimized kernels for each sub-configuration, during which it uses flags read from the individual sub-configurations'make_defs.mk
files. By contrast, the general framework code is compiled once--using the flags read from the family'smake_defs.mk
file--and executed by all sub-configurations.)
For a more detailed walkthrough of these files' expected/allowed contents, please see the descriptions provided in the section on sub-configurations:
With these two files defined and present, the configuration family is properly constituted and ready to be registered within the configuration registry.
The configuration registry is the official place for declaring a sub-configuration or configuration family. Unless a configuration (singleton or family) is declared within the registry, configure
will not accept it as a valid configuration target at configure-time.
Before describing the syntax and semantics of the registry, we'll first briefly describe three types of information we wish to encode into the registry:
Configuration list. First and foremost, the registry needs to enumerate the registered sub-configurations. That is, it needs to list the sub-configurations (or, singleton families) that are available to be targeted by configure
. The registry also needs to specify configuration family membership--that is, the (umbrella) families to which those sub-configurations belong.
Kernel list. Next, the registry needs to specify the list of kernel sets that will be needed by each sub-configuration, and by proxy, each configuration family. It's easy to think of different configurations as corresponding to different microarchitectures, and that generally holds true. However, sometimes we use the same configuration for multiple microarchitectures (e.g. haswell
is used for Intel Haswell, Broadwell, and non-server Skylake variants). It might also be tempting to think of each microarchitecture as having its own set of kernels. However, in practice, we find that some microarchitectures' kernels are identical to those of a previous microarchitectural revision, or to those of another vendor's microarchitecture. Thus, sometimes a sub-configuration will wish to use a kernel set that is "native" to a different configuration. In these cases, there is not a one-to-one mapping of sub-configuration names to kernel set names, and therefore the configuration registry must separately specify the kernel sets needed by any sub-configuration (and by proxy, any configuration family).
Kernel-to-configuration map. Lastly, and most subtly, for each kernel set in the kernel list, the registry needs to specify the sub-configuration(s) that depend on that particular kernel set. Notice that the kernel list can be obtained by mapping sub-configurations to kernel sets they require. By contrast, the kernel-to-configuration map tracks the reverse dependency and helps us answer: for any given kernel set, which sub-configurations caused the kernel set to be pulled into the build? This mapping is needed when determining which sub-configuration's compiler flags (as defined in its make_defs.mk
file) to use when compiling that kernel set. The most obvious solution to this problem would have been to associate compiler flags with the individual kernel sets. However, given the desire to share kernel sets among sub-configurations, we needed the flexibility of applying different compiler flags to any given kernel set based on the sub-configuration that would be utilizing that kernel set. In the case that multiple sub-configurations pull in the same kernel set, a set of heuristics is used to choose between the sub-configurations so that a single set of compiler flags can be chosen for use when compiling that kernel set.
The configuration registry exists as a human-readable file, config_registry
, located at the top-level of the BLIS distribution. What follows is an example of a config_registry
file that is based on actual contents in a BLIS commit recent as of this writing. Note that lines containing only whitespace are ignored. Furthermore, any characters that appear after (and including) a #
are treated as comments and also ignored.
#
# config_registry
#
# Processor families.
x86_64: intel64 amd64
intel64: haswell sandybridge penryn generic
amd64: zen excavator steamroller piledriver bulldozer generic
arm64: cortexa57 generic
arm32: cortexa15 cortexa9 generic
# Intel architectures.
haswell: haswell
sandybridge: sandybridge
penryn: penryn
knl: knl
# AMD architectures.
zen: zen/haswell/sandybridge
excavator: excavator/piledriver
steamroller: steamroller/piledriver
piledriver: piledriver
bulldozer: bulldozer
# ARM architectures.
cortexa57: cortexa57/armv8a
cortexa15: cortexa15/armv7a
cortexa9: cortexa9/armv7a
# Generic architectures.
generic: generic
Generally speaking, the registry can be thought of as defining a very simple grammar. (However, as you'll soon see, there are nuances that are un-grammar-like.) The registry can contain two kinds of lines. The first type defines a singleton configuration family. For example, the line
haswell: haswell
defines a configuration family haswell
(the left side of the :
) as containing only itself: the sub-configuration by the same name, haswell
(the right side of the :
). When singleton families are defined in this way, it implicitly pulls in the kernel set by the same name as the sub-configuration (in this case, haswell
). More specifically, the haswell
sub-configuration depends on the kernels residing in the kernels/haswell
sub-directory.
The second type of line defines an umbrella configuration family. For example, the line
intel64: haswell sandybridge penryn generic
defines the configuration family intel64
as containing the haswell
, sandybridge
, penryn
, and generic
sub-configurations as members (technically speaking, it is more accurate to think of the family as containing singleton families rather than their corresponding sub-configurations). Thus, if the user runs ./configure intel64
, the library will be built to support all sub-configurations defined within the intel64
family.
Note: generic
is a somewhat special sub-configuration that uses only reference kernels and reference blocksizes. It is included in every umbrella family so that when those families are instantiated into BLIS libraries and linked to an application, the application will be able to run even if none of the other sub-configurations (haswell
, sandybridge
, penryn
) are chosen at runtime by the hardware detection heuristic.
Some sub-configurations, for various reasons, do not rely on their own set of kernels and instead use the kernel set that is native to another sub-configuration. For example, the excavator
and steamroller
configurations each correspond to hardware that is very similar to the hardware targeted by the piledriver
configuration. In fact, the former two configurations rely exclusively on kernels written for the latter configuration. (Presently, there are no excavator
or steamroller
kernel sets in BLIS.) We denote this kernel dependency with a /
character:
excavator: excavator/piledriver
steamroller: steamroller/piledriver
Here, the first line (reading from left-to-right) defines the excavator
singleton family as containing only itself, the excavator
sub-configuration, and also specifies that this sub-configuration must have access to the piledriver
kernel set. The second line defines the steamroller
singleton family in a similar manner.
Note: Specifying non-native kernel sets via the /
character is only allowed when defining singleton configuration families. They may NOT appear in the definitions of umbrella families! When an umbrella family includes a singleton family that is defined to require non-native kernels, this will be accounted for during the parsing of the config_registry
file.
Sometimes, a sub-configuration may need access to more than one kernel set. If additional kernel sets are needed, they should be listed with additional /
characters:
zen: zen/haswell/sandybridge
The line above defines the zen
singleton family as containing only itself, the zen
sub-configuration, and also specifies that this sub-configuration must have access to the haswell
kernel set as well as the sandybridge
kernel set. What if there exists a zen
kernel set as well, which the zen
sub-configuration must access in addition to those of haswell
and sanydbridge
? In this case, it would need to be annotated explicitly as:
zen: zen/zen/haswell/sandybridge
This line (which is hypothetical and does not appear in the config_registry
example above) defines the zen
singleton family in terms of only the zen
sub-configuration, and provides that sub-configuration access to zen
, haswell
, and sandybridge
kernel sets. (Also: the kernel sets may appear in any order.)
Notice that while kernel sets usually correspond to a sub-configuration, they do not always. For example, while the armv7a
and armv8a
kernel sets are referenced in the example config_registry
file, there do not exist any registered sub-configurations by those names. However, the kernel directories exist and the kernel sets appear in the definitions of a few cortex
singleton families.
One last thing to point out: take a look at the x86_64
configuration family:
x86_64: intel64 amd64
Unlike most of the registered families, which are defined in terms of sub-configurations, x86_64
is defined in terms of other families--specifically, intel64
and amd64
:
intel64: haswell sandybridge penryn generic
amd64: zen excavator steamroller piledriver bulldozer generic
This multi-level style of specifying sub-configurations became available starting in 290dd4a. The behavior of configure
in this situation is as you would expect; that is, including intel64
and amd64
in the definition of x86_64
is equivalent to:
x86_64: haswell sandybridge penryn zen excavator steamroller piledriver bulldozer generic
Any duplicates that may result are removed automatically.
The configuration list, kernel list, and kernel-to-configuration map are constructed internally by configure
, but these structures can be inspected by running configure
with the -c
(which is the short form of --show-config-lists
) option. This can be useful as a sanity check to make sure configure
is properly parsing and interpreting the config_registry
file.
The first thing printed is the configuration list:
$ ./configure -c amd64
configure: reading configuration registry...done.
...
configure: configuration list:
configure: amd64: zen excavator steamroller piledriver bulldozer generic
configure: arm32: cortexa15 cortexa9 generic
configure: arm64: cortexa57 generic
configure: bulldozer: bulldozer
configure: cortexa15: cortexa15
configure: cortexa57: cortexa57
configure: cortexa9: cortexa9
configure: excavator: excavator
configure: generic: generic
configure: haswell: haswell
configure: intel64: haswell sandybridge penryn generic
configure: knl: knl
configure: penryn: penryn
configure: piledriver: piledriver
configure: sandybridge: sandybridge
configure: skx: skx
configure: steamroller: steamroller
configure: x86_64: haswell sandybridge penryn zen excavator steamroller piledriver bulldozer generic
This simply lists the sub-configurations associated with each defined configuration family (singleton or umbrella). Note that they are sorted alphabetically.
Next, the kernel list (actually, all kernel lists) is printed:
configure: kernel list:
configure: amd64: zen piledriver bulldozer generic
configure: arm32: armv7a generic
configure: arm64: armv8a generic
configure: bulldozer: bulldozer
configure: cortexa15: armv7a
configure: cortexa57: armv8a
configure: cortexa9: armv7a
configure: excavator: piledriver
configure: generic: generic
configure: haswell: haswell zen
configure: intel64: haswell zen sandybridge penryn generic
configure: knl: knl
configure: penryn: penryn
configure: piledriver: piledriver
configure: sandybridge: sandybridge
configure: skx: skx
configure: steamroller: piledriver
configure: x86_64: haswell sandybridge penryn zen piledriver bulldozer generic
configure: zen: zen
This shows the kernel sets that are pulled in by each configuration family. For singleton families, this is specified in a straightforward manner via the /
character described in the previous section. For umbrella families, this is determined indirectly by looking up the definitions of the singleton families that are members of the umbrella family.
Next, the full kernel-to-configuration map is printed:
configure: kernel-to-config map for 'amd64':
configure: bulldozer: bulldozer
configure: generic: generic
configure: piledriver: excavator steamroller piledriver
configure: zen: zen
For each of the kernel sets required of the selected configuration family above, the kernel-to-configuration map shows the sub-configurations that required that kernel set. Notice that sometimes a single kernel set may be pulled in by more than one sub-configuration, as with the piledriver
kernel set.
Lastly, we print a version of the kernel-to-configuration map in which we've used a set of heuristics to select a single sub-configuration for each kernel set in the map:
configure: kernel-to-config map for 'amd64' (chosen pairs):
configure: bulldozer:bulldozer
configure: generic:generic
configure: piledriver:piledriver
configure: zen:zen
This variant of the kernel-to-config map is formatted as a series of "sub-configuration:kernel-set" pairs. These pairs are used during the processing of the top-level Makefile
to determine which sub-configuration's compiler flags should be used when compiling the source code within each kernel set.
Adding support for a new set of kernels in BLIS is easy and can be done via the following steps.
-
Create and populate the kernel set directory. First, we must create a directory in
kernels
that corresponds to the new kernel set. Suppose we wanted to add kernels for Intel's Knight's Landing microarchitecture. In BLIS, this corresponds to theknl
configuration, and so we should name the directoryknl
. This is because we want theknl
kernel set to be pulled by default into builds that include theknl
sub-configuration.$ mkdir kernels/knl $ ls kernels armv7a bgq generic knc old piledriver sandybridge armv8a bulldozer haswell knl penryn power7
Next, we must write the
knl
kernels and locate them insidekernels/knl
. (For more information on writing BLIS kernels, please see the Kernels Guide.) We recommend separating level-1v, level-1f, and level-3 kernels into separate1
,1f
, and3
sub-directories, respectively. The kernel files and functions therein do not need to follow any particular naming convention, though we strongly recommend using the conventions already used by other kernel sets. Take a look at other kernel files, such as those forhaswell
, for examples. Finally, for theknl
kernel set, you should insert a file namedbli_kernels_knl.h
intokernels/knl
that prototypes all of your new kernel set's kernel functions. You are welcome to write your own prototypes, but to make the prototyping of kernels easier we recommend using the prototype-generating macros for level-1v, level-1f, level-1m, and level-3 functions defined in frame/1/bli_l1v_ker_prot.h, frame/1f/bli_l1f_ker_prot.h, frame/1m/bli_l1m_ker_prot.h, and frame/3/bli_l3_ukr_prot.h, respectively. The following example utilizes how a select subset of these macros can be used to generate kernel function prototypes.GEMM_UKR_PROT( double, d, gemm_knl_asm_24x8 ) PACKM_KER_PROT( double, d, packm_knl_asm_24xk ) PACKM_KER_PROT( double, d, packm_knl_asm_8xk ) AXPYF_KER_PROT( dcomplex, z, axpyf_knl_asm ) DOTXF_KER_PROT( dcomplex, z, dotxf_knl_asm ) AXPYV_KER_PROT( float, s, axpyv_knl_asm ) DOTXV_KER_PROT( float, s, dotxv_knl_asm )
The first line generates a function prototype for a double-precision real
gemm
microkernel namedbli_dgemm_knl_asm_24x8()
. Notice how the macro takes three arguments: the C language datatype, the single character corresponding to the datatype, and the base name of the function, which includes the operation (gemm
), the kernel set name (knl
), and a substring specifying its implementation (asm_24x8
).The second and third lines generate prototypes for double-precision real
packm
kernels to go along with thegemm
microkernel above. The fourth and fifth lines generate prototypes for double-precision complex instances of the level-1f kernelsaxpyf
anddotxf
. The last two lines generate prototypes for single-precision real instances of the level-1v kernelsaxpyv
anddotxv
. -
Add support within the framework source code. We also need to make a minor update to the framework to support the new kernels--specifically, to pull in the kernels' function prototypes.
frame/include/bli_arch_config.h
. When adding support for theknl
kernel set to the framework, we must modify this file to#include
thebli_kernels_knl.h
header file:#ifdef BLIS_KERNELS_KNL #include "bli_kernels_knl.h" #endif
The
BLIS_KERNELS_KNL
macro, which guards the#include
directive, is automatically defined by the build system when theknl
kernel set is required by any sub-configuration.
Adding support for a new umbrella configuration family in BLIS is fairly straightforward and can be done via the following steps. The hypothetical examples used in these steps assume you are trying to create a new configuration family intelavx
that supports only Intel microarchitectures that support the Intel AVX instruction set.
-
Create and populate the family directory. First, we must create a directory in
config
that corresponds to the new family. Since we are adding a new family namedintelavx
, we would name our directoryintelavx
.$ mkdir config/intelavx $ ls config amd64 cortexa15 excavator intel64 knl piledriver steamroller bgq cortexa57 generic intelavx old power7 template bulldozer cortexa9 haswell knc penryn sandybridge zen
We also need to create
bli_family_intelavx.h
andmake_defs.mk
files inside our new sub-directory. Since they will be very similar to those of theintel64
family's files, we can copy those files over and then modify them accordingly:$ cp config/intel64/bli_family_intel64.h config/intelavx/bli_family_intelavx.h $ cp config/intel64/make_defs.mk config/intelavx/
First, we update the configuration name inside of
make_defs.mk
:THIS_CONFIG := intelavx
and while we're editing the file, we can make any other changes to compiler flags we wish (if any). Similarly, the
bli_family_intelavx.h
header file should be updated, though in our case it does not need any changes; the original file is empty and thus the copied file can remain empty as well. Note that other configuration families may have different needs. Remember that all of the parameters set in this file, either explicitly or implicitly (via their defaults), must work for all sub-configurations in the family. When creating or modifying a family, it's worth reviewing the parameters' defaults, which are set in frame/include/bli_kernel_macro_defs.h and convincing yourself that each parameter default (or overriding definition inbli_family_*.h
) will work for each sub-configuration. -
Add support within the framework source code. Next, we need to update the BLIS framework source code so that the new configuration family is recognized and supported. Configuration families require updates to two files.
-
frame/include/bli_arch_config.h
. This file must be updated to#include
thebli_family_intelavx.h
header file. Notice that the preprocessor directive should be guarded as follows:#ifdef BLIS_FAMILY_INTELAVX #include "bli_family_intelavx.h" #endif
The
BLIS_FAMILY_INTELAVX
will automatically be defined by the build system whenever the family was targeted byconfigure
isintelavx
. (In general, if the user runs./configure foobar
, the C preprocessor macroBLIS_FAMILY_FOOBAR
will be defined.) -
frame/base/bli_arch.c
. This file must be updated so thatbli_arch_query_id()
returns the correctarch_t
microarchitecture ID value to the caller. This function is called when the framework is trying to choose which sub-configuration to use at runtime. For x86_64 architectures, this is supported via theCPUID
instruction, as implemented viabli_cpuid_query_id()
. Thus, you can simply mimic what is done for theintel64
family by inserting lines such as:#ifdef BLIS_FAMILY_INTELAVX id = bli_cpuid_query_id(); #endif
This results in
bli_cpuid_query_id()
being called, which will return thearch_t
ID value corresponding to the hardware detected byCPUID
. (If your configuration family does not consist of x86_64 architectures, then you'll need some other heuristic to determine how to choose the correct sub-configuration at runtime. When in doubt, please open an issue to begin a dialogue with developers.)
-
-
Update the configuration registry. The last step is to update the
config_registry
file so that it defines the new family. Since we want the family to include only Intel sub-configurations that support AVX, we would add the following line:intelavx: haswell sandybridge
Notice that we left out the Core2-based
penryn
sub-configuration since it targets hardware that only supports SSE vector instructions.
Adding support for a new-subconfiguration to BLIS is similar to adding support for a family, though there are a few additional steps. Throughout this section, we will use the knl
(Knight's Landing) configuration as an example to illustrate the typical changes necessary to various files in BLIS.
-
Create and populate the family directory. First, we must create a directory in
config
that corresponds to the new sub-configuration.$ mkdir config/knl $ ls config amd64 cortexa15 excavator intel64 old power7 template bgq cortexa57 generic knc penryn sandybridge zen bulldozer cortexa9 haswell knl piledriver steamroller
We also need to create
bli_cntx_init_knl.c
,bli_family_intelavx.h
, andmake_defs.mk
files inside our new sub-directory. Since they will be very similar to those of thehaswell
sub-configuration's files, we can copy those files over and then modify them accordingly:$ cp config/haswell/bli_cntx_init_haswell.c config/knl/bli_cntx_init_knl.c $ cp config/haswell/bli_family_haswell.h config/knl/bli_family_knl.h $ cp config/haswell/make_defs.mk config/knl/
First, we update the configuration name inside of
make_defs.mk
:THIS_CONFIG := knl
and while we're editing the file, we can make any other changes to compiler flags we wish (if any). Similarly, the
bli_family_knl.h
header file should be updated as needed. Since the number of vector registers and the vector register size onknl
differ from the defaults, we must explicitly set them. (The role of these parameters was explained in a previous section.) Furthermore, provided that a macroBLIS_NO_HBWMALLOC
is not set, we use a different implementation ofmalloc()
andfree()
and#include
that implementation's header file.#define BLIS_SIMD_MAX_NUM_REGISTERS 32 #define BLIS_SIMD_MAX_SIZE 64 #ifdef BLIS_NO_HBWMALLOC #include <stdlib.h> #define BLIS_MALLOC_POOL malloc #define BLIS_FREE_POOL free #else #include <hbwmalloc.h> #define BLIS_MALLOC_POOL hbw_malloc #define BLIS_FREE_POOL hbw_free #endif
Finally, we update
bli_cntx_init_knl.c
to initialize the context with the appropriate kernel function pointers and blocksize values. The functions used to perform this initialization are explained in an earlier section. -
Add support within the framework source code. Next, we need to update the BLIS framework source code so that the new sub-configuration is recognized and supported. Sub-configurations require updates to four files--six if hardware detection logic is added.
-
frame/include/bli_type_defs.h
. First, we need to define an ID to associate with the microarchitecture for which we are adding support. All microarchitecture type IDs are defined in bli_type_defs.h as an enumerated type that wetypedef
toarch_t
. To supportknl
, we add a new enumerated type valueBLIS_ARCH_KNL
:typedef enum { BLIS_ARCH_KNL, BLIS_ARCH_KNC, BLIS_ARCH_HASWELL, BLIS_ARCH_SANDYBRIDGE, BLIS_ARCH_PENRYN, BLIS_ARCH_ZEN, BLIS_ARCH_EXCAVATOR, BLIS_ARCH_STEAMROLLER, BLIS_ARCH_PILEDRIVER, BLIS_ARCH_BULLDOZER, BLIS_ARCH_CORTEXA57, BLIS_ARCH_CORTEXA15, BLIS_ARCH_CORTEXA9, BLIS_ARCH_POWER7, BLIS_ARCH_BGQ, BLIS_ARCH_GENERIC, BLIS_NUM_ARCHS } arch_t;
Notice that the total number of
arch_t
values,BLIS_NUM_ARCHS
, is updated automatically. -
frame/base/bli_gks.c
. We must also update the global kernel structure, or gks, to register the new sub-configuration during library initialization. Sub-configuration registration occurs inbli_gks_init()
. Forknl
, updating this function amounts to inserting the following lines#ifdef BLIS_CONFIG_KNL bli_gks_register_cntx( BLIS_ARCH_KNL, bli_cntx_init_knl, bli_cntx_init_knl_ref, bli_cntx_init_knl_ind ); #endif
This function submits pointers to various context initialization functions to the global kernel structure, which are then stored and called at the appropriate time. The functions must be named strictly according to the format shown in the example above, with
knl
replaced with the sub-configuration name. Also, note the call tobli_gks_register_cntx
is guarded byBLIS_CONFIG_KNL
. This macro is automatically#defined
by the build system if and when theknl
sub-configuration is enabled at configure-time, either directly as a singleton family or indirectly via an umbrella family. -
frame/include/bli_arch_config.h
. This file must be updated in two places. First, we must modify it to generate prototypes for thebli_cntx_init_*()
functions, including the developer-provided functionbli_cntx_init_knl()
(defined inconfig/knl/bli_cntx_init_knl.c
), by inserting:#ifdef BLIS_CONFIG_KNL CNTX_INIT_PROTS( knl ) #endif
Here, the
CNTX_INIT_PROTS
macro generates the appropriate prototypes based on the name of the sub-configuration. Next, we must#include
thebli_family_knl.h
header file, just as we would if we were adding support for an umbrella family:#ifdef BLIS_FAMILY_KNL #include "bli_family_knl.h" #endif
As before with umbrella families, the
BLIS_FAMILY_KNL
macro is automatically defined by the build system for whatever family was targeted byconfigure
. (That is, if the user runs./configure foobar
, the C preprocessor macroBLIS_FAMILY_FOOBAR
will be defined.) -
frame/base/bli_arch.c
. This file must be updated so thatbli_arch_query_id()
returns the correctarch_t
architecture ID value to the caller.bli_arch_query_id()
is called when the framework is trying to choose which sub-configuration to use at runtime. When adding support for a sub-configuration as a singleton family, this amounts to adding a block of code such as:#ifdef BLIS_FAMILY_KNL id = BLIS_ARCH_KNL; #endif
The
BLIS_FAMILY_KNL
macro is automatically#defined
by the build system if theknl
sub-configuration was targeted directly (as a singleton family) at configure-time. Other ID values are returned only if their respective family macros are defined. (Recall that only one family is ever enabled at time.) If, however, theknl
sub-configuration was enabled indirectly via an umbrella family,bli_arch_query_id()
will return thearch_t
ID value via the lines similar to the following:#ifdef BLIS_FAMILY_INTEL64 id = bli_cpuid_query_id(); #endif #ifdef BLIS_FAMILY_AMD64 id = bli_cpuid_query_id(); #endif
Supporting runtime detection of
knl
microarchitectures requires addingknl
support tobli_cpuid_query_id()
, which is addressed in the next step (bli_cpuid.c
). Before we finish editing thebli_arch.c
file, we need to add a string label to the static arrayconfig_name
:static char* config_name[ BLIS_NUM_ARCHS ] = { "knl", "knc", "haswell", "sandybridge", "penryn", "zen", "excavator", "steamroller", "piledriver", "bulldozer", "cortexa57", "cortexa15", "cortexa9", "power7", "bgq", "generic" };
This array is used by
bli_arch_string()
when mappingarch_t
values to the strings associated with that architecture ID. Because thearch_t
value is used as the index of each string, the relative order of the strings in this array is important. Be sure to insert the new string (in our case,"knl"
) at the same relative location as thearch_t
value inserted inbli_type_defs.h
. This will ensure that eacharch_t
value will map to its corresponding string in theconfig_name
array. -
frame/base/bli_cpuid.c
. To support the aforementioned runtime microarchitecture detection, the functionbli_cpuid_query_id()
, defined in bli_cpuid.c, will need to be updated. Specifically, we need to insert logic that will detect the presence of the new hardware based on the results of theCPUID
instruction (assuming the new microarchitecture belongs to the x86_64 architecture family). For example, when support forknl
was added, this entailed adding the following code block tobli_cpuid_query_id()
:#ifdef BLIS_CONFIG_KNL if ( bli_cpuid_is_knl( family, model, features ) ) return BLIS_ARCH_KNL; #endif
Additionally, we had to define the function
bli_cpuid_is_knl()
, which checks for various processor features known to be present onknl
systems and returns a booleanTRUE
if all relevant feature checks are satisfied by the hardware. Note that the order in which we check for the sub-configurations is important. We must check for microarchitectural matches from most recent to most dated. This prevents an older sub-configuration from being selected on newer hardware when a newer sub-configuration would have also matched. -
frame/base/bli_cpuid.h
. After defining the functionbli_cpuid_is_knl()
, we must also update bli_cpuid.h to contain a prototype for the function.
-
-
Update the configuration registry. Lastly, we update the
config_registry
file so that it defines the new sub-configuration. For example, if we want to define a sub-configuration calledknl
that used onlyknl
kernels, we would add the following line:knl: knl
If, when defining
bli_cntx_init_knl()
, we referenced kernels from a non-native kernel set--say, those ofhaswell
--in addition toknl
-specific kernels, we would need to explicitly pull in bothknl
andhaswell
kernel sets:knl: knl/knl/haswell
If you are ever unsure which configuration is "active", or the configuration parameters that were specified (or implied by default) at configure-time, simply run:
$ make showconfig
configuration family: intel64
sub-configurations: haswell sandybridge penryn
requisite kernels: haswell sandybridge penryn
kernel-to-config map: haswell:haswell penryn:penryn sandybridge:sandybridge
-----------------------
BLIS version string: 0.2.2-73
install prefix: /home/field/blis
debugging status: off
multithreading status: no
enable BLAS API? yes
enable CBLAS API? no
build static library? yes
build shared library? no
This will tell you the current configuration name, the configuration registry lists, as well as other information stored by configure
in the config.mk
file.
Due to the way the BLIS framework handles header files, any change to any header file will result in the entire library being rebuilt. This policy is in place mostly out of an abundance of caution. If two or more files use definitions in a header that is modified, and one or more of those files somehow does not get recompiled to reflect the updated definitions, you could end up sinking hours of time trying to track down a bug that didn't ever need to be an issue to begin with. Thus, to prevent developers (including the framework developer(s)) from shooting themselves in the foot with this problem, the BLIS build system recompiles all object files if any header file is touched. We apologize for the inconvenience this may cause.
If you have further questions about BLIS configurations, please do not hesitate to contact the BLIS developer community. To do so, simply join and post to the blis-devel mailing list.