Kronecker Products #1454

mike-lawrence · 2024-10-10T15:17:21Z

Intro

I've been playing around with different approaches to efficiently computing Kronecker products, with particular interest in the cases where both input matrices are either covariance matrices or correlation matrices (common in multi-output Gaussian Processes). These special cases have symmetries that might be leveraged to achieve more efficient compute.

Implementations

Below are "user defined function" implementations of kprod() (arbitrary inputs), kprod_cov (covariance inputs), and kprod_corr (correlation inputs). After looking at the kronecker product functions currently in Eigen (in an unsupported section here), I also wrote *_blocked versions of each, where I gather the idea is that blocked computation achieves better memory locality.

Here's the collection of functions:

matrix kprod(matrix A, matrix B){
	int n_A = rows(A) ;
	int n_B = rows(B) ;
	int n_Z = n_A * n_B ;
	matrix[n_Z, n_Z] Z ;
	for (i in 1:n_A) {
		for (j in 1:n_A) {
			for (p in 1:n_B) {
				for (q in 1:n_B) {
					int row = (i - 1) * n_B + p ;
					int col = (j - 1) * n_B + q ;
					Z[row, col] = A[i, j] * B[p, q] ;
				}
			}
		}
	}
	return(Z) ;
}
matrix kprod_cov(matrix A, matrix B) {
	int n_A = rows(A) ;
	int n_B = rows(B) ;
	int n_Z = n_A * n_B ;
	matrix[n_Z, n_Z] Z ;
	// Loop over the upper triangular part of Z (includes diagonal)
	for (row in 1:n_Z) {
		int i = ((row - 1) %/% n_B) + 1 ;
		int p = ((row - 1) % n_B) + 1 ;
		for (col in row:n_Z) {
			int j = ((col - 1) %/% n_B) + 1 ;
			int q = ((col - 1) % n_B) + 1 ;
			// Compute the Kronecker product element
			Z[row, col] = A[i, j] * B[p, q] ;
			// Fill the symmetric counterpart
			// Note: this is redundant on the diagonal, but unsure it's worth a wholly separate loop to copy just the off-diagonal
			Z[col, row] = Z[row, col] ;
		}
	}
	return(Z) ;
}
matrix kprod_corr(matrix A, matrix B) {
	int n_A = rows(A) ;
	int n_B = rows(B) ;
	int n_Z = n_A * n_B ;
	matrix[n_Z, n_Z] Z ;
	// Set diagonal elements to 1
	for (row in 1:n_Z) {
		Z[row, row] = 1.0 ;
	}
	// Loop over the upper triangular off-diagonal elements
	for (row in 1:(n_Z - 1)) {
		int i = ((row - 1) %/% n_B) + 1 ;
		int p = ((row - 1) % n_B) + 1 ;
		for (col in (row + 1):n_Z) {
			int j = ((col - 1) %/% n_B) + 1 ;
			int q = ((col - 1) % n_B) + 1 ;
			// Compute the product of corresponding elements
			Z[row, col] = A[i, j] * B[p, q] ;
			// Fill the symmetric counterpart
			Z[col, row] = Z[row, col] ;
		}
	}
	return(Z) ;
}
matrix kprod_blocked(matrix A, matrix B) {
	int r_A = rows(A) ;
	int c_A = cols(A) ;
	int r_B = rows(B) ;
	int c_B = cols(B) ;
	matrix[r_A * r_B, c_A * c_B] Z ;
	for (i in 1:r_A) {
		for (j in 1:c_A) {
			// Compute the starting row and column indices in C for the current block
			int row_start = (i - 1) * r_B + 1 ;
			int col_start = (j - 1) * c_B + 1 ;
			// Assign the block to the appropriate position in C
			Z[row_start:(row_start + r_B - 1), col_start:(col_start + c_B - 1)] = A[i, j] * B ;
		}
	}
	return Z ;
}
matrix kprod_cov_blocked(matrix A, matrix B){
	int n_A = rows(A) ;
	int n_B = rows(B) ;
	int n_Z = n_A * n_B ;
	matrix[n_Z, n_Z] Z ;
	// First loop: Handle diagonal blocks (i == j)
	for (i in 1:n_A) {
		int row_start = (i - 1) * n_B + 1 ;
		int col_start = (i - 1) * n_B + 1 ;
		Z[row_start:(row_start + n_B - 1), col_start:(col_start + n_B - 1)] = A[i, i] * B ;
	}
	// Second loop: Handle off-diagonal blocks (i < j)
	for (i in 1:(n_A - 1)) {
		for (j in (i + 1):n_A) {
			matrix[n_B, n_B] block = A[i, j] * B ;
			int row_i = (i - 1) * n_B + 1 ;
			int col_j = (j - 1) * n_B + 1 ;
			int row_j = (j - 1) * n_B + 1 ;
			int col_i = (i - 1) * n_B + 1 ;
			// Assign block to position (i, j)
			Z[row_i:(row_i + n_B - 1), col_j:(col_j + n_B - 1)] = block ;
			// Assign symmetric block to position (j, i)
			Z[row_j:(row_j + n_B - 1), col_i:(col_i + n_B - 1)] = block ;
		}
	}
	return Z ;
}
matrix kprod_corr_blocked(matrix A, matrix B){
	int n_A = rows(A) ;
	int n_B = rows(B) ;
	int n_Z = n_A * n_B ;
	matrix[n_Z, n_Z] Z ;
	// First loop: Handle diagonal blocks (i == j)
	for (i in 1:n_A) {
		int row_start = (i - 1) * n_B + 1 ;
		int col_start = (i - 1) * n_B + 1 ;
		Z[row_start:(row_start + n_B - 1), col_start:(col_start + n_B - 1)] = B ;
	}
	// Second loop: Handle off-diagonal blocks (i < j)
	for (i in 1:(n_A - 1)) {
		for (j in (i + 1):n_A) {
			matrix[n_B, n_B] block = A[i, j] * B ;
			int row_i = (i - 1) * n_B + 1 ;
			int col_j = (j - 1) * n_B + 1 ;
			int row_j = (j - 1) * n_B + 1 ;
			int col_i = (i - 1) * n_B + 1 ;
			// Assign block to position (i, j)
			Z[row_i:(row_i + n_B - 1), col_j:(col_j + n_B - 1)] = block ;
			// Assign symmetric block to position (j, i)
			Z[row_j:(row_j + n_B - 1), col_i:(col_i + n_B - 1)] = block ;
		}
	}
	return Z ;
}

Benchmarks

I've attempted to benchmark these implementations across a variety of input sizes and with both default model compilation options:

stanc_options = list()
cpp_options = list()

as well as "fast" model compilation options:

stanc_options = list('O1')
cpp_options = list(
	stan_threads=FALSE
	, STAN_CPP_OPTIMS=TRUE
	, STAN_NO_RANGE_CHECKS=TRUE
	, CXXFLAGS_OPTIM = "-O3 -march=native -mtune=native"
)

Using the unblocked kprod() performance as baseline, here's some relative-performance data (note: I'm still computing more values and more samples for the larger sizes and will update this plot as things finish; at present there are 20 samples for each point in the 2-65 range as well as 127-129):

Benchmark take-aways

So far it seems that there's not much benefit of the correlation-input functions over the covariance-input functions, especially with "fast" compilation options.
The non-blocked kprod_cov performs better than kprod at the smallest input sizes, but eventually seems to fall to equal performance with larger sample sizes.
The blocked kprod_cov_blocked performs worse than the non-blocked kprod at the smallest input sizes, but achieves equal performance by moderate input sizes and increases monotonically (if asymptotically) thereafter for large performance benefits at large input sizes.
kprod_blocked has a similar relative performance trajectory as kprod_cov_blocked, but there still seems to be benefit of the latter, especially when using default model compilation options.
There are spikes of sudden performance changes (relative to kprod) in many of the kprod-alternatives at the N=64 & N=128 values (note: values of N of 63,65,127 & 129 are also present in the graph, showing that the spikes occur at 64 & 128 specifically). I'm not really sure what to make of those nor why the blocked functions spike to higher relative performance at those values while the non-blocked functions spike to lower relative performance at those values.

Questions

I welcome any thoughts on this. Are any of these benchmarks even pertinent, or would implementation in Stan directly have performance implications that make this benchmarking of user-defined functions irrelevant? Also, given the minor performance bump seen here between kprod_blocked and kprod_cov_blocked, maybe it makes sense to start with simply using the existing-but-unsupported Eigen implementation that is akin to kprod_blocked?

The text was updated successfully, but these errors were encountered:

WardBrian · 2024-10-10T19:05:12Z

The FFT functions we use from Eigen are also in their unsupported module, so that seems like an okay place to start

mike-lawrence · 2024-10-12T18:00:13Z

To integrate the Kronecker Product from Eigen, should I start by emulating the changes to stan-dev/math when the FFT stuff was added?

bob-carpenter · 2024-10-13T17:46:23Z

Are any of these benchmarks even pertinent, or would implementation in Stan directly have performance implications that make this benchmarking of user-defined functions irrelevant?

We probably don't want to explicitly construct a Kronecker product. Eigen uses the obvious expression template implementation in C++ where you just hold a reference to the argument matrices and then lazily evaluate entries as needed. For larger scale problems, we want to be able to do things like Cholesky factorization efficiently on Kronecker products. It's a rabbit hole, for sure.

@WardBrian and @SteveBronder are probably the best people to get involved.

mike-lawrence · 2024-10-14T14:24:57Z

@bob-carpenter can you clarify: @WardBrian suggested using Eigen's KroneckerProduct(); are you saying you agree or disagree with this?

bob-carpenter · 2024-10-14T15:24:48Z

Sure. The key idea is that Eigen's Kronecker product implementation is an expression template. See:

https://en.wikipedia.org/wiki/Expression_templates

All it does if you call b ⊗ c is is create a C++ object that stores a reference (pointer) to b and a reference to c. To evaluate the value at (b ⊗ c)(m, n), it's easier to think the other way around, with

(b ⊗ c)(p * r + v, q * s + w) = b(r, s) * a(v, w).

The modular arithmetic to go the other way is on the top of the Wikipedia page.

https://en.wikipedia.org/wiki/Kronecker_product

Ideally, we will never explicitly construct b ⊗ c, because if b is size m x n and c is size p x q, then b ⊗ c is of size (m * p) x (n * q), which quickly gets out of hand (e.g., two 10 x 10 matrices produce a size 10,000 Kronecker product).

The real trick is to make sure we can do things like Cholesky factorization efficiently, because the Cholesky factor of a Kronecker product is the Kronecker product of the Cholesky factors.

WardBrian · 2024-10-14T16:11:48Z

Assigning to a variable in a Stan program will always evaluate the expression, so using these will be very difficult to do efficiently if just implemented as a normal function. A working but difficult to use implementation might be a good first step in thinking about a better one, though

bob-carpenter · 2024-10-14T19:16:20Z

Assigning to a variable in a Stan program will always evaluate the expression

Good point. We can have function returns that don't collapse template expressions, but if you assign the Kronecker product to a Stan variable, it'll be a problem in that it'll do the allocation and copy. I think that's true even if the variable being assigned to is a function argument.

Everything's going to have to remain unevaluated (in the no calls to .eval() sense) to keep up efficiency. But even then, chaining with something like multivariate normal will be a problem, as in multi_normal(mu, a ⊗ b) will have to be set up to make sure we can accept a template parameter as an argument and then do the right thing when we Cholesky factor internally.

WardBrian · 2024-10-14T19:31:01Z

I think that's true even if the variable being assigned to is a function argument.

This is a bit more complicated. Built in functions can accept unevaluated templates and keep them lazy. User defined functions are code generated such that they can accept these, but we immediately call to_ref on them which will evaluate them

bob-carpenter · 2024-10-14T20:56:21Z

Thanks. I should have clarified that I meant user-defined functions. I hadn't realized we'd generalized the argument templates---but I guess that doesn't matter if we immediately evaluate!

mike-lawrence added the feature New feature or request label Oct 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kronecker Products #1454

Kronecker Products #1454

mike-lawrence commented Oct 10, 2024 •

edited

Loading

WardBrian commented Oct 10, 2024

mike-lawrence commented Oct 12, 2024

bob-carpenter commented Oct 13, 2024

mike-lawrence commented Oct 14, 2024

bob-carpenter commented Oct 14, 2024

WardBrian commented Oct 14, 2024

bob-carpenter commented Oct 14, 2024

WardBrian commented Oct 14, 2024

bob-carpenter commented Oct 14, 2024

Kronecker Products #1454

Kronecker Products #1454

Comments

mike-lawrence commented Oct 10, 2024 • edited Loading

Intro

Implementations

Benchmarks

Benchmark take-aways

Questions

WardBrian commented Oct 10, 2024

mike-lawrence commented Oct 12, 2024

bob-carpenter commented Oct 13, 2024

mike-lawrence commented Oct 14, 2024

bob-carpenter commented Oct 14, 2024

WardBrian commented Oct 14, 2024

bob-carpenter commented Oct 14, 2024

WardBrian commented Oct 14, 2024

bob-carpenter commented Oct 14, 2024

mike-lawrence commented Oct 10, 2024 •

edited

Loading