Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kronecker Products #1454

Open
mike-lawrence opened this issue Oct 10, 2024 · 9 comments
Open

Kronecker Products #1454

mike-lawrence opened this issue Oct 10, 2024 · 9 comments
Labels
feature New feature or request

Comments

@mike-lawrence
Copy link

mike-lawrence commented Oct 10, 2024

Intro

I've been playing around with different approaches to efficiently computing Kronecker products, with particular interest in the cases where both input matrices are either covariance matrices or correlation matrices (common in multi-output Gaussian Processes). These special cases have symmetries that might be leveraged to achieve more efficient compute.

Implementations

Below are "user defined function" implementations of kprod() (arbitrary inputs), kprod_cov (covariance inputs), and kprod_corr (correlation inputs). After looking at the kronecker product functions currently in Eigen (in an unsupported section here), I also wrote *_blocked versions of each, where I gather the idea is that blocked computation achieves better memory locality.

Here's the collection of functions:

matrix kprod(matrix A, matrix B){
	int n_A = rows(A) ;
	int n_B = rows(B) ;
	int n_Z = n_A * n_B ;
	matrix[n_Z, n_Z] Z ;
	for (i in 1:n_A) {
		for (j in 1:n_A) {
			for (p in 1:n_B) {
				for (q in 1:n_B) {
					int row = (i - 1) * n_B + p ;
					int col = (j - 1) * n_B + q ;
					Z[row, col] = A[i, j] * B[p, q] ;
				}
			}
		}
	}
	return(Z) ;
}
matrix kprod_cov(matrix A, matrix B) {
	int n_A = rows(A) ;
	int n_B = rows(B) ;
	int n_Z = n_A * n_B ;
	matrix[n_Z, n_Z] Z ;
	// Loop over the upper triangular part of Z (includes diagonal)
	for (row in 1:n_Z) {
		int i = ((row - 1) %/% n_B) + 1 ;
		int p = ((row - 1) % n_B) + 1 ;
		for (col in row:n_Z) {
			int j = ((col - 1) %/% n_B) + 1 ;
			int q = ((col - 1) % n_B) + 1 ;
			// Compute the Kronecker product element
			Z[row, col] = A[i, j] * B[p, q] ;
			// Fill the symmetric counterpart
			// Note: this is redundant on the diagonal, but unsure it's worth a wholly separate loop to copy just the off-diagonal
			Z[col, row] = Z[row, col] ;
		}
	}
	return(Z) ;
}
matrix kprod_corr(matrix A, matrix B) {
	int n_A = rows(A) ;
	int n_B = rows(B) ;
	int n_Z = n_A * n_B ;
	matrix[n_Z, n_Z] Z ;
	// Set diagonal elements to 1
	for (row in 1:n_Z) {
		Z[row, row] = 1.0 ;
	}
	// Loop over the upper triangular off-diagonal elements
	for (row in 1:(n_Z - 1)) {
		int i = ((row - 1) %/% n_B) + 1 ;
		int p = ((row - 1) % n_B) + 1 ;
		for (col in (row + 1):n_Z) {
			int j = ((col - 1) %/% n_B) + 1 ;
			int q = ((col - 1) % n_B) + 1 ;
			// Compute the product of corresponding elements
			Z[row, col] = A[i, j] * B[p, q] ;
			// Fill the symmetric counterpart
			Z[col, row] = Z[row, col] ;
		}
	}
	return(Z) ;
}
matrix kprod_blocked(matrix A, matrix B) {
	int r_A = rows(A) ;
	int c_A = cols(A) ;
	int r_B = rows(B) ;
	int c_B = cols(B) ;
	matrix[r_A * r_B, c_A * c_B] Z ;
	for (i in 1:r_A) {
		for (j in 1:c_A) {
			// Compute the starting row and column indices in C for the current block
			int row_start = (i - 1) * r_B + 1 ;
			int col_start = (j - 1) * c_B + 1 ;
			// Assign the block to the appropriate position in C
			Z[row_start:(row_start + r_B - 1), col_start:(col_start + c_B - 1)] = A[i, j] * B ;
		}
	}
	return Z ;
}
matrix kprod_cov_blocked(matrix A, matrix B){
	int n_A = rows(A) ;
	int n_B = rows(B) ;
	int n_Z = n_A * n_B ;
	matrix[n_Z, n_Z] Z ;
	// First loop: Handle diagonal blocks (i == j)
	for (i in 1:n_A) {
		int row_start = (i - 1) * n_B + 1 ;
		int col_start = (i - 1) * n_B + 1 ;
		Z[row_start:(row_start + n_B - 1), col_start:(col_start + n_B - 1)] = A[i, i] * B ;
	}
	// Second loop: Handle off-diagonal blocks (i < j)
	for (i in 1:(n_A - 1)) {
		for (j in (i + 1):n_A) {
			matrix[n_B, n_B] block = A[i, j] * B ;
			int row_i = (i - 1) * n_B + 1 ;
			int col_j = (j - 1) * n_B + 1 ;
			int row_j = (j - 1) * n_B + 1 ;
			int col_i = (i - 1) * n_B + 1 ;
			// Assign block to position (i, j)
			Z[row_i:(row_i + n_B - 1), col_j:(col_j + n_B - 1)] = block ;
			// Assign symmetric block to position (j, i)
			Z[row_j:(row_j + n_B - 1), col_i:(col_i + n_B - 1)] = block ;
		}
	}
	return Z ;
}
matrix kprod_corr_blocked(matrix A, matrix B){
	int n_A = rows(A) ;
	int n_B = rows(B) ;
	int n_Z = n_A * n_B ;
	matrix[n_Z, n_Z] Z ;
	// First loop: Handle diagonal blocks (i == j)
	for (i in 1:n_A) {
		int row_start = (i - 1) * n_B + 1 ;
		int col_start = (i - 1) * n_B + 1 ;
		Z[row_start:(row_start + n_B - 1), col_start:(col_start + n_B - 1)] = B ;
	}
	// Second loop: Handle off-diagonal blocks (i < j)
	for (i in 1:(n_A - 1)) {
		for (j in (i + 1):n_A) {
			matrix[n_B, n_B] block = A[i, j] * B ;
			int row_i = (i - 1) * n_B + 1 ;
			int col_j = (j - 1) * n_B + 1 ;
			int row_j = (j - 1) * n_B + 1 ;
			int col_i = (i - 1) * n_B + 1 ;
			// Assign block to position (i, j)
			Z[row_i:(row_i + n_B - 1), col_j:(col_j + n_B - 1)] = block ;
			// Assign symmetric block to position (j, i)
			Z[row_j:(row_j + n_B - 1), col_i:(col_i + n_B - 1)] = block ;
		}
	}
	return Z ;
}

Benchmarks

I've attempted to benchmark these implementations across a variety of input sizes and with both default model compilation options:

stanc_options = list()
cpp_options = list()

as well as "fast" model compilation options:

stanc_options = list('O1')
cpp_options = list(
	stan_threads=FALSE
	, STAN_CPP_OPTIMS=TRUE
	, STAN_NO_RANGE_CHECKS=TRUE
	, CXXFLAGS_OPTIM = "-O3 -march=native -mtune=native"
)

Using the unblocked kprod() performance as baseline, here's some relative-performance data (note: I'm still computing more values and more samples for the larger sizes and will update this plot as things finish; at present there are 20 samples for each point in the 2-65 range as well as 127-129):
kronecker

Benchmark take-aways

  • So far it seems that there's not much benefit of the correlation-input functions over the covariance-input functions, especially with "fast" compilation options.
  • The non-blocked kprod_cov performs better than kprod at the smallest input sizes, but eventually seems to fall to equal performance with larger sample sizes.
  • The blocked kprod_cov_blocked performs worse than the non-blocked kprod at the smallest input sizes, but achieves equal performance by moderate input sizes and increases monotonically (if asymptotically) thereafter for large performance benefits at large input sizes.
  • kprod_blocked has a similar relative performance trajectory as kprod_cov_blocked, but there still seems to be benefit of the latter, especially when using default model compilation options.
  • There are spikes of sudden performance changes (relative to kprod) in many of the kprod-alternatives at the N=64 & N=128 values (note: values of N of 63,65,127 & 129 are also present in the graph, showing that the spikes occur at 64 & 128 specifically). I'm not really sure what to make of those nor why the blocked functions spike to higher relative performance at those values while the non-blocked functions spike to lower relative performance at those values.

Questions

I welcome any thoughts on this. Are any of these benchmarks even pertinent, or would implementation in Stan directly have performance implications that make this benchmarking of user-defined functions irrelevant? Also, given the minor performance bump seen here between kprod_blocked and kprod_cov_blocked, maybe it makes sense to start with simply using the existing-but-unsupported Eigen implementation that is akin to kprod_blocked?

@mike-lawrence mike-lawrence added the feature New feature or request label Oct 10, 2024
@WardBrian
Copy link
Member

The FFT functions we use from Eigen are also in their unsupported module, so that seems like an okay place to start

@mike-lawrence
Copy link
Author

To integrate the Kronecker Product from Eigen, should I start by emulating the changes to stan-dev/math when the FFT stuff was added?

@bob-carpenter
Copy link
Contributor

Are any of these benchmarks even pertinent, or would implementation in Stan directly have performance implications that make this benchmarking of user-defined functions irrelevant?

We probably don't want to explicitly construct a Kronecker product. Eigen uses the obvious expression template implementation in C++ where you just hold a reference to the argument matrices and then lazily evaluate entries as needed. For larger scale problems, we want to be able to do things like Cholesky factorization efficiently on Kronecker products. It's a rabbit hole, for sure.

@WardBrian and @SteveBronder are probably the best people to get involved.

@mike-lawrence
Copy link
Author

@bob-carpenter can you clarify: @WardBrian suggested using Eigen's KroneckerProduct(); are you saying you agree or disagree with this?

@bob-carpenter
Copy link
Contributor

Sure. The key idea is that Eigen's Kronecker product implementation is an expression template. See:

https://en.wikipedia.org/wiki/Expression_templates

All it does if you call b ⊗ c is is create a C++ object that stores a reference (pointer) to b and a reference to c. To evaluate the value at (b ⊗ c)(m, n), it's easier to think the other way around, with

(b ⊗ c)(p * r + v, q * s + w) = b(r, s) * a(v, w).

The modular arithmetic to go the other way is on the top of the Wikipedia page.

https://en.wikipedia.org/wiki/Kronecker_product

Ideally, we will never explicitly construct b ⊗ c, because if b is size m x n and c is size p x q, then b ⊗ c is of size (m * p) x (n * q), which quickly gets out of hand (e.g., two 10 x 10 matrices produce a size 10,000 Kronecker product).

The real trick is to make sure we can do things like Cholesky factorization efficiently, because the Cholesky factor of a Kronecker product is the Kronecker product of the Cholesky factors.

@WardBrian
Copy link
Member

Assigning to a variable in a Stan program will always evaluate the expression, so using these will be very difficult to do efficiently if just implemented as a normal function. A working but difficult to use implementation might be a good first step in thinking about a better one, though

@bob-carpenter
Copy link
Contributor

Assigning to a variable in a Stan program will always evaluate the expression

Good point. We can have function returns that don't collapse template expressions, but if you assign the Kronecker product to a Stan variable, it'll be a problem in that it'll do the allocation and copy. I think that's true even if the variable being assigned to is a function argument.

Everything's going to have to remain unevaluated (in the no calls to .eval() sense) to keep up efficiency. But even then, chaining with something like multivariate normal will be a problem, as in multi_normal(mu, a ⊗ b) will have to be set up to make sure we can accept a template parameter as an argument and then do the right thing when we Cholesky factor internally.

@WardBrian
Copy link
Member

I think that's true even if the variable being assigned to is a function argument.

This is a bit more complicated. Built in functions can accept unevaluated templates and keep them lazy. User defined functions are code generated such that they can accept these, but we immediately call to_ref on them which will evaluate them

@bob-carpenter
Copy link
Contributor

Thanks. I should have clarified that I meant user-defined functions. I hadn't realized we'd generalized the argument templates---but I guess that doesn't matter if we immediately evaluate!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants