HAL server failing with large kernels #514

ikabadzhov · 2024-08-05T13:50:21Z

Version

latest master / 4.0.0

What behaviour are you expecting?

I was reproducing the server-client setup via HAL server as in https://github.com/codeplaysoftware/oneapi-construction-kit/tree/main/examples/hal_cpu_remote_server, then I noticed my big kernels are erroring out on (both of) my RISC-V device(s). I am sure the (both) device(s) have sufficient memory, and in fact the allocation takes place as expected.

What actual behaviour are you seeing?

I am seeing the following from the local client (first lines as expected):

$ HAL_REMOTE_PORT=5906 ./test $((1<<25))
Running on ock cpu
Allocated 128 MB

$ HAL_REMOTE_PORT=5906 ./test $((1<<26))
Running on ock cpu
Allocated 256 MB
terminate called after throwing an instance of 'sycl::_V1::runtime_error'
  what():  Native API failed. Native API returns: -999 (Unknown PI error) -999 (Unknown PI error)
Aborted (core dumped)

and on the RISC-V server, I get a seg fault shortly as: Segmentation fault (core dumped). And after that, attempting to restart the server on the same port, I fail with Unable to start server on requested port 5906, node 127.0.0.1.

On the other hand, empty kernel, or no kernel at all is OK.

What steps are required to reproduce the bug?

To reproduce, on the client side:

#include <sycl/sycl.hpp>

int main(int argc, char **argv) {
  unsigned long long len = 1 << 28;
  if (argc > 1) {
    len = std::stoull(argv[1]);
  }

  sycl::queue queue(sycl::accelerator_selector_v);
  std::cout << "Running on " << queue.get_device().get_info<sycl::info::device::name>() << std::endl;
  float *d_a = sycl::malloc_device<float>(len, queue);
  queue.wait();
  std::cout << "Allocated " << len * sizeof(float) / 1024 / 1024 << " MB" << std::endl;
  queue.parallel_for(sycl::range<1>(len), [=](sycl::id<1> idx) {
    d_a[idx] = idx;
  }).wait();
  return 0;
}

On the server, simply listen on a port as usual.

Minimal test case

No response

Anything else we should know?

No response

The text was updated successfully, but these errors were encountered:

ikabadzhov added the bug Something isn't working label Aug 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HAL server failing with large kernels #514

HAL server failing with large kernels #514

ikabadzhov commented Aug 5, 2024

HAL server failing with large kernels #514

HAL server failing with large kernels #514

Comments

ikabadzhov commented Aug 5, 2024

Version

What behaviour are you expecting?

What actual behaviour are you seeing?

What steps are required to reproduce the bug?

Minimal test case

Anything else we should know?