Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add device mask checks for event commands #8484

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

r-potter
Copy link
Contributor

@r-potter r-potter commented Sep 2, 2024

No description provided.

Check if the current device mask has only 1 bit set
when calling vkCmd*Event commands.

Implemented VUIDs:

* VUID-vkCmdSetEvent-commandBuffer-01152
* VUID-vkCmdSetEvent2-commandBuffer-03826
* VUID-vkCmdWaitEvents-commandBuffer-01167
* VUID-vkCmdWaitEvents2-commandBuffer-03846
* VUID-vkCmdResetEvent-commandBuffer-01157
* VUID-vkCmdResetEvent2-commandBuffer-03833
@r-potter r-potter requested a review from a team as a code owner September 2, 2024 15:49
@ci-tester-lunarg
Copy link
Collaborator

CI Vulkan-ValidationLayers build queued with queue ID 247687.

@ci-tester-lunarg
Copy link
Collaborator

CI Vulkan-ValidationLayers build # 17353 running.

@ci-tester-lunarg
Copy link
Collaborator

CI Vulkan-ValidationLayers build # 17353 failed.

vk::EndCommandBuffer(m_command_buffer.handle());
}

TEST_F(PositiveSyncVal, EventCmds2ValidDeviceMask) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is failing CI with errors like

[ VUID-VkDeviceGroupCommandBufferBeginInfo-deviceMask-00106 ] Object 0: handle = 0x29b6b553f30, type = VK_OBJECT_TYPE_COMMAND_BUFFER; | MessageID = 0x274fbde | vkBeginCommandBuffer(): pBeginInfo->pNext.deviceMask (0x3) is invalid, Physical device count is 1. The Vulkan spec states: deviceMask must be a valid device mask value

Copy link
Contributor Author

@r-potter r-potter Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@spencer-lunarg Is this coming from a CI machine with two GPUs in it by any chance? The only logical way I can make this work is if vkEnumeratePhysicalDevices returns two, but the physical_device_count member of the state tracker is equal to 1 (due not having created a device group when we initialized the device).

That sounds like a real bug in the new tests. I see what would be required to fix it based on other device group tests but I wanted to confirm the setup is what I think it must be first

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the machine has 1 dedicated GPU in it, but there is an integrated GPU on the CPU I am pretty sure

I unfortunately have zero experience with device groups, not sure how well things are tested, but this is some real unexplored territory... if you think there are driver bugs, we have an internal "don't run on this GPU config" internal YAML file I can update if you want for any tests (happy to do for any with your best judgement)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would probably explain it. I think the latest iteration of the test is correct, but it also looks like it hangs a CI device (Pixel?). That seems like a fairly plausible outcome of a negative test that generates invalid sync code.

The NV crash is a bit less obvious. I'm not sure why that didn't replicate locally, but I'll investigate more and come back once sure it's not an error on my side. This is definitely a less robustly exercised part of the API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems like a fairly plausible outcome of a negative test that generates invalid sync code

so remember that these tests will do a if (skip == true) skip_dispatch and so a common issue is it crashes because we are not correctly catching the VU and it gets into the driver and blows up

@@ -6560,3 +6560,136 @@ TEST_F(NegativeSyncVal, ResourceHandleIndexStability) {

m_default_queue->Wait();
}

TEST_F(NegativeSyncVal, EventCmdsInvalidDeviceMask) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so these tests belong in sync_object.cpp

Sync Object is what we are calling anything around validating the Synchronization objects in Vulkan (fence, semaphore, etc)

Sync Val is what we are calling to the separate optional add on for validation
https://github.com/KhronosGroup/Vulkan-ValidationLayers/blob/main/docs/synchronization_usage.md

@@ -1930,3 +1930,107 @@ TEST_F(PositiveSyncVal, AtomicAccessFromTwoSubmits2) {
m_errorMonitor->VerifyFound();
m_default_queue->Wait();
}

TEST_F(PositiveSyncVal, EventCmdsValidDeviceMask) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as mentioned above, move to sync_object_positive.cpp

@ci-tester-lunarg
Copy link
Collaborator

CI Vulkan-ValidationLayers build queued with queue ID 250862.

@ci-tester-lunarg
Copy link
Collaborator

CI Vulkan-ValidationLayers build # 17396 running.

@ci-tester-lunarg
Copy link
Collaborator

CI Vulkan-ValidationLayers build # 17396 failed.

@spencer-lunarg
Copy link
Contributor

CI Vulkan-ValidationLayers build # 17396 failed.

The android stack trace can be seen, for the Linux NVIDIA machine here is the stack trace from the crash on EventCmdsInvalidDeviceMask

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x000079c0acbeb10e in ?? ()
   from /lib/x86_64-linux-gnu/libnvidia-glcore.so.550.78
[Current thread is 1 (Thread 0x79c0b06b4600 (LWP 3656320))]
#0  0x000079c0acbeb10e in ?? ()
   from /lib/x86_64-linux-gnu/libnvidia-glcore.so.550.78
#1  0x000079c0acb1d6e5 in ?? ()
   from /lib/x86_64-linux-gnu/libnvidia-glcore.so.550.78
#2  0x000079c0b0373da0 in ?? () from /lib/x86_64-linux-gnu/libGLX_nvidia.so.0
#3  0x000079c0b066067a in ?? () from /lib/x86_64-linux-gnu/libvulkan.so
#4  0x000079c08215c3db in vulkan_layer_chassis::CreateDevice (
    gpu=0x55bc256b6060, pCreateInfo=0x7ffd1216d5a0, pAllocator=0x0,
    pDevice=0x7ffd1216d558)
    at /home/lunarg/.jenkins/vz3/Debug64/Vulkan-ValidationLayers/layers/vulkan/generated/chassis.cpp:725
#5  0x000079c0b065dcca in ?? () from /lib/x86_64-linux-gnu/libvulkan.so
#6  0x000079c0b06613ed in ?? () from /lib/x86_64-linux-gnu/libvulkan.so
#7  0x000079c0b06776fa in vkCreateDevice ()
   from /lib/x86_64-linux-gnu/libvulkan.so
#8  0x000055bc204bf899 in vkt::Device::init (this=0x55bc2bc68160, info=...)
    at /home/lunarg/.jenkins/vz3/Debug64/Vulkan-ValidationLayers/tests/framework/binding.cpp:264
#9  0x000055bc204bf7db in vkt::Device::init (this=0x55bc2bc68160,
    extensions=std::vector of length 0, capacity 0, features=0x55bc2c43b988,
    create_device_pnext=0x7ffd1216dc00, all_queue_count=false)
    at /home/lunarg/.jenkins/vz3/Debug64/Vulkan-ValidationLayers/tests/framework/binding.cpp:258
#10 0x000055bc2048a419 in vkt::Device::Device (this=0x55bc2bc68160,
    phy=0x55bc2c311ce0, extension_names=std::vector of length 0, capacity 0,
    features=0x55bc2c43b988, create_device_pnext=0x7ffd1216dc00,
    all_queue_count=false)
    at /home/lunarg/.jenkins/vz3/Debug64/Vulkan-ValidationLayers/tests/framework/binding.h:226
#11 0x000055bc20483bf9 in VkRenderFramework::InitState (this=0x55bc2c43b2a0,
    features=0x55bc2c43b988, create_device_pnext=0x7ffd1216dc00, flags=2)
    at /home/lunarg/.jenkins/vz3/Debug64/Vulkan-ValidationLayers/tests/framework/render.cpp:648
#12 0x000055bc20d2c83c in NegativeSyncObject_EventCmdsInvalidDeviceMask_Test::TestBody (this=0x55bc2c43b2a0)
    at /home/lunarg/.jenkins/vz3/Debug64/Vulkan-ValidationLayers/tests/unit/sync_object.cpp:3695

@lunarpapillo
Copy link
Contributor

The LunarG CI Checkrun failed for both the Windows-NVIDIA configuration (Spencer posted the stack trace above) and for the Android GalaxyS24 configuration. The latter failure seems to be due to a bad device that we're repairing.

I'm not restarting this run because of the Windows-NVIDIA failure. The next time you push a fix, the LunarG CI Checkrun should start normally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants