Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EGD driver can potentially be locked in JSD_EGD_STATE_MACHINE_STATE_FAULT #65

Open
d-loret opened this issue Jan 13, 2023 · 2 comments
Open
Labels
bug Something isn't working

Comments

@d-loret
Copy link
Collaborator

d-loret commented Jan 13, 2023

The EGD driver can end up locked in the JSD_EGD_STATE_MACHINE_STATE_FAULT if the following sequence of events happens:

  1. A fault occurs in the EGD. State transitions to JSD_EGD_STATE_MACHINE_STATE_FAULT.
  2. The timeout to retrieve the error expires (i.e. state->pub.fault_code is set to JSD_EGD_FAULT_UNKNOWN).
  3. Driver transitions out of JSD_EGD_STATE_MACHINE_STATE_FAULT.
  4. Before a reset is issued, another fault occurs and cannot be retrieved either.
  5. Because state->pub.fault_code is still JSD_EGD_FAULT_UNKNOWN at this point, the code will never execute the timeout since the if clause to act on the timeout requires state->pub.fault_code to be different from JSD_EGD_FAULT_UNKNOWN.

I think the solution is to remove the check of state->pub.fault_code in the if clause. It does not seem that check is actually needed for the proper functioning of the code.

@alex-brinkman, can you corroborate the above reasoning or clarify why the check on state->pub.fault_code is done?

@d-loret d-loret added the bug Something isn't working label Jan 13, 2023
@alex-brinkman
Copy link
Collaborator

I believe you are correct in identifying the potential issue and I agree with your fix to remove the check in the if clause.

Out of curiosity, did you encounter this as a read issue in the wild or just find it via code inspection?

@d-loret
Copy link
Collaborator Author

d-loret commented Feb 2, 2023

I found it through code inspection.

But I later saw an issue in EELS where the code was stuck in JSD_EGD_FAULT_UNKNOWN. We had to take down the node in order to get out of the weird state. Even though I didn't look into it in detail, it seemed very much like this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants