Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Error during OAuth callback / missing constraints for role_membership table #3811

Closed
tillprochaska opened this issue Jul 11, 2024 · 3 comments
Labels
backend Issues related to Aleph’s backend, API, CLI etc. bug Things that should work, but don’t Moderate Issue that may require attention

Comments

@tillprochaska
Copy link
Contributor

Describe the bug
Some users experience a 500 error when signing in using OAuth.

The role_membership table stores which groups a user is part of. It has two columns, group_id and role_id (migration). However, it doesn’t define any constraints (composite primary key or a unique constraint, so it is possible to add duplicate rows to the table.

Duplicate rows in this table can cause errors during the OAuth callback. After authenticating a user, Aleph "syncs" the groups the user is part of. It does this by first deleting all rows in the role_membership for the user, then adding new rows for all of the groups the users is currently part of. Groups are encoded in the OAuth token returned by the identity provider, e.g. Keycloak.

In case of duplicate rows, there’s a mismatch between the number of rows SQLAlchemy expects to delete and the number of rows that are actually deleted. This results in the following error:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 2190, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1486, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/local/lib/python3.8/dist-packages/flask_cors/extension.py", line 176, in wrapped_function
    return cors_after_request(app.make_response(f(*args, **kwargs)))
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1484, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1469, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
  File "/aleph/aleph/views/sessions_api.py", line 123, in oauth_callback
    role = handle_oauth(oauth.provider, oauth_token)
  File "/aleph/aleph/oauth.py", line 100, in handle_oauth
    role.clear_roles()
  File "/aleph/aleph/model/role.py", line 109, in clear_roles
    db.session.flush()
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/orm/scoping.py", line 897, in flush
    return self._proxied.flush(objects=objects)
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/orm/session.py", line 4179, in flush
    self._flush(objects)
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/orm/session.py", line 4315, in _flush
    transaction.rollback(_capture_exception=True)
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/util/langhelpers.py", line 146, in __exit__
    raise exc_value.with_traceback(exc_tb)
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/orm/session.py", line 4275, in _flush
    flush_context.execute()
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/orm/unitofwork.py", line 466, in execute
    rec.execute(self)
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/orm/unitofwork.py", line 591, in execute
    self.dependency_processor.process_saves(uow, states)
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/orm/dependency.py", line 1195, in process_saves
    self._run_crud(
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/orm/dependency.py", line 1220, in _run_crud
    raise exc.StaleDataError(
sqlalchemy.orm.exc.StaleDataError: DELETE statement on table 'role_membership' expected to delete 1 row(s); Only 2 were matched.

To Reproduce
Steps to reproduce the behavior:

  1. Set up a local Keycloak instance for development as explained in this guide.
  2. Create a user and add the user to a user group as explained in the second section of the group.
  3. Open Aleph and sign in using OAuth.
  4. You should be able to sign in successfully.
  5. When inspecting the contents of the role_membership table, you should see one row representing the group membership set up in step 2.
  6. Manually add another duplicate row to the table.
  7. Sign out of Aleph and sign in again using OAuth.
  8. You should see an internal server error. The API logs should contain the error from above.

Aleph version
3.17.0 and 4.0.0-rc33

Additional context

  • It is unclear how exactly you’d end up with duplicate rows in that table. Possibly some weird edge case, but it seems to have happened before.
  • A quick workaround in these cases is to remove the duplicate rows.
@tillprochaska tillprochaska added bug Things that should work, but don’t backend Issues related to Aleph’s backend, API, CLI etc. triage These issues need to be reviewed by the Aleph team labels Jul 11, 2024
@stchris
Copy link
Contributor

stchris commented Jul 11, 2024

SQL query hint:

select group_id, member_id, count(*) from role_membership group by group_id, member_id having count(*) > 1

It looks like a UNIQUE constraint on (group_id, member_id) would avoid these situations (but I'm not sure of more underlying issues).

@tillprochaska tillprochaska added Moderate Issue that may require attention and removed triage These issues need to be reviewed by the Aleph team labels Jul 15, 2024
@tillprochaska
Copy link
Contributor Author

tillprochaska commented Jul 16, 2024

For future reference: While looking a little closer at logs when testing a fix for this, I noticed that in one case, there were two overlapping OAuth requests happening at the same time, processed by different pods. Both requests had a separate auth code and state, i.e. both requests were part of separate OAuth flows.

In order to sync groups, Aleph first clears all current group memberships, then inserts new group memberships based on the data in the OAuth token retrieved from the ID provider. However, Aleph does this in a single transaction, so I wouldn’t expect this to be the issue.

@tillprochaska
Copy link
Contributor Author

tillprochaska commented Jul 17, 2024

For future reference: While looking a little closer at logs when testing a fix for this, I noticed that in one case, there were two overlapping OAuth requests happening at the same time, processed by different pods. Both requests had a separate auth code and state, i.e. both requests were part of separate OAuth flows.

Some additional context:

  • The affected user was part of many groups, but there was only one duplicate group membership.
  • The user was added to the group associated with the duplicate membership shortly before the error occurred. The two OAuth requests were the first OAuth requests since the user was added to the group.

I think the issue is that while deleting and inserting the role memberships is part of a single transaction, determining which role memberships to delete is not. When clearing role memberships for a given user, SQLAlchemy doesn’t construct a query like this:

DELETE FROM role_membership WHERE member_id = 123

Instead, it deletes specific rows based on data read before:

DELETE FROM role_membership WHERE member_id = 123 AND group_id = 456
DELETE FROM role_membership WHERE member_id = 123 AND group_id = 789

If a user is added to a new group, the following race condition can occur when two overlapping OAuth requests are handled:

t Request 1 Request 2
1 Begin transaction
2 Fetch role memberships for user (Group A, Group B)
3 End transaction
4 Begin transaction
5 Fetch role memberships for user (Group A, Group B)
6 End transaction
7 Begin transaction
8 Delete role memberships for user (Group A, Group B)
9 Insert new role memberships (Group A, Group B, Group C)
10 End transaction
11 Begin transaction
12 Delete role memberships for user (Group A, Group B)
13 Insert new role memberships (Group A, Group B, Group C)
14 End transaction

As a result, the role membership for Group C is stored twice.

tillprochaska added a commit that referenced this issue Jul 17, 2024
This likely isn't the root cause for #3811, but it will help with debugging in the future.
tillprochaska added a commit that referenced this issue Jul 17, 2024
While this doesn't fix the root cause for #3811, it ensures that this error doesn't result in inconsistent data that have previously permanently blocked users from sign-in. Having sensible constraints on tables is a good idea anyway.
tillprochaska added a commit that referenced this issue Nov 4, 2024
* Make name/email customizable in OAuth mocks

* Fix OAuth test setup

This was an oversight and should have always called `super().setUp()` instead of `super().setUpClass()`. This didn’t fail when running tests for the entire file (`pytest test_sessions_api.py`), but it did always fail when running individual tests (`pytest test_sessions_api.py -k oauth_callback`).

* Test OAuth group sync

* Log warning when trying to add duplicate role membership

This likely isn't the root cause for #3811, but it will help with debugging in the future.

* Add primary key constraint on `role_membership` table

While this doesn't fix the root cause for #3811, it ensures that this error doesn't result in inconsistent data that have previously permanently blocked users from sign-in. Having sensible constraints on tables is a good idea anyway.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend Issues related to Aleph’s backend, API, CLI etc. bug Things that should work, but don’t Moderate Issue that may require attention
Projects
None yet
Development

No branches or pull requests

2 participants