Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Passing a username as "organizations" config value crashes the tap #173

Open
laurentS opened this issue Dec 9, 2022 · 5 comments
Open
Labels
bug Something isn't working

Comments

@laurentS
Copy link
Contributor

laurentS commented Dec 9, 2022

The following config:

{
    "auth_token": access_token,
    "start_date": "2022-12-09",
    "organizations": ["laurents"]
}

will crash the tap with the following traceback (trimmed to the useful part):

2022-12-09 17:36:11,534 Beginning incremental sync of 'repositories'...
2022-12-09 17:36:11,534 Tap has custom mapper. Using 1 provided map(s).
2022-12-09 17:36:11,900 Tap will run with 1 auth tokens
Traceback (most recent call last):
[...]
  File "tap_github/repository_streams.py", line 182, in get_records
    yield from super().get_records(context)
  File "singer_sdk/streams/rest.py", line 537, in get_records
    for record in self.request_records(context):
  File "singer_sdk/streams/rest.py", line 357, in request_records
    resp = decorated_request(prepared_request, context)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "backoff/_sync.py", line 105, in retry
    ret = target(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^
  File "singer_sdk/streams/rest.py", line 253, in _request
    self.validate_response(response)
  File "/tap_github/client.py", line 233, in validate_response
    raise FatalAPIError(msg)
singer_sdk.exceptions.FatalAPIError: 404 Client Error: b'{"message":"Not Found","documentation_url":"https://docs.github.com/rest/reference/repos#list-organization-repositories"}' (Reason: Not Found) for path: /orgs/laurents/repos

This is because my username is not an org from github's perspective. Calling the users/laurents/repos endpoint works fine on the other hand.

I'm not sure what the best solution is:

  • let calling code check that the values are valid. I don't think this is right, the tap shouldn't crash.
  • run a preliminary query like we do for repositories and use it to filter out invalid org names.
  • handle the error where it's raised, but it's difficult to have enough context to do it properly.
  • any other ideas?
@laurentS laurentS added the bug Something isn't working label Dec 9, 2022
@aaronsteers
Copy link
Contributor

@laurentS - Just to check here, it is desireable in this case to be able to pull streams for a "personal" org as if it were a "real" org?

If yes, and if there is a parent-child relationship, perhaps we could check a property of the org and skip/omit child streams that are non-applicable.

That would allow a hybrid of orgs and usernames to be provided in the organizations list.

However, if we'd rather not accept personal usernames as orgs, then (again assuming we have a parent-child relationship) probably a similar check that results in hard failure when iterating on the organizations themselves would do the trick. "Error: Detected that 'laurents' is not a valid organization ID. Personal usernames are not allowed in the organizations list."

What do you think?

@laurentS
Copy link
Contributor Author

My feeling is that github mixes users and orgs in the web app, to the point where behaviour is basically the same when listing an org's repo or a user's repos. Using the api, I get essentially the same results as well, whether I call github.com/orgs/meltano/repos or github.com/users/laurents/repos: a list of repos, so it is possible to follow up with the same child streams.

Taking a step back, the lack of consistency is somewhat annoying here. It is impossible to know if joeblog in github.com/joeblog or joeblog/superpackage is a user or an org without hitting the api and possibly getting an error back (and crashing the tap, abandoning all further processing).
I think having to write code outside the tap to figure this out defeats the purpose of the tap, yet it is the behaviour we currently have (as we can pass users or organizations in the config file).

So I'd lean towards making it transparent in the tap as well, so that it is possible to pass "organizations": ["meltano", "laurents"] as a config option, and the tap figures things out.
This would work for the repositories stream and its children (which is my use case, and the only use case available in the tap right now), but if we were to add a stream to fetch data about the org itself using the orgs/ORG endpoint, then we'd have a problem again, and we would need a different behaviour there. For consistency with the "symmetric" users stream (if you pass "users": ["meltano"], it will skip it), we could use the graphql endpoint in a new Organizations stream to filter out invalid ones.

What do you think?
@ericboucher any further thoughts?

@ericboucher
Copy link
Contributor

ericboucher commented Feb 19, 2023

It is indeed quite annoying. As a first approach I think we could try to make it transparent and run successfully for repos. And log a more explicit warning on "Not Found" for the other streams. Eg. "meltano" was not found, are you sure this GitHub organization exists and it is not a user?

@laurentS
Copy link
Contributor Author

I wonder if we could preprocess the organizations config param with the graphql endpoint, like we do for users and repos, but do it in such a way that we end up with 2 lists: "real" organizations, and valid users (and implicitly, invalid input that is discarded, basically whatever was neither org nor user). I think we could manage this with a single graphql query passing the list as both orgs and users, and processing results+errors in the response.

Then:

  • for specifically org streams (teams, team members, team roles), we use the "real_orgs" as input,
  • for repo streams, we can use both validated orgs and users as inputs, changing the url as required so the tap runs transparently on both.
  • And for invalid input, we just log a warning during preprocessing, like we currently do for repos and users: "invalid org name , discarding".

@ericboucher
Copy link
Contributor

ericboucher commented Mar 5, 2023

@laurentS sounds good, let's try something like this. Random thought - we could imagine another type of input called users_and_orgs or users_or_orgs that explicitly has this behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants