Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GDrive remote support #2551

Merged
merged 1 commit into from
Nov 29, 2019
Merged

GDrive remote support #2551

merged 1 commit into from
Nov 29, 2019

Conversation

maxhora
Copy link
Contributor

@maxhora maxhora commented Sep 29, 2019

  • Have you followed the guidelines in our
    Contributing document?

  • Does your PR affect documented changes or does it add new functionality
    that should be documented? If yes, have you created a PR for
    dvc.org documenting it or at
    least opened an issue for it? If so, please add a link to it.

Docs update will be tracked under iterative/dvc.org#381


User should create locally GDrive settings file (format expected by PyDrive) and put his/here own Google Project credentials into new created settings file. For test purposes following settings file content might be used:

client_config_backend: settings
client_config:
    client_id: 719861249063-v4an78j9grdtuuuqg3lnm0sugna6v3lh.apps.googleusercontent.com
    client_secret: 2fy_HyzSwkxkGzEken7hThXb

save_credentials: True
save_credentials_backend: file
save_credentials_file: credentials.json

get_refresh_token: True

oauth_scope:
  - https://www.googleapis.com/auth/drive
  - https://www.googleapis.com/auth/drive.appdata

User should configure dvc repo with GDrive remote and path to settings file created in previous step as following:

dvc remote add gdrive -d gdrive gdrive://root/test
dvc remote modify gdrive keyfile YOUR_SETTINGS_FILE.yaml

Above configuration will push dvc files to the new created directory test located at user's GDrive root directory.


Implementation details:

  • PyDrive library is used to access Google Drive API v2. To handle and avoid getting API usage limits errors ratelimit and backoff libraries are used.
  • Environment variable PYDRIVE_USER_CREDENTIALS_DATA can be used to store user account's token data to skip manual login action. Should be useful for automated tested, but the actual value of PYDRIVE_USER_CREDENTIALS_DATA is supposed to be encrypted and secured from unauthorized usage.

@maxhora maxhora self-assigned this Sep 29, 2019
@efiop efiop mentioned this pull request Sep 29, 2019
2 tasks
dvc/remote/gdrive/__init__.py Outdated Show resolved Hide resolved
tests/conftest.py Outdated Show resolved Hide resolved
tests/func/test_data_cloud.py Outdated Show resolved Hide resolved
Copy link
Contributor

@efiop efiop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks neat! I see that it is still WIP and has some things like commented-out code, but decided to do a quick glimpse over for now 🙂

@janchorowski
Copy link

Thanks, for the PR, I started using it and it works great. However, I guess the following might be considered (@Maxris ping me for help if you want), ordered by severity:

  1. add rate limiting, I hacked smth here, can send you a PR to the PR: https://github.com/janchorowski/dvc/blob/2ef9a4c0a03923567c86df81be833eeb96f71b7d/dvc/remote/gdrive/__init__.py#L38-L52
  2. change the way cache_exists works by implementing the all() method - from a quick read it seems that you are listing each directory several times, this requires throttling down the requests to the API to levels below the actual quota (theoretically the throttling should be set to 1k requests/100s, it only worked for me with about 1 request/sec due to the way file listing is implemented).
  3. make client_id configurable and add instructions how to configure custom OAuth2 applications - in a way similar to RClone (https://rclone.org/drive/#making-your-own-client-id), otherwise users will hit your quota quite often.

@maxhora
Copy link
Contributor Author

maxhora commented Oct 7, 2019

Hey @janchorowski , great findings!
Sure, PR to PR sounds good, or you should be able to push your commits directly to remote branch.

2 and 3 sounds reasonable. I'm looking for the moment at point 3 to figure out the best way to configure OAuth2 token to pass test_data_cloud.py

@maxhora
Copy link
Contributor Author

maxhora commented Oct 9, 2019

@janchorowski I'm going to address things you have mentioned above as next step. Will just grab your changes from https://github.com/janchorowski/dvc/blob/2ef9a4c0a03923567c86df81be833eeb96f71b7d/dvc/remote/gdrive/__init__.py#L38-L52 and give them a try locally. thanks!

@maxhora
Copy link
Contributor Author

maxhora commented Oct 12, 2019

@janchorowski

1. add rate limiting, I hacked smth here, can send you a PR to the PR: https://github.com/janchorowski/dvc/blob/2ef9a4c0a03923567c86df81be833eeb96f71b7d/dvc/remote/gdrive/__init__.py#L38-L52

Thank you, proposed solution seems works great so far! I'm going to do more testing with higher amount of files.

2. change the way `cache_exists` works by implementing the `all()` method - from a quick read it seems that you are listing each directory several times, this requires throttling down the requests to the API to levels below the actual quota (theoretically the throttling should be set to 1k requests/100s, it only worked for me with about 1 request/sec due to the way file listing is implemented).

According to the doc from the link you have mentioned (https://rclone.org/drive/#making-your-own-client-id) "The default Google quota is 10 transactions per second". Since DCV sends batch_exists queries in multiple threads, I believe, it reaches above quota pretty easy. Also I was not able to find direct mention about this quota in their doc. It might be that 1k requests/100s means just 10 requests per second in real use.

Intermediate directories are already cached and their remote IDs are retrieved from runtime cache collection (get_path_id_from_cache). Also ListFile calls are updated to search remote file by expected name and iterating over all files in directory doesn't happen any more (need to mention that this haven't decreased the number of queries to GDrive API).

Could you please let me know if you can see here the way to optimize number of API queries further.

3. make client_id configurable and add instructions how to configure custom OAuth2 applications - in a way similar to RClone (https://rclone.org/drive/#making-your-own-client-id), otherwise users will hit your quota quite often.

Going to create docs for this.

@maxhora maxhora changed the title [WIP] GDrive remote support GDrive remote support Oct 14, 2019
@maxhora
Copy link
Contributor Author

maxhora commented Oct 14, 2019

PR should be ready for the review.

So far known issues:

  • The problem mentioned by @janchorowski to change the way cache_exists works by implementing the all() method still should be addressed. At the moment dvc sends maximum 10 requests per second (have tested with recent changes in this PR and seems it works without exceeding the API's limits) or, by another words, it checks existence on remote of 10 files max per 1 second. That's very slow for repos with huge number of files.
    As suggested, it should be possible to reimplement all method to query API for 1000 files per request. That, theoretically, should allow us to check 1000 x 10 files existence per second.

I might will implement above just after docs PR creation and add to current PR.

  • Time to time (randomly) uploading speed of file to remote might become very low. It happens from the start of upload till the end for single file. Not yet sure what might cause such issue.

Copy link
Contributor

@efiop efiop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Please see some minor comments up above.

@maxhora

This comment has been minimized.

@maxhora

This comment has been minimized.

dvc/remote/gdrive/__init__.py Outdated Show resolved Hide resolved
dvc/remote/gdrive/__init__.py Outdated Show resolved Hide resolved
dvc/remote/gdrive/__init__.py Outdated Show resolved Hide resolved
dvc/remote/gdrive/__init__.py Outdated Show resolved Hide resolved
dvc/remote/gdrive/__init__.py Outdated Show resolved Hide resolved
dvc/remote/gdrive/__init__.py Outdated Show resolved Hide resolved
dvc/remote/gdrive/__init__.py Outdated Show resolved Hide resolved
dvc/remote/gdrive/__init__.py Outdated Show resolved Hide resolved
dvc/remote/gdrive/__init__.py Show resolved Hide resolved
dvc/remote/gdrive/__init__.py Show resolved Hide resolved
dvc/remote/gdrive/__init__.py Outdated Show resolved Hide resolved
dvc/remote/gdrive/__init__.py Outdated Show resolved Hide resolved
dvc/remote/gdrive/__init__.py Outdated Show resolved Hide resolved
dvc/remote/gdrive/__init__.py Show resolved Hide resolved
dvc/remote/gdrive/__init__.py Outdated Show resolved Hide resolved
@Suor

This comment has been minimized.

Comment on lines +252 to +257
if remote_ids:
return remote_ids[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this deterministic between runs? Or will it return a random id?

Copy link
Contributor Author

@maxhora maxhora Nov 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not able to find relevant info about order of returned results by Drive API v2.

https://developers.google.com/drive/api/v2/reference/files/list#parameters

Above link says for optional orderBy param: Please note that there is a current limitation for users with approximately one million files in which the requested sort order is ignored. And it seems DVC can't rely on usage of orderBy because of above.

So far I haven't noticed that results might be returned in different order.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't run into millions, may just sort/take minimum one instead of first to make it deterministic. I will add it to #2865

dvc/remote/gdrive/__init__.py Show resolved Hide resolved
dvc/remote/gdrive/__init__.py Show resolved Hide resolved
@efiop
Copy link
Contributor

efiop commented Nov 28, 2019

@Maxris Please don't forget to install your git hooks, so that restyled doesn't have to create all of those annoying PR's each time 🙂

@shcheklein
Copy link
Member

@efiop @Suor please review again, keep in mind that we have created a separate ticket to address some refactoring/improvements.

Copy link
Contributor

@efiop efiop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! For the record, #2865 is for the rest of TODOs.

@shcheklein
Copy link
Member

Approved! Thanks, @Maxris 🎉

@shcheklein
Copy link
Member

@Maxris there was some conversation I can't find - to provide a mechanism to reset (cached) credentials in case they are expired or something. Should we move it to the #2865 ? To keep track of it.

@maxhora
Copy link
Contributor Author

maxhora commented Nov 29, 2019

@shcheklein this conversation seems happened there #2551 (comment) and already added to the #2865 under number 2 in the list of To-Dos.

@shcheklein
Copy link
Member

@Maxris kk, I see. Thanks!

Copy link
Contributor

@Suor Suor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @Maxris, this was big.

@Suor
Copy link
Contributor

Suor commented Nov 29, 2019

Added a couple of points to #2865

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants