Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 pull ends up with PermissionError: The difference between the request time and the server's #77

Open
petrchmelar opened this issue Oct 4, 2023 · 2 comments

Comments

@petrchmelar
Copy link

petrchmelar commented Oct 4, 2023

Hi, a few days ago we have encountered this exception when pulling dvc tracked data using dvc get command. We were pulling about 800 of 4GB files.
The issue can be related to this comment iterative/dvc#9070 (comment)

Traceback (most recent call last):                                                                                                                                                                                                                                               
  File "reproduce_pull.py", line 50, in <module>
    copy(
  File "/srv/workplace/pchmelar/repos/dvc/venv/lib/python3.8/site-packages/dvc_objects/fs/generic.py", line 93, in copy
    return _get(
  File "/srv/workplace/pchmelar/repos/dvc/venv/lib/python3.8/site-packages/dvc_objects/fs/generic.py", line 241, in _get
    raise result
  File "/srv/workplace/pchmelar/repos/dvc/venv/lib/python3.8/site-packages/dvc_objects/executors.py", line 134, in batch_coros
    result = fut.result()
  File "/srv/workplace/pchmelar/repos/dvc/venv/lib/python3.8/site-packages/dvc_objects/fs/generic.py", line 220, in _get_one_coro
    return await get_coro(
  File "/srv/workplace/pchmelar/repos/dvc/venv/lib/python3.8/site-packages/dvc_objects/fs/callbacks.py", line 84, in func
    return await wrapped(path1, path2, **kw)
  File "/srv/workplace/pchmelar/repos/dvc/venv/lib/python3.8/site-packages/dvc_objects/fs/callbacks.py", line 52, in wrapped
    res = await fn(*args, **kwargs)
  File "/srv/workplace/pchmelar/repos/dvc/venv/lib/python3.8/site-packages/s3fs/core.py", line 1224, in _get_file
    body, content_length = await _open_file(range=0)
  File "/srv/workplace/pchmelar/repos/dvc/venv/lib/python3.8/site-packages/s3fs/core.py", line 1215, in _open_file
    resp = await self._call_s3(
  File "/srv/workplace/pchmelar/repos/dvc/venv/lib/python3.8/site-packages/s3fs/core.py", line 348, in _call_s3
    return await _error_wrapper(
  File "/srv/workplace/pchmelar/repos/dvc/venv/lib/python3.8/site-packages/s3fs/core.py", line 140, in _error_wrapper
    raise err
PermissionError: The difference between the request time and the server's time is too large.

We found out that the command fails semi-deterministically after about 20 minutes of pulling process.
We also found out that setting the --jobs parameter to 10 instead of 4x cpu_count (which is default) worked-around the problem.

>>> import os
>>> os.cpu_count()
56

I did a little bit digging around and maybe found a root cause of the problem:

=> That means that for large number of jobs (hundreds) a session waiting time longer than 15 minutes can cause the "The difference between the request time and the server's time is too large." error because the request timestamp differs by more than 15 minutes from server time.

Reproduction

I was not able to reproduce the issue by simulating the download process so I needed to pull actual file from s3.
The following code creates s3 bucket, uploads file and tries to copy it from s3 using dvc-data in 1000 threads.
If some of the 1000 threads do not finish in 15 minutes the request will probably fail. In our case it failed after 17 minutes.

# %%
import logging
from pathlib import Path
from s3fs import S3FileSystem
from tempfile import TemporaryDirectory
from dvc.fs import FileSystem
from dvc_objects.fs.generic import copy
from dvc_objects.fs.local import LocalFileSystem

import tqdm
from dvc.fs.callbacks import Callback

logging.basicConfig(level=logging.INFO)

s3_fs = S3FileSystem(endpoint_url="your-endpoint-url")
dvc_local_fs = LocalFileSystem()

bucket_name = "test-bucket"
no_requests = 1000 # 1000 pull requests in total
no_chars = 1_000_000_000 # 1.9 GB files
no_jobs = 1000 # 1000 async jobs scheduled to reach the 15 minutes diff limit 

# %%
if s3_fs.exists(bucket_name):
    logging.info(f"Deleting {bucket_name}")
    s3_fs.rm(bucket_name, recursive=True)

logging.info("Creating bucket")
s3_fs.mkdir(bucket_name)

with TemporaryDirectory() as dir:
    file = Path(dir, "test-file")
    file.write_text(" ".join(["a"] * no_chars))
    with Callback.as_tqdm_callback(
        unit="file",
        desc="Put",
    ) as cb:
        s3_fs.put(dir, bucket_name, recursive=True, callback=cb)
    logging.info(f"File uploaded to {bucket_name}")

# %%
with TemporaryDirectory(dir=".") as dir:
    logging.info("Loading files from s3")
    dvc_s3_fs = FileSystem(s3_fs)
    bucket_dir = s3_fs.ls(bucket_name)[0]
    file = f"{bucket_dir}/test-file"
    src_files = [file] * no_requests
    dst_files = [str(Path(dir, str(idx))) for idx in range(no_requests)]
    with Callback.as_tqdm_callback(unit="file", desc="pull", size=len(dst_files)) as cb:
        copy(
            dvc_s3_fs,
            src_files,
            dvc_local_fs,
            dst_files,
            callback=cb,
            batch_size=no_jobs,
        )

# %%

DVC doctor

╰─❯ dvc doctor
DVC version: 2.55.0
-------------------
Platform: Python 3.8.10 on Linux-5.15.0-78-generic-x86_64-with-glibc2.17
Subprojects:
        dvc_data = 0.47.5
        dvc_objects = 0.25.0
        dvc_render = 0.3.1
        dvc_task = 0.3.0
        scmrepo = 1.3.1
Supports:
        http (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
        s3 (s3fs = 2023.9.2, boto3 = 1.28.17)
Cache types: <https://error.dvc.org/no-dvc-cache>
Caches: local
Remotes: https
Workspace directory: xfs on /dev/mapper/data-srv
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/77821165453cef8903abded466242330
@skshetry skshetry transferred this issue from iterative/dvc-data Mar 7, 2024
@shcheklein
Copy link
Member

Excellent research @petrchmelar !

Can you try it with the most recent DVC version (it seems the whole stack of libraries is quite outdated).

I don't see an option from the DVC side to alter the max_pool_connections. Only the options from this config are taken into account, but that doesn't affect the number of actual connections made to S3. We should probably give a way to pass additional config options (e.g. max_pool_connections). I don't see immediately a way to specify it via AWS config file or AWS env variables.

It even a bit more complicated considered that probably a single file could be downloaded using multiple requests concurrently. So, the total number of connections needed can quite large in general.

I would not though do any custom logic to try to predict the needed value. Probably just us being able to pass the max_pool_connections down should be enough.

@pmrowla if you have time by chance, since you touched this recently, does my logic sound reasonable to you? :)

@pmrowla
Copy link
Contributor

pmrowla commented Mar 8, 2024

Exposing max_pool_connections should be fine here, but I'm not sure this is still an issue given that there's been a ton of significant changes between dvc 2.55 and the current release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants