Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DVC tool to push data to HDFS #18

Open
subodhdere opened this issue Jun 27, 2023 · 12 comments
Open

DVC tool to push data to HDFS #18

subodhdere opened this issue Jun 27, 2023 · 12 comments

Comments

@subodhdere
Copy link

subodhdere commented Jun 27, 2023

Bug Report

Issue name

DVC tool to push data to HDFS

dvc push -r myremote -v

Description

Getting below error while pushing data to HDFS.

singhab@jupyter-singhab-jupyter:~/dvc-example$ dvc push -r myremote -v
2023-06-20 07:12:42,730 DEBUG: v2.58.2 (conda), CPython 3.8.16 on Linux-3.10.0-957.1.3.el7.x86_64-x86_64-with-glibc2.10
2023-06-20 07:12:42,730 DEBUG: command: /opt/conda/bin/dvc push -r myremote -v
2023-06-20 07:12:43,153 DEBUG: Preparing to transfer data from '/home/singhab/dvc-example/.dvc/cache' to '/user/halo/dvc-usecase'
2023-06-20 07:12:43,153 DEBUG: Preparing to collect status from '/user/halo/dvc-usecase'
2023-06-20 07:12:43,154 DEBUG: Collecting status from '/user/halo/dvc-usecase'
2023-06-20 07:12:43,155 DEBUG: Querying 1 oids via object_exists                                                                                    
  0% Checking cache in '/user/halo/dvc-usecase'|                                                                         |0/? [00:00<?,    ?files/s]loadFileSystems error:
(unable to get root cause for java.lang.NoClassDefFoundError)
(unable to get stack trace for java.lang.NoClassDefFoundError)
hdfsBuilderConnect(forceNewInstance=1, nn=hdfs://<name-node>, port=8020, kerbTicketCachePath=FILE:/tmp/krb5cc_1000, [email protected]) error:
(unable to get root cause for java.lang.NoClassDefFoundError)
(unable to get stack trace for java.lang.NoClassDefFoundError)
/arrow/cpp/src/arrow/status.cc:137: Failed to disconnect hdfs client: IOError: HDFS hdfsFS::Disconnect failed. Detail: [errno 9] Bad file descriptor
2023-06-20 07:12:43,435 ERROR: unexpected error - HDFS connection failed                                                                            
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/dvc/cli/__init__.py", line 210, in main
    ret = cmd.do_run()

Reproduce

dvc push -r myremote -v

Expected

Data should be pushed to HDFS.

Output of dvc doctor:

$ dvc doctor

Additional Information (if any):

@efiop efiop transferred this issue from iterative/dvc Jun 27, 2023
@efiop
Copy link
Contributor

efiop commented Jun 27, 2023

Hi @subodhdere , please post output of dvc doctor

Also, the log you've posted seems to be partial. Please post full log.

@subodhdere
Copy link
Author

subodhdere commented Jun 29, 2023

Hello, PSB logs of dvc doctor

singhab@jupyter-singhab-jupyter:~$ dvc doctor
DVC version: 2.58.2 (conda)
---------------------------
Platform: Python 3.8.16 on Linux-3.10.0-957.1.3.el7.x86_64-x86_64-with-glibc2.10
Subprojects:
        dvc_data = 0.51.0
        dvc_objects = 0.22.0
        dvc_render = 0.5.3
        dvc_task = 0.2.1
        scmrepo = 1.0.3
Supports:
        http (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.4, aiohttp-retry = 2.8.3)
Config:
        Global: /home/singhab/.config/dvc
        System: /etc/xdg/dvc

@subodhdere
Copy link
Author

subodhdere commented Jun 29, 2023

Adding full logs.

singhab@jupyter-singhab-jupyter:~/dvc-example$ dvc push -r myremote -v
2023-06-29 10:42:46,833 DEBUG: v2.58.2 (conda), CPython 3.8.16 on Linux-3.10.0-957.1.3.el7.x86_64-x86_64-with-glibc2.10
2023-06-29 10:42:46,833 DEBUG: command: /opt/conda/bin/dvc push -r myremote -v
2023-06-29 10:42:47,284 DEBUG: Preparing to transfer data from '/home/singhab/dvc-example/.dvc/cache' to '/user/halo/dvc-usecase'
2023-06-29 10:42:47,284 DEBUG: Preparing to collect status from '/user/halo/dvc-usecase'
2023-06-29 10:42:47,284 DEBUG: Collecting status from '/user/halo/dvc-usecase'
2023-06-29 10:42:47,285 DEBUG: Querying 1 oids via object_exists                                                                                    
  0% Checking cache in '/user/halo/dvc-usecase'|                                                                         |0/? [00:00<?,    ?files/s]loadFileSystems error:
(unable to get root cause for java.lang.NoClassDefFoundError)
(unable to get stack trace for java.lang.NoClassDefFoundError)
hdfsBuilderConnect(forceNewInstance=1, nn=hdfs://am01.halo-telekom.com, port=8020, kerbTicketCachePath=FILE:/tmp/krb5cc_1000, [email protected]) error:
(unable to get root cause for java.lang.NoClassDefFoundError)
(unable to get stack trace for java.lang.NoClassDefFoundError)
/arrow/cpp/src/arrow/status.cc:137: Failed to disconnect hdfs client: IOError: HDFS hdfsFS::Disconnect failed. Detail: [errno 9] Bad file descriptor
2023-06-29 10:42:47,541 ERROR: unexpected error - HDFS connection failed                                                                            
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/dvc/cli/__init__.py", line 210, in main
    ret = cmd.do_run()
  File "/opt/conda/lib/python3.8/site-packages/dvc/cli/command.py", line 26, in do_run
    return self.run()
  File "/opt/conda/lib/python3.8/site-packages/dvc/commands/data_sync.py", line 60, in run
    processed_files_count = self.repo.push(
  File "/opt/conda/lib/python3.8/site-packages/dvc/repo/__init__.py", line 65, in wrapper
    return f(repo, *args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/dvc/repo/push.py", line 92, in push
    result = self.cloud.push(
  File "/opt/conda/lib/python3.8/site-packages/dvc/data_cloud.py", line 154, in push
    return self.transfer(
  File "/opt/conda/lib/python3.8/site-packages/dvc/data_cloud.py", line 135, in transfer
    return transfer(src_odb, dest_odb, objs, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/dvc_data/hashfile/transfer.py", line 203, in transfer
    status = compare_status(
  File "/opt/conda/lib/python3.8/site-packages/dvc_data/hashfile/status.py", line 178, in compare_status
    dest_exists, dest_missing = status(
  File "/opt/conda/lib/python3.8/site-packages/dvc_data/hashfile/status.py", line 149, in status
    odb.oids_exist(hashes, jobs=jobs, progress=pbar.callback)
  File "/opt/conda/lib/python3.8/site-packages/dvc_objects/db.py", line 406, in oids_exist
    return list(wrap_iter(remote_oids, callback))
  File "/opt/conda/lib/python3.8/site-packages/dvc_objects/db.py", line 36, in wrap_iter
    for index, item in enumerate(iterable, start=1):
  File "/opt/conda/lib/python3.8/site-packages/dvc_objects/db.py", line 354, in list_oids_exists
    in_remote = self.fs.exists(paths, batch_size=jobs)
  File "/opt/conda/lib/python3.8/site-packages/dvc_objects/fs/base.py", line 352, in exists
    if self.fs.async_impl:
  File "/opt/conda/lib/python3.8/site-packages/funcy/objects.py", line 47, in __get__
    return prop.__get__(instance, type)
  File "/opt/conda/lib/python3.8/site-packages/funcy/objects.py", line 25, in __get__
    res = instance.__dict__[self.fget.__name__] = self.fget(instance)
  File "/opt/conda/lib/python3.8/site-packages/dvc_hdfs/__init__.py", line 58, in fs
    return HadoopFileSystem(**self.fs_args)
  File "/opt/conda/lib/python3.8/site-packages/fsspec/spec.py", line 79, in __call__
    obj = super().__call__(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/fsspec/implementations/arrow.py", line 278, in __init__
    fs = HadoopFileSystem(
  File "pyarrow/_hdfs.pyx", line 95, in pyarrow._hdfs.HadoopFileSystem.__init__
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
OSError: HDFS connection failed

2023-06-29 10:42:47,630 DEBUG: link type reflink is not available ([Errno 95] no more link types left to try out)
2023-06-29 10:42:47,630 DEBUG: Removing '/home/singhab/.FxZzopJoqjfYPoAjwtVGz6.tmp'
2023-06-29 10:42:47,636 DEBUG: Removing '/home/singhab/.FxZzopJoqjfYPoAjwtVGz6.tmp'
2023-06-29 10:42:47,645 DEBUG: Removing '/home/singhab/.FxZzopJoqjfYPoAjwtVGz6.tmp'
2023-06-29 10:42:47,650 DEBUG: Removing '/home/singhab/dvc-example/.dvc/cache/.hEYTKG7bHugmicwMbkzNsk.tmp'
2023-06-29 10:42:47,665 DEBUG: Version info for developers:
DVC version: 2.58.2 (conda)
---------------------------
Platform: Python 3.8.16 on Linux-3.10.0-957.1.3.el7.x86_64-x86_64-with-glibc2.10
Subprojects:
        dvc_data = 0.51.0
        dvc_objects = 0.22.0
        dvc_render = 0.5.3
        dvc_task = 0.2.1
        scmrepo = 1.0.3
Supports:
        hdfs (fsspec = 2023.6.0, pyarrow = 12.0.0),
        http (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
        s3 (s3fs = 2023.6.0, boto3 = 1.26.161)
Config:
        Global: /home/singhab/.config/dvc
        System: /etc/xdg/dvc
Cache types: hardlink, symlink
Cache directory: nfs4 on fs-a461edff.efs.eu-central-1.amazonaws.com:/halo-claim-singhab-jupyter-pvc-2486a8de-d78a-4515-8ea0-cf2fe89befe2
Caches: local
Remotes: hdfs, s3
Workspace directory: nfs4 on fs-a461edff.efs.eu-central-1.amazonaws.com:/halo-claim-singhab-jupyter-pvc-2486a8de-d78a-4515-8ea0-cf2fe89befe2
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/e621ece895c6241383df59f56935951d

Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!
2023-06-29 10:42:47,669 DEBUG: Analytics is enabled.
2023-06-29 10:42:47,708 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/tmp/tmpdlov741k']'
2023-06-29 10:42:47,711 DEBUG: Spawned '['daemon', '-q', 'analytics', '/tmp/tmpdlov741k']'

@efiop
Copy link
Contributor

efiop commented Jun 29, 2023

Looks like something is with your credentials/config. Does hdfs CLI work?

@subodhdere
Copy link
Author

subodhdere commented Jun 30, 2023

Hello,
HDFS cli is working.
Also shared .dvc/config file for more info.

singhab@jupyter-singhab-jupyter:~/dvc-example$ hdfs dfs -ls hdfs://am01.halo-telekom.com:8020/user/halo/dvc-usecase
Found 1 items
-rw-rw-r--+  3 singhab hadoop          0 2023-06-15 09:45 hdfs://am01.halo-telekom.com:8020/user/halo/dvc-usecase/test.txt

===================================================

singhab@jupyter-singhab-jupyter:~/dvc-example$ cat .dvc/config
[core]
    remote = myremotes3
['remote "myremote"']
    url = hdfs://am01.halo-telekom.com:8020/user/halo/dvc-usecase
    user = singhab

@efiop
Copy link
Contributor

efiop commented Jun 30, 2023

@subodhdere So what credentials are you using and how? kerberos maybe?

Overall seems like a configuration issue.

@subodhdere
Copy link
Author

Hello Team, we are using kerberos for authentication.

@efiop
Copy link
Contributor

efiop commented Jul 3, 2023

@subodhdere
Copy link
Author

Hello Team,
I have executed provided commands related to Kerberos authentication.

dvc remote modify --local myremote kerb_ticket FILE:/tmp/krb5cc_1000
dvc remote add -d myremote hdfs://am01.halo-telekom.com:8020/user/halo/dvc-usecase
dvc remote modify --local myremote user "[email protected]"

You can refer below message in error.

hdfsBuilderConnect(forceNewInstance=1, nn=hdfs://, port=8020, kerbTicketCachePath=FILE:/tmp/krb5cc_1000, userName=[email protected]) error:

@efiop
Copy link
Contributor

efiop commented Jul 5, 2023

@subodhdere Seems like the error is cut off.

@subodhdere
Copy link
Author

subodhdere commented Jul 7, 2023

Hello Team, Can you please look for below full error.

singhab@jupyter-singhab-jupyter:~/dvc-example$ dvc push -r myremote -v
2023-06-29 10:42:46,833 DEBUG: v2.58.2 (conda), CPython 3.8.16 on Linux-3.10.0-957.1.3.el7.x86_64-x86_64-with-glibc2.10
2023-06-29 10:42:46,833 DEBUG: command: /opt/conda/bin/dvc push -r myremote -v
2023-06-29 10:42:47,284 DEBUG: Preparing to transfer data from '/home/singhab/dvc-example/.dvc/cache' to '/user/halo/dvc-usecase'
2023-06-29 10:42:47,284 DEBUG: Preparing to collect status from '/user/halo/dvc-usecase'
2023-06-29 10:42:47,284 DEBUG: Collecting status from '/user/halo/dvc-usecase'
2023-06-29 10:42:47,285 DEBUG: Querying 1 oids via object_exists                                                                                    
  0% Checking cache in '/user/halo/dvc-usecase'|                                                                         |0/? [00:00<?,    ?files/s]loadFileSystems error:
(unable to get root cause for java.lang.NoClassDefFoundError)
(unable to get stack trace for java.lang.NoClassDefFoundError)
hdfsBuilderConnect(forceNewInstance=1, nn=hdfs://am01.halo-telekom.com, port=8020, kerbTicketCachePath=FILE:/tmp/krb5cc_1000, [email protected]) error:
(unable to get root cause for java.lang.NoClassDefFoundError)
(unable to get stack trace for java.lang.NoClassDefFoundError)
/arrow/cpp/src/arrow/status.cc:137: Failed to disconnect hdfs client: IOError: HDFS hdfsFS::Disconnect failed. Detail: [errno 9] Bad file descriptor
2023-06-29 10:42:47,541 ERROR: unexpected error - HDFS connection failed                                                                            
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/dvc/cli/__init__.py", line 210, in main
    ret = cmd.do_run()
  File "/opt/conda/lib/python3.8/site-packages/dvc/cli/command.py", line 26, in do_run
    return self.run()
  File "/opt/conda/lib/python3.8/site-packages/dvc/commands/data_sync.py", line 60, in run
    processed_files_count = self.repo.push(
  File "/opt/conda/lib/python3.8/site-packages/dvc/repo/__init__.py", line 65, in wrapper
    return f(repo, *args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/dvc/repo/push.py", line 92, in push
    result = self.cloud.push(
  File "/opt/conda/lib/python3.8/site-packages/dvc/data_cloud.py", line 154, in push
    return self.transfer(
  File "/opt/conda/lib/python3.8/site-packages/dvc/data_cloud.py", line 135, in transfer
    return transfer(src_odb, dest_odb, objs, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/dvc_data/hashfile/transfer.py", line 203, in transfer
    status = compare_status(
  File "/opt/conda/lib/python3.8/site-packages/dvc_data/hashfile/status.py", line 178, in compare_status
    dest_exists, dest_missing = status(
  File "/opt/conda/lib/python3.8/site-packages/dvc_data/hashfile/status.py", line 149, in status
    odb.oids_exist(hashes, jobs=jobs, progress=pbar.callback)
  File "/opt/conda/lib/python3.8/site-packages/dvc_objects/db.py", line 406, in oids_exist
    return list(wrap_iter(remote_oids, callback))
  File "/opt/conda/lib/python3.8/site-packages/dvc_objects/db.py", line 36, in wrap_iter
    for index, item in enumerate(iterable, start=1):
  File "/opt/conda/lib/python3.8/site-packages/dvc_objects/db.py", line 354, in list_oids_exists
    in_remote = self.fs.exists(paths, batch_size=jobs)
  File "/opt/conda/lib/python3.8/site-packages/dvc_objects/fs/base.py", line 352, in exists
    if self.fs.async_impl:
  File "/opt/conda/lib/python3.8/site-packages/funcy/objects.py", line 47, in __get__
    return prop.__get__(instance, type)
  File "/opt/conda/lib/python3.8/site-packages/funcy/objects.py", line 25, in __get__
    res = instance.__dict__[self.fget.__name__] = self.fget(instance)
  File "/opt/conda/lib/python3.8/site-packages/dvc_hdfs/__init__.py", line 58, in fs
    return HadoopFileSystem(**self.fs_args)
  File "/opt/conda/lib/python3.8/site-packages/fsspec/spec.py", line 79, in __call__
    obj = super().__call__(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/fsspec/implementations/arrow.py", line 278, in __init__
    fs = HadoopFileSystem(
  File "pyarrow/_hdfs.pyx", line 95, in pyarrow._hdfs.HadoopFileSystem.__init__
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
OSError: HDFS connection failed

2023-06-29 10:42:47,630 DEBUG: link type reflink is not available ([Errno 95] no more link types left to try out)
2023-06-29 10:42:47,630 DEBUG: Removing '/home/singhab/.FxZzopJoqjfYPoAjwtVGz6.tmp'
2023-06-29 10:42:47,636 DEBUG: Removing '/home/singhab/.FxZzopJoqjfYPoAjwtVGz6.tmp'
2023-06-29 10:42:47,645 DEBUG: Removing '/home/singhab/.FxZzopJoqjfYPoAjwtVGz6.tmp'
2023-06-29 10:42:47,650 DEBUG: Removing '/home/singhab/dvc-example/.dvc/cache/.hEYTKG7bHugmicwMbkzNsk.tmp'
2023-06-29 10:42:47,665 DEBUG: Version info for developers:
DVC version: 2.58.2 (conda)
---------------------------
Platform: Python 3.8.16 on Linux-3.10.0-957.1.3.el7.x86_64-x86_64-with-glibc2.10
Subprojects:
        dvc_data = 0.51.0
        dvc_objects = 0.22.0
        dvc_render = 0.5.3
        dvc_task = 0.2.1
        scmrepo = 1.0.3
Supports:
        hdfs (fsspec = 2023.6.0, pyarrow = 12.0.0),
        http (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
        s3 (s3fs = 2023.6.0, boto3 = 1.26.161)
Config:
        Global: /home/singhab/.config/dvc
        System: /etc/xdg/dvc
Cache types: hardlink, symlink
Cache directory: nfs4 on fs-a461edff.efs.eu-central-1.amazonaws.com:/halo-claim-singhab-jupyter-pvc-2486a8de-d78a-4515-8ea0-cf2fe89befe2
Caches: local
Remotes: hdfs, s3
Workspace directory: nfs4 on fs-a461edff.efs.eu-central-1.amazonaws.com:/halo-claim-singhab-jupyter-pvc-2486a8de-d78a-4515-8ea0-cf2fe89befe2
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/e621ece895c6241383df59f56935951d

Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!
2023-06-29 10:42:47,669 DEBUG: Analytics is enabled.
2023-06-29 10:42:47,708 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/tmp/tmpdlov741k']'
2023-06-29 10:42:47,711 DEBUG: Spawned '['daemon', '-q', 'analytics', '/tmp/tmpdlov741k']'

@loveis98
Copy link

loveis98 commented Nov 24, 2023

@subodhdere @efiop Hi! I have the same one error. Any updates?
@subodhdere How did you deal with the error?

My traceback:

loadFileSystems error:
(unable to get root cause for java.lang.NoClassDefFoundError)
(unable to get stack trace for java.lang.NoClassDefFoundError)
hdfsBuilderConnect(forceNewInstance=1, nn=hdfs://cdp, port=8020, kerbTicketCachePath=FILE:/tmp/krb5cc_298426831_298426831, userName=user05) error:
(unable to get root cause for java.lang.NoClassDefFoundError)
(unable to get stack trace for java.lang.NoClassDefFoundError)
/arrow/cpp/src/arrow/status.cc:137: Failed to disconnect hdfs client: IOError: HDFS hdfsFS::Disconnect failed. Detail: [errno 9] Bad file descriptor
ERROR: unexpected error - HDFS connection failed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants