Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Saving Checkpoints to S3 bucket #1186

Open
2 tasks done
mhmgad opened this issue Dec 11, 2022 · 1 comment
Open
2 tasks done

Saving Checkpoints to S3 bucket #1186

mhmgad opened this issue Dec 11, 2022 · 1 comment
Labels
🛑 Checkpoints issues related to checkpoints and resuming training enhancement New feature or request

Comments

@mhmgad
Copy link

mhmgad commented Dec 11, 2022

Problem Statement

While training on AWS sagemaker, it is expensive to keep the checkpoints on the notebook. Being able to upload checkpoints directly to s3 bucket will save time.

Describe the solution you'd like

in check points, allow s3 bucket urls

from pykeen.pipeline import pipeline

result = pipeline(
    model="transe",
    dataset="nations",
    training_kwargs=dict(
        num_epochs=10,
        checkpoint_name="test_checkpoint.pt",
        checkpoint_directory="s3://bucket/checkpoints/",
    ),
)

Describe alternatives you've considered

Uploading checkpoints at the end of training for backup.

Additional information

No response

Issue Template Checks

  • This is not a bug report (use a different issue template if it is)
  • This is not a question (use the discussions forum instead)
@mhmgad mhmgad added the enhancement New feature or request label Dec 11, 2022
@mberr
Copy link
Member

mberr commented Jan 9, 2023

Hi @mhmgad ,

we currently have pykeen.pipeline.PipelineResult.save_to_s3 to upload the final results to s3, but no support for checkpoints yet.

We would be happy to see a PR for this. The place to start from would be this method pykeen.training.training_loop.py:TrainingLoop._save_state.

@cthoyt cthoyt added the 🛑 Checkpoints issues related to checkpoints and resuming training label Sep 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🛑 Checkpoints issues related to checkpoints and resuming training enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants