Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Python script and module to check dataset readiness for data preservation #235

Open
8 of 10 tasks
astrochun opened this issue Jul 8, 2021 · 0 comments
Open
8 of 10 tasks
Labels
curation Pertains to aspects of curation, including workflow management enhancement New feature or request preservation Related to data preservation scripts Script development
Milestone

Comments

@astrochun
Copy link
Contributor

astrochun commented Jul 8, 2021

Summary

With multiple datasets now published, we now need to ensure that the curated copies are identical to those that are published. This requires the following steps:

  • A Preserve class as part of a sub-pckage called `preserve
    • Save a JSON file of the published metadata from the Figshare API in METADATA
    • Perform checksum verification to ensure the files on the curation server is identical. If not, provide warnings of differences
    • If checksum is incorrect, retrieve correct file and save in DATA (This is partially done, but still needs work and requires testing)
    • Generate symbolic links in DATA from ORIGINAL_DATA if files are unchanged and not present in DATA. See: Enhancement: Symbolic links to ORIGINAL_DATA for DATA when publishing #192
    • Delete all hidden files (e.g., .DS_Store, ._*)
    • Delete old README.txt files (CL: My recollection is that this was tested with a couple of old README.txt files)
    • To avoid overwrite, change the file permissions to read-only (This will be done in ReBACH in 6.Archived)
  • A python script called preserve_checks that takes two main inputs/parameters: --article-id and --version-no
  • Add option to do full run with update. By default, it would be "dry run". This would allow for the flexibility to check the script while not retrieving/updating published dataset or creating new metadata files for record.

Objectives

Proposal

Testing notes

Additional notes

Implemented in: TBD

@astrochun astrochun added curation Pertains to aspects of curation, including workflow management enhancement New feature or request preservation Related to data preservation scripts Script development labels Jul 8, 2021
@astrochun astrochun added this to the v1.2.0 milestone Jul 8, 2021
@astrochun astrochun self-assigned this Jul 8, 2021
@astrochun astrochun changed the title Feature: Script to check dataset readiness for data preservation Feature: Python script and module to check dataset readiness for data preservation Jul 9, 2021
@astrochun astrochun removed their assignment Oct 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
curation Pertains to aspects of curation, including workflow management enhancement New feature or request preservation Related to data preservation scripts Script development
Projects
None yet
1 participant