Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data management #8

Open
eboileau opened this issue Jul 20, 2022 · 1 comment
Open

Data management #8

eboileau opened this issue Jul 20, 2022 · 1 comment
Assignees
Labels
issue list Generic issue/feature list by theme

Comments

@eboileau
Copy link
Collaborator

eboileau commented Jul 20, 2022

Issues and features related to data upload.


  1. Data upload (H5AD)
  • [BUG] Currently, if uploading in H5AD format, the original data is left under uploads/files. We can either handle this case differently, or just remove the original data.
  • [FEATURE] We need to determine how best to allow using existing unstructured metadata, layers, or observation/variable-level matrices (such as UMAP, etc. )

  1. Data upload (large data) - [ENHANCEMENT]

For relatively large datasets (e.g. 10G H5AD file), the current upload is not suitable, this will take forever, or will be interrupted.
Meanwhile, I added a new apache2 config unlimited_uploads.conf with LimitRequestBody 0, and further raised the PHP limits

post_max_size = 0
upload_max_filesize = 30000M
max_execution_time = 300

but there may be other timeout configurations that may also interrupt PHP execution. We need to think of a longer term solution. See #14 , I think this will work, at least for now. If we keep this solution, we should clean the PHP upload script and add proper logging.


  1. Data upload (general)
  • [QUESTION] It looks like the original metadata is left under uploads/files. This might raise security issues. Should we remove it?
  • [DOCUMENTATION] We need to update the documentation, in particular for H5AD (and prioritize this format, at least for scRNA-seq).
  • [REMARK] File names must match exactly, otherwise upload will fail without any meaningful error message, e.g. if using gene.tab instead of genes.tab. Documentation should either be clear about this, or we allow some fuzziness in file names during upload, or we make sure an appropriate error message is displayed.
  • [REMARK] For failed uploads, some files may remain under /tmp or files/uploads.
@eboileau eboileau added the issue list Generic issue/feature list by theme label Jul 20, 2022
@eboileau eboileau self-assigned this Jul 20, 2022
@eboileau
Copy link
Collaborator Author

eboileau commented Aug 4, 2022

This commit 1981d80 addresses H5AD data and metadata upload.

As for using existing unstructured metadata, layers, or observation/variable-level matrices, I need more time to figure out how exactly this is handled. Available display types are either taken from the columns (adata.obs) if primary, and/or from obsm if there are stored analyses. However, if obsm such as 'X_pca', 'X_tsne', 'X_umap' are present (but not in columns), they are shown as display parameters e.g. X, Y, but the data is actually not accessible, i.e. we get ERROR: Value of 'x' is not the name of a column in 'data_frame'. So observation matrices need to be in the columns to be usable in the curator view, and remaining unstructured metadata, layers, etc. are unused in primary analyses.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
issue list Generic issue/feature list by theme
Projects
None yet
Development

No branches or pull requests

1 participant