-
Notifications
You must be signed in to change notification settings - Fork 395
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize file repository zip exports #1668
Comments
Thanks for the useful feature. One thing I would like to suggest is to NOT use partial zip files, so that individual parts can be downloaded/consumed separately. For data analysis or testing, we used to pull contracts from the sanctuary project, but it is no longer updated/maintained. A (manually triggered) GH workflow can then run tests/analysis against a blockchain across a few small jobs. Each job analyzes a subset of contracts in that block chain (example), reporting back a combined result. Sourcify zip exports would be immensely helpful to us to continue testing against the up-to-date contracts, but currently it cannot be used in GitHub actions (or most other CI providers) as the combined zip file (32gb) is too big for a single job/machine. |
Thanks for the feedback! @OmarTawfik Are you running the test/analysis on the whole repo or are you running it on a subset? You mention "Each job analyzes a subset of contracts" but my question is when all jobs are done, have you analyzed all contracts in the repo? From what I understand, what you need is not the diffs between versions (or each commit in sanctuary's case), but a way to "shard" or split the repo. In sanctuary's case it's just the two chars of the address. I think we can easily do it similarly. Are you also aware of the DB dumps too? https://docs.sourcify.dev/docs/repository/sourcify-database/ I don't know what exactly your analysis do but for a tool like slang, I think it contains much more information like the bytecodes, some compiler outputs etc. |
Thank you for the tip! I looked at compiled_contracts_50000_55000.parquet for example. It has the compilation version, options, and source files, which are perfect. For our analysis, it sounds like we can just shard over the individual I see 6M |
Yes exactly. Essentially a |
Oh! this would actually block us from using it for the same reason. IIUC, that means even if a job can fetch a single part of |
Oh no :( right, good that I mentioned it. I don't think we can change our exports to include sources in cc: @marcocastignoli |
In that case, what do you think of the original suggestion for the File Repository v2 to export individual zip files instead of zip parts that need to be concatenated to be unzipped? https://repo-backup.sourcify.app/manifest.json I don't know if there is a limitation/requirement on your end for how many parts it is split to, but your idea above of splitting into 256 parts ( |
Just to recap the conversation here: We can optimize the archives by:
My take on this:
|
Assuming the zip file would contain the full data of these contracts (source files + compilation metadata) without needing to fetch other zip files, |
@kuzdogan if you also think this works I can create an issue for implementing this |
The 1st point of splitting the .zips based on the chains and the first byte
However I am not sure about the incremental exports. How does it reconcile with the first one? Do you want to split them first by the export date and then the chain and address? So something like this:
I just wonder if we are trying to solve and already solved problem. Maybe there's a tool or library to handle such exports? Still it should not be too much work to do this. |
Honestly, I’m not a fan of incremental exports in this context. While I see the advantage of being able to download only the latest changes, it also comes with several drawbacks, such as creating a ‘dirty repository’ with duplicate partial and full matches, as well as increased difficulty in downloading and assembling the repository from multiple files. Aside from this, I'm searching for existing tools. |
We decided to split the archive by chain and first byte like explained above: #1695 E.g.
In regards to incremental exports, we have opted to provide the same functionality through a different method. This decision will be the subject of further research in a separate issue. |
Following #1586 we have received some feedback for how the filesystem repo zips are structured. We can optimize this a little more for users:
1-0-10000.zip
, 1-10000-20000.zip`...) and combine the long tail of chains together (Revisit and update Sourcify integration modes otterscan/otterscan#2472 (comment))The text was updated successfully, but these errors were encountered: