Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize file repository zip exports #1668

Closed
kuzdogan opened this issue Sep 27, 2024 · 13 comments
Closed

Optimize file repository zip exports #1668

kuzdogan opened this issue Sep 27, 2024 · 13 comments
Assignees

Comments

@kuzdogan
Copy link
Member

Following #1586 we have received some feedback for how the filesystem repo zips are structured. We can optimize this a little more for users:

  1. Incremental exports: We can pack the zips according to the timestamps of files. If we run the export every week, then a user would only need to download the new week's zips instead of downloading everything from scratch. (See Discord msg). One problem with this case would be when a contract goes from partial to perfect. In that case the contract will appear as perfect in the new week's zip but will also exist partial in the prev. weeks' zips.
  2. Zipping by chain: Usually people are interested in a subset of all chains. It would make sense to put chains with many verified contracts to their own zip and split (e.g. for Mainnet 1-0-10000.zip, 1-10000-20000.zip`...) and combine the long tail of chains together (Revisit and update Sourcify integration modes otterscan/otterscan#2472 (comment))
@OmarTawfik
Copy link

Thanks for the useful feature. One thing I would like to suggest is to NOT use partial zip files, so that individual parts can be downloaded/consumed separately.

For data analysis or testing, we used to pull contracts from the sanctuary project, but it is no longer updated/maintained. A (manually triggered) GH workflow can then run tests/analysis against a blockchain across a few small jobs. Each job analyzes a subset of contracts in that block chain (example), reporting back a combined result.

Sourcify zip exports would be immensely helpful to us to continue testing against the up-to-date contracts, but currently it cannot be used in GitHub actions (or most other CI providers) as the combined zip file (32gb) is too big for a single job/machine.

@kuzdogan
Copy link
Member Author

Thanks for the feedback! @OmarTawfik

Are you running the test/analysis on the whole repo or are you running it on a subset? You mention "Each job analyzes a subset of contracts" but my question is when all jobs are done, have you analyzed all contracts in the repo?

From what I understand, what you need is not the diffs between versions (or each commit in sanctuary's case), but a way to "shard" or split the repo. In sanctuary's case it's just the two chars of the address. I think we can easily do it similarly.

Are you also aware of the DB dumps too? https://docs.sourcify.dev/docs/repository/sourcify-database/

I don't know what exactly your analysis do but for a tool like slang, I think it contains much more information like the bytecodes, some compiler outputs etc.

@OmarTawfik
Copy link

Thank you for the tip!

I looked at compiled_contracts_50000_55000.parquet for example. It has the compilation version, options, and source files, which are perfect. For our analysis, it sounds like we can just shard over the individual compiled_contracts parts (173 so far), as long as the individual .parquet files can be read separately.

I see 6M verified_contracts. I assume that is all the "known" contracts in Ethereum mainnet. But there are only 865K compiled_contracts. Is that because of deduplication (5M+ duplicates)?

@kuzdogan
Copy link
Member Author

Yes exactly. Essentially a verified_contract is a mapping between a compiled_contract and a contract_deployment. A lot of deployments are from the same compilation. We'll have even more deduplication after we spin out the sources table and deduplicate sources #1615

@OmarTawfik
Copy link

Oh! this would actually block us from using it for the same reason. IIUC, that means even if a job can fetch a single part of compiled_contracts to test it separately, it still needs to download all sources parts to reconstruct the compilation, which means we will run out of space again 😢

@kuzdogan
Copy link
Member Author

Oh no :( right, good that I mentioned it.

I don't think we can change our exports to include sources in compiled_contracts, that's the whole point... What's your size limit?

cc: @marcocastignoli

@OmarTawfik
Copy link

OmarTawfik commented Sep 30, 2024

In that case, what do you think of the original suggestion for the File Repository v2 to export individual zip files instead of zip parts that need to be concatenated to be unzipped? https://repo-backup.sourcify.app/manifest.json

I don't know if there is a limitation/requirement on your end for how many parts it is split to, but your idea above of splitting into 256 parts (00-FF based on first two chars) sounds great!

@marcocastignoli
Copy link
Member

Just to recap the conversation here:

We can optimize the archives by:

  1. Making each part of the archive unzippable without concatenating it to the others, splitting the archive per chain OR the first byte of the address OR both
  2. Performing incremental exports

My take on this:

  1. I think it totally makes sense. We just need to find the best way to split the archive. IMO we can split per chain and per first byte match_type.chain.first_byte.zip (e.g. full_match.1.00.zip will contain all mainnet full_match contracts with the address starting with 00)
  2. Even if there are old partial matches in the repository, by default our APIs check first the full_match existence before trying the partial_match. So I don't think it's a huge problem other than having a dirty repository, at least for us. For other use cases this could be a known limitation of our repository export.

@OmarTawfik
Copy link

Assuming the zip file would contain the full data of these contracts (source files + compilation metadata) without needing to fetch other zip files,match_type.chain.first_byte.zip sounds great! Thanks 👍

@marcocastignoli
Copy link
Member

My take on this:

1. I think it totally makes sense. We just need to find the best way to split the archive. IMO we can split per chain and per first byte `match_type.chain.first_byte.zip` (e.g. `full_match.1.00.zip` will contain all mainnet full_match contracts with the address starting with `00`)

2. Even if there are old partial matches in the repository, by default our APIs check first the `full_match` existence before trying the `partial_match`. So I don't think it's a huge problem other than having a dirty repository, at least for us. For other use cases this could be a known limitation of our repository export.

@kuzdogan if you also think this works I can create an issue for implementing this

@kuzdogan
Copy link
Member Author

kuzdogan commented Oct 8, 2024

The 1st point of splitting the .zips based on the chains and the first byte 0xab.. makes sense and is straightforward so I think we can do that already. So something like this:

11155111/
├── full_match/
│   ├── 0x00/
│   │   ├── 0x00a329c0648769a73afac7f9381e08fb43dbea72/
│   │   └── 0x00b5ce2620d3f46c145b38f5b0d8d0e145952f6e/
│   └── 0xff/
│       ├── 0xff98a63f0a8662c3014f0d5a5077371a81e4e341/
│       └── 0xffb52bc3d04dfad89490f31011a4a56f5d4f6985/
└── partial_match/
    ├── 0x00/
    │   ├── 0x00a32.../
    │   └── 0x00b5c.../
    └── 0xff/
        ├── 0xff98a.../
        └── 0xffb52.../

However I am not sure about the incremental exports. How does it reconcile with the first one? Do you want to split them first by the export date and then the chain and address? So something like this:

├── 2023-10-08/
│   └── 11155111/
│       ├── full_match/
│       │   ├── 0x00/
│       │   │   ├── 0x00a329c0648769a73afac7f9381e08fb43dbea72/
│       │   │   └── 0x00b5ce2620d3f46c145b38f5b0d8d0e145952f6e/
│       │   └── 0xff/
│       │       ├── 0xff98a63f0a8662c3014f0d5a5077371a81e4e341/
│       │       └── 0xffb52bc3d04dfad89490f31011a4a56f5d4f6985/
│       └── partial_match/
│           ├── 0x00/
│           │   ├── 0x00a32.../
│           │   └── 0x00b5c.../
│           └── 0xff/
│               ├── 0xff98a.../
│               └── 0xffb52.../
└── 2023-10-15/
    └── 11155111/
        ├── full_match/
        │   ├── 0x00/
        │   │   ├── 0x00a329c0648769a73afac7f9381e08fb43dbea72/
        │   │   └── 0x00b5ce2620d3f46c145b38f5b0d8d0e145952f6e/
        │   └── 0xff/
        │       ├── 0xff98a63f0a8662c3014f0d5a5077371a81e4e341/
        │       └── 0xffb52bc3d04dfad89490f31011a4a56f5d4f6985/
        └── partial_match/
            ├── 0x00/
            │   ├── 0x00a32.../
            │   └── 0x00b5c.../
            └── 0xff/
                ├── 0xff98a.../
                └── 0xffb52.../

I just wonder if we are trying to solve and already solved problem. Maybe there's a tool or library to handle such exports? Still it should not be too much work to do this.

@marcocastignoli
Copy link
Member

Honestly, I’m not a fan of incremental exports in this context. While I see the advantage of being able to download only the latest changes, it also comes with several drawbacks, such as creating a ‘dirty repository’ with duplicate partial and full matches, as well as increased difficulty in downloading and assembling the repository from multiple files.

Aside from this, I'm searching for existing tools.

@kuzdogan kuzdogan assigned marcocastignoli and unassigned kuzdogan Oct 10, 2024
@marcocastignoli
Copy link
Member

marcocastignoli commented Oct 14, 2024

We decided to split the archive by chain and first byte like explained above: #1695

E.g.

11155111/
├── full_match/
│   ├── 0x00/
│   │   ├── 0x00a329c0648769a73afac7f9381e08fb43dbea72/
│   │   └── 0x00b5ce2620d3f46c145b38f5b0d8d0e145952f6e/
│   └── 0xff/
│       ├── 0xff98a63f0a8662c3014f0d5a5077371a81e4e341/
│       └── 0xffb52bc3d04dfad89490f31011a4a56f5d4f6985/
└── partial_match/
    ├── 0x00/
    │   ├── 0x00a32.../
    │   └── 0x00b5c.../
    └── 0xff/
        ├── 0xff98a.../
        └── 0xffb52.../

In regards to incremental exports, we have opted to provide the same functionality through a different method. This decision will be the subject of further research in a separate issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: COMPLETED
Development

No branches or pull requests

3 participants