Optimize file repository zip exports #1668

kuzdogan · 2024-09-27T08:53:07Z

Following #1586 we have received some feedback for how the filesystem repo zips are structured. We can optimize this a little more for users:

Incremental exports: We can pack the zips according to the timestamps of files. If we run the export every week, then a user would only need to download the new week's zips instead of downloading everything from scratch. (See Discord msg). One problem with this case would be when a contract goes from partial to perfect. In that case the contract will appear as perfect in the new week's zip but will also exist partial in the prev. weeks' zips.
Zipping by chain: Usually people are interested in a subset of all chains. It would make sense to put chains with many verified contracts to their own zip and split (e.g. for Mainnet 1-0-10000.zip, 1-10000-20000.zip`...) and combine the long tail of chains together (Revisit and update Sourcify integration modes otterscan/otterscan#2472 (comment))

The text was updated successfully, but these errors were encountered:

OmarTawfik · 2024-09-28T05:40:05Z

Thanks for the useful feature. One thing I would like to suggest is to NOT use partial zip files, so that individual parts can be downloaded/consumed separately.

For data analysis or testing, we used to pull contracts from the sanctuary project, but it is no longer updated/maintained. A (manually triggered) GH workflow can then run tests/analysis against a blockchain across a few small jobs. Each job analyzes a subset of contracts in that block chain (example), reporting back a combined result.

Sourcify zip exports would be immensely helpful to us to continue testing against the up-to-date contracts, but currently it cannot be used in GitHub actions (or most other CI providers) as the combined zip file (32gb) is too big for a single job/machine.

kuzdogan · 2024-09-30T10:35:00Z

Thanks for the feedback! @OmarTawfik

Are you running the test/analysis on the whole repo or are you running it on a subset? You mention "Each job analyzes a subset of contracts" but my question is when all jobs are done, have you analyzed all contracts in the repo?

From what I understand, what you need is not the diffs between versions (or each commit in sanctuary's case), but a way to "shard" or split the repo. In sanctuary's case it's just the two chars of the address. I think we can easily do it similarly.

Are you also aware of the DB dumps too? https://docs.sourcify.dev/docs/repository/sourcify-database/

I don't know what exactly your analysis do but for a tool like slang, I think it contains much more information like the bytecodes, some compiler outputs etc.

OmarTawfik · 2024-09-30T12:33:50Z

Thank you for the tip!

I looked at compiled_contracts_50000_55000.parquet for example. It has the compilation version, options, and source files, which are perfect. For our analysis, it sounds like we can just shard over the individual compiled_contracts parts (173 so far), as long as the individual .parquet files can be read separately.

I see 6M verified_contracts. I assume that is all the "known" contracts in Ethereum mainnet. But there are only 865K compiled_contracts. Is that because of deduplication (5M+ duplicates)?

kuzdogan · 2024-09-30T12:38:23Z

Yes exactly. Essentially a verified_contract is a mapping between a compiled_contract and a contract_deployment. A lot of deployments are from the same compilation. We'll have even more deduplication after we spin out the sources table and deduplicate sources #1615

OmarTawfik · 2024-09-30T12:49:21Z

Oh! this would actually block us from using it for the same reason. IIUC, that means even if a job can fetch a single part of compiled_contracts to test it separately, it still needs to download all sources parts to reconstruct the compilation, which means we will run out of space again 😢

kuzdogan · 2024-09-30T13:59:13Z

Oh no :( right, good that I mentioned it.

I don't think we can change our exports to include sources in compiled_contracts, that's the whole point... What's your size limit?

cc: @marcocastignoli

OmarTawfik · 2024-09-30T14:19:45Z

In that case, what do you think of the original suggestion for the File Repository v2 to export individual zip files instead of zip parts that need to be concatenated to be unzipped? https://repo-backup.sourcify.app/manifest.json

I don't know if there is a limitation/requirement on your end for how many parts it is split to, but your idea above of splitting into 256 parts (00-FF based on first two chars) sounds great!

marcocastignoli · 2024-10-01T10:51:12Z

Just to recap the conversation here:

We can optimize the archives by:

Making each part of the archive unzippable without concatenating it to the others, splitting the archive per chain OR the first byte of the address OR both
Performing incremental exports

My take on this:

I think it totally makes sense. We just need to find the best way to split the archive. IMO we can split per chain and per first byte match_type.chain.first_byte.zip (e.g. full_match.1.00.zip will contain all mainnet full_match contracts with the address starting with 00)
Even if there are old partial matches in the repository, by default our APIs check first the full_match existence before trying the partial_match. So I don't think it's a huge problem other than having a dirty repository, at least for us. For other use cases this could be a known limitation of our repository export.

OmarTawfik · 2024-10-01T11:20:12Z

Assuming the zip file would contain the full data of these contracts (source files + compilation metadata) without needing to fetch other zip files,match_type.chain.first_byte.zip sounds great! Thanks 👍

marcocastignoli · 2024-10-01T12:58:05Z

My take on this:

1. I think it totally makes sense. We just need to find the best way to split the archive. IMO we can split per chain and per first byte `match_type.chain.first_byte.zip` (e.g. `full_match.1.00.zip` will contain all mainnet full_match contracts with the address starting with `00`)

2. Even if there are old partial matches in the repository, by default our APIs check first the `full_match` existence before trying the `partial_match`. So I don't think it's a huge problem other than having a dirty repository, at least for us. For other use cases this could be a known limitation of our repository export.

@kuzdogan if you also think this works I can create an issue for implementing this

kuzdogan · 2024-10-08T13:06:23Z

The 1st point of splitting the .zips based on the chains and the first byte 0xab.. makes sense and is straightforward so I think we can do that already. So something like this:

11155111/
├── full_match/
│   ├── 0x00/
│   │   ├── 0x00a329c0648769a73afac7f9381e08fb43dbea72/
│   │   └── 0x00b5ce2620d3f46c145b38f5b0d8d0e145952f6e/
│   └── 0xff/
│       ├── 0xff98a63f0a8662c3014f0d5a5077371a81e4e341/
│       └── 0xffb52bc3d04dfad89490f31011a4a56f5d4f6985/
└── partial_match/
    ├── 0x00/
    │   ├── 0x00a32.../
    │   └── 0x00b5c.../
    └── 0xff/
        ├── 0xff98a.../
        └── 0xffb52.../

However I am not sure about the incremental exports. How does it reconcile with the first one? Do you want to split them first by the export date and then the chain and address? So something like this:

├── 2023-10-08/
│   └── 11155111/
│       ├── full_match/
│       │   ├── 0x00/
│       │   │   ├── 0x00a329c0648769a73afac7f9381e08fb43dbea72/
│       │   │   └── 0x00b5ce2620d3f46c145b38f5b0d8d0e145952f6e/
│       │   └── 0xff/
│       │       ├── 0xff98a63f0a8662c3014f0d5a5077371a81e4e341/
│       │       └── 0xffb52bc3d04dfad89490f31011a4a56f5d4f6985/
│       └── partial_match/
│           ├── 0x00/
│           │   ├── 0x00a32.../
│           │   └── 0x00b5c.../
│           └── 0xff/
│               ├── 0xff98a.../
│               └── 0xffb52.../
└── 2023-10-15/
    └── 11155111/
        ├── full_match/
        │   ├── 0x00/
        │   │   ├── 0x00a329c0648769a73afac7f9381e08fb43dbea72/
        │   │   └── 0x00b5ce2620d3f46c145b38f5b0d8d0e145952f6e/
        │   └── 0xff/
        │       ├── 0xff98a63f0a8662c3014f0d5a5077371a81e4e341/
        │       └── 0xffb52bc3d04dfad89490f31011a4a56f5d4f6985/
        └── partial_match/
            ├── 0x00/
            │   ├── 0x00a32.../
            │   └── 0x00b5c.../
            └── 0xff/
                ├── 0xff98a.../
                └── 0xffb52.../

I just wonder if we are trying to solve and already solved problem. Maybe there's a tool or library to handle such exports? Still it should not be too much work to do this.

marcocastignoli · 2024-10-10T09:21:28Z

Honestly, I’m not a fan of incremental exports in this context. While I see the advantage of being able to download only the latest changes, it also comes with several drawbacks, such as creating a ‘dirty repository’ with duplicate partial and full matches, as well as increased difficulty in downloading and assembling the repository from multiple files.

Aside from this, I'm searching for existing tools.

marcocastignoli · 2024-10-14T10:16:04Z

We decided to split the archive by chain and first byte like explained above: #1695

E.g.

11155111/
├── full_match/
│   ├── 0x00/
│   │   ├── 0x00a329c0648769a73afac7f9381e08fb43dbea72/
│   │   └── 0x00b5ce2620d3f46c145b38f5b0d8d0e145952f6e/
│   └── 0xff/
│       ├── 0xff98a63f0a8662c3014f0d5a5077371a81e4e341/
│       └── 0xffb52bc3d04dfad89490f31011a4a56f5d4f6985/
└── partial_match/
    ├── 0x00/
    │   ├── 0x00a32.../
    │   └── 0x00b5c.../
    └── 0xff/
        ├── 0xff98a.../
        └── 0xffb52.../

In regards to incremental exports, we have opted to provide the same functionality through a different method. This decision will be the subject of further research in a separate issue.

manuelwedler mentioned this issue Oct 2, 2024

Make file changes visible in the manifest of the parquet exports #1683

Open

2 tasks

marcocastignoli assigned kuzdogan Oct 8, 2024

kuzdogan assigned marcocastignoli and unassigned kuzdogan Oct 10, 2024

marcocastignoli closed this as completed Oct 14, 2024

marcocastignoli mentioned this issue Oct 14, 2024

Split the repository archive by match_type, chain and first byte. #1695

Open

kuzdogan added the 📖 research label Oct 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize file repository zip exports #1668

Optimize file repository zip exports #1668

kuzdogan commented Sep 27, 2024

OmarTawfik commented Sep 28, 2024

kuzdogan commented Sep 30, 2024

OmarTawfik commented Sep 30, 2024

kuzdogan commented Sep 30, 2024

OmarTawfik commented Sep 30, 2024

kuzdogan commented Sep 30, 2024

OmarTawfik commented Sep 30, 2024 •

edited

Loading

marcocastignoli commented Oct 1, 2024

OmarTawfik commented Oct 1, 2024

marcocastignoli commented Oct 1, 2024

kuzdogan commented Oct 8, 2024

marcocastignoli commented Oct 10, 2024

marcocastignoli commented Oct 14, 2024 •

edited

Loading

Optimize file repository zip exports #1668

Optimize file repository zip exports #1668

Comments

kuzdogan commented Sep 27, 2024

OmarTawfik commented Sep 28, 2024

kuzdogan commented Sep 30, 2024

OmarTawfik commented Sep 30, 2024

kuzdogan commented Sep 30, 2024

OmarTawfik commented Sep 30, 2024

kuzdogan commented Sep 30, 2024

OmarTawfik commented Sep 30, 2024 • edited Loading

marcocastignoli commented Oct 1, 2024

OmarTawfik commented Oct 1, 2024

marcocastignoli commented Oct 1, 2024

kuzdogan commented Oct 8, 2024

marcocastignoli commented Oct 10, 2024

marcocastignoli commented Oct 14, 2024 • edited Loading

OmarTawfik commented Sep 30, 2024 •

edited

Loading

marcocastignoli commented Oct 14, 2024 •

edited

Loading