Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming Blob Contents #1595

Open
willstott101 opened this issue Sep 14, 2024 · 3 comments
Open

Streaming Blob Contents #1595

willstott101 opened this issue Sep 14, 2024 · 3 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@willstott101
Copy link
Contributor

willstott101 commented Sep 14, 2024

Summary 💡

I'd like to get a stream of bytes from the odb without having to load the entire blob into memory. I've had a peruse of the gitoxide pack parsing code and the official git docs for the format, and it seems like there are no guarantees that the chunks are stored in order in the packfile, would you expect that to be a sticking point here, or in reality would they always be in order?

I'm intending to explore implementing this myself.

Motivation 🔦

I'm starting to use gitoxide to serve files from git repos on a server. I was previously shelling out to git cat-file in nodejs, but rust is proving to be a nicer environment for the few other bits this service performs. I would like to keep memory usage to a minimum however, and we have some tens-of-megabyte blobs which are a bit awkward to have to load into memory before serving them.

@willstott101 willstott101 added the enhancement New feature or request label Sep 14, 2024
@Byron Byron added the help wanted Extra attention is needed label Sep 14, 2024
@Byron
Copy link
Member

Byron commented Sep 14, 2024

You are not the first to desire such a feature, and here is my thoughts on it:

  • object streaming isn't reasonably possible for delta-packed objects
  • object streaming is quite straightforward for
    • loose objects
    • non-deltified objects

gitoxide doesn't have an API that would allow streaming, but for two out of three cases it could have one. It's just a question how often an object can actually be streamed, but that could be something the server-side can influence.

I can imagine, for example, that the server finds big objects that it wants to make streamable, and then writes them out as loose-objects. Unfortunately, these would probably be removed next time a gc runs, but that would be under control of the server as well.

From an implementation point of view, I can imagine offering a way to stream an object, which then either streams it from an opened loose file, or 'fake'-streams it from an in-memory copy. The first version of the implementation probably wouldn't want to deal with streaming from a pack as most packed objects are deltified.

My recommendation is to start with a non-streaming version, and see how you do in the memory-usage department right away. I suspect real memory usage to be pretty low and very much defined by the size of objects that are held in memory during the streaming, assuming not much else is going on.
From there, with the effort described here, one should be able to lower the peak memory usage.

The reason I am saying this is that there might be other factors that drive memory usage, and it would be good to find out quickly before optimizing something that in reality isn't a problem.

@willstott101
Copy link
Contributor Author

willstott101 commented Sep 18, 2024

I can't pretend I'm not partly pursuing this for academic reasons. I'm sure for 99% of files stored in git repos it would never result in a perceivable improvement, but at the same time there's a chicken and egg problem - tooling proficiency influences usage patterns which influences tooling engineering effort. Anyway that's all a bit philosophical...

I'm not clear how delta compression would affect the "stream-ability" of a blob, my understanding from reading the spec is that a delta-compressed blob is a list of chunks, some of which exist in other objects, some of which are in the blob data directly. By making the assumption that all these chunks appear in-order I would imagine one could recursively dig into the other objects one by one as you build and return the resulting buffer... essentially a stack of decoders. IDK if a recursive descent parser is an appropriate analogy here but that's what comes to mind.

You make a great point about controlling the way objects are packed to keep large files as loose objects - that would definitely be possible in my situation. However, frankly, I think the most pragmatic solution in my case is a disk cache of brotli-encoded blob data ready to just hand to the web server as a file handle for later requests - there happens to be a limited number of "large" files in my deployment and this should work very well.

My next question is if you would plausibly accept an PR implementing a pairing to gix_odb/trait.Write/write_stream that was something like:

fn read_stream(&'a self, id: &[oid]) -> Result<&'a mut dyn Read, Error>

Presumably with many caveats like initially only streaming loose objects, or dynamically allocating a hidden Vec somewhere for the entire object in other cases. Providing a breeding ground for additional capabilities, for instance there's plausibly some objects small enough that could use the slice given to fn read(&mut self, buf: &mut [u8]) -> Result<usize> to decode directly from the pack without ever allocating a Vec.

@Byron
Copy link
Member

Byron commented Sep 19, 2024

By making the assumption that all these chunks appear in-order I would imagine one could recursively dig into the other objects one by one as you build and return the resulting buffer... essentially a stack of decoders. IDK if a recursive descent parser is an appropriate analogy here but that's what comes to mind.

It's a reverse-chain of instructions buffers, pointing back at their base objects which are more instruction buffers, up to an undeltified base. From there, in reverse, these instructions have to be applied to assemble the final objects, in order. So one will be forced to do all of the expensive work upfront anyway, which is decompressing the objects. Anything I can imagine regarding trying to stream under these circumstances is super-complex and likely will not save anything most of the time (quite the opposite). But by all means…do try it, I may be wrong. If I were to approach it, I'd try to find typical deltified objects and visualize their delta-instructions to see if anything can be done at all.

You make a great point about controlling the way objects are packed to keep large files as loose objects - that would definitely be possible in my situation.

I am saying even more than that. There is the possibility to let objects be packed, but not delta-compressed. That way they can easily be streamed, just as if they were loose objects.
It might not be feasible to always mark such objects though (in .gitattributes).

You make a great point about controlling the way objects are packed to keep large files as loose objects - that would definitely be possible in my situation. However, frankly, I think the most pragmatic solution in my case is a disk cache of brotli-encoded blob data ready to just hand to the web server as a file handle for later requests - there happens to be a limited number of "large" files in my deployment and this should work very well.

I agree, that's a very practical approach.

My next question is if you would plausibly accept an PR implementing a pairing to gix_odb/trait.Write/write_stream that was something like: [..]

I think it would be something like this:

fn read_stream<'a>(&'a self, id: &[oid]) -> Result<Option<Box<dyn Read> + 'a>, Error>

I don't particularly like the Box, but I also know I would like type parameters on the trait even less :D.
This method won't try to allocate an internal vec but instead return None if the object can't be streamed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants