-
-
Notifications
You must be signed in to change notification settings - Fork 313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Streaming Blob Contents #1595
Comments
You are not the first to desire such a feature, and here is my thoughts on it:
I can imagine, for example, that the server finds big objects that it wants to make streamable, and then writes them out as loose-objects. Unfortunately, these would probably be removed next time a From an implementation point of view, I can imagine offering a way to stream an object, which then either streams it from an opened loose file, or 'fake'-streams it from an in-memory copy. The first version of the implementation probably wouldn't want to deal with streaming from a pack as most packed objects are deltified. My recommendation is to start with a non-streaming version, and see how you do in the memory-usage department right away. I suspect real memory usage to be pretty low and very much defined by the size of objects that are held in memory during the streaming, assuming not much else is going on. The reason I am saying this is that there might be other factors that drive memory usage, and it would be good to find out quickly before optimizing something that in reality isn't a problem. |
I can't pretend I'm not partly pursuing this for academic reasons. I'm sure for 99% of files stored in git repos it would never result in a perceivable improvement, but at the same time there's a chicken and egg problem - tooling proficiency influences usage patterns which influences tooling engineering effort. Anyway that's all a bit philosophical... I'm not clear how delta compression would affect the "stream-ability" of a blob, my understanding from reading the spec is that a delta-compressed blob is a list of chunks, some of which exist in other objects, some of which are in the blob data directly. By making the assumption that all these chunks appear in-order I would imagine one could recursively dig into the other objects one by one as you build and return the resulting buffer... essentially a stack of decoders. IDK if a recursive descent parser is an appropriate analogy here but that's what comes to mind. You make a great point about controlling the way objects are packed to keep large files as loose objects - that would definitely be possible in my situation. However, frankly, I think the most pragmatic solution in my case is a disk cache of brotli-encoded blob data ready to just hand to the web server as a file handle for later requests - there happens to be a limited number of "large" files in my deployment and this should work very well. My next question is if you would plausibly accept an PR implementing a pairing to fn read_stream(&'a self, id: &[oid]) -> Result<&'a mut dyn Read, Error> Presumably with many caveats like initially only streaming loose objects, or dynamically allocating a hidden Vec somewhere for the entire object in other cases. Providing a breeding ground for additional capabilities, for instance there's plausibly some objects small enough that could use the slice given to |
It's a reverse-chain of instructions buffers, pointing back at their base objects which are more instruction buffers, up to an undeltified base. From there, in reverse, these instructions have to be applied to assemble the final objects, in order. So one will be forced to do all of the expensive work upfront anyway, which is decompressing the objects. Anything I can imagine regarding trying to stream under these circumstances is super-complex and likely will not save anything most of the time (quite the opposite). But by all means…do try it, I may be wrong. If I were to approach it, I'd try to find typical deltified objects and visualize their delta-instructions to see if anything can be done at all.
I am saying even more than that. There is the possibility to let objects be packed, but not delta-compressed. That way they can easily be streamed, just as if they were loose objects.
I agree, that's a very practical approach.
I think it would be something like this: fn read_stream<'a>(&'a self, id: &[oid]) -> Result<Option<Box<dyn Read> + 'a>, Error> I don't particularly like the Box, but I also know I would like type parameters on the trait even less :D. |
Summary 💡
I'd like to get a stream of bytes from the odb without having to load the entire blob into memory. I've had a peruse of the gitoxide pack parsing code and the official git docs for the format, and it seems like there are no guarantees that the chunks are stored in order in the packfile, would you expect that to be a sticking point here, or in reality would they always be in order?
I'm intending to explore implementing this myself.
Motivation 🔦
I'm starting to use gitoxide to serve files from git repos on a server. I was previously shelling out to
git cat-file
in nodejs, but rust is proving to be a nicer environment for the few other bits this service performs. I would like to keep memory usage to a minimum however, and we have some tens-of-megabyte blobs which are a bit awkward to have to load into memory before serving them.The text was updated successfully, but these errors were encountered: