Large File Support: Use I/O, not filepath wherever possible #162

atz · 2017-06-05T22:39:37Z

It is costly to pull down files and write them to disk unnecessarily. For sufficiently large files, this will break the ingest/derivative pipeline. This is made worse by attempts at job parallelization, where each job (potentially serviced on a different worker box) incurs this cost. But it is possible to avoid this problem.

Even though we are forking to shell for many of the non-ruby derivative processors, we should avoid forcing the input (and ideally output) to be literal filesystem files, when there is no such legitimate need:

FFMPEG, for example supports pipe/STDIN for the formats we care about: https://stackoverflow.com/questions/12999674/ffmpeg-which-file-formats-support-stdin-usage

This also allows optimizations for processors that don't use the bulk of a large file (e.g., only the metadata and first 2 minutes of, say, a 6 hour video). They can read until satisfied and then reset/close the IO. Most of the GBs are never pulled down, never put in memory, and never written to disk.

With a cloud-based platform like Hyku, it is very conceivable that this derivatives code is the tightest bottleneck in supporting large files.

atz mentioned this issue Jun 5, 2017

Hyrax::WorkingDirectory puts object in memory as string samvera/hyrax#1128

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large File Support: Use I/O, not filepath wherever possible #162

Large File Support: Use I/O, not filepath wherever possible #162

atz commented Jun 5, 2017

Large File Support: Use I/O, not filepath wherever possible #162

Large File Support: Use I/O, not filepath wherever possible #162

Comments

atz commented Jun 5, 2017