Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add video support to segment-anything-2 pipeline #181

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

eliteprox
Copy link
Collaborator

@eliteprox eliteprox commented Aug 29, 2024

This change updates the request field image to media_file and loads the appropriate segment-anything-2 inference model based on the content type of the file. It uses ffmpeg to process the video to image frames and loads them in with inference values from the request.

Some adjustments still need to be made to the request/response parameters. I think "frame index" and "object id' may be two request parameters to add to this pipeline for the video requests, I've hard coded some values for now.

labels=kwargs.get('point_labels', None),
)
video_segments = {} # video_segments contains the per-frame segmentation results
for out_frame_idx, out_obj_ids, out_mask_logits in self.tm_vid.propagate_in_video(inference_state):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a feeling we should return the full triple instead of creating a video segment, leaving post-processing to the consumer of the API ( though I recognize this is good quick way to validate the sanity of the mask outputs )

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the guidance on this, I see how we can just return the results of self.tm_vid.propagate_in_video(inference_state):

Copy link
Collaborator Author

@eliteprox eliteprox Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a feeling we should return the full triple instead of creating a video segment, leaving post-processing to the consumer of the API ( though I recognize this is good quick way to validate the sanity of the mask outputs )

I had some issues trying to return the correct values. I added frame index as an input parameter, normally propagate_in_video will loop returning results for each frame starting at the frame index until the end of the video, now it should only return a single frame. But the data doesn't look correct, can you take a look? @pschroedl

@eliteprox eliteprox marked this pull request as draft September 3, 2024 15:40
@eliteprox eliteprox changed the base branch from segment_anything_2_pipeline_image to main September 10, 2024 03:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants