Clarification on JSON Lines Dataset for Multi-Task Fine-Tuning of Florence-2 #323
Replies: 2 comments 3 replies
-
Hi @mariaalfaroc 👋 I don't have an answer, but I suggest looking at maestro. That's our newest project, aimed explicitly at fine-tuning multimodal models. Note that the next two weeks are intense, so they might not respond. Here's where @SkalskiP talks about the data format for Florence 2: YouTube. |
Beta Was this translation helpful? Give feedback.
-
Hi @LinasKo, Thanks for your response! I've reviewed the maestro documentation and the YouTube tutorial. However, in both of them, the fine-tuning process for Florence-2 is focused on a single task at a time—Object Detection (OD) or Visual Question Answering (VQA). For OD, a sample annotation from the dataset looks like this: {
"image": "IMG_20220316_165139_jpg.rf.e4c229a9128494d17992cbe88af575df.jpg",
"prefix": "<OD>",
"suffix": "9 of diamonds<loc_141><loc_18><loc_404><loc_465>jack of diamonds<loc_589><loc_120><loc_789><loc_454>queen of diamonds<loc_308><loc_482><loc_570><loc_966>king of diamonds<loc_549><loc_477><loc_777><loc_904>10 of diamonds<loc_396><loc_75><loc_613><loc_458>"
} For VQA, it appears as: {
"image": "IMG_20220316_165139_jpg.rf.e4c229a9128494d17992cbe88af575df.jpg",
"prefix": "<VQA> How many cards are in the image?",
"suffix": "5"
} What I'd like to know is: how should the annotations be structured if I want to fine-tune Florence-2 for both OD and VQA simultaneously? Would this structure be valid? Is this even possible? {
"image": "IMG_20220316_165139_jpg.rf.e4c229a9128494d17992cbe88af575df.jpg",
"prefix": ["<OD>", "<VQA> How many cards are in the image?"],
"suffix": [
"9 of diamonds<loc_141><loc_18><loc_404><loc_465>jack of diamonds<loc_589><loc_120><loc_789><loc_454>queen of diamonds<loc_308><loc_482><loc_570><loc_966>king of diamonds<loc_549><loc_477><loc_777><loc_904>10 of diamonds<loc_396><loc_75><loc_613><loc_458>",
"5"
]
} Thank you so much again! :) |
Beta Was this translation helpful? Give feedback.
-
Hi everyone,
I came across the notebook discussing how to fine-tune Florence-2 for Object Detection, and I have a question regarding the structure of the JSON Lines dataset when fine-tuning for multiple tasks.
Specifically, how should the dataset be formatted if I want to fine-tune for more than one task?
Should the
prefix
field be a list of task string IDs, while thesuffix
field contains a list of strings that represent the answers for each task? For example, would the following structure be correct?Additionally, is there a guide available on how to format datasets for each task?
I appreciate any guidance on this!
Thank you!
Beta Was this translation helpful? Give feedback.
All reactions