Finetuning Florence2 Token Length Limit #289

kevinjeswani · 2024-06-28T22:28:28Z

kevinjeswani
Jun 28, 2024

Hi,

I've been trying to follow the fine tuning notebooks below and I'm getting stuck on the token length issues.
https://colab.research.google.com/github/roboflow-ai/notebooks/blob/main/notebooks/how-to-finetune-florence-2-on-detection-dataset.ipynb?ref=blog.roboflow.com#scrollTo=zqDWEWDcaSxN
https://colab.research.google.com/drive/1Y8GVjwzBIgfmfD3ZypDX5H1JA_VG0YDL?usp=sharing
https://colab.research.google.com/drive/1hKDrJ5AH_o7I95PtZ9__VlCTNAo1Gjpf?usp=sharing

I am typically getting token lengths in the 1040-1200 range that throw an error during training.
Training Epoch 1/1: 0%| | 0/10 [00:00<?, ?it/s]Token indices sequence length is longer than the specified maximum sequence length for this model (1038 > 1024). Running this sequence through the model will result in indexing errors
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [174,0,0], thread: [96,0,0] Assertion srcIndex < srcSelectDimSize failed.

I was initially trying to use two image dataset with images that are capped at 960px and 1600px (which can in reality be up to 4000px before scaling). The reason for the large size is because these are construction drawings, which I ideally need to avoid segmenting them to keep proper context of certain objects in the same frame. The larger ones have a considerable amount more of annotations (up to 150 / frame), so I tried experimenting with the smaller image dataset which has a maximum of 40 annotations in a single frame. I tried resizing them down from 960px to 320px max, which didn't seem to do very much. I tried to cap the number of annotations per frame, and the model was able to fine tune at least.

Is there suggestions as to how I can get around the 1024 token length maximum? Would it be technically sound to have multiple copies of the images in the datasets, with each copy only have annotations for a single class to reduce the input token length? I fear that this would cause issues having multiple copies of the same image, and having the other trained classes not labeled in that frame.

I actually wanted to not only train but also train deeper descriptions for each class so the model can understand the construction/engineering context for these novel classes (similar to this question: https://huggingface.co/microsoft/Florence-2-large/discussions/32). I initially tried to add two types of annotations in the jsonl one with the "OD" prefix and then another line with "DENSE_REGION_CAPTION" for each image. However, if I can't even get the number of annotations to work for the "OD", this definitely won't work. Any suggestions?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finetuning Florence2 Token Length Limit #289

{{title}}

Replies: 0 comments

Select a reply

Finetuning Florence2 Token Length Limit #289

kevinjeswani Jun 28, 2024

Replies: 0 comments

kevinjeswani
Jun 28, 2024