Are timestamp tokens used in previous text? #2140
George0828Zhang
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
According to Figure 1 in the whisper paper, during training, the previous text tokens does not contain timestamp tokens. However, during transcribe, if
without_timestamps=False
andcondition_on_previous_text=True
, the prompt tokens (which contains previous text) get passed into model with the timestamp tokens. I confirmed this by printing outoptions.prompt
right before this line:whisper/whisper/transcribe.py
Line 195 in ba3f3cd
There are several timestamp tokens in there.
If this is indeed true, wouldn't it cause train-test mismatch?
Current transcript:
If both have timestamps, then current segment should start from 30. But the vocabulary is missing for >30s timestamps.
Beta Was this translation helpful? Give feedback.
All reactions