Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running the same code twice giving two different results #1085

Open
bchinnari opened this issue Oct 24, 2024 · 3 comments
Open

Running the same code twice giving two different results #1085

bchinnari opened this issue Oct 24, 2024 · 3 comments

Comments

@bchinnari
Copy link

Hi, I am running faster-whisper on an audio file like follows

segments, info = model.transcribe(wav, task="transcribe", language="hi",beam_size=1, word_timestamps=True,max_new_tokens=50  )

The same code sometimes gives two segments and sometimes gives one segment on the same audio file. I find this weird. Is this expected ? whenever this gives 2 segments, second one of those is always "insertions" . there is no speech, but model gives some words as output.

However if I slightly modify the above statement to not output word timestamps like follows

segments, info = model.transcribe(wav, task="transcribe", language="hi",beam_size=1, max_new_tokens=50  )

I always get only one segment in the output with good accuracy.
Is the presence of "word_timestamps=True" messing this up ?

@bchinnari
Copy link
Author

Is this possible ? Did anyone observe this ?

@bchinnari
Copy link
Author

Ok. Here is what I did. I took a pretrained HF model (https://huggingface.co/vasista22/whisper-hindi-small) and fine-tuned it using my data. Then I converted the checkpoint to faster-whisper format.

If I use "word_timestamps=True" in transcribe function, I am getting extra (useless) segments in the output. I don't know why.

This is not happening if I use whisper model directly for transcription. This is happening with my fine-tuned model only.

@bchinnari
Copy link
Author

bchinnari commented Oct 25, 2024

when "word_timestamps=False", the output is as follows
Segment(id=1, seek=600, start=0.0, end=6.0, text='सितम्बर 19', tokens=[50364, 45938, 33279, 36158, 48521, 27099, 3941, 105, 25411, 1294], temperature=0.0, avg_logprob=-0.17912933772260492, compression_ratio=0.6857142857142857, no_speech_prob=1.3633834695708693e-14, words=None)

when it is True, the output is like this
Segment(id=1, seek=252, start=np.float64(0.0), end=np.float64(2.52), text='सितम्बर 19', tokens=[50364, 45938, 33279, 36158, 48521, 27099, 3941, 105, 25411, 1294], temperature=0.0, avg_logprob=-0.17907507040283896, compression_ratio=0.6857142857142857, no_speech_prob=1.3633834695708693e-14, words=[Word(start=np.float64(0.0), end=np.float64(2.16), word='सितम्बर', probability=np.float64(0.999978095293045)), Word(start=np.float64(2.16), end=np.float64(2.52), word=' 19', probability=np.float64(0.9993481040000916))])
Segment(id=2, seek=496, start=np.float64(2.52), end=np.float64(4.96), text='सितम्बर', tokens=[50364, 45938, 33279, 36158, 48521, 27099, 3941, 105, 25411], temperature=0.0, avg_logprob=-0.4705956637859344, compression_ratio=0.65625, no_speech_prob=0.02291429601609707, words=[Word(start=np.float64(2.52), end=np.float64(4.96), word='सितम्बर', probability=np.float64(0.741854028776288))])
Segment(id=3, seek=598, start=np.float64(4.96), end=np.float64(5.98), text='सितम्बर 19', tokens=[50364, 45938, 33279, 36158, 48521, 27099, 3941, 105, 25411, 1294], temperature=0.0, avg_logprob=-0.4931728406385942, compression_ratio=0.6857142857142857, no_speech_prob=0.2716793119907379, words=[Word(start=np.float64(4.96), end=np.float64(5.98), word='सितम्बर', probability=np.float64(0.7634602943435311)), Word(start=np.float64(5.98), end=np.float64(5.98), word=' 19', probability=np.float64(4.705471383203985e-06))])

  1. when the flag is False, the text is correct and also the number of segments is also correct. But the end of the segment is marked as "6.0" which is incorrect. "6sec" is duration of the wave file.
  2. when the flag is True, the first segment text is correct and end time of the first segment is also correct. But it gave two more segments which is incorrect.

Is there something wrong which is obvious ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant