Dataset bias ("❤️ Translated by Amara.org Community") #928
-
Hello, I noticed multiples biases using whisper. For example, it sometimes outputs (in french) There are also leftovers of "soustitreur.com" which implies OpenAI used soustitreur.com as a contractor. And at the end of a music video, when music fades away, it outputed : "Thank you for watching". Have you guys noticed any bias like this ? |
Beta Was this translation helpful? Give feedback.
Replies: 15 comments 45 replies
-
related to #651 |
Beta Was this translation helpful? Give feedback.
-
I saw this today in a transcription from Portuguese. It was at the end of a video (newscast) with no voice present and closing theme music playing:
|
Beta Was this translation helpful? Give feedback.
-
This is our list: {
"en": [
" www.mooji.org",
],
"nl": [
" Ondertitels ingediend door de Amara.org gemeenschap",
" Ondertiteld door de Amara.org gemeenschap",
" Ondertiteling door de Amara.org gemeenschap"
],
"de": [
" Untertitelung aufgrund der Amara.org-Community"
" Untertitel im Auftrag des ZDF für funk, 2017",
" Untertitel von Stephanie Geiges",
" Untertitel der Amara.org-Community",
" Untertitel im Auftrag des ZDF, 2017",
" Untertitel im Auftrag des ZDF, 2020",
" Untertitel im Auftrag des ZDF, 2018",
" Untertitel im Auftrag des ZDF, 2021",
" Untertitelung im Auftrag des ZDF, 2021",
" Copyright WDR 2021",
" Copyright WDR 2020",
" Copyright WDR 2019",
" SWR 2021",
" SWR 2020",
],
"fr": [
" Sous-titres réalisés para la communauté d'Amara.org",
" Sous-titres réalisés par la communauté d'Amara.org",
" Sous-titres fait par Sous-titres par Amara.org",
" Sous-titres réalisés par les SousTitres d'Amara.org",
" Sous-titres par Amara.org",
" Sous-titres par la communauté d'Amara.org",
" Sous-titres réalisés pour la communauté d'Amara.org",
" Sous-titres réalisés par la communauté de l'Amara.org",
" Sous-Titres faits par la communauté d'Amara.org",
" Sous-titres par l'Amara.org",
" Sous-titres fait par la communauté d'Amara.org"
" Sous-titrage ST' 501",
" Sous-titrage ST'501",
" Cliquez-vous sur les sous-titres et abonnez-vous à la chaîne d'Amara.org",
" ❤️ par SousTitreur.com",
],
"it": [
" Sottotitoli creati dalla comunità Amara.org",
" Sottotitoli di Sottotitoli di Amara.org",
" Sottotitoli e revisione al canale di Amara.org",
" Sottotitoli e revisione a cura di Amara.org",
" Sottotitoli e revisione a cura di QTSS",
" Sottotitoli e revisione a cura di QTSS.",
" Sottotitoli a cura di QTSS",
],
"es": [
" Subtítulos realizados por la comunidad de Amara.org",
" Subtitulado por la comunidad de Amara.org",
" Subtítulos por la comunidad de Amara.org",
" Subtítulos creados por la comunidad de Amara.org",
" Subtítulos en español de Amara.org",
" Subtítulos hechos por la comunidad de Amara.org",
" Subtitulos por la comunidad de Amara.org"
" Más información www.alimmenta.com",
" www.mooji.org",
],
"gl": [
" Subtítulos realizados por la comunidad de Amara.org"
],
"pt": [
" Legendas pela comunidade Amara.org",
" Legendas pela comunidade de Amara.org",
" Legendas pela comunidade do Amara.org",
" Legendas pela comunidade das Amara.org",
" Transcrição e Legendas pela comunidade de Amara.org"
],
"la": [
" Sottotitoli creati dalla comunità Amara.org",
" Sous-titres réalisés para la communauté d'Amara.org"
],
"ln": [
" Sous-titres réalisés para la communauté d'Amara.org"
],
"pl": [
" Napisy stworzone przez społeczność Amara.org",
" Napisy wykonane przez społeczność Amara.org",
" Zdjęcia i napisy stworzone przez społeczność Amara.org",
" napisy stworzone przez społeczność Amara.org",
" Tłumaczenie i napisy stworzone przez społeczność Amara.org",
" Napisy stworzone przez społeczności Amara.org",
" Tłumaczenie stworzone przez społeczność Amara.org",
" Napisy robione przez społeczność Amara.org"
" www.multi-moto.eu",
],
"ru": [
" Редактор субтитров А.Синецкая Корректор А.Егорова"
],
"tr": [
" Yorumlarınızıza abone olmayı unutmayın.",
],
"su": [
" Sottotitoli creati dalla comunità Amara.org"
],
"zh": [
"字幕由Amara.org社区提供",
"小編字幕由Amara.org社區提供"
]
}
|
Beta Was this translation helpful? Give feedback.
-
It happend to me too, strange enough only with the large-v2 model |
Beta Was this translation helpful? Give feedback.
-
A funny one: when using The funny thing is that if I input So I guess both Whisper and Google Translate are using the same bad sources. |
Beta Was this translation helpful? Give feedback.
-
I got rid of these with --suppress_tokens "". See details here: Continued in #1488 |
Beta Was this translation helpful? Give feedback.
-
I just ended up ignoring rows with a high |
Beta Was this translation helpful? Give feedback.
-
This is just additional info that this also happens with the ChatGPT Voice feature. I got the string "Untertitel im Auftrag des ZDF für funk, 2017" (which translates to "Subtitles commissioned by ZDF for funk, 2017", funk is the content network of the public broadcasters ARD and ZDF) three times since yesterday. Every time this happened, it was silent in the room, and Whisper seemed to think it recognized something and sent this prompt to the chat. |
Beta Was this translation helpful? Give feedback.
-
Has anyone tested it with new V3 large model, do you think they filtered their dataset to fix this issue ? |
Beta Was this translation helpful? Give feedback.
-
Hi, Thanks! |
Beta Was this translation helpful? Give feedback.
-
Yes, I have found this when using the app to talk with ChatGPT on Android (the one with 10M downloads as of today). I believe the problem is more of a managerial than of a programming nature. It seems that conversations are being rerouted via the API to an intermediary who performs the translations (sometimes really badly, so I tend to push-to-talk or either to type, but not to use the continual speech feature identified by the "headphones" icon, which produces even worse dictations/translations, especially for Castillian/European Spanish). Sometimes, during these audio conversations, I can hear clicks and beeps in the background when they should not be heard, so it really seems to me that they are sending my data straight through a clickfarm in an offshore call center, like in the last two episodes of "Silicon Valley -3rd season". These "features" are temperamental and sometimes will stop responding altogether and cut your conversation off. Some other times they will alter the contents of your texts randomly (e.g.: from "morning" to "afternoon"). I am really worried about the possibility of eavesdropping, IP theft, data security, and, of course, performance and effectiveness. Also, it is really annoying that the tag about "Amara.org" is being inserted randomly within conversations: it would seem that this company wants to make themselves heard as if they were claiming credit for some of the features, I do not know if justifiedly or not, but certainly in a technically terrible and obnoxious manner. IF you have to choose, as said, the push-to-talk feature works best regarding transcriptional accuracy. |
Beta Was this translation helpful? Give feedback.
-
Now after a push to talk it sometimes spits out stuff about a "2020 Copyright" instead of transcribing... really weird. |
Beta Was this translation helpful? Give feedback.
-
Am I write to think that we should collect the audio segments that produce those hallucination lines, and then fine tune (train) Whisper model with those audio segments? Has anyone already experimented with that? |
Beta Was this translation helpful? Give feedback.
-
it makes sense this would happen. the model was obviously trained on subtitles and those specific subs typically show at the end of a video when there's nothing being said i can think of two ways to fix this:
i have 2.5K transcriptions with "Amara" in it (on gliglish.com), will test things on my end and report results. |
Beta Was this translation helpful? Give feedback.
-
Any news about this issue? |
Beta Was this translation helpful? Give feedback.
This is our list: