Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Translation Keeps randomly outputting a word #82

Closed
onlygary opened this issue Apr 6, 2024 · 8 comments
Closed

Translation Keeps randomly outputting a word #82

onlygary opened this issue Apr 6, 2024 · 8 comments

Comments

@onlygary
Copy link

onlygary commented Apr 6, 2024

I am using the Whisper Medium EN model, translating from ENG to Japanese.

The translation when speaking seems to work fine, but when I stop talking (even when the mic is off) , it keeps outputting "レイアウト" randomly every few seconds. Tried this with other languages too and it just keeps outputting one random word.

No idea why this happens. It stops if I turn translation off

@royshil
Copy link
Collaborator

royshil commented Apr 10, 2024

@onlygary
This is caused by noise that the model thinks is speech but isn't and it assumes it's saying thank you
Add a noise cancellation filter before (above) the localvocal filter

In the new version coming out I also have a built in VAD that will improve this situation

@royshil
Copy link
Collaborator

royshil commented Apr 23, 2024

@onlygary can you please test #92 ?

@royshil
Copy link
Collaborator

royshil commented Apr 30, 2024

this is fixed in #95

@takoyaro
Copy link

Came here to say that I'm also experiencing this issue.
For context, I have tried muting my mic entirely but unfortunately the issue still arises as the random output is clearly coming from an empty input passed to the translator:
[obs-localvocal] Translation: '' -> 'レイアウト

I'd love to contribute but my C++ knowledge hits its limits here.

11:36:04.341: [obs-localvocal] found 576000 bytes, 144000 frames in input buffer, need >= 576000, processing
11:36:04.341: [obs-localvocal] with 144000 remaining to full segment, popped 144000 info-frames, pushing at 0 (overlap)
11:36:04.341: [obs-localvocal] first segment, no overlap exists, 144000 frames to process
11:36:04.341: [obs-localvocal] processing 144000 frames (3000 ms), start timestamp 1025374744210200 
11:36:04.341: [obs-localvocal] 2 channels, 48000 frames, 3000.000000 ms
11:36:04.347: [obs-localvocal] VAD detected no speech in 48000 frames
11:36:04.347: [obs-localvocal] skipping inference
11:36:04.347: [obs-localvocal] Translating text. __en__ -> __ja__
11:36:04.501: [obs-localvocal] audio processing of 0 ms data took 159 ms

I feel like if VAD detected no speech in[...] happens, the empty output shouldn't be sent to the translation model.
You already have that implemented for speech inference as seen in the logs so that might be an easy win:
[obs-localvocal] skipping inference

@royshil
Copy link
Collaborator

royshil commented May 10, 2024

Came here to say that I'm also experiencing this issue. For context, I have tried muting my mic entirely but unfortunately the issue still arises as the random output is clearly coming from an empty input passed to the translator: [obs-localvocal] Translation: '' -> 'レイアウト

I'd love to contribute but my C++ knowledge hits its limits here.

11:36:04.341: [obs-localvocal] found 576000 bytes, 144000 frames in input buffer, need >= 576000, processing
11:36:04.341: [obs-localvocal] with 144000 remaining to full segment, popped 144000 info-frames, pushing at 0 (overlap)
11:36:04.341: [obs-localvocal] first segment, no overlap exists, 144000 frames to process
11:36:04.341: [obs-localvocal] processing 144000 frames (3000 ms), start timestamp 1025374744210200 
11:36:04.341: [obs-localvocal] 2 channels, 48000 frames, 3000.000000 ms
11:36:04.347: [obs-localvocal] VAD detected no speech in 48000 frames
11:36:04.347: [obs-localvocal] skipping inference
11:36:04.347: [obs-localvocal] Translating text. __en__ -> __ja__
11:36:04.501: [obs-localvocal] audio processing of 0 ms data took 159 ms

I feel like if VAD detected no speech in[...] happens, the empty output shouldn't be sent to the translation model. You already have that implemented for speech inference as seen in the logs so that might be an easy win: [obs-localvocal] skipping inference

yep this is fixed on master but not yet in a released version, which is coming shortly

@royshil
Copy link
Collaborator

royshil commented May 14, 2024

@takoyaro @onlygary can you test the latest version?

@takoyaro
Copy link

@takoyaro @onlygary can you test the latest version?

Issue is fixed on my end. Thank you Roy!

@royshil
Copy link
Collaborator

royshil commented Jun 6, 2024

closing this, resolved

@royshil royshil closed this as completed Jun 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
3 participants