Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle batch processing when few files fails in the whole batch #50

Open
BBC-Esq opened this issue Mar 11, 2024 · 3 comments
Open

Handle batch processing when few files fails in the whole batch #50

BBC-Esq opened this issue Mar 11, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@BBC-Esq
Copy link

BBC-Esq commented Mar 11, 2024

When my script batch processes a bunch of audio files using the approach you gave me to use a list of files and their settings when processing, if a single file fails for any reason, it prevents the transcriptions of all files' transcriptions from being done? I created a workaround to process each file to the transcribe_with_vad method (each using its own tqdm) and added error handling, which works. I was wondering if there's a way to make it so I can you your most efficient approach and still have error handling for a specific audio file? Here is the original script and a comparison with the single audio file processing with error handling:

import os
from PySide6.QtCore import QThread, Signal
from pathlib import Path
import whisper_s2t
import time

class Worker(QThread):
    finished = Signal(str)
    progress = Signal(str)

    def __init__(self, directory, recursive, output_format, device, size, quantization, beam_size, batch_size, task):
        super().__init__()
        self.directory = directory
        self.recursive = recursive
        self.output_format = output_format
        self.device = device
        self.size = size
        self.quantization = quantization
        self.beam_size = beam_size
        self.batch_size = batch_size
        self.task = task.lower()

    def run(self):
        directory_path = Path(self.directory)
        patterns = ['*.mp3', '*.wav', '*.flac', '*.wma']
        audio_files = []

        if self.recursive:
            for pattern in patterns:
                audio_files.extend(directory_path.rglob(pattern))
        else:
            for pattern in patterns:
                audio_files.extend(directory_path.glob(pattern))

        max_threads = os.cpu_count()
        cpu_threads = max((2 * max_threads) // 3, 4) if max_threads is not None else 4

        model_identifier = f"ctranslate2-4you/whisper-{self.size}-ct2-{self.quantization}"
        model = whisper_s2t.load_model(model_identifier=model_identifier, backend='CTranslate2', device=self.device, compute_type=self.quantization, asr_options={'beam_size': self.beam_size}, cpu_threads=cpu_threads)

        audio_files_str = [str(file) for file in audio_files]
        output_file_paths = [str(file.with_suffix(f'.{self.output_format}')) for file in audio_files]

        lang_codes = 'en'
        tasks = self.task
        initial_prompts = None

        start_time = time.time()

        if audio_files_str:
            self.progress.emit(f"Processing {len(audio_files_str)} files...")
            out = model.transcribe_with_vad(audio_files_str, lang_codes=lang_codes, tasks=tasks, initial_prompts=initial_prompts, batch_size=self.batch_size)
            whisper_s2t.write_outputs(out, format=self.output_format, op_files=output_file_paths)

            for original_audio_file, output_file_path in zip(audio_files, output_file_paths):
                self.progress.emit(f"{tasks.capitalize()} {original_audio_file} to {output_file_path}")

        processing_time = time.time() - start_time
        self.finished.emit(f"Total processing time: {processing_time:.2f} seconds")

image

@BBC-Esq
Copy link
Author

BBC-Esq commented Mar 11, 2024

Here's the final version that I ended up incorporating into my latest release, to avoid the issue, but would still be very interested in knowing if there's a way to address a single file to cause the entire batch processing of multiple files to fail...

https://github.com/BBC-Esq/WhisperS2T-transcriber/releases/tag/v1.1.0

@shashikg
Copy link
Owner

Hey @BBC-Esq ! I think there can be a simple fix for this. I will add the fix in next release.

PS: I'm slightly stuffed with my office work. Expect some delay in the next release (end of march probably).

PPS: Next release will also include end-to-end deployment ready server for WhisperS2T !!

@shashikg shashikg changed the title ERROR when batch processing? Mar 20, 2024
@shashikg shashikg changed the title Handle batch processing when a few files fails in the whole batch Mar 20, 2024
@shashikg shashikg added the bug Something isn't working label Mar 20, 2024
@BBC-Esq
Copy link
Author

BBC-Esq commented May 24, 2024

Hey @BBC-Esq ! I think there can be a simple fix for this. I will add the fix in next release.

PS: I'm slightly stuffed with my office work. Expect some delay in the next release (end of march probably).

PPS: Next release will also include end-to-end deployment ready server for WhisperS2T !!

Do you have time to continue to work on this repository? Ctranslate2 just implemented flash attention BTW.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
2 participants