Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Randomly getting error while generating word timestamps #59

Open
rahulmate opened this issue Apr 9, 2024 · 2 comments
Open

Randomly getting error while generating word timestamps #59

rahulmate opened this issue Apr 9, 2024 · 2 comments

Comments

@rahulmate
Copy link

code
`model = whisper_s2t.load_model(model_identifier="large-v2", asr_options={'word_timestamps': True},backend='TensorRT-LLM')

files = ['output.wav']
lang_codes = ['en']
tasks = ['transcribe']
initial_prompts = [None]

out = model.transcribe_with_vad(files,
lang_codes=lang_codes,
tasks=tasks,
initial_prompts=initial_prompts,
batch_size=16)`

For above code sometime it throws in below error for same file. Is there any explanation for it.
`RuntimeError Traceback (most recent call last)
Cell In[15], line 10
8 initial_prompts = [None]
9 start =time.time()
---> 10 out = model.transcribe_with_vad(files,
11 lang_codes=lang_codes,
12 tasks=tasks,
13 initial_prompts=initial_prompts,
14 batch_size=16)
15 end =time.time()
16 print(f"batch :: {16} time:: {end-start}")

File ~/temp_triton/triton_env/lib/python3.10/site-packages/torch/utils/_contextlib.py:115, in context_decorator..decorate_context(*args, **kwargs)
112 @functools.wraps(func)
113 def decorate_context(*args, **kwargs):
114 with ctx_factory():
--> 115 return func(*args, **kwargs)

File ~/temp_triton/triton_env/lib/python3.10/site-packages/whisper_s2t/backends/init.py:171, in WhisperModel.transcribe_with_vad(self, audio_files, lang_codes, tasks, initial_prompts, batch_size)
169 for signals, prompts, seq_len, seg_metadata, pbar_update in self.data_loader(audio_files, lang_codes, tasks, initial_prompts, batch_size=batch_size):
170 mels, seq_len = self.preprocessor(signals, seq_len)
--> 171 res = self.generate_segment_batched(mels.to(self.device), prompts, seq_len, seg_metadata)
173 for res_idx, _seg_metadata in enumerate(seg_metadata):
174 responses[_seg_metadata['file_id']].append({**res[res_idx],
175 'start_time': round(_seg_metadata['start_time'], 3),
176 'end_time': round(_seg_metadata['end_time'], 3)})

File ~/temp_triton/triton_env/lib/python3.10/site-packages/whisper_s2t/backends/tensorrt/model.py:248, in WhisperModelTRT.generate_segment_batched(self, features, prompts, seq_lens, seg_metadata)
246 text_tokens = [[_t for _t in x[0] if t < self.tokenizer.eot]+[self.tokenizer.eot] for x in result]
247 sot_seqs = [tuple(
[-4:]) for _ in prompts]
--> 248 word_timings = self.align_words(features, texts, text_tokens, sot_seqs, seq_lens, seg_metadata)
250 for _response, _word_timings in zip(response, word_timings):
251 _response['word_timestamps'] = _word_timings

File ~/temp_triton/triton_env/lib/python3.10/site-packages/whisper_s2t/backends/tensorrt/model.py:200, in WhisperModelTRT.align_words(self, features, texts, text_tokens, sot_seqs, seq_lens, seg_metadata)
198 token_alignments = [[] for _ in seg_metadata]
199 for start_seq, req_idx in start_seq_wise_req.items():
--> 200 res = self.aligner_model.align(ctranslate2.StorageView.from_array(features[req_idx]),
201 start_sequence=list(start_seq),
202 text_tokens=[text_tokens[_] for _ in req_idx],
203 num_frames=list(seq_lens[req_idx].detach().cpu().numpy()),
204 median_filter_width=7)
206 for _res, _req_idx in zip(res, req_idx):
207 token_alignments[_req_idx] = _res

RuntimeError: No position encodings are defined for positions >= 448, but got position 454`

@aleksandr-smechov
Copy link

aleksandr-smechov commented Apr 9, 2024

You can try adjusting the align_words method here to this:

for start_seq, req_idx in start_seq_wise_req.items():
    # adding adjusted_num_frames
    adjusted_num_frames = [min(frame, MAX_TEXT_TOKEN_LENGTH) for frame in seq_lens[req_idx].detach().cpu().numpy()]
    res = self.aligner_model.align(
        ctranslate2.StorageView.from_array(features[req_idx]),
        start_sequence=list(start_seq),
        text_tokens=[text_tokens[_] for _ in req_idx],
        num_frames=adjusted_num_frames,
        median_filter_width=7
    )

and adjusting data_collate_fn here to:

def data_collate_fn(self, batch):
    # adding max_seq_len_samples
    max_seq_len_samples = MAX_TEXT_TOKEN_LENGTH * (HOP_LENGTH * INPUT_STRIDE)
    if self.use_dynamic_time_axis:
        max_len = min(max([_[3] for _ in batch]) + self.dta_padding, N_SAMPLES, max_seq_len_samples)
    else:
        max_len = min(N_SAMPLES, max_seq_len_samples)

Let me know if that fixes anything @rahulmate

@rahulmate
Copy link
Author

rahulmate commented Apr 11, 2024

Thanks @aleksandr-smechov changes in align_words function solved the issue. I haven’t done benchmark yet but will run it to check the timestamps. For changes in data_collate_fn I was getting error with tensorRt model tensor

Could not set shape torch.Size([16, 80, 896]) for tensor x. Please check the profile range for which your model was build. Selection deleted
Currently only using changes in align_words because originally I was getting issue with align model itself.

brunjo added a commit to ai-avatar/WhisperS2T2 that referenced this issue May 26, 2024
See shashikg#59 (comment)

Error: No position encodings are defined for positions >= 448, but got position 454
brunjo added a commit to ai-avatar/WhisperS2T2 that referenced this issue May 26, 2024
See shashikg#59 (comment)

Error: No position encodings are defined for positions >= 448, but got position 454
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants