TensorRT-LLM Backend Exported Model #8

shashikg · 2024-01-28T16:42:51Z

Hey everyone!

WhisperS2T now supports the TensorRT-LLM backend, achieving double the inference speed compared to the CTranslate2 backend! The current optimal configuration on an A30 GPU achieves transcription of 1-hour files in approximately 18 seconds.

After TensorRT-LLM optimization, the exported model only works on NVIDIA GPUs with the same cuda_compute_capability. This means a model exported on a T4 GPU won't work on an A100, and vice versa.

Help Needed: Model export takes about 3-6 minutes. I need volunteers out there to export the model for a specific GPU and share it. It would be a huge help to the community! I have access to A30, A100, and T4 GPUs for which I will add the exported models.

PS: I will update this discussion in a few weeks on how to contribute your exported model.

Thanks,
Shashi

The text was updated successfully, but these errors were encountered:

yuekaizhang · 2024-01-31T02:36:58Z

@shashikg It's a very nice and excited project. If you're not in a hurry, I will be available to help you after 20 days. One concern is that, with the updates of the tensorrt-llm cpp runtime, these exported models might become obsolete. It would be best to have a one-button script to handle this, so we can regularly generate the latest models and then upload them to Hugging Face: like this https://huggingface.co/csukuangfj/k2/tree/main/cuda with tensorrt-version-gpu-model-name.pt

shashikg · 2024-01-31T19:48:05Z

Hi @yuekaizhang thanks for showing interest in this project!

If you're not in a hurry, I will be available to help you after 20 days.

Yeah no hurry.. I am planning to maintain this project for a longer duration!

It would be best to have a one-button script to handle this, so we can regularly generate the latest models and then upload them to Hugging Face.

Right suggestion. So at present exported models are cached locally in tmp directory (even this helps - basically if anyone uses the package again on the same device, it picks up the exported model from the cache directory). I am working on to generate a tarred format for the exported model, which can be used to load the model or share it. I will integrate the huggingface_hub client library to automate the upload process as well.

I will ping here once everything get's ready.

shashikg added the help wanted Extra attention is needed label Jan 28, 2024

shashikg mentioned this issue Feb 21, 2024

huggingface repository for ctranslate2 models or provide as alternative source #20

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TensorRT-LLM Backend Exported Model #8

TensorRT-LLM Backend Exported Model #8

shashikg commented Jan 28, 2024

yuekaizhang commented Jan 31, 2024

shashikg commented Jan 31, 2024 •

edited

Loading

TensorRT-LLM Backend Exported Model #8

TensorRT-LLM Backend Exported Model #8

Comments

shashikg commented Jan 28, 2024

yuekaizhang commented Jan 31, 2024

shashikg commented Jan 31, 2024 • edited Loading

shashikg commented Jan 31, 2024 •

edited

Loading