Xuan Son NGUYEN’s Post

View profile for Xuan Son NGUYEN, graphic

Network and Security Engineer

This is huge. We will soon be able to « just talk » to a computer like a homie, without any internet connections.

View profile for Thomas Wolf, graphic

Co-founder and CSO at 🤗 Hugging Face

The Kyutai fully end-to-end audio model demo of today is a huge deal that many people missed in the room. Mostly irrelevant are the facts that: - they come a few week after OpenAI ChatGPT-4o - the demo was less polished than the 4o one (in terms of voice quality, voice timing…) Relevant: - the model training pipeline and model archi are simple and hugely scalable, with a tiny 8+ people team like Kyutai building it in 4 months. Synthetic data is a huge enabler here - laser focus on local devices: Moshi will soon be everywhere. Frontier model builders have low incentive to let you run smaller models locally (price per token…) but non-profits like Kyutai have very different incentives. The Moshi demo is already online while the OpenAI 4o one is still in limbo. - going under 300 ms of latency while keeping Llama 8B or above quality of answers is a key enabler in terms of interactivity, it’s game changing, This feeling when the model answer your question before you even finished asking is quite crazy or when you interrupt the model while it’s talking and it react… Predictive coding in a model, instantly updated model of what you’re about to say... Basically they nailed the fundamentals. It’s here. This interactive voice tech will be everywhere. It will soon be an obvious commodity.

To view or add a comment, sign in

Explore topics