cross-posted from: https://lemmit.online/post/1261677
This is an automated archive made by the Lemmit Bot.
The original was posted on /r/machinelearning by /u/pvp239 on 2023-10-31 18:49:08.
Hey r/MachineLearning!
At Hugging Face, we’ve worked hard the last months to create a powerful, but fast distilled version of Whisper. We’re excited to share our work with you now!
Distil-Whisper is 6x faster than Whisper-large-v2 and performs within 1% WER on out-of-distribution datasets. On long-form audio, we even achieve better results thanks to a reduction in hallucinations.
For more information, please have a look:
GitHub page: https://github.com/huggingface/distil-whisper/tree/main
Paper: https://github.com/huggingface/distil-whisper/blob/main/Distil_Whisper.pdf
Quick summary:
- Distillation Process
We’ve kept the whole encoder, but reduced the decoder to just 2 layers. Encoding takes O(1) forward passes, decoding takes O(N). To improve speed, all that matters is the decoder! The encoder is frozen during distillation while we fine-tune all of the decoder. Both KL loss and pseudo-labeling next word prediction is used.
- Data
We use 20,000h of open-sourced audio data coming from 9 diverse audio datasets. A WER-filter is used to make sure low-quality training data is thrown out.
- Results
We’ve evaluated the model only on out-of-distribution datasets and are only 1% worse than Whisper-large-v2 on short-form evals (CHiME-4, Earnings-22, FLEURS, SPGISpeech). On long-form evals (Earnings, Meanwhile, Rev 16) we beat Whisper-large-v2 thanks to a reduction in hallucinations.
- Robust to noise
Distil-Whisper is very robust to noise (similar to its teacher). We credit this to keeping the original encoder frozen during training.
- Pushing for max inference time
Distil-Whisper is 6x faster than Whisper on both short-form and long-form audio. In addition, we employ Flash Attention and chunked decoding which helps us achieve a real-time factor of 0.01!
- Checkpoints?!
Checkpoints will be released this Thursday and will be directly integrated into Transformers. All checkpoints will be licensed under MIT.
Whisper is speech recognition ai by openai its used for the conversation mode in chatgpt for mobile and its crazy good compared to other models as far as i tested with some None-English, might be the one microsoft includes it in teams to automate note taking.
Speech recognition is way faster then text generation but 6times faster is still a big leap in technology. Eventually making way for near natural cross language conversations and all performance upgrades may alo aid in more efficiënt aj costs.