cross-posted from: https://lemmit.online/post/1261677

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/pvp239 on 2023-10-31 18:49:08.


Hey r/MachineLearning!

At Hugging Face, we’ve worked hard the last months to create a powerful, but fast distilled version of Whisper. We’re excited to share our work with you now!

Distil-Whisper is 6x faster than Whisper-large-v2 and performs within 1% WER on out-of-distribution datasets. On long-form audio, we even achieve better results thanks to a reduction in hallucinations.

For more information, please have a look:

Quick summary:

  1. Distillation Process

We’ve kept the whole encoder, but reduced the decoder to just 2 layers. Encoding takes O(1) forward passes, decoding takes O(N). To improve speed, all that matters is the decoder! The encoder is frozen during distillation while we fine-tune all of the decoder. Both KL loss and pseudo-labeling next word prediction is used.

  1. Data

We use 20,000h of open-sourced audio data coming from 9 diverse audio datasets. A WER-filter is used to make sure low-quality training data is thrown out.

  1. Results

We’ve evaluated the model only on out-of-distribution datasets and are only 1% worse than Whisper-large-v2 on short-form evals (CHiME-4, Earnings-22, FLEURS, SPGISpeech). On long-form evals (Earnings, Meanwhile, Rev 16) we beat Whisper-large-v2 thanks to a reduction in hallucinations.

  1. Robust to noise

Distil-Whisper is very robust to noise (similar to its teacher). We credit this to keeping the original encoder frozen during training.

  1. Pushing for max inference time

Distil-Whisper is 6x faster than Whisper on both short-form and long-form audio. In addition, we employ Flash Attention and chunked decoding which helps us achieve a real-time factor of 0.01!

  1. Checkpoints?!

Checkpoints will be released this Thursday and will be directly integrated into Transformers. All checkpoints will be licensed under MIT.

    • webghost0101
      link
      fedilink
      English
      arrow-up
      3
      ·
      1 year ago

      Whisper is speech recognition ai by openai its used for the conversation mode in chatgpt for mobile and its crazy good compared to other models as far as i tested with some None-English, might be the one microsoft includes it in teams to automate note taking.

      Speech recognition is way faster then text generation but 6times faster is still a big leap in technology. Eventually making way for near natural cross language conversations and all performance upgrades may alo aid in more efficiënt aj costs.