There’s really compelling open source models like Zonos coming out; ElevenLabs will need to figure out how to thread the needle to keep everyone happy while other solutions eat into the pie.
Oh I’m glad this tech went somewhere useful! I remember reading the paper and toying with the models they released as a proof of concept like… 8 years ago?
It was really powerful back then. The ability to do TTS of someone’s voice given literally 3 seconds of training data?! (In fact I found that it worked best with short, nonsense audio clips than actually saying anything. Saying “test test test” worked way better than reading an actual sentence.)
But now it looks like it can actually handle tone well. It’s also probably way better now, and less… Asthmatic sounding.
There’s really compelling open source models like Zonos coming out; ElevenLabs will need to figure out how to thread the needle to keep everyone happy while other solutions eat into the pie.
Oh I’m glad this tech went somewhere useful! I remember reading the paper and toying with the models they released as a proof of concept like… 8 years ago? It was really powerful back then. The ability to do TTS of someone’s voice given literally 3 seconds of training data?! (In fact I found that it worked best with short, nonsense audio clips than actually saying anything. Saying “test test test” worked way better than reading an actual sentence.) But now it looks like it can actually handle tone well. It’s also probably way better now, and less… Asthmatic sounding.