f5-tts #4
Replies: 4 comments 1 reply
-
I gave it a cursory glance the other day; my initial impressions is that it's decently comparable to this repo's output for the voice sample I grabbed and tested. My only ick right now is that, like some other solutions (Bark, I think some implementation(s) of VALL-E X), it requires transcribing the input audio prompt. While I can't complain if it works, it feels like a limitation, but I suppose it's necessary for "flow matching" models, since those are more primed for speech editing where TTS is a byproduct (at least given Meta's Voicebox paper). As far as the dataset, I'm a bit puzzled? While I want to say another good job to Emilia for it being a decent dataset, 100K hours of audio is extremely steep when typical models seem to go about the ~50K hours LibriSpeech has. I don't know if the onus is on needing it or it just helped to have that much extra audio, especially since SESD boasts <1K hours, and my VALL-E got by with <20K hours of unique audio (I don't have the numbers of total audio fed over the various training sessions off the top of my head). I suppose the paper might explain. When I get a chance I'll take a deeper look through both the architecture and outputs. I'm curious about things like:
But for sure I'll compare it against my VALL-E in the demo page whenever I get the chance. |
Beta Was this translation helpful? Give feedback.
-
Color me impressed; while it performs to my expectations, it does perform pretty well. I'll do what I can to try and express my thoughts:
For what it's worth, it's a good model and gets my approval, and I feel a bit better knowing that my VALL-E is comparable to a SOTA model (ignoring the hiccups my VALL-E has sometimes).
|
Beta Was this translation helpful? Give feedback.
-
As an addendum from doing a cross-eval on my VALL-E:
I definitely commend F5 for being a good model without really many quirks. I swear every time I do any eval on my VALL-E, it's some different quirk I have to attest to, and F5 felt quirkless for the most part. |
Beta Was this translation helpful? Give feedback.
-
Thank you for sharing your thoughts in detail. Appreciate it! |
Beta Was this translation helpful? Give feedback.
-
you might have already seen it https://github.com/SWivid/F5-TTS
zero-shot results are quite good, have any take on this? @e-c-k-e-r
Beta Was this translation helpful? Give feedback.
All reactions