this post was submitted on 10 Aug 2023
1 points (100.0% liked)

Artificial Intelligence

0 readers
0 users here now

Reddit's home for Artificial Intelligence (AI).

founded 1 year ago
MODERATORS
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/artificial by /u/Successful-Western27 on 2023-08-09 14:20:43.


If you're creating voice-enabled products, I hope this will help you choose which model to use!

I read the papers and docs for Bark and Tortoise TTS - two text-to-speech models that seemed pretty similar on the surface but are actually pretty different.

Here's what Bark can do:

  • It can synthesize natural, human-like speech in multiple languages.
  • Bark can also generate music, sound effects, and other audio.
  • The model supports generating laughs, sighs, and other non-verbal sounds to make speech more natural and human-sounding. I find these really compelling and these imperfections make the speech sound much more real. Check out an example here (scroll down to "pizza.webm").
  • Bark allows control over tone, pitch, speaker identity and other attributes through text prompts.
  • The model learns directly from text-audio pairs.

Whereas for Tortoise TTS:

  • It excels at cloning voices using just short audio samples of a target speaker. This makes it easy to produce text in many distinct voices (like celebrities). I think voice cloning is the best use case for this tool.
  • The quality of the synthesized voices is pretty high.
  • Tortoise supports fine-grained control of speech characteristics like tone, emotion, pacing, etc through priming text.
  • Tortoise is only trained on English and it's not capable of producing sound effects.

Here's how they compare to the other speech-related models I've taken a look at so far:

| Model | Best Use Cases | Key Strengths | |


|


|


| | Bark | Voice assistants, audio generation | Flexibility, multilingual | | Tortoise TTS | Audiobooks, voice cloning | Natural prosody, voice cloning | | AudioLDM (full guide) | Voice assistants | High-quality speech and SFX | | Whisper | Transcription | Accuracy, flexibility | | Free VC | Voice conversion | Retains speech style |

I have a full write-up here if you want to read more, it's about a 10-minute read. I also looked at the model inputs and outputs and speculated on some products you can build with each tool.

no comments (yet)
sorted by: hot top controversial new old
there doesn't seem to be anything here