this post was submitted on 01 Aug 2024
43 points (95.7% liked)

Selfhosted

39159 readers
432 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules:

  1. Be civil: we're here to support and learn from one another. Insults won't be tolerated. Flame wars are frowned upon.

  2. No spam posting.

  3. Posts have to be centered around self-hosting. There are other communities for discussing hardware or home computing. If it's not obvious why your post topic revolves around selfhosting, please include details to make it clear.

  4. Don't duplicate the full text of your blog or github here. Just post the link for folks to click.

  5. Submission headline should match the article title (don’t cherry-pick information from the title to fit your agenda).

  6. No trolling.

Resources:

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

founded 1 year ago
MODERATORS
 

I'm using https://github.com/rhasspy/piper mostly to create some audiobooks and read some posts/news, but the voices available are not always comfortable to listen to.

Do you guys have any recommendation for a voice changer to process these audio files?
Preferably it'll have a CLI so I can include it in my pipeline to process RSS feeds, but I don't mind having to work through an UI.
Bonus points if it can process the audio streams.

top 13 comments
sorted by: hot top controversial new old
[–] [email protected] 22 points 1 month ago (1 children)

That's called text to speech, not a voice changer. A voice changer is the thing in the Darth Vader halloween masks.

There's been discussion on TTS programs here recently: https://lemm.ee/search?q=tts&type=All&listingType=All&communityId=185&page=1&sort=TopAll

Or you can search via your local instance/interface.

[–] [email protected] 9 points 1 month ago (2 children)

Text to speech is what piper is doing.
What I'm looking for is called voice changer since I want to change a voice which already read something.

That's exactly what I want: "the thing in the Darth Vader halloween masks" but for linux, preferably via CLI to ingest audio files and be able to configure it to change the voice as I want, not only Darth Vader.

[–] [email protected] 20 points 1 month ago

Oh, I see. I think it would still be easier to either use a different voice in piper (the github page talks about this) or use a different tts program entirely.

[–] [email protected] 4 points 1 month ago

So, all of the awkward pauses, the lack of inflection - you're saying keep those, just change who it sounds like is speaking?

[–] [email protected] 7 points 1 month ago

In case you wanted to try other TTS providers, here's a leaderboard based on user votes.

https://huggingface.co/spaces/TTS-AGI/TTS-Arena

[–] [email protected] 5 points 1 month ago (1 children)

Do you guys have any recommendation for a voice changer to process these audio files?

I'm not totally sure what you're going for.

If you want to transform spoken audio to a different sort of voice, then that's one problem.

But this Piper thing appears to be a text-to-speech software package, and I'd think that it'd be easier and provide a more-capable system to just obtain a different voice and re-generate the audio from the text, rather than generating the audio and then transforming it, unless I'm not getting what you're going for.

Like, here's a project -- which I have not used -- to generate Piper voices from audio samples of speech.

[–] [email protected] 3 points 1 month ago (1 children)

I haven't completely looked into creating a model for piper, but just having to deal with a dataset is not something I look forward to, like gathering the data and all of what this implies.

So, I'm thinking it's easier to take an existing model and make adjustments to fit a bit better on what I would like to hear constantly.

[–] [email protected] 4 points 1 month ago* (last edited 1 month ago)

I haven't used Piper, but I do want to let you know that it may be a lot easier than you think. I have used TortoiseTTS, and there, you can just fed in a handful (like, four or so) short clips (maybe six seconds, arbitrary speech), and that's adequate to let it do a reasonable facimile of the voice in the recordings. Like, it doesn't involve long recording sessions speaking back pre-recorded speech, and you can even combine samples from different people to "mix" their voices. I grabbed a few clean short recordings from video of someone speaking, and that was sufficient. TortoiseTTS doesn't even retain the model, rebuilds it from scratch from the samples you provided every time it renders voice (which is a testament to how little data it pulls in). It's not on par with, say, the tremendous amount of work involved in creating a voice for Festival or similar. The "Option B" for Piper on the page I linked to has:

I have built usable voice models with as few as 50 samples of the target voice.

...which is more than the tiny handful that I was using on TortoiseTTS, but might open up a lot of options and provide control over what you're hearing, especially if you have a voice that you really like.

But, okay. Say you decide that you want to go the post-text-to-speech transform route. Do you have any idea how you want to process them? The most-obvious things I can think of are:

  • Pitch-shifting, like if you want the voice to sound more feminine or masculine.

  • Tempo-shifting, like if you want the voice to speak more-quickly or more-slowly, but without altering the pitch.

Those are straightforward transforms that people do do on voice recordings; if you want a command-line tool that can do this in a pipeline, sox is a good choice that I've used in the past.

I can imagine that maybe you just want to apply some kind of effect to it (sounding like a robot in an echoy cave? Someone talking over an old radio? Shifting perceptual 3d position in space of the audio source?). There's a Linux -- I'm assuming, given your preference for a CLI, and the community, that this is a Linux environment -- audio plugin system called LADSPA and a successor system called LV2. Most Linux audio software, including sox, can run these on audio streams.

You can maybe do automated removal of silent bits, if there are excessive pauses...sox has silence-removal functionality.

But most other things that I can think of that one might want to do to a voice, more-sophisticated stuff, like making it sound happy, say, or giving it a different accent or something...I think that it's going to be a lot harder to do that after the text-to-speech phase rather than before.

[–] [email protected] 1 points 1 month ago* (last edited 1 month ago)

Coincidentally, I just found this other thread that mentions EasyEffects: https://programming.dev/post/17612973

You might be able to use a virtual device to get it working for your use case.

[–] [email protected] 1 points 1 month ago (1 children)
[–] [email protected] 2 points 1 month ago (2 children)

I don't want to manage piper voices, I can handle that directly in my file system as I only have a few.
The issue is none of the ones I've found are good for me, so what I need is something to change the voice once it has been generated by piper.

[–] [email protected] 2 points 1 month ago

what you're looking for is called RVC. It's integrated into some voice-cloning github projects but i don't use it. Here for example: https://github.com/codename0og/rvc-realtime-voice-changer

[–] [email protected] 1 points 1 month ago

There are a few voices included with pied which is why I suggested it.