Download - Voice and AI: Respeecher

Discover

Podcast Features
Monetization
Podbean App
- Podcast Studio
  Easy-to-use audio recorder app.
- Podcast App
  The best podcast player & podcast app.

Help and Support
Popular Topics

All Arts Business Comedy Education
Fiction Government Health & Fitness History Kids & Family
Leisure Music News Religion & Spirituality Science
Society & Culture Sports Technology True Crime TV & Film
Live

How to Start a Podcast
How to Start a Live Podcast
How to Monetize a podcast
How to Promote Your Podcast
How to Use Group Recording

Log in
Start your podcast for free

Podcasting
Advertisers
Enterprise
Pricing
Resources
- Help and Support
- Popular Topics
Discover

VO BOSS

Business:Entrepreneurship

Voice and AI: Respeecher

2021-12-16

Download Right click and do "save link as"

What if you could perform beyond the limitations of your own voice? Anne is joined by special guest Alex Serdiuk for a bonus Voice and Ai episode. They discuss Respeecher’s speech-to-speech technology, the limitations of your natural voice, and how a synthesized voice is similar to a printing press. The future isn’t just on its way - the future is here - and creative possibilities are endless when human voices and technology work together...

Transcript

>> It’s time to take your business to the next level, the BOSS level! These are the premiere Business Owner Strategies and Successes being utilized by the industry’s top talent today. Rock your business like a BOSS, a VO BOSS! Now let’s welcome your host, Anne Ganguzza.

Anne: Hey everyone. Welcome to the VO BOSS podcast for another episode of the AI and Voice series. I'm your host, Anne Ganguzza, and today I'm excited to bring you special guest Alex Serdiuk. Alex is the founder and CEO of Respeecher, an AI speech-to-speech based company that creates voice cloning for content creators. Respeecher's technology was the first synthetic speech adopted by big Hollywood productions starting around 2019. And their primary focus is in improving the voice cloning technology in many directions, including the tech democratization to let sound professionals and creators have access to it. And as a voice talent, we love that. So Alex, thank you so much for joining me today. It's a pleasure having you.

Alex: Hey Anne, everyone. It's so great to be here. Thank you for having me.

Anne: Yes. So I have so many questions. You're a relatively young company founded in 2018, correct?

Alex: Yes. That's correct, yes.

Anne: Yeah. So, but you seem to have come a really long way in a very short amount of time. So if you don't mind, tell us a little bit about your company and how you got started.

Alex: Yeah, actually for us, it felt like a very long amount of time, like eternity. But yeah, we started a bit earlier than 2018 with the idea we were playing around for several years. So we actually participated in one hackathon in Kiev, in Ukraine, and everyone were picking this ideas of applying deep learning AI, quite sophisticated machine learning techniques to do something with visuals, to do something with pictures. And we thought that would be cool to try doing something with speech, and that's harder task because we are much picky about the stuff we hear, unlike the stuff we see. And we ended up winning that hackathon with a very simple prototype of voice conversion technology that allowed one voice sound like another voice. Then we started to play around with the technology, started to speak to some folks we thought who could be our first clients, if you start this company. And they told us that it's all about quality. So if you talk about high quality voice cloning, it should be really high. So it should be indistinguishable for listener, whether it's synthesized or not.

And given that we are quite picky about the sounds that we spot all the tiny little artifacts in sound the task has been challenging. So we launched the company in 2018 and took us about a year to get to the level where it could actually be of interest to some big sound engineers in Hollywood. And since then we've been improving the technology in several directions, usability, quality of the sound, speed, all that stuff. We try to make better on constant basis.

Anne: Got it, got it. So, all right. What might seem like a simple question, because I think a lot of us in the voice industry, we've heard about text-to-speech. And as a matter of fact, we've been doing it for a very long time, you know, TTS projects. But now speech-to-speech is different. And so tell us exactly what is the difference between text-to-speech and speech-to-speech.

Alex: Yeah. The differences in input, right? So when you use text-to-speech, you type words, and there is some AI that tries to make those words sound like they were spoken by human. The thing is there are two, in my opinion, holistic problems with text-to-speech. And that's one of the reasons why we do speech-to-speech. The first holistic problem would it text-to-speech be so limited to language models, to vocabularies. So if you want to try something different from what is in the vocabulary, it would fail. So if you try to pronounce some unusual name or street address, text-to-speech doesn't know where to take it from. That's one problem.

The second one would be emotional control. And this one is huge. So text-to-speech can offer you few emotions, right? It can sound excited or sad, but that's it. And we humans are best in terms of producing emotions as we use our vocal apparatus. And we are the best in terms of being guided, how to produce emotions. So if you try to imagine very sophisticated text-to-speech that would allow you to have all these triggers our vocal apparatus has from the day we were born, that would be a very comprehensive tool. That would be extremely hard to use. It would be just simpler to say it in the exact way you want to say it. And that's where it's speech-to-speech comes in.

So the idea of speech-to-speech is to enable a human speaking. The voice of another human is speaking in another timbre and all the emotions, all the inflections, all this stuff is being taken from source speaker. That means that you act, but you remove this boundary of being attached to the vocal apparatus, you were born with, the voice you have at the particular moment of your life. You can sound very different and that would be natural because emotions, inflections, acting would be yours. The timbre would be different.

Anne: So then you require an actor to be a model for whatever voice that gets applied to? Is that correct?

Alex: That's correct. We heavily rely on the actors.

Anne: So then I would think that it's a different process because what I'm familiar with in terms of synthetic voices is that we record a whole bunch of prompts and then there becomes this voice that's created from that. And your technology basically has a source voice, is that correct, that is the actor? And then you can apply any different voice to that voice model? And so for every script, you would have an actor speaking those words, and then you would be able to apply any voice to that?

Alex: Yeah, that's right. So basically our model compares voices. So it compares your voice to another voice you want to sound like, and it understands the difference between your timbre and the timbre you want to sound alike. And then after model learned those differences, you can actually feed their recordings in your voice. And those recordings would be converted into the voice of your desire.

Anne: So then let's talk about the target voice, first of all. Is that something that let's say when you have different target voices, if I want it to be a target voice, I would say, how do I create that target voice? Is that similar to how most people create their synthetic voices? Meaning I record a series of prompts, and it becomes part of the data model, and then a voice is created, and then that is how you create your target voices?

Alex: Yeah, that's correct. It's similar to text-to-speech. So basically you would need to record your voice in very good condition for some time, though speech-to-speech requirements are all over usually than text-to-speech. You don't need to go in studio and spend like hours. Say on a particular script, we can take existing recordings of your voice. And that would be enough. We just need observations of your voice saying different things in different emotions so model would learn it and then it's good to go.

Anne: Interesting. So then it's basically your model, which is the actor, would be any good either audio example that you have of acting, but it doesn't have to be the exact script?

Alex: Correct.

Anne: Is that correct? Okay.

Alex: Yeah. That they can read a lullaby for their baby or whatever. And in many of our projects, in many of our film projects, we had to deal with old recordings because we used to do a lot of de-aging or resurrecting projects. And that's cool about speech-to-speech that we can take existing recordings in quite a small amount. So currently we require like 40 minutes, but in plenty of projects, we had to deal with much less data.

Anne: Wow. So then, so this is an additional layer that you do. So not only do you create the target voices in a traditional like text-to-speech kind of way where you're creating the synthetic voice, but you're also creating that speech-to-speech model, which is the acting. And that, again, like you're saying, doesn't necessarily have to be the same script that you want to be repeated. Let's say there's a new movie out, and you want to have a particular target voice on it. Would the actor model have to go in and say all the lines first so that the speech-to-speech target could kind of, I guess, mimic it or reiterate it?

Alex: Yeah. So the -- the way how our system works, we would on the first stage, on the training stage, we would need just examples of a target voice, someone we impersonate, and source voice, a voice actor who would be doing impersonation. And we don't care much about what is the content, what are the spoken words? So it could consist of the content that needs to be converted further for the movie, but it could be something different. But then once the model is trained, you can say exact lines in the exact performance that are needed for the movie. And that would be converted into a target voice within minutes.

Anne: Got it. That's pretty impressive. What are the applications that you see for your speech-to-speech software?

Alex: Yeah, we've been focused on very high quality content because what's special about our technology, it can produce very high quality results, not just because of sound quality itself, quality of the sound files, but also because of the control you would have over emotional content. So you can make it sound exactly as you want it to sound. We've been applying our technology for films, animation, TV series, where we helped content creators get voices they cannot get in any other way. Like we did some work for Mandalorian season two, where we helped with making the voice, synthesizing the voice of young Mark Hamill, young Luke Skywalker --

Anne: Yeah.

Alex: -- who appeared in the very last scene. And you cannot get this voice anymore. You have recordings of 40 years old, but the voice of Mark Hamill is drastically different --

Anne: Yes.

Alex: -- from what he had 40 years ago.

Anne: 40 years ago.

Alex: That would be one application. We did some resurrection projects. One of them audience might have heard of would be Super Bowl opening where Vince Lombardi came and said some encouraging things about all the challenges our society needs to go through in this quite, quite hard time.

Anne: I remember that.

Alex: Yeah, that was a powerful piece we did together with NFL Digital Domain, 72 and Sunny. And the idea was to resurrect the voice of this person. We also did one cool project in resurrection where we made quite famous announcer -- not just announcer, but basketball commentator in Puerto Rico, who died 20 years ago, to voiceover the whole game in August, when --

Anne: Wow.

Alex: -- Puerto Rico made it to Olympics.

Anne: Wow.

Alex: And that was huge for us because we were focused on short form content for quite a while. Our technology has been heavy and we required a lot of take. And that might have been one of the first projects when we had like our own health (?) of voiceover in one take that had to be converted overnight for putting on TV the next day. And it worked out. So it sounded good. And recordings for target voice for Manolo were extremely bad. So it was quite, quite complicated, but it turned out to be working, and Telemundo put it on stream.

Anne: Wow. So then that's very impressive. Now it's also very scary, not just for me as a voice actor, but I'm thinking for the consumer, right, who's listening to the voice. So what sort of steps are taken to, I guess, notify the listener that maybe, especially if you're resurrecting voices. I would imagine that there's gotta be some sort of a protocol where you're allowing people or letting people know that this voice is resurrected or like, what are your thoughts on that?

Alex: Yeah. I mean, we basically build some guiding principles, guiding ethics principles from the very beginning when we started. And the first thing we always ask our prospective clients, when they want to do a project, whether they have permission or going to obtain one from owner of the voice they're going to clone. And in case if that person would be deceased, we would require permission from their relatives or estate or if that's a president, from president library, from company or individual that owns the right. And that would be the very first step. Then we actually need to be sure that the project is not controversial in general, because it might be not wrong to do something with permission. But if it's very attached to politics or were a controversial content, even with permission, we can just say no, because there is a lot of fear to this technology --

Anne: Yes.

Alex: -- in general and --

Anne: And deep fakes, I'm thinking. Right?

Alex: And deep fakes. Yeah. And the thing is, I mean, the technology itself is neither good or bad. It's just an instrument like a Photoshop, like hammer, like printing press. The thing is that we used to be scared of something new. And our goal is to showcase exciting, cool projects, creative opportunities, opportunities for voice actors using this technology without some bad projects to be in the news, because bad news travels so far, right? Everyone's heard about this end Tony Bourdain project that is --

Anne: Yes.

Alex: -- very controversial. Right?

Anne: Yes.

Alex: But I guess much less people heard about the amazing work we did for Mandalorian --

Anne: Yeah.

Alex: -- even though Mandalorian is the biggest TV series of 2020.

Anne: That's very true. That's very true. So then maybe you can answer this question. As a voice actor, what are the opportunities for me, as a voice actor -- number one, I like that you have an ethics statement on your website, and that you say that you are not allowing any deceptive uses of the technology. But number one, how can voice actors use this to let's say enhance our opportunities? And also how are you protecting the voice actor from any type of misuse or deepfakes or ethics?

Alex: Yeah, I mean, in terms of protection, we do have quite strict protocols that are required from us when we've work with biggest Hollywood studios, right? So have data security and stuff in place. In terms of opportunities, look, let's think about this technology from the point of view that the technology itself removes limitation you have. You -- you've been attached to your voice, and you're attached to your voice you have in particular moment of your life. So you can act, you can, you can work only with the particular vocal timbre you have been born with, right? The technology allows you to sound very different. So you can sound like 70 years old woman, or like 12 years old kid. And it would sound like 12 years old kid or 70 years old woman in terms of naturalness. The thing is you would, you would act those voices.

And that means that, in my opinion, in future, the distribution of load between voice actors could be significantly improved in future. Because when voice actor is being hired, they're hired for two things, their ability to act and their vocal timbre, the unique timbre they have. And now we can remove the timbre part from equation, and voice actors would be hired because their ability to perform. And that's amazing because some voice actors who meet very high demand for their particular vocal timbre can give this timbre, can license this timbre to other voice actors who can use it with their approval. But also the voice actors who cannot get jobs just because their vocal timbre does not match this particular character can actually get these jobs because they can sound like, like a different person.

Anne: So then they would buy a license for that target person? Is that correct? How does that work?

Alex: Yeah, that's correct. I mean, our company has been focused on like one-off projects for quite a while because the technology has been heavy, but this year we launched what we call a voice marketplace, and that would be a self-serve product. There -- it's been a roller coaster for us to make this heavy technology we used to operate manually the work in self-serve mode. But voice marketplace is out and it works. And it's really cool piece of technology where we try to democratize access to such a fine tool, to smaller creators and to voice actors.

And the idea of the voice marketplace that as user of the voice marketplace, you can speak in 40, more than 40 different vocal timbres we created there for you. And we actually hired people. We paid them money. We got their release and consent to use their voice in the voice marketplace. And those voices we have in the voice marketplace so far belong to average people because the most important part is this --

Anne: The timbre.

Alex: -- timbre. Yeah. But acting could be done by user --

Anne: Interesting.

Alex: -- and that means that you can sound exactly like any of those voices we have in the system and just utilize opportunities in terms of acting and performance, instead of being limited to the vocal timbre you own. So that's one way how --

Anne: Got it.

Alex: -- voice actors can benefit from this technology right now.

Anne: So then I can have an account in your marketplace, and then I can purchase additional timbres. Is that correct?

Alex: Yeah, that's correct. And you can get access to all the voices we have on the voice marketplace, try it out, but that's like a starting point.

Anne: Interesting.

Alex: We started with some like average voices, but in future, we want to add other voices, professional voices, because I mean, when system has not seen some particular emotions like singing, or crying, or whispering, it performs suboptimal, right? And people who are not professionally trained to be voice actors cannot produce many emotions. And that means for getting very high quality and professional voices in the output, you would want to see in target voices some professional voices.

Anne: Yes.

Alex: We want to invite voice actors in future as well as we want to get licensing deals with some famous voices and even voices from the past.

Anne: Sure.

Alex: But the thing is this kind of improvement to the voice marketplace as a product requires us to build two more layers. The first one would be approval layer. So as target voice, when you supply your voice to the system, you should feel secure that your voice is not used for something that you feel is inappropriate.

Anne: Sure.

Alex: So you need to be able to approve the content that is being created --

Anne: Yes.

Alex: -- with your voice or approve the user, the company, or the individual who want to use your voice. That's first thing. The second layer would be building compensation model --

Anne: Yes.

Alex: -- because there should be economics there's built on usage.

Anne: Sure.

Alex: It shouldn't be just one time licensing deal.

Anne: Right.

Alex: And those layers, they require some time to be built as well as some attention. And they should work very properly because it should be trusted.

Anne: Yes. And I do believe that for a voice talent, if they were a target voice or the source voice, I think they would want to number one, it should be a permission-based model. Or they would want that. Also they would want fair compensation. And I, I agree with you saying that that compensation would be on a per job basis because there is, you know, the way that we determine usage now, if we're doing a McDonald's commercial, right, we have a certain time that we can use that. And we aren't able to use our voice for a competitor. So I think on a per job usage basis is wise, and that is going to be, from what I understand -- I mean, especially for you, because you're doing the AI development, right, and the products. And so now also to have a marketplace, that's a whole other ball game. So kudos to you for wanting to build that marketplace and to do it in a fair and ethical way. So when any of us go onto your website or marketplace, and we are, let's say recording on it or inputting our voice or sending you files, what is your policy in terms of who owns that voice?

Alex: Yeah. Voice is owned by the person who, whose voice it is. Right? And there is quite clear legislation around that. So that's your IP and you own it. And without your permission, your voice cannot be used for something you have not authorized. So your recordings as a source speaker belong only to you. Recordings of converted speech, you get them. So you own the recordings of converted speech, if you're, if you use our voice marketplace on paid basis and that's quite clear and fair.

Anne: Great. Okay. So how, going back to the ethics where you say that we don't allow any misuse of our technology, how do you actually prevent anybody from misusing your technology?

Alex: Yeah. I mean, on example of the voice marketplace, you can not introduce any target voice, right? You cannot just put their voice of Donald Trump and try to say something in his voice because system does not allow it.

Anne: Okay.

Alex: And we do not have any public API or even non public API that would allow users or our partners to create target voices themselves. In those cases, when you need a particular voice to be cloned, always need to go through us. And we would require permission. And we actually require written permission, or in cases when we've worked with big and legit studios, we can put it on their shoulders. So they would need to get the permission themselves. The second part of protecting our technology from misuse is actually bringing awareness about existence of this technology. And we did plenty of projects that were focused more -- mostly on bringing awareness like Nixon project we did in 2019 with MIT.

And the whole idea of the project was to make Richard Nixon say the speech that was written in case if moon landing (?) goes wrong, actually showcase what modern technologies can do to change our understanding of history. And this educational part is extremely important because we all understand that this type of very fine technology could -- would fall in wrong hands in the future --

Anne: Absolutely.

Alex: -- and that's in quite foreseeable future. And the thing is we can protect ourselves only being aware that voice can be manipulated.

Anne: Yes.

Alex: Like if we're aware that something that is typed in the newspaper could not be true. Though. Our grandparents or grand-grandparents used to believe in everything that was typed. So that's, that's about how we treat the information we receive. And that's about awareness. Another thing we work on is to create a watermark, and the idea of watermark --

Anne: Yes.

Alex: -- the watermark to be able to tell Respeecher generated content from any other content. That's been quite complex and hard task because with our technology, you can generate a very small file and to put there a legit watermark, you will need to have this balance of watermark being not hearable --

Anne: Right.

Alex: -- but being not easily removable.

Anne: Right.

Alex: And keeping this balance in very short chunks is quite hard task, but I hope in next year, we would release the watermark. Another thing we are doing, we are actually working in several communities that are designed with the idea of building detection of synthetic speech algorithms that would detect synthetic speech or synthetic images. And we are providing our samples, we are providing our recordings that sound very indistinguishable in order to improve those algorithms. And the idea is those algorithms should be created and adopted as soon as possible, and big platforms --

Anne: Yes.

Alex: -- that distribute content like YouTube or Facebook should have this stuff embedded there. So it would just notify people that this recording or this video might have been manipulated. And that's quite important thing to do.

Anne: I agree, especially after hearing samples on your webpage, how really good your technology is, because it is encapsulating like the emotion. And I can only imagine for us, it makes us like doubly scared. You know, text-to-speech, synthetic voices is already scary, but this is an extra kind of step where it sounds so real that -- and especially how can you tell? Let's say that, you know, somehow my voice gets out there, or somehow the model of what I said gets out there, and how do I know that I approved that and allowed that to happen or allowed that usage? So I think it's great that yes, you should get those models out there and that watermarking out there as soon as possible on all platforms. Because I also think for us to be able to give the permission and to know where our voice is being used and for the people listening, they need to know that what they're listening to may not be human or may be altered. So good stuff.

Alex: Yeah. That's correct.

Anne: Yeah.

Alex: However, I want to contradict you a bit about letting viewers of the film be obligatorily notified about synthetic speech being part of that. I mean, viewers are not notified about effects, about postproduction that has been made to speech. And you can think about some cases --

Anne: True.

Alex: -- when our stuff is more like a postproduction technique, like we de-age some voice. So an actor acts themselves, but they sound younger, right? It's nothing bad with this use case and you don't obligatory need to have like a huge notification --

Anne: Right.

Alex: -- on the center of the screen that --

Anne: Right.

Alex: -- this audio has been manipulated. Because if you think about dinosaurs in Jurassic park, you don't have --

Anne: Yeah.

Alex: -- and you don't expect to have those --

Anne: Sure.

Alex: -- notifications that this creature does not exist, or Terminator, or like that's a creative part of things. And in cases, if it used in postproduction or as a creative tool, it shouldn't be there in my opinion. But in cases when it's, it might consist of controversial content, it my consist of alternative history content, when someone like Anthony Bourdain never actually say these lines, even if he wrote it himself, the notification should be in place. Because in such cases, we always encourage our clients and documentary creators to be very straightforward and tell their listeners that voice has been modified. Synthesized.

Anne: Excellent point, excellent point. Thank you for that. Wow. So this has just been a wonderful conversation. Thank you so much for educating us and talking about your product. Respeecher. How can BOSSes get in touch with you if they're interested to find out more, or maybe try it out, or maybe be a voice, how can they get in touch with you?

Alex: Yeah, so you just basically go to our website, respeecher.com, and you can hear a lot of examples, read our blog, read our ethics statements, look some projects we finished, and we can actually talk about, because there are plenty of projects that have delayed PR rights for us. And you can easily try voice marketplace. You can try the same core technology that we are using for Hollywood for your needs. And we would really appreciate the feedback because voice marketplace is something quite new for us --

Anne: Yes.

Alex: -- but we want this to be a very good creative tool and tool that would let voice actors do what they do best, act, without being limited to their timbre, and creators be focused on creative opportunities without being limited to necessity of finding a particular vocal timbre. And sometime it's very hard to find.

Anne: Wow. Well, thank you so very much for joining me today. I'm going to give a great, big shout-out to our sponsor ipDTL that allowed me to connect with Alex today. You can find out more at ipdtl.com. You guys, have an amazing week, and I'll see you next week. Thanks so much, Alex.

Alex: Thank you, Anne.

Anne: Bye-Bye.

Alex: Bye.

>> Join us next week for another edition of VO BOSS with your host Anne Ganguzza. And take your business to the next level. Sign up for our mailing list at voBOSS.com and receive exclusive content, industry revolutionizing tips and strategies, and new ways to rock your business like a BOSS. Redistribution with permission. Coast to coast connectivity via ipDTL.