How do you know that the voice you are hearing right now is human? Most of you have no idea what I look like, so how can you tell I’m a real person? What if your favorite YouTuber is actually an AI?
2023 is shaping up to be the year of artificial intelligence. Between the controversy swirling around various image generators and all the hype about ChatGPT, AI has been dominating news headlines for months.And for good reason.
Known as “generative AI,” these programs are capable of performing tasks previously reserved for humans, namely the generation of text, images, video, and other creative media.
YouTube’s new CEO, Neal Mohan, has even said the company is looking to expand AI’s role in content creation. In a letter outlining YouTube’s yearly goals, he stated “The power of AI is just beginning to emerge in ways that will reinvent video and make the seemingly impossible possible.” It’s likely that in a few months, you may not be listening to my voice but one created by an AI.
Of course, this technology isn’t exactly new. The AI video platform Synthesia has been around since 2017 and has partnered with major brands like Nike, Reuters, BBC, and Google. Starting at just $30 a month, you can use its service to create your very own “digital twin” — an AI-generated avatar that both looks and sounds just like you. The process is simple. First, you record yourself reading eight pages of prewritten scripts, each one capturing a different tone like instructional, professional, or cheerful. Next, after a bit of hair and makeup, you stand in front of a green screen, working with a director and film crew to record various movements. The whole thing only takes three hours, and afterward, you gain access to a platform where you can insert text or upload audio files to the avatar. You can even tweak the audio to more accurately represent your natural speaking pattern.
The advantages of this technology are obvious. On the most basic level, digital avatars don’t have to worry about camera shyness. They always look presentable, and never need reshoots. Simply assign the parameters, hit a button, and you’ve got a piece of publishable content. Not only does this allow creators to manage their workflow better, it also allows them to oversee multiple projects simultaneously. Rather than being limited to a single production, creators can practically be in several places at the same time.
Some YouTubers are actually already doing this, albeit in a more analog fashion. With over 130 million followers, MrBeast is the most popular YouTube celebrity on the planet. His videos feature expensive stunts, competitive challenges, “Let’s Plays”, and a wide variety of other fun content. In order to maintain his demanding production schedule, MrBeast created a clone of himself, only instead of using Synthesia, he hired a living, breathing person. Mr. Beast 2.0 was trained seven hours a day for two years to learn how to make the exact same decisions that MrBeast himself would make. This allowed the YouTuber to essentially be in two places at once, effectively doubling his creative output. Since then, the Mr. Beasts have gone on to make some of the most amazing videos on the platform and start an entire fast-food chain.
This “cloning” strategy offers us a hint of the potential of generative AI. Having multiple creators working under the same name, whether they are a look-alike or artificial intelligence, opens up completely new avenues to explore. Take social media influencers, for instance. Their name is their product, and they sell that product to prospective companies looking to market their goods and services. Normally, influencers are limited to a single IP – themselves – but with generative AI, they can create dozens of digital avatars, each with its own talent agent and associated brands and licenses. These clones can then be sold to corporate partners who could use them to create advertisements without the influencer ever having to show up to work.
Not only does this increase the potential output of creators, as dozens of videos could be pumped out in the time it used to take to make one, but it also lowers the cost of production. Instead of hiring an entire team of writers, videographers, editors, makeup artists, and other industry professionals, you only need to pay for a single piece of software. The potential payoff is absolutely staggering.
Imagine a world where your digital twin runs around the Metaverse doing your work for you or AI-generated celebrity avatars interact with fans through virtual reality.Thanks to artificial intelligence, all of this will soon be possible. Plus, it’ll help MrBeast avoid putting himself into dangerous situations.
Synthesia has worked with over 15,000 businesses and created more than 4.5 million videos. Though to be candid, these videos tend to be fairly corporate and are limited to a single avatar standing in front of a background. While this is fine for HR training videos or marketing promotions, the platform lacks the crucial tools necessary for more creative media. You won’t be making an entire short film using Synthesia, at least not yet. Still, the technology offers us a peek into what’s possible. The pieces are all there.
Attempting to put them together is Snapchat which recently announced the launch of its own chatbot. Dubbed “My AI” and powered by ChatGPT, MyAI is able to interact with users and respond with natural sounding dialogue. However unlike Microsoft’s new Bing AI or Google’s Bard, it’s not meant to serve as a search engine. Rather, Snapchat’s AI is presented more like a personality, even appearing in your friend list with its own profile page and Bitmoji. Snapchat CEO Evan Spiegel has indicated that the company’s goal is to humanize AI and to normalize these kinds of interactions, saying “[t]he big idea is that in addition to talking to our friends and family every day, we’re going to talk to AI every day.”
It seems as if it’s only a matter of time before AI-generated personas will be popping up in your feed. Though, for some of us, that may already be the case. Meet Xierra Vega. created by the LA-based production studio Corridor Crew, Xierra is a 100 percent AI-generated social media influencer. Their videos have been posted to Instagram and TikTok for a little over a year, amassing an audience of around 30 thousand followers between platforms. Everything from the dialogue and animation to the tone and camera angles is AI-generated, and the results have been, well, mixed.
If you scroll through Xierra’s videos, most are a bit nonsensical. The character’s speech is odd, their movements are jerky, and each video ends with a random dance sequence, perhaps as a homage to early TikTok dances. Most of the videos are filled with the kinds of bugs that you’d see in a video game from the early 2000s. Xierra’s avatar frequently walks through walls, jumps around the room and makes painfully awkward facial expressions. Despite all this, what Corridor Crew has accomplished is actually pretty remarkable.
The trickiest part of generative AI is successfully combining different elements to form something new and cohesive – making sure a character’s lips sync to the audio, that their interactions with locations and objects are organic, and that their decisions form a logical narrative. Even for their quirks, Xierra has been doing all of this. Their videos contain multiple ongoing stories that build off each other, including one where they get a jet ski and another where they become trapped in their basement only to discover that they are, in fact, an AI.
The biggest technological hurdle that both Xierra Vega and Synthesia still need to overcome is what’s referred to as “the uncanny valley.” It’s the psychological gap that we humans experience when seeing something that is close to, but still an imperfect replica of, ourselves. Xierra’s behavior is almost human-like but lacks coherence. The digital avatars created by Synthesia are convincing, but when you watch them, it’s clear something is off.
The voices are a little too Siri-like, and the avatars are somehow moving both too much and not enough. It’s like they’re trying to overcompensate for the fact that they’re not real. But this is just a limitation of current technology. Generative AI is still very new, and given a few years, the uncanny valley will inevitably be crossed.
In reality, there are much bigger problems that everyone, not just content creators, should be worried about. In a previous video, I talked about AI bias. Since the launch of generative AI programs, many of them have demonstrated clear racial prejudices. Likely the result of the way these programs are trained. But more disturbingly, other programs have acted aggressively or erratically towards users who attempt to stress-test their systems.
While Synthesia and other companies claim to have installed guardrails to prevent these sorts of behaviors, others haven’t been as diligent. Facebook’s chatbot LLaMa was leaked online in early March 2023. Since then, it’s been downloaded by plenty of people looking to exploit the technology for their own purposes. A group of programmers on Discord created a version of the AI made specifically to spit out racial obscenities and hate speech. Groups like these claim that by exposing vulnerabilities in the programs, they’re fighting back against the companies behind them. Companies that are becoming increasingly secretive about their technology.
Open AI, the company behind ChatGPT, has done a complete 180 on its original open-source principles. Instead, they’ve chosen to keep the latest iteration of the chatbot behind closed doors. Microsoft has also made some worrying decisions, including firing the entire ethics and society team in its AI department.
This is concerning given the recent wave of lawsuits against generative AI image programs like Midjourney and Stable Diffusion, both of which have been accused of training their AIs by using copyrighted works of art without obtaining consent from the artists. Visual artists have been sounding the alarm about this for months, but it’s now a problem that other creators are waking up to as well. It’s bad enough when another human steals your idea, but imagine being a comedian and hearing ChatGPT rip off one of your jokes or being a celebrity and seeing an AI impersonating you online.
In fact, this has already happened. ElevenLabs is an AI that generates voice clips using audio uploaded by users. You enter a recording of whoever you want, input some text, and suddenly you have the ability to, say, make Joe Biden and Donald Trump argue about video games. Or you could make a dead YouTuber say whatever you wanted. This is what happened to John Bain, otherwise known as TotalBiscuit, a YouTube commentator who passed away in 2018. In March of 2023, an AI voice model impersonating Bain appeared online, making various inflammatory statements, including transphobic comments. While Bain will never have to endure hearing his voice used as a tool to promote bigotry, Bain’s widow has. She’s now faced with the choice of whether to remove Bain’s 3,000+ videos from YouTube or leave them online, vulnerable to abuse.
Other celebrities have fallen victim to AI impersonation, too. One video showed Emma Watson reading sections of Hitler’s Mein Kampf, and another showed Mary Elizabeth Winstead using transphobic slurs and repeating 4Chan memes. Besides becoming platforms for trolls to create hate speech spewing deep fakes, generative AI is also being used by governments as a tool for propaganda.
In January, it emerged that someone had used Synthesia to generate a series of videos of a newscaster expressing support for Burkina Faso’s new military dictatorship. A few weeks later, state-run television stations in Venezuela began playing a video they claimed was of an American newscaster debunking negative claims about the Venezuelan economy, when in reality the country has been facing a terrible economic crisis. In reality, the man featured in the video was one of Synthesia’s avatars. Similarly, pro-China videos have also emerged online, also clearly produced using Synthesia. Fortunately, these videos were flagged as AI-generated thanks to their obvious flaws, but it’s only a matter of time before the technology creates avatars and humans that are indistinguishable from each other.
So, what happens when this technology becomes so good that you can no longer tell the difference between a person and a program? The promise of generative AI is that it will give creators more opportunities to monetize their work and explore new ideas. More than that, it lowers the bar of entry. In the same way that digital audio workstations like Ableton effectively act as an entire orchestra with a DJ as composer, platforms like ChatGPT and Synthesia allow everyone the opportunity to become a director without needing to get a job in Hollywood.
You don’t need writers, actors or a film crew. You just need a laptop and an idea. We might see a new wave of creative media as millions of people find novel ways to express themselves through these programs. That said, the potential for abuse of this technology is extraordinarily high, and in the race for technological supremacy, safety has become an afterthought for many companies.
Stronger guardrails need to be implemented, legislation protecting artists’ works and individuals’ likenesses needs to be passed, and the companies responsible for this technology need to operate with greater transparency. OpenAI recently published a report claiming that 80 percent of the American workforce will be impacted by ChatGPT in some way. And that doesn’t include the various image, video, and audio generators out there.
If artificial intelligence forever changes how we live and work, then we all should have a say in how it’s developed and where it’s used. Audiences should never have to guess whether or not the voice they are listening to is human.