By now, anyone with even a passing interest in AI is familiar with the process of typing a message into a chatbot and getting a long stream of text back in response. Today's announcement of his ChatGPT-4o, which allows users to converse with chatbots using real-time audio and video, may seem like just a lateral expansion of that basic interaction model. .
But after reviewing today's announcement and the more than a dozen video demos posted by OpenAI, I think we're on the verge of a major shift in the way we think and work with language models at scale. Although we ourselves do not yet have access to ChatGPT-4o's audiovisual features, the important nonverbal cues displayed here (both for GPT-4o and the user) instantly make the chatbot feel more human-like. And I'm not sure I'm completely prepared for how the average user will feel about that.
I think it's human
In this video, a newly pregnant father asks ChatGPT-4o for his opinion on a dad joke (“What do you call a giant pile of kittens? Meow-nyantin!”). The old ChatGPT4 could easily type the same response: “Congratulations on your upcoming addition to your family!” “It's totally funny, isn't it? It's definitely a top-notch dad joke.” But there's no doubt there's more to it than that. hearing GPT-4o provides the same information in the video, accompanied by the gentle laughter and rise and fall of the intonation of his lifelong friend.
Or watch this video of GPT-4o reacting to an image of an adorable white dog. The AI ​​assistant immediately speaks in a high-pitched, baby-talk-like range that anyone meeting a cute pet for the first time will quickly become accustomed to. This is a convincing demonstration of what Randall Munroe of xkcd famously identified as “You're a kitten!” This goes a long way to convincing us that GPT-4o is just like humans.
Next, there's a staged birthday party demo, in which GPT-4o sings the song “Happy Birthday” with deadpan dramatic pauses, self-conscious laughter, and even lightly altered lyrics. Then it breaks into some kind of stupid raspberry gibberish. . It's a little depressing to have an AI assistant sing “Happy Birthday,” but the specific rendition of that song here has a lovely tenderness that doesn't feel mechanical.
As I watched OpenAI's GPT-4o demo this afternoon, I found myself grinning more than once as I stumbled upon another amazing example of its audio capabilities. Whether it's a typical sportscaster's voice or a sarcastic impression of Aubrey Plaza, it's all incredibly jarring, especially for those of us who are used to LLM interactions resembling text conversations. is.
If these demos are any indication of ChatGPT-4o's voice capabilities, we could see a whole new level of parasocial relationships develop between this AI assistant and its users. Probably. For years, text-based chatbots have exploited human “cognitive deficiencies” to trick people into believing they are sentient. When you add in the emotional element of GPT-4o's precise voice changes, a wide range of users tend to believe that there is actually a ghost inside the machine.
Look at me, feel me, touch me, heal me
Beyond GPT-4o's new nonverbal emotion recording, the model's response speed also looks set to change the way we interact with chatbots. The reduction in response time difference from ChatGPT4's 2-3 seconds to his GPT-4o's claimed 320ms may not seem like much, but it's a difference that increases over time. . The real-time translation example shows the difference. Two interlocutors can carry on the conversation more naturally because they don't have to wait awkwardly between the end of a sentence and the beginning of translation.