ChatGPT Advanced Voice Mode Wows Testers with Realistic Sound Effects and Breath Control

On Tuesday, OpenAI began rolling out an alpha version of its new Advanced Voice Mode to a select group of ChatGPT Plus subscribers. This feature, previewed in May alongside the launch of GPT-4o, aims to enhance the naturalness and responsiveness of conversations with the AI. While the initial preview faced criticism for its simulated emotional expressiveness and a public dispute with actress Scarlett Johansson over alleged voice imitation, early feedback on social media has been largely positive.

In the early tests, users with access to Advanced Voice Mode report that it enables real-time conversations with ChatGPT, including the ability to interrupt the AI mid-sentence almost instantly. The mode is capable of detecting and reacting to emotional cues through vocal tone and delivery, and it adds sound effects to storytelling.

A surprising feature is how the voices simulate breathing pauses while speaking. Tech writer Cristiano Giardina remarked on X about the realism of the voice taking a breath as if it were human: "ChatGPT Advanced Voice Mode counting as fast as it can to 10, then to 50 (this blew my mind—it stopped to catch its breath like a human would)."

Advanced Voice Mode mimics audible pauses for breath based on extensive training with human speech audio samples. The model has learned to simulate these inhalations at appropriate times, having been exposed to a vast array of human speech examples. Large language models like GPT-4o excel at imitation, and this capability now extends to audio as well.

Giardina also noted the mode’s performance on X, including its handling of accents in various languages and its use of sound effects. "It’s very fast, there’s virtually no latency from when you stop speaking to when it responds," he said. "When you ask it to make noises, it always has the voice ‘perform’ the noises (with funny results). It can do accents, but when speaking other languages it always has an American accent. (In the video, ChatGPT is acting as a soccer match commentator)."

X user Kesku, a moderator of OpenAI’s Discord server, shared examples of ChatGPT using multiple voices for different parts and narrating a sci-fi story with atmospheric sound effects from a prompt asking for an exciting action story with sci-fi elements.

AI advocate Manuel Sainsily posted a video showcasing Advanced Voice Mode reacting to camera input while giving advice on kitten care. "It feels like face-timing a super knowledgeable friend, which in this case was super helpful—reassuring us with our new kitten," he said. "It can answer questions in real-time and use the camera as input too!"

As with any LLM-based system, Advanced Voice Mode might occasionally generate incorrect responses due to limitations in its training data. However, as a tech demo or AI-powered entertainment tool, it appears to perform many tasks as demonstrated by OpenAI.

Regarding safety, an OpenAI spokesperson informed Ars Technica that the company collaborated with over 100 external testers speaking 45 languages from 29 regions for the release. The system is designed to prevent impersonation by restricting outputs to four preset voices and includes filters to block requests for copyrighted music or audio. Giardina noted some audio "leakage" with unintended background music, indicating the model’s training included a diverse range of audio sources.

OpenAI plans to expand access to more ChatGPT Plus users over the coming weeks, with a full rollout expected this fall. Alpha test participants will receive instructions via the ChatGPT app and email. OpenAI claims to have improved the model's capacity to handle millions of simultaneous, real-time voice conversations with low latency and high quality, preparing for a substantial increase in usage.