I know you've seen it.

The funny videos of babies talking, the meeting notetakers, AI-powered answering machines, and even deepfake voicenotes. These are all powered by Voice AI.

In this article, I'm going to break down the technology and business behind Voice AI.

The information in this post is based on my theoretical and practical research on the topic, my experience as a member of various AI training and development communities, and my experience working on a team that has handled conversation intelligence for organizations like Samsung, DHL, and Formula 1, among others.

I’ve also added a Q&A section based on my experience at the end.

Why Voice AI Matters

Voice is the most developed form of shared communication that humans have, one that we have spent thousands of years perfecting. As a result, most humans are more comfortable with speaking and listening than they are with typing and reading.

This means that Voice AI leads to better requests (input) and better results (output) from computers. It has unlocked what I think is the biggest technology UX opportunity we’ve seen in decades.

For the first time ever, people who are illiterate, visually impaired, or elderly may have the opportunity to use computers at the same level as everyone else. It’s a fundamental shift in who gets to participate in the digital world and how we do it, too.

It’s also great for multitasking.

Use Cases

Some popular use cases of Voice AI include:

How It Works (The Tech)

Modern Voice AI has been implemented in a lot of ways across various products, but at its core, it’s powered by one or a mixture of three things: Speech To Text (STT), Text To Speech (TTS), and Speech To Speech (STS).

Speech To Text (STT)

This is the technology that powers converting speech into text. Examples of products that make use of this include meeting notetakers (fireflies.ai, otter.ai), ChatGPT’s voice mode, YouTube and Zoom live captions, etc.

This is how it works:

The accuracy of the final text generated could depend on a number of things: the quality of the model and training data, the number of voices on the recording, the kind of recording (e.g., mono or stereo), and context (if provided).

Of these things, I think context is the most underutilized element. In my experience, the more context is provided about the audio content, the higher the transcription accuracy. This is similar to how the human brain fills in the gaps using context when it hears a new accent for the first time. However, context may not always be available, and it can affect transcription latency.

For most AI products with voice capabilities, the text output generated in step 5 above is further processed and then fed into an LLM to carry out some action, e.g., summarize.

Text To Speech (TTS)

The AI-generated podcasts, videos of animals talking, Google Translate, narration of some audiobooks, etc., are all powered by Text To Speech (TTS) Technology. This is how it works, high-level:

Step 3 doesn’t happen in all TTS systems, but it’s where most of the magic happens for great TTS systems. The ability to properly include intonation, rhythm, and disfluencies/fillers (where relevant) to the generated speech makes a big difference in how realistic it sounds. e.g., saying a sad sentence in a sad tone and a happy sentence in an excited tone.

Some platforms are leading in this, while others still need to do some work. For example, in Google’s Gemini 2.5Pro TTS model, I can get it to pause and even sigh.

Speech To Speech (STS)

This is the technology that will deliver that Jarvis-like personal assistant you see in Sci-fi movies. It’s what allows you to speak to a computer and get speech back.

I was skeptical about giving it a section of its own since most implementations of this just put together STT + LLM + TTS, but there have been some unique advancements that make it worthy of having its own section.

This is how it works:

Latency optimizations and the implementation of “turn-taking” play crucial roles in STS, i.e., the system’s ability to identify that you’re done speaking, understand that it’s its turn to speak, and respond quickly enough to what you said. In most cases, processing starts before you’re done speaking.

Here’s how far along we are: I once watched the voice AI agent of one business speaking to the AI agent-answering machine of another. Neither system could tell it wasn’t talking to a human, especially because of how smooth the conversation was. That is probably what the future looks like. (The caller was an AI agent built on Bland.ai.)

The Business Behind it

Players

Each category optimizes for different things: speed, accuracy, language support, cost, and those trade-offs are often where their market differentiation lives.

Monetization

Monetization works differently depending on where in the stack a player sits.

Hyperscalers and Labs primarily charge on a usage basis, i.e., per minute, per token, or per API request, sometimes varying by model. The second layer of monetization is bundling voice features into existing products to drive subscriptions: ChatGPT’s voice mode is a good example of this.

Voice AI Specialists are also primarily usage-based. A few, like ElevenLabs, have added B2C subscription tiers on top.

Application and Enterprise Players typically absorb the cost of underlying Voice AI services and charge users for the core product via monthly or annual plans, outcome-based pricing, or per-seat pricing. They almost always have fair usage policies around voice features to keep unit economics healthy.

Where This Goes (The Future):

Voice AI is going to be enormous. According to Straits Research, the AI voice generator market is projected to grow from $6.4 billion in 2025 to over $54 billion by 2033, and even the more conservative estimates from other analysts point to significant growth, with projections varying based on how the market is defined.

On the tech side, a lot of that value will come from increased speed and lower costs, which will drive adoption across various platforms. This will be most visible in Speech To Speech (STS) technology.

Being able to ask your coffee machine to make a latte, and have it tell you when it’s done, all without touching your phone or the machine itself. That’s not a distant future.

Voice is going to become the default way people interact with technology.

Ethics

While working with major Voice AI providers, I’ve noticed that several of them are actively trying to make AI-generated speech sound exactly like humans. I understand the rationale (it reduces friction and makes interactions feel more natural), but I believe it’s the wrong path.

There is more to lose by making AI sound exactly like humans than there is to gain. It’s fine to know you’re on a call with an AI agent. Most people don’t care, as long as they’re getting value. What they do or should care about is being deceived. The real risks are voice scams and the deepening romantic dependency on AI bots, both of which are already happening.

My view is that some elements of the robotic-sounding tone should be deliberately preserved as a signal, a clear, audible marker that you are talking to a machine. Similar to how AI image generators embed watermarks, AI voice should have its own equivalent.

On the other side of this coin, there’s also some research that shows that AI may shape the way humans speak in the future. It is based on the concept of Lexical entrainment, the natural tendency humans have to mirror the language of whoever they’re speaking to. If we talk to AI systems constantly, we may start adapting our speech patterns to sound more like them. That’s a conversation worth having.

Fun Q &A: Things I’ve Seen and Learned from Experience

Q: How do Voice AI systems handle and understand accents?

A*: It’s by diversifying the training data, i.e., training it with voices of people with different accents. I’m part of communities that regularly engage in AI training and testing, and requests like this come in frequently.

Q: What’s the funniest thing/most interesting thing you’ve seen working with Voice AI?

A*: I once saw a voice AI agent of one business speaking to the Voice AI agent answering machine of another business. Neither system could tell that it wasn’t a human on the other end. The caller was an AI agent built on bland.ai. It was both impressive and unsettling, and probably a preview of how a lot of routine business communication will work within the next few years.

Q: What are some common issues that you’ve found working with Voice AI?

Here are some of the interesting ones:

I enjoyed writing this post, and I hope you enjoyed reading it too. If you’ve made it this far, you now have a solid foundation on both the technical and business sides of Voice AI.

What I’d love to know: Is there a use case or a question I didn’t cover here that you think deserves more attention? Drop a comment, reach out directly, or share this with someone who would find it useful.