αAI powered voice chat in June 2024: Lipstick on a pig

Table of contents

  • Tech stack

    The Tech Stack used

  • What I did?

    What was involved in the demo system?

  • How I did it?

    Tie it all together?

  • Challenges in AI voice generation

    Right now, at best, you're talking to a competent robot.

  • Potential and limitations for voice chatbots

    Be cautious and clear.

  • Future directions and challenges

    Address some of the current shortcomings

  • Conclusion

    It’s a pig.

  • Code and links

    Link to all the code.

June 12, 2024
By Edward.
A pig wearing lipstick next to large black shoes.

Tech stack

The Tech Stack used

Tech stack

  • LLM host Groq
  • LLM LLAMA 3
  • TTS DeepGram
  • STT SpeechRecognition API
  • Web NextJS (React front-end, Express API)

What I did?

What was involved in the demo system?

I built a simple version of OpenAIs voice functionality, using free APIs. You can talk, listen and converse with LLMs. You can set the mood of the person responding and change the context and knowledge that API plugs into. (Simple RAG). 

All the code is linked at the end of the blog.

Feel free to play around and tell me what you think. 

How I did it?

Tie it all together?

I used NextJS for the demo. There are several advantages to this; I like the fact that we can handle all the client side and server side code from the same repo. It keeps things easier to link together, especially when you are a one-man band. This makes creating prototypes and handling AI use cases comparatively easy.

For this demo we just followed a few simple steps.

1.

Init NextJS

2.

Create AudioWidget client component

3.

Integrate WebSpeechRecognitionAPI

4.

This is currently only stable in Chrome, for production replace with WhisperAPI or equivalent

5.

Create two API endpoints, one for Groq, one for DeepGram

6.

Stream speech from browser to Groq

7.

Stream content from Groq to Deepgram

8.

Play audio response from DeepGram in AudioWidget

The streaming functionality with Vercel lets you optimize API chaining and steam data immediately, be it text content from Groq or audio from Deepgram. This was really helpful. It does feel like the next-generation approach for AI based systems. The reason being that traditional handshake API requests are unlikely to be able to handle increasingly complex bi-directional API interactions required by AI.

It is important to note my approach was optimized for “free” rather than for quality and speed. (Although Groq helpfully provides both!) Deepgram is not a free API but has generous starting credits, Groq is free although rate limited. 

You’ll see in the code there is a surprisingly small amount of code required to set up this demo. This was one of the pleasant surprises. My general evaluation of the stack and way it works together is as follows:

Review

  • NextJS Just a wonderful piece of technology, client side frameworks have come a long way
  • Groq Setting new benchmarks in terms of speed and cost. A breath of fresh air.
  • Llama3 You can notice the difference between GPT-io and Llama3. But you can also notice the difference in price. Great for cheap requests (and demos)
  • Deepgram I haven’t tried all the TTS service providers. Generous starting credits. The platform works fine and latency is a strong point. Suffers from the same problems as all TTS providers (technology is very green).

Building the demo feels great, and I’m not particularly concerned if we wanted to adapt this demo for stability and scale, it all feels robust. (Although the demo itself is of course prone to rate-limiting, running out of credit, and could do with more polish around the audio player itself). 

However, the tech felt great… The outcome felt… Errrm. Underwhelming? 

Challenges in AI voice generation

Right now, at best, you're talking to a competent robot.

The first problem in the STT and TTS AI audio flow is the most obvious. Latency. 

Imagine you are having a conversation and there's a delay before the other person responds. It disrupts the flow and makes the interaction feel unnatural. There is no breathing, no fillers, no pausing. The tone pitch goes up and down seemingly at random. I know folk are actively addressing these issues, but even in the most advanced stacks the fact that we are speaking to a machine is obvious. Despite reducing response times from seconds to hundreds of milliseconds, it's not yet the seamless, real-time experience we're aiming for. It’s quite clearly; person talks to a machine. 

The problem of latency is being solved as we speak. However, beneath the surface there is a more existential issue; what I like to call "lipstick on a pig." 

There are three different systems all glued together with APIs (and string given the stability/response times of these APIs!), but decoupled under the hood.

1.

Convert Speech to Text (STT, Whisper API or Speech recognition etc)

2.

Generate response (LLMs, Llama3 etc)

3.

Convert response to audio (TTS, Deepgram etc)

All these APIs are doing is converting audio to text, processing it through a language model, and then converting it back to audio. It might seem sophisticated on the surface but underneath it's just basic text generation in a robot's voice. Each individual system is comprehensive and reasonably mature, but glue them all together on our proverbial pig and there is no real understanding of the nuances of audio interactions.

The real disconnect here is inability to understand and blend audio signals. When we communicate, we rely on more than just words—we pick up on tone, sentiment, and other subtle cues. These systems don't grasp any of that, or if one does, they then have to convert it into a format that can be understood by the dependencies. 

Deepgram for example publishes a sentiment extraction API. It's a necessary workaround to the fundamental problem that the recipient of the audio data (the LLM) has no way of interpreting the social cues of what is being said. 

But extra APIs and more complexity is not going to help us build a smooth and production ready system. I fear that it walks like a horse, talks like a horse, but we’ve stuck a horn on it and people think this is a unicorn. I think we are much further away from our goal of genuine audio interactions than the majority are aware.

The technology feels like smoke and mirrors. We need models that don't just play with text but truly understand and replicate audio interactions. While companies are working on these new audio approaches, we're not there yet. And solving each problem in isolation will just increase complexity. 

Potential and limitations for voice chatbots

Be cautious and clear.

I had envisioned using this demo for creative applications, like interactive drama teaching or sales training. The idea was to leverage LMMs' creativity for fun and practical purposes. But the current limitations make this impossible, or at the very least, rubbish. LLMs can't grasp tone, sentiment, or context because under the hood they are processing text inputs. We can tell them. But then we are really really missing the point of what an audio interaction should be, and thus we get the result that we’ve got right now; underwhelming and robotic conversations.  

This brings me to the larger issue: overhyping the technology. Some businesses are jumping on the AI bandwagon, even replacing their call centers with these systems. I fear for these companies. In my view, that's a huge mistake. Right now, at best, you're talking to a competent robot that might respond accurately some of the time. But the experience isn't natural, and the remaining percentage of failures can be disastrous.

For any company considering this move, I'd say you might as well shut down your communication channels. It's cheaper and less risky. The technology isn't ready for natural, functional conversations yet. Be cautious and clear about what you're trying to achieve with these technologies. Understand that the last 5% of achieving seamless interactions will take exponentially more effort. And a big part of me feels like the 95% we are currently using, is in fact salad dressing. You are not a pioneer adopting this technology, at the best you are a guinea pig and your customers will be the ones to suffer.

While there's immense potential, we're not there yet. In fact, I’m not convinced we’ve really started. Audio feels like a side quest at the moment, and LLMs themselves have to advance significantly before the fundamentals will be ready. There is a long way to go before we can truly replicate the intricacies of human interactions. Until then, it's crucial for businesses to manage their expectations and not over-rely on these tools for critical customer-facing roles. The dream is there, and I'm hopeful we'll achieve it someday—but right now, it is very much a work in progress.

Future directions and challenges

Address some of the current shortcomings

Looking ahead, there is some reason to be hopeful in the realm of audio recognition technology. We have some exciting developments on the horizon, especially from major players who are teasing us with upcoming audio advancements. These companies are continuously iterating and pushing the boundaries of what's possible. I'm particularly eager to see how they might address some of the current shortcomings, like the inability to grasp tone, sentiment, and contextual nuances. It feels like we need native audio models, trained on audio that understand and respond in audio. That brings me more question, but feels like the right path. Until I can test and build on top of that, for me there are fundamental implementation issues that are effectively unsolvable. 

The current systems completely miss the richness of human communication—the inflections, the pauses, the emotions. If future models can start to interpret and replicate these nuances, we could see a significant leap forward in how natural and effective these interactions can feel.

Conclusion

It’s a pig.

It’s 1903, the Wright brothers take the first step towards motorized aviation. There were 5 witnesses as they lasted 12 seconds and 120 feet. This was the beginning of aviation as we know it today. 


In our gliding experiments, we had had a number of experiences in which we had landed upon one wing, but the crushing of the wing had absorbed the shock so that we were not uneasy about the motor in case of a landing of that kind...

The Wright Brothers.

In 2020’s, we published the first AI powered voice recognition systems. The problem is that there weren’t 5 witnesses, but the whole world. We now have everyone from engineers to business leaders make drastic, revolutionary decisions to rework their systems and processes on top of this innovative new technology.

I think it’s madness.

I’m hopeful that we will make further advances. the possibility of genuine AI powered voice interactions are endless. Drama, sales, training, hell even psychotherapy and relationship counseling. I get this is an exciting vision, and I’m all on board. 

I’m as excited by the innovation and progress as everyone else. 

But if it looks like a pig, squeals like a pig and walks like a pig. It’s a pig. 

Even if it’s wearing lipstick. 

Link to all the code.