The Future of Conversational AI – Where Voice is Only a Single Component
By Eitan Cohen, CEO and Co-founder of TechSee
Anyone who’s ever called a bank, airline or any business with a customer service department, has likely interacted with a Voice AI – a tool trained by machine learning datasets relevant to that industry in order to respond to questions and direct the caller accordingly. The idea being that simple requests or tedious information gathering tasks are absorbed by the AI agent.
Feedback from consumers on how well current AI-support works is often a mixed bag. For shopping, a Voice assistant can be beneficial, with 80% of shoppers satisfied with their AI assistant. Then there’s the 75% of us who say we prefer speaking to a live agent, and over half who don’t trust a chatbot. A major concern for callers is the Voice AI’s inability to deal with complex issues.
As well as tweaking and enhancing Voice AI to make it more responsive, there is mounting evidence for Voice AI to sit alongside other AI modalities. This unleashes a more dynamic, autonomous user experience, key facets of Agentic AI, which presents the optimal way of investigating requests and supports callers without encountering the hiccups which Voice AI alone might.
This movement for the incorporation of Voice AI as part of a broader solution is where AI, especially in a customer service setting, will flourish.
More Than Just Words
Voice AI processes and responds to spoken language. It’s used to streamline customer service conversations, help with troubleshooting, assist in product selection, and much more.
Voice AI is also constantly evolving, with new models of AI agents emerging – from Microsoft’s Copilot AI with advanced Voice AI capabilities or Meta’s inclusion of Voice AI into Meta AI. This subset of AI is primed to expand in terms of application, and in the process become more “emotionally” in sync with the end-user, intuiting what is needed in real-time.
Conversation is more than just talking though. If we refer to the commonly held assertion that body language is 90% of conversation, words become less prominent. In fact, the architect of this conversational breakdown found more specifically it is 55% nonverbal, 38% vocal, and 7% words only. Invariably, seeing the person you are engaged is more powerful than only hearing them.
Developers have taken these principles and are adapting it to the future of conversational AI. Interaction with that bank or airline or ticketing company will no longer just be an exchange of words, but a 3D experience, with audio, visual and text-based user engagement. Voice AI will have its place, as one part of a multi-modal encounter.
Why The Many Modalities of AI Makes a Better Agent
Voice AI isn’t just graduating into a multi-modal offering as a result of inevitable transition but is actively being sampled alongside other forms of AI. It works best when combined with other modalities such as Visual AI and text-based AI, like large language models (LLMs), and action-oriented Agentic AI. By integrating these elements, businesses can deliver deeper context, richer interactions, and smarter automation across the customer lifecycle.
Think of it this way – how many brands do we recognize by image or logo rather than only its written slogan? How many websites would we linger on if there were no visuals? How many news articles would we read with as much dedication without accompanying photos? One study found the length of time reading text-based articles increased with a photo included. Cognitive psychology has plenty to say on why images are more memorable than words, but for the purpose of the future of AI, in a customer service setting especially, let’s just agree they’re vital for amplifying attention and intelligently interpreting an issue alongside the verbal instruction.
Take Google’s Gemini – a multimodal tool which you can send a picture of a plate of food and it will respond with a written recipe. Now transpose this concept into the customer contact center. A caller sends in an image of a broken appliance and a multimodal AI can diagnose the problem, based on a reservoir of information it has amassed on similar interactions, with similar imagery, with similar outcomes.
By integrating visual AI alongside Voice, an image or video can be requested to analyze the issue live with the caller on the line, while Large Language Models (LLMs) with a strong cognitive service can prosecute product documentation, knowledge bases, or past service interactions to provide detailed, contextually relevant information. The AI can identify the make and model, see the root cause, guide the customer using augmented reality through the fix, and verify if the issue is resolved. This enhances the precision and efficiency of customer interactions. Furthermore, multimodal agentic AI can use augmented reality overlays and generative visual AI to guide the user visually. This is critical for tasks like setup and onboarding.
This is where customer service should be moving towards. An AI of many faces, which listens, scans images, extrapolates findings based on historic data and presents acceptable solutions in a quicker timeframe, allowing human agents and technicians to swoop in when they’re needed, armed with well documented and ordered information the AI has compiled on the case.
How Companies Can Incorporate Voice AI into a Wider Offering
When positioned within a multimodal strategy, Voice AI excels at enhancing service and sales operations. In customer service, it can work in tandem with visual AI, LLMs and Agentic AI as a frontline tool, quickly resolving Tier 1 inquiries and even tackling more complex issues. In sales it can assist in guiding customers through the buying journey, making recommendations, answering questions, and even facilitating purchases.
Often, customers who begin their journey with Voice technology may need to switch modalities – adding visual capabilities to the previously Voice-only interaction. Multimodal integrated solutions enable a smooth transition across text, Voice and visual interactions, making service and sales experiences seamless and personalized.
Voice AI in tandem with Visual AI and LLMs works to automate routine inquiries, like product troubleshooting, onboarding, or account setup. This not only reduces operational costs but also delivers a more efficient, dynamic customer experience.
To build a multimodal solution companies should ensure it is part of a comprehensive strategy that includes visual AI and LLMs. Deploying Voice AI as part of an agentic AI system ensures it doesn’t just respond to customers but solves problems and completes tasks across different channels.
A Multisensory AI Experience
Voice AI will continue to play a crucial role in service automation and customer experience, but its greatest potential is unlocked when merged with other modalities. By integrating vision, text, and data-driven AI alongside Voice, businesses can build a strategic AI-driven customer engagement strategy.
More personalized interactions, more satisfied customers, reduced workforce churn and accelerated business growth. This is the potential impact of AI in service and sales conversations, when it mobilizes multiple modalities to understand and connect with customers. According to Gartner 80% of B2C companies will have shifted from traditional customer service models to AI-driven, automated customer engagement channels by 2025. How many of these will transition to a multimodal agentic AI offering remains to be seen, but by appreciating how it can add depth and clarity to a customer interaction, the greater the likelihood of widespread adoption.
Eitan is currently the CEO and Co-founder of TechSee. He has 20 years of experience as a seasoned high-tech entrepreneur and executive building companies, teams, and products from the ground up. Enterprises that have benefited from his expertise and hands-on leadership include Local Sciences (CEO, Acquired), Blue Pumpkin (GM, Acquired), Merchant Circle (GM, Acquired), Nice Systems, and Amdocs. Eitan holds a BSc in Computer Science and BA Psychology from Tel-Aviv University.