Part II: Multimodal capabilities unlock new opportunities in Vertical AI
Vertical AI applications born out of novel audio, voice, and vision capabilities will fundamentally change the way we work.
Since ChatGPT first came onto the scene and captivated entrepreneurs (and the world), we’ve witnessed a massive explosion in products and services using LLMs to address the “lowest hanging fruit” use cases for generative AI: text-based tasks — spanning everything from creating legal contracts and job descriptions to drafting emails and website copy.
Demand for text-based AI solutions remains high. AI can take over time-consuming tasks like creating first drafts, thereby refocusing employee efforts to more complex functions. But much more of our day-to-day work requires data types and capabilities other than just text, such as speaking to customers and reasoning over complex images and graphical data. Today, use cases like these are no longer off the table.
The emergence of multimodal models has created opportunities for vertical AI to impact a much larger share of the economy than previously imagined by expanding beyond text-based tasks and workflows. In Part II, we report on new models that support a variety of data types across audio, video, voice, and vision, promising early applications of new and improved voice and vision capabilities, and the potential of AI agents to change how businesses operate.
Exciting developments in multimodal architecture
In the past 12 months, new models have emerged that demonstrate significant advancements in terms of their ability to understand context and reduce hallucinations, as well as their overall reasoning capabilities. The performance we’re seeing across speech recognition, image processing, and voice generation in certain models is approaching (or, in some cases, surpassing) human capabilities, unlocking many new use cases for AI.
Voice capabilities
We’ve seen rapid progress made on two core components of the conversational voice stack: speech-to-text models (automatic speech recognition) and text-to-speech models (generative voice). Dozens of vendors are now providing models with these capabilities, which has led to a flurry of new AI applications, particularly in the case of conversational voice.
Most of these applications rely on what’s called a “cascading architecture,” where voice is first transcribed to text, then that text is fed into an LLM to generate a response, and finally the text output is fed back into the generative voice model to produce an audio response. Up until very recently, this has been the best way to build conversational voice applications. However, the approach has a few drawbacks — primarily that it introduces additional latency and some of the non-textual context (i.e., the end user’s emotion and sentiment) gets lost in the transcription process.
As of the time of this writing, a new generation of speech-native models are being released including OpenAI’s Realtime API, which supports speech-to-speech interactions via GPT-4o, as well as several open-source projects such as Kyutai’s Moshi. Developing models capable of processing and reasoning on raw audio has been an active area of research for many years and it’s been widely acknowledged that speech-native models would eventually replace cascading architecture.
Speech-native models have substantially lower latency (< 500 milliseconds) than previous models. They can also capture much more context from users (i.e., their tone, sentiment, emotion, etc.) and generate responses reflecting that context, making exchanges feel more natural and increasing the likelihood that they address the user’s needs. Over the next few years, we anticipate a step function change in the speed and quality of conversational voice applications, as more of them are built on these new and improved models.
Use cases for voice
There’s already been dramatic progress in transcription-based applications, where speech-to-text models are a bit more mature. As a result, there’s been notable early progress in end-to-end conversational voice agents — something we consider the “next frontier” for voice AI solutions. Let’s look at four initial use cases:
- Transcription frees up time for users to facilitate next steps in workflows: Bessemer portfolio company Abridge has pioneered a best-in-class medical transcription application that can generate medical notes based on clinical conversations, and identify appropriate follow-ups including prescription ordering, scheduling appointments with specialists, and referencing billing codes. When doctors don’t have to complete those tasks manually, they can redirect that time and attention towards patient care.
Another great example is Rillavoice, a company that’s bringing AI to the home services vertical. Rillavoice’s transcription application records conversations between salespeople and customers for training purposes so that sales managers can still provide valuable coaching feedback without having to go on very time-consuming in-person “ride-alongs.”
- Fielding inbound calls to capture incremental revenue: One most compelling use cases for end-to-end voice agents we’ve seen so far is inbound sales, particularly when the solution is purpose-built for a specific vertical (i.e., home services businesses or automotive dealerships). Voice agents can ensure that a business never misses a valuable lead by fielding customer calls after hours or when other sales representatives are busy. Some solutions are able to book an appointment for the customer and even interact with the customer’s system of record to quote a price. These capabilities — combined with the dramatic improvement in conversational voice models compared to prior voice bots — have made it possible for some AI sales agents to close inbound leads at an impressive rate, without requiring interactions or interventions from a sales representative.
- Upleveling customer success experience with AI: Customer support has long been a target of automation, but many users found prior versions of Interactive Voice Response (IVR) technology quite frustrating to use. Modern voice agents have proven to be much more effective. While traditional IVR products could only understand a customer’s intent in response to specific phrasing, modern voice agents are able to provide a correct answer regardless of how customers ask questions or make requests. And as with all these use cases, automating phone calls gives time back to customer service representatives to focus on solving complex customer problems and answering nuanced questions (vs. FAQs).
- Automating outbound calls to increase top of funnel: Multiple solutions have emerged to automate outbound calls for sales and recruiting teams. Typically, the voice agents use the customer’s stated criteria to identify the highest-potential sales leads or candidates, make the initial call to the leads, and then route them to the next meeting with a salesperson or recruiter. Having AI take over the outbound workflow significantly increases the number of leads that can be contacted and the company's top of funnel as a result. And with their time freed up, salespeople and recruiters have a better chance of closing the highest potential leads.
It will be critical to monitor regulation in this space that limits unwanted robocalls and to ensure that solutions facilitate outbound calls only to leads and candidates that opt in to sales outreach.
Across all voice use cases, we expect that low latency and understanding a user’s sentiment and emotion will become table stakes, and that more sophisticated solutions will differentiate along other dimensions, such as orchestrating conversations across multiple underlying models in real-time to optimize cost and performance; supporting omnichannel communications, multiple languages, and real-time translation; and building effective conversational guardrails, particularly in highly-regulated use cases.
Vision capabilities
In vision, we’ve seen the development of models like GPT-4 with vision (GPT-4V) that can interpret images and respond to questions about them, as well as multimodal models like GPT-4o that can process raw images and video. It’s expected that GPT-5 will also be able to reason more accurately across an increased context window, deriving nuanced insights from image inputs and potentially adding video processing capabilities too. Google’s multimodal model Gemini 1.5 Pro can already understand input across both image and video and retain contextual understanding up to one million input tokens.
We expect that these and similar models will continue to improve in performance and come down in cost — great news for application builders.
Use cases for vision and video
Initial use cases for vision within vertical applications tend to fall into one of four categories: data extraction, visual inspection, design, and video analytics. While data extraction is the most mature use case for vision models so far, we’re seeing progress in other areas, and still just scratching the surface of all the other potential use cases where vision models could be applied.
- Data extraction from pictures, PDFs, or images of other unstructured documents: AI can relieve humans of tedious data entry tasks and open up downstream workflows by applying structure to currently unstructured data. For example, Raft’s platform for the freight forwarding industry uses a combination of computer vision and LLMs to extract critical information from PDF invoices, populate its customers’ enterprise resource planning platforms (ERPs), and automate downstream tasks like invoice reconciliation and preparation of customs declarations.
- Augmenting jobs that currently involve human visual inspection: A number of companies have emerged that use AI to help streamline manual visual inspection processes and deliver faster results. The AI construction platform xBuild generates scope-of-work packages for residential construction and restoration projects, and then partners with insurance companies to get them approved for reimbursement. xBuild uses photos of the damaged roofs and blueprints of houses to generate reports that outlines the scope of repair that will be required to restore roofs to their proper condition, in accordance with local building codes. Other applications have used AI and computer vision to automate the process of quality assurance reviews in construction drawings, helping to catch errors early to prevent costly changes later down the line in the construction process.
- Generating 2D and 3D designs: There has been a steep increase in AI platforms serving the architecture, engineering, and construction (AEC) industry. Some companies are using AI to create feasibility assessments that combine visual depictions of proposed sites (buildings, parking lots, etc.) with the associated costs of supplies, adjusting the former with constraints of the latter, and vice versa. Other solutions like Snaptrude create detailed 3D designs of buildings, taking over the repetitive work typically done by structural engineers and giving them time back to focus on higher-level design work (rather than tedious tasks like putting pipes in the right place). Automating aspects of detailed product and infrastructure design not only saves customers valuable engineering time but can also strengthen sales proposals and increase project win rates.
- Video analytics: Models that generate and/or understand video are the least mature among vision models, but they’re improving rapidly. For example, video understanding models have become fairly capable when it comes to object tracking, classification, and even natural language search of video content. There have already been some compelling commercial applications built on top of these models, such as those that monitor video feeds for safety violations in manufacturing or industrial settings. But given how fast video models are maturing, we expect to see even more impressive applications in the coming years, and an expansion into more use cases, particularly in robotics where video understanding is a critical component of robotic perception.
Across all vision use cases, founders should avoid mistaking complexity for value. While solutions may be the most defensible when they automate particularly complex workflows—say, the creation of detailed 3D designs vs. 2D error checking—the value to customers will almost always be directly tied to how well the automation fits within a user's existing workflow.
If a design automation solution requires burdensome integrations with difficult-to-replace core systems (such as Revit) and has low initial ROI, it will be hard to drive sales and adoption, regardless of how robust the solution is. Early-stage companies may be better off starting with a product that’s less technically complex and narrower in scope and then expanding from there. Of course, the best path will vary sector by sector and use case, but the trade-offs are important to keep in mind.
The promise of AI agents
While early hype around AI agents has fallen short of reality, we’ve witnessed real progress recently, as teams find ways to more effectively constrain tasks for AI agents in order to reduce compounding errors in multi-step reasoning. We’re especially bullish on agents given the amount of research and resources being dedicated to reasoning-focused foundation models, such as OpenAI’s o1. Most LLMs focus purely on predicting the next token based on patterns seen in training data, but models like o1 take a fundamentally different approach to problem-solving. These models are designed to “think” more at inference time, using chain-of-thought reasoning to better plan and assess their approach before arriving at an answer. While early, these models are demonstrating impressive performance on more complex reasoning tasks.
Today, agents are playing a valuable role in text, voice-based, and vision-based workflows that involve repetitive tasks and communication (as we show below). But over the coming year, we expect applications built on newer reasoning-based models to emerge and deliver on the true potential of AI agents: addressing complex workflows autonomously.
- Sales and marketing: Many companies have come out with AI agents that can source and contact potential customers for sales teams. What’s promising about these agents is that they’re able to conduct a significant amount of research to identify high-quality prospects (through a detailed web search of the target company, its employees, and relevant industry news), and then use those research findings to craft relevant and highly personalized emails. Because agents can effectively execute the research and outreach portions of the job while maintaining relatively high quality, it allows sales representatives to redirect time towards closing warm leads.
- Negotiations: AI agents have shown promise in automating negotiations across multiple parties. Companies like Pactum have developed AI agents that can negotiate legal and commercial terms for supply chain use cases. Pactum’s application maps out the value function for the user while the agent conducts simultaneous negotiations with suppliers to optimize deal terms. We’ve seen similar approaches taken by other vertical AI companies in the sales and promotions space. Here, agents negotiate with buyers and suppliers on set criteria such as discounts for bulk purchases or rapid payment plans.
- Investigations: Enterprise cybersecurity teams are often overwhelmed by the high volume of security alerts they receive, but there are now AI agents that can assist with the initial phase of an alert investigation. That includes: gathering information about an event from multiple disparate systems, researching the malicious behaviors that might have been involved, and summarizing the incident and grading its severity. While most teams tend to use agents for lower-stakes workflows, it's clear that more sophisticated agents can (and likely will) address more and more workflows over time that require information gathering and synthesis.
We believe that agents addressing tasks and workflows that require more complex reasoning across multiple modalities will be significantly more defensible than solutions that don’t. In particular, we’re seeing that it’s possible to drive higher performance in agentic workflows through clever architectural decisions and by stitching together the right models, guardrails, and feedback loops to deliver consistent outcomes. Agent performance is not purely based on the scale of data and compute that’s thrown at a problem (as is the case with LLM training), and so this is a more compelling opportunity for early-stage startups. In all cases, it will be key to strike the right balance between building a technical moat and ensuring flexibility given the fast-paced development of the underlying models.
Vertical AI expands its horizons
Vertical AI founders have already begun taking advantage of new capabilities, putting them to use to address a much wider range of real-world tasks and workflows — far beyond what most of us could have imagined even two years ago. As happened with text, underlying models in voice and vision will increasingly become commoditized, making it more sustainable for companies to build applications on top of powerful foundation models. Based on early signals, we believe that this wave of vertical AI applications will not only change the industries they serve and the vertical landscape. It will fundamentally change the way we work and interact with the world.
Up next: Novel business models
Advancements in LLMs and generative AI have driven business model innovation as much as product innovation. They’ve catalyzed novel software business models that have opened up opportunities in industries that were previously off limits for vertical software, as well as facilitating new use cases that have allowed existing vertical software incumbents to continue building a “layer-cake” of products and services. In the next article, we’ll take a deep dive into three of these emerging business models—copilots, agents, and AI-enabled services—and the roles, potential applications, and pricing strategy for each model.
If you are working on a Vertical AI application, we would love to hear from you! Please reach out to our team at VerticalAI@bvp.com.