I am exploring the latest conversational AI technologies—when prompts alone aren’t enough.
One topic has been simmering for quite some time in my long list of blog drafts: What is the ideal interface for communicating with a generative AI?
I was inspired again this morning after reading Stefano Gatti’s consistently excellent post (link). His insights—particularly the interview with Federica Fragapane on Data Visualization (I was there at that TED Talk! 😊)—and his discussion on voice interfaces and smart objects sparked some thoughts I’d like to share about AI and its interfaces.
The History of Interfaces: From Mainframes to GUIs
I began my career helping a company transition from mainframe terminals to graphical interfaces. At just 19 years old, I introduced forty Cobol programmers—whose careers had started with punch cards—to GUI concepts introduced by Windows and the Web. I’m no UI wizard, but since then, I’ve had a deep passion for how humans interact with technology.
Today, just as back then, we’re facing a transformation: making AI understandable and usable for those without technical skills. The evolution of interfaces opened computing up to (almost) everyone, making it (almost) intuitive and visually accessible.
With AI, the principle is similar: to fully leverage its potential (which, as I often say, is not simple Software 1.0), we must create interfaces that are just as natural—interfaces that don’t require specialized knowledge from users and emphasize the natural language we can now use to communicate with it.
This challenge is even more complex because we expect AI to understand our intentions and respond intelligently to diverse inputs.
Traditional Interfaces vs. Generative AI
Until now, keyboards and mice (along with countless other devices) have been our interfaces for software (and I’m pretty fond of the keyboard). However, generative AIs have new requirements and possibilities, especially with the advent of multimodal models that allow us to input and design text, visual data, and sounds. They demand different inputs.
Thus, the challenges for those designing new interfaces are substantial. It’s time to transcend the limitations of traditional interfaces with entirely new approaches.
For those interested in specifics, ISO standards define key factors in measuring usability: effectiveness, efficiency, and satisfaction are no longer subjective but measurable. You can explore this in this paper or the paper by Luera et al., from which I drew the illustration above.
My Pursuit of a Better Interface with AI
My attempt to “Jarvis-ify” everything—that is, to create an AI assistant similar to Iron Man’s Jarvis, which recognizes me and allows me to carry my private AI with me—reflects my ongoing search for a ‘better way’ to relate to data, algorithms, and AI interaction.
I’ve personally encountered the challenges of interfacing with all these models. I’ll discuss this in the blog soon, as I believe the topic of Personal AIs is critically important in the future.
What Are the Limits of Current AI Interfaces?
The trends we’ve observed in recent months are revealing—and may sound familiar to you:
Those using AI via chat-style bots (a relic of Software 1.0) often struggle to achieve effective results because it’s difficult for the average person to ask structured questions. This frequently yields mediocre results.
When AI asks questions of us, we’re often caught off guard or find its requests frustrating and insufficient. (This area is worth exploring; I always recommend allowing LLMs to ask questions to improve interactions).
Even if a conversation exceeds 100k tokens of context (the length of a short novel), we struggle to track all the steps. Many conversations become “tangled” due to either our contextual difficulties or those of the LLMs.
The arrival of LLMs has allowed us to ask virtually anything, yet we’ve realized that not all questions are effective—poorly phrased requests yield inconsistent answers.
Hence, prompt engineers have risen who teach us how to ask AI questions to get better results. But here, too, interface limitations are evident: we are adapting to the way AI understands, rather than demanding that it adapt to human needs. This implies that we continue with the limitations of the past 50 years, where users have had to (ineffectively) adapt to software, spending time on complex, unproductive tasks.
To help illustrate the interaction challenge, I tried an intriguing experiment with Leonardo.ai using a Realtime Canvas that interprets what we draw in real-time. Providing a prompt for context helps us achieve exact results—even if, like me, you can’t draw. Check out the video (it’s entertaining, too!) to see how it works: the AI interprets our drawings and refines them based on the instructions given.
It’s becoming clear that we must move beyond chat as the sole interface.
Ongoing Solutions
Over the past months, several attempts have been made to overcome these limits:
OpenAI has adapted ChatGPT with the Canvas feature, turning it into a text or code editor.
O1 promises to reason better with minimal textual input to understand our intentions, as I discussed here.
Midjourney introduced an editor for modifying images generated from prompts, similar to what Adobe and DALL-E 3 offer with Generative Fill.
Microsoft is integrating Copilot across its suite, tackling the longstanding issue of interfaces in Office and business systems, aiming to blend Software 1.0 and 2.0.
Google created Notebook LM, reducing the need for chat and introducing a podcast-like feature for two-way knowledge environments.
Notion and other tools like Visual Studio, Photoshop, and Canva are now infused with copilots. Examples abound.
In essence, we’re surrounded by assistants that blend Software 1.0 with AI, adding a new layer of usability challenges.
Is Voice the Solution?
From “2001: A Space Odyssey” with HAL 9000 to Iron Man’s Jarvis, it seems natural to interact with AI through voice. It feels intuitive, spontaneous, and effective—right?
For instance, companies are implementing voice assistants to aid sales teams in real-time CRM data access, simply by asking direct questions like, “What are this month’s sales?” or “Show me the clients who haven’t received follow-ups yet.” In call centers, voice bots enable operators to quickly retrieve information or record orders without interrupting the flow of conversation, enhancing efficiency and customer satisfaction. These small revolutions simplify data access and speed up daily operations.
Significant strides have been made recently:
Elevenlabs has created an almost perfect voice bot, meeting many of the criteria from my AI assistant series. The company focuses on voice accuracy and replicating inflections, making conversations more realistic and natural.
HeyGen introduced interactive avatars with human-like appearances and cloned voices, adding a visual element for more engaging interaction—ideal for presentations or demos (learn more at my workshops!).
OpenAI released advanced voice mode, which is promising but has limitations. While initially exciting, “knowing it’s artificial” might limit its impact. However, it’s a helpful complement for rapid, daily interactions.
Google Notebook LM is helpful for podcasts, but as a one-way interface from AI to us, it’s not interactive. This quick-summarizing feature is ideal for time-pressed individuals seeking concise information.
My Meta Ray-Ban glasses allow me to talk to Meta’s Llama AI, though brief interactions lack depth. It’s useful for quick exchanges, but the dialogue lacks context and continuity.
It seems that voice could resolve many issues, but not all. For instance, giving precise verbal instructions for editing an image or video remains challenging.
So What...
Not only are complex tasks like image or video descriptions challenging to do by voice, but even more straightforward actions like text editing reveal that voice alone is insufficient.
If the examples above aren’t enough, imagine modifying a complex table verbally: describing each cell’s changes, content, formulas, and format would take more time and precision than using a keyboard and mouse.
Imagine telling a text editor, “Bold the word ‘difficulty’ in the previous sentence of the cursor position.” A single click would be faster.
Voice is certainly a valuable tool for those who find the keyboard slow, especially for on-the-go or contextual tasks like a voice-based help desk or CRM queries. However, it may be limited to brief or specific contexts since it lacks the effectiveness of face-to-face conversation, where non-verbal and empathetic cues play a role.
We’re still at the beginning of a long research journey. Interfaces for interacting with AI are still finding their path and are starting to "move beyond the screen" to better meet our needs. Yet, I don’t anticipate a definitive solution soon; even human-to-human communication has difficulties.
Looking to the future, AI interactions might move beyond voice and text. Emerging technologies, like neural interfaces, open fascinating possibilities: imagine issuing commands simply by thinking without needing to touch a keyboard or mouse, as Neuralink or Meta’s experiments suggest. Or, with emotion recognition tools, an AI could interpret our moods from voice tone, real-time images, or writing style, adjusting responses for a more empathetic, natural interaction. These advancements are still in their early stages, but they could redefine human-machine interfaces, transforming AI from a tool into a genuine “virtual collaborator.”
Glimpse, my novel on the birth and evolution of a Super AI, explores interfaces extensively and starts with a note on this concept: The best Interface is No Interface. In the book, I explore the possibility that AI could solve the interface challenge by becoming intuitively comprehensible. But I won’t spoil it for those who haven’t read it.
I don’t expect an immediate, perfect solution—after all, even humans struggle to connect, and misunderstandings are commonplace with their often severe consequences.
Maybe AI will someday help us solve this fundamental issue. But John Lennon’s Imagine comes to mind; perhaps it’s just a utopia. 😊
But...
I’d love to hear your perspective.
Do you think voice is truly the ideal method of interaction, or do you envision more advanced alternatives?
What stresses or bothers you most when working with AI?
What challenges do you face when designing the ideal AI interface if you create AI solutions?
Let me know in the comments, or explore more innovative ideas on the blog. Don’t miss the latest updates—subscribe to the newsletter!
Comments