Read article11 min
Written by Philo van Kemenade
Designing trustworthy interactions with large language models
How can we leverage LLMs effectively while addressing their limitations in design? What are the functional shortcomings of LLMs, and how do they impact design? What strategies can be used to ensure trustworthy AI-driven interactions?

Intro

Access and exploration of information supported by generative AI and Large Language Models (LLMs) is becoming more and more ubiquitous in search engines, websites, and software. As LLMs are integrated into various domains, how do we leverage their strengths while also addressing their weaknesses and prioritizing user needs in design? As a design agency specializing in data-driven experiences, we aim to help audiences gain a deeper understanding of the information that shapes their reality. We see a lot of potential to use AI in our work, and it’s important to ensure this is done with transparency and care.

In this article we’ve compiled key insights from our research work on the potential of LLMs in our projects, including how LLMs can enable more fluent natural language interactions, why their outputs should be viewed with caution, and four strategies we used to guide LLMs toward more reliable results.

Unlocking the potential of language in data interaction

The idea of computers processing natural language isn’t new, but recent advancements in deep learning—particularly the development of transformer architecture powering LLMs—have brought this vision closer to reality. For the first time, computers can interpret, process, and respond to us in ways that feel intuitive and human-like. This breakthrough not only enhances the way we interact with machines but also unlocks vast amounts of information, such as historical records, niche industry data, or unstructured text, that were previously inaccessible to computational methods.

One promising application is text transformation tasks, such as summarization, translation, reformatting, or adjusting tone, allowing users to refine or repurpose content with ease. Another is the extraction of information from unstructured sources, like identifying entities, topics, or relationships from written documents or even noisy optical character recognition output. Both use cases emphasize augmenting existing sources rather than generating entirely new material, resulting in more predictable and reliable outcomes.

Yet, as we unlock this new-found potential, it’s essential to design systems that balance innovation with dependability. LLMs are powerful but imperfect, and building meaningful applications requires a thorough understanding of their strengths and weaknesses. Addressing these limitations is critical—not to undermine the promise of LLMs but to ensure that their integration fosters trust and delivers value in real-world scenarios.

Identifying the Pitfalls of LLMs

While LLMs have revolutionized natural language processing, they also exhibit significant limitations, especially in contexts that demand precision and reliability. In addition to these functional challenges, the substantial energy and water usage required to train and run large AI models raises critical questions about sustainability. This makes it all the more important to assess whether an LLM is the right tool for the task, particularly when designing production systems for large-scale use. The following points focus on functional shortcomings and, while not exhaustive, have proven valuable in informing the design of LLM-based experiences.

Lack of relevance & specificity

While LLMs are trained on vast amounts of language data providing them with broad general knowledge, they often fall short when it comes to domain-specific information. This gap becomes apparent in contexts where users require precise, detailed answers related to specialized fields.

In the domain of aviation safety for example, an LLM might be able to make statements about the general importance of safety measures. However, user queries might be concerned with specific flight details or historical incidents. While this information could be accessible to an airline developing an LLM-powered app, it is unlikely to be part of that LLM’s training data, preventing it from providing precise and relevant answers users are looking for.

Bridging this gap is critical for ensuring that LLMs can deliver accurate and domain-specific information. Without this alignment, LLM output risks being too generic or even irrelevant to user needs.

Inaccuracies

LLMs generate text by predicting the most likely next word in a sequence—a process that enables fluid and convincing responses but can also result in significant errors. While AI tool builders often refer to these errors as “hallucinations,” many experts caution against this term, as it implies intent or agency that machine learning models simply do not possess. Inaccuracies can take several forms:

  • Falsehoods: Statements that are entirely untrue and lack any factual foundation
  • Irrelevance: Outputs may include irrelevant or unsolicited details that don’t align with the prompt
  • Source conflation: The model may blend information from different sources, leading to contradictions or misleading results

When the issue of inaccuracies is overlooked, users may place undue trust in seemingly authoritative responses, leading to misinformed decisions. In critical contexts the risks are even higher, as erroneous information can compromise safety, efficiency, or credibility.

Knowledge cut-off

The datasets that LLMs are trained on, contain content collected up to a specific point in time, known as the knowledge cut-off date. This means the model has no awareness of events or developments that occurred beyond that moment.

Asking OpenAI’s GPT-4o about its cut-off date

For instance, if you asked an LLM trained on data up to early 2022 about the outcomes of COP27, the United Nations climate conference held later that year, it might confidently discuss the Paris Agreement or the commitments made at COP26 in Glasgow. However, it would fail to provide any insights into the agreements, pledges, or debates from COP27, as that information lies beyond its training horizon.

Such limitations can lead to misinformation or gaps in understanding, especially in fields like climate policy, where staying current with developments is critical for informed decision-making and advocacy.

When chat interfaces aren’t the right tool for the job

The chat interface has been instrumental in popularizing LLMs like ChatGPT, offering a simple and flexible way for users to interact with these models. While a chat interface lends itself to many different use cases, it has a couple of major drawbacks.

First of all, a chat box can feel like an intimidating blank page. Without clear guidance, users may struggle understanding how to use the tool effectively and navigate its strengths and weaknesses. This challenge is recognized by most AI companies who offer a ‘talk to our model’ interface, and a common solution is to provide suggested prompts to help users get started.

Examples of the "blank page" text inputs of ChatGPT, Claude and Gemini and Perplexity along with suggested prompts to overcome this hurdle

But even when we step past this black page, there are so many occasions when a chat interface isn’t the right tool for the job. Imagine designing a website’s color palette. Of course we can use textual color representations like “#A7F3D0” or “Sage Green” and throw prompts around to “make the first color a bit darker” or “increase the saturation ever so slightly”, but this approach would feel too imprecise and indirect for what we want to achieve.

Generally speaking, a chat interface’s lack of structure often leaves users guessing about how to phrase their inputs or interpret the outputs. In critical domains, where accuracy and reliability are essential, a generic text box can place an undue cognitive burden on users. Thoughtful UI design can address this by providing contextual cues, guiding user input, and surfacing an application’s capabilities in ways that are most intuitive and approachable for the task at hand. By doing so, we can transform interactions with an AI model from uncertain trial-and-error into empowering and context-specific experiences.

Strategies for trustworthy interactions

To effectively harness the power of LLMs and provide users with accurate, trustworthy information, we must adopt targeted strategies that counteract their inherent pitfalls. In this section, we outline four key strategies that address the limitations of the use of LLMs, improve the reliability of outputs and ensure that our AI applications remain aligned with user needs and expectations.

Grounding responses with relevant data

Standard LLM chatbots generate responses based on user prompts and all the information they’ve been trained on. As we’ve seen in the first pitfall above, since this generic training data doesn’t include sources that are specific to our business domain, the models are unable to answer detailed questions about that content accurately.

The Retrieval-Augmented Generation (RAG) architecture addresses this limitation by grounding LLM responses in specific, up-to-date information from curated sources. In this approach, when a user asks a question about a specific topic like aviation safety, the system first retrieves relevant snippets from a curated set of documents. These pieces of text are then included in the prompt to the LLM along with the original user query, allowing the model to generate an answer informed by our verified sources.

A regular LLM Chat Bot architecture above and a Retrieval-Augmented Generation (RAG) architecture below

Anticipated answers with manipulation on top

When users land on a general-purpose chat interface, it’s difficult to know their specific needs or how to serve them effectively. In contrast, most software operates within a defined context—designed for particular users, tasks, and domains. As interface designers, we can harness this context to anticipate user needs and even address them proactively.

Consider a platform that provides access to a wealth of lengthy documents. Rather than waiting for users to request a summary, we can anticipate their desire for a quick overview by presenting a summary as soon as they arrive. This proactive approach provides users with relevant insights without requiring them to craft a prompt or even recognize that a summary would be useful. The LLM works seamlessly in the background, reducing the cognitive effort for users and improving their experience.

But we don’t have to stop here. Summaries can vary in length, tone, or focus depending on the user’s expertise, interests, or goals. By prompting the LLM to consider these factors, we can offer users tailored summaries and empower them to refine the output through dedicated input controls. This design not only enhances usability but also highlights the potential of LLMs to deliver personalized, context-aware solutions that meet users where they are.

Concept design that lets users specify their preferences for a personalized summary

Templated prompts for consistent quality

To ensure the quality and consistency of LLM outputs, we can leverage templated prompts that provide structure and focus to the model’s responses. These templates are the result ofprompt engineering—an iterative process of testing and refining prompt phrasing to achieve the most effective results.

A prompt template constrains the LLM’s input and incorporates two types of input:

  • Source input: This includes relevant documents or retrieved snippets to ground the LLM in accurate, specific information.
  • User input: This captures the user’s preferences for how the output should be tailored. In our summarization example, we use the preferences for length, familiarity and interests as specified by the user.

When combined in a prompt template, these inputs ensure each summary is grounded in relevant content and personalized to meet user needs, creating a more tailored and user-friendly experience.

Prompt template, combining Source input (blue) and User input (orange)

Validation before publication

So far, we've explored real-time generation, where LLMs create content based on user input, which works well for personalized tasks like our summarization example. However, some tasks prioritize accuracy over personalization, such as extracting key entities for later presentation. This is where pre-validated generation can ensure correctness before publication.

The two approaches can be combined in a two-step process. First, source data is processed by the model to generate enhanced data, such as extracted entities or categorized content. The output is then validated—either manually or through automated checks—to ensure its accuracy. Once verified, the enhanced data is published and made accessible to users. In the second step, users can interact with this validated content, with real-time generation providing personalized outputs that adapt to their preferences.

By distinguishing between these two approaches we can design systems that are both adaptive and dependable. This balance ensures that interfaces meet diverse user needs without compromising reliability.

Pre-validated generation, followed by real-time generation in a two step process

Empowering users through responsible AI design

Navigating the evolving landscape of artificial intelligence, particularly in the realm of Large Language Models, requires a careful balance of innovation and responsibility. Our explorations have highlighted both the potential of LLMs to enhance user interactions and the inherent challenges that must be addressed to ensure accuracy and reliability.

We can adopt several strategies to improve the quality of AI-driven outputs: grounding responses in relevant data, utilizing templated prompts for consistency, designing specific graphical user interfaces that align with user needs, and implementing robust validation processes. These methods foster a more trustworthy experience for users and ensure that AI applications meet the complex demands of domains where accuracy is critical.

As we continue to refine our approaches, our focus remains on designing interactions that empower users with trustworthy insights while maintaining a responsible use of AI technology. By emphasizing transparency, accuracy, and user-centric design, we can harness the power of LLMs to create meaningful and impactful interactions that empower individuals to make informed decisions grounded in reliable information.

Acknowledgements

Thanks to Oscar Senar, Luke Noothout, Seowoo Nam, Eda Saridogan, Klava Fadeeva, Arimit Bhattacharya, Jan-Wouter Dekker, Luca Séllei, Maira Ribelles and Wouter van Dijk for helpful conversations and feedback on early versions of the article. Special thanks to Luke Noothout for capturing ideas in graphic form.

Read, see, play next
read6 min

C°F Experiments: Learning and Growing Through Experimentation

CLEVER°FRANKE’s self-initiated projects and the value of experimentation

read6 min

Challenging design briefs

How we leverage problem-solving to deliver exceptional results?

read7 min

Design and technology: pioneering amongst constraints

How can constraints in design projects be reframed as opportunities for innovation?