Read article7 min
Written by Iona Keeley and Luke Noothout
AI scoring beyond technical performance
What if AI models came with a label? A simple way to compare not just what they can do, but also their ethical and environmental impact? As LLMs become woven into our daily tools, it’s time to expand the conversation beyond performance alone and give people clear, transparent ways to make informed choices.

For most people, the launch of ChatGPT in November 2022 was the moment AI became real. What started as a simple website quickly transformed into a global phenomenon, sparking a wave of competition among labs like Anthropic, Google, and Mistral. In just a few years, AI has shifted from futuristic promise to a multi-billion-dollar industry, with new models and products released at breakneck speed. But while performance is easy to market, ethics and sustainability remain harder to measure and easier to ignore.

AI choice overload

Consumers can now choose from an ever-increasing number of AI chatbots, products and services. In an effort to win over consumers, companies promote their products as the newest, most intelligent and fastest AI products around.

But the increasing capability of AI technology is only one side of the story. There are many concerns about the environmental impact of these systems, which require huge data centers that use up vast amounts of resources. Furthermore, the data that is used to produce these models is increasingly being scrutinized, raising ethical questions about the fair use of authored work. The result is an information overload. Consumers have more choice than ever when it comes to AI tools, but struggle to make an informed decision. It has become practically impossible as a consumer to choose a product based not just on the novelty and claimed performance of a model.

Performance, sustainability and ethics

To bring environmental and ethical ramifications of AI models to the forefront, we propose a universal rating system that evaluates not only on technical performance, but also takes these factors into account. We see three major impacts such a universal rating could bring to AI technology:

  • It helps increase AI literacy and encourages wider discourse that goes beyond the technological and performance-related merits of AI models
  • It empowers consumers to consciously choose to use AI technology that aligns with their values
  • It encourages producers of products that make use of AI technology to incorporate sustainability and ethical perspectives

There are already many benchmarks in the field of AI, yet few combine these three crucial aspects. In our research, we found plenty of benchmarks and leaderboards evaluating performance and effectivity metrics as well as some focussed on responsible AI metrics or environmental impact. But these frameworks seemed to operate in isolation.

Additionally, while these databases are valuable in developing a rich understanding of the wider impact of AI systems, their reach is limited. The information usually lives in dedicated websites, reports and tables that make it possible to do a thorough comparison between AI models within a given framework, but chances are small that the average consumer will find and consult these databases before using AI products.

A universal rating system meets consumers where they can make a difference: when choosing and using products. Such a rating system needs to fulfill a number of requirements. First, the rating system should be identifiable as an independent score. Second, it should be clear to the user to assess a rating as good or bad. Finally, the system should balance simplicity with richness, providing consumers with enough insight to make informed decisions without overwhelming them.

Visualizing three perspectives on AI

Through an extensive design exploration, we’ve developed a first proposal of what a universal AI rating system could look like.

This rating consists of three parts. The performance pillar describes the capability of an AI model, and looks at metrics such as accuracy and speed, compared to state-of-the-art benchmarks. The sustainability pillar scores the environmental impact of a model through metrics such as energy used and carbon emitted both during development and uses. The ethics pillar scores the ethical impact of a model by analyzing biases and examining privacy and transparency standards.

The system uses a familiar A to F scale, with A representing the highest overall quality and F the lowest, following established conventions that consumers recognize from other rating systems.

Each of the three pillars, Ethics, and Sustainability, is scored separately, with a clear grade and an intuitive green-to-red color scale for instant recognition. While the three scores are combined into a single triangular visual, the individual ratings remain distinct, making it easy to spot where an AI tool excels and where it falls short. This balance allows users to grasp the overall impact at a glance, while still seeing the nuances across different dimensions. The design mirrors familiar consumer labels (like energy and nutrition ratings), lowering the barrier to understanding while opening up a more holistic view of AI evaluation.

Grades A to F for each of the three pillars: Ethics, Performance, and Sustainability

Our rating system treats sustainability and ethics as equally important as performance. This means an AI model might achieve an "A" in performance while receiving lower grades in sustainability or ethics, making it immediately visible how these shortcomings affect the model's overall profile.

Models with the same performance grade can present dramatically different profiles based on their ethics and sustainability scores

While all three pillars are weighted equally in our system, we recognize that performance often drives initial decision-making. That's why performance is positioned at the top of our visualization, with ethics and sustainability as the foundational pillars below, emphasizing that strong performance should be built on ethical and sustainable practices.

Informing across contexts

The power of a universal rating system is that it informs consumers in the moments where they can make a choice. This means the visual should live comfortably in a myriad of contexts. High-level tools such as browser plugins or local applications can help users keep track of integrated AI models they encounter in websites and software. Model picker dropdowns have already become commonplace in many AI-powered products and services. In addition, platforms such as marketplaces and comparison tools can tap into the rating system to provide an extra layer of information for consumers. All of these places could redirect to a single database providing more in-depth information and help consumers find alternatives.

Please note that all the following visuals are purely illustrative and do not represent actual ratings based on data.

A desktop app that lives in your toolbar, showing scores for the AI tools you’re actively using, like Figma or Slack.
A browser extension that displays performance, sustainability, and ethics scores for AI websites and web apps.
An integration within AI tools like Cursor, helping users compare and choose between different models.

In addition, platforms such as marketplaces and comparison tools can tap into the rating system to provide an extra layer of information for consumers.

All of these places could redirect to a single database providing more in-depth information and help consumers find alternatives.

Built into existing AI benchmarking platforms, such as the Epoch AI Benchmarking Hub or LMArena
A dedicated AI rankings site where people can explore AI tools and their rankings and dig deeper into the metrics

Conclusion

We acknowledge the technical challenges posed by a universal AI rating system, nor are we claiming that our proposal is a perfect solution. Establishing comparable metrics will require substantial research, and today's high-scoring models could be obsolete and outperformed within months. Not to mention, implementation would demand extensive collaboration between industry, academia, and policymakers.

Yet AI literacy has never been more urgent. As AI's influence accelerates, consumers need better tools to make informed choices about which models they use. And these choices should not be limited to performance. On the contrary: as the AI industry scales up, ethical and sustainable considerations are more important than ever. A universal rating system could spark the broader public discourse we need about AI's growing impact.

This vision requires partnership. We're eager to connect with experts and stakeholders who share our commitment to AI literacy and want to help build more nuanced conversations around AI technology.

Read, see, play next
BMX ride signature visualization
play

BMX ride signature visualization

Revealing the unique style of each cyclist

read9 min

5 approaches to sustainability data design

Different ways to apply Data Design to inform about our changing world and drive action.

Flight of the Night Re-imagined
see

Flight of the Night Re-imagined

What role can artistic and generative design play in reshaping data visualization?

read7 min

A creative approach to workshops on AI-powered design