← Writing
Projects 6 min January 5, 2025

Beyond the Thumbs Up: Rethinking Metrics for the Internet of Agents

Let’s dive into a quick story.

Mike is a logistics manager for a big supply chain operation, overseeing shipments, inventory, and distribution schedules. One afternoon, an autonomous AI agent on his team flags an unexpected issue — a manufacturing delay at a supplier two levels down the chain.

Normally, Mike wouldn’t have caught the problem until shipments were already affected. But the AI agent doesn’t just dump data on his desk. Instead, it flags the risk like a sharp-eyed colleague, offering a recommendation.

“This supplier has a pattern of delays during peak season. Based on historical reliability and cost-effectiveness, I suggest switching to an alternate supplier with minimal disruption.”

Mike hesitates. Switching suppliers mid-cycle is a risk. He needs more confidence. Instead of making a knee-jerk decision, he turns back to the AI agent. “How accurate have your past disruption predictions been?”

The AI pulls up its track record. “Over the last 18 months, I’ve correctly identified disruptions 92% of the time.”

That helps, but Mike isn’t convinced yet. “What about supplier B? They’ve been reliable before.”

“Supplier B has a longer lead time and higher costs,” the AI replies. “However, if speed is your top priority, I can adjust my recommendation.”

Mike considers this. “What about supplier C? Any chance they can handle this?”

“Supplier C is an option, but they have a 15% failure rate on last-minute orders. If reducing risk is the goal, supplier A is still your best bet.”

Now, with a clear picture of his choices, Mike approves the switch to supplier A.

This is what a looped-in-human workflow looks like, where leaders like Mike don’t just use AI as a tool, they collaborate with it. They bounce solutions back and forth, refining decisions the way they would with a trusted colleague.

But trust and collaboration doesn’t happen overnight. It’s built over time, shaped by every interaction.

While Mike’s story is made up, the challenge is real. As AI shifts from automation to autonomous collaboration, the way we interact with it has to evolve. In Designing Human-Machine Interactions in an Autonomous Agent World, I explored how AI isn’t just a tool but a collaborator, requiring new design principles to create meaningful human-agent interactions.

But designing better interactions is only half the equation, measuring their success is just as critical. At Outshift, we’re focused on exactly this: how do we build The Internet of Agents to work seamlessly with people and prove that agents are a reliable partner? That’s why measuring trust and collaboration in AI isn’t just a technical challenge; it’s a fundamental shift in how we define success.

Would Mike rely on it the next time? Would his colleagues? If supply chain teams were to track AI-assisted interventions, they wouldn’t just look at whether the AI flagged disruptions accurately, they’d also need to understand its impact on decision-making confidence, operational speed, and overall efficiency.

We’re used to tracking how often AI gets the right answer, but what about how it actually affects decision-making? Does it offer new insights that push people to think differently? Does it make experts more confident in their choices? These are the questions we should be asking.

Why Traditional AI Metrics Fall Short

Accuracy, Latency, Efficiency, Hallucination Rate, etc. AI metrics, like these, revolve around the model and its efficiency.

Measuring only model performance is like rating a chef just on how fast they chop onions. It’s one metric, but it hardly tells the whole story.

Did the AI actually improve the way people made decisions? Did it make teamwork smoother or lighten the mental load? AI isn’t just about getting the right answer — it’s about fitting into real workflows, making people more capable, and proving itself as a reliable partner over time.

So how do we measure what actually matters for The Internet of Agents.

Measuring What Matters: New Metrics for AI-Human Collaboration

If we really want AI to work as a teammate, not just a tool, we need a new way to measure its performance. At Outshift, we’re thinking a lot about how to measure agentic collaboration in the Internet of Agents. Here are some key areas where we believe new kinds of metrics are needed:

Decision-Making Metrics

  • Decision Latency: How long does it take from problem identification to resolution when AI is involved?
  • Trust Reinforcement: How often do people validate or refine AI suggestions, and does trust in AI improve over time?
  • Consensus Speed: How quickly can a team (including AI) align on a decision?

Workflow & Efficiency Metrics

  • Task Hand-off Effectiveness: Does AI seamlessly transition tasks between humans and agents, or does it create bottlenecks?
  • Cognitive Load Reduction: Does AI meaningfully reduce the mental effort required to complete a task, or does it add more work?
  • Hybrid Intelligence Utilization: Is AI being used to elevate human expertise, rather than just automate tasks?

Collaboration & Interaction Metrics

  • Human Override Rate: How often do users reject or correct AI-generated actions
  • Agent Coordination Efficiency: How well do multiple AI agents collaborate on a shared task?
  • AI Adoption Rate: How frequently do users accept AI suggestions without modification?
  • Self-Correction Rate: How often does AI revise its own output before human intervention is required?

Moving Beyond the Thumbs Up

Today’s AI feedback mechanisms are basic at best. A thumbs-up/thumbs-down rating system isn’t enough to understand whether AI is truly working for us. We need structured, contextual feedback loops that offer deeper insights:

  • Contextual Feedback: Tracking how often AI contributions actually improve human decision-making, not just how quickly they respond.
  • Collaboration Satisfaction Surveys: Gauging how well AI integrates into workflows from the user’s perspective.
  • Adaptive Feedback Loops: AI that learns from human revisions, pauses, and escalations to continuously improve its performance.

Designing the Right Framework

It’s not just about picking the right metrics, it’s about designing AI systems that can be observed, evaluated, and improved based on those metrics. At Outshift we’re considering ways to do this, including:

  • Building an AI-Human Performance Dashboard that visualizes trust levels, task handoff success, and collaboration trends.
  • Identifying Friction Points in workflows to refine AI behaviors and reduce inefficiencies.
  • Developing Real-Time Observability Tools to capture AI-human interaction data in a structured way.

The Future of AI Measurement

Right now, AI is mostly graded on how fast and accurate it is — like a student acing multiple-choice tests but never working in a group project. To really achieve The Internet of Agents, we need to shift the focus to how well AI actually collaborates with people.

If we measure that, we’re measuring the future of work itself.