Measuring the Value of AI
The honeymoon is over. After months of aggressive rollouts, enterprises across the globe have given their employees access to AI coding assistants, copilots, and chat interfaces. And now the inevitable question lands on someone's desk: What are we getting for this?
We've deployed it everywhere. Now what?
The honeymoon is over. After months of aggressive rollouts, enterprises across the globe have given their employees access to AI coding assistants, copilots, and chat interfaces. The credit card bills are arriving. And now the inevitable question lands on someone's desk: What are we getting for this?
Organizations are scrambling to measure the return on their AI investment. They're building dashboards tracking token consumption, counting AI-assisted pull requests, and monitoring how many lines of code were generated versus written by hand. Some are going further — tracking tokens that led to PRs that led to production deployments, as if the pipeline from keystrokes to customer value were a simple conveyor belt.
It isn't. And the uncomfortable truth is that this measurement problem isn't new at all.
This isn't a new problem
Long before AI assistants existed, businesses struggled to quantify the value of knowledge work. Agile coaches spent decades watching organizations try — and fail — to measure the output of Scrum teams. Lines of code, velocity points, features shipped: none of these correlated reliably with business outcomes.
The problem is structural. As Barber (2008) demonstrated in research on value chain measurement, the further you move from tangible, physical processes toward intangible value creation, the harder measurement becomes. Software development, creative work, and strategic thinking exist almost entirely in that intangible space. Baruch Lev's seminal work Intangibles: Management, Measurement, and Reporting (2000) — cited over 6,000 times — established that organizations systematically fail to account for intangible value creation, precisely because it resists the neat quantification that accounting systems demand.
AI hasn't created a new measurement problem. It has amplified an existing one and made it impossible to ignore.
The Traps We Keep Falling Into
Goodhart's Law: When the Metric Becomes the Target
"When a measure becomes a target, it ceases to be a good measure." Charles Goodhart formulated this principle in 1975 about monetary policy, but Fritz (2016) demonstrated its devastating applicability to software development. The moment lines of code became a productivity metric, developers produced more lines of code — without producing more value.
We're walking into the same trap with AI. The moment "tokens consumed per PR" becomes a metric, teams will optimize for it. They'll either starve their AI of context to appear efficient, or feed it bloated prompts to appear productive, depending on which direction management rewards. Neither behavior correlates with value produced.
The McNamara Fallacy: Counting What's Countable
During the Vietnam War, Secretary of Defense Robert McNamara insisted on measuring success through body counts and territory captured — metrics that were quantifiable but ultimately meaningless for determining who was winning. The McNamara Fallacy, as van Nieuwenhuizen describes it, is "prioritizing quantitative metrics while ignoring equally vital qualitative factors that resist easy measurement."
Token usage is today's proxy. It tells you something is happening. It tells you resources are being consumed. It tells you absolutely nothing about whether the right things are being built, whether customers are happier, or whether your product is improving.
The fallacy is seductive because the alternative — measuring qualitative outcomes — is genuinely hard. So organizations default to measuring what their billing systems already capture.
Self-Serving Attribution Bias: We Believe Our Own Hype
There's another problem lurking beneath the measurement challenge. Research by Shepperd, Malone, and others (2008) on self-serving attribution bias shows that people systematically over-attribute positive outcomes to their own efforts and under-attribute them to external factors. Libby and Rennekamp (2012) demonstrated this in forecasting contexts: after a success, people attribute it internally, leading to overconfident predictions about future performance.
When a developer says "AI made me 10x more productive," they genuinely believe it. When a manager says "Our AI investment led to a 30% increase in shipping velocity," they genuinely believe it too. But these are uncontrolled observations colored by confirmation bias and the sunk-cost pressure of having championed the investment in the first place.
Even when organizations attempt to estimate value beforehand, those estimates are rarely tracked after the fact. The hypothesis that a feature would produce X value is tested at go/no-go time, then forgotten. Nobody goes back six months later to verify whether the value actually materialized. Estimates don't correlate with reality — they correlate with what was politically convenient to claim at funding time.
The Shifted Filter: From Pre-Build to Post-Build
Here's what AI has fundamentally changed about value creation in organizations: the capacity constraint used to be a natural quality filter.
In the pre-AI world, building anything required convincing a team or a product owner to spend their limited time on it. Plenty of good ideas died in this filter, but the most important ones usually survived. Scarcity imposed discipline.
Now that filter is evaporating. With AI-assisted development, the cost of building something has dropped dramatically. When capacity is infinite (or feels that way), everything gets built. Business stakeholders who used to have to fight for engineering time can now point agents at problems themselves.
The result? The filter moves from pre-build to post-build. Instead of asking "Should we build this?" teams must now ask "Now that we built this, should we ship it?" And crucially: most organizations have never developed the muscle for this.
A/B testing, eye tracking, funnel optimization — these post-build validation techniques exist in marketing departments and growth teams. But they're almost entirely absent from internal business software development. And you can tell just by using most enterprise software.
Imagine a product like SAP with infinite capacity to add even more schemas, tables, and configuration options. The complexity crisis of enterprise software isn't a resource problem — it's a judgment problem. AI removes the resource constraint without addressing the judgment gap.
The Polishing Problem
There's a related dysfunction: AI makes it trivially easy to generate an MVP-quality implementation of anything, but finishing things — truly finishing them — still requires human judgment, taste, and iteration.
I recently rewrote a piece of software I'd been maintaining for ten years. The AI-assisted rewrite took roughly a week. Making myself proud of the result took three more months of tweaks and refinements. The initial rewrite was the easy part. The value was in the polish.
Organizations optimizing for velocity, be it story points or — for things shipped, tokens consumed, features delivered — are implicitly de-prioritizing this polish. Every half-baked, AI-generated feature that ships without refinement represents a small quality-of-life tax on every user who has to interact with it.
The Non-Technical Blind Spot
If measuring AI value for developers is hard, measuring it for non-technical departments can sometimes be even harder. At least with engineers, we have many crude proxies, like commits and deployments. With HR, finance, and operations teams paying per-seat for AI assistants, organizations typically have no visibility whatsoever into usage or impact or the ability to directly correlate AI usage to measurable productivity improvements.
Yet sometimes the value story for non-technical work is actually simpler. The use of AI can enable teams to quickly automate tasks they occasionally do manually. Of which the results can multiply if the "occasionally" turns into "frequently" suddenly. I recently supported a team who applies content updates on a website once in a while, and was face with a request to update 75 pages all at once. From word templates. Automating that away saves 2 weeks of manual labor, and also removes the pain of smaller batches in the future.
It's the ambient, non-targeted AI features that defy measurement. The auto-complete suggestions, the email drafts, the meeting summaries. Each one might save 30 seconds. Multiply by thousands of employees and the theoretical value is enormous. But theoretical value isn't the same as realized value, and nobody is checking whether those 30-second savings actually compound into better outcomes rather than simply expanding the time available for other low-value work.
Plus, sometimes the devil is in the details, not the summarized conversation of a 2 hour teams conversation. The quirked eyebrow of a participant as something is said. The person dropping from the call after a comment was made...
How to Measure Better
If pure output metrics are misleading and self-reported value is biased, what alternatives exist? Here are approaches grounded in measurement science:
1. Measure Outcomes, Not Outputs
Douglas Hubbard's framework from How to Measure Anything (2014) argues that most "immeasurable" things can be measured — but only if you define what you actually mean by value. Stop counting tokens, PRs, and features shipped. Instead, identify the business outcomes you care about (customer retention, time-to-resolution, revenue per feature, support ticket volume) and measure those over time. The AI investment is a confounding variable, not the thing you're measuring.
2. Use Controlled Comparisons
Rather than measuring before/after (which conflates AI impact with everything else that changed), run controlled experiments. Give some teams AI access for specific tasks and others not. Rotate. Compare outcomes, not effort. This is operationally inconvenient, which is exactly why almost nobody does it — and why everyone's "measurement" is actually just observational storytelling.
3. Track Value Hypothesis Closure Rates
For every feature or project that gets built, require a falsifiable value hypothesis before build starts. Then measure whether that hypothesis proved true after deployment. Track the ratio of hypotheses confirmed versus falsified. If AI is helping you ship more things, but the hypothesis hit rate is falling, you're producing more waste faster. That's negative ROI regardless of what the token dashboard says.
4. Measure What Didn't Get Built
One of the most important potential values of AI is compression — not building more things, but building the same things with fewer resources, freeing capacity for quality, exploration, or rest. If your team is shipping the same amount as before but with less overtime, less burnout, and fewer shortcuts, that's value. It just doesn't show up in velocity charts.
5. Adopt the Post-Build Filter
If AI is shifting your constraint from capacity to judgment, invest in judgment infrastructure. Feature flags, gradual rollouts, user research, usage analytics. The organizations that will get the most value from AI aren't the ones building the most — they're the ones with the tightest feedback loop between shipping and learning whether they should have shipped.
6. Accept Irreducible Uncertainty
Some value is genuinely unmeasurable in the moment. How much value was produced by allowing Microsoft Word to italicize text? Probably immense — if it never had basic formatting, we'd all be using something else. But no ROI calculation could have predicted that in 1985. Some infrastructure investments only reveal their value in retrospect, through the things they enabled that nobody foresaw. Not everything needs a dashboard. Some things need trust and taste.
The Honest Answer
Companies want a clean number. "Our AI investment produced €X in value." That number doesn't exist and it won't exist, for the same reason that organizations never successfully quantified the value of email, spreadsheets, or the internet itself. These are enabling technologies whose value is diffused across every process they touch, modified by how skillfully they're used, and contingent on decisions that haven't been made yet.
What you can measure is cost. What you can observe is whether your customers are happier, your employees are less burned out, your products are more refined, and your judgment about what to build is getting sharper.
The hardest pill to swallow: maybe that feature your AI-powered team just shipped in record time should never have been built in the first place. And no token dashboard in the world will tell you that.
References:
- Barber, E. (2008). "How to measure the 'value' in value chains." International Journal of Physical Distribution & Logistics Management, 38(9), 685.
- Fritz, T. (2016). "Measuring individual productivity." Perspectives on Data Science for Software Engineering. Elsevier.
- Hubbard, D.W. (2014). How to Measure Anything: Finding the Value of Intangibles in Business. Wiley.
- Lev, B. (2000). Intangibles: Management, Measurement, and Reporting. Brookings Institution Press.
- Libby, R. & Rennekamp, K. (2012). "Self-serving attribution bias, overconfidence, and the issuance of management forecasts." Journal of Accounting Research.
- Shepperd, J., Malone, W., & Sweeny, K. (2008). "Exploring causes of the self-serving bias." Social and Personality Psychology Compass, 2(2), 895-908.
- van Nieuwenhuizen, P. "The McNamara Fallacy: Relearning Old Lessons."