Michael Austin/theispot.com

Performance metrics are among managers’ most powerful tools: Setting the right goals and tracking progress accurately can help you take your business where you want it to go. To be effective, though, goals and metrics need to be clear and simple, and the fewer the better.

In our growing company, we’ve learned that simplicity increases our odds of achieving what we want. When our goals were too numerous and too complex, employees’ decisions didn’t sync up within or across teams, which meant groups and individuals were tugging in various directions and failing to produce desired outcomes at scale. So we set out to identify a single key performance indicator that would unify behavior within one major customer-facing group — the unit that designs, creates, and manages our online storefront — and could also serve as a shared currency across teams, enabling us to make smarter investments in the business. But we realized that going all in on a KPI without putting some sort of check on it could have serious unintended consequences. We knew that any primary goal had to be bounded by a constraint, as in, “Maximize X without reducing Y.”

This is the story of how Agoda, the Asia-based subsidiary of the Booking Holdings online travel group, made its way to a single “KPI + constraint” approach that helps us run a good bit of our business.1 It took some trial and error to get there; the KPIs we developed and tested along the way promoted both positive and counterproductive behaviors and outcomes. Gradually, however, we found our guiding principles, implemented them, improved our business outcomes, and fostered a culture of learning and cooperation in the process.

Because we believe that our experiences and insights may be helpful to other companies, both within e-commerce and beyond, we’re sharing them here.

We Tested Our Way Toward a Metric

Since the company’s earliest days, Agoda has focused on continuously improving its front end, or storefront — the website and mobile apps through which customers search for, select, and purchase travel products — to convert website visits into sales. By increasing its conversion rate, a digital business can target its marketing more effectively, because existing customers are easier to engage than new ones. Revenues from the converted customers can be deployed to make the marketing increasingly efficient, which pays off in further conversions.

In light of these benefits, conversion rate may seem like the perfect primary KPI for our business, with return on investment as an obvious constraint. (After all, why chase down conversions that are exceedingly hard to get and not likely to pay off later?) To calculate conversion rate, you typically divide traffic numbers by sales. Unfortunately, measuring traffic isn’t as straightforward as it sounds; it’s complicated by numerous factors, including unwieldy bots, aggressive marketing, and the many indirect ways customers can come to the site.

Still, we knew we wanted to reap the benefits of increased conversions — and do it faster than the competition. But how would we go about it and track our progress accurately, given how imprecise those traffic numbers could be? To sort that out, we started conducting small experiments, one after another, to test platform changes that we had reason to believe would increase conversions. We used A/B testing, which involved changing a single element (such as a color, a button, an image, or a message — like “Rooms are limited!” or “Good choice!”) for a subset of our users and comparing the results with a control group’s. “Winning” experiments — those that generated more bookings than the control — were incorporated into our coding and rolled out to customers more broadly.

Finding those winners was difficult. Often, our hypotheses about what would improve sales turned out to be wrong. Around 80% to 90% of our experiments were good ideas by smart people but did not improve our business; many actually made the results worse. For example, if we told customers, “Book now or this room will be gone!” we might motivate them to purchase the room — or we might annoy them and cause them to abandon the booking process. In addition, making a software change could introduce a bug we weren’t aware of. But those types of failures were part of the process.

Experimentation helped us determine what was going to work and what wasn’t. Initially, we ran experiments as one-offs, but to learn from them more effectively, we built a centralized system that allowed us to log and analyze test results in detail and then make and measure changes to the website. This experiment “engine” gave us a consistent, scalable way to make decisions and assess their impact on conversion, and it gave us an early primary metric to work with: speed of experimentation.

We Discovered Velocity’s Pros and Cons

Given how important experiments were to our front-end decisions, we realized quite early that we needed to increase the speed of our experiment engine. So we set velocity as a primary KPI to focus our conversion efforts, defining it as the number of experiments we ran every quarter. We knew there was no way we could grow at a meaningful pace if we ran too few of them, no matter how positive the results for each one.

By dicing our experiments more finely, testing one element at a time so that we could isolate and fix problems faster, and bringing in managers who were skilled in engineering and process management to streamline things, we went from a few dozen front-end experiments per quarter to well over 1,000 just a few quarters later. While this approach helped us identify many ways to increase our conversion rate, it also caused our site’s programming code to deteriorate because of how frequently changes were made. In short, the number of bugs exploded. As a result, we introduced a constraint on the KPI: code quality, as measured by the number and severity of bugs.

The push for velocity, with the quality constraint, changed how we operate. It led us to take a fresh look at our deployment architecture to see how we could implement changes faster, but not so fast that we caused damage. We created software tools to unify, automate, and speed up code deployment and monitoring to hundreds of thousands of servers at our data centers around the world.

Many companies deploy new code only weekly or even less often; we do it four or five times per day. To support that pace, we rethought our systems, staff, and organizational structure. As the number of bugs increased, we had to build better quality assurance techniques. We invested heavily in our network operations center, which tracks our platform’s performance to alert us when any significant deviations from normal website behavior occur.

There are always deviations, which may be caused by internal changes or by external factors; only a strong command of statistics can help you determine which deviations are relatively normal and which are indications of real problems. So we developed systems to find statistically significant changes, even in minor patterns. We also built monitoring tools to map code changes to performance patterns so that we could analyze root causes and resolve problems faster. And we invested in data systems to track traffic in every market, identify significant anomalies, and alert the network operations center, helping us see things we would otherwise have missed and respond to them quickly.

We Revised Our Primary KPI

Focusing on increasing velocity served a critical purpose: It helped us run many more experiments that boosted bookings to some degree. But it sometimes created the wrong incentives. Teams could score high on velocity (and be rewarded for it) by running many small experiments, regardless of whether these experiments correlated with meaningful increases to the platform’s conversion rate.

To determine which changes provided the biggest opportunities for conversion, we built stronger data analysis tools.2 Gradually, it became clear that to foster those larger opportunities, we needed to change our primary KPI.

So we moved to a metric we call incremental bookings per day, or IBPDs. We kept the same constraint: minimizing the quantity and severity of bugs. Our new KPI was rather simple: If an A/B test had n1 bookings in variant A (the control group, with no change from earlier practices) and n2 bookings in variant B (where the change was made), the IBPD impact could be calculated as follows:

Both A and B versions were likely to produce bookings; we wanted to select the more productive variant.

This KPI was designed to reward teams according to the value their experiments generated. We set up agile scrum teams and gave them a target for IBPDs. We expected this approach would push teams to run bigger and more ambitious experiments rather than simply more experiments. We were seeking the experiments that would generate the largest number of conversions. To counteract fear of failure, which can lead people to make safe bets and create less value, we actively encouraged teams to try big things even if they were risky and adjusted targets for those that did, giving them room to fail without penalty.

And, indeed, the IBPD approach generated benefits. As expected, teams shifted their focus from quantity to quality. They also started developing and sharing best practices. These insights were easier to identify once our experiments were more precise, and teams were eager to swap success stories, since everyone was looking for good ideas to build on. For example, they learned that it was critical to provide customers with the right amount of choice: With too few or too many options, people would not select a room. After gleaning this principle from our centralized system for logging and analyzing results, teams began to apply it to decisions about numerous features of the website — figuring out how many hotels to show in response to a search, the right number of photos to display in the gallery, and so on.

The new KPI also had important management implications. For starters, it allowed us to compare conversion rates generated by different teams so that we could determine where to invest more and where to pull back or change tactics. While this approach made teams more competitive, it also made them more cooperative: They discovered that if they shared what they learned, other teams would share insights with them, which helped everyone meet their targets.

Moving to IBPDs for the storefront also helped us see how the KPI could aid decision-making throughout the company. Take marketing, for instance. There, the primary metric was initially the number of site visitors that campaigns brought in during a specific period, and ROI (our visitor yield for our marketing costs) was the constraint. That way of measuring success had its limits, because a boost in traffic didn’t necessarily increase conversions. As the marketing team began running experiments against IBPDs instead, conversion results improved. If a promotional campaign for a specific market at a specific time brought in visitors, A/B testing different approaches within that same market and time period showed which types of campaigns actually produced more bookings. While external factors could help or hinder a particular campaign, overall the A/B approach helped us determine more precisely which campaigns worked. We could then test them in other markets as well.

Across groups, the IBPD metric allowed management to make smarter trade-offs (for example, engineering head count versus marketing spend) and thus optimize company investments. We could determine not only how much conversion we had added in each quarter and how that compared with the previous quarter but also which factors had contributed to that success.

Sharing a KPI across multiple groups allowed us to aggregate results on our company dashboard as well. Having that view further fueled decision-making about internal investments. In addition to using IBPDs to compare teams’ contributions and allocate resources, we could assess product managers’ performance. When they fell short of their goals, our first response was to move them to a more productive area to see whether their performance improved. Effectively, we were A/B testing our product managers.

We Continue to Fight Bugs and Bias

While moving to incremental bookings per day as our primary KPI was a significant step forward, we constantly debated whether we were headed in the right direction: What if, as with velocity, the new metric caused some behaviors we didn’t want? Should we be measuring something else? What were we missing?

At any rate, we knew that IBPDs did not solve every problem once and for all. For example, the bug issue had never completely gone away. We were good at fixing large bugs that were pretty visible. But addressing small bugs wasn’t attractive to teams, because they were harder to find and fixing them didn’t produce a significant increase in bookings. With enough small bugs, however, the site would suffer death by a thousand cuts, since the aggregate impact was significant — almost every user would encounter a bug. Eventually we realized that we had to establish a clear threshold even for small bugs; we would tolerate only a certain number at any one time. If the number increased beyond the limit, teams would have to fix the bugs to earn their full KPI bonuses, even if the IBPD impact of individual bugs was small.

More contentious — and more difficult — was the debate about how to use statistics to promote behavior that was truly beneficial for our business. In a well-optimized platform, the vast majority of winning experiments produce IBPDs of less than 1%. (The more we optimized the program, the tougher it became to identify changes that would improve it further.) And it was hard to detect whether these small effects were in fact real wins or just statistical noise. We needed to address this issue to ensure that product managers were making the right decisions.

Once again, incentives proved problematic: We wanted teams to validate that apparent wins were directly related to real business value, but we found that when bonuses were tied to IBPDs, teams were biased to treat any experiment with positive results as a win, regardless of whether the impact was significant or simply noise. This was a natural behavioral response — and it did not produce results that benefited the company as a whole.

So we further refined our KPI, calling it unbiased IBPDs, or UBIs. This is how it works: Every time we flag an experiment as a win, before we roll out the change, we run the same experiment again for a certain period (usually a week). The results of that subsequent evaluation run are then factored into the team’s performance. If the first run appears positive as the result of statistical noise, the evaluation run is equally likely to appear negative, so with a large enough sample, the variance cancels itself out. When this happens, UBIs are zero — and the team makes no progress toward achieving its KPI goals (or its bonus). Recognizing the risk that their experiments could produce zero or even negative UBIs, teams now have an incentive to treat only those experiments that produce truly positive results as wins. Rather than just moving ahead with all experiments that appear positive, teams now look much more closely at their experiments to determine whether the changes they’re making to the site actually alter customer behavior.

Even this approach is not a perfect solution. At scale, across multiple teams, UBIs work well. But if an individual team runs only 10 or 20 experiments per quarter, it may encounter a lot of noise, since this one team doesn’t have enough tests to cancel the variance effectively. (Our rule of thumb: It takes at least 50 experiments to eliminate variance.) In addition, teams may be reluctant to accept results that are unclear, running experiments again and again to get better statistics. This reduces velocity — bringing us back to where we started. As with choice for our customers, there is a right amount of testing: Too little yields inaccurate results, whereas too much slows us down and impedes growth.

All this is to say, our primary KPI remains a work in progress. Despite its limitations, though, we have found the UBI system to be quite useful overall. Teams’ behaviors are now more aligned with creating business value through conversions, without significantly hurting velocity. UBIs allow us to measure how much teams and individuals contribute to the company’s revenues quarter by quarter, since every incremental booking can be given a dollar value. Managers can now make data-based decisions about whose performance to reward and where to invest more resources in the business.

Changing the Culture

Our work to define a primary KPI + constraint on the front end and in marketing has permeated the company. Today, this approach to decision-making can be found in every part of Agoda, even in areas that traditionally aren’t experiment focused. In our inventory-purchasing organization, for example, we allocate tasks to staff members in 35 countries, and the impact of each task on business results is estimated in margin points (a proxy for UBIs in a part of the business that doesn’t move the needle on bookings). The constraint we’ve imposed is to reduce adverse partner behavior. (We want our inventory-purchasing team to be aggressive so that we can decrease costs and increase profit margins, but if we push too hard, hotel partners might stop working with us altogether.) We run experiments to assess which strategies correlate with the greatest margin increases while maintaining fruitful partner relationships.

One predictable cultural outcome of our focus on KPIs is that we spend a lot of time talking about measurement at the beginning of every project — and then we never stop talking about it, to hold teams accountable. Another outcome is a bit more surprising: Hierarchy now matters less in our decision-making. Arguing about how to measure success is much less ego-based than arguing about the decisions themselves, and continual testing proves that even the most senior, experienced people are wrong most of the time. It’s hard for leaders to be arrogant and pull rank when we often find that other people’s ideas work better than our own. The skill of persuasion doesn’t matter as much as it used to either, since we defer to experiment results.

Perhaps one of the greatest cultural benefits of unifying and continually refining our KPIs and constraints is a shared sense of purpose. Rather than having hundreds of KPIs, and the confusion and silos they generate, striving for streamlined metrics gets people moving in the same direction. Because that helps employees understand how their work contributes to the company’s overall success, it also fosters cooperation, collaboration, and learning. Employees recognize that they benefit from what others do and the knowledge they produce and share.

For these reasons, we believe our obsession with setting and tracking the right KPIs and constraints has been our most effective weapon in a tough market. This approach defines us and will define our future, and we think it can work for others as well.