by Jesse Wolfersberger | February 20, 2018
Among the new age stats that have taken over baseball, none has caught on better, or caused more controversy, than Wins Above Replacement, more commonly known as WAR. The goal of this measure is to quantify everything a player does on the field, offensively and defensively, and come up with one number that summarizes his value to his team -- how many wins he created. This grandiose scope is why it's gained popularity and enthusiastic pushback from some fans and media members. And why it is so relevant to measuring the results of your incentive programs.

This column isn't about WAR, but I'll spend a few sentences on it for context. Mike Trout smacks a single to right field in the 9th inning. That event could win the game if the score was tied and there was a runner on third base. If his team was down 15 runs with no one on, then the single is almost inconsequential. The key to WAR is that Trout gets the same credit for each of those singles, because the only difference is the context of the at bat, which he had no control over. Every single is given credit as an average single, every double is worth an average double, and so on. The player gets credit for only what he does, independent of his team's quality or where he hits in the lineup. 

The final step of the WAR calculation is the one I want to highlight -- replacement level. If Paul Goldschmidt gets injured, the Diamondbacks don't fall back onto an average player, they fall back onto a bench player or a minor-leaguer. That's what replacement level means for baseball, a player that could be acquired for little or no cost -- somewhere between an average player and zero. When measuring an incentive program, whether that be a contest, a promotion, or an annual program, you need to ask yourself, "what is replacement level?" 

It's easy to measure the top-line activity. If it was a contest, for example, how many people entered and how much did they sell? It might be harder to answer the replacement level question. If the contest didn't run, certainly the sales force would have still sold something. Just as in baseball, zero is rarely the baseline to use in the incentive industry. 

The best way to determine a baseline is with a random control group. That's great for academics, but often difficult to pull off in a business context. The methodology that people often rely on is year-over-year results. So, if the contest ran in January, you compare to last January. Although I see this method often, I am here to tell you this is not a good idea. Real world data is too messy to rely on last year's results. Every industry is going through major changes right now and using last year's data as the sole measurement of a baseline is incomplete at best. 

When a random control group isn't possible, it's time to engage the data scientists. There is a whole family of analytical techniques to statistically determine baseline behavior. One way would be to create a predictive model, based on past behavior, participant details, and market factors. That model can predict what each person would have been expected to do without the contest. Another method is control matching. There are several flavors of this technique, but the idea is creating a virtual control group, made up of other real participants, but weighted and matched to meet the same characteristics of the test group. 

Two pieces of advice in creating a sophisticated replacement level.

 1) Call your shot. This means choosing your methodology at the start of the program. Only then can you watch the effect of the program over time as it rises above the baseline. This is psychologically much more powerful than waiting until the end, where people have already made up their minds on the program.

 2) Get familiar with political science. This is the field where much of the control matching comes from. Unlike medicine, political scientists can't create control groups, so they have to deal with messy real-world data. Most of what I've learned about artificial control groups comes from a former colleague who was a political science major.

In this industry, there is much time and effort spent on measurement. No matter what vertical your business is in, I would wager you've spent several conference calls discussing topics such as tagging, sales allocations, or invoice matching. Before you make any conclusions about the performance of your program, you should probably spend some of that effort making sure you have the right baseline too. Then you can create a WAR for your program. Except instead of wins, you're measuring dollars above replacement.

Jesse Wolfersberger leads the Decision Sciences team for Maritz Motivation Solutions, and specializes in merging the fields of behavioral science and artificial intelligence. Contact him to discuss if you are using data in your programs to make them smarter at [email protected].