- nuvoteQ Admin
Making the Multi-Armed Bandit work for you

The multi-armed bandit (MAB) is a famous thought experiment in Data Science, more specifically reinforcement learning, where a gambler has to choose among multiple slot machines with different probabilities of winning (all unknown of course), and a gambler’s task is to maximize his/her return while playing the different slot machines.
Since there are multiple slot machines to choose from, you can either determine pay-out possibilities by taking several chances with each slot machine and collecting enough data until you know for sure which machine is the best. Doing this will reveal to you the winning ratios of all the slot machines to a high degree of certainty (a typical statistical experiment). However, in the process you would have wasted a lot of money on slot machines with low rates of return. The alternative is to focus on a few slots faster, continuously evaluating winnings and maximizing your investments over these slots for higher returns. This is the approach followed with the MAB.
The two opposing forces at play with bandits are ‘exploration’ and ‘exploitation’. Exploration is necessary to discover the return on investment for each slot machine. The role of exploitation is to maximize profits – play the slot machine with the highest win rate more often than not. As testing progresses, the algorithm balances exploration and exploitation to optimise the return on investment, and eventually providing a clear indication of the best slot machine to play.
The most common use-case for bandits is marketing campaigns, but some interesting recent use cases for bandits include optimal dose-finding in clinical trials and optimising Covid-19 testing policy constrained by testing capacity. Essentially, finding the optimal policy is what reinforcement learning is all about, which makes bandits ideal for this application.
MAB is perfect for cases where:
· There is no need for interpretation of results/performance of the under-performing arms, and all you care about is maximizing output with respect to some optimization criterion.
· The window of opportunity for optimization is short-lived and there’s not enough time for gathering statistically significant results. For example, optimizing pricing for a limited period offer or short-term campaigns where the opportunity cost of traditional statistical testing (A/B testing) is simply too large.
· It can also be useful in the longer term, especially for selecting policies. MAB has been described as a “set it and forget it” approach, since it will optimize automatically with respect to some criterion.
· In more complex situations multiple MABs can be set up to optimize a more complex multidimensional problem.
If you’re curious about how this technology can be applied to your unique situation get in touch with our Data Science team who will gladly walk you through this technique in detail and assist your with setting up bespoke bandits to optimize various aspects of your business.